You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm experiencing strange errors during running synapseml lightgbm in a streaming mode (dataTransferMode: streaming). If I run SynapseML LightGBM with one or several executors having more than one core each, the app encounter either SIGSEGV/SIGBUS errors or errors related to memory corruption on lightgbm side.
The exact error may look like:
Example 1
JRE version: OpenJDK Runtime Environment Temurin-17.0.13+11 (17.0.13+11) (build 17.0.13+11)
Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.13+11 (17.0.13+11, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
Problematic frame:
[thread 277 also had an error]
C [lib_lightgbm.so+0x1acac3][LightGBM] [Warning] [LightGBM] [Warning] std::bad_alloc
std::bad_alloc
[LightGBM] [Warning] std::bad_alloc
[LightGBM] [Warning] std::bad_alloc
LightGBM::SparseBin::Push(int, int, unsigned int)+0x33
[LightGBM] [Warning] [LightGBM] [Warning] std::bad_alloc
std::bad_alloc
[LightGBM] [Warning] std::bad_alloc
[LightGBM] [Warning] std::bad_alloc
[LightGBM] [Warning] std::bad_alloc
[LightGBM] [Warning] std::bad_alloc
[LightGBM] [Warning] std::bad_alloc
[LightGBM] [Warning] std::bad_alloc
terminate called without an active exception
Example 2
[LightGBM] [Info] Loaded reference dataset: 59 features, 1199951 num_data
double free or corruption (!prev)
Example 3
SIGBUS (0x7) at pc=0x00007f2d7475aac3, pid=15, tid=348
Problematic frame:
C [lib_lightgbm.so+0x1acac3] LightGBM::SparseBin::Push(int, int, unsigned int)+0x33
[LightGBM] [Warning] [LightGBM] [Warning] std::bad_allocstd::bad_alloc
[LightGBM] [Warning] std::bad_alloc
[LightGBM] [Warning] std::bad_alloc
[LightGBM] [Warning] std::bad_alloc
[thread 261 also had an error]
malloc_consolidate(): invalid chunk size
I have noticed several moments in the behavior of Synapse ML Lightgbm:
Errors don't appear if executors have only one core.
Errors don't appear if the app is running in bulk mode (dataTransferMode: bulk)
Errors appear on two more datasets of different sizes
(used_cars and lama_test_dataset)
Some configurations for the app may lead to sporadic appearance of the errors. For instance, if app is running with 2 executors each having 2 cores, the error appear only in a fraction of runs.
The datasets, where errors appear on, don't have any suspicious to me columns of values (there are no NaN values in the datasets).
All datasets I'm running with are available on the link: Google Drive
The script that runs Synapse ML Lightgbm is in the attachment. Exact parameters of LightGBM can be found there.
I would appreciate any help on this issue. For me It seems to be a bug related to race conditions in dataset / buffer preparation for native LightGBM, but may be there is something wrong with may settings.
SynapseML version
1.0.8
System information
Describe the problem
I'm experiencing strange errors during running synapseml lightgbm in a streaming mode (dataTransferMode: streaming). If I run SynapseML LightGBM with one or several executors having more than one core each, the app encounter either SIGSEGV/SIGBUS errors or errors related to memory corruption on lightgbm side.
The exact error may look like:
Example 1
JRE version: OpenJDK Runtime Environment Temurin-17.0.13+11 (17.0.13+11) (build 17.0.13+11)
Java VM: OpenJDK 64-Bit Server VM Temurin-17.0.13+11 (17.0.13+11, mixed mode, sharing, tiered, compressed oops, compressed class ptrs, g1 gc, linux-amd64)
Problematic frame:
[thread 277 also had an error]
C [lib_lightgbm.so+0x1acac3][LightGBM] [Warning] [LightGBM] [Warning] std::bad_alloc
std::bad_alloc
[LightGBM] [Warning] std::bad_alloc
[LightGBM] [Warning] std::bad_alloc
LightGBM::SparseBin::Push(int, int, unsigned int)+0x33
[LightGBM] [Warning] [LightGBM] [Warning] std::bad_alloc
std::bad_alloc
[LightGBM] [Warning] std::bad_alloc
[LightGBM] [Warning] std::bad_alloc
[LightGBM] [Warning] std::bad_alloc
[LightGBM] [Warning] std::bad_alloc
[LightGBM] [Warning] std::bad_alloc
[LightGBM] [Warning] std::bad_alloc
terminate called without an active exception
Example 2
[LightGBM] [Info] Loaded reference dataset: 59 features, 1199951 num_data
double free or corruption (!prev)
Example 3
SIGBUS (0x7) at pc=0x00007f2d7475aac3, pid=15, tid=348
Problematic frame:
C [lib_lightgbm.so+0x1acac3] LightGBM::SparseBin::Push(int, int, unsigned int)+0x33
[LightGBM] [Warning] [LightGBM] [Warning] std::bad_allocstd::bad_alloc
[LightGBM] [Warning] std::bad_alloc
[LightGBM] [Warning] std::bad_alloc
[LightGBM] [Warning] std::bad_alloc
[thread 261 also had an error]
malloc_consolidate(): invalid chunk size
I have noticed several moments in the behavior of Synapse ML Lightgbm:
(https://www.kaggle.com/datasets/fedesoriano/company-bankruptcy-prediction)
(used_cars and lama_test_dataset)
Statistics of failed and successful runs for different app settings is presented in the table below:
dataset | instances | cores | success_percent
0 | company_bankruptcy_dataset | 1 | 1 | 100.000000
1 | company_bankruptcy_dataset | 1 | 4 | 100.000000
2 | company_bankruptcy_dataset | 2 | 1 | 100.000000
3 | company_bankruptcy_dataset | 2 | 2 | 100.000000
4 | lama_test_dataset | 1 | 1 | 100.000000
5 | lama_test_dataset | 1 | 4 | 0.000000
6 | lama_test_dataset | 2 | 1 | 100.000000
7 | lama_test_dataset | 2 | 2 | 100.000000
8 | used_cars_dataset | 1 | 1 | 100.000000
9 | used_cars_dataset | 1 | 4 | 0.000000
10 | used_cars_dataset | 2 | 1 | 66.666667
11 | used_cars_dataset | 2 | 2 | 54.545455
The datasets, where errors appear on, don't have any suspicious to me columns of values (there are no NaN values in the datasets).
All datasets I'm running with are available on the link: Google Drive
The script that runs Synapse ML Lightgbm is in the attachment. Exact parameters of LightGBM can be found there.
I would appreciate any help on this issue. For me It seems to be a bug related to race conditions in dataset / buffer preparation for native LightGBM, but may be there is something wrong with may settings.
Code to reproduce issue
Other info / logs
log_error_1.txt
log_error_2.txt
log_error_3.txt
log_error_4.txt
log_error_5.txt
What component(s) does this bug affect?
area/cognitive
: Cognitive projectarea/core
: Core projectarea/deep-learning
: DeepLearning projectarea/lightgbm
: Lightgbm projectarea/opencv
: Opencv projectarea/vw
: VW projectarea/website
: Websitearea/build
: Project build systemarea/notebooks
: Samples under notebooks folderarea/docker
: Docker usagearea/models
: models related issueWhat language(s) does this bug affect?
language/scala
: Scala source codelanguage/python
: Pyspark APIslanguage/r
: R APIslanguage/csharp
: .NET APIslanguage/new
: Proposals for new client languagesWhat integration(s) does this bug affect?
integrations/synapse
: Azure Synapse integrationsintegrations/azureml
: Azure ML integrationsintegrations/databricks
: Databricks integrationsThe text was updated successfully, but these errors were encountered: