Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

torch fails on new Mac M3 architecture #1167

Open
gilbertocamara opened this issue May 18, 2024 · 13 comments
Open

torch fails on new Mac M3 architecture #1167

gilbertocamara opened this issue May 18, 2024 · 13 comments

Comments

@gilbertocamara
Copy link

Dear @dfalbel I have bought a new MacBook Air with the M3 chip which has 8 CPUs, 10 GPUs and 16GB integrated memory. My R torch apps are crashing. I have put together a MWE which works on all other architectures, including in MacBook Air M1 and MacMini. The OS is the same (Sonoma 14.5). The MWE follows:

# ==== MWE

# Download the training samples
rds_file <- "https://raw.githubusercontent.com/e-sensing/sitsdata/master/inst/extdata/torch/train_samples.rds?raw=true"
dest_file <- paste0(tempdir(),"/train_samples.rds")
download.file(rds_file,
              destfile = dest_file,
              method = "curl")
train_samples <- readRDS(dest_file)

# Sample labels
labels <- c("Cerrado", "Forest", "Pasture", "Soy_Corn")

# Create numeric labels vector
code_labels <- seq_along(labels)
names(code_labels) <- labels

# Split the data into training and validation data sets
# Create partitions different splits of the input data
frac <- 0.2
train_samples <- dplyr::group_by(train_samples, .data[["label"]])
test_samples <- train_samples |>
    dplyr::slice_sample(prop = frac) |>
    dplyr::ungroup()
    
# Remove the lines used for validation
sel <- !train_samples[["sample_id"]] %in% test_samples[["sample_id"]]
train_samples <- train_samples[sel, ]

# Shuffle the data
train_samples <- train_samples[sample(nrow(train_samples), nrow(train_samples)), ]
test_samples <- test_samples[sample(nrow(test_samples), nrow(test_samples)), ]

# Organize data for model training
train_x <- as.matrix(train_samples[, -2:0])
train_y <- unname(code_labels[train_samples[["label"]]])

# Create the test data
test_x <- as.matrix(test_samples[, -2:0])
test_y <- unname(code_labels[test_samples[["label"]]])

# Set torch seed
torch::torch_manual_seed(sample.int(10^5, 1))

# Avoid a global variable for 'self'
self <- NULL

# function to create a simple sequential NN module
.torch_linear_relu_dropout <- torch::nn_module(
    classname = "torch_linear_batch_norm_relu_dropout",
    initialize = function(input_dim,
                          output_dim,
                          dropout_rate) {
        self$block <- torch::nn_sequential(
            torch::nn_linear(input_dim, output_dim),
            torch::nn_relu(),
            torch::nn_dropout(dropout_rate)
        )
    },
    forward = function(x) {
        self$block(x)
    }
)

# Define the MLP architecture
mlp_model <- torch::nn_module(
    initialize = function(num_pred, layers, dropout_rates, y_dim) {
        tensors <- list()
        # input layer
        tensors[[1]] <- .torch_linear_relu_dropout(
            input_dim = num_pred,
            output_dim = 512,
            dropout_rate = 0.40
        )
        # output layer
        tensors[[length(tensors) + 1]] <-
            torch::nn_linear(layers[length(layers)], y_dim)
        # add softmax tensor
        tensors[[length(tensors) + 1]] <- torch::nn_softmax(dim = 2)
        # create a sequential module that calls the layers in the same
        # order.
        self$model <- torch::nn_sequential(!!!tensors)
    },
    forward = function(x) {
        self$model(x)
    }
)
# Train the model using luz

torch_model <- luz::setup(
    module = mlp_model,
    loss = torch::nn_cross_entropy_loss(),
    metrics = list(luz::luz_metric_accuracy()),
    optimizer = torch::optim_adamw,
)
torch_model <- luz::set_hparams(
    torch_model,
    num_pred = ncol(train_x),
    layers = 512,
    dropout_rates = 0.3,
    y_dim = length(code_labels)
)
torch_model <- luz::set_opt_hparams(
    torch_model,
    lr = 0.001,
    eps = 1e-08,
    weight_decay = 1.0e-06
)
torch_model <- luz::fit(
    torch_model,
    data = list(train_x, train_y),
    epochs = 100,
    valid_data = list(test_x, test_y),
    callbacks = list(luz::luz_callback_early_stopping(
        patience = 20,
        min_delta = 0.01
    )),
  verbose = TRUE
)

The error occurs in the luz::fit function. Inside RStudio, the code gets stuck and then RStudio asks to restart R. When running R from the terminal, the output is:

 *** caught bus error ***
address 0x16daa0000, cause 'invalid alignment'

 *** caught segfault ***
address 0x9, cause 'invalid permissions'
zsh: segmentation fault  R

The sessionInfo() output is as follows:


R version 4.4.0 (2024-04-24)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.5

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Sao_Paulo
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] crayon_1.5.2      vctrs_0.6.5       cli_3.6.2         zeallot_0.1.0    
 [5] rlang_1.1.3       processx_3.8.4    generics_0.1.3    torch_0.12.0.9000
 [9] coro_1.0.4        glue_1.7.0        bit_4.0.5         prettyunits_1.2.0
[13] luz_0.4.0         ps_1.7.6          hms_1.1.3         fansi_1.0.6      
[17] tibble_3.2.1      progress_1.2.3    lifecycle_1.0.4   compiler_4.4.0   
[21] dplyr_1.1.4       fs_1.6.4          Rcpp_1.0.12       pkgconfig_2.0.3  
[25] rstudioapi_0.16.0 R6_2.5.1          tidyselect_1.2.1  utf8_1.2.4       
[29] pillar_1.9.0      callr_3.7.6       magrittr_2.0.3    tools_4.4.0      
[33] bit64_4.0.5     
@dfalbel
Copy link
Member

dfalbel commented May 20, 2024

Can you show me the output of torch::install_torch(reinstall = TRUE) ? Also, I'assuming it doesnt fail if you run eg: torch_randn(10)`?

@gilbertocamara
Copy link
Author

Sure!

torch::install_torch(reinstall = TRUE)
trying URL 'https://github.com/mlverse/libtorch-mac-m1/releases/download/LibTorch-for-R/libtorch-v2.0.1.zip'
Content type 'application/octet-stream' length 49631992 bytes (47.3 MB)
==================================================
downloaded 47.3 MB

trying URL 'https://torch-cdn.mlverse.org/binaries/refs/heads/main/latest/lantern-0.12.0.9000+cpu+arm64-Darwin.zip'
Content type 'application/zip' length 3602457 bytes (3.4 MB)
==================================================
downloaded 3.4 MB

✔ torch dependencies have been installed.
ℹ You must restart your session to use torch correctly.

Running a simple command such as torch_randn(10) works.

torch::torch_randn(10)
torch_tensor
 0.8753
 0.9061
-1.8905
-0.2683
-0.4204
-0.3306
 1.1119
 0.0052
 0.3246
-0.2530
[ CPUFloatType{10} ]

torch also can access the M3 MPS. The following works.

x <- torch::torch_randn(10, 10)$to(device="mps")
y <- torch::torch_randn(10, 10)$to(device="mps")

torch::torch_mm(x, y)

The problems appear on the luz::fit() function. We compiled the lantern library from source, and tried to install it as follows.

# compiled lantern from source and configured env variables as follows
devtools::install(build = FALSE)
Running /Library/Frameworks/R.framework/Resources/bin/R CMD INSTALL \
  /Users/gilberto/torch --install-tests 
* installing to library ‘/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library’
* installing *source* package ‘torch’ ...
** using staged installation
CMAKE_FLAGS: 
** libs
con compilatore C++: ‘Apple clang version 15.0.0 (clang-1500.3.9.4)’
con SDK: ‘MacOSX14.4.sdk’
*** Building lantern!
mkdir -p ../build-lantern
cd ../build-lantern && cmake ../src/lantern -DCMAKE_INSTALL_PREFIX=/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/00LOCK-torch/00new/torch -DCMAKE_INSTALL_MESSAGE="LAZY"  && cmake --build . --target install --config Release
### Lots of output...
-- Build files have been written to: /Users/gilberto/torch/build-lantern

## We then configured the env variables
Sys.setenv(LANTERN_URL="/Users/gilberto/torch/build-lantern")
Sys.setenv(TORCH_URL="/Users/gilberto/torch/build-lantern/libtorch")
## We then tried to install torch after this, but if falis

Either there is a problem with the lantern code when using M3, or we have failed to install correctly after compiling from source.

@dfalbel
Copy link
Member

dfalbel commented May 20, 2024

You might want to try setting the env var BUILD_LANTERN=1 then running remotes::install_github("mlverse/torch") to build lantern from source. Although, I don't think lantern is the culprit here, as it's just a relatively thin wrapper around LibTorch. You might also need to build LibTorch from source.

@dfalbel
Copy link
Member

dfalbel commented May 20, 2024

Also, have you tried installing pre-built binaries from with eg:

kind <- "cpu"
version <- "0.12.0.9000"
options(repos = c(
  torch = sprintf("https://torch-cdn.mlverse.org/packages/%s/%s/", kind, version),
  CRAN = "https://cloud.r-project.org" # or any other from which you want to install the other R dependencies.
))
install.packages("torch", type = "binary")

@gilbertocamara
Copy link
Author

Thanks! I have tried, but failed.

@dfalbel
Copy link
Member

dfalbel commented May 20, 2024

Can you also try disabling MPS on luz, just so we can narrow a little more the problem.

You can do something like:

torch_model <- luz::fit(
    torch_model,
    data = list(train_x, train_y),
    epochs = 100,
    valid_data = list(test_x, test_y),
    callbacks = list(luz::luz_callback_early_stopping(
        patience = 20,
        min_delta = 0.01
    )),
  verbose = TRUE,
  accelerator = accelerator(cpu = TRUE)
)

@gilbertocamara
Copy link
Author

Works!!! Can we now make luz work on MPS?

@dfalbel
Copy link
Member

dfalbel commented May 20, 2024

I think we will need to figure out why torch fails on M3 + MPS for that model. I believe it's possible that you will need to build LibTorch from source to fix this issue.

@gilbertocamara
Copy link
Author

How do I build libtorch and liblantern from source?

@dfalbel
Copy link
Member

dfalbel commented May 20, 2024

To build LibTorch from source, you can follow instructions the steps in this workflow file:

https://github.com/mlverse/libtorch-mac-m1/blob/main/.github/workflows/libtorch.yaml

Then copy the libtorch files into src/lantern/build and run load_all or dev tools::install with BUILD_LANTERN=1 set.

@gilbertocamara
Copy link
Author

Thanks!! I will try

@gilbertocamara
Copy link
Author

Dear @dfalbel we tried to build torch from source, but it did not work on Mac M3 chip. Looking at the pytorch github, other developers are having similar problems with the new M3 chip. Please see the following issue:

pytorch/pytorch#125803

@DenaJGibbon
Copy link

Hello. I had a similar issue, but after I upgraded to macOS Sonoma 14.4.1 on a Mac M2. I posted on the Luz GitHub, but was happy to see some discussion here.

mlverse/luz#143

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants