torch fails on new Mac M3 architecture #1167

gilbertocamara · 2024-05-18T21:46:13Z

Dear @dfalbel I have bought a new MacBook Air with the M3 chip which has 8 CPUs, 10 GPUs and 16GB integrated memory. My R torch apps are crashing. I have put together a MWE which works on all other architectures, including in MacBook Air M1 and MacMini. The OS is the same (Sonoma 14.5). The MWE follows:

# ==== MWE

# Download the training samples
rds_file <- "https://raw.githubusercontent.com/e-sensing/sitsdata/master/inst/extdata/torch/train_samples.rds?raw=true"
dest_file <- paste0(tempdir(),"/train_samples.rds")
download.file(rds_file,
              destfile = dest_file,
              method = "curl")
train_samples <- readRDS(dest_file)

# Sample labels
labels <- c("Cerrado", "Forest", "Pasture", "Soy_Corn")

# Create numeric labels vector
code_labels <- seq_along(labels)
names(code_labels) <- labels

# Split the data into training and validation data sets
# Create partitions different splits of the input data
frac <- 0.2
train_samples <- dplyr::group_by(train_samples, .data[["label"]])
test_samples <- train_samples |>
    dplyr::slice_sample(prop = frac) |>
    dplyr::ungroup()
    
# Remove the lines used for validation
sel <- !train_samples[["sample_id"]] %in% test_samples[["sample_id"]]
train_samples <- train_samples[sel, ]

# Shuffle the data
train_samples <- train_samples[sample(nrow(train_samples), nrow(train_samples)), ]
test_samples <- test_samples[sample(nrow(test_samples), nrow(test_samples)), ]

# Organize data for model training
train_x <- as.matrix(train_samples[, -2:0])
train_y <- unname(code_labels[train_samples[["label"]]])

# Create the test data
test_x <- as.matrix(test_samples[, -2:0])
test_y <- unname(code_labels[test_samples[["label"]]])

# Set torch seed
torch::torch_manual_seed(sample.int(10^5, 1))

# Avoid a global variable for 'self'
self <- NULL

# function to create a simple sequential NN module
.torch_linear_relu_dropout <- torch::nn_module(
    classname = "torch_linear_batch_norm_relu_dropout",
    initialize = function(input_dim,
                          output_dim,
                          dropout_rate) {
        self$block <- torch::nn_sequential(
            torch::nn_linear(input_dim, output_dim),
            torch::nn_relu(),
            torch::nn_dropout(dropout_rate)
        )
    },
    forward = function(x) {
        self$block(x)
    }
)

# Define the MLP architecture
mlp_model <- torch::nn_module(
    initialize = function(num_pred, layers, dropout_rates, y_dim) {
        tensors <- list()
        # input layer
        tensors[[1]] <- .torch_linear_relu_dropout(
            input_dim = num_pred,
            output_dim = 512,
            dropout_rate = 0.40
        )
        # output layer
        tensors[[length(tensors) + 1]] <-
            torch::nn_linear(layers[length(layers)], y_dim)
        # add softmax tensor
        tensors[[length(tensors) + 1]] <- torch::nn_softmax(dim = 2)
        # create a sequential module that calls the layers in the same
        # order.
        self$model <- torch::nn_sequential(!!!tensors)
    },
    forward = function(x) {
        self$model(x)
    }
)
# Train the model using luz

torch_model <- luz::setup(
    module = mlp_model,
    loss = torch::nn_cross_entropy_loss(),
    metrics = list(luz::luz_metric_accuracy()),
    optimizer = torch::optim_adamw,
)
torch_model <- luz::set_hparams(
    torch_model,
    num_pred = ncol(train_x),
    layers = 512,
    dropout_rates = 0.3,
    y_dim = length(code_labels)
)
torch_model <- luz::set_opt_hparams(
    torch_model,
    lr = 0.001,
    eps = 1e-08,
    weight_decay = 1.0e-06
)
torch_model <- luz::fit(
    torch_model,
    data = list(train_x, train_y),
    epochs = 100,
    valid_data = list(test_x, test_y),
    callbacks = list(luz::luz_callback_early_stopping(
        patience = 20,
        min_delta = 0.01
    )),
  verbose = TRUE
)

The error occurs in the luz::fit function. Inside RStudio, the code gets stuck and then RStudio asks to restart R. When running R from the terminal, the output is:

 *** caught bus error ***
address 0x16daa0000, cause 'invalid alignment'

 *** caught segfault ***
address 0x9, cause 'invalid permissions'
zsh: segmentation fault  R

The sessionInfo() output is as follows:


R version 4.4.0 (2024-04-24)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.5

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Sao_Paulo
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] crayon_1.5.2      vctrs_0.6.5       cli_3.6.2         zeallot_0.1.0    
 [5] rlang_1.1.3       processx_3.8.4    generics_0.1.3    torch_0.12.0.9000
 [9] coro_1.0.4        glue_1.7.0        bit_4.0.5         prettyunits_1.2.0
[13] luz_0.4.0         ps_1.7.6          hms_1.1.3         fansi_1.0.6      
[17] tibble_3.2.1      progress_1.2.3    lifecycle_1.0.4   compiler_4.4.0   
[21] dplyr_1.1.4       fs_1.6.4          Rcpp_1.0.12       pkgconfig_2.0.3  
[25] rstudioapi_0.16.0 R6_2.5.1          tidyselect_1.2.1  utf8_1.2.4       
[29] pillar_1.9.0      callr_3.7.6       magrittr_2.0.3    tools_4.4.0      
[33] bit64_4.0.5

The text was updated successfully, but these errors were encountered:

dfalbel · 2024-05-20T17:26:54Z

Can you show me the output of torch::install_torch(reinstall = TRUE) ? Also, I'assuming it doesnt fail if you run eg: torch_randn(10)`?

gilbertocamara · 2024-05-20T17:48:21Z

Sure!

torch::install_torch(reinstall = TRUE)
trying URL 'https://github.com/mlverse/libtorch-mac-m1/releases/download/LibTorch-for-R/libtorch-v2.0.1.zip'
Content type 'application/octet-stream' length 49631992 bytes (47.3 MB)
==================================================
downloaded 47.3 MB

trying URL 'https://torch-cdn.mlverse.org/binaries/refs/heads/main/latest/lantern-0.12.0.9000+cpu+arm64-Darwin.zip'
Content type 'application/zip' length 3602457 bytes (3.4 MB)
==================================================
downloaded 3.4 MB

✔ torch dependencies have been installed.
ℹ You must restart your session to use torch correctly.

Running a simple command such as torch_randn(10) works.

torch::torch_randn(10)
torch_tensor
 0.8753
 0.9061
-1.8905
-0.2683
-0.4204
-0.3306
 1.1119
 0.0052
 0.3246
-0.2530
[ CPUFloatType{10} ]

torch also can access the M3 MPS. The following works.

x <- torch::torch_randn(10, 10)$to(device="mps")
y <- torch::torch_randn(10, 10)$to(device="mps")

torch::torch_mm(x, y)

The problems appear on the luz::fit() function. We compiled the lantern library from source, and tried to install it as follows.

# compiled lantern from source and configured env variables as follows
devtools::install(build = FALSE)
Running /Library/Frameworks/R.framework/Resources/bin/R CMD INSTALL \
  /Users/gilberto/torch --install-tests 
* installing to library ‘/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library’
* installing *source* package ‘torch’ ...
** using staged installation
CMAKE_FLAGS: 
** libs
con compilatore C++: ‘Apple clang version 15.0.0 (clang-1500.3.9.4)’
con SDK: ‘MacOSX14.4.sdk’
*** Building lantern!
mkdir -p ../build-lantern
cd ../build-lantern && cmake ../src/lantern -DCMAKE_INSTALL_PREFIX=/Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/library/00LOCK-torch/00new/torch -DCMAKE_INSTALL_MESSAGE="LAZY"  && cmake --build . --target install --config Release
### Lots of output...
-- Build files have been written to: /Users/gilberto/torch/build-lantern

## We then configured the env variables
Sys.setenv(LANTERN_URL="/Users/gilberto/torch/build-lantern")
Sys.setenv(TORCH_URL="/Users/gilberto/torch/build-lantern/libtorch")
## We then tried to install torch after this, but if falis

Either there is a problem with the lantern code when using M3, or we have failed to install correctly after compiling from source.

dfalbel · 2024-05-20T18:38:51Z

You might want to try setting the env var BUILD_LANTERN=1 then running remotes::install_github("mlverse/torch") to build lantern from source. Although, I don't think lantern is the culprit here, as it's just a relatively thin wrapper around LibTorch. You might also need to build LibTorch from source.

dfalbel · 2024-05-20T18:40:53Z

Also, have you tried installing pre-built binaries from with eg:

kind <- "cpu"
version <- "0.12.0.9000"
options(repos = c(
  torch = sprintf("https://torch-cdn.mlverse.org/packages/%s/%s/", kind, version),
  CRAN = "https://cloud.r-project.org" # or any other from which you want to install the other R dependencies.
))
install.packages("torch", type = "binary")

gilbertocamara · 2024-05-20T18:59:01Z

Thanks! I have tried, but failed.

dfalbel · 2024-05-20T19:02:01Z

Can you also try disabling MPS on luz, just so we can narrow a little more the problem.

You can do something like:

torch_model <- luz::fit(
    torch_model,
    data = list(train_x, train_y),
    epochs = 100,
    valid_data = list(test_x, test_y),
    callbacks = list(luz::luz_callback_early_stopping(
        patience = 20,
        min_delta = 0.01
    )),
  verbose = TRUE,
  accelerator = accelerator(cpu = TRUE)
)

gilbertocamara · 2024-05-20T19:11:07Z

Works!!! Can we now make luz work on MPS?

dfalbel · 2024-05-20T20:29:47Z

I think we will need to figure out why torch fails on M3 + MPS for that model. I believe it's possible that you will need to build LibTorch from source to fix this issue.

gilbertocamara · 2024-05-20T20:32:49Z

How do I build libtorch and liblantern from source?

dfalbel · 2024-05-20T20:38:32Z

To build LibTorch from source, you can follow instructions the steps in this workflow file:

https://github.com/mlverse/libtorch-mac-m1/blob/main/.github/workflows/libtorch.yaml

Then copy the libtorch files into src/lantern/build and run load_all or dev tools::install with BUILD_LANTERN=1 set.

gilbertocamara · 2024-05-20T20:44:21Z

Thanks!! I will try

gilbertocamara · 2024-05-21T17:32:55Z

Dear @dfalbel we tried to build torch from source, but it did not work on Mac M3 chip. Looking at the pytorch github, other developers are having similar problems with the new M3 chip. Please see the following issue:

pytorch/pytorch#125803

DenaJGibbon · 2024-05-25T11:09:29Z

Hello. I had a similar issue, but after I upgraded to macOS Sonoma 14.4.1 on a Mac M2. I posted on the Luz GitHub, but was happy to see some discussion here.

mlverse/luz#143

DenaJGibbon mentioned this issue May 25, 2024

Apparent issue with macOS Sonoma 14.4.1 mlverse/luz#143

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

torch fails on new Mac M3 architecture #1167

torch fails on new Mac M3 architecture #1167

gilbertocamara commented May 18, 2024

dfalbel commented May 20, 2024

gilbertocamara commented May 20, 2024

dfalbel commented May 20, 2024

dfalbel commented May 20, 2024 •

edited

gilbertocamara commented May 20, 2024

dfalbel commented May 20, 2024 •

edited

gilbertocamara commented May 20, 2024

dfalbel commented May 20, 2024

gilbertocamara commented May 20, 2024

dfalbel commented May 20, 2024

gilbertocamara commented May 20, 2024

gilbertocamara commented May 21, 2024

DenaJGibbon commented May 25, 2024

torch fails on new Mac M3 architecture #1167

torch fails on new Mac M3 architecture #1167

Comments

gilbertocamara commented May 18, 2024

dfalbel commented May 20, 2024

gilbertocamara commented May 20, 2024

dfalbel commented May 20, 2024

dfalbel commented May 20, 2024 • edited

gilbertocamara commented May 20, 2024

dfalbel commented May 20, 2024 • edited

gilbertocamara commented May 20, 2024

dfalbel commented May 20, 2024

gilbertocamara commented May 20, 2024

dfalbel commented May 20, 2024

gilbertocamara commented May 20, 2024

gilbertocamara commented May 21, 2024

DenaJGibbon commented May 25, 2024

dfalbel commented May 20, 2024 •

edited

dfalbel commented May 20, 2024 •

edited