Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not able to convert Llama 3.2 1B Instruct to Tflite format #269

Open
atultiwari opened this issue Sep 29, 2024 · 5 comments
Open

Not able to convert Llama 3.2 1B Instruct to Tflite format #269

atultiwari opened this issue Sep 29, 2024 · 5 comments

Comments

@atultiwari
Copy link

Description of the bug:

I am using Google Colab Pro+ (with High RAM) to convert Llama 3.2 1B Instruct model to Tflite format (for later use in mediapipe android app). For that

  1. I downloaded the safetensor file from unsloth huggingface (link).
  2. I updated the convert script with the path of the downloaded safetensor file
  3. I solved torch-xla related issue by using following install -
    !pip install https://storage.googleapis.com/pytorch-xla-releases/wheels/cuda/12.4/torch_xla-2.4.0-cp310-cp310-linux_x86_64.whl

However, now I am getting error flatbuffers.builder.BuilderSizeError: flatbuffers: cannot grow buffer beyond 2 gigabytes.

If it helps - Link of the Google Colab Notebook

Actual vs expected behavior:

Expected behavior

  • Tflite File should have been made without any error

Actual behavior

  • I got the error as mentioned above and the error logs are as following.

Any other information you'd like to share?

Error Log

/content/ai-edge-torch/ai_edge_torch/generative/examples/llama 2024-09-29 19:20:17.359533: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable TF_ENABLE_ONEDNN_OPTS=0. 2024-09-29 19:20:17.377103: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered WARNING: All log messages before absl::InitializeLog() is called are written to STDERR E0000 00:00:1727637617.398465 5173 cuda_dnn.cc:8312] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered E0000 00:00:1727637617.405059 5173 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered 2024-09-29 19:20:17.426425: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations. To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags. /usr/local/lib/python3.10/dist-packages/torch_xla/__init__.py:202: UserWarning: tensorflowcan conflict withtorch-xla. Prefer tensorflow-cpuwhen using PyTorch/XLA. To silence this warning,pip uninstall -y tensorflow && pip install tensorflow-cpu. If you are in a notebook environment such as Colab or Kaggle, restart your notebook runtime afterwards. warnings.warn( /usr/local/lib/python3.10/dist-packages/torch/_subclasses/functional_tensor.py:362: UserWarning: At pre-dispatch tracing, we will assume that any custom op that is marked with CompositeImplicitAutograd and functional are safe to not decompose. We found xla.mark_tensor.default to be one such op. warnings.warn( /usr/local/lib/python3.10/dist-packages/torch/_subclasses/functional_tensor.py:362: UserWarning: At pre-dispatch tracing, we will assume that any custom op that is marked with CompositeImplicitAutograd and functional are safe to not decompose. We found xla.mark_tensor.default to be one such op. warnings.warn( W0929 19:22:15.024696 133865848435328 runtime.py:42] PJRT is now the default runtime. For more information, see https://github.com/pytorch/xla/blob/master/docs/pjrt.md W0929 19:22:15.024884 133865848435328 runtime.py:59] Defaulting to PJRT_DEVICE=CPU WARNING: All log messages before absl::InitializeLog() is called are written to STDERR I0000 00:00:1727637735.028522 5173 cpu_client.cc:467] TfrtCpuClient created. 2024-09-29 19:22:37.086796: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:47] Overriding orig_value setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0. I0000 00:00:1727637757.086944 5173 gpu_device.cc:2022] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 38554 MB memory: -> device: 0, name: NVIDIA A100-SXM4-40GB, pci bus id: 0000:00:04.0, compute capability: 8.0 I0929 19:22:41.535187 133865848435328 signature_serialization.py:156] Function innercontains input name(s) resource with unsupported characters which will be renamed to xlacallmodule_readvariableop_117_resource in the SavedModel. I0929 19:22:41.651360 133865848435328 signature_serialization.py:156] Functioninnercontains input name(s) resource with unsupported characters which will be renamed to xlacallmodule_readvariableop_117_resource in the SavedModel. I0929 19:22:42.652768 133865848435328 functional_saver.py:440] Sharding callback duration: 67 I0929 19:22:46.771306 133865848435328 functional_saver.py:440] Sharding callback duration: 105 INFO:tensorflow:Assets written to: /tmp/tmphr1fv8ev/assets I0929 19:22:58.322300 133865848435328 builder_impl.py:836] Assets written to: /tmp/tmphr1fv8ev/assets I0929 19:22:58.358078 133865848435328 fingerprinting_utils.py:49] Writing fingerprint to /tmp/tmphr1fv8ev/fingerprint.pb WARNING: All log messages before absl::InitializeLog() is called are written to STDERR W0000 00:00:1727637787.839623 5173 tf_tfl_flatbuffer_helpers.cc:365] Ignored output_format. W0000 00:00:1727637787.839659 5173 tf_tfl_flatbuffer_helpers.cc:368] Ignored drop_control_dependency. 2024-09-29 19:23:07.840485: I tensorflow/cc/saved_model/reader.cc:83] Reading SavedModel from: /tmp/tmphr1fv8ev 2024-09-29 19:23:07.847678: I tensorflow/cc/saved_model/reader.cc:52] Reading meta graph with tags { serve } 2024-09-29 19:23:07.847723: I tensorflow/cc/saved_model/reader.cc:147] Reading SavedModel debug info (if present) from: /tmp/tmphr1fv8ev I0000 00:00:1727637787.889566 5173 mlir_graph_optimization_pass.cc:401] MLIR V1 optimization pass is not enabled 2024-09-29 19:23:07.895510: I tensorflow/cc/saved_model/loader.cc:236] Restoring SavedModel bundle. 2024-09-29 19:23:10.512704: I tensorflow/cc/saved_model/loader.cc:220] Running initialization op on SavedModel bundle at path: /tmp/tmphr1fv8ev 2024-09-29 19:23:10.591254: I tensorflow/cc/saved_model/loader.cc:466] SavedModel load for tags { serve }; Status: success: OK. Took 2750774 microseconds. 2024-09-29 19:23:10.649508: I tensorflow/compiler/mlir/tensorflow/utils/dump_mlir_util.cc:268] disabling MLIR crash reproducer, set env varMLIR_CRASH_REPRODUCER_DIRECTORY to enable. 2024-09-29 19:32:25.463338: I tensorflow/compiler/mlir/lite/flatbuffer_export.cc:3893] Estimated count of arithmetic ops: 2586.261 G ops, equivalently 1293.130 G MACs Traceback (most recent call last): File "/content/ai-edge-torch/ai_edge_torch/generative/examples/llama/convert_to_tflite.py", line 68, in <module> app.run(main) File "/usr/local/lib/python3.10/dist-packages/absl/app.py", line 308, in run _run_main(main, args) File "/usr/local/lib/python3.10/dist-packages/absl/app.py", line 254, in _run_main sys.exit(main(argv)) File "/content/ai-edge-torch/ai_edge_torch/generative/examples/llama/convert_to_tflite.py", line 59, in main converter.convert_to_tflite( File "/content/ai-edge-torch/ai_edge_torch/generative/utilities/converter.py", line 62, in convert_to_tflite ai_edge_torch.signature( File "/content/ai-edge-torch/ai_edge_torch/_convert/converter.py", line 163, in convert return conversion.convert_signatures( File "/content/ai-edge-torch/ai_edge_torch/_convert/conversion.py", line 105, in convert_signatures tflite_model = lowertools.exported_programs_to_tflite( File "/content/ai-edge-torch/ai_edge_torch/lowertools/_shim.py", line 75, in exported_programs_to_tflite return utils.merged_bundle_to_tfl_model( File "/content/ai-edge-torch/ai_edge_torch/lowertools/torch_xla_utils.py", line 280, in merged_bundle_to_tfl_model tflite_model = translate_recipe.quantize_model( File "/content/ai-edge-torch/ai_edge_torch/lowertools/translate_recipe.py", line 162, in quantize_model result = qt.quantize() File "/usr/local/lib/python3.10/dist-packages/ai_edge_quantizer/quantizer.py", line 243, in quantize quantized_model = self._get_quantized_model(quant_params) File "/usr/local/lib/python3.10/dist-packages/ai_edge_quantizer/quantizer.py", line 331, in _get_quantized_model return model_modifier_instance.modify_model(quant_params) File "/usr/local/lib/python3.10/dist-packages/ai_edge_quantizer/model_modifier.py", line 85, in modify_model return self._serialize_small_model(quantized_model) File "/usr/local/lib/python3.10/dist-packages/ai_edge_quantizer/model_modifier.py", line 178, in _serialize_small_model model_bytearray = flatbuffer_utils.convert_object_to_bytearray( File "/usr/local/lib/python3.10/dist-packages/tensorflow/lite/tools/flatbuffer_utils.py", line 122, in convert_object_to_bytearray model_offset = model_object.Pack(builder) File "/usr/local/lib/python3.10/dist-packages/tensorflow/lite/python/schema_py_generated.py", line 18390, in Pack bufferslist.append(self.buffers[i].Pack(builder)) File "/usr/local/lib/python3.10/dist-packages/tensorflow/lite/python/schema_py_generated.py", line 17650, in Pack data = builder.CreateNumpyVector(self.data) File "/usr/local/lib/python3.10/dist-packages/flatbuffers/builder.py", line 503, in CreateNumpyVector self.StartVector(x.itemsize, x.size, x.dtype.alignment) File "/usr/local/lib/python3.10/dist-packages/flatbuffers/builder.py", line 400, in StartVector self.Prep(N.Uint32Flags.bytewidth, elemSize*numElems) File "/usr/local/lib/python3.10/dist-packages/flatbuffers/builder.py", line 354, in Prep self.growByteBuffer() File "/usr/local/lib/python3.10/dist-packages/flatbuffers/builder.py", line 303, in growByteBuffer raise BuilderSizeError(msg) flatbuffers.builder.BuilderSizeError: flatbuffers: cannot grow buffer beyond 2 gigabytes I0000 00:00:1727638470.142136 5173 cpu_client.cc:470] TfrtCpuClient destroyed.

@atultiwari atultiwari added the type:bug Bug label Sep 29, 2024
@pkgoogle pkgoogle self-assigned this Sep 30, 2024
@pkgoogle
Copy link
Contributor

Hi @atultiwari, can you gain access to the "official" data? https://huggingface.co/meta-llama/Llama-3.2-3B ... I'm not fully sure what the differences between the official one and unsloth version are but we will likely run into less issues with this route. There's also a specific convert_3b_to_tflite.py script now available, perhaps that will resolve your issue: llama example

@pkgoogle pkgoogle added the status:awaiting user response When awaiting user response label Sep 30, 2024
@atultiwari
Copy link
Author

Hi @atultiwari, can you gain access to the "official" data? https://huggingface.co/meta-llama/Llama-3.2-3B ... I'm not fully sure what the differences between the official one and unsloth version are but we will likely run into less issues with this route. There's also a specific convert_3b_to_tflite.py script now available, perhaps that will resolve your issue: llama example

Hi, Thank you for your reply.
I tried both official llama 3.2 1B & 3B. Unfortunately -

  1. For 3B version - Google colab pro+ account having A100 (40GB vram) with high RAM got out of memory error -
    INFO:tensorflow:Assets written to: /tmp/tmpbqvjlqee/assets I1001 07:05:05.760098 135071553254016 builder_impl.py:836] Assets written to: /tmp/tmpbqvjlqee/assets I1001 07:05:05.835448 135071553254016 fingerprinting_utils.py:49] Writing fingerprint to /tmp/tmpbqvjlqee/fingerprint.pb 2024-10-01 07:05:28.240182: W external/local_xla/xla/tsl/framework/bfc_allocator.cc:497] Allocator (GPU_0_bfc) ran out of memory trying to allocate 96.00MiB (rounded to 100663296)requested by op Identity
  2. For 1B version - I got the same error even with the official llama 3.2 1B model, as in the original message (which I recieved with unsloth) -
    2024-10-01 07:23:19.558938: I tensorflow/compiler/mlir/lite/flatbuffer_export.cc:3893] Estimated count of arithmetic ops: 2586.261 G ops, equivalently 1293.130 G MACs flatbuffers.builder.BuilderSizeError: flatbuffers: cannot grow buffer beyond 2 gigabytes

For detailed error logs, kindly have a look at the Google Colab repo(link), using official Llama 3.2 3B & 1B models.
Thank you

@pkgoogle
Copy link
Contributor

pkgoogle commented Oct 1, 2024

I'm getting a different error with python=3.11 and the nightly install (and a A100):

python convert_3b_to_tflite.py --checkpoint_path=/xxxxxxxxx/git/Llama-3.2-3B
2024-10-01 20:35:51.862016: I tensorflow/core/util/port.cc:153] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2024-10-01 20:35:51.880245: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1727814951.903249    6648 cuda_dnn.cc:8312] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1727814951.910202    6648 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-10-01 20:35:51.932554: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 AVX512F AVX512_VNNI FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Traceback (most recent call last):
  File "/xxxxxxxxx/git/ai-edge-torch/ai_edge_torch/generative/examples/llama/convert_3b_to_tflite.py", line 68, in <module>
    app.run(main)
  File "/xxxxxxxxx/git/ai-edge-torch/aet_269/lib/python3.11/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/xxxxxxxxx/git/ai-edge-torch/aet_269/lib/python3.11/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
             ^^^^^^^^^^
  File "/xxxxxxxxx/git/ai-edge-torch/ai_edge_torch/generative/examples/llama/convert_3b_to_tflite.py", line 54, in main
    pytorch_model = llama.build_3b_model(
                    ^^^^^^^^^^^^^^^^^^^^^
  File "/xxxxxxxxx/git/ai-edge-torch/aet_269/lib/python3.11/site-packages/ai_edge_torch/generative/examples/llama/llama.py", line 202, in build_3b_model
    loader.load(model, strict=False)
  File "/xxxxxxxxx/git/ai-edge-torch/aet_269/lib/python3.11/site-packages/ai_edge_torch/generative/utilities/loader.py", line 153, in load
    state = self._loader(self._file_name)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/xxxxxxxxx/git/ai-edge-torch/aet_269/lib/python3.11/site-packages/ai_edge_torch/generative/utilities/loader.py", line 49, in load_safetensors
    with safe_open(file, framework="pt") as fp:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge

I get the same error with 1B and the other script.

@Arya-Hari
Copy link

Hi! Is there any update on this? I'm facing the same issues when I ran on Colab as well. I also wanted to ask, is there any specific preferred platform or environment that should be used for running the examples? Thanks!

@pkgoogle
Copy link
Contributor

Hi @Arya-Hari, for using the library/doing the conversion https://github.com/google-ai-edge/ai-edge-torch?tab=readme-ov-file#installation the requirements here will be your best bet. For inferencing, the liteRT examples are https://github.com/google-ai-edge/litert-samples/tree/main/examples your best bet on Android/iOS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants