Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

INTERNAL: bitcode module not found at ./opencl.bc when running with "TF_XLA_FLAGS=--tf_xla_auto_jit=2" #1591

Closed
tedliosu opened this issue Feb 27, 2022 · 11 comments

Comments

@tedliosu
Copy link

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Ubuntu 20.04
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: N/A
  • TensorFlow installed from (source or binary): binary using pip
  • TensorFlow version (use command below): v2.8.0-rc1-3269-g5b009178df4 2.8.0
  • Python version: Python 3.9.10
  • Bazel version (if compiling from source): N/A
  • GCC/Compiler version (if compiling from source): N/A
  • ROCm version: 5.0.1
  • GPU model and memory: RX 6800 Reference 16 GB VRAM

You can collect some of this information using our environment capture
script
You can also obtain the TensorFlow version with:

  1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
  2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior

  1. git clone https://github.com/tensorflow/benchmarks.git
  2. cd ./benchmarks/scripts/tf_cnn_benchmarks/
  3. Running:
    • TF_XLA_FLAGS=--tf_xla_auto_jit=2 TF_ROCM_FUSION_ENABLE=1 python3 tf_cnn_benchmarks.py --num_gpus=1 --batch_size=128 --model=resnet50
      results in the following error:
    2022-02-27 16:36:42.512830: E tensorflow/compiler/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:292] bitcode module is required by this HLO module but was not found at ./opencl.bc
    2022-02-27 16:36:42.513381: W tensorflow/core/framework/op_kernel.cc:1745] OP_REQUIRES failed at xla_ops.cc:436 : INTERNAL: bitcode module not found at ./opencl.bc
    INFO:tensorflow:Error reported to Coordinator: <class 'tensorflow.python.framework.errors_impl.InternalError'>, Graph execution error:
    
    2 root error(s) found.
      (0) INTERNAL: bitcode module not found at ./opencl.bc
    	 [[{{node cluster_3_1/xla_compile}}]]
    	 [[cluster_1_1/merge_oidx_0/_567]]
      (1) INTERNAL: bitcode module not found at ./opencl.bc
    	 [[{{node cluster_3_1/xla_compile}}]]
    0 successful operations.
    0 derived errors ignored.
    
    I've attached the full output of the command at this step in this file.

Describe the expected behavior

  • The same three steps above shouldn't produce any errors that aborts execution of the benchmark script tf_cnn_benchmarks.py; the errors did not appear with ROCm version 4.5.2 and tensorflow-rocm version 2.7.0 (I've tried using tensorflow-rocm version 2.7.0 and version 2.7.1 with ROCm version 5.0.1, but tensorflow complained that it couldn't find "libamdhip64.so.4")

Contributing

  • Do you want to contribute a PR? (yes/no): no, because I have no idea how to fix the issue
  • Briefly describe your candidate solution(if contributing): N/A

Standalone code to reproduce the issue
Please refer to steps above in reproducing issue to git clone the code from GitHub.

@tedliosu
Copy link
Author

Please let me know if there's anything else I may be able to contribute in order to resolve this issue.

@tedliosu
Copy link
Author

tedliosu commented Oct 1, 2022

Ok after doing some programming which refreshed my knowledge of how executables look for missing files on Linux in general, I discovered a pretty hacky work-around for this issue at hand:

(tensorflow_rocm) bkupuntu@opencl-os:~/github_repo_installs/benchmarks/scripts/tf_cnn_benchmarks$ ls -l *.bc
lrwxrwxrwx 1 bkupuntu bkupuntu 32 Oct  1 04:12 ockl.bc -> /opt/rocm/amdgcn/bitcode/ockl.bc
lrwxrwxrwx 1 bkupuntu bkupuntu 58 Oct  1 04:15 oclc_correctly_rounded_sqrt_on.bc -> /opt/rocm/amdgcn/bitcode/oclc_correctly_rounded_sqrt_on.bc
lrwxrwxrwx 1 bkupuntu bkupuntu 44 Oct  1 04:14 oclc_daz_opt_off.bc -> /opt/rocm/amdgcn/bitcode/oclc_daz_opt_off.bc
lrwxrwxrwx 1 bkupuntu bkupuntu 48 Oct  1 04:13 oclc_finite_only_off.bc -> /opt/rocm/amdgcn/bitcode/oclc_finite_only_off.bc
lrwxrwxrwx 1 bkupuntu bkupuntu 49 Oct  1 04:17 oclc_isa_version_1030.bc -> /opt/rocm/amdgcn/bitcode/oclc_isa_version_1030.bc
lrwxrwxrwx 1 bkupuntu bkupuntu 48 Oct  1 04:15 oclc_unsafe_math_off.bc -> /opt/rocm/amdgcn/bitcode/oclc_unsafe_math_off.bc
lrwxrwxrwx 1 bkupuntu bkupuntu 51 Oct  1 04:16 oclc_wavefrontsize64_on.bc -> /opt/rocm/amdgcn/bitcode/oclc_wavefrontsize64_on.bc
lrwxrwxrwx 1 bkupuntu bkupuntu 32 Oct  1 04:12 ocml.bc -> /opt/rocm/amdgcn/bitcode/ocml.bc
lrwxrwxrwx 1 bkupuntu bkupuntu 34 Oct  1 04:11 opencl.bc -> /opt/rocm/amdgcn/bitcode/opencl.bc

So the tensorflow rocm build is simply NOT looking in the correct directory for the bitcode files, which are under the /opt/rocm/amdgcn directory... Still not sure what kind of patches are needed in the tensorflow source code in order for tensorflow to look in the correct directory for those bitcode files 😕

Btw for others running into the same problem as I am, YMMV on which exact bitcode files should be linked to the current working directory based on what GPU you have (I have a gfx1030-based RX 6800).

@Mushoz
Copy link

Mushoz commented Dec 10, 2022

I am getting the exact same error message, however, that happens even without any environment variables. I am unable to run tensorflow-rocm on a 6900xt under rocm 5.4.0. It used to work just fine previously.

There are similar reports here: ROCm/ROCm#1796

Any idea on how to get it to work again? Are you symlinking the .bc files? Or what exactly are you proposing as a hacky solution? Right now, tensorflow is unusable on RDNA3 cards as far as I can tell.

@tedliosu
Copy link
Author

I am getting the exact same error message, however, that happens even without any environment variables. I am unable to run tensorflow-rocm on a 6900xt under rocm 5.4.0. It used to work just fine previously.

There are similar reports here: RadeonOpenCompute/ROCm#1796

Any idea on how to get it to work again? Are you symlinking the .bc files? Or what exactly are you proposing as a hacky solution? Right now, tensorflow is unusable on RDNA3 cards as far as I can tell.

@Mushoz yes I am simply symlinking the appropriate files into the current working directory as shown in my previous comment; ymmv as to which exact files to symlink (I just kept symlinking each file each error told me it was looking for until all errors went away) bc it appears to be architecture dependent. Sorry to hear that you're running into even worse issues and hopefully my solution helps to fix them 🥺

@Mushoz
Copy link

Mushoz commented Dec 12, 2022

Cheers, that worked wonderfully! I really wonder why this isn't reported by more people. A simple model with just one dense layer with some randomly generated features and targets refuses to run on my 6900XT, so even in the most simple of cases it's completely broken without symlinking. I did not have to do that previously, so that's a big regression. This is all without any switches, just a purely stock tensorflow-rocm installation and execution.

@xupit3r
Copy link

xupit3r commented Dec 25, 2022

hey, so i was having this same issue Radeon Pro VII (gfx906) on ubuntu 22.04 using rocm 5.4.1 and it turns out that if i set the ROCM_PATH to /opt/rocm (which is where all the library and bitcode goodies are), XLA could compile and run.

@jasondrusso
Copy link

jasondrusso commented Dec 30, 2022

@tedliosu I can also confirm this issue with my 6800XT and your solution working for me as well. Seems like there should be an environment variable that should resolve what is essentially a path problem. Updating the ROCM_PATH didn't help for me, though.

FYI, I am observing this problem with ROCM 5.4.1.

@vsrikarunyan
Copy link

hey, so i was having this same issue Radeon Pro VII (gfx906) on ubuntu 22.04 using rocm 5.4.1 and it turns out that if i set the ROCM_PATH to /opt/rocm (which is where all the library and bitcode goodies are), XLA could compile and run.

I used to have this issue taken care by setting ROCM_HOME in the past; but this time it needed ROCM_PATH. As far as my environment is concerned, they are identical to yours.

@tedliosu
Copy link
Author

@tedliosu I can also confirm this issue with my 6800XT and your solution working for me as well. Seems like there should be an environment variable that should resolve what is essentially a path problem. Updating the ROCM_PATH didn't help for me, though.

FYI, I am observing this problem with ROCM 5.4.1.

@jasondrusso Unfortunately, as maintenance for the original code-base used to reproduce this issue has long been abandoned (see this comment for more info), I am no longer able to test whether or not setting ROCM_PATH solves the issue as presented here, at least not without using a completely different code-base as a reproducer. Attempting to run the benchmark that led me to this issue in the first place resulted in this error by the way:

Traceback (most recent call last):
  File "/home/bkupuntu/github_repo_installs/old_tf_benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 68, in <module>
    app.run(main)  # Raises error on invalid flags, unlike tf.app.run()
  File "/home/bkupuntu/tensorflow_rocm/lib/python3.10/site-packages/absl/app.py", line 308, in run
    _run_main(main, args)
  File "/home/bkupuntu/tensorflow_rocm/lib/python3.10/site-packages/absl/app.py", line 254, in _run_main
    sys.exit(main(argv))
  File "/home/bkupuntu/github_repo_installs/old_tf_benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py", line 59, in main
    tfversion = cnn_util.tensorflow_version_tuple()
  File "/home/bkupuntu/github_repo_installs/old_tf_benchmarks/scripts/tf_cnn_benchmarks/cnn_util.py", line 27, in tensorflow_version_tuple
    major, minor, patch = v.split('.')
ValueError: too many values to unpack (expected 3)

So if you don't mind, could you please provide a minimal working example of the code that you were working with that led you to the same error that I originally arrived at as well? Otherwise I unfortunately can't help confirm whether or not this issue is purely a user configuration issue 🙁

Thanks in advance 😃

@tedliosu
Copy link
Author

Since I broke the system containing my RX 6800 while attempting to upgrade its system memory, and no longer have the time nor energy to maintain my own desktop system, I just sold my RX 6800 (my only AMD GPU). Therefore, since I will not be able to repro any potential fix of this issue anymore, I am closing this issue for the time being. Will be more than willing to reopen this if anyone else runs into the same issue as me.

@tedliosu
Copy link
Author

sorry pressed wrong button closing now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants