Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPUv2 numerical inaccuracy in simple Add + Mul #66740

Open
gustavla opened this issue Apr 30, 2024 · 1 comment
Open

GPUv2 numerical inaccuracy in simple Add + Mul #66740

gustavla opened this issue Apr 30, 2024 · 1 comment
Assignees
Labels
Android comp:lite TF Lite related issues

Comments

@gustavla
Copy link
Contributor

gustavla commented Apr 30, 2024

System information

  • Google Pixel 7 / Android 14 / Google Tensor G2
  • TensorFlow 2.16.1

Running a simple model with a Add + Mul on both GPU and CPU gives vastly different results (on device CPU confirmed separately to be correct):

Arrays are not almost equal to 7 decimals

Mismatched elements: 392911 / 393216 (99.9%)
Max absolute difference: 2.719718
Max relative difference: 0.99997413
 x: array([[[[0.408207 , 0.2999736, 0.1134637, ..., 0.1838855, 0.1379557,
          0.2322327],
         [1.3259705, 0.9743981, 0.3685618, ..., 0.5973114, 0.4481186,...
 y: array([[[[0.4082031, 0.       , 0.       , ..., 0.       , 0.       ,
          1.2128906],
         [1.3251953, 0.       , 0.       , ..., 0.       , 0.       ,...

Standalone code to reproduce the issue

Inference repro
I reproduced this using https://aihub.qualcomm.com/ out of convenience (but I'm sure it will repro through other means as well), so I'm attaching the script here.

import numpy as np
import qai_hub as hub

model = hub.upload_model("tflite_66740_add_mul_gpu_numerically_incorrect.tflite")

rng = np.random.RandomState(1234)
x = rng.uniform(size=(1, 64, 64, 1)).astype(np.float32)

inputs = hub.upload_dataset({"model_5/0/body/0/0/norm/LayerNormalization/batchnorm/Rsqrt": [x]})

device = hub.Device("Google Pixel 7")
cpu_job = hub.submit_inference_job(
    model,
    device=device,
    inputs=inputs,
    options="--compute_unit cpu",
)

gpu_job = hub.submit_inference_job(
    model,
    device=device,
    inputs=inputs,
    options="--compute_unit gpu",
)

cpu_output = cpu_job.download_output_data()
gpu_output = gpu_job.download_output_data()

np.testing.assert_almost_equal(cpu_output["output_0"][0], gpu_output["output_0"][0])

Model visualization
image

Any other info / logs

Logs from the GPU inference job (via https://aihub.qualcomm.com/):

[01/May/2024:08:11:15 +10:00: profiler/info] Android system property: ro.board.platform = gs201
[01/May/2024:08:11:15 +10:00: profiler/info] Android system property: ro.boot.hardware = panther
[01/May/2024:08:11:15 +10:00: profiler/info] Android system property: ro.boot.hardware.platform = gs201
[01/May/2024:08:11:15 +10:00: profiler/info] Android system property: ro.system.build.id = UP1A.231005.007.A1
[01/May/2024:08:11:15 +10:00: profiler/info] Android system property: ro.system.build.version.release = 14
[01/May/2024:08:11:15 +10:00: profiler/info] Android system property: ro.hardware = panther
[01/May/2024:08:11:15 +10:00: profiler/info] Android system property: ro.hardware.chipname = 
[01/May/2024:08:11:15 +10:00: profiler/info] Android system property: ro.product.board = panther
[01/May/2024:08:11:15 +10:00: profiler/info] Android system property: ro.product.brand = google
[01/May/2024:08:11:15 +10:00: profiler/info] Android system property: ro.product.device = panther
[01/May/2024:08:11:15 +10:00: profiler/info] Android system property: ro.product.build.fingerprint = google/panther/panther:14/UP1A.231005.007.A1/10762838:user/release-keys
[01/May/2024:08:11:15 +10:00: profiler/info] Android system property: ro.product.manufacturer = Google
[01/May/2024:08:11:15 +10:00: profiler/info] Android system property: ro.product.model = Pixel 7
[01/May/2024:08:11:15 +10:00: profiler/info] Android system property: ro.product.name = panther
[01/May/2024:08:11:15 +10:00: profiler/info] Android system property: ro.soc.manufacturer = Google
[01/May/2024:08:11:15 +10:00: profiler/info] Android system property: ro.soc.model = GS201
[01/May/2024:08:11:15 +10:00: profiler/info] [Manager] DeviceManager::DeviceManager
[01/May/2024:08:11:15 +10:00: profiler/info] [Manager] findAvailableDevices
[01/May/2024:08:11:15 +10:00: profiler/info] [Manager] Found interface google-edgetpu (version = 2.0)
[01/May/2024:08:11:15 +10:00: profiler/info] [Manager] Found interface google-armnn (version = ArmNN)
[01/May/2024:08:11:15 +10:00: profiler/info] NNAPI devices: google-edgetpu,google-armnn,nnapi-reference
[01/May/2024:08:11:15 +10:00: profiler/info] GPU device: ARM Mali-G710
[01/May/2024:08:11:15 +10:00: profiler/info] OpenGL Version: OpenGL ES 3.2 v1.r43p0-01eac0.ff5c643eda65d3f5ba9886fdffb12673
[01/May/2024:08:11:15 +10:00: profiler/info] OpenCL Version: OpenCL C 1.2 v1.r43p0-01eac0.ff5c643eda65d3f5ba9886fdffb12673
[01/May/2024:08:11:15 +10:00: profiler/info] -=- Tungsten Running Task: Loading -=-
[01/May/2024:08:11:15 +10:00: profiler/info] Detected chipset 3101, made by 3000.
[01/May/2024:08:11:15 +10:00: profiler/info] [job_id: jlpej1x85] [model.tflite] Loading tflite model Models/model.tflite
[01/May/2024:08:11:15 +10:00: profiler/info] [job_id: jlpej1x85] [model.tflite] Malloc VM size before: 28600.0 kB, allocated: 18032.4 kB, slack: 10567.6 kB.
[01/May/2024:08:11:15 +10:00: profiler/info] [job_id: jlpej1x85] [model.tflite] Current memory baseline range: 61792.4-72360.0 kB.
[01/May/2024:08:11:15 +10:00: profiler/debug] [job_id: jlpej1x85] [model.tflite] Runtime metadata not found in Models/model.tflite/trt_metadata.json or Models/model.tflite/trt_metadata.pb
[01/May/2024:08:11:15 +10:00: profiler/info] [job_id: jlpej1x85] [model.tflite] TF Lite version 2.16.1. Loading model from Models/model.tflite.
[01/May/2024:08:11:15 +10:00: profiler/debug] [job_id: jlpej1x85] [model.tflite] Mapping resource file in Models/model.tflite
[01/May/2024:08:11:15 +10:00: profiler/info] [job_id: jlpej1x85] [model.tflite] Loaded model. Minimum TF Lite version = 1.5.0.
[01/May/2024:08:11:15 +10:00: profiler/info] [job_id: jlpej1x85] [model.tflite] No delegates specified; using compute unit=cpu_and_gpu.
[01/May/2024:08:11:15 +10:00: profiler/info] [job_id: jlpej1x85] [model.tflite] [tflite] Initialized TensorFlow Lite runtime.
[01/May/2024:08:11:15 +10:00: profiler/info] [job_id: jlpej1x85] [model.tflite] GPUV2 delegate requested. OpenCL detected.
[01/May/2024:08:11:15 +10:00: profiler/debug] [job_id: jlpej1x85] [model.tflite] Enabling delegate cache in dir=/data/user/0/ai.tetra.tungsten/cache/1714515075350/ai.tetra.runtime/0.6.0/model.tflite_8945450969824422876_1714515066045/gpuv2.
[01/May/2024:08:11:15 +10:00: profiler/info] [job_id: jlpej1x85] [model.tflite] [tflite] Created TensorFlow Lite delegate for GPU.
[01/May/2024:08:11:15 +10:00: profiler/info] [job_id: jlpej1x85] [model.tflite] [tflite] Replacing 2 out of 2 node(s) with delegate (TfLiteGpuDelegateV2) node, yielding 1 partitions for the whole graph.
[01/May/2024:08:11:15 +10:00: profiler/warning] [job_id: jlpej1x85] [model.tflite] [tflite] File /data/user/0/ai.tetra.tungsten/cache/1714515075350/ai.tetra.runtime/0.6.0/model.tflite_8945450969824422876_1714515066045/gpuv2/gpuv2_5609552586171313032.bin couldn't be opened for reading: No such file or directory
[01/May/2024:08:11:15 +10:00: profiler/info] [job_id: jlpej1x85] [model.tflite] [tflite] Initialized OpenCL-based API.
[01/May/2024:08:11:15 +10:00: profiler/info] [job_id: jlpej1x85] [model.tflite] [tflite] Created 1 GPU delegate kernels.
[01/May/2024:08:11:15 +10:00: profiler/info] [job_id: jlpej1x85] [model.tflite] Applied 1 delegates: GPUV2/OpenCL. Model is fully delegated=true.
[01/May/2024:08:11:15 +10:00: profiler/debug] [job_id: jlpej1x85] [model.tflite] Saving delegate selection for subsequent steps.
[01/May/2024:08:11:15 +10:00: profiler/info] [job_id: jlpej1x85] [model.tflite] Malloc VM size after: 36240.0 kB, allocated: 25209.7 kB, slack: 11030.3 kB.
[01/May/2024:08:11:15 +10:00: profiler/info] [job_id: jlpej1x85] [model.tflite] Status Successfully Loaded Cold with t = 132027 us and usage: before = 72360.0 kB; peakBefore = 72360.0 kB; mallocUnusedBefore = 10567.6 kB; after = 81780.0 kB; peakAfter = 81744.0 kB; mallocUnusedAfter = 11030.3 kB; increase = 0.0-8957.3 kB; peak = 9384.0-19951.6 kB
[01/May/2024:08:11:15 +10:00: profiler/info] [job_id: jlpej1x85] [model.tflite] Saving results to /storage/emulated/0/Android/data/ai.tetra.tungsten/files/Results/job_jlpej1x85/job_jlpej1x85_results.bin
[01/May/2024:08:11:16 +10:00: profiler/info] -=- Tungsten Running Task: Loading -=-
[01/May/2024:08:11:16 +10:00: profiler/info] Detected chipset 3101, made by 3000.
[01/May/2024:08:11:16 +10:00: profiler/info] [job_id: jlpej1x85] [model.tflite] Loading previously saved results in /storage/emulated/0/Android/data/ai.tetra.tungsten/files/Results/job_jlpej1x85/job_jlpej1x85_results.bin
[01/May/2024:08:11:16 +10:00: profiler/info] [job_id: jlpej1x85] [model.tflite] Loading tflite model Models/model.tflite
[01/May/2024:08:11:16 +10:00: profiler/info] [job_id: jlpej1x85] [model.tflite] Malloc VM size before: 30640.0 kB, allocated: 17338.9 kB, slack: 13301.1 kB.
[01/May/2024:08:11:16 +10:00: profiler/info] [job_id: jlpej1x85] [model.tflite] Current memory baseline range: 54914.9-68216.0 kB.
[01/May/2024:08:11:16 +10:00: profiler/debug] [job_id: jlpej1x85] [model.tflite] Runtime metadata not found in Models/model.tflite/trt_metadata.json or Models/model.tflite/trt_metadata.pb
[01/May/2024:08:11:16 +10:00: profiler/info] [job_id: jlpej1x85] [model.tflite] TF Lite version 2.16.1. Loading model from Models/model.tflite.
[01/May/2024:08:11:16 +10:00: profiler/debug] [job_id: jlpej1x85] [model.tflite] Mapping resource file in Models/model.tflite
[01/May/2024:08:11:16 +10:00: profiler/info] [job_id: jlpej1x85] [model.tflite] Loaded model. Minimum TF Lite version = 1.5.0.
[01/May/2024:08:11:16 +10:00: profiler/info] [job_id: jlpej1x85] [model.tflite] GPUV2 delegate requested. OpenCL detected.
[01/May/2024:08:11:16 +10:00: profiler/debug] [job_id: jlpej1x85] [model.tflite] Enabling delegate cache in dir=/data/user/0/ai.tetra.tungsten/cache/1714515075350/ai.tetra.runtime/0.6.0/model.tflite_8945450969824422876_1714515066045/gpuv2.
[01/May/2024:08:11:16 +10:00: profiler/info] [job_id: jlpej1x85] [model.tflite] [tflite] Replacing 2 out of 2 node(s) with delegate (TfLiteGpuDelegateV2) node, yielding 1 partitions for the whole graph.
[01/May/2024:08:11:16 +10:00: profiler/info] [job_id: jlpej1x85] [model.tflite] [tflite] Found serialized data for model gpuv2 (790680 B) at /data/user/0/ai.tetra.tungsten/cache/1714515075350/ai.tetra.runtime/0.6.0/model.tflite_8945450969824422876_1714515066045/gpuv2/gpuv2_5609552586171313032.bin
[01/May/2024:08:11:16 +10:00: profiler/info] [job_id: jlpej1x85] [model.tflite] [tflite] Initialized OpenCL-based API from serialized data.
[01/May/2024:08:11:16 +10:00: profiler/info] [job_id: jlpej1x85] [model.tflite] [tflite] Created 1 GPU delegate kernels.
[01/May/2024:08:11:16 +10:00: profiler/info] [job_id: jlpej1x85] [model.tflite] Applied 1 delegates: GPUV2/OpenCL. Model is fully delegated=true.
[01/May/2024:08:11:16 +10:00: profiler/info] [job_id: jlpej1x85] [model.tflite] Malloc VM size after: 36496.0 kB, allocated: 25417.5 kB, slack: 11078.5 kB.
[01/May/2024:08:11:16 +10:00: profiler/info] [job_id: jlpej1x85] [model.tflite] Status Successfully Loaded Warm with t = 94180 us and usage: before = 68216.0 kB; peakBefore = 68216.0 kB; mallocUnusedBefore = 13301.1 kB; after = 78828.0 kB; peakAfter = 78076.0 kB; mallocUnusedAfter = 11078.5 kB; increase = 0.0-12834.6 kB; peak = 9860.0-23161.1 kB
[01/May/2024:08:11:16 +10:00: profiler/info] [job_id: jlpej1x85] [model.tflite] Saving results to /storage/emulated/0/Android/data/ai.tetra.tungsten/files/Results/job_jlpej1x85/job_jlpej1x85_results.bin
@sawantkumar
Copy link

Hi @gustavla ,

I am in the process of replicating your issue . I will get back to you as soon as poosible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Android comp:lite TF Lite related issues
Projects
None yet
Development

No branches or pull requests

3 participants