Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two slices of a device-only buffer pointing to the same memory do not have the same identity #7004

Open
knzivid opened this issue Sep 9, 2022 · 1 comment

Comments

@knzivid
Copy link
Contributor

knzivid commented Sep 9, 2022

Expected

Making two slices from a buffer with the same dimension and index should point to the same buffer and should be interchangeable. Passing the first slice to a halide func and using the second slice to copy to host should copy the result of the halide func execution to the host.

Actual

Copying the original buffer, as well as the second slice are not marked dirty. Only the first slice is marked dirty after running the pipeline. Is this documented behavior?

main.cpp
Halide::Runtime::Buffer<uint32_t, 3> buf(std::vector{8, 8, 2});
buf.fill(0);
buf.set_host_dirty();
buf.copy_to_device(halide_opencl_device_interface());

auto slice = buf.sliced(2, 0);
consumer(slice);

printf(">> Copying the parent buffer (%s) to host\n", buf.device_dirty() ? "dirty" : "notdirty");
buf.copy_to_host();
print(buf);

auto newslice = buf.sliced(2, 0);
printf(">> Making a new slice (%s)\n", newslice.device_dirty() ? "dirty" : "notdirty");
newslice.copy_to_host();
print(newslice);

printf(">> Using existing slice (%s)\n", slice.device_dirty() ? "dirty" : "notdirty");
slice.copy_to_host();
print(slice);

auto sliceagain = buf.sliced(2, 0);
printf(">> Making a new slice again (%s)\n", sliceagain.device_dirty() ? "dirty" : "notdirty");
sliceagain.copy_to_host();
print(sliceagain);
main.py
consumer = Func("consumer")
x, y = Var(), Var()
consumer[x, y] = u32(1)
xo, yo, xi, yi = Var(), Var(), Var(), Var()
consumer.gpu_tile(x, y, xo, yo, xi, yi, 8, 8)
Verbose debug output
halide_copy_to_device validating input buffer: buffer(0, 0x0, 0x55d8b7facf00, 1, uint32, {0, 8, 1}, {0, 8, 8}, {0, 2, 64})
halide_device_malloc validating input buffer: buffer(0, 0x0, 0x55d8b7facf00, 1, uint32, {0, 8, 1}, {0, 8, 8}, {0, 2, 64})
halide_device_malloc: target device interface 0x55d8b6b792a8
CL: halide_opencl_device_malloc (user_context: 0x0, buf: 0x7ffc499e82d0)
    load_libopencl (user_context: 0x0)
    Loaded OpenCL runtime library: libOpenCL.so
    create_opencl_context (user_context: 0x0)
CL: platform 0 NVIDIA CUDA
CL: platform 1 Intel(R) OpenCL HD Graphics
    Got platform 'Intel(R) OpenCL HD Graphics', about to create context (t=60640070)
      device name: Intel(R) UHD Graphics 630 [0x3e98]
      device vendor: Intel(R) Corporation
      device profile: FULL_PROFILE
      global mem size: 25552 MB
      max mem alloc size: 4095 MB
      local mem size: 65536
      max compute units: 24
      max workgroup size: 256
      max work item dimensions: 3
      max work item sizes: 256x256x256x0
    clCreateContext -> 0x55d8b82bbde0
    clCreateCommandQueue 0x55d8b7fb0130
    allocating buffer(0, 0x0, 0x55d8b7facf00, 1, uint32, {0, 8, 1}, {0, 8, 8}, {0, 2, 64})
    clCreateBuffer -> 512 0x55d8b82bc670 device_handle: 0x55d8b82bc020
    Allocated device buffer 0x55d8b82bc020 for buffer 0x7ffc499e82d0
CL: validate 0x55d8b82bc670 offset: 0: asked for 512, actual allocated 512
    Time: 3.579700e-02 ms
halide_copy_to_device 0x7ffc499e82d0 host is dirty
halide_copy_to_device 0x7ffc499e82d0 calling copy_to_device()
CL: halide_opencl_buffer_copy (user_context: 0x0, src: 0x7ffc499e82d0, dst: 0x7ffc499e82d0)
CL: validate 0x55d8b82bc670 offset: 0: asked for 0, actual allocated 512
    from host to device, 0x55d8b7facf00 + 0 -> 0x55d8b82bc020 + 0, 512 bytes
    Time: 3.430599e+00 ms
halide_device_crop 
 src: buffer(94389291171872, 0x55d8b6b792a8, 0x55d8b7facf00, 0, uint32, {0, 8, 1}, {0, 8, 8}, {0, 2, 64})
 dst: buffer(94389291174752, 0x55d8b6b792a8, 0x55d8b7facf00, 0, uint32, {0, 8, 1}, {0, 8, 8})
Entering Pipeline consumer
Target: x86-64-linux-cl_doubles-debug-no_runtime-opencl
 Output Buffer consumer: 0x7ffc499e8248
CL: halide_opencl_initialize_kernels (user_context: 0x0, state_ptr: 0x55d8b6b80d80, program: 0x55d8b6b74a90, size: 3179
halide_cuda_initialize_kernels got compilation_cache mutex.
    clCreateProgramWithSource -> 0x55d8b82c0560
    clBuildProgram 0x55d8b82c0560 -D MAX_CONSTANT_BUFFER_SIZE=4294959104 -D MAX_CONSTANT_ARGS=8 
Caching compiled kernel: 0x55d8b82c0560 id 2 context 0x55d8b82bbde0
    Time: 2.683257e+02 ms
halide_copy_to_device validating input buffer: buffer(94389291174752, 0x55d8b6b792a8, 0x55d8b7facf00, 0, uint32, {0, 8, 1}, {0, 8, 8})
halide_copy_to_device 0x7ffc499e8248 skipped (host is not dirty)
CL: halide_opencl_run (user_context: 0x0, entry: _kernel_consumer_s0_v4_v6___block_id_y, blocks: 1x1x1, threads: 8x8x1, shmem: 0
    clCreateKernel _kernel_consumer_s0_v4_v6___block_id_y ->     Time: 4.110000e-03 ms
    clSetKernelArg 0 8 [0x55d8b82bcb60 ...] 1
Mapped dev handle is: 0x55d8b82bc670
    clSetKernelArg 1 4 [0x7f7c00000008 ...] 0
    clSetKernelArg 2 4 [0x55d800000000 ...] 0
    clSetKernelArg 3 4 [0x8 ...] 0
    clSetKernelArg 4 4 [0x55d800000001 ...] 0
    clSetKernelArg 5 4 [0x7f7c00000000 ...] 0
    clSetKernelArg 6 4 [0x8 ...] 0
    clSetKernelArg 7 0 [nullptr]
    clEnqueueNDRangeKernel 1x1x1, 8x8x1 -> CL_SUCCESS
    Releasing kernel 0x55d8b9418260
    clReleaseKernel finished0x55d8b9418260
    Time: 6.657300e-01 ms
Exiting Pipeline consumer
CL: halide_opencl_finalize_kernels (user_context: 0x0, state_ptr: 0x2
>> Copying the parent buffer (notdirty) to host
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
halide_device_crop 
 src: buffer(94389291171872, 0x55d8b6b792a8, 0x55d8b7facf00, 0, uint32, {0, 8, 1}, {0, 8, 8}, {0, 2, 64})
 dst: buffer(94389292046784, 0x55d8b6b792a8, 0x55d8b7facf00, 0, uint32, {0, 8, 1}, {0, 8, 8})
>> Making a new slice (notdirty)
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
>> Using existing slice (dirty)
halide_copy_to_host validating input buffer: buffer(94389291174752, 0x55d8b6b792a8, 0x55d8b7facf00, 2, uint32, {0, 8, 1}, {0, 8, 8})
copy_to_host_already_locked 0x7ffc499e8248 dev_dirty is true
CL: halide_opencl_buffer_copy (user_context: 0x0, src: 0x7ffc499e8248, dst: 0x7ffc499e8248)
CL: validate 0x55d8b82bc670 offset: 0: asked for 0, actual allocated 512
    from device to host, 0x55d8b82bcb60 + 0 -> 0x55d8b7facf00 + 0, 256 bytes
    Time: 8.687100e-02 ms
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
halide_device_crop 
 src: buffer(94389291171872, 0x55d8b6b792a8, 0x55d8b7facf00, 0, uint32, {0, 8, 1}, {0, 8, 8}, {0, 2, 64})
 dst: buffer(94389293454400, 0x55d8b6b792a8, 0x55d8b7facf00, 0, uint32, {0, 8, 1}, {0, 8, 8})
>> Making a new slice again (notdirty)
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
CL: halide_opencl_device_release_crop(user_context: 0x0, buf: 0x7ffc499e8158) cl_mem: 0x55d8b82bc670 offset: 0
CL: validate 0x55d8b82bc670 offset: 0: asked for 0, actual allocated 512
    clReleaseMemObject 0x55d8b82bc670
    Time: 1.605000e-03 ms
CL: halide_opencl_device_release_crop(user_context: 0x0, buf: 0x7ffc499e81d0) cl_mem: 0x55d8b82bc670 offset: 0
CL: validate 0x55d8b82bc670 offset: 0: asked for 0, actual allocated 512
    clReleaseMemObject 0x55d8b82bc670
    Time: 1.224000e-03 ms
CL: halide_opencl_device_release_crop(user_context: 0x0, buf: 0x7ffc499e8248) cl_mem: 0x55d8b82bc670 offset: 0
CL: validate 0x55d8b82bc670 offset: 0: asked for 0, actual allocated 512
    clReleaseMemObject 0x55d8b82bc670
    Time: 1.242000e-03 ms
halide_device_free validating input buffer: buffer(94389291171872, 0x55d8b6b792a8, 0x0, 0, uint32, {0, 8, 1}, {0, 8, 8}, {0, 2, 64})
CL: halide_opencl_device_free (user_context: 0x0, buf: 0x7ffc499e82d0) cl_mem: 0x55d8b82bc670
CL: validate 0x55d8b82bc670 offset: 0: asked for 0, actual allocated 512
    clReleaseMemObject 0x55d8b82bc670
    Time: 5.286000e-03 ms
Releasing cached compilation: 0x55d8b82c0560 id 2 context 0x55d8b82bbde0
CL: halide_opencl_device_release (user_context: 0x0)
    clReleaseCommandQueue 0x55d8b7fb0130
    clReleaseContext 0x55d8b82bbde0
halide_memoization_cache_cleanup
@mcourteaux
Copy link
Contributor

mcourteaux commented Nov 5, 2024

copy_to_host() will only copy the buffer back if it is marked as device_dirty. Unfortunately, dirty-bits are not automatically synchronized between slices/crops of other buffers and the original. While this is currently normal behavior, it is somewhat error-prone for new users. It's currently still your responsibility to correctly mark sliced buffers as dirty, or the parent buffer as dirty, depending on your use case.

Related for an in-depth discussion and me struggling to debug something related: #8395.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants