Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: more KnetArray indexing #198

Closed
ngphuoc opened this issue Nov 23, 2017 · 33 comments
Closed

Feature request: more KnetArray indexing #198

ngphuoc opened this issue Nov 23, 2017 · 33 comments

Comments

@ngphuoc
Copy link

ngphuoc commented Nov 23, 2017

The following would be useful:

getindex(::Knet.KnetArray{Float32,3}, ::Colon, ::Int64, ::Int64)
getindex(::Knet.KnetArray{Float32,3}, ::Colon, ::Array{Int64,1}, ::Int64)
getindex(::Knet.KnetArray{Float32,3}, ::Colon, ::Array{Int64,1},  ::Array{Int64,1})
@ngphuoc
Copy link
Author

ngphuoc commented Nov 30, 2017

And these are needed to reorder hidden states between runs in RNN with dynamic sizes

getindex(::Knet.KnetArray{Float32,3}, ::Colon, ::Array{Int64,1}, ::Colon)
getindex(::Knet.KnetArray{Float32,3}, ::Colon, ::UnitRange{Int64}, ::Colon)
getindex(::Knet.KnetArray{Float32,3}, ::Colon, ::StepRange{Int64,Int64}, ::Colon)

@denizyuret
Copy link
Owner

@maleadt I wonder if we can somehow use the kernels generated by CUDAnative instead of trying to implement all these indexing variations by hand. I would generate code for these using CUDAnative and put them in libknet8.so for now, just not sure how. This would be a stopgap measure until CUDAnative does not require Julia recompile and I can use CuArrays etc.

@maleadt
Copy link
Collaborator

maleadt commented Dec 1, 2017

Yeah definitely. I'm not sure how the kernels would look like, but you could use metaprogramming to generate function definitions, call code_ptx on them (possibly using different type signatures), collect the generated PTX and link it all together into a single PTX to send to the CUDA driver when Knet.jl is loaded. For performance reasons, you'll want to look into using CUDAnative's DevicePtr and performing aligned loads, but that has been implemented in CUDAnative already and is just a matter of figuring out how to use that infrastructure.

If you can come up with a simple example I could look into creating a proof of concept.

@denizyuret
Copy link
Owner

OK, first I wanted to make sure that we can cover the indexing operations with CUDAnative using CuArrays. I got a SegFault in Julia release-0.6. Tagging @maleadt, @MikeInnes, @SimonDanisch.
Here is the test script: https://gist.github.com/denizyuret/0100c8383df9cf4cd0106f54599603c5
Here is the segfault: https://gist.github.com/denizyuret/336ed2e44bf68bb6ff79a85b082231fd

@ilkerkesen
Copy link
Collaborator

I got a CUDA error instead of a segfault.

https://gist.github.com/ilkerkesen/c921d41894b8667d8acc6e71803f04fe

I am using Julia v0.6 and the full path is,

/KUFS/apps/julia/0.6.0-compiled/usr/bin

@SimonDanisch
Copy link

CLArrays seems to work fine, so I guess this is CUDAnative specific (CuArrays and CLArrays should share the same indexing code, if I'm not missing something)

@SimonDanisch
Copy link

Ah no, it's just almost the same code... Let me check the specific differences.

@SimonDanisch
Copy link

I'm running out of time and couldn't get CuArrays running with all the changes in LLVM/CUDAnative.
But for anyone who has it working already, there is a simple thing you can try.

Just comment out the indexing include in CuArrays.jl at https://github.com/JuliaGPU/CuArrays.jl/blob/master/src/CuArrays.jl#L13

And try it again. Then it should use the same code as CLArrays for indexing!

@maleadt
Copy link
Collaborator

maleadt commented Dec 2, 2017

I'm running out of time and couldn't get CuArrays running with all the changes in LLVM/CUDAnative.

Tagged versions / using Pkg should work, respecting REQUIRE is ... required.

@SimonDanisch
Copy link

Hm, maybe I messed up somewhere - but all I did was Pkg.update, and it told me I need 0.7 for LLVM.jl. Then I started manually downgrading relevant packages, and since it always took a long time for each package, I ended up in the mentioned situation ;)

@maleadt
Copy link
Collaborator

maleadt commented Dec 2, 2017

Huh, 0.5.1 is the last version of LLVM.jl that supports 0.6, 0.9.0 does indeed require 0.7 but that shouldn't get auto-installed, right?

EDIT: this is getting a bit off topic, we shouldn't be hijacking this issue I guess. @SimonDanisch, could you open an issue on LLVM.jl if Pkg did effective update to an incompatible version?

@SimonDanisch
Copy link

Weird! Yeah not sure what's going on ;)

@denizyuret
Copy link
Owner

@SimonDanisch how can I repeat your test with CLArrays? I guess I am confused about the relationship between GPUArrays, CLArrays, CuArrays etc.

@denizyuret
Copy link
Owner

I can confirm that as @SimonDanisch suggested commenting out https://github.com/JuliaGPU/CuArrays.jl/blob/master/src/CuArrays.jl#L13, i.e. include("indexing.jl") stopped the error and now the test runs fine.
@maleadt can we use one of these indexing operations as the proof-of-concept for generating and loading code?

@SimonDanisch
Copy link

Well, both inherit from GPUArrays.GPUArray, which is what Base.AbstractArray is for GPUs.

GPUArrays also contains hardware independent kernels (e.g. for broadcasting and indexing) and in theory Cu-CLArrays should share all those kernels and interfaces.

Long story short: @MikeInnes wanted to take things slowly, so almost all code in CuArrays was kept for now, overwriting the hardware independent kernels from GPUArrays. The replacement with the GPUArray kernels will come piece by piece.

But for every kernel in CuArrays there should already be a hardware independent version in GPUArrays (which should offer the same functionality at the same execution speed).

So CLArrays itself doesn't contain a single kernel and uses GPUArrays hardware independent kernels exclusively - which are pretty much a mixture of the best kernels from CuArrays (only rewritten to be hardware independent) and the old GPUArrays.

Regarding using CLArrays:

using CLArrays

a = rand(4,4,4)
@show summary(a)
@show summary(a[:,2,2])
@show summary(a[:,[2,4],2])
@show summary(a[:,[2,4],[1,3]])
@show summary(a[:,[2,4],:])
@show summary(a[:,2:4,:])
@show summary(a[:,1:2:4,:])
a = CLArray(a) 
@show summary(a)
@show summary(a[:,2,2])
@show summary(a[:,[2,4],2])
@show summary(a[:,[2,4],[1,3]])
@show summary(a[:,[2,4],:])
@show summary(a[:,2:4,:])
@show summary(a[:,1:2:4,:])

@denizyuret
Copy link
Owner

@SimonDanisch, @maleadt how can I see the PTX code behind these indexing operations with CLArrays or CuArrays? Is there a separate kernel compiled for each distinct getindex arg signature? How does the same PTX handle different index ranges or different index arrays?

@maleadt
Copy link
Collaborator

maleadt commented Dec 5, 2017

You can use CUDAnative's @code_ptx or code_ptx, similar to @code_llvm and code_llvm (ie. prepending it to an invocation, or passing a tuple type). And codegen behavior is pretty identical to Base, meaning new PTX code is generated for every new specialization.

EDIT: of course, @code_ptx needs to be prepended to a @cuda invocation, and doesn't work when stuck in front of eg. summary. I guess that's a little inconvenient, and we might have to make a more convenient version of @code_... for that use case.

@SimonDanisch
Copy link

I don't really have a nice API for this, but you can quite easily access any compiled kernels like this:

julia> a[:,2,2] # compile an indexing operation
GPU: 4-element Array{Float64,1}:
 0.337476
 0.673492
 0.28916 
 0.260523

julia> CLArrays.compiled_functions # the compiled kernel cache
Dict{Any,Any} with 1 entry:
  (Ptr{Void} @0x00000000036b3760, GPUArrays.index_kernel, (CLArrays => CLArrays.CLFunction{GPUArrays.#index_kernel,Tuple{CLArrays.KernelState,CLArrays.CLArray{Float64,1},CLArrays.CLArray{Float64,3},Tupl…

julia> signature = first(CLArrays.compiled_functions)[1][2:end] # the signature of a compiled kernel
(GPUArrays.index_kernel, (CLArrays.KernelState, CLArrays.DeviceArray{Float64,1,Transpiler.CLIntrinsics.GlobalPointer{Float64}}, CLArrays.DeviceArray{Float64,3,Transpiler.CLIntrinsics.GlobalPointer{Float64}}, Tuple{UInt32,UInt32,UInt32}, Tuple{Base.Slice{Base.OneTo{Int64}},Int64,Int64}))

julia> m = Transpiler.CLMethod(signature)
GPUArrays.index_kernel(CLArrays.KernelState, CLArrays.DeviceArray{Float64,1,Transpiler.CLIntrinsics.GlobalPointer{Float64}}, CLArrays.DeviceArray{Float64,3,Transpiler.CLIntrinsics.GlobalPointer{Float64}}, Tuple{UInt32,UInt32,UInt32}, Tuple{Base.Slice{Base.OneTo{Int64}},Int64,Int64})

julia> Sugar.getast!(m) # compiled julia expr
quote 
    $(Expr(:inbounds, true))
    I_3::Int64::Int64
    I_2::Int64::Int64
    I_1::Int64::Int64
    is::Tuple{UInt32,UInt32,UInt32}::Tuple{UInt32,UInt32,UInt32}
    i::UInt32::UInt32
    _7::UInt32 = (GPUArrays.linear_index)(_2::CLArrays.KernelState)::UInt32
    if (>)(_7::UInt32, (length)(_3::CLArrays.DeviceArray{Float64,1,Transpiler.CLIntrinsics.GlobalPointer{Float64}})::UInt32)::Bool
        $(Expr(:return))
    end
    _8::Tuple{UInt32,UInt32,UInt32} = (GPUArrays.gpu_ind2sub)(_5::Tuple{UInt32,UInt32,UInt32}, _7::UInt32)::Tuple{UInt32,UInt32,UInt32}
    _9::Int64 = (getindex)((getfield)(_6::Tuple{Base.Slice{Base.OneTo{Int64}},Int64,Int64}, field1)::Base.Slice{Base.OneTo{Int64}}, (Int64)((getfield)(_8::Tuple{UInt32,UInt32,UInt32}, s0)::UInt32)::Int64)::Int64
    _10::Int64 = (getindex)((getfield)(_6::Tuple{Base.Slice{Base.OneTo{Int64}},Int64,Int64}, field2)::Int64, (Int64)((getfield)(_8::Tuple{UInt32,UInt32,UInt32}, s1)::UInt32)::Int64)::Int64
    _11::Int64 = (getindex)((getfield)(_6::Tuple{Base.Slice{Base.OneTo{Int64}},Int64,Int64}, field3)::Int64, (Int64)((getfield)(_8::Tuple{UInt32,UInt32,UInt32}, s2)::UInt32)::Int64)::Int64
    _12::Float64::Float64::Float64
    _12::Float64 = (getindex)(_4::CLArrays.DeviceArray{Float64,3,Transpiler.CLIntrinsics.GlobalPointer{Float64}}, (Tuple{Int64,Int64,Int64}){_9::Int64, _10::Int64, _11::Int64}::Tuple{Int64,Int64,Int64})::Float64
    (setindex!)(_3::CLArrays.DeviceArray{Float64,1,Transpiler.CLIntrinsics.GlobalPointer{Float64}}, _12::Float64, (Tuple{UInt32}){_7::UInt32}::Tuple{UInt32})::Void
    $(Expr(:return))
    $(Expr(:inbounds, :pop))
end

julia> Sugar.getsource!(m) |> println # opencl code
void index_kernel_29(KernelState state, DeviceArray_double_1___global1double121 dest, DeviceArray_double_3___global1double121 src, uint3 idims, Tuple_Slice_OneTo_long_long_long Is)
{
    ;
    long I_3;
    long I_2;
    long I_1;
    uint3 is;
    uint i;
    i = linear_index_5(state);
    if(i > length_8(dest)){
        return;
    };
    is = gpu_ind2sub_15(idims, i);
    I_1 = getindex_16(Is.field1, (long)(is.s0));
    I_2 = getindex_17(Is.field2, (long)(is.s1));
    I_3 = getindex_17(Is.field3, (long)(is.s2));
    double _ssavalue_0;
    _ssavalue_0 = getindex_25(src, (long3){I_1, I_2, I_3});
    setindex9_28(dest, _ssavalue_0, (uint){i});
    return;
    ;
}

@denizyuret
Copy link
Owner

How about with cuda or CuArrays?
So far I've got:

julia> @code_ptx a[:,2,2]
ERROR: AssertionError: func isa Core.Function
julia> @code_ptx getindex(a,:,2,2)
ERROR: ArgumentError: invalid device function getindex(CUDAnative.CuDeviceArray{Float32,3,CUDAnative.AS.Global}, Colon, Int64, Int64): is a varargs function
julia> code_ptx(getindex, (CuArray,Colon,Int64,Int64))
ERROR: ArgumentError: invalid call to device function getindex(CuArrays.CuArray, Colon, Int64, Int64): passing abstract arguments

@SimonDanisch
Copy link

For CuArrays, you need to find out the signature of the called kernel. Maybe with Astinerpreter2 or print statements in the indexing code. If it uses the GPUArray indexing code, you can also insert a statement here: https://github.com/JuliaGPU/CuArrays.jl/blob/master/src/gpuarray_interface.jl#L61
Since that's where all kernel calls should route you to.
I think you can just put a CUDAnative.@code_ptx here:

function GPUArrays._gpu_call(f, A::CuArray, args::Tuple, blocks_threads::Tuple{T, T}) where T <: NTuple{N, Integer} where N
    blocks, threads = blocks_threads
    CUDAnative.@code_typed @cuda (blocks, threads) f(CuKernelState(), args...)
    CUDAnative.@code_llvm @cuda (blocks, threads) f(CuKernelState(), args...)
    CUDAnative.@code_ptx @cuda (blocks, threads) f(CuKernelState(), args...)

    @cuda (blocks, threads) f(CuKernelState(), args...)
end

@SimonDanisch
Copy link

@maleadt
Copy link
Collaborator

maleadt commented Dec 5, 2017

With Cassette, I think we could create an interface (eg. @device_code_llvm) that overdubs the call to the compiler and visualizes that code. For now, I'd rather enable the following print statement: https://github.com/JuliaGPU/CUDAnative.jl/blob/6ab4665dcd3a865c506c86a5d082eb979ceb48b7/src/jit.jl#L394

Doing so, you get to see the following:

julia> using CuArrays, CUDAnative

julia> a = rand(4,4,4)

julia> da = cu(a)

julia> da[:, 2,2]

DEBUG: (Re)compiling index_kernel(CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}, CUDAnative.CuDeviceArray{Float32,3,CUDAnative.AS.Global}, Tuple{Int64,Int64,Int64}, Tuple{Base.Slice{Base.OneTo{Int64}},Int64,Int64}) for capability 3.5.0

With this info, you can:

julia> CUDAnative.code_ptx(CuArrays.index_kernel, Tuple{CUDAnative.CuDeviceArray{Float32,1,CUDAnative.AS.Global}, CUDAnative.CuDeviceArray{Float32,3,CUDAnative.AS.Global}, Tuple{Int64,Int64,Int64}, Tuple{Base.Slice{Base.OneTo{Int64}},Int64,Int64}}; cap=v"3.5", kernel=true)

So yeah, you'll probably need a better interface if you want to use CUDAnative as a PTX compiler.

@denizyuret
Copy link
Owner

I got some code out with @SimonDanisch's hack. See if I can use them. Different calls to the same function generate the same ptx code except for symbols ptxcall_index_kernel_#####_param_#. I'll go learn some PTX now.

@maleadt
Copy link
Collaborator

maleadt commented Dec 5, 2017

Before we delve into making CUDAnative behave like a static compiler, why can't you use the stack "as intended", ie. to execute GPU code (where the JIT compilation to PTX is an implementation detail)? Installation issues for CUDAnative should be resolved for 1.0 (they are already on 0.7).

@denizyuret
Copy link
Owner

denizyuret commented Dec 6, 2017 via email

@ngphuoc
Copy link
Author

ngphuoc commented Dec 19, 2017

Currently I only need this indexing:

function getindex(a::KnetArray{T,3}, i::Colon, j, k::Colon) where T
  s = size(a)
  a = permutedims(a, [1,3,2])
  a = reshape(a,(:,s[2]))
  a = a[:,j]
  a = reshape(a, (s[1],s[3],s[2]))
  permutedims(a, [1,3,2])
end

Could you help me with its setindex! counterpart? (and its primitive if needed). Many thanks!

@ngphuoc
Copy link
Author

ngphuoc commented Dec 21, 2017

No worries. I just made it a function. It's interesting that using 2D indexing with sub2ind is 20 times faster than using permutedims in my case:

function getat(x::KnetArray{T,3}, i::Colon, j, k) where T
  I,J,K = size(x)
  k==(:) && (k=1:K)
  x = reshape(x,(I,J*K));
  js = repeat(j, outer=K);
  ks = repeat(k, inner=J);
  ix = sub2ind((J,K), js, ks);
  x=x[:,ix]
reshape(x,(I,J,K))
end

@CarloLucibello
Copy link
Collaborator

Hi @ngphuoc, you could also take a look to this PR #229, which amounts to

function getindex(a::KnetArray, I...) 
    crange = CartesianRange(to_indices(a, I))
    linind = [sub2ind(size(a), t.I...) for t in crange]
    b = getindex(a, vec(linind))
    shape = size(crange) # TODO drop scalar dimension
    reshape(b, shape)
end

function setindex!(a::KnetArray, v, I...) 
    crange = CartesianRange(to_indices(a, I))
    linind = [sub2ind(size(a), t.I...) for t in crange]
    setindex!(a, v, vec(linind))
end

@ngphuoc
Copy link
Author

ngphuoc commented Dec 21, 2017

That's excellent! Thanks @CarloLucibello

@ngphuoc
Copy link
Author

ngphuoc commented Dec 22, 2017

It gives the following error on julia 0.6.2:

julia> a[:,i,:];
ERROR: MethodError: Cannot `convert` an object of type Tuple{Base.Slice{Base.OneTo{Int64}},Array{Int64,1},Base.Slice{Base.OneTo{Int64}}} to an object of type CartesianRange
This may have arisen from a call to the constructor CartesianRange(...),
since type constructors fall back to convert methods.
Stacktrace:
 [1] getindex(::Knet.KnetArray{Float32,3}, ::Colon, ::Array{Int64,1}, ::Colon) at ./REPL[11]:3

@CarloLucibello
Copy link
Collaborator

Supported indexes are columns, scalars and unit ranges for the time being

@ngphuoc
Copy link
Author

ngphuoc commented Dec 22, 2017

I found sub2ind for each index at a time is 300 times slower than for all indices at once in my case.

@denizyuret
Copy link
Owner

CuArrays fallbacks fix this in 85c4125

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants