Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help with complex groupby and combine in efficient way #34

Open
bicepjai opened this issue Aug 24, 2021 · 0 comments
Open

Help with complex groupby and combine in efficient way #34

bicepjai opened this issue Aug 24, 2021 · 0 comments
Labels

Comments

@bicepjai
Copy link

bicepjai commented Aug 24, 2021

Here is a sample dataset with 7 million rows

# some ids have single date entries
data = DataFrame(
    "unique_id" => [i for i in 1:1500000], 
    "datestamp" => [Date("2021-08-16") for i in 1:1500000],
)

# some ids have 2 date entries
data = vcat(data, DataFrame(
    "unique_id" => repeat(1500001:4500000, 2), 
    "datestamp" => vcat(repeat([Date("2021-08-16")],3000000), repeat([Date("2021-08-22")],3000000)), 
))

# have lot of columns that needs to be processed
real_number_columns = [string(c)*"_"*string(i) for c in 'a':'z' for i in 1:4]
for c in columns
    data[!, c] = rand(7500000)
end
# some dimension
data[!, "dimension"] = rand([string(c) for c in 'A':'Z'],7500000)
size(data)

This will give us (7500000, 107) dataset

Looking for processing this large dataset faster however possible. currently it takes around 40 mins without any parallel processing. The resulting data-frame will have

  1. unique_id
  2. real_number_columns with suffix "_DIFF" when unique_id contains 2 datestamps (diffs of values) or else missing
  3. real_number_columns with suffix "_RECENT" when unique_id contains 1 or 2 datestamp (just the recent value)
  4. just the most recent dimension
# get the difference between real_number_columns across dates if 2 dates exist
df = @chain data begin
    transform!(:, :datestamp => ByRow(d -> d == Date("2021-08-22") ? true : false) => :recent_date)
    groupby(["unique_id"])
    combine(df1 -> begin
            df1_max = nothing
            df1_min = nothing
    try
        d = Dict()
        if nrow(df1) == 1
            d["day_diff"] = [0]
            df1_max = df1[1,:]
            df1_min = df1[1,:]
        else
            d["day_diff"] = [6]
            df1_max = df1[findfirst(df1.recent_date),:]
            df1_min = df1[findfirst(.!df1.recent_date),:]
        end
        d["dimension"] = [df1_max.dimension]
        for m in real_number_columns
            d[m*"_RECENT"] = [df1_max[m]]
            if nrow(df1) > 1
                d[m*"_DIFF"] = [df1_max[m] - df1_min[m]]
            else
                d[m*"_DIFF"] = [missing]
            end
        end
        DataFrame(d)
        catch e
            @error "Something went wrong" exception=(e, catch_backtrace())
            rethrow()
        end
    end)
end

Questions

  1. how can one make this processing efficient ?
  2. Can we get faster processing time than efficiency (say on device with RAM around 1 TB) ?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants