Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document custom generation of column names in manual #3430

Open
schlichtanders opened this issue Mar 8, 2024 · 9 comments · May be fixed by #3433
Open

Document custom generation of column names in manual #3430

schlichtanders opened this issue Mar 8, 2024 · 9 comments · May be fixed by #3433
Labels
Milestone

Comments

@schlichtanders
Copy link

I am looking for a fix or workaround for how to use AsTable in combination with several columns which should be transformed, i.e. .=>.

I always get ERROR: ArgumentError: Duplicate column name(s) returned:

df = DataFrame(a = 1:10, b = 4:13)
function myextrema(a)
    ex = extrema(a)
    (min=ex[1], max=ex[2])
end

combine(df, :a => myextrema => AsTable)  # works 
combine(df, [:a, :b] .=> myextrema .=> AsTable)  # fails

throws the following error

ERROR: ArgumentError: Duplicate column name(s) returned: :min, :max
Stacktrace:
[1] select_transform!(::Base.RefValue{…}, df::DataFrame, newdf::DataFrame, transformed_cols::Set{…}, copycols::Bool, allow_resizing_newdf::Base.RefValue{…}, column_to_copy::BitVector)
@ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/abstractdataframe/selection.jl:838
[2] _manipulate(df::DataFrame, normalized_cs::Vector{Any}, copycols::Bool, keeprows::Bool)
@ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/abstractdataframe/selection.jl:1778
[3] manipulate(::DataFrame, ::Any, ::Vararg{Any}; copycols::Bool, keeprows::Bool, renamecols::Bool)
@ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/abstractdataframe/selection.jl:1698
[4] #manipulate#599
@ ~/.julia/packages/DataFrames/58MUJ/src/abstractdataframe/selection.jl:1833 [inlined]
[5] combine(df::DataFrame, args::Any; renamecols::Bool, threads::Bool)
@ DataFrames ~/.julia/packages/DataFrames/58MUJ/src/abstractdataframe/selection.jl:1669
[6] top-level scope
@ REPL[125]:1

My ideal behaviour would be that AsTable prepends the column name, but of course this would be breaking.
Maybe there could be a PrependColName(AsTable) wrapper or something similar?

@bkamins
Copy link
Member

bkamins commented Mar 8, 2024

This is the intended way to do it:

julia> combine(df, [:a, :b] .=> myextrema .=> x -> x .* ["_min", "_max"])
1×4 DataFrame
 Row │ a_min  a_max  b_min  b_max
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 │     1     10      4     13

You can then even do just e.g.:

julia> combine(df, [:a, :b] .=> Ref∘extrema .=> x -> x .* ["_min", "_max"])
1×4 DataFrame
 Row │ a_min  a_max  b_min  b_max
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 │     1     10      4     13

@schlichtanders
Copy link
Author

Thank you very much - I couldn't find such an example in the documentation.

I still don't understand why your second version works 😅.

This approach has the disadvantage that one needs to replicate which fields the transformation function has. Looks flexible, and easy to understand, which is really great, but also like duplication.

@bkamins
Copy link
Member

bkamins commented Mar 8, 2024

  1. It is documented that to produce multiple columns you have to either pass AsTable or a vector of column names.
  2. It is documented that you can auto-generate the target column names using a function (to dynamically generate them). In this case the function takes source column names as input.

This approach has the disadvantage that one needs to replicate which fields the transformation function has.

Yes - this is a disadvantage. That is why I have commented that you do not have to pass these column names in the function (the example with Ref, which skips defining target column names).


We could allow for a function taking both "source column names" and "names returned by a function" and allowing combining them, but it seemed overly complex (i.e. the API would be hard for typical users to understand and learn). What I have given you was the most concise variant.

The variant that you want is available, and it avoids duplication, but the disadvantage is that the code is longer (so I thought that it is less interesting):

julia> using DataFrames

julia> df = DataFrame(a = 1:10, b = 4:13)
10×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      4
   2 │     2      5
   3 │     3      6
   4 │     4      7
   5 │     5      8
   6 │     6      9
   7 │     7     10
   8 │     8     11
   9 │     9     12
  10 │    10     13

julia> function myextrema(a)
           ex = extrema(a[1])
           n = propertynames(a)[1]
           (; Symbol(n, "_min") => ex[1], Symbol(n, "_max") => ex[2])
       end
myextrema (generic function with 1 method)

julia>

julia> combine(df, AsTable.([:a, :b]) .=> myextrema .=> AsTable) 
1×4 DataFrame
 Row │ a_min  a_max  b_min  b_max
     │ Int64  Int64  Int64  Int64
─────┼────────────────────────────
   1 │     1     10      4     13

@schlichtanders
Copy link
Author

2. It is documented that you can auto-generate the target column names using a function (to dynamically generate them). In this case the function takes source column names as input.

Could an example be added to https://dataframes.juliadata.org/stable/man/working_with_dataframes/?
This was my source of truth and there I couldn't find it.

@bkamins
Copy link
Member

bkamins commented Mar 9, 2024

There is an example in the docstring. https://dataframes.juliadata.org/stable/lib/functions/#DataFrames.combine. We could add also something in the intro manual. Could you propose something that you would find most useful?

@bkamins bkamins changed the title AsTable is not compatible with .=> Document custom generation of column names in manual Mar 9, 2024
@bkamins bkamins added doc and removed question labels Mar 9, 2024
@bkamins bkamins added this to the 1.7 milestone Mar 9, 2024
@schlichtanders
Copy link
Author

I think just below .=> within the combine Section would be nice

julia> combine(df, names(df) .=> sum, names(df) .=> prod)
1×4 DataFrame
 Row │ A_sum  B_sum    A_prod  B_prod
     │ Int64  Float64  Int64   Float64
─────┼─────────────────────────────────
   110     10.0      24     24.0

# this is new:
julia> combine(df, names(df) .=> Ref  extrema .=> (c -> c .* ["_min", "_max"]))

Probably with a little extra explanation what the Ref is doing here (I haven't entirely understood its need yet).

@bkamins
Copy link
Member

bkamins commented Mar 27, 2024

@bkamins bkamins linked a pull request Mar 27, 2024 that will close this issue
@bkamins
Copy link
Member

bkamins commented Mar 27, 2024

See #3433 for an update of the manual. Of course please comment if something is not clear or should be improved.

@schlichtanders
Copy link
Author

looks especially good. Thank you for the detailed documentation improvement!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants