-
Notifications
You must be signed in to change notification settings - Fork 368
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updateindex #3401
base: main
Are you sure you want to change the base?
Updateindex #3401
Conversation
…te columns when `makeunique` is false. If `mergeduplicates` is passed a function then that function is invoked on the values of all duplicate columns and its return value is assigned to that named column.
will only be done two at a time.
As I have commented before - it will be super hard to review such a big PR. That is why I have recommended to split it into smaller PRs and merge them incrementally. But I will try to comment on this PR (however, take note that because it is so big it will be hard to properly review it and be sure that all issues are caught/discussed). As a side note - it seems that you did not use latest |
rename!(index(df), vals, makeunique=makeunique) | ||
makeunique::Bool=false, mergeduplicates::MergeDuplicates=nothing) | ||
if !makeunique && isa(mergeduplicates, Function) | ||
(new_columns, colindex) = process_updates(UpdateIndex(vals), _columns(df), mergeduplicates) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_columns
is not defined for general AbstractDataFrame
.
makeunique::Bool=false) | ||
rename!(index(df), vals, makeunique=makeunique) | ||
makeunique::Bool=false, mergeduplicates::MergeDuplicates=nothing) | ||
if !makeunique && isa(mergeduplicates, Function) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the docstring seems not to have been updated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in particular, I am not clear what rename!
/rename
should do when mergeduplicates
is passed.
# renaming columns of SubDataFrame has to clean non-note metadata in its parent | ||
_drop_all_nonnote_metadata!(parent(df)) | ||
return df | ||
end | ||
|
||
function rename!(idx::Index, new_index::Index) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
functions for Index
should be added in other/index.jl
# renaming columns of SubDataFrame has to clean non-note metadata in its parent | ||
_drop_all_nonnote_metadata!(parent(df)) | ||
return df | ||
end | ||
|
||
function rename!(idx::Index, new_index::Index) | ||
splice!(idx.names, 1:length(idx.names), new_index.names) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we first check that idx
and new_index
are independent?
@@ -353,9 +365,11 @@ julia> rename(uppercase, df) | |||
``` | |||
""" | |||
rename(df::AbstractDataFrame, vals::AbstractVector{Symbol}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
docstring update is missing
Wherever the `mergeduplicates` keyword argument is available it is either `nothing` or | ||
a `Function` that will be executed to combine duplicated columns (when `makeunique=false`) | ||
""" | ||
MergeDuplicates = Union{Nothing,Function} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am OK to add this definition, but then its docstring should be more precise I think.
will be combined by invoking the function with all values from those columns. | ||
e.g. `mergeduplicates=coalesce` will use the first non-missing value. Since `hcat` and | ||
`hcat!` are performed recursively for more than two frames, this `mergeduplicates` | ||
function will only combine two columns at a time. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is not clear what happens if makeuniqe=true
and mergeduplicates is
Function`.
""" | ||
hcat(df::AbstractDataFrame...; | ||
makeunique::Bool=false, copycols::Bool=true) | ||
makeunique::Bool=false, mergeduplicates::MergeDuplicates=nothing, copycols::Bool=true) | ||
|
||
Horizontally concatenate data frames. | ||
|
||
If `makeunique=false` (the default) column names of passed objects must be unique. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this statement does not seem to be true after this PR.
if it is `true` a new unique name will be generated by adding a suffix, | ||
if it is `false` an error will be thrown unless a `mergeduplicates` functiom is provided. | ||
- `mergeduplicates` : defines what to do if `name` already exists in `df` and `makeunique` | ||
is false. It should be given a Function that combines the values of all of the duplicated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is false. It should be given a Function that combines the values of all of the duplicated | |
is false. It should be given a `Function` that combines the values of all of the duplicated |
if it is `false` an error will be thrown unless a `mergeduplicates` functiom is provided. | ||
- `mergeduplicates` : defines what to do if `name` already exists in `df` and `makeunique` | ||
is false. It should be given a Function that combines the values of all of the duplicated | ||
columns which will be passed as a varargs. The return value is used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is not clear if the passed function takes elements of the columns iteratively or whole columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also it is not clear how things are processed if multiple duplicate columns are provided.
I have not finished reviewing the PR. I will try to get back to it later. |
Estimate on when this functionality will be merged into DataFrames.jl? Has this PR been replaced with yet another one? |
The is a replacement for #3366. It defines a new keyword
mergeduplicates
that can be set to aFunction
that combines values to merge columns instead of erroring when there are duplicate column names andmakeunique=false
.It is implemented in stages, the first of which creates a temporary struct
UpdateIndex
which is used to initialize aDataFrame
or represent a set of column names and columns that will be merged into the resultingDataFrame
in anhcat
orhcat!
operation. It then follows up with commits that extendpermutedims
and joins to resolve column clashes in the same fashion.