-
Notifications
You must be signed in to change notification settings - Fork 112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pivot_wider generated names not consistent #838
Comments
Hi @blset, I feel like what the I also didn't follow what you meant by this:
If there's a specific issue you're having, could you post an example of that? |
In the case pivot_wider receives a single column for the values_from, there is indeed no need to include a reference to the original column name, for what concerns the naming of columns without any ambiguity. But for what concerns semantics, you loose a name, and you don't formally know anymore what is inside the table. the same for a variable with only one modality that gets in column. I'm using pivot_wider to build a dynamic table constructor from dynamic sql. then for the html rendering of the table, when there is one or more "in column" variables, the names after pivot_wider are consistent unless there is only one value_for or if there is a variable in column with one modality only (that disappears from the naming also in some circumstances). All these edge cases highly complicate post processing the names of the table it would be so much easier if the naming was consistent. And if you think about it, with two values_from, ambiguity could be removed by a simple 1 and 2 suffix. I f you use the names, that is because it helps remember what's inside the table. Well it's the same with one value_from, we need a clue for what's in the table that is encoded in the name. thanks |
I understand what you mean by losing information. But I feel like no prefix is a reasonable default for the single column case since you can always use the require Explorer.DataFrame, as: DF
df = DF.new(
weekday: ["Mon", "Tue", "Wed", "Thu", "Fri", "Mon", "Tue", "Wed", "Thu", "Fri"],
team: ["A", "B", "C", "A", "B", "C", "A", "B", "C", "A"],
hour: [10, 9, 10, 10, 11, 15, 14, 16, 14, 16],
score: [0, 9, 0, 0, 1, 5, 4, 6, 4, 6],
league: ["L", "L", "L", "L", "L", "L", "L", "L", "L", "L"]
) Without df |> DF.pivot_wider("weekday", "hour") |> DF.print(limit: :infinity)
# +---------------------------------------------------------------------+
# | Explorer DataFrame: [rows: 9, columns: 8] |
# +----------+-------+----------+-------+-------+-------+-------+-------+
# | team | score | league | Mon | Tue | Wed | Thu | Fri |
# | <string> | <s64> | <string> | <s64> | <s64> | <s64> | <s64> | <s64> |
# +==========+=======+==========+=======+=======+=======+=======+=======+
# | A | 0 | L | 10 | | | 10 | |
# +----------+-------+----------+-------+-------+-------+-------+-------+
# | B | 9 | L | | 9 | | | |
# +----------+-------+----------+-------+-------+-------+-------+-------+
# | C | 0 | L | | | 10 | | |
# +----------+-------+----------+-------+-------+-------+-------+-------+
# | B | 1 | L | | | | | 11 |
# +----------+-------+----------+-------+-------+-------+-------+-------+
# | C | 5 | L | 15 | | | | |
# +----------+-------+----------+-------+-------+-------+-------+-------+
# | A | 4 | L | | 14 | | | |
# +----------+-------+----------+-------+-------+-------+-------+-------+
# | B | 6 | L | | | 16 | | |
# +----------+-------+----------+-------+-------+-------+-------+-------+
# | C | 4 | L | | | | 14 | |
# +----------+-------+----------+-------+-------+-------+-------+-------+
# | A | 6 | L | | | | | 16 |
# +----------+-------+----------+-------+-------+-------+-------+-------+ vs. with df |> DF.pivot_wider("weekday", "hour", names_prefix: "hour_") |> DF.print(limit: :infinity)
# +------------------------------------------------------------------------------------+
# | Explorer DataFrame: [rows: 9, columns: 8] |
# +----------+-------+----------+----------+----------+----------+----------+----------+
# | team | score | league | hour_Mon | hour_Tue | hour_Wed | hour_Thu | hour_Fri |
# | <string> | <s64> | <string> | <s64> | <s64> | <s64> | <s64> | <s64> |
# +==========+=======+==========+==========+==========+==========+==========+==========+
# | A | 0 | L | 10 | | | 10 | |
# +----------+-------+----------+----------+----------+----------+----------+----------+
# | B | 9 | L | | 9 | | | |
# +----------+-------+----------+----------+----------+----------+----------+----------+
# | C | 0 | L | | | 10 | | |
# +----------+-------+----------+----------+----------+----------+----------+----------+
# | B | 1 | L | | | | | 11 |
# +----------+-------+----------+----------+----------+----------+----------+----------+
# | C | 5 | L | 15 | | | | |
# +----------+-------+----------+----------+----------+----------+----------+----------+
# | A | 4 | L | | 14 | | | |
# +----------+-------+----------+----------+----------+----------+----------+----------+
# | B | 6 | L | | | 16 | | |
# +----------+-------+----------+----------+----------+----------+----------+----------+
# | C | 4 | L | | | | 14 | |
# +----------+-------+----------+----------+----------+----------+----------+----------+
# | A | 6 | L | | | | | 16 |
# +----------+-------+----------+----------+----------+----------+----------+----------+ I'm not totally opposed to a default that works like: DF.pivot_wider(df, "a", "b", prefix_names: "a_")
# or
DF.pivot_wider(df, "a", "b", prefix_names: "a_b_")
# or
DF.pivot_wider(df, "a", "b", prefix_names: "b_when_a_is_") Since, if present, you could always achieve the current default by doing: DF.pivot_wider(df, "a", "b", prefix_names: "") But this behavior doesn't seem to be the norm across other libraries. |
names_prefix is ok, as it is, the problem happens after the names_prefix fragment. the desired pattern is well visible when you iterate several pivot_wider in a row, at stage 1
because that is the pattern for two successive pivot_wider in a row,
the hour_weekday_value pattern, that is to say
does not seem to be in elixir, only the name prefix seems to be which is |
If you really need them to match, I think you can do this: DF.pivot_wider(df, "weekday", "hour", prefix_names: "weekday@hour_weekday_") Should be straightforward to do dynamically as well: a = "weekday"
b = "hour"
DF.pivot_wider(df, a, b, prefix_names: "#{a}@#{b}_#{a}_") |
I cannot do it manually because it is already done automatically, but not consistently... forget about the prefix and you will see that it is not consistent and not easely predictable |
here is a resume without any resort to names_prefix which is not important for the consistency in short to get consistent names, there is only one requirement: valueFromName_columnName_columnValue etc.. this requirement is automatic when there is at least two values from columns. so for sanity it would be better to fix it from the source the pivot_wider names generating algorithm here are some illustrations
Now what happens when you iterate pivot_wider
|
I think I better understand what you're saying. Your use case is that you have to infer the original column from the new column names? That seems like it would be a little error prone. However, I think the team would consider a PR that changes what And I apologize but I still don't follow the part about what you think should happen when you call |
Sorry for not being clear enough when values from contains more than one variable (eg hour and score), all is ok, in the names generating process for weekday in column what would be more consistent in my opinion is that even with a single variable in values_from you would have
to pass several variables in column, you iterate, and put all new columns from previous step in the values_from list, which concatenates variable names so starting from this
league in column with values from hour would give names then iteration with weekday in column with values from you see that in the iteration process, if for some reason at one stage you get only a single values_from (because you ask a single one or because the previous step gave only one through a one modality variable) currently the names will be missing information. what's more insidious, if you start with a two or more modality variable , since this gives many values_from variables, you will But if you start with a one modality variable, you get stucked unless you also start with many values_from variables.
the benefit is that you never loose track of what's in the table from the names, and for instance you can generate nice hierarchical html table of the dataframe instead of using the flat lengthy names can you point the name generating process for pivot wider in the code ? it seems it's in polar.
|
Ok, the iterated calls part was just another example of the inconsistency. Got it.
https://github.com/elixir-explorer/explorer/blob/main/native/explorer/src/dataframe.rs#L593-L648 Note that there's some post-processing done in our Rust code. |
For what I can guess, not being familiar with rust, the name post processing in the Explorer Rust code is not at the root of the inconsistency moreover from the example in the polar doc one can see the one value_column case giving bare names |
Hello,
with version 0.7.2, but I think there is no change in 0.8
when using pivot wider and the names_prefix option the names generated are not consistent in at least two situations :
1 - a row variable going to column with only one modality is not always found in the name
2 - an indicator variable alone in values_from list is not found in the generated name
this is a problem when you want to dynamically manipulate the names afterwards, for instance to generate a header with span.
The text was updated successfully, but these errors were encountered: