Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pivot_wider generated names not consistent #838

Open
blset opened this issue Jan 24, 2024 · 11 comments
Open

pivot_wider generated names not consistent #838

blset opened this issue Jan 24, 2024 · 11 comments

Comments

@blset
Copy link

blset commented Jan 24, 2024

Hello,

with version 0.7.2, but I think there is no change in 0.8

when using pivot wider and the names_prefix option the names generated are not consistent in at least two situations :
1 - a row variable going to column with only one modality is not always found in the name
2 - an indicator variable alone in values_from list is not found in the generated name

this is a problem when you want to dynamically manipulate the names afterwards, for instance to generate a header with span.


 d1 =
      Explorer.DataFrame.new(
        weekday: [
          "Monday",
          "Tuesday",
          "Wednesday",
          "Thursday",
          "Friday",
          "Monday",
          "Tuesday",
          "Wednesday",
          "Thursday",
          "Friday"
        ],
        team: ["A", "B", "C", "A", "B", "C", "A", "B", "C", "A"],
        hour: [10, 9, 10, 10, 11, 15, 14, 16, 14, 16],
        score: [0, 9, 0, 0, 1, 5, 4, 6, 4, 6],
        league: ["L", "L", "L", "L", "L", "L", "L", "L", "L", "L"]
      )
  
+---------------------------------------------------------+
|       Explorer DataFrame: [rows: 10, columns: 5]        |
+-----------+----------+-----------+-----------+----------+
|  weekday  |   team   |   hour    |   score   |  league  |
| <string>  | <string> | <integer> | <integer> | <string> |
+===========+==========+===========+===========+==========+
| Monday    | A        | 10        | 0         | L        |
+-----------+----------+-----------+-----------+----------+
| Tuesday   | B        | 9         | 9         | L        |
+-----------+----------+-----------+-----------+----------+
| Wednesday | C        | 10        | 0         | L        |
+-----------+----------+-----------+-----------+----------+
| Thursday  | A        | 10        | 0         | L        |
+-----------+----------+-----------+-----------+----------+
| Friday    | B        | 11        | 1         | L        |
+-----------+----------+-----------+-----------+----------+


d2 = Explorer.DataFrame.pivot_wider(d1, "weekday", "hour", names_prefix: "weekday@")


names should be weekday@hour_weekday_Monday etc...
+----------------------------------------------------------------------------------------------------------------------------+
|                                         Explorer DataFrame: [rows: 9, columns: 8]                                          |
+----------+-----------+----------+----------------+-----------------+-------------------+------------------+----------------+
|   team   |   score   |  league  | weekday@Monday | weekday@Tuesday | weekday@Wednesday | weekday@Thursday | weekday@Friday |
| <string> | <integer> | <string> |   <integer>    |    <integer>    |     <integer>     |    <integer>     |   <integer>    |
+==========+===========+==========+================+=================+===================+==================+================+
| A        | 0         | L        | 10             |                 |                   | 10               |                |
+----------+-----------+----------+----------------+-----------------+-------------------+------------------+----------------+
| B        | 9         | L        |                | 9               |                   |                  |                |
+----------+-----------+----------+----------------+-----------------+-------------------+------------------+----------------+
| C        | 0         | L        |                |                 | 10                |                  |                |
+----------+-----------+----------+----------------+-----------------+-------------------+------------------+----------------+
| B        | 1         | L        |                |                 |                   |                  | 11             |
+----------+-----------+----------+----------------+-----------------+-------------------+------------------+----------------+
| C        | 5         | L        | 15             |                 |                   |                  |                |
+----------+-----------+----------+----------------+-----------------+-------------------+------------------+----------------+

as it is when you have more than one value column
  d3 =
      Explorer.DataFrame.pivot_wider(d1, "weekday", ["hour", "score"], names_prefix: "weekday@")

+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                                                                                                    Explorer DataFrame: [rows: 3, columns: 12]                                                                                                                                                    |
+----------+----------+-----------------------------+------------------------------+--------------------------------+-------------------------------+-----------------------------+------------------------------+-------------------------------+---------------------------------+--------------------------------+------------------------------+
|   team   |  league  | weekday@hour_weekday_Monday | weekday@hour_weekday_Tuesday | weekday@hour_weekday_Wednesday | weekday@hour_weekday_Thursday | weekday@hour_weekday_Friday | weekday@score_weekday_Monday | weekday@score_weekday_Tuesday | weekday@score_weekday_Wednesday | weekday@score_weekday_Thursday | weekday@score_weekday_Friday |
| <string> | <string> |          <integer>          |          <integer>           |           <integer>            |           <integer>           |          <integer>          |          <integer>           |           <integer>           |            <integer>            |           <integer>            |          <integer>           |
+==========+==========+=============================+==============================+================================+===============================+=============================+==============================+===============================+=================================+================================+==============================+
| A        | L        | 10                          | 14                           |                                | 10                            | 16                          | 0                            | 4                             |                                 | 0                              | 6                            |
+----------+----------+-----------------------------+------------------------------+--------------------------------+-------------------------------+-----------------------------+------------------------------+-------------------------------+---------------------------------+--------------------------------+------------------------------+
| B        | L        |                             | 9                            | 16                             |                               | 11                          |                              | 9                             | 6                               |                                | 1                            |
+----------+----------+-----------------------------+------------------------------+--------------------------------+-------------------------------+-----------------------------+------------------------------+-------------------------------+---------------------------------+--------------------------------+------------------------------+
| C        | L        | 15                          |                              | 10                             | 14                            |                             | 5                            |                               | 0                               | 4                              |                              |
+----------+----------+-----------------------------+------------------------------+--------------------------------+-------------------------------+-----------------------------+------------------------------+-------------------------------+---------------------------------+--------------------------------+------------------------------+

Now with League with only one modality

    d2p = Explorer.DataFrame.pivot_wider(d1, "league", "hour", names_prefix: "league@")

should be league@hour_league_L
+----------------------------------------------+
|  Explorer DataFrame: [rows: 10, columns: 4]  |
+-----------+----------+-----------+-----------+
|  weekday  |   team   |   score   | league@L  |
| <string>  | <string> | <integer> | <integer> |
+===========+==========+===========+===========+
| Monday    | A        | 0         | 10        |
+-----------+----------+-----------+-----------+
| Tuesday   | B        | 9         | 9         |
+-----------+----------+-----------+-----------+
| Wednesday | C        | 0         | 10        |
+-----------+----------+-----------+-----------+
| Thursday  | A        | 0         | 10        |
+-----------+----------+-----------+-----------+
| Friday    | B        | 1         | 11        |
+-----------+----------+-----------+-----------+

but is'ok if there was already weekday in the wide dimension, except no trace of "hour"

    values = d2.names -- ["team", "score", "league"]
    d5 = Explorer.DataFrame.pivot_wider(d2, "league", values, names_prefix: "league@")

+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                                            Explorer DataFrame: [rows: 9, columns: 7]                                                                            |
+----------+-----------+--------------------------------+---------------------------------+-----------------------------------+----------------------------------+--------------------------------+
|   team   |   score   | league@weekday@Monday_league_L | league@weekday@Tuesday_league_L | league@weekday@Wednesday_league_L | league@weekday@Thursday_league_L | league@weekday@Friday_league_L |
| <string> | <integer> |           <integer>            |            <integer>            |             <integer>             |            <integer>             |           <integer>            |
+==========+===========+================================+=================================+===================================+==================================+================================+
| A        | 0         | 10                             |                                 |                                   | 10                               |                                |
+----------+-----------+--------------------------------+---------------------------------+-----------------------------------+----------------------------------+--------------------------------+
| B        | 9         |                                | 9                               |                                   |                                  |                                |
+----------+-----------+--------------------------------+---------------------------------+-----------------------------------+----------------------------------+--------------------------------+
| C        | 0         |                                |                                 | 10                                |                                  |                                |
+----------+-----------+--------------------------------+---------------------------------+-----------------------------------+----------------------------------+--------------------------------+
| B        | 1         |                                |                                 |                                   |                                  | 11                             |
+----------+-----------+--------------------------------+---------------------------------+-----------------------------------+----------------------------------+--------------------------------+
| C        | 5         | 15                             |                                 |                                   |                                  |                                |
+----------+-----------+--------------------------------+---------------------------------+-----------------------------------+----------------------------------+--------------------------------+

@billylanchantin
Copy link
Contributor

Hi @blset,

I feel like what the names_prefix option is doing is reasonable. In the case where pivot_wider receives a single column for the values_from argument, there is no need to include a reference to the original column name in the new column names by default. It's only when you pass a list of columns for values_from that the new column names need to be disambiguated.

I also didn't follow what you meant by this:

this is a problem when you want to dynamically manipulate the names afterwards, for instance to generate a header with span.

If there's a specific issue you're having, could you post an example of that?

@blset
Copy link
Author

blset commented Jan 24, 2024

In the case pivot_wider receives a single column for the values_from, there is indeed no need to include a reference to the original column name, for what concerns the naming of columns without any ambiguity.

But for what concerns semantics, you loose a name, and you don't formally know anymore what is inside the table.
In my example d2 above, you don't know that the table contains hours.

the same for a variable with only one modality that gets in column.

I'm using pivot_wider to build a dynamic table constructor from dynamic sql.
the sql computes everything with regular group by,
then with pivot_wider i can afterwards position the group variables either in row (where there are by default) or in column.

then for the html rendering of the table, when there is one or more "in column" variables,
I need to compute the span for each floor of the hierarchy and other things, I need to reorder the columns, etc..
and for that I need the naming to be consistent, because it is then a simple function of the names of the dataframe.

the names after pivot_wider are consistent unless there is only one value_for or if there is a variable in column with one modality only (that disappears from the naming also in some circumstances). All these edge cases highly complicate post processing the names of the table

it would be so much easier if the naming was consistent.

And if you think about it, with two values_from, ambiguity could be removed by a simple 1 and 2 suffix. I f you use the names, that is because it helps remember what's inside the table. Well it's the same with one value_from, we need a clue for what's in the table that is encoded in the name.

thanks

@billylanchantin
Copy link
Contributor

billylanchantin commented Jan 24, 2024

I understand what you mean by losing information. But I feel like no prefix is a reasonable default for the single column case since you can always use the names_prefix option if you want. That's what it's there for:

require Explorer.DataFrame, as: DF

df = DF.new(
  weekday: ["Mon", "Tue", "Wed", "Thu", "Fri", "Mon", "Tue", "Wed", "Thu", "Fri"],
  team: ["A", "B", "C", "A", "B", "C", "A", "B", "C", "A"],
  hour: [10, 9, 10, 10, 11, 15, 14, 16, 14, 16],
  score: [0, 9, 0, 0, 1, 5, 4, 6, 4, 6],
  league: ["L", "L", "L", "L", "L", "L", "L", "L", "L", "L"]
)

Without names_prefix:

df |> DF.pivot_wider("weekday", "hour") |> DF.print(limit: :infinity)
# +---------------------------------------------------------------------+
# |              Explorer DataFrame: [rows: 9, columns: 8]              |
# +----------+-------+----------+-------+-------+-------+-------+-------+
# |   team   | score |  league  |  Mon  |  Tue  |  Wed  |  Thu  |  Fri  |
# | <string> | <s64> | <string> | <s64> | <s64> | <s64> | <s64> | <s64> |
# +==========+=======+==========+=======+=======+=======+=======+=======+
# | A        | 0     | L        | 10    |       |       | 10    |       |
# +----------+-------+----------+-------+-------+-------+-------+-------+
# | B        | 9     | L        |       | 9     |       |       |       |
# +----------+-------+----------+-------+-------+-------+-------+-------+
# | C        | 0     | L        |       |       | 10    |       |       |
# +----------+-------+----------+-------+-------+-------+-------+-------+
# | B        | 1     | L        |       |       |       |       | 11    |
# +----------+-------+----------+-------+-------+-------+-------+-------+
# | C        | 5     | L        | 15    |       |       |       |       |
# +----------+-------+----------+-------+-------+-------+-------+-------+
# | A        | 4     | L        |       | 14    |       |       |       |
# +----------+-------+----------+-------+-------+-------+-------+-------+
# | B        | 6     | L        |       |       | 16    |       |       |
# +----------+-------+----------+-------+-------+-------+-------+-------+
# | C        | 4     | L        |       |       |       | 14    |       |
# +----------+-------+----------+-------+-------+-------+-------+-------+
# | A        | 6     | L        |       |       |       |       | 16    |
# +----------+-------+----------+-------+-------+-------+-------+-------+

vs. with names_prefix

df |> DF.pivot_wider("weekday", "hour", names_prefix: "hour_") |> DF.print(limit: :infinity)
# +------------------------------------------------------------------------------------+
# |                     Explorer DataFrame: [rows: 9, columns: 8]                      |
# +----------+-------+----------+----------+----------+----------+----------+----------+
# |   team   | score |  league  | hour_Mon | hour_Tue | hour_Wed | hour_Thu | hour_Fri |
# | <string> | <s64> | <string> |  <s64>   |  <s64>   |  <s64>   |  <s64>   |  <s64>   |
# +==========+=======+==========+==========+==========+==========+==========+==========+
# | A        | 0     | L        | 10       |          |          | 10       |          |
# +----------+-------+----------+----------+----------+----------+----------+----------+
# | B        | 9     | L        |          | 9        |          |          |          |
# +----------+-------+----------+----------+----------+----------+----------+----------+
# | C        | 0     | L        |          |          | 10       |          |          |
# +----------+-------+----------+----------+----------+----------+----------+----------+
# | B        | 1     | L        |          |          |          |          | 11       |
# +----------+-------+----------+----------+----------+----------+----------+----------+
# | C        | 5     | L        | 15       |          |          |          |          |
# +----------+-------+----------+----------+----------+----------+----------+----------+
# | A        | 4     | L        |          | 14       |          |          |          |
# +----------+-------+----------+----------+----------+----------+----------+----------+
# | B        | 6     | L        |          |          | 16       |          |          |
# +----------+-------+----------+----------+----------+----------+----------+----------+
# | C        | 4     | L        |          |          |          | 14       |          |
# +----------+-------+----------+----------+----------+----------+----------+----------+
# | A        | 6     | L        |          |          |          |          | 16       |
# +----------+-------+----------+----------+----------+----------+----------+----------+

I'm not totally opposed to a default that works like:

DF.pivot_wider(df, "a", "b", prefix_names: "a_")
# or
DF.pivot_wider(df, "a", "b", prefix_names: "a_b_")
# or
DF.pivot_wider(df, "a", "b", prefix_names: "b_when_a_is_")

Since, if present, you could always achieve the current default by doing:

DF.pivot_wider(df, "a", "b", prefix_names: "")

But this behavior doesn't seem to be the norm across other libraries.

@blset
Copy link
Author

blset commented Jan 24, 2024

names_prefix is ok, as it is, the problem happens after the names_prefix fragment.

the desired pattern is well visible when you iterate several pivot_wider in a row, at stage 1

weekday@Monday should be weekday@hour_weekday_Monday

because that is the pattern for two successive pivot_wider in a row,
fully revealed starting from 2 values_from

league@weekday@hour_weekday_Monday_league_L league@weekday@score_weekday_Monday_league_L

the hour_weekday_value pattern, that is to say

originalvalueFromName_originalColumnName_originalColumnNameValue1

originalvalueFromName_originalColumnName_originalColumnNameValue2 etc..

does not seem to be in elixir, only the name prefix seems to be which is league@weekday@ in my examples

@billylanchantin
Copy link
Contributor

weekday@Monday should be weekday@hour_weekday_Monday

If you really need them to match, I think you can do this:

DF.pivot_wider(df, "weekday", "hour", prefix_names: "weekday@hour_weekday_")

Should be straightforward to do dynamically as well:

a = "weekday"
b = "hour"
DF.pivot_wider(df, a, b, prefix_names: "#{a}@#{b}_#{a}_")

@blset
Copy link
Author

blset commented Jan 25, 2024

I cannot do it manually because it is already done automatically, but not consistently...
moreover I don't know the values that are pulled from the row var and in what order they are used for the new names of columns

forget about the prefix
compare the names generated when there are at least two vars that go in column and only one
compare the names when there are at least two values_from and only one
compare the names when the variable that goes in column has at least two modalities and only one

and you will see that it is not consistent and not easely predictable
the simplest approach would be to generate the names consistently but it's maybe in the polar implementation

@blset
Copy link
Author

blset commented Jan 25, 2024

here is a resume without any resort to names_prefix which is not important for the consistency

in short to get consistent names, there is only one requirement:
when values_from is only one column, put the name of the values_from column in the resultant column name (even if not necessary to remove ambiguity) and put the name of the variable column as prefix to variable value all the time

valueFromName_columnName_columnValue etc..

this requirement is automatic when there is at least two values from columns.
Ok the end programmer could detect the one value column and tweak the names but it is tricky when the variable
in column has only one modality and you don't know it and you iterate further other variables in column

so for sanity it would be better to fix it from the source the pivot_wider names generating algorithm

here are some illustrations

d1 original table
note that league has only one modality

+-----------------------------------------------------------+
|         Explorer DataFrame: [rows: 2, columns: 6]         |
+----------+----------+-------+-------+----------+----------+
| weekday  |   team   | hour  | score |  league  |    id    |
| <string> | <string> | <s64> | <s64> | <string> | <string> |
+==========+==========+=======+=======+==========+==========+
| Monday   | A        | 10    | 0     | L        | row1     |
+----------+----------+-------+-------+----------+----------+
| Tuesday  | B        | 9     | 9     | L        | row2     |
+----------+----------+-------+-------+----------+----------+

d2 = Explorer.DataFrame.pivot_wider(d1, "weekday", "hour")

d2 weekday in column / hour: one variable 2 modalities 1 value
here the names generated should be  hour_weekday_Monday hour_weekday_Tuesday for consistency
+ -----------------------------------------------------------+
|         Explorer DataFrame: [rows: 2, columns: 6]         |
+----------+-------+----------+----------+--------+---------+
|   team   | score |  league  |    id    | Monday | Tuesday |
| <string> | <s64> | <string> | <string> | <s64>  |  <s64>  |
+==========+=======+==========+==========+========+=========+
| A        | 0     | L        | row1     | 10     |         |
+----------+-------+----------+----------+--------+---------+
| B        | 9     | L        | row2     |        | 9       |
+----------+-------+----------+----------+--------+---------+

d3 league in column/ hour: one variable 1 modality 1 value
d3 = Explorer.DataFrame.pivot_wider(d1, "league", "hour")
same as above hour_league_L
so the problem is not in the number of modalities of the variable (same pattern as with weekday)

+------------------------------------------------+
|   Explorer DataFrame: [rows: 2, columns: 5]    |
+----------+----------+-------+----------+-------+
| weekday  |   team   | score |    id    |   L   |
| <string> | <string> | <s64> | <string> | <s64> |
+==========+==========+=======+==========+=======+
| Monday   | A        | 0     | row1     | 10    |
+----------+----------+-------+----------+-------+
| Tuesday  | B        | 9     | row2     | 9     |
+----------+----------+-------+----------+-------+

d4 weakday in column / hour score: one variable 2 modalities 2 values
d4 = Explorer.DataFrame.pivot_wider(d1, "weekday", ["hour", "score"])

desired consistency, as soon as there are more than one value column
+----------------------------------------------------------------------------------------------------------------------------+
|                                         Explorer DataFrame: [rows: 2, columns: 7]                                          |
+----------+----------+----------+---------------------+----------------------+----------------------+-----------------------+
|   team   |  league  |    id    | hour_weekday_Monday | hour_weekday_Tuesday | score_weekday_Monday | score_weekday_Tuesday |
| <string> | <string> | <string> |        <s64>        |        <s64>         |        <s64>         |         <s64>         |
+==========+==========+==========+=====================+======================+======================+=======================+
| A        | L        | row1     | 10                  |                      | 0                    |                       |
+----------+----------+----------+---------------------+----------------------+----------------------+-----------------------+
| B        | L        | row2     |                     | 9                    |                      | 9                     |
+----------+----------+----------+---------------------+----------------------+----------------------+-----------------------+

d4b league in column / hour score: one variable 2 modalities 2 values
d4b = Explorer.DataFrame.pivot_wider(d1, "league", ["hour", "score"])
desired consistency, as soon as there are more than one value column
+-----------------------------------------------------------------+
|            Explorer DataFrame: [rows: 2, columns: 5]            |
+----------+----------+----------+---------------+----------------+
| weekday  |   team   |    id    | hour_league_L | score_league_L |
| <string> | <string> | <string> |     <s64>     |     <s64>      |
+==========+==========+==========+===============+================+
| Monday   | A        | row1     | 10            | 0              |
+----------+----------+----------+---------------+----------------+
| Tuesday  | B        | row2     | 9             | 9              |
+----------+----------+----------+---------------+----------------+

Now what happens when you iterate pivot_wider

d5 one variable 1 modality, then another variable 2 modalities, 2 values
d5 = Explorer.DataFrame.pivot_wider(d3, "weekday", ["L"])

should be hour_league_L from d3  then hour_league_L_weekday_Monday etc...
+------------------------------------------------+
|   Explorer DataFrame: [rows: 2, columns: 5]    |
+----------+-------+----------+--------+---------+
|   team   | score |    id    | Monday | Tuesday |
| <string> | <s64> | <string> | <s64>  |  <s64>  |
+==========+=======+==========+========+=========+
| A        | 0     | row1     | 10     |         |
+----------+-------+----------+--------+---------+
| B        | 9     | row2     |        | 9       |
+----------+-------+----------+--------+---------+

d6 one variable 2 modalities, then another variable 1 modality, many values
values_from = (d4.names -- d1.names) -- ["weekday"]
value_from =  ["hour_weekday_Monday", "hour_weekday_Tuesday", "score_weekday_Monday",
 "score_weekday_Tuesday"]

d6 = Explorer.DataFrame.pivot_wider(d4, "league", values_from)
values_from ok from d4 so it's ok because many values_from
+-----------------------------------------------------------------------------------------------------------------------------------------------------+
|                                                      Explorer DataFrame: [rows: 2, columns: 6]                                                      |
+----------+----------+------------------------------+-------------------------------+-------------------------------+--------------------------------+
|   team   |    id    | hour_weekday_Monday_league_L | hour_weekday_Tuesday_league_L | score_weekday_Monday_league_L | score_weekday_Tuesday_league_L |
| <string> | <string> |            <s64>             |             <s64>             |             <s64>             |             <s64>              |
+==========+==========+==============================+===============================+===============================+================================+
| A        | row1     | 10                           |                               | 0                             |                                |
+----------+----------+------------------------------+-------------------------------+-------------------------------+--------------------------------+
| B        | row2     |                              | 9                             |                               | 9                              |
+----------+----------+------------------------------+-------------------------------+-------------------------------+--------------------------------+



@billylanchantin
Copy link
Contributor

I think I better understand what you're saying. Your use case is that you have to infer the original column from the new column names?

That seems like it would be a little error prone. However, I think the team would consider a PR that changes what names_prefix defaults to so that it's consistent for the single and multi-column case.

And I apologize but I still don't follow the part about what you think should happen when you call pivot_wider multiple times. But perhaps some tests in the PR would help clarify.

@blset
Copy link
Author

blset commented Jan 26, 2024

Sorry for not being clear enough

when values from contains more than one variable (eg hour and score), all is ok, in the names generating process for weekday in column
hour_weekday_Monday score_weekday_Monday, etc...

what would be more consistent in my opinion is that even with a single variable in values_from you would have

hour_league_L instead of only L for a league variable with a single modality L and values_from hour

to pass several variables in column, you iterate, and put all new columns from previous step in the values_from list, which concatenates variable names

so starting from this



+-----------------------------------------------------------+
|         Explorer DataFrame: [rows: 2, columns: 6]         |
+----------+----------+-------+-------+----------+----------+
| weekday  |   team   | hour  | score |  league  |    id    |
| <string> | <string> | <s64> | <s64> | <string> | <string> |
+==========+==========+=======+=======+==========+==========+
| Monday   | A        | 10    | 0     | L        | row1     |
+----------+----------+-------+-------+----------+----------+
| Tuesday  | B        | 9     | 9     | L        | row2     |
+----------+----------+-------+-------+----------+----------+


league in column with values from hour would give names
weekday team score id hour_league_L (instead of only L because there is only one variable)

then iteration with weekday in column with values from hour_league_L would give names
team score id hour_League_L_weekday_Monday hour_League_L_weekday_Tuesday (instead of only weekday_Monday ``weekday_Tuesday currently because there is only one values_from variable)

you see that in the iteration process, if for some reason at one stage you get only a single values_from (because you ask a single one or because the previous step gave only one through a one modality variable) currently the names will be missing information.

what's more insidious, if you start with a two or more modality variable , since this gives many values_from variables, you will
not get any problem if further in the iteration you get a one modality variable, because you already carry several values_from variables and it cannot decrease.

But if you start with a one modality variable, you get stucked unless you also start with many values_from variables.


the remedy for all this is:

generate the name indicator_colvar_colvalue1 indicator_colvar_colvalue2 ...

 for pivot_wider(df, colvar, indicator) even in the case indicator is a single name and colvar has only one modality

the benefit is that you never loose track of what's in the table from the names, and for instance you can generate nice hierarchical html table of the dataframe instead of using the flat lengthy names

can you point the name generating process for pivot wider in the code ? it seems it's in polar.


 def pivot_wider(df, id_columns, names_from, values_from, names_prefix) do
    names_prefix_optional = unless names_prefix == "", do: names_prefix
    Shared.apply(:df_pivot_wider, [
      df.data,
      id_columns,
      names_from,
      values_from,
      names_prefix_optional
    ])
    |> Shared.create_dataframe()
  end

explorer/lib/explorer/polars_backend/shared.ex

  def apply(fun, args \\ []) do
    case apply(Native, fun, args) do
      {:ok, value} -> value
      {:error, error} -> raise runtime_error(error)
    end
  end

@billylanchantin
Copy link
Contributor

Ok, the iterated calls part was just another example of the inconsistency. Got it.

can you point the name generating process for pivot wider in the code ?

https://github.com/elixir-explorer/explorer/blob/main/native/explorer/src/dataframe.rs#L593-L648

Note that there's some post-processing done in our Rust code.

@blset
Copy link
Author

blset commented Jan 27, 2024

For what I can guess, not being familiar with rust, the name post processing in the Explorer Rust code is not at the root of the inconsistency
one can see the names_prefix in action but precisely, the inconsistency happens independently of names_prefix

moreover from the example in the polar doc one can see the one value_column case giving bare names

https://docs.pola.rs/user-guide/transformations/pivot/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants