Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should addcol-window pad first list with None to indicate no rows above first row? #2279

Closed
daviewales opened this issue Jan 31, 2024 · 12 comments
Labels

Comments

@daviewales
Copy link
Contributor

Small description

Given a column of values, I want to create a column containing the values of this column from the previous row.

test
A
B
C
D

I press w, and input 1 0 to create a column with 1 previous value, and 0 subsequent values:

image

I now create a new column with the previous values using =test_window[0]:
This yields exactly what I want, except in the first row.
Rather than the previous (NULL) value, I get the value of the current row:

image

Expected result

I expect the window list to always contain 2 values, and be padded with None when the previous value does not exist, such as for the first window record. This might look like the following:

image

Steps to reproduce with sample data and a .vd

test-window.zip

Additional context
Please include the version of VisiData and Python.

saul.pw/VisiData v3.0.1
Python 3.10.12

@saulpw
Copy link
Owner

saulpw commented Sep 5, 2024

I think you're right, @daviewales. Thanks for putting the windowing feature through its paces!

@saulpw saulpw closed this as completed Sep 5, 2024
@saulpw saulpw added fixed wart Little warts and quirky bumps labels Sep 5, 2024
@daviewales
Copy link
Contributor Author

Thanks!

@midichef
Copy link
Contributor

midichef commented Sep 9, 2024

I relied on the old behavior to detect the first or last values in the sheet, checking when the window list becomes 1 element shorter. It's 1 element shorter at the first row for windows like w 1 0, or at the last row for windows like w 0 1, or both, for windows like w 1 1.

An example where I used this was to rank columns by groups. However, that specific use case will be made unnecessary by #2417

So here is a good place to mention alternate methods to test for the first/last rows.
To identify the first row, use addcol-incr (by pressing i) and test for rows where A == 1, by making a column expression: = A == 1.
And to identify the last row: A == len(sheet.rows). After that you may want to run freeze-col by pressing '. Otherwise the True value for the last row will move or disappear when rows above it are inserted/deleted.

@daviewales
Copy link
Contributor Author

@midichef, reminds me of:
https://xkcd.com/1172/

@thejud
Copy link
Contributor

thejud commented Oct 9, 2024

Is there any chance this could be undone? This behavior is not like other tools that provide window functions, for example neither postgres nor pandas pad the window. It makes it quite difficult to determine how many values were actually available for the window when it was created. You can rely on tricks like adding a row ID, and comparing the row ID to the windows size. However, it is dependent on the row id being kept in every instance where you want to evaluate the window function.

Trying to inspect the window itself to figure it out is a problem as well, since we can't determine if padded value is there because of some conversion that resulted in None, or because of a missing value.

Why it matters: Computations that rely on the number of items, (avg/mean in particular) will not work correctly.

@saulpw
Copy link
Owner

saulpw commented Oct 9, 2024

Well, we now have 2 votes for the old behavior (fewer values for the first and last rows), and 1 vote for the new behavior (None-padded to the same number of values for each row). I'm inclined to revert this change as requested (sorry @daviewales), and I'd like to come to a final decision before the release, which will happen within the next week. Please register your vote on this comment; 👍 to revert to the old behavior and 👎 to keep the new behavior.

@saulpw saulpw reopened this Oct 9, 2024
@saulpw saulpw added 3.1 on deck and removed bug fixed wart Little warts and quirky bumps labels Oct 9, 2024
@saulpw saulpw changed the title addcol-window should pad first list with None to indicate no rows above first row Should addcol-window pad first list with None to indicate no rows above first row? Oct 9, 2024
@daviewales
Copy link
Contributor Author

For my specific use case, I was trying to essentially create a lag function. For example, see the SQLite docs:

The first form of the lag() function returns the result of evaluating expression expr against the previous row in the partition. Or, if there is no previous row (because the current row is the first), NULL.

I can see the benefit of the window function returning only the actual values without padding, to ensure that aggregation functions work correctly.

Perhaps we could add a lag (and lead) function to support my use case too?

@thejud
Copy link
Contributor

thejud commented Oct 9, 2024

One more surprise with the current implementation is that if you specify a windows size larger than the total number of rows in the table, it will gladly pad up to your specified size. For example, to create a cumulative sum, I add a window >= the number of rows in the table. If you make the before rows a LOT bigger, it will be padded with that many Nones.

@thejud
Copy link
Contributor

thejud commented Oct 9, 2024

@daviewales I was experimenting getting the delta between rows.
I created a window with a of size 1, 0, which puts both the current and previous (if present) value into the window. Then, for instance for a delta (lag of 1), I add an expression column (with a window named win) like:

=win[1] - win[0] if len(win) > 1 else None

Does that work? It doesn't handle the partitioning part of a lag function, however.

@daviewales
Copy link
Contributor Author

daviewales commented Oct 9, 2024

Yes. If we revert to the old behaviour, I could probably just do:

win[0] if len(win) > 1 else None

@daviewales
Copy link
Contributor Author

I had a play, and came up with a basic lag function for my .visidatarc:

from visidata.features.window import WindowColumn

def lag(col):
    return col[0] if len(col) > 1 else None

class LagColumn(WindowColumn):
    def getValue(self, row):
        window = self.windowrows.get(id(row), None)
        return lag(window)

@Sheet.api
def addcol_lag(sheet, curcol):
    before, after = 1, 0
    newcol = LagColumn(
        curcol.name+"_lag",
        sourcecol=curcol,
        before=before,
        after=after
    )
    sheet.addColumnAtCursor(newcol)

Sheet.addCommand('', 'addcol-lag', 'addcol_lag(cursorCol)', 'add column where each row contains the value from the previous row of the current column')

Naively hacked up from here:

class WindowColumn(Column):

saulpw added a commit that referenced this issue Oct 10, 2024
@saulpw
Copy link
Owner

saulpw commented Oct 10, 2024

Okay, I'm reverting this commit, if only to have less churn from release to release. Thanks to everyone for chiming in!

@saulpw saulpw closed this as completed Oct 10, 2024
@saulpw saulpw removed the on deck label Oct 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants