Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upcoming pandas (>2.2.0) raises "read-only" errors #919

Open
martindurant opened this issue Feb 7, 2024 · 3 comments
Open

Upcoming pandas (>2.2.0) raises "read-only" errors #919

martindurant opened this issue Feb 7, 2024 · 3 comments

Comments

@martindurant
Copy link
Member

No longer allows setting series values in-place. Thanks pandas.

@jorisvandenbossche
Copy link
Member

You're welcome!

The returning of read-only numpy arrays is certainly one of the parts of the large CoW change (https://pandas.pydata.org/pdeps/0007-copy-on-write.html) we are least certain about. So feedback from downstream developers is certainly welcome.

I assume the issue here is because you allocate an empty dataframe first, and then get "view" arrays to write into. For the index, in one of the code paths that happens here:

views[col] = index.values

The return value of .values is now a read-only numpy array (https://pandas.pydata.org/docs/user_guide/copy_on_write.html#read-only-numpy-arrays). You know you just created this data yourself, so you can safely change its writeable flag to True as a workaround.

And I suppose this only happens for the Index, because for columns you rely on the Block.values, where we didn't add this protection as this is regarded as internal anyway.


It's probably already covered by the failing tests you have in fastparquet's own test suite, but listing here some tests that are failing on the pandas side (they were being skipped with CoW enabled for some time, we should have reported that earlier):

# dataframe with a non-default (i.e. non-RangeIndex) index
df = pd.DataFrame({"A": [1, 2, 3]}, index=list("abc"))
df.to_parquet("test.parquet", engine="fastparquet")
pd.read_parquet("test.parquet", engine="fastparquet")
# probably same underlying issue; tz-aware datetime index
import datetime
idx = [datetime.datetime.now(datetime.timezone.utc)] * 5
df = pd.DataFrame(index=idx, data={"index_as_col": idx})
df.to_parquet("test.parquet", engine="fastparquet")
pd.read_parquet("test.parquet", engine="fastparquet")

@martindurant
Copy link
Member Author

Thanks for the info, @jorisvandenbossche . Any idea of the release timeline?

@jorisvandenbossche
Copy link
Member

The current goal is April

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants