-
-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get list of valid parquet files in directory #795
Comments
Hi @pyrito import fastparquet as fp
import pandas as pd
from os import path as os_path
# Example data
df = pd.DataFrame({'a':range(6)})
pq_path = os.path.expanduser('~/Documents/code/data/pq_test')
fp.write(pq_path, df, row_group_offsets=[0,2,4], file_scheme='hive')
pf=fp.ParquetFile(pq_path)
# Get the base path (make sure to prefix it to subsequent file path)
In [19]: pf.basepath
Out[19]: '/home/yoh/Documents/code/data/pq_test'
# Get list of file path
my_path = [rg.columns[0].file_path for rg in pf.row_groups]
In[21]: my_path
Out[21]: ['part.0.parquet', 'part.1.parquet', 'part.2.parquet'] You could also use a built-in function, result will be the same, without duplicates in case several row groups are in the same file. from fastparquet.api import row_groups_map
rg_map = list(row_groups_map(pf.row_groups).keys())
In[23]: rg_map
Out[23]: ['part.0.parquet', 'part.1.parquet', 'part.2.parquet'] If you are using partitions, partition names will show up in file path as well, as in PyArrow example you provide. PS: if you would like the documentation to state this, please, I am sure a PR about this will be welcome ;) |
PPS: if this answers your request, please, feel free to close the ticket |
@yohplala thank you for the quick response! I don't think this would work for every case. For example, if I do something like this: import pandas
import fastparquet
import numpy as np
df = pandas.DataFrame(np.random.randint(0, 100, size=(int(2**18), 2**8))).add_prefix('col')
df.to_parquet("testing/test.parquet")
# This works as expected
df = pandas.read_parquet("testing/test.parquet", engine='fastparquet')
f = fastparquet.ParquetFile("testing/test.parquet")
In [30]: f.basepath
Out[30]: 'testing/test.parquet'
In [28]: my_path = [rg.columns[0].file_path for rg in f.row_groups]
# This should still contain at least `test.parquet`
In [29]: my_path
Out[29]: [None] |
The logic is already kind of implemented here: https://github.com/dask/fastparquet/blob/main/fastparquet/api.py#L151-L155 |
You are right, that is exactly the logic that is used, and I don't mind it being moved or replicated in a utility function. However, fastparquet always allows you to pass a single data file path or list or paths and will in that case read them unmodified, without any filename filter. This is what happens in your example (the |
@martindurant that makes sense. You mention an important caveat but I think it would still be helpful to have it saved as an attribute or called through another function. |
Note that for the single-file case, you do have the path available as the |
Every engine handles valid parquet files in a directory differently. PyArrow has this property that allows users to get a list of absolute paths in the Dataset source (see here). Could we do something similar for fastparquet?
The text was updated successfully, but these errors were encountered: