Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get list of valid parquet files in directory #795

Open
pyrito opened this issue Aug 10, 2022 · 7 comments
Open

Get list of valid parquet files in directory #795

pyrito opened this issue Aug 10, 2022 · 7 comments

Comments

@pyrito
Copy link

pyrito commented Aug 10, 2022

Every engine handles valid parquet files in a directory differently. PyArrow has this property that allows users to get a list of absolute paths in the Dataset source (see here). Could we do something similar for fastparquet?

@yohplala
Copy link

Hi @pyrito
If i understand correctly, fastparquet already does that, maybe not straight away.

import fastparquet as fp
import pandas as pd
from os import path as os_path

# Example data
df = pd.DataFrame({'a':range(6)})
pq_path = os.path.expanduser('~/Documents/code/data/pq_test')
fp.write(pq_path, df, row_group_offsets=[0,2,4], file_scheme='hive')
pf=fp.ParquetFile(pq_path)

# Get the base path (make sure to prefix it to subsequent file path)
In [19]: pf.basepath
Out[19]: '/home/yoh/Documents/code/data/pq_test'

# Get list of file path
my_path = [rg.columns[0].file_path for rg in pf.row_groups]
    
In[21]: my_path
Out[21]: ['part.0.parquet', 'part.1.parquet', 'part.2.parquet']

You could also use a built-in function, result will be the same, without duplicates in case several row groups are in the same file.

from fastparquet.api import row_groups_map

rg_map = list(row_groups_map(pf.row_groups).keys())

In[23]: rg_map
Out[23]: ['part.0.parquet', 'part.1.parquet', 'part.2.parquet']

If you are using partitions, partition names will show up in file path as well, as in PyArrow example you provide.
Best regards,

PS: if you would like the documentation to state this, please, I am sure a PR about this will be welcome ;)

@yohplala
Copy link

PPS: if this answers your request, please, feel free to close the ticket

@pyrito
Copy link
Author

pyrito commented Aug 10, 2022

@yohplala thank you for the quick response! I don't think this would work for every case. For example, if I do something like this:

import pandas
import fastparquet
import numpy as np

df = pandas.DataFrame(np.random.randint(0, 100, size=(int(2**18), 2**8))).add_prefix('col')
df.to_parquet("testing/test.parquet")
# This works as expected
df = pandas.read_parquet("testing/test.parquet", engine='fastparquet')

f = fastparquet.ParquetFile("testing/test.parquet")

In [30]: f.basepath
Out[30]: 'testing/test.parquet'

In [28]: my_path = [rg.columns[0].file_path for rg in f.row_groups]

# This should still contain at least `test.parquet`
In [29]: my_path
Out[29]: [None]

@pyrito
Copy link
Author

pyrito commented Aug 10, 2022

The logic is already kind of implemented here: https://github.com/dask/fastparquet/blob/main/fastparquet/api.py#L151-L155

@martindurant
Copy link
Member

You are right, that is exactly the logic that is used, and I don't mind it being moved or replicated in a utility function. However, fastparquet always allows you to pass a single data file path or list or paths and will in that case read them unmodified, without any filename filter. This is what happens in your example (the .file_path attributes are pointers from the root directory of a dataset, but in this case there is no directory).

@pyrito
Copy link
Author

pyrito commented Aug 10, 2022

@martindurant that makes sense. You mention an important caveat but I think it would still be helpful to have it saved as an attribute or called through another function.

@martindurant
Copy link
Member

Note that for the single-file case, you do have the path available as the .fn attribute. In the case of multi-file datasets, this will be the effective root of the dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants