Get list of valid parquet files in directory #795

pyrito · 2022-08-10T13:54:33Z

Every engine handles valid parquet files in a directory differently. PyArrow has this property that allows users to get a list of absolute paths in the Dataset source (see here). Could we do something similar for fastparquet?

yohplala · 2022-08-10T14:18:31Z

Hi @pyrito
If i understand correctly, fastparquet already does that, maybe not straight away.

import fastparquet as fp
import pandas as pd
from os import path as os_path

# Example data
df = pd.DataFrame({'a':range(6)})
pq_path = os.path.expanduser('~/Documents/code/data/pq_test')
fp.write(pq_path, df, row_group_offsets=[0,2,4], file_scheme='hive')
pf=fp.ParquetFile(pq_path)

# Get the base path (make sure to prefix it to subsequent file path)
In [19]: pf.basepath
Out[19]: '/home/yoh/Documents/code/data/pq_test'

# Get list of file path
my_path = [rg.columns[0].file_path for rg in pf.row_groups]
    
In[21]: my_path
Out[21]: ['part.0.parquet', 'part.1.parquet', 'part.2.parquet']

You could also use a built-in function, result will be the same, without duplicates in case several row groups are in the same file.

from fastparquet.api import row_groups_map

rg_map = list(row_groups_map(pf.row_groups).keys())

In[23]: rg_map
Out[23]: ['part.0.parquet', 'part.1.parquet', 'part.2.parquet']

If you are using partitions, partition names will show up in file path as well, as in PyArrow example you provide.
Best regards,

PS: if you would like the documentation to state this, please, I am sure a PR about this will be welcome ;)

yohplala · 2022-08-10T14:18:53Z

PPS: if this answers your request, please, feel free to close the ticket

pyrito · 2022-08-10T14:27:57Z

@yohplala thank you for the quick response! I don't think this would work for every case. For example, if I do something like this:

import pandas
import fastparquet
import numpy as np

df = pandas.DataFrame(np.random.randint(0, 100, size=(int(2**18), 2**8))).add_prefix('col')
df.to_parquet("testing/test.parquet")
# This works as expected
df = pandas.read_parquet("testing/test.parquet", engine='fastparquet')

f = fastparquet.ParquetFile("testing/test.parquet")

In [30]: f.basepath
Out[30]: 'testing/test.parquet'

In [28]: my_path = [rg.columns[0].file_path for rg in f.row_groups]

# This should still contain at least `test.parquet`
In [29]: my_path
Out[29]: [None]

pyrito · 2022-08-10T14:32:16Z

The logic is already kind of implemented here: https://github.com/dask/fastparquet/blob/main/fastparquet/api.py#L151-L155

martindurant · 2022-08-10T14:39:17Z

You are right, that is exactly the logic that is used, and I don't mind it being moved or replicated in a utility function. However, fastparquet always allows you to pass a single data file path or list or paths and will in that case read them unmodified, without any filename filter. This is what happens in your example (the .file_path attributes are pointers from the root directory of a dataset, but in this case there is no directory).

pyrito · 2022-08-10T15:04:01Z

@martindurant that makes sense. You mention an important caveat but I think it would still be helpful to have it saved as an attribute or called through another function.

martindurant · 2022-08-11T20:32:15Z

Note that for the single-file case, you do have the path available as the .fn attribute. In the case of multi-file datasets, this will be the effective root of the dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Get list of valid parquet files in directory #795

Get list of valid parquet files in directory #795

pyrito commented Aug 10, 2022

yohplala commented Aug 10, 2022

yohplala commented Aug 10, 2022

pyrito commented Aug 10, 2022

pyrito commented Aug 10, 2022

martindurant commented Aug 10, 2022

pyrito commented Aug 10, 2022

martindurant commented Aug 11, 2022

Get list of valid parquet files in directory #795

Get list of valid parquet files in directory #795

Comments

pyrito commented Aug 10, 2022

yohplala commented Aug 10, 2022

yohplala commented Aug 10, 2022

pyrito commented Aug 10, 2022

pyrito commented Aug 10, 2022

martindurant commented Aug 10, 2022

pyrito commented Aug 10, 2022

martindurant commented Aug 11, 2022