You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Inspired by #450 and #62 I thought I would see how things perform with the cuDF GPU accelerated backend for Dask. Not an official submission as it's not Java and also uses GPUs.
This was run on a workstation with a 12-core CPU and two NVIDIA RTX 8000 GPUs using the rapidsai/notebooks:23.12-cuda12.0-py3.10 container image. I know this is an extravagant hardware setup compared to the MacBooks used in other tests so take that into consideration. It would be fun to run this on GPU servers in the cloud as well as gaming PCs and gaming laptops to see how different GPUs perform.
🔥 4.5 s ± 27.7 ms per loop (mean ± std. dev. of 4 runs, 3 loops each)
importdaskimportdask.dataframeasddfromdask.distributedimportClientfromdask_cudaimportLocalCUDACluster# Tells Dask to use the GPU DataFrame backendclient=Client(LocalCUDACluster())
dask.config.set({"dataframe.backend": "cudf"})
df=dd.read_csv(
"measurements.txt",
sep=";",
header=None,
names=["station", "measure"]
)
df=df.groupby("station").agg(["min", "max", "mean"])
df.columns=df.columns.droplevel()
df=df.compute().to_pandas().sort_values("station")
I needed to cast the DataFrame back to Pandas before calling sort_values() because of a bug in cuDF. But sorting a ~400 row DataFrame is a small effort anyway so not sure if fixing the bug would improve the time much.
I also experimented with rewriting the data generation for the GPU and managed to get that down to 24 seconds. It starts with the station seed data and then uses cuDF and CuPy to generate the random data in big chunks and append them to the file. This way we can generate even larger datasets like 10 Billion Rows quickly.
importcupyascpimportcudfn=1_000_000_000# Number of rows of data to generatelookup_df=cudf.read_csv("lookup.csv") # Load our lookup table of stations and their mean temperaturesstd=10.0# We assume temperatures are normally distributed with a standard deviation of 10chunksize=2e8# Set the number of rows to generate in one go (reduce this if you run into GPU RAM limits)filename="measurements.txt"# Choose where to write todefgenerate_chunk(filename, chunksize, std, lookup_df):
"""Generate some sample data based on the lookup table."""df=cudf.DataFrame({
# Choose a random station from the lookup table for each row in our output "station": cp.random.randint(0, len(lookup_df)-1, int(chunksize)),
# Generate a normal distibution around zero for each row in our output# Because the std is the same for every station we can adjust the mean for each row afterwards"measure": cp.random.normal(0, std, int(chunksize))
})
# Offset each measurement by the station's mean valuedf.measure+=df.station.map(lookup_df.mean_temp)
# Round the temprature to one decimal placedf.measure=df.measure.round(decimals=1)
# Convert the station index to the station namedf.station=df.station.map(lookup_df.station)
# Append this chunk to the output filewithopen(filename, "a") asfh:
df.to_csv(fh, sep=";", chunksize=10_000_000, header=False, index=False)
# Loop over chunks and generate dataforiinrange(int(n/chunksize)):
# Generate a chunkgenerate_chunk(filename, chunksize, std, lookup_df)
I tried this data generation method with just Pandas and Numpy but it's way slower than the Java implementation.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Inspired by #450 and #62 I thought I would see how things perform with the cuDF GPU accelerated backend for Dask. Not an official submission as it's not Java and also uses GPUs.
This was run on a workstation with a 12-core CPU and two NVIDIA RTX 8000 GPUs using the
rapidsai/notebooks:23.12-cuda12.0-py3.10
container image. I know this is an extravagant hardware setup compared to the MacBooks used in other tests so take that into consideration. It would be fun to run this on GPU servers in the cloud as well as gaming PCs and gaming laptops to see how different GPUs perform.🔥 4.5 s ± 27.7 ms per loop (mean ± std. dev. of 4 runs, 3 loops each)
I needed to cast the DataFrame back to Pandas before calling
sort_values()
because of a bug in cuDF. But sorting a ~400 row DataFrame is a small effort anyway so not sure if fixing the bug would improve the time much.I also experimented with rewriting the data generation for the GPU and managed to get that down to 24 seconds. It starts with the station seed data and then uses cuDF and CuPy to generate the random data in big chunks and append them to the file. This way we can generate even larger datasets like 10 Billion Rows quickly.
I tried this data generation method with just Pandas and Numpy but it's way slower than the Java implementation.
Beta Was this translation helpful? Give feedback.
All reactions