1BRC in GPU Python with Dask and cuDF #487

jacobtomlinson · 2024-01-19T11:04:18Z

jacobtomlinson
Jan 19, 2024

Inspired by #450 and #62 I thought I would see how things perform with the cuDF GPU accelerated backend for Dask. Not an official submission as it's not Java and also uses GPUs.

This was run on a workstation with a 12-core CPU and two NVIDIA RTX 8000 GPUs using the rapidsai/notebooks:23.12-cuda12.0-py3.10 container image. I know this is an extravagant hardware setup compared to the MacBooks used in other tests so take that into consideration. It would be fun to run this on GPU servers in the cloud as well as gaming PCs and gaming laptops to see how different GPUs perform.

🔥 4.5 s ± 27.7 ms per loop (mean ± std. dev. of 4 runs, 3 loops each)

import dask
import dask.dataframe as dd
from dask.distributed import Client
from dask_cuda import LocalCUDACluster
    
# Tells Dask to use the GPU DataFrame backend
client = Client(LocalCUDACluster())
dask.config.set({"dataframe.backend": "cudf"})

df = dd.read_csv(
    "measurements.txt",
    sep=";",
    header=None,
    names=["station", "measure"]
)
df = df.groupby("station").agg(["min", "max", "mean"])
df.columns = df.columns.droplevel()
df = df.compute().to_pandas().sort_values("station")

I needed to cast the DataFrame back to Pandas before calling sort_values() because of a bug in cuDF. But sorting a ~400 row DataFrame is a small effort anyway so not sure if fixing the bug would improve the time much.

I also experimented with rewriting the data generation for the GPU and managed to get that down to 24 seconds. It starts with the station seed data and then uses cuDF and CuPy to generate the random data in big chunks and append them to the file. This way we can generate even larger datasets like 10 Billion Rows quickly.

import cupy as cp
import cudf

n = 1_000_000_000  # Number of rows of data to generate

lookup_df = cudf.read_csv("lookup.csv")  # Load our lookup table of stations and their mean temperatures
std = 10.0  # We assume temperatures are normally distributed with a standard deviation of 10
chunksize = 2e8  # Set the number of rows to generate in one go (reduce this if you run into GPU RAM limits)
filename = "measurements.txt"  # Choose where to write to

def generate_chunk(filename, chunksize, std, lookup_df):
    """Generate some sample data based on the lookup table."""
    df = cudf.DataFrame({
        # Choose a random station from the lookup table for each row in our output               
        "station": cp.random.randint(0, len(lookup_df)-1, int(chunksize)),
        # Generate a normal distibution around zero for each row in our output
        # Because the std is the same for every station we can adjust the mean for each row afterwards
        "measure": cp.random.normal(0, std, int(chunksize))
    })
    
    # Offset each measurement by the station's mean value
    df.measure += df.station.map(lookup_df.mean_temp)
    # Round the temprature to one decimal place
    df.measure = df.measure.round(decimals=1)
    # Convert the station index to the station name
    df.station = df.station.map(lookup_df.station)
    
    # Append this chunk to the output file
    with open(filename, "a") as fh:
        df.to_csv(fh, sep=";", chunksize=10_000_000, header=False, index=False)

# Loop over chunks and generate data
for i in range(int(n / chunksize)):
    # Generate a chunk
    generate_chunk(filename, chunksize, std, lookup_df)

I tried this data generation method with just Pandas and Numpy but it's way slower than the Java implementation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

1BRC in GPU Python with Dask and cuDF #487

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

1BRC in GPU Python with Dask and cuDF #487

jacobtomlinson Jan 19, 2024

Replies: 0 comments

jacobtomlinson
Jan 19, 2024