Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QST] aggregate function that operates on vector(array of numeric) data #15741

Open
Rhett-Ying opened this issue May 14, 2024 · 5 comments
Open
Labels
question Further information is requested

Comments

@Rhett-Ying
Copy link

What is your question?
I am wondering if cudf has native or built-in support for aggregate function that run against vector data. Namley, text/image embeddings are stored in the column of csv/parquet file. And I'd like to run various aggregate functions such as mean, max and so on. All these operations are element-wise, namely, it returns the mean of all the values in same index and return an array with same lenght. What's more, I'd like to run K-Nearest-Neighbor search as well.

If not natively supported, how to achieve these operations with performance efficient?

example code:

import cudf
import numpy as np
import pandas as pd

# Sample DataFrame with Pandas to cuDF conversion
data = {
    'category': ['A', 'A', 'B', 'B'],
    'values': [np.array([1, 2, 3]), np.array([4, 5, 6]), np.array([7, 8, 9]), np.array([10, 11, 12])]
}
pdf = pd.DataFrame(data)
df = cudf.DataFrame.from_pandas(pdf)

result = df.groupby('category').agg({'values': ['sum', 'mean']})

print(result)

# Expected output
'''
category
A     [2.5, 3.5, 4.5]
B    [8.5, 9.5, 10.5]
Name: values, dtype: object
'''
@Rhett-Ying Rhett-Ying added the question Further information is requested label May 14, 2024
@vyasr
Copy link
Contributor

vyasr commented May 20, 2024

This kind of operations is not natively supported, unfortunately. The fundamental issue is that pandas allows you to put arbitrary objects into a Series/DataFrame and it will run Python operations on them. In this case, since you put numpy arrays in, pandas will happily just leave them as numpy arrays and use binary operations on numpy array so this works as expected. cudf does not support arbitrary objects in this way, so we have to be a bit more clever about rearranging the data ourselves to handle this kind of operation. Per-row array data is supported through the list dtype, which is what your'e getting in the from_pandas call in your snippet. To work with that in vectorized fashion, the typical approach is to use the explode method, which flattens out the data. Here is a snippet that gives you an essentially equivalent result (slight differences in column names etc):

import cudf
import numpy as np
import pandas as pd

# Sample DataFrame with Pandas to cuDF conversion
data = {
    'category': ['A', 'A', 'B', 'B'],
    'values': [np.array([1, 2, 3]), np.array([4, 5, 6]), np.array([7, 8, 9]), np.array([10, 11, 12])]
}
pdf = pd.DataFrame(data)
df = cudf.DataFrame.from_pandas(pdf)

print("pandas result")
print(pdf.groupby('category').agg({'values': ['sum', 'mean']}))
print()

exploded_values = df[["values"]].explode("values")
df = df[["category"]].merge(exploded_values, left_index=True, right_index=True)
df["index"] = np.tile(np.arange(3), 4)

print("cudf result")
print(df.groupby(["category", "index"]).agg({"values": ["sum", "mean"]}).groupby("category").collect())

This outputs:

pandas result
                values                  
                   sum              mean
category                                
A            [5, 7, 9]   [2.5, 3.5, 4.5]
B         [17, 19, 21]  [8.5, 9.5, 10.5]

cudf result
         (values, sum)    (values, mean)
category                                
A            [5, 9, 7]   [2.5, 4.5, 3.5]
B         [19, 17, 21]  [9.5, 8.5, 10.5]

@Rhett-Ying
Copy link
Author

@vyasr Thanks for your suggestion. The suggestion you gave above is equivalent to splitting array into separate columns, then apply sum()/mean() on each column, and merge the output back into an array?

@vyasr
Copy link
Contributor

vyasr commented May 23, 2024

Yes, that is basically equivalent. You cannot operate on the numpy arrays directly, but assuming they are all of the same length you could split them into multiple columns if you have control of that on construction. Otherwise the list-based approach I showed is the way you could process it if you have to take the numpy array-based inputs from pandas as-is.

@vyasr
Copy link
Contributor

vyasr commented May 30, 2024

@Rhett-Ying does the above solution address your needs?

@Rhett-Ying
Copy link
Author

@vyasr Thanks for your suggestion. One major concern for me is the performance. Especially when I want to apply more advanced operations on vector data such as K-Nearest-Neighbor Search. Should I leverage tools like CUVS for operations on vector data?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
Status: In Progress
Development

No branches or pull requests

3 participants
@vyasr @Rhett-Ying and others