Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] cudf.Series.duplicated returns error 'Series' object has no attribute 'duplicated' #15777

Closed
tiraldj opened this issue May 17, 2024 · 5 comments
Labels
0 - Waiting on Author Waiting for author to respond to review bug Something isn't working cuDF (Python) Affects Python cuDF API.

Comments

@tiraldj
Copy link

tiraldj commented May 17, 2024

Describe the bug
I am trying to see if there are duplicate values in a feature within a dataframe using duplicated()

Steps/Code to reproduce bug

first i tried using the duplicated() on the column itself

df['job_title'].duplicated()
then explicitly made it a series of string values then ran duplicated().
`dups = cudf.Series(df['job_title']).astype('string')

dups = dups.duplicated()
`
in these cases i get the error: 'Series' object has no attribute 'duplicated'

Expected behavior
Something like this, where 'True' means duplicate of something that came before:

0 False
1 False
2 True
3 False
4 True
dtype: bool

Environment overview (please complete the following information)

  • Environment location: [Bare-metal, Docker, Cloud(specify cloud provider)] Cloud (Paperspace, RAPIDS image)
  • Method of cuDF install: [conda, Docker, or from source] image came with cudf preinstalled

Environment details
Please run and paste the output of the cudf/print_env.sh script here, to gather any other relevant environment details
I couldn't run the print_env.sh, wasn't found in the directory

nvidia-smi says: NVIDIA-SMI 525.116.04 Driver Version: 525.116.04 CUDA Version: 12.0
I was using a P5000 on paperspace

Additional context
Add any other context about the problem here.

@tiraldj tiraldj added the bug Something isn't working label May 17, 2024
@mroeschke
Copy link
Contributor

Thanks for the report. Could you share what version of cudf you are running as well as a reproducible example? For example, this works on cudf 24.04

>>> import cudf
>>> cudf.Series(list("121")).duplicated()
0    False
1    False
2     True
dtype: bool
>>> cudf.__version__
'24.04.00'

@mroeschke mroeschke added 0 - Waiting on Author Waiting for author to respond to review cuDF (Python) Affects Python cuDF API. labels May 17, 2024
@tiraldj
Copy link
Author

tiraldj commented May 18, 2024

thank you for the reply

cudf.version says '22.10.01+2.gca9a422da9'

again this is the Paperspace cloud service's RAPIDS image. I will come back with a reproducible code later today.

@mroeschke
Copy link
Contributor

I do not see duplicated implemented for 22.10.01 so that's the reason for the failure.

It appears 23.02 is the first release where duplicated is implemented.

@tiraldj
Copy link
Author

tiraldj commented May 21, 2024

thank you for the clarification

@mroeschke
Copy link
Contributor

Closing as the OP was the expected behavior given the cudf version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Waiting on Author Waiting for author to respond to review bug Something isn't working cuDF (Python) Affects Python cuDF API.
Projects
Status: Done
Development

No branches or pull requests

3 participants
@mroeschke @tiraldj and others