Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Series/Single Column DataFrame Groupby value_counts fails (DataFrame Groupby value_counts succeeds) #15696

Open
beckernick opened this issue May 7, 2024 · 1 comment
Labels
bug Something isn't working cuDF (Python) Affects Python cuDF API.

Comments

@beckernick
Copy link
Member

Groupby value_counts fails on when selecting individual columns from a DataFrame, but succeeds when running on the entire DataFrame.

import pandas as pd
import cudf

gdf = cudf.datasets.randomdata(dtypes={"id": int, "x": int})
pdf = gdf.to_pandas()

print(pdf.groupby("id").x.value_counts().head())
print(gdf.groupby("id").x.value_counts())
id   x   
942  988     1
961  1026    1
965  1062    1
984  981     1
993  999     1
Name: count, dtype: int64
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
File [/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/groupby/groupby.py:2783](http://10.136.7.109:8881/lab/tree/nvme/0/nicholasb/benchmarks/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/groupby/groupby.py#line=2782), in _Grouping._handle_by_or_level(self, by, level)
   2782 try:
-> 2783     self._handle_label(by)
   2784 except (KeyError, TypeError):

File [/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/groupby/groupby.py:2845](http://10.136.7.109:8881/lab/tree/nvme/0/nicholasb/benchmarks/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/groupby/groupby.py#line=2844), in _Grouping._handle_label(self, by)
   2844     else:
-> 2845         raise e
   2846 self.names.append(by)

File [/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/groupby/groupby.py:2839](http://10.136.7.109:8881/lab/tree/nvme/0/nicholasb/benchmarks/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/groupby/groupby.py#line=2838), in _Grouping._handle_label(self, by)
   2838 try:
-> 2839     self._key_columns.append(self._obj._data[by])
   2840 except KeyError as e:
   2841     # `by` can be index name(label) too.

File [/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/column_accessor.py:155](http://10.136.7.109:8881/lab/tree/nvme/0/nicholasb/benchmarks/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/column_accessor.py#line=154), in ColumnAccessor.__getitem__(self, key)
    154 def __getitem__(self, key: Any) -> ColumnBase:
--> 155     return self._data[key]

KeyError: 'id'

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
Cell In[31], line 8
      5 pdf = gdf.to_pandas()
      7 print(pdf.groupby("id").x.value_counts().head())
----> 8 print(gdf.groupby("id").x.value_counts())

File [/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/groupby/groupby.py:2598](http://10.136.7.109:8881/lab/tree/nvme/0/nicholasb/benchmarks/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/groupby/groupby.py#line=2597), in GroupBy.value_counts(self, subset, normalize, sort, ascending, dropna)
   2591     raise ValueError(
   2592         f"Keys {set(subset) & set(groupings)} in subset "
   2593         "cannot be in the groupby column keys."
   2594     )
   2596 df["__placeholder"] = 1
   2597 result = (
-> 2598     df.groupby(groupings + list(subset), dropna=dropna)[
   2599         "__placeholder"
   2600     ]
   2601     .count()
   2602     .sort_index()
   2603     .astype(np.int64)
   2604 )
   2606 if normalize:
   2607     levels = list(range(len(groupings), result.index.nlevels))

File [/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/nvtx/nvtx.py:116](http://10.136.7.109:8881/lab/tree/nvme/0/nicholasb/benchmarks/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/nvtx/nvtx.py#line=115), in annotate.__call__.<locals>.inner(*args, **kwargs)
    113 @wraps(func)
    114 def inner(*args, **kwargs):
    115     libnvtx_push_range(self.attributes, self.domain.handle)
--> 116     result = func(*args, **kwargs)
    117     libnvtx_pop_range(self.domain.handle)
    118     return result

File [/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/series.py:3426](http://10.136.7.109:8881/lab/tree/nvme/0/nicholasb/benchmarks/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/series.py#line=3425), in Series.groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, observed, dropna)
   3400 @_cudf_nvtx_annotate
   3401 @docutils.doc_apply(
   3402     groupby_doc_template.format(
   (...)
   3424     dropna=True,
   3425 ):
-> 3426     return super().groupby(
   3427         by,
   3428         axis,
   3429         level,
   3430         as_index,
   3431         sort,
   3432         group_keys,
   3433         squeeze,
   3434         observed,
   3435         dropna,
   3436     )

File [/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/nvtx/nvtx.py:116](http://10.136.7.109:8881/lab/tree/nvme/0/nicholasb/benchmarks/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/nvtx/nvtx.py#line=115), in annotate.__call__.<locals>.inner(*args, **kwargs)
    113 @wraps(func)
    114 def inner(*args, **kwargs):
    115     libnvtx_push_range(self.attributes, self.domain.handle)
--> 116     result = func(*args, **kwargs)
    117     libnvtx_pop_range(self.domain.handle)
    118     return result

File [/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/indexed_frame.py:5337](http://10.136.7.109:8881/lab/tree/nvme/0/nicholasb/benchmarks/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/indexed_frame.py#line=5336), in IndexedFrame.groupby(self, by, axis, level, as_index, sort, group_keys, squeeze, observed, dropna)
   5331 if group_keys is None:
   5332     group_keys = False
   5334 return (
   5335     self.__class__._resampler(self, by=by)
   5336     if isinstance(by, cudf.Grouper) and by.freq
-> 5337     else self.__class__._groupby(
   5338         self,
   5339         by=by,
   5340         level=level,
   5341         as_index=as_index,
   5342         dropna=dropna,
   5343         sort=sort,
   5344         group_keys=group_keys,
   5345     )
   5346 )

File [/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/groupby/groupby.py:283](http://10.136.7.109:8881/lab/tree/nvme/0/nicholasb/benchmarks/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/groupby/groupby.py#line=282), in GroupBy.__init__(self, obj, by, level, sort, as_index, dropna, group_keys)
    281     self.grouping = self._by
    282 else:
--> 283     self.grouping = _Grouping(obj, self._by, level)

File [/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/groupby/groupby.py:2751](http://10.136.7.109:8881/lab/tree/nvme/0/nicholasb/benchmarks/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/groupby/groupby.py#line=2750), in _Grouping.__init__(self, obj, by, level)
   2748 # Need to keep track of named key columns
   2749 # to support `as_index=False` correctly
   2750 self._named_columns = []
-> 2751 self._handle_by_or_level(by, level)
   2753 if len(obj) and not len(self._key_columns):
   2754     raise ValueError("No group keys passed")

File [/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/groupby/groupby.py:2785](http://10.136.7.109:8881/lab/tree/nvme/0/nicholasb/benchmarks/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/groupby/groupby.py#line=2784), in _Grouping._handle_by_or_level(self, by, level)
   2783     self._handle_label(by)
   2784 except (KeyError, TypeError):
-> 2785     self._handle_misc(by)

File [/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/groupby/groupby.py:2868](http://10.136.7.109:8881/lab/tree/nvme/0/nicholasb/benchmarks/nvme/0/nicholasb/miniconda3/envs/rapids-24.06/lib/python3.10/site-packages/cudf/core/groupby/groupby.py#line=2867), in _Grouping._handle_misc(self, by)
   2866 by = cudf.core.column.as_column(by)
   2867 if len(by) != len(self._obj):
-> 2868     raise ValueError("Grouper and object must have same length")
   2869 self._key_columns.append(by)
   2870 self.names.append(None)

ValueError: Grouper and object must have same length
print(gdf.groupby("id").value_counts()) # succeeds
# print(gdf.groupby("id")[["x"]].value_counts()) # same error as above
@beckernick beckernick added bug Something isn't working cuDF (Python) Affects Python cuDF API. labels May 7, 2024
@mroeschke
Copy link
Contributor

It appears that groupby.value_counts is only properly implemented for DataFrameGroupby. In pandas value_counts has a different signature depending on whether the resulting grouped object is a Series or DataFrame.

# DataFrameGroupby
    def value_counts(
        self,
        subset: Sequence[Hashable] | None = None,
        normalize: bool = False,
        sort: bool = True,
        ascending: bool = False,
        dropna: bool = True,
    ) -> DataFrame | Series:
# SeriesGroupby
    def value_counts(
        self,
        normalize: bool = False,
        sort: bool = True,
        ascending: bool = False,
        bins=None,
        dropna: bool = True,
    ) -> Series | DataFrame:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuDF (Python) Affects Python cuDF API.
Projects
Status: In Progress
Development

No branches or pull requests

3 participants
@beckernick @mroeschke and others