Tracking Issue for the GroupBy and GroupByMut iterators #80552

Kerollmops · 2020-12-31T11:10:13Z

Feature gate: #![feature(slice_group_by)]

This is a tracking issue for the GroupBy and GroupByMut iterators.

This feature exposes the group_by and group_by_mut methods on the slice and mutable slice types, these methods return the GroupBy and GroupByMut iterators structs respectively. Those two iterators return subslices of the original slice where a user-defined function returns true for two following elements of the slice.

Public API

These methods can return subslices that contains equal elements:

let slice = &[1, 1, 1, 3, 3, 2, 2, 2];

let mut iter = slice.group_by(|a, b| a == b);

assert_eq!(iter.next(), Some(&[1, 1, 1][..]));
assert_eq!(iter.next(), Some(&[3, 3][..]));
assert_eq!(iter.next(), Some(&[2, 2, 2][..]));
assert_eq!(iter.next(), None);

they can also be used to extract the sorted subslices:

let slice = &[1, 1, 2, 3, 2, 3, 2, 3, 4];

let mut iter = slice.group_by(|a, b| a <= b);

assert_eq!(iter.next(), Some(&[1, 1, 2, 3][..]));
assert_eq!(iter.next(), Some(&[2, 3][..]));
assert_eq!(iter.next(), Some(&[2, 3, 4][..]));
assert_eq!(iter.next(), None);

Steps / History

Initial RFC discussion: RFC: Add the group_by and group_by_mut methods to slice rfcs#2477 (it was determined that an RFC wasn't needed)
Implementation: The return of the GroupBy and GroupByMut iterators on slice #79895
Stabilization PR

Unresolved Questions

Should this be group_by? Or should we reserve that name for another higher-level combinator?
RFC: Add the group_by and group_by_mut methods to slice rfcs#2477 (comment)

The text was updated successfully, but these errors were encountered:

purpleP · 2021-04-24T09:50:38Z

It's probably too late, but would you consider renaming this functions to a more proper name like split_on_change or group_on_change?

group_by kind of should give a transitive closure of a relationship.

mrpink76 · 2021-06-11T08:48:04Z

Just an alternative name suggestion:
split_by_neighbour

marcospb19 · 2021-06-25T06:55:58Z

I think that split_on_change and split_by_neighbour are confusing because other split_* iterators skip elements, except for split_inclusive*.

Here a "Group" is a subslice of contiguous elements, each of the subslices is delimited by the element(s) where the predicate fails. Is there other important definition of a group that should be prioritized over this?

Minor thing, but I would prefer is there was a indication that the elements are an adjacent/contiguous subslice, is that a common definition of a group? Some names I thought about:

group_subslices
group_contiguous
subslice_by
group_subsequences

SkiFire13 · 2021-06-25T11:23:33Z

Is there other important definition of a group that should be prioritized over this?

I usually think of a group as a set of elements, not necessarily contiguous. This is the meaning it has for example in Java and SQL.

In the original RFC someone proposed calling them chunks_by or split_by, which IMO seems more fitting. Between the two I would go for the chunks_by. split_by seems to be overlapping with the Needle API, although you could also argue that this may be part of it.

lacasaprivata2 · 2021-09-06T13:31:19Z

Yes - group_by in most languages (and my own macros in c/c++ by example) does not require a contiguous range but rather the entire width of the iterator

lacasaprivata2 · 2021-09-06T13:33:05Z

In response to the comment in Java, the common approach in that language is to apply a

collect(___)

to the end of a stream (iterator)

and inside the parens choose the type of grouping, such as Collections.toList() or Collections.toMap(key_fn, value_fn)

vbfox · 2021-09-23T09:10:28Z

I'm looking at this issue after a confusion with itertool's group_by on Vec that behave the same (group only contiguous elements) and was wondering if the naming of the std proposal was better.

So i'm adding my voice here to a change of name, group_by doesn't evoke the fact that only contiguous elements are grouped. group_contiguous from @marcospb19 suggestions would feel clearer to me.

mjhanninen · 2021-10-17T08:15:03Z

Couple more candidates

runs_by
streaks_by

rsalmei · 2021-12-02T22:57:26Z

For me the name is perfect, it has the same meaning as Python's groupby: https://docs.python.org/3/library/itertools.html#itertools.groupby
Looking forward to this!

purpleP · 2021-12-03T08:26:36Z

For me the name is perfect, it has the same meaning as Python's groupby: https://docs.python.org/3/library/itertools.html#itertools.groupby
Looking forward to this!

Not to argue with you as you’re just expressing your subjective opinion, but it got me thinking about it more deeply.

I’m not a linguist, but I think the function with group inside it’s name should produce groups by a particular criterion which doesn’t have anything to do with the ordering (kind of what SQL group by clause). And what pythons groupby (which I think was inspired by Haskell) is doing is more appropriately described as splitting the sequence and not grouping it’s elements.

I would like to use objective criteria here and not historical reasons. Are there any linguists or mathematicians here?

By the way I almost never seen pythons groupby used without sorting the sequence first. I mean it’s main use IMHO is to produce equivalence classes from the sequence.

obscurans · 2021-12-21T06:06:51Z

Mathematician here - don't think you're going to get much from math:

The overriding meaning of "group" is the noun for abstract algebra
Haskellers had no problem with going for groupBy

The clearest parallel, in my 2c, is to the str::split series, which returns iterator over slices. My proposal is to invert the sense of the predicate and make it split_between. group_between also seems reasonable.

jblachly · 2022-01-02T20:13:33Z

D calls this operation chunkBy, although Rust already uses chunk to mean fixed group of N elements in std::slice IIRC.

Re @purpleP comment about language and meaning:
Agree that "group" does not semantically convey precisely what's happening. chunk_by IMO better conveys that the operation will "fail" if the Vec is not already sorted. However, given naming constraints (chunk already having a separate meaning in rust) and other languages' naming conventions, group_by seems the most sense at this time.

PS: Looking very much forward to this feature as I'm porting functional-style D to Rust; what's blocking besides naming?

[0] https://dlang.org/library/std/algorithm/iteration/chunk_by.html

SkiFire13 · 2022-01-02T21:08:10Z

In my opinion split_by (or some other variation starting with split) is the one that makes most sense.

split because we already have split, which has very similar semantics with the proposed method. The only difference between them is that split takes an FnMut(&T) -> bool while the proposed method takes an FnMut(&T, &T) -> bool;
_by because we already have other methods that end in _by and take a closure that takes two consecutive elements, the most clear example being Vec::dedup_by. Following this naming a _by_key would also make sense.

I would like to voice against group_by because even though some languages use it for this exact same function, others don't. Many people have the expectation that a group is not limited to consecutive elements, and it would be broken if this method was named group_by.
The current code examples (the first one in particular) aggravate this naming issue even more by making it look like the group will contain all the elements that match the predicate, matching the expectation of users whose previous language gave another meaning to group_by.

although Rust already uses chunk to mean fixed group of N elements in std::slice IIRC.

That's not entirely true, chunks can return a slice of less than N elements in the last iteration. And if you take away the fixed size constraint then the meaning is the same, just an iterator yielding subslices.

leonardo-m · 2022-01-17T21:59:05Z

I'm finding this function quite useful in my code.

It's similar to a split, but it takes a function (predicate) with two arguments (diadic) instead of a function with one argument. So if you don't like the "group_by" Python-inspired name (that takes a single argument predicate) then do you like split_on_adjacent? :-)

Regarding the implementation, currently it performs a linear scan:

if (self.predicate)(l, r) { len += 1 } else { break }

When (after the sort) the runs are long enough, a more aggressive strategy could be faster. Something like galloping of D language (exponential grow, followed by binary search), or an arithmetic grow followed by binary search. This can't work if the input slice isn't sorted.

Kerollmops · 2022-01-18T09:19:47Z

Hey @leonardo-m,

If you need something more polished than this nightly method in the standard library, I have also worked on a crate with something similar to the galloping algorithm you describe here. The library is called slice-group-by and is available on crates.io.

leonardo-m · 2022-01-18T09:58:00Z

I've just tried that crate and it works well (in my case it gives no speedup probably because my runs are generally very short).

leonardo-m · 2022-01-18T12:52:18Z

Another note, inside next() of GroupBy there's this line:

let (head, tail) = self.slice.split_at(len);

Inside slice::split_at there's this assert:

assert!(mid <= self.len());

I've seen that such assert isn't removed in the asm of my group_by usages.

I think it's because of this missed LLVM optimization:
https://users.rust-lang.org/t/hare-and-tortoise-indexes-analysis/63985

So are benchmarks telling us that it is performance-wise worth giving a hint (inside next() of GroupBy) to help LLVM remove that assert on mid?

SkiFire13 · 2022-01-18T14:50:27Z

Note that with a carefully placed assert you can convince LLVM to remove any unnecessary bound check. https://godbolt.org/z/c4TxMM9Tr

leonardo-m · 2022-01-18T17:51:21Z

SkiFire13, despite I have some experience with optimizing Rust code and finding ways to remove bound tests, your code feels a bit magical. Very nice. I think that change is worth pushing in the Rust stdlib.

sdroege · 2022-05-09T12:39:13Z

One thing that I needed in my use-case in addition to what GroupBy (and the slice-group-by crate) provide is access to the remainder of the iterator at any point in time, i.e. an accessor for the GroupBy::slice field.

arniu · 2022-09-04T12:23:24Z

The chunks are slices and do not overlap.

So the name chunks_by / chunks_by_mut is a better one.

BurntSushi · 2023-08-22T13:50:37Z

I also like chunk_by better as well, for consistency with our use of "chunks" in other APIs.

jblachly · 2023-08-22T16:56:59Z

I also like chunk_by better as well, for consistency with our use of "chunks" in other APIs.

also semantically hinting that it would ~~fail~~ not return expected results if the collection is not ~~sorted~~ ordered as assumed, whereas "group" doesn't carry this implication

marcospb19 · 2023-08-22T17:28:31Z

it would fail if the collection is not sorted

The proposed functions do not make any assumptions about ordering.

chunk_by communicates that matched elements are sequential, is this what you meant?

jblachly · 2023-08-22T17:45:29Z

it would fail if the collection is not sorted

The proposed functions do not make any assumptions about ordering.

chunk_by communicates that matched elements are sequential, is this what you meant?

I apologize for imprecision. I've updated my comment to reflect my intention. (namely, that "group by" could carry an implicit meaning that it might not operate sequentially, as you suggest)

Zenithsiz · 2023-09-16T10:16:08Z

Would it be possible for group_by_mut to allow a Fn(&mut T, &mut T) instead? I don't have a specific use case, but looking at Vec::retain vs Vec::retain_mut, there are use-cases for predicates receiving unique references.

For example, a type might need to cache a result to use in it's PartialEq impl, which might require &mut to do efficiently without inner mutability. This would allow not having to calculate this cached value up-front before group_by_mut, as well as not have to do the expensive calculation twice during group_by_mut (since it wouldn't be able to cache it with a &T).

In terms of implementation, since group_by_mut never calls it's predicate with the same element in both arguments, this should be fine.

I can see how there might be some "weirdness" about passing a mutable reference to a predicate, but if we think that it's a unique reference instead, then it make sense, since the algorithm is designed in such a way that each item passed to the predicate is unique (i.e. both arguments are never the same and won't overlap with yielded slices).

rfcbot · 2023-10-26T01:32:27Z

🔔 This is now entering its final comment period, as per the review above. 🔔

marcospb19 · 2023-10-28T16:36:09Z

@Zenithsiz oi Filipe 👋.

I don't think Fn(&mut T, &mut T) is a great idea because most elements are actually called twice.

0	1	2	3	4
(0	1)
	(1	2)
		(2	3)
			(3	4)

First as the right element, then, as the left element, with the exception of the first and last elements.

If you mutate an element in the previous predicate call, this might lead to a confusing pitfall where the yielded groups do not respect the predicate (as expected).

For the caching usage case you mentioned, I'd say that you can do this instead:

slice.iter_mut().for_each(perform_cache_operation);
// then, call `group_by`

marcospb19 · 2023-10-28T16:45:09Z

I also like chunk_by better as well, for consistency with our use of "chunks" in other APIs.

Looks like most people do agree.

Do we need to wait for FCP to finish before creating a PR to rename group_by to chunk_by?

rfcbot · 2023-11-05T01:44:13Z

The final comment period, with a disposition to merge, as per the review above, is now complete.

As the automated representative of the governance process, I would like to thank the author for their work and everyone else who contributed.

This will be merged soon.

Ten0 · 2023-12-14T14:08:03Z

As suggested here by the original author, I'm porting my comment to here:

fn linear_group_by_key<F, K>(&self, func: F) -> LinearGroupByKey<T, F>
    where F: FnMut(&T) -> K,
          K: PartialEq

Does not allow:

let strings = vec!["A".to_owned()];
strings.linear_group_by_key(|s| s.as_str());

In order for that to work, we need:

fn linear_group_by_key<'a, F, K>(&'a self, func: F) -> LinearGroupByKey<T, F>
    where F: FnMut(&'a T) -> K,
          K: PartialEq

Since nothing moves in memory it should be possible.

While there is (unfortunately) no by_key variant ATM proposed in the std, I'm guessing that chunk_by might also benefit from having lifetime information about the source slice (e.g. if it's storing the reference somewhere) so the same thing somewhat applies.

That would also match better what is done on binary_search_by (it would feel weird that it works with one and not the other).

SkiFire13 · 2023-12-14T16:43:37Z

I'm guessing that chunk_by/chunk_by_mut might also benefit from having lifetime information about the source slice

I agree it might be useful for chunk_by, but unfortunately I don't think this is sound for chunk_by_mut

Ten0 · 2023-12-14T17:06:27Z

I agree it might be useful for chunk_by, but unfortunately I don't think this is sound for chunk_by_mut

Aha you're right, whoops 😅

Yeah so only chunk_by of course. (updated message above)

Renamed "group by" to "chunk by" a per rust-lang#80552. Newly stable items: * `core::slice::ChunkBy` * `core::slice::ChunkByMut` * `[T]::chunk` * `[T]::chunk_by` Closes rust-lang#80552.

…r=dtolnay Stabilize `slice_group_by` Renamed "group by" to "chunk by" a per rust-lang#80552. Newly stable items: * `core::slice::ChunkBy` * `core::slice::ChunkByMut` * `[T]::chunk` * `[T]::chunk_by` Closes rust-lang#80552.

Rollup merge of rust-lang#117678 - niklasf:stabilize-slice_group_by, r=dtolnay Stabilize `slice_group_by` Renamed "group by" to "chunk by" a per rust-lang#80552. Newly stable items: * `core::slice::ChunkBy` * `core::slice::ChunkByMut` * `[T]::chunk` * `[T]::chunk_by` Closes rust-lang#80552.

Kerollmops added C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Dec 31, 2020

Kerollmops mentioned this issue Dec 31, 2020

The return of the GroupBy and GroupByMut iterators on slice #79895

Merged

m-ou-se added A-iterators Area: Iterators A-slice Area: `[T]` Libs-Tracked Libs issues that are tracked on the team's project board. labels Dec 31, 2020

KodrAus mentioned this issue Jan 14, 2021

RFC: Add the group_by and group_by_mut methods to slice rust-lang/rfcs#2477

Closed

Kerollmops mentioned this issue Jan 23, 2022

Will simd be introduced？ RoaringBitmap/roaring-rs#60

Closed

cuviper mentioned this issue Nov 11, 2022

Added A par_chunk_by method rayon-rs/rayon#925

Merged

rfcbot added final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. and removed proposed-final-comment-period Proposed to merge/close by relevant subteam, see T-<team> label. Will enter FCP once signed off. labels Oct 26, 2023

rfcbot added finished-final-comment-period The final comment period is finished for this PR / Issue. to-announce Announce this issue on triage meeting and removed final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. labels Nov 5, 2023

niklasf mentioned this issue Nov 7, 2023

Stabilize slice_group_by #117678

Merged

apiraino removed the to-announce Announce this issue on triage meeting label Nov 9, 2023

jswrenn mentioned this issue Nov 17, 2023

group_by is weird rust-itertools/itertools#374

Closed

Kerollmops mentioned this issue Dec 14, 2023

LinearGroupByKey could allow provided closure to borrow from input Kerollmops/slice-group-by#25

Open

bors closed this as completed in 0bccdb3 Jan 26, 2024

imartayan mentioned this issue Feb 2, 2024

Discussion / question imartayan/CBL#1

Open

phenylshima mentioned this issue Feb 20, 2024

jlabelを導入する VOICEVOX/voicevox_core#742

Merged

alexander-buerger-met-no mentioned this issue Jul 24, 2024

use stable rust gauteh/hidefix#39

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking Issue for the GroupBy and GroupByMut iterators #80552

Tracking Issue for the GroupBy and GroupByMut iterators #80552

Kerollmops commented Dec 31, 2020 •

edited

Loading

purpleP commented Apr 24, 2021

mrpink76 commented Jun 11, 2021

marcospb19 commented Jun 25, 2021

SkiFire13 commented Jun 25, 2021

lacasaprivata2 commented Sep 6, 2021

lacasaprivata2 commented Sep 6, 2021 •

edited

Loading

vbfox commented Sep 23, 2021

mjhanninen commented Oct 17, 2021 •

edited

Loading

rsalmei commented Dec 2, 2021

purpleP commented Dec 3, 2021 •

edited

Loading

obscurans commented Dec 21, 2021

jblachly commented Jan 2, 2022 •

edited

Loading

SkiFire13 commented Jan 2, 2022 •

edited

Loading

leonardo-m commented Jan 17, 2022

Kerollmops commented Jan 18, 2022 •

edited

Loading

leonardo-m commented Jan 18, 2022

leonardo-m commented Jan 18, 2022 •

edited

Loading

SkiFire13 commented Jan 18, 2022

leonardo-m commented Jan 18, 2022

sdroege commented May 9, 2022

arniu commented Sep 4, 2022

BurntSushi commented Aug 22, 2023

jblachly commented Aug 22, 2023 •

edited

Loading

marcospb19 commented Aug 22, 2023

jblachly commented Aug 22, 2023

Zenithsiz commented Sep 16, 2023

rfcbot commented Oct 26, 2023

marcospb19 commented Oct 28, 2023 •

edited

Loading

marcospb19 commented Oct 28, 2023

rfcbot commented Nov 5, 2023

Ten0 commented Dec 14, 2023 •

edited

Loading

SkiFire13 commented Dec 14, 2023

Ten0 commented Dec 14, 2023 •

edited

Loading

Tracking Issue for the GroupBy and GroupByMut iterators #80552

Tracking Issue for the GroupBy and GroupByMut iterators #80552

Comments

Kerollmops commented Dec 31, 2020 • edited Loading

Public API

Steps / History

Unresolved Questions

purpleP commented Apr 24, 2021

mrpink76 commented Jun 11, 2021

marcospb19 commented Jun 25, 2021

SkiFire13 commented Jun 25, 2021

lacasaprivata2 commented Sep 6, 2021

lacasaprivata2 commented Sep 6, 2021 • edited Loading

vbfox commented Sep 23, 2021

mjhanninen commented Oct 17, 2021 • edited Loading

rsalmei commented Dec 2, 2021

purpleP commented Dec 3, 2021 • edited Loading

obscurans commented Dec 21, 2021

jblachly commented Jan 2, 2022 • edited Loading

SkiFire13 commented Jan 2, 2022 • edited Loading

leonardo-m commented Jan 17, 2022

Kerollmops commented Jan 18, 2022 • edited Loading

leonardo-m commented Jan 18, 2022

leonardo-m commented Jan 18, 2022 • edited Loading

SkiFire13 commented Jan 18, 2022

leonardo-m commented Jan 18, 2022

sdroege commented May 9, 2022

arniu commented Sep 4, 2022

BurntSushi commented Aug 22, 2023

jblachly commented Aug 22, 2023 • edited Loading

marcospb19 commented Aug 22, 2023

jblachly commented Aug 22, 2023

Zenithsiz commented Sep 16, 2023

rfcbot commented Oct 26, 2023

marcospb19 commented Oct 28, 2023 • edited Loading

marcospb19 commented Oct 28, 2023

rfcbot commented Nov 5, 2023

Ten0 commented Dec 14, 2023 • edited Loading

SkiFire13 commented Dec 14, 2023

Ten0 commented Dec 14, 2023 • edited Loading

Kerollmops commented Dec 31, 2020 •

edited

Loading

lacasaprivata2 commented Sep 6, 2021 •

edited

Loading

mjhanninen commented Oct 17, 2021 •

edited

Loading

purpleP commented Dec 3, 2021 •

edited

Loading

jblachly commented Jan 2, 2022 •

edited

Loading

SkiFire13 commented Jan 2, 2022 •

edited

Loading

Kerollmops commented Jan 18, 2022 •

edited

Loading

leonardo-m commented Jan 18, 2022 •

edited

Loading

jblachly commented Aug 22, 2023 •

edited

Loading

marcospb19 commented Oct 28, 2023 •

edited

Loading

Ten0 commented Dec 14, 2023 •

edited

Loading

Ten0 commented Dec 14, 2023 •

edited

Loading