Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking Issue for the GroupBy and GroupByMut iterators #80552

Closed
2 of 3 tasks
Kerollmops opened this issue Dec 31, 2020 · 41 comments · Fixed by #117678
Closed
2 of 3 tasks

Tracking Issue for the GroupBy and GroupByMut iterators #80552

Kerollmops opened this issue Dec 31, 2020 · 41 comments · Fixed by #117678
Labels
A-iterators Area: Iterators A-slice Area: `[T]` C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC disposition-merge This issue / PR is in PFCP or FCP with a disposition to merge it. finished-final-comment-period The final comment period is finished for this PR / Issue. Libs-Tracked Libs issues that are tracked on the team's project board. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Comments

@Kerollmops
Copy link
Contributor

Kerollmops commented Dec 31, 2020

Feature gate: #![feature(slice_group_by)]

This is a tracking issue for the GroupBy and GroupByMut iterators.

This feature exposes the group_by and group_by_mut methods on the slice and mutable slice types, these methods return the GroupBy and GroupByMut iterators structs respectively. Those two iterators return subslices of the original slice where a user-defined function returns true for two following elements of the slice.

Public API

These methods can return subslices that contains equal elements:

let slice = &[1, 1, 1, 3, 3, 2, 2, 2];

let mut iter = slice.group_by(|a, b| a == b);

assert_eq!(iter.next(), Some(&[1, 1, 1][..]));
assert_eq!(iter.next(), Some(&[3, 3][..]));
assert_eq!(iter.next(), Some(&[2, 2, 2][..]));
assert_eq!(iter.next(), None);

they can also be used to extract the sorted subslices:

let slice = &[1, 1, 2, 3, 2, 3, 2, 3, 4];

let mut iter = slice.group_by(|a, b| a <= b);

assert_eq!(iter.next(), Some(&[1, 1, 2, 3][..]));
assert_eq!(iter.next(), Some(&[2, 3][..]));
assert_eq!(iter.next(), Some(&[2, 3, 4][..]));
assert_eq!(iter.next(), None);

Steps / History

Unresolved Questions

@Kerollmops Kerollmops added C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. labels Dec 31, 2020
@m-ou-se m-ou-se added A-iterators Area: Iterators A-slice Area: `[T]` Libs-Tracked Libs issues that are tracked on the team's project board. labels Dec 31, 2020
@purpleP
Copy link

purpleP commented Apr 24, 2021

It's probably too late, but would you consider renaming this functions to a more proper name like split_on_change or group_on_change?

group_by kind of should give a transitive closure of a relationship.

@mrpink76
Copy link

Just an alternative name suggestion:
split_by_neighbour

@marcospb19
Copy link
Contributor

I think that split_on_change and split_by_neighbour are confusing because other split_* iterators skip elements, except for split_inclusive*.

Here a "Group" is a subslice of contiguous elements, each of the subslices is delimited by the element(s) where the predicate fails. Is there other important definition of a group that should be prioritized over this?

Minor thing, but I would prefer is there was a indication that the elements are an adjacent/contiguous subslice, is that a common definition of a group? Some names I thought about:

  • group_subslices
  • group_contiguous
  • subslice_by
  • group_subsequences

@SkiFire13
Copy link
Contributor

Is there other important definition of a group that should be prioritized over this?

I usually think of a group as a set of elements, not necessarily contiguous. This is the meaning it has for example in Java and SQL.

In the original RFC someone proposed calling them chunks_by or split_by, which IMO seems more fitting. Between the two I would go for the chunks_by. split_by seems to be overlapping with the Needle API, although you could also argue that this may be part of it.

@lacasaprivata2
Copy link

Yes - group_by in most languages (and my own macros in c/c++ by example) does not require a contiguous range but rather the entire width of the iterator

@lacasaprivata2
Copy link

lacasaprivata2 commented Sep 6, 2021

In response to the comment in Java, the common approach in that language is to apply a

collect(___)

to the end of a stream (iterator)

and inside the parens choose the type of grouping, such as Collections.toList() or Collections.toMap(key_fn, value_fn)

@vbfox
Copy link

vbfox commented Sep 23, 2021

I'm looking at this issue after a confusion with itertool's group_by on Vec that behave the same (group only contiguous elements) and was wondering if the naming of the std proposal was better.

So i'm adding my voice here to a change of name, group_by doesn't evoke the fact that only contiguous elements are grouped. group_contiguous from @marcospb19 suggestions would feel clearer to me.

@mjhanninen
Copy link

mjhanninen commented Oct 17, 2021

Couple more candidates

  • runs_by
  • streaks_by

@rsalmei
Copy link

rsalmei commented Dec 2, 2021

For me the name is perfect, it has the same meaning as Python's groupby: https://docs.python.org/3/library/itertools.html#itertools.groupby
Looking forward to this!

@purpleP
Copy link

purpleP commented Dec 3, 2021

For me the name is perfect, it has the same meaning as Python's groupby: https://docs.python.org/3/library/itertools.html#itertools.groupby
Looking forward to this!

Not to argue with you as you’re just expressing your subjective opinion, but it got me thinking about it more deeply.

I’m not a linguist, but I think the function with group inside it’s name should produce groups by a particular criterion which doesn’t have anything to do with the ordering (kind of what SQL group by clause). And what pythons groupby (which I think was inspired by Haskell) is doing is more appropriately described as splitting the sequence and not grouping it’s elements.

I would like to use objective criteria here and not historical reasons. Are there any linguists or mathematicians here?

By the way I almost never seen pythons groupby used without sorting the sequence first. I mean it’s main use IMHO is to produce equivalence classes from the sequence.

@obscurans
Copy link

Mathematician here - don't think you're going to get much from math:

  1. The overriding meaning of "group" is the noun for abstract algebra
  2. Haskellers had no problem with going for groupBy

The clearest parallel, in my 2c, is to the str::split series, which returns iterator over slices. My proposal is to invert the sense of the predicate and make it split_between. group_between also seems reasonable.

@jblachly
Copy link

jblachly commented Jan 2, 2022

D calls this operation chunkBy, although Rust already uses chunk to mean fixed group of N elements in std::slice IIRC.

Re @purpleP comment about language and meaning:
Agree that "group" does not semantically convey precisely what's happening. chunk_by IMO better conveys that the operation will "fail" if the Vec is not already sorted. However, given naming constraints (chunk already having a separate meaning in rust) and other languages' naming conventions, group_by seems the most sense at this time.

PS: Looking very much forward to this feature as I'm porting functional-style D to Rust; what's blocking besides naming?

[0] https://dlang.org/library/std/algorithm/iteration/chunk_by.html

@SkiFire13
Copy link
Contributor

SkiFire13 commented Jan 2, 2022

In my opinion split_by (or some other variation starting with split) is the one that makes most sense.

  • split because we already have split, which has very similar semantics with the proposed method. The only difference between them is that split takes an FnMut(&T) -> bool while the proposed method takes an FnMut(&T, &T) -> bool;
  • _by because we already have other methods that end in _by and take a closure that takes two consecutive elements, the most clear example being Vec::dedup_by. Following this naming a _by_key would also make sense.

I would like to voice against group_by because even though some languages use it for this exact same function, others don't. Many people have the expectation that a group is not limited to consecutive elements, and it would be broken if this method was named group_by.
The current code examples (the first one in particular) aggravate this naming issue even more by making it look like the group will contain all the elements that match the predicate, matching the expectation of users whose previous language gave another meaning to group_by.

although Rust already uses chunk to mean fixed group of N elements in std::slice IIRC.

That's not entirely true, chunks can return a slice of less than N elements in the last iteration. And if you take away the fixed size constraint then the meaning is the same, just an iterator yielding subslices.

@leonardo-m
Copy link

I'm finding this function quite useful in my code.

It's similar to a split, but it takes a function (predicate) with two arguments (diadic) instead of a function with one argument. So if you don't like the "group_by" Python-inspired name (that takes a single argument predicate) then do you like split_on_adjacent? :-)

Regarding the implementation, currently it performs a linear scan:

if (self.predicate)(l, r) { len += 1 } else { break }

When (after the sort) the runs are long enough, a more aggressive strategy could be faster. Something like galloping of D language (exponential grow, followed by binary search), or an arithmetic grow followed by binary search. This can't work if the input slice isn't sorted.

@Kerollmops
Copy link
Contributor Author

Kerollmops commented Jan 18, 2022

Hey @leonardo-m,

If you need something more polished than this nightly method in the standard library, I have also worked on a crate with something similar to the galloping algorithm you describe here. The library is called slice-group-by and is available on crates.io.

@leonardo-m
Copy link

I've just tried that crate and it works well (in my case it gives no speedup probably because my runs are generally very short).

@leonardo-m
Copy link

leonardo-m commented Jan 18, 2022

Another note, inside next() of GroupBy there's this line:

let (head, tail) = self.slice.split_at(len);

Inside slice::split_at there's this assert:

assert!(mid <= self.len());

I've seen that such assert isn't removed in the asm of my group_by usages.

I think it's because of this missed LLVM optimization:
https://users.rust-lang.org/t/hare-and-tortoise-indexes-analysis/63985

So are benchmarks telling us that it is performance-wise worth giving a hint (inside next() of GroupBy) to help LLVM remove that assert on mid?

@SkiFire13
Copy link
Contributor

Note that with a carefully placed assert you can convince LLVM to remove any unnecessary bound check. https://godbolt.org/z/c4TxMM9Tr

@leonardo-m
Copy link

SkiFire13, despite I have some experience with optimizing Rust code and finding ways to remove bound tests, your code feels a bit magical. Very nice. I think that change is worth pushing in the Rust stdlib.

@sdroege
Copy link
Contributor

sdroege commented May 9, 2022

One thing that I needed in my use-case in addition to what GroupBy (and the slice-group-by crate) provide is access to the remainder of the iterator at any point in time, i.e. an accessor for the GroupBy::slice field.

@arniu
Copy link

arniu commented Sep 4, 2022

The chunks are slices and do not overlap.

So the name chunks_by / chunks_by_mut is a better one.

@BurntSushi
Copy link
Member

I also like chunk_by better as well, for consistency with our use of "chunks" in other APIs.

@jblachly
Copy link

jblachly commented Aug 22, 2023

I also like chunk_by better as well, for consistency with our use of "chunks" in other APIs.

also semantically hinting that it would fail not return expected results if the collection is not sorted ordered as assumed, whereas "group" doesn't carry this implication

@marcospb19
Copy link
Contributor

it would fail if the collection is not sorted

The proposed functions do not make any assumptions about ordering.

chunk_by communicates that matched elements are sequential, is this what you meant?

@jblachly
Copy link

it would fail if the collection is not sorted

The proposed functions do not make any assumptions about ordering.

chunk_by communicates that matched elements are sequential, is this what you meant?

I apologize for imprecision. I've updated my comment to reflect my intention. (namely, that "group by" could carry an implicit meaning that it might not operate sequentially, as you suggest)

@Zenithsiz
Copy link

Would it be possible for group_by_mut to allow a Fn(&mut T, &mut T) instead? I don't have a specific use case, but looking at Vec::retain vs Vec::retain_mut, there are use-cases for predicates receiving unique references.

For example, a type might need to cache a result to use in it's PartialEq impl, which might require &mut to do efficiently without inner mutability. This would allow not having to calculate this cached value up-front before group_by_mut, as well as not have to do the expensive calculation twice during group_by_mut (since it wouldn't be able to cache it with a &T).

In terms of implementation, since group_by_mut never calls it's predicate with the same element in both arguments, this should be fine.

I can see how there might be some "weirdness" about passing a mutable reference to a predicate, but if we think that it's a unique reference instead, then it make sense, since the algorithm is designed in such a way that each item passed to the predicate is unique (i.e. both arguments are never the same and won't overlap with yielded slices).

@rfcbot rfcbot added final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. and removed proposed-final-comment-period Proposed to merge/close by relevant subteam, see T-<team> label. Will enter FCP once signed off. labels Oct 26, 2023
@rfcbot
Copy link

rfcbot commented Oct 26, 2023

🔔 This is now entering its final comment period, as per the review above. 🔔

@marcospb19
Copy link
Contributor

marcospb19 commented Oct 28, 2023

@Zenithsiz oi Filipe 👋.

I don't think Fn(&mut T, &mut T) is a great idea because most elements are actually called twice.

0 1 2 3 4
(0 1)
(1 2)
(2 3)
(3 4)

First as the right element, then, as the left element, with the exception of the first and last elements.

If you mutate an element in the previous predicate call, this might lead to a confusing pitfall where the yielded groups do not respect the predicate (as expected).

For the caching usage case you mentioned, I'd say that you can do this instead:

slice.iter_mut().for_each(perform_cache_operation);
// then, call `group_by`

@marcospb19
Copy link
Contributor

I also like chunk_by better as well, for consistency with our use of "chunks" in other APIs.

Looks like most people do agree.

Do we need to wait for FCP to finish before creating a PR to rename group_by to chunk_by?

@rfcbot rfcbot added finished-final-comment-period The final comment period is finished for this PR / Issue. to-announce Announce this issue on triage meeting and removed final-comment-period In the final comment period and will be merged soon unless new substantive objections are raised. labels Nov 5, 2023
@rfcbot
Copy link

rfcbot commented Nov 5, 2023

The final comment period, with a disposition to merge, as per the review above, is now complete.

As the automated representative of the governance process, I would like to thank the author for their work and everyone else who contributed.

This will be merged soon.

@Ten0
Copy link

Ten0 commented Dec 14, 2023

As suggested here by the original author, I'm porting my comment to here:

fn linear_group_by_key<F, K>(&self, func: F) -> LinearGroupByKey<T, F>
    where F: FnMut(&T) -> K,
          K: PartialEq

Does not allow:

let strings = vec!["A".to_owned()];
strings.linear_group_by_key(|s| s.as_str());

In order for that to work, we need:

fn linear_group_by_key<'a, F, K>(&'a self, func: F) -> LinearGroupByKey<T, F>
    where F: FnMut(&'a T) -> K,
          K: PartialEq

Since nothing moves in memory it should be possible.

While there is (unfortunately) no by_key variant ATM proposed in the std, I'm guessing that chunk_by might also benefit from having lifetime information about the source slice (e.g. if it's storing the reference somewhere) so the same thing somewhat applies.

That would also match better what is done on binary_search_by (it would feel weird that it works with one and not the other).

@SkiFire13
Copy link
Contributor

I'm guessing that chunk_by/chunk_by_mut might also benefit from having lifetime information about the source slice

I agree it might be useful for chunk_by, but unfortunately I don't think this is sound for chunk_by_mut

@Ten0
Copy link

Ten0 commented Dec 14, 2023

I agree it might be useful for chunk_by, but unfortunately I don't think this is sound for chunk_by_mut

Aha you're right, whoops 😅

Yeah so only chunk_by of course. (updated message above)

matthiaskrgr pushed a commit to matthiaskrgr/rust that referenced this issue Jan 26, 2024
Renamed "group by" to "chunk by" a per rust-lang#80552.

Newly stable items:

* `core::slice::ChunkBy`
* `core::slice::ChunkByMut`
* `[T]::chunk`
* `[T]::chunk_by`

Closes rust-lang#80552.
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Jan 26, 2024
…r=dtolnay

Stabilize `slice_group_by`

Renamed "group by" to "chunk by" a per rust-lang#80552.

Newly stable items:

* `core::slice::ChunkBy`
* `core::slice::ChunkByMut`
* `[T]::chunk`
* `[T]::chunk_by`

Closes rust-lang#80552.
matthiaskrgr added a commit to matthiaskrgr/rust that referenced this issue Jan 26, 2024
…r=dtolnay

Stabilize `slice_group_by`

Renamed "group by" to "chunk by" a per rust-lang#80552.

Newly stable items:

* `core::slice::ChunkBy`
* `core::slice::ChunkByMut`
* `[T]::chunk`
* `[T]::chunk_by`

Closes rust-lang#80552.
@bors bors closed this as completed in 0bccdb3 Jan 26, 2024
rust-timer added a commit to rust-lang-ci/rust that referenced this issue Jan 26, 2024
Rollup merge of rust-lang#117678 - niklasf:stabilize-slice_group_by, r=dtolnay

Stabilize `slice_group_by`

Renamed "group by" to "chunk by" a per rust-lang#80552.

Newly stable items:

* `core::slice::ChunkBy`
* `core::slice::ChunkByMut`
* `[T]::chunk`
* `[T]::chunk_by`

Closes rust-lang#80552.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-iterators Area: Iterators A-slice Area: `[T]` C-tracking-issue Category: An issue tracking the progress of sth. like the implementation of an RFC disposition-merge This issue / PR is in PFCP or FCP with a disposition to merge it. finished-final-comment-period The final comment period is finished for this PR / Issue. Libs-Tracked Libs issues that are tracked on the team's project board. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging a pull request may close this issue.