You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Note: we're interested in processing large datasets. Performing group by key on such dataset requres sending all data corresponding to specific on one machine. That's called shuffling. And it's expensive. This task about implementing a method for doing 1 shuffling instead of 2 in some specific case.
One part of the anonymization pipeline is to do contribution bounding. Namely for to limit contributions from 1 privacy unit. One of the common way to specify contributions is with max_partitions_contributed and max_contribution_per_partition. Atm it's done with 2 samplings (each of which is performing group by):
Sample max_contributions_per_partition per (privacy_id, partition_key) (code) (i.e. with group by per (privacy_id, partition_key))
Sample max_partitions_contributed per (partition_key) (code) (i.e. with group by per (partition_key))
Another way to do sampling is to do group by privacy_key and to do sampling in memory (i.e. having only 1 shuffling).
Goal
Implement sampling with one group by privacy_key and to do sampling in memory.
Note: Since one privacy unit can contain too much, datapoints, we can limit it with some large const, for example 10**7.
Context
Prerequisites: PipleineDP terminology, especially privacy unit, partition key.
Note: we're interested in processing large datasets. Performing
group by key
on such dataset requres sending all data corresponding to specific on one machine. That's called shuffling. And it's expensive. This task about implementing a method for doing 1 shuffling instead of 2 in some specific case.One part of the anonymization pipeline is to do contribution bounding. Namely for to limit contributions from 1 privacy unit. One of the common way to specify contributions is with
max_partitions_contributed
andmax_contribution_per_partition
. Atm it's done with 2 samplings (each of which is performinggroup by
):max_contributions_per_partition
per(privacy_id, partition_key)
(code) (i.e. withgroup by
per(privacy_id, partition_key)
)max_partitions_contributed
per(partition_key)
(code) (i.e. withgroup by
per(partition_key)
)Another way to do sampling is to do group by
privacy_key
and to do sampling in memory (i.e. having only 1 shuffling).Goal
Implement sampling with one group by
privacy_key
and to do sampling in memory.Note: Since one privacy unit can contain too much, datapoints, we can limit it with some large const, for example
10**7
.Code pointers
The text was updated successfully, but these errors were encountered: