Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Contribution bounding with Group By privacy unit #488

Open
dvadym opened this issue Sep 13, 2023 · 0 comments
Open

Contribution bounding with Group By privacy unit #488

dvadym opened this issue Sep 13, 2023 · 0 comments
Labels
Type: New Feature ➕ Introduction of a completely new addition to the codebase

Comments

@dvadym
Copy link
Collaborator

dvadym commented Sep 13, 2023

Context

Prerequisites: PipleineDP terminology, especially privacy unit, partition key.

Note: we're interested in processing large datasets. Performing group by key on such dataset requres sending all data corresponding to specific on one machine. That's called shuffling. And it's expensive. This task about implementing a method for doing 1 shuffling instead of 2 in some specific case.

One part of the anonymization pipeline is to do contribution bounding. Namely for to limit contributions from 1 privacy unit. One of the common way to specify contributions is with max_partitions_contributed and max_contribution_per_partition. Atm it's done with 2 samplings (each of which is performing group by):

  1. Sample max_contributions_per_partition per (privacy_id, partition_key) (code) (i.e. with group by per (privacy_id, partition_key))
  2. Sample max_partitions_contributed per (partition_key) (code) (i.e. with group by per (partition_key))

Another way to do sampling is to do group by privacy_key and to do sampling in memory (i.e. having only 1 shuffling).

Goal

Implement sampling with one group by privacy_key and to do sampling in memory.

Note: Since one privacy unit can contain too much, datapoints, we can limit it with some large const, for example 10**7.

Code pointers

  1. ContributionBounder is the abstract base class for ContributionBounders.
  2. SamplingCrossAndPerPartitionContributionBounder is the class which does current 2 stage sampling.
  3. SamplingPerPrivacyIdContributionBounder is a class which samples fixed number per privacy_unit (it's more as an example)
  4. Tests for contriution bounders
  5. Contribution bounder creation
@dvadym dvadym added the Type: New Feature ➕ Introduction of a completely new addition to the codebase label Sep 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: New Feature ➕ Introduction of a completely new addition to the codebase
Projects
None yet
Development

No branches or pull requests

1 participant