Erik Taubeneck (@eriktaubeneck), Martin Thomson (@martinthomson), Benjamin Savage (@benjaminsavage), Benjamin Case (@bmcase), Daniel Masny (@danielmasny), Richa Jain (@richajaindce)
Originally posted on 2022/07/28.
Interoperable Private Attribution (IPA) is a proposal for a new web platform API for advertising attribution. It does so by proposing two new user-agent APIs: 1) a set_match_key()
API, and 2) a get_encrypted_match_key()
API. The match keys which leave the user agent are always encrypted towards a privacy preserving measurement system, i.e., a distributed multi-party computation (MPC) operated by helper parties who are only trusted to not collude. The helper parties that operate this privacy preserving measurement system participate in a protocol to produce an aggregate and differentially private attribution result. Our goal is that a web platform which implements this API only needs to trust the helper parties to not collude in order to have assurance that the API does not enable cross-context tracking. (The PATCG is working on a threat model, and our goal is that IPA will satisfy that model.)
Update to set_match_key
(May 3, 2023): We discovered a privacy attack against the set set_match_key()
API that could be carried out by a malicious match key provider. The attack does not apply to the case when the device generates a random match key and we propose focusing on that on-device match key version as an MVP. It remains an open research question to support setting the match key by a match key provider.
This document provides an end-to-end overview of that protocol, focusing primarily on the MPC performed by the helper parties. For exploring an MPC based approach, we have made several design choices in favor of simplicity. This is especially true for our approach to DP as well as the focus on last touch attribution. We intend to improve the functionality of IPA by adding other approaches to DP and attribution as IPA progresses.
- IPA End to End Protocol
- Overview
- Protocol
- Setup
- Client Side: Setting the match key
- Client Side: Getting the encrypted match key report
- Constraints on encrypted match keys
- Clarifying allowed uses
- Additional Data
- Generating source and trigger reports by the report collector
- Secure Multi Party Computation between a helper party network (P1, P2, P3)
- Technical Discussion and Remarks
- Thanks and Acknowledgements
There are a number of different parties that are involved with the protocol in one way or another:
- User Agent: A piece of software used by an individual person.
- User Agent Vendor: The distributor of the user agent.
- Helper Party: A party who participates in the MPC protocol.
- Helper Party Network: A set of three helper parties. We aim to find helper party networks which are trusted by user agent vendors to be non-colluding.
- Websites/apps: An individual website (identified by its eTLD+1) or an individual app (identified by the mechanism for the OS it operates on.)
- Match key providers): Websites/apps which call the set match key API, typically websites/apps with a large logged-in set of users.
- Source websites/apps: Websites/apps on which a user produces a source event.
- Trigger websites/apps: Websites/apps on which a user produces a trigger event.
- Report collectors: A specific website/app or a delegate acting on their behalf, that issues queries to the helper party network.
- Delegate report collectors: Service providers to whom websites/apps delegate their query issuing. In the case where a service provider acts as a delegate for two distinct websites/apps, we consider these two separate report collectors for the purpose of this document.
- Fanout queries: The sets of queries that can be run by a report collector. A fanout query must either be a source fanout query or a trigger fanout query.
- Source fanout query: A query issued by a report collector composed of source reports from a single source website/app, which is the website/app tied to that report collector. A source fanout query may contain trigger reports from many trigger websites/apps.
- Trigger fanout query: A query issued by a report collector composed of trigger reports from a single trigger website/app, which is the website/app tied to that report collector. A trigger fanout query may contain source reports from many source websites/apps.
- Epoch: a set period of time, e.g., a week.
- Match key: An identifier, set in the user agent, which identifies an individual person. This must never be released (beyond the match key provider) to any party in unencrypted form.
- Breakdown keys: A key, specified by the report collector, which allows for producing aggregates across many groups (or breakdowns.)
- Attribution Constraint ID: An identifier, specified by the report collector, to denote if a given pair of source and trigger events can be attributed (beyond having the same match key.)
The privacy goals of the IPA proposal are achieved by limiting (through differential privacy) the amount of information about people revealed to individual websites/apps. However, many (if not most) websites/apps that use this API will likely do so by working with one or more service providers, which we call delegated report collectors. This results in a somewhat complex many-to-many relationship between websites/apps and delegated report collectors. For the sake of this document, a report collector is assumed to work with a single website/app; in the case where the same service provider works with multiple websites/apps, we consider those distinct report collectors. Each report collector may be one of many report collectors for that website/app, which is primarily meaningful in differential privacy budgeting, discussed below.
Attribution measurement is a basic measurement approach used in online digital advertising. For a given conversion event (e.g., a purchase), we aim to attribute that conversion to an ad impression (if one exists.) We generalize into source events (i.e., an ad impression) and trigger events (i.e., a purchase.) Source events happen on a source website/app, and trigger events happen on a trigger website/app. In order to attribute, a source event must occur before the trigger event and must be from the same individual. In the event that a single query contains trigger events from multiple trigger websites/apps, we must also constrain the attribution to only attribute trigger events to source events from relevant campaigns.
There are various approaches to addressing the situation when a trigger event can be attributed to multiple source events; for simplicity, in this document we currently only address last touch attribution, i.e., trigger events are attributed to the most recent source event (although it is feasible to extend this approach to more complex attribution approaches).
Source and trigger events are represented by source and trigger reports (more details in section Generating source and trigger reports.) Source reports include a breakdown key, which is used to identify the granularity of the attribution queries. Both source and trigger reports include an attribution constraint ID, for cases where trigger events may not be eligible to attribute to all source events. The end result of attribution measurement is aggregates (counts and sums of values included with trigger reports), grouped by the breakdown key.
As a SQL query, we can approximate last touch attribution as:
SELECT breakdown_key, SUM(value)
FROM (
SELECT t.id,
ANY(t.value) AS value,
MAX_BY(s.breakdown_key, s.timestamp) AS breakdown_key
FROM source_reports s
JOIN trigger_reports t
ON s.match_key = t.match_key
WHERE s.timestamp < t.timestamp
AND s.attribution_constraint_id = t.attribution_constraint_id
GROUP BY t.id
)
GROUP BY breakdown_key;
IPA begins with different websites collecting information on important events. Sites define what events are interesting to them. Any information about that event can be collected along with the encrypted information produced by get_encrypted_match_key()
. At the time that an event is collected, IPA doesn't distinguish between source events and trigger events.
These events will occur on different websites/apps, so they then need to be gathered together by a report collector and assembled into a fanout query. This means that websites/apps -- or entities acting on their behalf, such as an agency, ad network, SSP, or DSP -- will need to coordinate the collection of events.
The bulk of the information that is collected for each event is chosen by the website/app and is not passed to the user agent. This information is not protected, so websites/apps can choose what information they share about each event.
There are two types of queries which can be issued by a report collector, each designed to provide information to a single site:
-
A source fanout query is designed to help a website/app understand the effect that the ads it shows have on outcomes for its advertisers. A source fanout query can only contain source reports from a single source website/app. A source fanout query can include trigger reports from multiple sites, where each report might represent a conversion event.
-
A trigger fanout query helps a site that buys ads to understand how its advertising is helping to drive outcomes on its website/app. A trigger fanout query can only contain trigger reports from a single source website/app. A trigger fanout query includes source reports from multiple sites, where each report might represent an ad impression or click.
Before making a query, source websites/apps and trigger websites/apps provide a report collector with additional information, enabling it to generate source reports and trigger_reports by associating encrypted match keys with other values. A collection of source reports and trigger reports with associated values determine what a query result means:
-
A breakdown key is associated with each source report in a query. The breakdown key is used to split source reports into groups for aggregation. A breakdown key can represent an advertising campaign, an advertising channel, a set of creatives, or any combination of these with other data from the event. The output of the query includes an aggregate value for each breakdown key.
-
A trigger value is associated with each trigger report. If the trigger report is matched with a source reports by the MPC, the trigger value contributes to the aggregated value for the breakdown key from the source report. Note that sensitivity capping might mean that some trigger values do not contribute to the final result.
-
A attribution constraint ID can be associated with both source reports and trigger reports. Reports with different attribution constraint IDs will not be matched to each other, which allows a site to include source reports and trigger reports that it does not wish to have attributed together. For instance, if the goal is to perform attribution for distinct product segments, assigning a different attribution constraint ID to reports from each distinct segment will ensure that impressions from one segment is not attributed to conversions from the other segment.
The report collector uses these associated values on events it receives to construct queries. Choosing which events are included and the values that are associated with each gives the report collector the ability to make different queries from the same set of events, subject only to differential privacy constraints on those queries.
Each site has a finite privacy budget for making queries. The total number of queries is not directly limited, but each query made by (or for) a website/app expends a portion of a finite differential privacy budget. The smaller the portion that is expended, the more differential privacy noise that is added to any results.
The budget is associated with the website/app that provides source reports for a source fanout query or the website/app that provides trigger reports for a trigger fanout query. A website/app can provide source reports for trigger fanout queries on any other website/app without expending their budget; similarly, no budget is spent by providing trigger reports to another website/app for source fanout queries.
For the MPC run by a helper party network we are proposing a 3-party, malicious, honest majority MPC such that even if one of the helper parties actively tries to attack the protocol and runs malicious code to do so, they will be unable to learn any of the sensitive inputs and any actively malicious behavior will be detectable. We believe this will satisfy a threat model where we are trying to prevent any helper party from leaking data in the event that they are curious, corrupted or compelled to do so.
With this being an honest majority MPC that does mean that if two of the three helper parties in a helper party network collude with each other the sensitive inputs can be leaked without detection. To mitigate this risk we propose the helper party networks be carefully chosen (e.g., to protect against multiple helper parties being compelled, they should be located in different jurisdictions unlikely to work together).
This three party, honest majority setting allows for very efficient MPC protocols that can achieve malicious security at only a reasonable cost over semi-honest security.
In our current proposal, we have restricted ourselves to basic techniques to ensure differential privacy (DP). Our current approach favors simplicity, but we plan to continue researching ways to get better utility with the same privacy guarantee. Fundamentally, any approach to ensure DP is (in principle) compatible with MPC, but not every approach might be sufficiently efficient in an MPC.
We are using DP to ensure that for a specific period of time, an epoch (e.g., a week,) the amount of information revealed about an individual person is bounded. Below, we describe how to achieve DP for an individual query, and then how to manage a privacy budget across queries over the course of an epoch.
In order to provide differential privacy, aggregate results have differentially private random noise added. Differential privacy is parameterized by ε (epsilon), which (inversely) scales the noise added to an aggregate output. The noise that is added needs to be proportional to the sensitivity, or the amount that inputs from an individual can influence the aggregate.
Adding more noise to an aggregate result can make the value less useful. A small ε means more noise and better privacy; a large ε means less noise and better utility. To ensure that noise is finite, the amount that each individual could contribute to the aggregate needs to be bounded. This bounding is done by sensitivity capping.
In order to provide differential privacy at the level of individual user contributions (as identified by match keys), each match key must be limited in the total contribution it can make to the aggregate. This maximum contribution is the sensitivity of the attribution function, and together with the ε, determines the amount of noise required to achieve differential privacy.
The sensitivity cap that is set for a query will be determined as a parameter to the query. Any contribution that exceeds this cap for a single individual will be lost. The exact value of what is lost due to this capping will be unknown to all entities involved - the helper parties and the report collector. For example, a report collector might set a maximum contribution of $100, but would be unaware how many users (if any) exceeded that cap and by how much the cap was exceeded.
Note that because individual contributions are capped, our protocol also provides robustness against malicious inputs over that cap. In addition to provide privacy assurances, the sensitivity cap also limits the amount by which a malicious input can alter a result.
The output of each query is a sum per breakdown key. The contribution from a single user is limited across all breakdown keys, but this contribution can be allocated all to a single breakdown key or distributed across multiple breakdown keys. Consequently, the sensitivity of the value of each breakdown is determined by the global sensitivity cap.
Random noise is added to each breakdown, using ε and the sensitivity to inform the variance of the noise distribution. Noise will be added to each breakdown sum to provide global DP. The exact noise distribution (Laplace, Gauss, ...) and method of application (in-MPC, by helpers, ...) has not yet been determined. This needs to consider the effect of the DP composition theorem, especially for multiple queries.
The previous section focuses on applying differential privacy to individual queries. However, we need to further design a system such that over an epoch, the amount of information released about people is bounded. Specifically, we propose that for a given epoch and website/app (represented by one or more report collectors), individuals (represented by match keys) can have a bounded impact on the results, as measured by ε differential privacy.
In our current approach, we achieve this by providing each website/app with a budget, ε. That website/app is then able to allocate that budget across their report collectors such that each report collector has budget εr (and the sum of all εr = ε. Websites/apps will need to commit to a specific budget allocation, and will not be able to change that commitment until the next epoch. The mechanism for managing this commitment is an open problem.
When report collectors run queries they will specify how much budget to use, εi, which will be deducted from their remaining budget. For example, given an epoch limit of εr, a report collector could perform 10 queries, each with global DP applied at εr/10, or a more complicated set of queries such as three with εr/5 and four with εr/10. When split over multiple queries, smaller εi values result in more noise on each query; each query has greater privacy protection, but the overall privacy loss is similar.
Each query will include both the budget, εi, and the sensitivity cap, which together determine the amount of noise required for that query. There is a tradeoff between noise level and sensitivity cap: larger noise allows less sensitivity capping, i.e., larger values after the capping. The exact tradeoff between noise level and cap would be chosen by the report collector according to how they want to use their budget. Statistically, this is a tradeoff between bias (introduced by the capping) and accuracy (decreased by the differentially private noise.)
The helper party network will need to maintain this budget, per report collector and reflecting the commitments made by websites/apps, over the course of an epoch, preventing report collectors from issuing additional queries once that budget is exhausted. At the beginning of the next epoch, every report collector’s budget will refresh to εr (barring any changes in allocation by the respective website/app.) Note that in infinite time, the information revealed also goes to infinity; our aim here is to control how quickly that divergence happens.
Part of the protocol involves a helper party network with three parties who the user-agent vendor trusts to be non-colluding. We assume that the user agent will be configured with the vendor approved helper party networks.
For each helper party network, some amount of setup needs to exist:
- The helper parties, P1, P2, P3 each generate one public key, secret key pair, i.e.,
(pk_1, sk_1)
,(pk_2, sk_2)
and(pk_3, sk_3)
, of an IND-CCA secure public key encryption.- We use
Encrypt(pk, data)
andDecrypt(sk, data)
to denote its encryption and decryption under keys(pk, sk)
.
- We use
- The public keys, i.e.,
pk1
,pk2
andpk3
, are retrieved by the user agents. Each of the private keys is stored by exactly one of the helper parties.
Every epoch, report collectors will need to make a commitment to use a specific helper party network and match key provider. These commitments are important for enabling the differential privacy model we outline above; for a given report collector, a single helper party network needs to manage its budget, and the commitment to a single match key provider prevents duplicating user reports.
The mechanism for making such a commitment, and making sure that it’s visible to all helper party networks is still an open problem.
For IPA we assume there will be unique match keys for each person that can be processed under encryption in the MPC. We are currently considering two main methods for setting match keys:
- Using the
set_match_key()
API - Generating it randomly on the device
The set_match_key()
API allows for the setting of a write-only match_key
in the user agent by match key providers. Match key providers are websites/app that are approved by the user agent vendor to set match keys. They would provide the most value when they are parties with large logged-in user bases, particularly across devices. As a user logs-in to a match key provider service with different user agents (across different browsers and apps, potentially on different devices) the match key provider will set the same match key for that user on each user agent. The reason we need match key providers to be approved by the user agent vendor is that in the current construction the match key providers would need to be trusted not to be malicious.
The second method (currently recommended for the initial version as most secure) is to have the device generate a random match key.
In either case, the match key cannot be read directly; it is only used to produce secret shares encrypted towards a specific helper party network.
The second API proposed in IPA, get_encrypted_match_key()
, allows report collectors (acting on behalf of a specific source/trigger website/app) to request an encrypted match key.
This function requires three inputs: the report collector (acting on behalf of a specific source/trigger website/app), the match key provider, and the helper party network. The report collector will need to make a commitment to use the same match key provider and helper party network for the duration of an epoch, however this will be validated later by the helper party network when the query is submitted.
To generate the encrypted match key, the user agent will perform the following:
- Look up in its local, private storage to see if a match key from the match key provider exists.
- If no match key exists, it will generate a random local one, and store it.
- Generate XOR secret shares of the match key,
match_key = mk_1 ^ mk_2 ^ mk_3
. - Generate the tuple
data_i = (mk_i, report_collector, current_website/app, match_key_provider)
. - Encrypt the data tuple under the public keys of the helper party network.
2.
Encrypt(pk_1, data_1), Encrypt(pk_2, data_2), Encrypt(pk_3, data_3)
- These reports are provided back to the report collector that called the API.
The key privacy property in generating the encrypted match key is to prevent the report collector from learning anything new about the user from the user agent. As a different secret sharing and encryption is used across the contexts of different websites/apps, encrypted match keys cannot be used by the report collector to enable cross-context tracking. Since an encrypted match key is always returned upon request, the existence of a report reveals no new information to the report collector. Finally, since a random fallback value is used, a report collector cannot learn even if the match key was set.
An encrypted match key is bound both to the site that requested it and to the epoch in which it was generated. This prevents a site from misrepresenting where the match key was generated and having it used for multiple queries.
But there is another key privacy property enabled by match keys, and that is capping the total contribution a single user can make to the aggregate output of an IPA query. As such, each site (or the report collector they delegate to) will need to make a commitment (for at least an epoch) to a specific match key provider, to prevent exceeding the intended privacy budget by leveraging multiple providers. What’s more, although a match key provider can call set_matchkey
at any time, any change they make will not be applied by the user agent until the start of the next epoch.
This implies that the user-agent need not generate a new secret-sharing and encryption of the match key every time. If the get_encrypted_match_key()
API is invoked multiple times, on the same website/app, for the same report collector (and, implicitly, the same match key provider and helper party network), the user-agent can just return the same cached value. This will not reveal any cross-context information. The only exception would be if the user clears that website/app’s storage. In this case, a new secret-sharing and encryption will need to be generated in order to prevent re-identification across that clear.
Finally, to prevent a report collector from circumventing their privacy budget by running duplicate queries, we need the helper party network to enforce that a report collector, (acting on behalf of a specific source or trigger website app) can only run source or trigger fanout queries related to that website/app. We can achieve this by having the helper party network ensure that:
- That they (the helper party network) are the helper party network committed to by the report collector.
- All provided encrypted match keys are from the match key provider committed to by the report collector.
- That the current source or trigger fanout query only contains source or trigger reports (respectively) with a match key generated on the report collector’s associated website/app, which is recorded in the current website/app field of the datai
It is possible for more than one report collector to invoke the get_encrypted_match_key()
API on the same source or trigger website/app.
For example, shoes.example
might buy ads on multiple source websites/apps. They would run trigger fanout queries to measure the effectiveness of the ads they pay for across all source websites/apps. Each of those source websites/apps would in turn like to receive trigger reports from shoes.example
for use in their own source fanout queries.
Suppose that search.example
is one of those source websites/apps where shoes.example
buys ads. In order to obtain an encrypted match key which could be used for search.example
’s source fanout queries, the get_encrypted_match_key
function will need to be invoked on the shoes.example
site, specifying search.example
as the report collector. In this example, search.example
would either need to have a script embedded on shoes.example
, or would need to coordinate with shoes.example
to generate the events, as laid out in the next section.
Let’s define the additional data a report collector must generate for source reports and trigger reports, beyond just the encrypted match key.
When a source event occurs, a report collector should also generate:
timestamp
, the time of the event represented as a unix timestamp integer. The report collector is responsible for deciding how to determine the time of the event.breakdown_key
, represented as an integer. A breakdown key known at the time the ad was shown, which will be used to group the aggregated values.attribution_constraint_id
a value that a source and trigger report will also need to match to be attributed.
Similarly, when a trigger event occurs, the report collector should also generate:
timestamp
, the time of the event represented as a unix timestamp integer. The report collector is responsible for deciding how to determine the time of the event.trigger_value
, the value of the trigger report to be aggregated.attribution_constraint_id
a value that a source and trigger report will also need to match to be attributed.
When source and trigger events occur, the only information that must be prepared by the user agent is the encrypted match key. All of the additional data listed above can be produced by the report collector without the need for any newly designed APIs.
Next, the report collector must produce source and trigger reports in a standard format that the helper party network expects to receive. The following two structs provide that for the source and trigger reports, respectively:
struct {
tuple[Ciphertext] encrypted_match_key_shares;
tuple[Ciphertext] encrypted_attribution_constraint_id;
tuple[Ciphertext] encrypted_timestamp_shares;
tuple[Ciphertext] encrypted_breakdown_key_shares;
} source_report
struct {
tuple[Ciphertext] encrypted_match_key_shares;
tuple[Ciphertext] encrypted_attribution_constraint_id;
tuple[Ciphertext] encrypted_timestamp_shares;
tuple[Ciphertext] encrypted_trigger_value;
} trigger_report
Aside from the encrypted match key (which is already in the correct format), for each of the fields of these structs, the report collector must do the following steps:
- Generate a 3-way secret sharing: e.g.,
v = v_1 + v_2 + v_3
- Encrypt each of the 3 shares towards one of the three helper parties: e.g.,
Encrypt(pk_1, v_1), Encrypt(pk_2, v_2), Encrypt(pk_3, v_3)
Report collectors might choose to do this processing either client-side or server-side; either is acceptable. There are advantages and disadvantages associated with each approach.
From here, the report collector is able to construct either a source or trigger fanout query, which are limited to having source reports or trigger reports, respectively, with match keys only from the website/app they are acting on behalf of. The report collector issues that query by sending the collection of reports to the helper party network which they are committed to.
To give a high level overview of what happens in the MPC:
- First the helper parties receive and decrypt their shares for source or trigger reports.
- Helper partyj receives either a
source_report
or atrigger_report
, and decrypts their shares of the attributes usingsk_j
. - Helper partyj validates the report and builds a
generic_report
from their decrypted shares. Section 2.8.1 Validating reports expands on the details.
- Helper partyj receives either a
- The helper parties supply these shares as private inputs into the MPC.
- The helper parties sort the generic reports by
match_key
,attribution_constraint_id
, thentimestamp
.- We explore a handful of possible optimizations for this in section Technical Discussion and Remarks.
- The MPC also sets a secret shared helper bit to indicate if a row has the same
(match_key, attribution_constraint_id)
as the one before it.- This could happen in a customized sorting protocol designed to also create these bits, or after the sort as a standalone step.
- The MPC computes an oblivious attribution, a protocol we describe below that uses a tree-like pattern to obliviously attribute trigger values to the nearest preceding source report with the same match key.
- The MPC enforces a sensitivity cap on the sum of trigger values from a particular match key.
- The MPC adds the
trigger_value
of the attributed trigger report to the sum corresponding to thebreakdown_key
of the source report to which it was attributed. - Using the sensitivity cap and the DP budget parameter, the MPC samples random noise and adds it to the sum corresponding to each breakdown key to provide a global differential privacy guarantee.
- For each epoch (e.g., one week), the MPC helpers track the total budget consumed by each report collector, to limit the total information leakage from the system for each epoch and report collector.
We now describe each step of the MPC in a bit more detail. Our MPC protocol relies heavily on the concept of oblivious algorithms, algorithms with access patterns that do not depend on their input. Oblivious algorithms allow for efficient MPC protocols, and we give more background in the section Technical Discussion and Remarks. In the following, we discuss our current oblivious sorting and oblivious last touch attribution approach.
When each helper party decrypts their share of the match key data, it also includes the report collector who invoked get_encrypted_match_key
along with the website/app where it was invoked and the match key provider. The helper party must validate a few items:
- That the report collector’s committed helper party network is the currently running helper party network.
- This should be implicitly true, otherwise the helper parties would be unable to decrypt the encrypted match key data.
- That the report collector’s committed match key provider is the match key provider specified in all decrypted match key data.
- That the report collector issuing the query is the report collector specified in all decrypted match key data.
- That the query is either a source fanout or trigger fanout query for the website/app that the report collector is issuing the queries on behalf of. Explicitly, it is either: 2. A source fanout query where all source reports have match key data that was invoked on the website/app that the report collector is issuing the queries on behalf of, or 3. A trigger fanout query where all trigger reports have match key data that was invoked on the website/app that the report collector is issuing the queries on behalf of.
In the MPC, source and trigger reports will need to be indistinguishable, so they will have the same format: generic report. This prevents information leakage by ensuring the helper parties are unable to differentiate between source and trigger reports throughout the entire protocol. The generic report shares (seen from the viewpoint of an individual helper party) uses the following struct:
struct {
int match_key_share;
int attribution_constraint_id_share;
int timestamp_share;
int is_trigger_report_share;
int breakdown_key_share;
int trigger_value_share;
} generic_report_share
Source reports do not have any associated trigger_value
, and trigger reports do not have any associated breakdown_key
, so for these each helper party can just set their share to zero. Additionally, each helper party will initially know which reports are source reports and which are trigger reports, and set the is_trigger_report
accordingly. Without loss of generality, we can assign helper party1 to set the value of is_trigger_report
to 1 for trigger reports, and all other helper parties set the value to 0. (All helper parties set the value to 0 for source reports.)
In the next step, the values will be obliviously sorted, which reorders the events and re-randomizes all of these secret shares in such a way that each row is indistinguishable from any other, and unlinkable to their original position. At this point, due to the structure of the generic_report
, source reports and trigger reports are indistinguishable.
The first step of the MPC is to sort the reports by the match key, then timestamp. This sort must be oblivious, i.e., it cannot reveal (even to the helper parties) the sorted location of any report. As such, the sorting must also involve re-sharing the secret shares of all the data in the report (to prevent the helper parties from linking the input rows to the output rows). For better readability, we give some examples in the clear, i.e., not secret shared. We emphasize that all the following lists would be secret shared among the helper parties and no party would have access to the data in the clear. The sorted list has the following format:
Example Data In the Clear (Would be secret shared in MPC) | ||||||
Match Key | Attribution Constraint | Timestamp | Helper Bit | Trigger Bit | Breakdown Key | Purchase Value |
1454 | 53 | 1:07 | 0 | 0 | 2 | 0 |
1454 | 53 | 1:27 | 1 | 0 | 3 | 0 |
1454 | 53 | 3:46 | 1 | 1 | -1 | 250 |
1454 | 53 | 4:12 | 1 | 1 | -1 | 25 |
1454 | 53 | 4:39 | 1 | 1 | -1 | 20 |
1454 | 72 | 5:52 | 0 | 1 | -1 | 130 |
1454 | 72 | 7:48 | 1 | 1 | -1 | 50 |
5422 | 53 | 3:24 | 0 | 0 | 0 | 0 |
9086 | 14 | 2:56 | 0 | 1 | -1 | 100 |
As mentioned above, the helper bit indicates if the preceding row has the same (match_key, attribution_constraint_id)
as the current element. We also may benefit from simply concatenating match_key
and attribution_constraint_id
together. We are exploring protocols to generate this helper bit during the sorting, and ultimately it’s an optimization to prevent a more expensive equality check between two rows in the attribution step (next.) We are actively researching which sorting algorithm will be best to use here, particularly in combination with creating the helper bit.
In this section, we describe our oblivious algorithm for last touch attribution. As mentioned earlier, we constrain ourselves within this document to last touch attribution for the sake of simplicity. We aim to extend the IPA attribution functionality in the future to other attribution heuristics, such as equal credit.
Our oblivious attribution algorithm operates on the sorted list of secret-shared values (described above) and has O(log N) iterations where N is the list size. During each of the iterations, i = [0, log_2(N))
, each row is compared to the row 2^i
below it in the list. This approach establishes a tree-like structure in which starting from the leaf nodes, each node accesses and accumulates data of its children. By increasing the distance between the interacting nodes during each iteration by a factor of two, we ensure that each node only accumulates the value of each successor only once. We stop accumulating data when (match_key, attribution_constraint_id)
are not equal, as indicated by the helper bit.
We must perform all iterations even if all nodes have already stopped accumulating data, because stopping early would reveal information (making it a non-oblivious algorithm.) Note that we could optimize this to a fixed, but less than log(N)
, number of iterations, if we believe that match keys will only have at most 2^i
reports in a given query.
In each iteration of the oblivious last touch attribution, each row is only compared with one other row in the list, which makes it highly parallelizable. The following image shows this interaction pattern over the course of 4 iterations.
This interaction pattern is oblivious because it does not depend on the values of the match keys. Further, it allows for accumulation of data across all successors, which can be modeled as a tree with nodes. For example, if we think of the first row as a tree, we can accumulate the following 8 rows in only 4 iterations:
We are adding two additional variables that will help us to accumulate the last touch attribution. One is a stop bit that will indicate whether the current row has stopped accumulating credit. It will be initialized to 1, and will stop accumulating credit once it is updated to 0. The other one is the credit variable that stores the accumulated credit. For trigger reports, we initialize the credit with its purchase value, and use -1 for the breakdown key (since it’s unused). The list has the following format:
Example Data In the Clear (Would be secret shared in MPC) | ||||||||
Match Key | Attribution Constraint | Timestamp | Helper Bit | Trigger Bit | Stop Bit | Breakdown Key | Purchase Value | Credit |
1454 | 53 | 1:07 | 0 | 0 | 1 | 2 | 0 | 0 |
1454 | 53 | 1:27 | 1 | 0 | 1 | 3 | 0 | 0 |
1454 | 53 | 3:46 | 1 | 1 | 1 | -1 | 250 | 250 |
1454 | 53 | 4:12 | 1 | 1 | 1 | -1 | 25 | 25 |
1454 | 53 | 4:39 | 1 | 1 | 1 | -1 | 20 | 20 |
1454 | 72 | 5:52 | 0 | 1 | 1 | -1 | 130 | 130 |
1454 | 72 | 7:48 | 1 | 1 | 1 | -1 | 50 | 50 |
5422 | 53 | 3:24 | 0 | 0 | 1 | 0 | 0 | 0 |
9086 | 14 | 2:56 | 0 | 1 | 1 | -1 | 100 | 100 |
Using the oblivious access pattern, we can implement last touch attribution using the following logic that is executed for each row. We use current
for the current row, and successor
for the row 2^i
below.
if (stop_bit == 1 and successor.helper_bit = 1 and successor.trigger_bit == 1):
credit += successor.credit
stop_bit = successor.stop_bit
else:
stop_bit = 0
**MPC Optimization for Last Touch. **We can improve the efficiency of the last touch logic by not using branching logic and replacing the if/else
form in the above code with arithmetic operations. Fortunately, we can easily represent the logic of the last touch attribution as arithmetic operations, as follows:
flag = current.stop_bit * successor.helper_bit * successor.trigger_bit
current.credit = current.credit + flag * successor.credit
current.stop_bit = flag * successor.stop_bit
We can further optimize this by not computing it fresh every step, but storing it explicitly on the list and updating it.
Using this logic during each iteration for each row, we obtain the following intermediate results for our example list.
In iteration 0, each row is compared to the next row (i.e., index + 1 == index + 2^0
) to update the Stop Bit and Credit columns. Values that change from the previous round are highlighted in bold.
Example Data In the Clear (Would be secret shared in MPC) | |||||||||||
Match Key | Attribution Constraint | Timestamp | Helper Bit | Trigger Bit | Stop Bit | Breakdown Key | Purchase Value | Credit | Updated Stop Bit
(Helper Bit[+1] & Trigger Bit[+1] & Stop Bit) |
Updated Credit
Credit + Credit[+1] * (Updated Stop Bit) |
|
1454 | 53 | 1:07 | 0 | 0 | 1 | 2 | 0 | 0 | (1 & 0 & 1) = 0 | 0 + 0*0 = 0 | |
1454 | 53 | 1:27 | 1 | 0 | 1 | 3 | 0 | 0 | (1 & 1 & 1) = 1 | 0 + 1*250 = 250 | |
1454 | 53 | 3:46 | 1 | 1 | 1 | -1 | 250 | 250 | (1 & 1 & 1) = 1 | 250 + 1*25 = 275 | |
1454 | 53 | 4:12 | 1 | 1 | 1 | -1 | 25 | 25 | (1 & 1 & 1) = 1 | 25 + 1*20 = 45 | |
1454 | 53 | 4:39 | 1 | 1 | 1 | -1 | 20 | 20 | (0 & 1 & 1) = 0 | 20 + 0*130 = 20 | |
1454 | 72 | 5:52 | 0 | 1 | 1 | -1 | 130 | 130 | (1 & 1 & 1) = 1 | 130 + 1*50 = 180 | |
1454 | 72 | 7:48 | 1 | 1 | 1 | -1 | 50 | 50 | (0 & 0 & 1) = 0 | 50 + 0*0 = 50 | |
5422 | 53 | 3:24 | 0 | 0 | 1 | 0 | 0 | 0 | (0 & 1 & 1) = 0 | 0 + 0*100 = 0 | |
9086 | 14 | 2:56 | 0 | 1 | 1 | -1 | 100 | 100 | (No successor) = 0 | 100 |
In iteration 1, each row is compared to the row 2 below (i.e., index + 2^1 == index + 2
)
Example Data In the Clear (Would be secret shared in MPC) | |||||||||||
Match Key | Attribution Constraint | Timestamp | Helper Bit | Trigger Bit | Stop Bit | Breakdown Key | Purchase Value | Credit | Updated Stop Bit
(Helper Bit[+2] & Trigger Bit[+2] & Stop Bit) |
Updated Credit
Credit + Credit[+2] * (Updated Stop Bit) |
|
1454 | 53 | 1:07 | 0 | 0 | 0 | 2 | 0 | 0 | (1 & 1 & 0) = 0 | 0 + 0*275 = 0 | |
1454 | 53 | 1:27 | 1 | 0 | 1 | 3 | 0 | 250 | (1 & 1 & 1) = 1 | 250 + 1*45 = 295 | |
1454 | 53 | 3:46 | 1 | 1 | 1 | -1 | 250 | 275 | (1 & 1 & 1) = 1 | 275 + 1*20 = 295 | |
1454 | 53 | 4:12 | 1 | 1 | 1 | -1 | 25 | 45 | (0 & 1 & 1) = 0 | 45 + 0*180 = 45 | |
1454 | 53 | 4:39 | 1 | 1 | 0 | -1 | 20 | 20 | (1 & 1 & 0) 0 | 20 + 0*50 = 20 | |
1454 | 72 | 5:52 | 0 | 1 | 1 | -1 | 130 | 180 | (0 & 0 & 1) = 0 | 180 + 0*0 = 180 | |
1454 | 72 | 7:48 | 1 | 1 | 0 | -1 | 50 | 50 | (0 & 1 & 0) = 0 | 50 + 0*100 = 50 | |
5422 | 53 | 3:24 | 0 | 0 | 0 | 0 | 0 | 0 | (No successor) = 0 | 0 | |
9086 | 14 | 2:56 | 0 | 1 | 0 | -1 | 100 | 100 | (No successor) = 0 | 100 |
In iteration 2, each row is compared to the row 4 below (i.e., `index + 2^2 == index + 4):
Example Data In the Clear (Would be secret shared in MPC) | |||||||||||
Match Key | Attribution Constraint | Timestamp | Helper Bit | Trigger Bit | Stop Bit | Breakdown Key | Purchase Value | Credit | Updated Stop Bit
(Helper Bit[+4] & Trigger Bit[+4] & Stop Bit) |
Updated Credit
Credit + Credit[+4] * (Updated Stop Bit) |
|
1454 | 53 | 1:07 | 0 | 0 | 0 | 2 | 0 | 0 | (1 & 1 & 0) = 0 | 0 + 0*20 = 0 | |
1454 | 53 | 1:27 | 1 | 0 | 1 | 3 | 0 | 295 | (0 & 1 & 1) = 0 | 295 + 0*180 = 295 | |
1454 | 53 | 3:46 | 1 | 1 | 1 | -1 | 250 | 295 | (1 & 1 & 1) = 1 | 295 + 1*50 = 345 | |
1454 | 53 | 4:12 | 1 | 1 | 0 | -1 | 25 | 45 | (0 & 0 & 0) = 0 | 45 + 0*0 = 45 | |
1454 | 53 | 4:39 | 1 | 1 | 0 | -1 | 20 | 20 | (0 & 1 & 0) = 0 | 20 + 0*100 = 20 | |
1454 | 72 | 5:52 | 0 | 1 | 0 | -1 | 130 | 180 | (No successor) = 0 | 180 | |
1454 | 72 | 7:48 | 1 | 1 | 0 | -1 | 50 | 50 | (No successor) = 0 | 50 | |
5422 | 53 | 3:24 | 0 | 0 | 0 | 0 | 0 | 0 | (No successor) = 0 | 0 | |
9086 | 14 | 2:56 | 0 | 1 | 0 | -1 | 100 | 100 | (No successor) = 0 | 100 |
In iteration 3, each row is compared to the row 8 below it (i.e., index + 2^3 == index + 8
.) This is our final iteration, because the largest gap in our sample dataset is 8.
Example Data In the Clear (Would be secret shared in MPC) | |||||||||||
Match Key | Attribution Constraint | Timestamp | Helper Bit | Trigger Bit | Stop Bit | Breakdown Key | Purchase Value | Credit | Updated Stop Bit
(Helper Bit[+4] & Trigger Bit[+4] & Stop Bit) |
Updated Credit
Credit + Credit[+4] * (Updated Stop Bit) |
|
1454 | 53 | 1:07 | 0 | 0 | 0 | 2 | 0 | 0 | (0 & 1 & 0) = 0 | 0 + 0*100 = 0 | |
1454 | 53 | 1:27 | 1 | 0 | 0 | 3 | 0 | 295 | (No successor) = 0 | 295 | |
1454 | 53 | 3:46 | 1 | 1 | 1 | -1 | 250 | 345 | (No successor) = 0 | 345 | |
1454 | 53 | 4:12 | 1 | 1 | 0 | -1 | 25 | 45 | (No successor) = 0 | 45 | |
1454 | 53 | 4:39 | 1 | 1 | 0 | -1 | 20 | 20 | (No successor) = 0 | 20 | |
1454 | 72 | 5:52 | 0 | 1 | 0 | -1 | 130 | 180 | (No successor) = 0 | 180 | |
1454 | 72 | 7:48 | 1 | 1 | 0 | -1 | 50 | 50 | (No successor) = 0 | 50 | |
5422 | 53 | 3:24 | 0 | 0 | 0 | 0 | 0 | 0 | (No successor) = 0 | 0 | |
9086 | 14 | 2:56 | 0 | 1 | 0 | -1 | 100 | 100 | (No successor) = 0 | 100 |
Note that there is no change after the 4th iteration, but since the helpers can only see secret shares, they are not aware that this is unchanged. In fact, this is the point of an oblivious algorithm; if the algorithm were to stop after iteration 3, it would leak unintended information (outside of the intended purpose constrained output.)
We’ve now outlined an algorithm for oblivious attribution, but in order to ensure differential privacy, we need to limit the overall contribution that an individual user (as defined by a match key) can make to the entire query. More specifically, we ensure that for each match key the sum of all attributed trigger values does not exceed some bound. When a match key’s sum of attributed trigger values does exceed this bound, we need to reduce them in some way. One approach is to reduce all contributions proportionally, e.g., if the cap is 100 but a match key contributes 300, all contributions made by that match key are reduced to a third of their value.
After the final iteration, we can then sum up the credit for each breakdown key by iterating through the list. Note that we ignore the -1 value, which we used as a dummy value for the trigger events.
Example Data In the Clear (Would be secret shared in MPC) | |
Breakdown Key | Credit |
0 | 0 |
1 | 0 |
2 | 0 |
3 | 295 |
After computing the aggregates within each category, the helper parties P1, P2, P3 generate additive noise to ensure DP on the user level. The amount of noise added depends on the sensitivity, i.e., on the cap of how much an individual match key owner can contribute.
The noise is generated within an MPC protocol. This could be done by having each party sample noise independently then add it to the secret shared aggregated value. This is a simple approach but has the drawback that the privacy guarantees in case of a corrupted MPC party are lower since the corrupted party will know their share of the noise and deduct it from the aggregated value. A better but more complicated approach is to run a secure coin tossing protocol between parties P1, P2, P3 where the coins are private and then use these coins to run a noise sampling algorithm within MPC to generate secret shares of the DP noise. This noise is then added to the secret shared aggregated value. Using the second approach, a malicious party cannot manipulate the protocol to see a noisy aggregated value with less noise. Hence, the privacy guarantees match the amount of DP noise generated and added as specified in the protocol description even when one party is malicious.
For efficiency, it might be possible to have the user-agent and report collector server only generate two shares, saving a third of the external bandwidth (between user-agent to report collector and report collector to MPC helper parties) with a possible increase for the initial rounds of MPC interactions to convert into a 2:3 sharing.
Sorting match keys gets more expensive as the size of the match key grows (as measured by bit length.) There are different approaches that could be used to compress the match key within MPC. This would lead to more efficient sorting, which comes at the cost of a slight increase in the probability of spurious collisions between users. The amount of compression would depend on the expected cardinality of unique match keys within a query, and the tolerance of the report collector for bias introduced by spurious collisions.
To improve the performance of the sorting step, we could presort the encrypted reports according to their timestamp. Later, the helper parties could use a stable sort to sort according to the match key. This approach comes with the drawback that the timestamps need to be in the clear for either the helper parties or the report collector, which the source website/app or trigger website/app may not wish to share. This leakage seems minimal, however we need to explore this approach further and ensure that this leakage does not allow any significant attacks. Presorting by the attribution constraint ids can be handled similarly instead of concatenating them with the match key.
We can compute the aggregation for a large number of breakdowns more efficiently using sorting and prefix sum algorithms. We can use the same strategy as before by using sort and the tree-like structure to aggregate across breakdown keys. We omit the details here, but it may provide useful optimization in an implementation.
IPA is intentionally designed to be compatible with measuring activities beyond the web, such as mobile devices and connected TVs. If a match key provider is able (and willing), they could extend this even further by performing user-level linkage to other contexts (e.g., email based matching with offline merchants), then distribute encrypted match keys, enabling businesses to bring offline user activity from these other contexts into the MPC.
The impact this may have on the overall ecosystem is not obvious. On one hand, it may drive an increase in sharing of PII between parties in an effort to gain access to this new measurement capability. On the other hand, it may cause a shift from the direct sharing of people’s individual behavior towards sharing of encrypted matchkeys, where it is possible to provide much stronger differential privacy protection against leaking individual behavior.
Determining the extent to which this might require mitigation is still an open question. PATCG participants and web platform vendors might be able to provide more input that will help resolve this issue.
Capping has the disadvantage that it introduces a level of inaccuracy that cannot be quantified by an advertiser or publisher. There is an approach to ensure DP provides more transparency to the inaccuracy introduced by releasing a differentially private estimate of the mean and covariance across individual user uncapped contributions to the breakdown sum. Publishing this mean and covariance would provide greater insight into the inaccuracy added from the DP sensitivity. We could follow the approach of [BDKU2020].
We have discussed IPA with various parties, and received several suggestions for additional extensions to the proposal. One such suggestion was the idea of query transparency. The idea is that the helper party network could publicly publish various statistics about the IPA queries that each report collector issued. This would allow for non-profits, and privacy watchdogs to audit the types of queries being run. For example, transparency about the number of queries run, the number of rows processed per query, the cardinality of breakdown keys used and more could bring much more transparency to the digital advertising ecosystem.
Oblivious algorithms are algorithms with data access patterns that do not depend on the input. This is important to protect the secrecy of the private inputs. If the data access patterns did depend on the input, a malicious helper party might be able to learn some information by just observing which secret shares were accessed in what order.
An example of an oblivious algorithm is the bitonic sorting algorithm [1]. Another advantage of oblivious algorithms is that in many cases they can be very efficiently computed in a secure MPC. Instead of running one large MPC protocol with a very large input, oblivious algorithms allow us to massively parallelize the algorithm, with each thread performing the same type of processing on tiny slices of the input data.
The last-touch attribution algorithm described above is oblivious. The algorithm can be parallelized into a large number of small computations that only compare two rows at a time. The pairs of rows which will be compared are always the same, regardless of the values of the input. As such, the helper party learns nothing about the input, and can easily parallelize all of these comparisons.
On some operating systems, browser components can be integrated into applications in a number of ways. Two models are considered for these web views:
-
Where the embedded browser component permits the application to access (or modify) content. Examples of these include Windows WebView2, iOS WKWebView, or Android WebView.
-
Where the operation of the browser component is not accessible to the application. Examples of this approach are Android Custom Tabs and SFSafariViewController.
In all cases, IPA provides the greatest measurement utility if there is a single system-wide (or user-wide) store of match keys. To facilitate this, support for storing and managing match keys might be provided at the level of the operating system. A system-wide utility will ensure that attribution can work across different apps, including browsing. Without a system-wide store, attribution is possible, though it will be limited to impressions and conversions that occur in the same app.
If the embedded browser makes content accessible to the application, then the application is effectively acting as an independent browser. An application therefore needs distinct storage from the regular browser so that one application cannot access private information hosted by another application or browser.
When IPA integrates into this sort of environment, the getEncryptedMatchKey()
API needs to return an encryption of the match key that is specific to the combination of application and site.
The same value cannot be used across different apps or sites, or the ciphertext that is provided might link activity across contexts.
Integration of the IPA API at the system level therefore needs to segment any storage of encrypted match keys based on the identity of the application that requests it, in addition to the site.
Note that this is an imperfect model because activity on sites shown in a web view might be linkable to activity in browser or other apps. Many sites attempt to detect webviews and disable login, which can prevent the most direct forms of linkability. However, some applications use embedded browsers to manage login and authorization of access to online services. Consequently, people can become accustomed to having to log in, despite that being a poor security practice, and in doing so create strong correlation between contexts.
Aside from partitioning the storage of encrypted match keys by the embedding app, no other adjustment is being considered. Sites visited within the web view can use match keys as usual. An embedding app will be able to view or even change the encrypted match key. The app cannot learn anything directly from doing so, but this access might allow actions that could be seen to be malicious, such as:
-
The embedding app can violate user privacy by providing identical or otherwise linkable encrypted match keys to different sites.
-
The embedding app that captures an encrypted match key from site content might be able to submit the value in an IPA query against the wishes of the embedded site.
-
The embedding app can provide sites with bogus encrypted match keys or encrypted match keys sourced from other contexts with a goal of spoiling attribution results.
-
The embedding app might deny sites in the web view with access to the IPA APIs.
-
The embedding app might misrepresent the identity of the site to the operating system, obtaining encrypted match keys that are attributed to a site of its choice.
An embedding app that behaves in this way is no different than operating an untrustworthy or malicious browser. For the most part, the risks are attributable to use of malicious software. However, the misrepresentation of site identity by an app allows the site to increase privacy loss in a way that is unique to IPA. The encrypted match keys obtained in this way can be used in IPA queries across multiple sites, with a separate privacy budget for each site. Because the encrypted match keys are known to be for the same person, these queries release more information about that person than is intended.
Limits on the number of encrypted match keys is an option being explored for managing sensitivity capping in some designs for scaling the IPA MPC components. Having per-app encrypted match keys works against that goal, as encrypted match keys might come from many more apps than different devices.
A system might offer control over how apps participate in the API, including per-app authorization of access to the IPA APIs. Apps that are denied access to the API can receive encryptions of a randomized match key, which might be stable (and so enable attribution within the app) or randomized (preventing all use of attribution).
Further investigation will be needed to better understand the implications of making IPA available to web views that are accessible to apps.
The use of custom tabs allows an application to use a browser, but without gaining access to the content of sites, either to view it or modify it. It is possible to treat this form of embedding as an extension of another browser, rather than being considered part of the application. Other than choosing a URL, the embedding application cannot access content, read cookies, or otherwise affect how the browser operates. This allows the browser to provide sites with the cookies that are used in the full browsing experience.
In this situation, the encrypted match key can be provided to the site as though this were part of a normal browsing session. A site could, if it is aware of being embedded in an app, share the encrypted match keys it obtains with the application. This might be possible via side channels in the embedding context, which would allow for linkability between the app context and the browsing context. For example, the embedding app might provide the site with URL parameters that convey information from the app, which might include user identity; in the other direction, the site might use resource loading, script execution, or other side channels to pass signals to the app.
We would like to thank Google for their Attribution Reporting API with Aggregatable Reports proposal. This was of great inspiration to us, and led to lots of ideation about how a server-side aggregation service could be used to provide private, aggregate reports.
We would also like to thank Apple for their Private Click Measurement proposal, first proposed in May 2019. The introduction of this proposal was a watershed moment, when the discussion pivoted to the approach of adding new, purpose-limited APIs to the web platform, to narrowly support various important ads use-cases.
We would also like to thank our academic collaborators for their feedback on early versions of the IPA proposal. Their input has been invaluable in evolving this proposal to be both more private, and more useful.
Finally, many thanks to all of the researchers who have published papers about protocols applicable to a helper party network with the same target threat model (i.e., 3 non-colluding parties, where at most one party can act maliciously). Their work on protocols like modulus conversion, sorting, and other algorithms is what has made this proposal possible. It’s an honor to be building on the shoulders of giants.