This is the landing page for the code and data release accompanying Scalable Influence and Fact Tracing for Large Language Model Pretraining (Chang et al. 2024).
Specifically, this includes:
- Data files in JSON lines (
.jsonl
) format for:- The set of 5.4k prompts (queries) used for fact tracing evaluation, as well as the full set of 1.2M queries these are sampled from.
- TDA method outputs (retrieved and scored proponents) corresponding to the experiments in Section 5 and Section 6 of the paper.
- TDA method outputs corresponding to additional evaluation tasks in Appendix A.5 of the paper.
- The corpus of 19.6M sentences from T-REx Wikipedia abstracts (Section 4.2 and 5 of the paper).
- A data viewer app to make it easier to look at and analyze sets of retrieved proponents.
Lists of proponent passages are challenging to work with in spreadsheets or plain text files, due to the amount of text on-screen and the difficulty of quickly looking at scores, identifying string matches, or filtering to specific types of examples. We found it useful to write custom HTML visualizations, and packaged these into a simple viewer app:
https://pair-code.github.io/pretraining-tda/demo
You can load a .jsonl
file of TDA results (a set of test examples and their
retrieved proponents from the training set) from a URL or by uploading from your
computer; see below for links to load the experiments
from the paper. The app runs entirely in-browser and doesn't send your data to
any server. For more information, see
the user guide and app documentation.
The set of 5.4k triples and associated prompts (factual queries) used in the experiments in the paper: https://storage.googleapis.com/tda-resources/2410.17413/public/trex_facts_sample.jsonl
The full set of 1.2M triples which these are sampled from: https://storage.googleapis.com/tda-resources/2410.17413/public/trex_facts.jsonl (1GB file). Note that the set of 5.4k is not sampled uniformly from this; see Section 4.2 of the paper for more details.
Each record has the following fields:
fact_id
kilt_id
entity0
,relation
, andentity1
entity0_uri
,predicate_uri
, andentity1_uri
entity0_alias
andentity1_alias
- alternative surface formstrex_sentences
- mapping to the T-REx sentences, belowc4_frequency
- annotation, based on string matching, of how frequently this fact appears in the C4 pretraining corpusis_repetition
- if the fact contains repetition between entity0 and entity1prompt0
,prompt1
,prompt2
- input prompts for this fact, generated using different templates. Unless otherwise specified, we useprompt0
for experiments in the paper.
The results files as used in the main paper are linked in the tables below. Each record has the following fields:
example_id
query_set
inputs_plaintext
- the prompt (query) string; for T-REx facts, this isprompt0
from the query files abovetargets_plaintext
- the target string, generallyentity1
from the query files aboveproponents
(asstring[]
) - proponent passage textproponent_ids
(asstring[]
) - passage IDs (for T-REx or C4)proponent_scores
(asnumber[]
) - passage scores from the TDA method (e.g. Equation (1) of the paper)
For TDA methods that support a notion of "opponents" (this includes most gradient-based methods, but not the BM25 or Gecko baselines) we also include fields analogous to the proponents:
opponents
(asstring[]
)opponent_ids
(asstring[]
)opponent_scores
(asnumber[]
)
And for T-REx records in Tables 1 and 2 (some fields marked optional):
fact_id
relation
8b_generations
(asstring[]
) - decoder samples from the 8B model, for estimating confidence in the LLM's answer8b_confidence
(asnumber
) - fraction of samples from the 8B that match the target entity or an aliasc4_frequency
andc4_frequency_bucket
- frequency of the fact in the C4 corpus, based on string matching. Bucket groups this into 0, 1, 2, 3, 4, 5, with 5 containing the most common facts.has_trex_sentence
- for retrievals from T-REx sentences, if there exists any sentence in T-REx containing this fact (optional, only Table 1)proponent_correct
(asboolean[]
) - for retrievals from T-REx sentences, whether each proponent contains the fact, according to the T-REx annotations (optional, only Table 1)proponent_ais_scores
(asnumber[]
) - for retrievals from C4, scores from the AIS (entailment) model for each proponent (optional, only Table 2)
For all tasks outside of T-REx, we retrieve proponents using TrackStar with the non-task-specific Hessian approximation (see Appendix A.5 in the paper). The additional tasks have the following optional fields:
is_8b_correct
- for T-REx and arithmetic tasks, whether the 8B model generation matches the ground truth; for PIQA and COPA, whether the 8B model assigns higher probability to the ground truth than to the alternative completion; for story generation, this field is not included (no "ground truth" to compare to).groundtruth
- for T-REx incorrect predictions, the ground-truth target (entity); otherwise, this is omitted andtargets_plaintext
is equal to the ground truth answer.
Table 1: T-REx facts, retrievals from T-REx sentences
Method | Download .jsonl file | Viewer link |
---|---|---|
BM25 | trex_retrievals_bm25.jsonl | view in app |
Gecko | trex_retrievals_gecko.jsonl | view in app |
TRAK | trex_retrievals_trak.jsonl | view in app |
Exp 1 | trex_retrievals_exp1.jsonl | view in app |
Exp 2 | trex_retrievals_exp2.jsonl | view in app |
Exp 3 | trex_retrievals_exp3.jsonl | view in app |
Exp 4 | trex_retrievals_exp4.jsonl | view in app |
Exp 5 | trex_retrievals_exp5.jsonl | view in app |
TrackStar | trex_retrievals_trackstar.jsonl | view in app |
Table 2: T-REx facts, retrievals from C4
Method | Download .jsonl file | Viewer link |
---|---|---|
BM25 | c4_trex_retrievals_bm25.jsonl | view in app |
Gecko | c4_trex_retrievals_gecko.jsonl | view in app |
Gradient dot product | c4_trex_retrievals_grad_dot.jsonl | view in app |
Gradient cosine | c4_trex_retrievals_grad_cosine.jsonl | view in app |
TrackStar | c4_trex_retrievals_trackstar.jsonl | view in app |
Appendix A.5: Additional tasks, retrievals from C4
Task | Download .jsonl file | Viewer link |
---|---|---|
T-REx incorrect predictions | c4_trex_incorrectpred_retrievals_trackstar.jsonl | view in app |
COPA | c4_copa_retrievals_trackstar_nontaskspecific.jsonl | view in app |
PIQA | c4_piqa_retrievals_trackstar_nontaskspecific.jsonl | view in app |
Arithmetic word problems | c4_arithmeticwordproblem_retrievals_trackstar_nontaskspecific.jsonl | view in app |
Simple arithmetic | c4_arithmetic_retrievals_trackstar_nontaskspecific.jsonl | view in app |
Story generation | c4_storygeneration_retrievals_trackstar_nontaskspecific.jsonl | view in app |
This is the corpus of 19.6 M sentences as described in Section 4.2 and Section 5, and used for the experiments in Section 5 of the paper. The data is approximately 6GB, split across 20 shards:
https://storage.googleapis.com/tda-resources/2410.17413/public/trex_sentences.jsonl-000[XY]-of-00020
:
Or fetch all using gsutil:
gsutil -m cp 'gs://tda-resources/2410.17413/public/trex_sentences.jsonl-*' /path/to/local/dir
Each record has the following fields:
sentence_id
text
abstract_uri
sent_idx_in_abst
- index of this sentence in the original abstractfact_triples
- relevant fact triples which are found in this sentence
If you use this data or find the data viewer useful, please cite our paper at:
@article{chang2024scalable,
title={Scalable Influence and Fact Tracing for Large Language Model Pretraining},
author={Chang, Tyler A. and Rajagopal, Dheeraj and Bolukbasi, Tolga and Dixon, Lucas and Tenney, Ian},
journal={arXiv preprint arXiv:2410.17413},
year={2024}
}
Copyright 2024 DeepMind Technologies Limited All software is licensed under the Apache License, Version 2.0 (Apache 2.0); you may not use this file except in compliance with the Apache 2.0 license. You may obtain a copy of the Apache 2.0 license at: https://www.apache.org/licenses/LICENSE-2.0
All other materials, except as set out below, are licensed under the Creative Commons Attribution 4.0 International License (CC-BY). You may obtain a copy of the CC-BY license at: https://creativecommons.org/licenses/by/4.0/legalcode. This dataset contains passages from:
- C4, which are made available under the ODC Attribution License. You may obtain a copy of the ODC Attribution License at: https://opendatacommons.org/licenses/by/1-0/.
- T-REX, which are made available under the CC-BY-SA-4.0 License. Page IDs for the corresponding Wikipedia articles are included in the data files. You may obtain a copy of the CC-BY-SA-4.0 License at: https://creativecommons.org/licenses/by-sa/4.0/deed.en
- COPA, Copyright (c) 2010, University of Southern California, which are made available under the BSD 2-Clause License. You may obtain a copy of the BSD 2-Clause License at: https://opensource.org/license/bsd-2-clause.
- PIQA, which are made available under the Academic Free License v3.0. You may obtain a copy of the Academic Free License v3.0 at: https://opensource.org/license/afl-3-0-php.
- CogComp, Copyright (c) 2022 CogComp, which are made available under the MIT License. You may obtain a copy of the MIT License at: https://opensource.org/license/mit.
Unless required by applicable law or agreed to in writing, all software and materials distributed here under the Apache 2.0 or CC-BY licenses are distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the licenses for the specific language governing permissions and limitations under those licenses.
This is not an official Google product.