Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inquiry about Web Pipeline Availability #151

Open
codefly13 opened this issue Apr 22, 2024 · 1 comment
Open

Inquiry about Web Pipeline Availability #151

codefly13 opened this issue Apr 22, 2024 · 1 comment

Comments

@codefly13
Copy link

I hope you are doing well. I came across a reference to the "Web Pipeline" in the paper "Dolma: an Open Corpus of Three Trillion Tokens for Language Model Pretraining Research" and I am very interested in exploring it further. However, it seems that the pipeline is still in preparation. I would like to kindly inquire about the availability of the "Web Pipeline". Is there any information on when it might be released for public use?

@codefly13 codefly13 changed the title Inquiry about CommonCrawl WARC Pipeline Availability Inquiry about Web Pipeline Availability Apr 22, 2024
@dumitrac
Copy link

Hi @codefly13 - all of it is already available in the dolma toolkit (i.e. this repo).
Please let me know if you're looking for something different.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants