-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Web page dataset (flamingo-like) #27
Comments
Recent question about the best format They suggest text + 5 images + sequence of local ids (tokens + images IDs)
I think I would recommend tar inside tar, it's the most straight forward So in practice it would look like this
|
So I'm very interested in this and have a dataset in mind. I'll try to figure out how to scrape it, I like the structure that rom suggests above. I wouldn't mind some help with architectural implementation if someone is interested in this. |
@christophschuhmann video on it https://youtu.be/NAspBQmxK4U next step: write a small doc with
|
I think we should discuss it more thoroughly. In the video, the suggestion is to use CLIP for deciding which text fits to which image in a sequence of text and images, using CLIP only on neighboring text and images. This I think is problematic, because CLIP is inherently an image-text pair only trained model and does not respect sequence structures with more than two elements, ignoring higher order dependencies (for example arbitrary triplets ( |
Here is my proposal for the outcomes of the preprocessing of Common Crawl.
I would suggest to only consider text of the length up to 256 tokens / words before & after each image, audio or video file. —— Independent of how we could eventually filter the interleaved datasets, - We could additionally use clip filtering approach but we applied in LAION ( maybe with a more powerful model like CLIP H or L 14 then) To create an even bigger CLIP filtered Image-Text pair dataset. And we can also apply CLIP filtering with CLAP and Video CLIP models to the surrounding texts to get Audio-Text and Video-Text pair datasets analogous to LAION 5B once we have decent CLAP / Video-CLIP models. —— It would in my opinion also make sense to compute the clip and weddings of all images to deduplicate them by CLIP embeddings and get aesthetics scores from the embeddings. —> Get many, many more images with high aesthetics scores for training generative models. |
Here is for first proof of concept to filter natural language text from WARC files from Common Crawl: https://colab.research.google.com/drive/1d10Stm4J2IIPcbjHF4HzwkBQJsXGJMhi#scrollTo=1NxragaBqgHZ&uniqifier=3 |
https://arxiv.org/abs/2204.14198
43M pages making 183M images and 182GB of text
Max 5 images (they limit to that) per page.
Sequences of text and image that are broadly in the same context
I think we would need some filtering beyond "dump the whole page"
Quite likely that some images are very unrelated with the rest of the text, which isn't useful
I guess we could use clip to tell us what to keep
The text was updated successfully, but these errors were encountered: