Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset recipes #153

Open
27 of 69 tasks
lorenzoh opened this issue Aug 10, 2021 · 14 comments
Open
27 of 69 tasks

Dataset recipes #153

lorenzoh opened this issue Aug 10, 2021 · 14 comments
Labels
good first issue Good for newcomers help wanted Contributions welcome!

Comments

@lorenzoh
Copy link
Member

lorenzoh commented Aug 10, 2021

With #151, FastAI.jl is getting high-level interfaces for searching datasets (finddatasets) and loading datasets into task-specific data containers (loaddataset). There is also a new DatasetRecipe that encapsulates configuration for loading a data container and the block information from a path. These recipes can be registered with a dataset so that they can be found using the above high-level functions.

The fastai dataset colletion comes with quite a lot of datasets, so only a few have recipes yet. This issue tracks the progress on adding recipes to all the datasets. Contributions of recipe types and recipe configs for datasets are welcome.

See src/datasets/recipes.jl for example recipe implementations and src/datasets/fastairegistry for how recipes are registered. listdatasources() gives you a list of all dataset sources and datasetpath(name) downloads them and returns the download folder.

Progress

For datasets that can be used for multiple tasks, they are listed below. Otherwise a checked dataset that at least one recipe is already implemented.

  • CUB_200_2011
  • bedroom (not sure how the folders are layed out)
  • caltech_101
  • cifar10
  • cifar100
  • food-101
  • imagenette-160
  • imagenette-320
  • imagenette
  • imagenette2-160
  • imagenette2-320
  • imagenette2
  • imagewang-160
  • imagewang-320
  • imagewang
  • imagewoof-160
  • imagewoof-320
  • imagewoof
  • imagewoof2-160
  • imagewoof2-320
  • imagewoof2
  • mnist_png
  • mnist_var_size_tiny
  • oxford-102-flowers
  • oxford-iiit-pet
  • stanford-cars
  • ag_news_csv
  • amazon_review_full_csv
  • amazon_review_polarity_csv
  • dbpedia_csv
  • giga-fren
  • imdb
  • sogou_news_csv
  • wikitext-103
  • wikitext-2
  • yahoo_answers_csv
  • yelp_review_full_csv
  • yelp_review_polarity_csv
  • biwi_head_pose
  • camvid
  • pascal-voc
  • pascal_2007
    • multi-label image classification ((Image{2}, LabelMulti))
    • object detection
  • pascal_2012
  • siim_small
  • skin-lesion
  • tcga-small
  • adult_sample
  • biwi_sample
  • camvid_tiny
  • dogscats
  • human_numbers
  • imdb_sample
  • mnist_sample
  • mnist_tiny
  • movie_lens_sample
  • planet_sample
  • planet_tiny
  • coco_sample
  • coco-train2017
  • coco-val2017
  • coco-test2017
  • coco-unlabeled2017
  • coco-image_info_test2017
  • coco-image_info_unlabeled2017
  • coco-annotations_trainval2017
  • coco-stuff_annotations_trainval2017
  • coco-panoptic_annotations_trainval2017
@lorenzoh lorenzoh added good first issue Good for newcomers help wanted Contributions welcome! labels Aug 10, 2021
@ToucheSir
Copy link
Member

ToucheSir commented Aug 10, 2021

Do you think we can pull some or all of this into MLDatasets.jl? Obviously some parts like the block API won't be applicable, but it would be nice to expose the registry functionality, for example.

Edit: ref. JuliaML/MLDatasets.jl#73 as well.

@darsnack
Copy link
Member

It might be worth also looking at DataSets.jl announced at JuliaCon.

@lorenzoh
Copy link
Member Author

Do you think we can pull some or all of this into MLDatasets.jl? Obviously some parts like the block API won't be applicable, but it would be nice to expose the registry functionality, for example.

At some point, all the dataset functionality should me merged down to MLDatasets.jl and MLDataPattern.jl.

The registry itself is pretty barebones; if you take away the functionality related to blocks, then you could replace it with a Dict{String, Vector{DatasetRecipe}} that maps a list of recipes to a dataset.

@lorenzoh
Copy link
Member Author

lorenzoh commented Aug 11, 2021

At some point we'll have to think about iterable datasets and at that point some rearchitecting DataSets.jl could be useful. It should also not be too hard to add iterable support to DataLoaders.jl.

For now I want to provide a useful core of offline datasets here in FastAI.jl with this simple approach. Rearchitecting should probably flow into the efforts in MLDatasets.jl (or perhaps a DLDatasets.jl if everything will be deprecated anyway?). I'll give a larger reply in JuliaML/MLDatasets.jl#73 later

In any case, any recipe logic associated with the fastai datasets here should be easily relocatable later. 👍

@lorenzoh
Copy link
Member Author

Some are being added in #163

@Chandu-4444
Copy link
Contributor

Hey, I'd like to work on this issue. Since this issue is labeled good first issue I believe I can help. Can you please specify to me what has to be done still cause I see the list above hasn't been updated?

@lorenzoh
Copy link
Member Author

Hey! The list above is uptodate. The easiest thing to get started with should be adding recipes for the csv datasets and registering some TableDatasetRecipes.

@Chandu-4444
Copy link
Contributor

Next I want to add recipes for dbpedia_csv, ag_news_csv. They all are in CSV format. But the labels were in separate files and the indexes of these labels are used in the actual CSV files. In that case, I think it is better to replace the label indices with the actual labels in the recipe code itself and then wrap it with TableClassificationRecipe? Are there any ideas to do this?

@lorenzoh
Copy link
Member Author

Might need a new recipe type that wraps TableRecipe, but can't say without looking at the folder structure

@Chandu-4444
Copy link
Contributor

fastai-dbpedia_csv/
└── dbpedia_csv
     ├── classes.txt
     ├── readme.txt
     ├── test.csv
     └── train.csv

This is the folder structure for both datasets (dbpedia_csv, ag_news_csv).

@Chandu-4444
Copy link
Contributor

Might need a new recipe type that wraps TableRecipe, but can't say without looking at the folder structure

Is it necessary to make a new recipe for datasets that have folder structures similar to the one above? Or is it possible to tweak the existing ones to get the job done?

@lorenzoh
Copy link
Member Author

lorenzoh commented Mar 1, 2022

I think in this case it may be possible to create a new recipe that wraps TableRecipe (which loads the table) and then reads in the labels and converts label indices to label strings. I don't have the bandwidth to look into this in more detail currently, though.

@Chandu-4444
Copy link
Contributor

I think in this case it may be possible to create a new recipe that wraps TableRecipe (which loads the table) and then reads in the labels and converts label indices to label strings.

I'll work on this.

@arcAman07
Copy link

After the community meet, I explored fastAI, MLutils and couple of other libraries and tried to understand the codebase specifically . Would love to get started with adding a dataset , can you please specify which one of the above would be a good one to get started into , also I believe the list above isnt updated

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Contributions welcome!
Projects
None yet
Development

No branches or pull requests

5 participants