Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split training images into training and test sets #16

Open
gwaybio opened this issue Nov 18, 2019 · 8 comments
Open

Split training images into training and test sets #16

gwaybio opened this issue Nov 18, 2019 · 8 comments
Labels
good first issue Good for newcomers

Comments

@gwaybio
Copy link
Contributor

gwaybio commented Nov 18, 2019

The EBimage feature data has already been split into training and test sets. The additional splitting of the training set was performed so that we can evaluate model performance on a test set without evaluating performance on the final validation set.

In this issue, one should identify a solution to split the training image data into a training and testing set using the exact same splits as the EBImage data

EBImage data: https://github.com/cytodata/single-cell-classifier/tree/master/2.process-data/data

@wagenrace
Copy link
Collaborator

That means we are gone split the data in to 4 parts? The training and testing of 2.process_data + test data @fheigwer has hosted and now an other one?

@gwaybio
Copy link
Contributor Author

gwaybio commented Nov 19, 2019

Three parts:

  1. Training
  2. Testing
  3. Validation

We have enough data to do this. We need to split the same samples in the raw image data.

@wagenrace
Copy link
Collaborator

Where does the data of http://193.196.20.37/scoringapp/ fit in to that? Because the data we download is already just a part of the full data set.

@gwaybio
Copy link
Contributor Author

gwaybio commented Nov 19, 2019

Good point - not really a "true" validation set then. Do we know of similar data that was collected in a different experiment? If so, then we don't need to split training

@wagenrace
Copy link
Collaborator

I would use the test-set what @fheigwer keep secret as a real test set. We only use it at the end to find out what works.

The test set of 2.process_data we can use to compare different model to each other (so the validation set). This one has the downside of leaked features because they are not split on well basis but cell basis, so one well can have training and validation data.

If a model needs to be trained we use the training set.

I don't think the leaked features problem can be solved. If we look at the number of wells we have to little to make a good split.

We have also leaked features because we try to reproduce the results from the hackaton. Only to models what works well on the test data we are using here. But I will ignore that for now

My proposal:

  • Training with training data
  • Compare the models with validation (test of 2.process_data)
  • after the best model is found train it with all data.
  • test it with @fheigwer secret data (this we only do once at the end)

@gwaybio
Copy link
Contributor Author

gwaybio commented Nov 20, 2019

I agree with this plan 👍

@wagenrace
Copy link
Collaborator

@gwaygenomics can I close this issue?

@gwaybio
Copy link
Contributor Author

gwaybio commented Nov 22, 2019

The task in this issue is not complete. The issue is to split images into the same training and test sets splits as the ebimage features. Does this make sense? Kinda separate from the plan you outlined above

@wagenrace wagenrace mentioned this issue Nov 29, 2019
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants