Split training images into training and test sets #16

gwaybio · 2019-11-18T16:06:41Z

The EBimage feature data has already been split into training and test sets. The additional splitting of the training set was performed so that we can evaluate model performance on a test set without evaluating performance on the final validation set.

In this issue, one should identify a solution to split the training image data into a training and testing set using the exact same splits as the EBImage data

EBImage data: https://github.com/cytodata/single-cell-classifier/tree/master/2.process-data/data

wagenrace · 2019-11-19T07:44:01Z

That means we are gone split the data in to 4 parts? The training and testing of 2.process_data + test data @fheigwer has hosted and now an other one?

gwaybio · 2019-11-19T12:25:10Z

Three parts:

Training
Testing
Validation

We have enough data to do this. We need to split the same samples in the raw image data.

wagenrace · 2019-11-19T14:59:21Z

Where does the data of http://193.196.20.37/scoringapp/ fit in to that? Because the data we download is already just a part of the full data set.

gwaybio · 2019-11-19T15:26:31Z

Good point - not really a "true" validation set then. Do we know of similar data that was collected in a different experiment? If so, then we don't need to split training

wagenrace · 2019-11-19T18:20:39Z

I would use the test-set what @fheigwer keep secret as a real test set. We only use it at the end to find out what works.

The test set of 2.process_data we can use to compare different model to each other (so the validation set). This one has the downside of leaked features because they are not split on well basis but cell basis, so one well can have training and validation data.

If a model needs to be trained we use the training set.

I don't think the leaked features problem can be solved. If we look at the number of wells we have to little to make a good split.

We have also leaked features because we try to reproduce the results from the hackaton. Only to models what works well on the test data we are using here. But I will ignore that for now

My proposal:

Training with training data
Compare the models with validation (test of 2.process_data)
after the best model is found train it with all data.
test it with @fheigwer secret data (this we only do once at the end)

gwaybio · 2019-11-20T11:31:52Z

I agree with this plan 👍

wagenrace · 2019-11-22T08:57:48Z

@gwaygenomics can I close this issue?

gwaybio · 2019-11-22T13:34:02Z

The task in this issue is not complete. The issue is to split images into the same training and test sets splits as the ebimage features. Does this make sense? Kinda separate from the plan you outlined above

gwaybio added the good first issue Good for newcomers label Nov 18, 2019

gwaybio mentioned this issue Nov 18, 2019

Write code to process raw image data #24

Open

wagenrace mentioned this issue Nov 29, 2019

Loading images #28

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split training images into training and test sets #16

Split training images into training and test sets #16

gwaybio commented Nov 18, 2019

wagenrace commented Nov 19, 2019

gwaybio commented Nov 19, 2019

wagenrace commented Nov 19, 2019

gwaybio commented Nov 19, 2019

wagenrace commented Nov 19, 2019

gwaybio commented Nov 20, 2019

wagenrace commented Nov 22, 2019

gwaybio commented Nov 22, 2019

Split training images into training and test sets #16

Split training images into training and test sets #16

Comments

gwaybio commented Nov 18, 2019

wagenrace commented Nov 19, 2019

gwaybio commented Nov 19, 2019

wagenrace commented Nov 19, 2019

gwaybio commented Nov 19, 2019

wagenrace commented Nov 19, 2019

gwaybio commented Nov 20, 2019

wagenrace commented Nov 22, 2019

gwaybio commented Nov 22, 2019