-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Split training images into training and test sets #16
Comments
That means we are gone split the data in to 4 parts? The training and testing of 2.process_data + test data @fheigwer has hosted and now an other one? |
Three parts:
We have enough data to do this. We need to split the same samples in the raw image data. |
Where does the data of http://193.196.20.37/scoringapp/ fit in to that? Because the data we download is already just a part of the full data set. |
Good point - not really a "true" validation set then. Do we know of similar data that was collected in a different experiment? If so, then we don't need to split training |
I would use the test-set what @fheigwer keep secret as a real test set. We only use it at the end to find out what works. The test set of 2.process_data we can use to compare different model to each other (so the validation set). This one has the downside of leaked features because they are not split on well basis but cell basis, so one well can have training and validation data. If a model needs to be trained we use the training set. I don't think the leaked features problem can be solved. If we look at the number of wells we have to little to make a good split. We have also leaked features because we try to reproduce the results from the hackaton. Only to models what works well on the test data we are using here. But I will ignore that for now My proposal:
|
I agree with this plan 👍 |
@gwaygenomics can I close this issue? |
The task in this issue is not complete. The issue is to split images into the same training and test sets splits as the ebimage features. Does this make sense? Kinda separate from the plan you outlined above |
The EBimage feature data has already been split into training and test sets. The additional splitting of the training set was performed so that we can evaluate model performance on a test set without evaluating performance on the final validation set.
In this issue, one should identify a solution to split the training image data into a training and testing set using the exact same splits as the EBImage data
EBImage data: https://github.com/cytodata/single-cell-classifier/tree/master/2.process-data/data
The text was updated successfully, but these errors were encountered: