Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preprocessing steps and tutorial reproducibility #86

Open
dyuan1111 opened this issue Jan 2, 2025 · 2 comments
Open

Preprocessing steps and tutorial reproducibility #86

dyuan1111 opened this issue Jan 2, 2025 · 2 comments
Labels
bug Something isn't working

Comments

@dyuan1111
Copy link

Report

Hello CREsted developers,

Thanks for making such a wonderful tool publicly available for use.

I am trying to reproduce the results from the tutorial, but the model does not seem to train well. I noticed that the 1.20 release fixed a reproducibility issue, and I was using version 1.1. Therefore, I upgraded CREsted to the latest version (v1.2.1) and am currently retraining the model.

Before delving into the model specifics, I wanted to ask if there were any particular preprocessing steps applied, such as dividing adata.X by a factor of 100 after normalization.

I ask because, in the notebook titled "Introduction to CREsted with Peak Regression", there is a bar plot showing the ground truth values for the region chr18:3892771-3894885. The y-axis ranges from 0 to 14 in the plot, whereas the tutorial data after normalization ranges from 0 to 1400. There are also minor differences in the bar heights beyond a simple scale factor of 100. I wonder if preprocessing steps account for this difference and could be the main reason why the model did not train well.

Thank you in advance for any guidance!
Dan

Version information

No response

@dyuan1111 dyuan1111 added the bug Something isn't working label Jan 2, 2025
@LukasMahieu
Copy link
Collaborator

LukasMahieu commented Jan 3, 2025

Hey Dan,

Sorry to hear that you're having issues training. What kind of performances are you getting on your test set?

I think the y-axis ranges plot might be an error in the tutorial since the target value used to be the 'mean' but seems to have changed in the latest tutorial version to 'count', so the tutorial plot might be an artifact of when it was still the mean (I'll check with Niklas who wrote the tutorial when he gets back next week).

@nkempynck
Copy link
Collaborator

nkempynck commented Jan 8, 2025

Hey Dan

Thanks for noticing. I noticed that the bigwigs used in the tutorial were not up to date anymore with those I used myself. Originally, we were using coverage bigwigs (giving a count to all the basepairs between two cut sites), but recently I have switched to cut site bigwigs, which only contain count values at the actual cut sites. This gives more sparse data, which is the reason for switching to the 'count' scalar, because mean values will be too low.

We are working on fixing the bigwigs on our server to the correct ones asap.

cheers
Niklas

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants