Preprocessing steps and tutorial reproducibility #86

dyuan1111 · 2025-01-02T03:58:47Z

Report

Hello CREsted developers,

Thanks for making such a wonderful tool publicly available for use.

I am trying to reproduce the results from the tutorial, but the model does not seem to train well. I noticed that the 1.20 release fixed a reproducibility issue, and I was using version 1.1. Therefore, I upgraded CREsted to the latest version (v1.2.1) and am currently retraining the model.

Before delving into the model specifics, I wanted to ask if there were any particular preprocessing steps applied, such as dividing adata.X by a factor of 100 after normalization.

I ask because, in the notebook titled "Introduction to CREsted with Peak Regression", there is a bar plot showing the ground truth values for the region chr18:3892771-3894885. The y-axis ranges from 0 to 14 in the plot, whereas the tutorial data after normalization ranges from 0 to 1400. There are also minor differences in the bar heights beyond a simple scale factor of 100. I wonder if preprocessing steps account for this difference and could be the main reason why the model did not train well.

Thank you in advance for any guidance!
Dan

Version information

No response

LukasMahieu · 2025-01-03T09:07:52Z

Hey Dan,

Sorry to hear that you're having issues training. What kind of performances are you getting on your test set?

I think the y-axis ranges plot might be an error in the tutorial since the target value used to be the 'mean' but seems to have changed in the latest tutorial version to 'count', so the tutorial plot might be an artifact of when it was still the mean (I'll check with Niklas who wrote the tutorial when he gets back next week).

nkempynck · 2025-01-08T08:48:58Z

Hey Dan

Thanks for noticing. I noticed that the bigwigs used in the tutorial were not up to date anymore with those I used myself. Originally, we were using coverage bigwigs (giving a count to all the basepairs between two cut sites), but recently I have switched to cut site bigwigs, which only contain count values at the actual cut sites. This gives more sparse data, which is the reason for switching to the 'count' scalar, because mean values will be too low.

We are working on fixing the bigwigs on our server to the correct ones asap.

cheers
Niklas

dyuan1111 added the bug Something isn't working label Jan 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocessing steps and tutorial reproducibility #86

Preprocessing steps and tutorial reproducibility #86

dyuan1111 commented Jan 2, 2025

LukasMahieu commented Jan 3, 2025 •

edited

Loading

nkempynck commented Jan 8, 2025 •

edited

Loading

Preprocessing steps and tutorial reproducibility #86

Preprocessing steps and tutorial reproducibility #86

Comments

dyuan1111 commented Jan 2, 2025

Report

Version information

LukasMahieu commented Jan 3, 2025 • edited Loading

nkempynck commented Jan 8, 2025 • edited Loading

LukasMahieu commented Jan 3, 2025 •

edited

Loading

nkempynck commented Jan 8, 2025 •

edited

Loading