Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

size of subsetted BSseq object is larger than the original object #83

Open
MohamedRefaat92 opened this issue Jun 20, 2019 · 8 comments
Open

Comments

@MohamedRefaat92
Copy link

Hi developers,

I am having a weird problem that results in the size of a subsetted bsseq object being larger than the original one:

>pryr::object_size(bs)
#>37.4 GB
>bs
An object of type 'BSseq' with
  28614065 methylation loci
  75 samples
has not been smoothed
All assays are in-memory
>bs[,-c(1,2)]
An object of type 'BSseq' with
  28614065 methylation loci
  73 samples
has not been smoothed
All assays are in-memory
>pryr::object_size(bs[,-c(1,2)])
71.3 GB

I am using the github version of bsseq.

Best,
Mohamed Shoeb

@PeteHaitch
Copy link
Contributor

Hi,

Briefly, it's because the subsetting is stored as 'delayed operation' using the DelayedArray package.
I'm a little surprised it doubles in size, however.

You might try using the HDF5Array backend for such a large object. It'll reduce the memory footprint to roughly under a 1 GB.

@PeteHaitch
Copy link
Contributor

I'm travelling for the next few days but will be happy to give some suggestions for processing a large dataset like this, as I've done quite a bit of this sort of thing.

@MohamedRefaat92
Copy link
Author

Hi,

Thanks for the clarification. I actually have no experience dealing with data of similar size. I have spent the last two days following some of the tutorials available online(here and here) about DelayedArray format and I understand what 'delayed operations' mean. Nevertheless, I don't know how to implement this architecture for memory footprint reduction. As a result, I would really appreciate it if you could guide me through an efficient way to process the data.

Best regards,
Mohamed Shoeb

@PeteHaitch
Copy link
Contributor

If starting from Bismark files, you could try read.bismark(..., BACKEND = "HDF5Array").
But since you've already created your BSseq object, you can create an HDF5-backed version using HDF5Array::saveHDF5SummarizedExperiment().

Alternatively, if you do want to keep your data in-memory, you can do realize(bsseq, BACKEND = NULL) which will 'realize' the delayed operations and you'll end up with an object of the size that's more in line with your expectations.
However, this requires you have enough memory to make another (temporary) copy.

I'm actually giving an updated tutorial on DelayedArray next week at BioC2019.
You can see the material at https://github.com/PeteHaitch/BioC2019_DelayedArray_workshop
After the conference there will be a rendered version online.
Any feedback you have on it is much appreciated! (I recognise that it's difficult to learn DelayedArray and it can lead to unexpected results like yours).

@MohamedRefaat92
Copy link
Author

Hi,

Thank you for your valuable input. I am also looking forward to learning more from the updated tutorial to be given at BIoC2019.

Actually, I start with the raw coverage and methylation matrices and then create a BSseq object from scratch. I find that using HDF5Array::saveHDF5SummarizedExperiment() to save the object and HDF5Array::loadHDF5SummarizedExperiment() for reading it afterward is the very straightforward.

I think that another method might be to filter the raw coverage and methylation matrices in advance before creating the BSseq objects. Don't know if that would be a good alternative or not, but worth trying.

One last question, is it recommended to create BSseq object from DelayedMatrix objects?
Because I start with dataframes containing the raw coverage and methylation values, and I transform them into class matrix in the following way:

bs <- BSseq(chr = chr,
            pos = pos,
            M = as.matrix(M), Cov = as.matrix(Cov),
            sampleNames = sampleNames)

Best regards,
Mohamed Shoeb

@PeteHaitch
Copy link
Contributor

I think that should be fine. The matrix will be wrapped in a DelayedMatrix, but that's cheap and shouldn't copy the data.

@kasperdanielhansen
Copy link
Contributor

kasperdanielhansen commented Jun 20, 2019 via email

@MohamedRefaat92
Copy link
Author

Thank you @kasperdanielhansen for pointing this out. It's good to keep that in mind because I have no experience on this level of analysis. I would be glad if you could share a "conservative" approach to follow in similar situations.

Best,
Mohamed Shoeb

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants