size of subsetted BSseq object is larger than the original object #83

MohamedRefaat92 · 2019-06-20T06:55:10Z

Hi developers,

I am having a weird problem that results in the size of a subsetted bsseq object being larger than the original one:

>pryr::object_size(bs)
#>37.4 GB

>bs
An object of type 'BSseq' with
  28614065 methylation loci
  75 samples
has not been smoothed
All assays are in-memory

>bs[,-c(1,2)]
An object of type 'BSseq' with
  28614065 methylation loci
  73 samples
has not been smoothed
All assays are in-memory

>pryr::object_size(bs[,-c(1,2)])
71.3 GB

I am using the github version of bsseq.

Best,
Mohamed Shoeb

The text was updated successfully, but these errors were encountered:

PeteHaitch · 2019-06-20T08:47:42Z

Hi,

Briefly, it's because the subsetting is stored as 'delayed operation' using the DelayedArray package.
I'm a little surprised it doubles in size, however.

You might try using the HDF5Array backend for such a large object. It'll reduce the memory footprint to roughly under a 1 GB.

PeteHaitch · 2019-06-20T08:51:00Z

I'm travelling for the next few days but will be happy to give some suggestions for processing a large dataset like this, as I've done quite a bit of this sort of thing.

MohamedRefaat92 · 2019-06-20T09:19:11Z

Hi,

Thanks for the clarification. I actually have no experience dealing with data of similar size. I have spent the last two days following some of the tutorials available online(here and here) about DelayedArray format and I understand what 'delayed operations' mean. Nevertheless, I don't know how to implement this architecture for memory footprint reduction. As a result, I would really appreciate it if you could guide me through an efficient way to process the data.

Best regards,
Mohamed Shoeb

PeteHaitch · 2019-06-20T11:02:04Z

If starting from Bismark files, you could try read.bismark(..., BACKEND = "HDF5Array").
But since you've already created your BSseq object, you can create an HDF5-backed version using HDF5Array::saveHDF5SummarizedExperiment().

Alternatively, if you do want to keep your data in-memory, you can do realize(bsseq, BACKEND = NULL) which will 'realize' the delayed operations and you'll end up with an object of the size that's more in line with your expectations.
However, this requires you have enough memory to make another (temporary) copy.

I'm actually giving an updated tutorial on DelayedArray next week at BioC2019.
You can see the material at https://github.com/PeteHaitch/BioC2019_DelayedArray_workshop
After the conference there will be a rendered version online.
Any feedback you have on it is much appreciated! (I recognise that it's difficult to learn DelayedArray and it can lead to unexpected results like yours).

MohamedRefaat92 · 2019-06-20T11:20:50Z

Hi,

Thank you for your valuable input. I am also looking forward to learning more from the updated tutorial to be given at BIoC2019.

Actually, I start with the raw coverage and methylation matrices and then create a BSseq object from scratch. I find that using HDF5Array::saveHDF5SummarizedExperiment() to save the object and HDF5Array::loadHDF5SummarizedExperiment() for reading it afterward is the very straightforward.

I think that another method might be to filter the raw coverage and methylation matrices in advance before creating the BSseq objects. Don't know if that would be a good alternative or not, but worth trying.

One last question, is it recommended to create BSseq object from DelayedMatrix objects?
Because I start with dataframes containing the raw coverage and methylation values, and I transform them into class matrix in the following way:

bs <- BSseq(chr = chr,
            pos = pos,
            M = as.matrix(M), Cov = as.matrix(Cov),
            sampleNames = sampleNames)

Best regards,
Mohamed Shoeb

PeteHaitch · 2019-06-20T12:15:58Z

I think that should be fine. The matrix will be wrapped in a DelayedMatrix, but that's cheap and shouldn't copy the data.

kasperdanielhansen · 2019-06-20T13:33:29Z

Broadly, we (well, Pete) has processed extremely large datasets using this backend, but it is probably somewhat finicky - ie. you can do stuff that will make it explode and other stuff which will work fine. And it is pretty clear that what is what is not well explained (and perhaps not well understood).

…

On Thu, Jun 20, 2019 at 2:15 PM Peter Hickey ***@***.***> wrote: I think that should be fine. The matrix will be wrapped in a DelayedMatrix, but that's cheap and shouldn't copy the data. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#83?email_source=notifications&email_token=ABF2DH45OAZK5KIWJEEQ3NTP3NYH5A5CNFSM4HZP7M3KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODYFHQLQ#issuecomment-504002606>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABF2DH3JILZULCS7HZY4HJLP3NYH5ANCNFSM4HZP7M3A> .

-- Best, Kasper

MohamedRefaat92 · 2019-06-20T13:47:21Z

Thank you @kasperdanielhansen for pointing this out. It's good to keep that in mind because I have no experience on this level of analysis. I would be glad if you could share a "conservative" approach to follow in similar situations.

Best,
Mohamed Shoeb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

size of subsetted BSseq object is larger than the original object #83

size of subsetted BSseq object is larger than the original object #83

MohamedRefaat92 commented Jun 20, 2019

PeteHaitch commented Jun 20, 2019

PeteHaitch commented Jun 20, 2019

MohamedRefaat92 commented Jun 20, 2019

PeteHaitch commented Jun 20, 2019

MohamedRefaat92 commented Jun 20, 2019

PeteHaitch commented Jun 20, 2019

kasperdanielhansen commented Jun 20, 2019 via email

MohamedRefaat92 commented Jun 20, 2019

size of subsetted BSseq object is larger than the original object #83

size of subsetted BSseq object is larger than the original object #83

Comments

MohamedRefaat92 commented Jun 20, 2019

PeteHaitch commented Jun 20, 2019

PeteHaitch commented Jun 20, 2019

MohamedRefaat92 commented Jun 20, 2019

PeteHaitch commented Jun 20, 2019

MohamedRefaat92 commented Jun 20, 2019

PeteHaitch commented Jun 20, 2019

kasperdanielhansen commented Jun 20, 2019 via email

MohamedRefaat92 commented Jun 20, 2019