Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XlsxWorksheet constructor is parsing the whole sheet on construction #618

Open
OwnageIsMagic opened this issue Nov 20, 2022 · 7 comments
Open

Comments

@OwnageIsMagic
Copy link
Contributor

OwnageIsMagic commented Nov 20, 2022

while ((record = sheetStream.Read()) != null)
{
switch (record)
{
case SheetDataBeginRecord _:
inSheetData = true;
break;
case SheetDataEndRecord _:
inSheetData = false;
break;
case RowHeaderRecord row when inSheetData:
rowIndexMaximum = Math.Max(rowIndexMaximum, row.RowIndex);
break;
case CellRecord cell when inSheetData:
columnIndexMaximum = Math.Max(columnIndexMaximum, cell.ColumnIndex);
break;
case ColumnRecord column:
columnWidths.Add(column.Column);
break;
case SheetFormatPrRecord sheetFormatProperties:
if (sheetFormatProperties.DefaultRowHeight != null)
DefaultRowHeight = sheetFormatProperties.DefaultRowHeight.Value;
break;
case SheetPrRecord sheetProperties:
CodeName = sheetProperties.CodeName;
break;
case MergeCellRecord mergeCell:
cellRanges.Add(mergeCell.Range);
break;
case HeaderFooterRecord headerFooter:
HeaderFooter = headerFooter.HeaderFooter;
break;
}
}

So basically calling reader.NextResult() is contributing 50% time of my workload and allocating tons of Cell, Cell[] and boxed double for worksheet I even don't want to read.

Profiler result

image


Maybe make those properties lazy calculated?

@appel1
Copy link
Collaborator

appel1 commented Nov 20, 2022

For sure we can be smarter here. Not sure how much difference lazily parsing cell content would make, but something to test and benchmark. Perhaps it would be enough to skip over unnecessary things when doing the pre-scan to determine properties like FieldCount and RowCount.

I've also been toying with the idea of making the properties that require us to scan the files twice optional via configuration, at least for the XML based formats. Not sure it is possible for the binary formats due to the many different ways files break the spec.

@OwnageIsMagic
Copy link
Contributor Author

You can estimate FieldCount/RowCount from <dimension /> element, I think just skipping sheetData would be a huge win.

@appel1
Copy link
Collaborator

appel1 commented Nov 20, 2022

It is often missing or incorrect so unfortunately it can't bed used.

@victor-gutemberg
Copy link

victor-gutemberg commented Feb 16, 2023

+1 on this issue. I'd like to implement my own method of detecting the data boundaries and this step is just consuming unnecessary time since I need to go through all the data again for my logic to work.

There could be just a configuration to disable it for now and allow the boundaries to be set via a property.

This has also been reported previously in an issue that was closed without resolution. #585

@ArjunVachhani
Copy link

I am also facing memory issue when upload huge files. In my case files are more than 200MB in size.

So I have decided to a build a library to read huge Xlsx file. library is mostly ready. If you try it today it should work fine. please fell free to try and share your feedbacks and bugs.
To reduce memory our library is doing following

  • Does not load whole work sheet, instead streams 1 row at a time.
  • Instead of loading shared string into RAM, it uses indexed file to quickly lookup value.

In coming days I will push few more changes to iron out any issue and add documentation.

Link to repository https://github.com/ArjunVachhani/XlsxHelper

@appel1
Copy link
Collaborator

appel1 commented May 18, 2024

Just skipping reading the cell contents when figuring out the field count for an .xslx made quite a big difference. I'll have to test if something similar can be done for the other formats and what impact that would have.

Current implementation
|         Method |     Mean |    Error |   StdDev |      Gen0 | Allocated |
|--------------- |---------:|---------:|---------:|----------:|----------:|
| OpenSingleFile | 82.66 ms | 0.713 ms | 0.667 ms | 3000.0000 |  33.69 MB |

Skip cell content when pre-scanning
|         Method |     Mean |    Error |   StdDev |      Gen0 |     Gen1 | Allocated |
|--------------- |---------:|---------:|---------:|----------:|---------:|----------:|
| OpenSingleFile | 56.83 ms | 0.215 ms | 0.168 ms | 2222.2222 | 222.2222 |  23.11 MB |

@appel1
Copy link
Collaborator

appel1 commented May 27, 2024

Something to also look into is if we can use a library like TurboXml when parsing XML. If the performance gains are enough perhaps it is worth the cost of having an additional dependency and the complication of having one code path for < net 8.0 and another for >= net 8.0.

That specific library won't work very well though because of its push nature but maybe there are others out there that isn't quite as allocation happy has XmlReader is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants