XlsxWorksheet constructor is parsing the whole sheet on construction #618

OwnageIsMagic · 2022-11-20T17:02:20Z

ExcelDataReader/src/ExcelDataReader/Core/OpenXmlFormat/XlsxWorksheet.cs

Lines 38 to 71 in 2f14bd4

 while ((record = sheetStream.Read()) != null) 

 { 

 switch (record) 

 { 

 case SheetDataBeginRecord _: 

 inSheetData = true; 

 break; 

 case SheetDataEndRecord _: 

 inSheetData = false; 

 break; 

 case RowHeaderRecord row when inSheetData: 

 rowIndexMaximum = Math.Max(rowIndexMaximum, row.RowIndex); 

 break; 

 case CellRecord cell when inSheetData: 

 columnIndexMaximum = Math.Max(columnIndexMaximum, cell.ColumnIndex); 

 break; 

 case ColumnRecord column: 

 columnWidths.Add(column.Column); 

 break; 

 case SheetFormatPrRecord sheetFormatProperties: 

 if (sheetFormatProperties.DefaultRowHeight != null) 

 DefaultRowHeight = sheetFormatProperties.DefaultRowHeight.Value; 

 break; 

 case SheetPrRecord sheetProperties: 

 CodeName = sheetProperties.CodeName; 

 break; 

 case MergeCellRecord mergeCell: 

 cellRanges.Add(mergeCell.Range); 

 break; 

 case HeaderFooterRecord headerFooter: 

 HeaderFooter = headerFooter.HeaderFooter; 

 break; 

 } 

 }

So basically calling reader.NextResult() is contributing 50% time of my workload and allocating tons of Cell, Cell[] and boxed double for worksheet I even don't want to read.

Profiler result

Maybe make those properties lazy calculated?

The text was updated successfully, but these errors were encountered:

appel1 · 2022-11-20T18:16:29Z

For sure we can be smarter here. Not sure how much difference lazily parsing cell content would make, but something to test and benchmark. Perhaps it would be enough to skip over unnecessary things when doing the pre-scan to determine properties like FieldCount and RowCount.

I've also been toying with the idea of making the properties that require us to scan the files twice optional via configuration, at least for the XML based formats. Not sure it is possible for the binary formats due to the many different ways files break the spec.

OwnageIsMagic · 2022-11-20T19:39:50Z

You can estimate FieldCount/RowCount from <dimension /> element, I think just skipping sheetData would be a huge win.

appel1 · 2022-11-20T20:07:16Z

It is often missing or incorrect so unfortunately it can't bed used.

victor-gutemberg · 2023-02-16T15:01:11Z

+1 on this issue. I'd like to implement my own method of detecting the data boundaries and this step is just consuming unnecessary time since I need to go through all the data again for my logic to work.

There could be just a configuration to disable it for now and allow the boundaries to be set via a property.

This has also been reported previously in an issue that was closed without resolution. #585

ArjunVachhani · 2023-08-19T16:59:05Z

I am also facing memory issue when upload huge files. In my case files are more than 200MB in size.

So I have decided to a build a library to read huge Xlsx file. library is mostly ready. If you try it today it should work fine. please fell free to try and share your feedbacks and bugs.
To reduce memory our library is doing following

Does not load whole work sheet, instead streams 1 row at a time.
Instead of loading shared string into RAM, it uses indexed file to quickly lookup value.

In coming days I will push few more changes to iron out any issue and add documentation.

Link to repository https://github.com/ArjunVachhani/XlsxHelper

appel1 · 2024-05-18T17:04:39Z

Just skipping reading the cell contents when figuring out the field count for an .xslx made quite a big difference. I'll have to test if something similar can be done for the other formats and what impact that would have.

Current implementation
|         Method |     Mean |    Error |   StdDev |      Gen0 | Allocated |
|--------------- |---------:|---------:|---------:|----------:|----------:|
| OpenSingleFile | 82.66 ms | 0.713 ms | 0.667 ms | 3000.0000 |  33.69 MB |

Skip cell content when pre-scanning
|         Method |     Mean |    Error |   StdDev |      Gen0 |     Gen1 | Allocated |
|--------------- |---------:|---------:|---------:|----------:|---------:|----------:|
| OpenSingleFile | 56.83 ms | 0.215 ms | 0.168 ms | 2222.2222 | 222.2222 |  23.11 MB |

appel1 · 2024-05-27T12:27:01Z

Something to also look into is if we can use a library like TurboXml when parsing XML. If the performance gains are enough perhaps it is worth the cost of having an additional dependency and the complication of having one code path for < net 8.0 and another for >= net 8.0.

That specific library won't work very well though because of its push nature but maybe there are others out there that isn't quite as allocation happy has XmlReader is.

appel1 added the performance label Dec 3, 2022

OwnageIsMagic mentioned this issue Sep 3, 2023

Extract text from big documents faster #653

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XlsxWorksheet constructor is parsing the whole sheet on construction #618

XlsxWorksheet constructor is parsing the whole sheet on construction #618

OwnageIsMagic commented Nov 20, 2022 •

edited

appel1 commented Nov 20, 2022

OwnageIsMagic commented Nov 20, 2022

appel1 commented Nov 20, 2022

victor-gutemberg commented Feb 16, 2023 •

edited

ArjunVachhani commented Aug 19, 2023

appel1 commented May 18, 2024

appel1 commented May 27, 2024 •

edited

XlsxWorksheet constructor is parsing the whole sheet on construction #618

XlsxWorksheet constructor is parsing the whole sheet on construction #618

Comments

OwnageIsMagic commented Nov 20, 2022 • edited

appel1 commented Nov 20, 2022

OwnageIsMagic commented Nov 20, 2022

appel1 commented Nov 20, 2022

victor-gutemberg commented Feb 16, 2023 • edited

ArjunVachhani commented Aug 19, 2023

appel1 commented May 18, 2024

appel1 commented May 27, 2024 • edited

OwnageIsMagic commented Nov 20, 2022 •

edited

victor-gutemberg commented Feb 16, 2023 •

edited

appel1 commented May 27, 2024 •

edited