Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rangeToArray tries to read every possible cell leading to OOM #3982

Closed
2 tasks done
tommasoiovane opened this issue Apr 12, 2024 · 4 comments · Fixed by #4035
Closed
2 tasks done

rangeToArray tries to read every possible cell leading to OOM #3982

tommasoiovane opened this issue Apr 12, 2024 · 4 comments · Fixed by #4035

Comments

@tommasoiovane
Copy link

tommasoiovane commented Apr 12, 2024

This is:

  • a bug report

What is the expected behavior?

I would expect readActiveSheet to return only cells with value or something in it.

What is the current behavior?

With certain xlsx files rangeToArray calculate $maxCol as "AMK" and $maxRow as "1048579" leading to very slow performance and OOM memory issues.

What are the steps to reproduce?

<?php

require __DIR__ . '/vendor/autoload.php';

// Create new Spreadsheet object
$spreadsheet = \PhpOffice\PhpSpreadsheet\IOFactory::load("./test_file.xlsx");
$data = $spreadsheet->getActiveSheet()->toArray(null, true, false, true);

This code will fail running out of memory. I managed to read the xls file changing Worksheet.php rangeToArray like this:

$maxRow = $this->getHighestDataRow();
$maxCol = $this->getHighestDataColumn();

Is there any reason why it should'nt be like this?

What features do you think are causing the issue

  • Reader

Which versions of PhpSpreadsheet and PHP are affected?

I tried both on 1.3 and 2.0.0

Thank you in advance
test_file.xlsx

@MarkBaker
Copy link
Member

Not a bug, and nothing to do with the Reader. The Reader reads the spreadsheet file, and the values returned for a worksheet by getHighestColumn() and getHighestRow() are calculated based on values stored in the file, and cells that are referenced in the content of the file (perhaps by style information or range settings). getHighestDataRow() and getHighestDataColumn() are calculated at the point when they are called, and based purely on cells that contain data.
The toArray() methods use the getHighestColumn() and getHighestRow() values, changing it would be a BC break; but rangeToArray() can always be used with the getHighestDataRow() and getHighestDataColumn() values... that's why we provide the method.
We don't recommend the use of toArray() or rangeToArray, because creating a large PHP array will always demand a lot of memory. Instead, we'd recommend using the row and column iterators.

@oleibman
Copy link
Collaborator

@tommasoiovane PR #3839 (part of release 2.0.0) and PR #3906 (part of release 2.1.0) address performance problems in toArray. I think they significantly help your issue. The code below is identical to the code you supplied, except for the addition of the echo statements.

        echo "starting read ", (new DateTime())->format('Y-m-d H:i:s'), "\n";
        $spreadsheet = \PhpOffice\PhpSpreadsheet\IOFactory::load("issue.3982.xlsx");
        echo "starting toArray ", (new DateTime())->format('Y-m-d H:i:s'), "\n";
        $data = $spreadsheet->getActiveSheet()->toArray(null, true, false, true);
        echo "ending toArray ", (new DateTime())->format('Y-m-d H:i:s'), "\n";
        echo "peak memory is ", memory_get_peak_usage(),"\n";

Output:

starting read 2024-05-16 22:09:33
starting toArray 2024-05-16 22:09:35
ending toArray 2024-05-16 22:09:36
peak memory is 44725760

This should be nowhere close to OOM on most systems - I believe PHP ships with a configurable memory limit of 128M, more than 3 times the memory usage of this script. My results are similarly good with 2.0.0 and 2.1.0, and I agree they are much worse on 1.29.0 for both memory and time. Please try again with the newest release.

@tommasoiovane
Copy link
Author

Performance seems to have significantly improved, however the toArray method still retuns every possible rows and column even thought they are empty. I tried adding the IGNORE_EMPTY_CELLS flag but it does not seems to make a difference.

echo "starting read ", (new DateTime())->format('Y-m-d H:i:s'), "<br>";
$spreadsheet = \PhpOffice\PhpSpreadsheet\IOFactory::load("test_file.xlsx", \PhpOffice\PhpSpreadsheet\Reader\IReader::IGNORE_EMPTY_CELLS);
echo "starting toArray ", (new DateTime())->format('Y-m-d H:i:s'), "<br>";
$data = $spreadsheet->getActiveSheet()->toArray(null, true, false, true);
echo "ending toArray ", (new DateTime())->format('Y-m-d H:i:s'), "<br>";
echo "peak memory is ", memory_get_peak_usage(),"<br>";
echo "Rows: " . count($data);

output

starting read 2024-05-17 09:24:08
starting toArray 2024-05-17 09:24:08
ending toArray 2024-05-17 09:24:09
peak memory is 35172424
Rows: 1048576

It seems to get the wrong rows number because it reads some styles like rowHeight from row 1048570.
image

Using libreoffice calc i deleted every row after row 6, now it worked with no problems. So is it a bug from excel online (or the program that was used to generate the xls, which unfortunately i don't know) that set style to the last row of the excel?

@oleibman
Copy link
Collaborator

I don't know how they got there, but the following lines in the xml are the source of the problem:

<row r="1048570" customFormat="false" ht="12.8" hidden="false" customHeight="true" outlineLevel="0" collapsed="false"/>
<row r="1048571" customFormat="false" ht="12.8" hidden="false" customHeight="true" outlineLevel="0" collapsed="false"/>
<row r="1048572" customFormat="false" ht="12.8" hidden="false" customHeight="true" outlineLevel="0" collapsed="false"/>
<row r="1048573" customFormat="false" ht="12.8" hidden="false" customHeight="true" outlineLevel="0" collapsed="false"/>
<row r="1048574" customFormat="false" ht="12.8" hidden="false" customHeight="true" outlineLevel="0" collapsed="false"/>
<row r="1048575" customFormat="false" ht="12.8" hidden="false" customHeight="true" outlineLevel="0" collapsed="false"/>
<row r="1048576" customFormat="false" ht="12.8" hidden="false" customHeight="true" outlineLevel="0" collapsed="false"/>

In other words, as far as this spreadsheet is concerned, rows 1048570-1048576 exist. The processing of the row attributes is entirely separate from the processing of the cells, which is why IGNORE_EMPTY_CELLS had no effect.

Now, since there are no cells associated with those rows, it might be possible to have Xlsx Reader optionally ignore these rows. It's not something we can do automatically (that would be a breaking change), and we probably don't want to use IGNORE_EMPTY_CELLS (ditto), but it might be possible to add a new option to the read filter or to Xlsx reader. I will think about it.

oleibman added a commit to oleibman/PhpSpreadsheet that referenced this issue May 22, 2024
Fix PHPOffice#3982. A number of issues submitted about Xlsx read performance have a common theme, namely that row 1,048,576 and a few rows before it are defined in the worksheet Xml with no cells attached to them. These might be the work of a third party product. While these extraneous rows do not cause any problems for the cells that are actually used on the worksheet, they can lead to excessive memory use. This PR provides an option for the application to ignore rows with no cells when loading.

Recent changes to the load logic had already made a significant difference to memory consumption and load time. For the spreadsheet attached to issue 3982, which had caused out-of-memory errors on the user's system, peak memory usage was already reduced to 40-odd MB. With the new option, this is drastically reduced again, to just over 9MB. Specifying the new option is very easy:
```php
$reader->setIgnoreRowsWithNoCells(true);
```
Note that there are cases where you might not want this (non-default) behavior. For example, if you set a row height on a row with no cells, the height would be lost with this option. Unfortunately, the extraneous row definitions in the problematic spreadsheets claim to have a custom height, so I can't just use "no custom row styles" as an additional filter.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging a pull request may close this issue.

3 participants