rangeToArray tries to read every possible cell leading to OOM #3982

tommasoiovane · 2024-04-12T14:41:36Z

This is:

a bug report

What is the expected behavior?

I would expect readActiveSheet to return only cells with value or something in it.

What is the current behavior?

With certain xlsx files rangeToArray calculate $maxCol as "AMK" and $maxRow as "1048579" leading to very slow performance and OOM memory issues.

What are the steps to reproduce?

<?php

require __DIR__ . '/vendor/autoload.php';

// Create new Spreadsheet object
$spreadsheet = \PhpOffice\PhpSpreadsheet\IOFactory::load("./test_file.xlsx");
$data = $spreadsheet->getActiveSheet()->toArray(null, true, false, true);

This code will fail running out of memory. I managed to read the xls file changing Worksheet.php rangeToArray like this:

$maxRow = $this->getHighestDataRow();
$maxCol = $this->getHighestDataColumn();

Is there any reason why it should'nt be like this?

What features do you think are causing the issue

Reader

Which versions of PhpSpreadsheet and PHP are affected?

I tried both on 1.3 and 2.0.0

Thank you in advance
test_file.xlsx

The text was updated successfully, but these errors were encountered:

MarkBaker · 2024-04-12T17:43:00Z

Not a bug, and nothing to do with the Reader. The Reader reads the spreadsheet file, and the values returned for a worksheet by getHighestColumn() and getHighestRow() are calculated based on values stored in the file, and cells that are referenced in the content of the file (perhaps by style information or range settings). getHighestDataRow() and getHighestDataColumn() are calculated at the point when they are called, and based purely on cells that contain data.
The toArray() methods use the getHighestColumn() and getHighestRow() values, changing it would be a BC break; but rangeToArray() can always be used with the getHighestDataRow() and getHighestDataColumn() values... that's why we provide the method.
We don't recommend the use of toArray() or rangeToArray, because creating a large PHP array will always demand a lot of memory. Instead, we'd recommend using the row and column iterators.

oleibman · 2024-05-17T05:31:16Z

@tommasoiovane PR #3839 (part of release 2.0.0) and PR #3906 (part of release 2.1.0) address performance problems in toArray. I think they significantly help your issue. The code below is identical to the code you supplied, except for the addition of the echo statements.

        echo "starting read ", (new DateTime())->format('Y-m-d H:i:s'), "\n";
        $spreadsheet = \PhpOffice\PhpSpreadsheet\IOFactory::load("issue.3982.xlsx");
        echo "starting toArray ", (new DateTime())->format('Y-m-d H:i:s'), "\n";
        $data = $spreadsheet->getActiveSheet()->toArray(null, true, false, true);
        echo "ending toArray ", (new DateTime())->format('Y-m-d H:i:s'), "\n";
        echo "peak memory is ", memory_get_peak_usage(),"\n";

Output:

starting read 2024-05-16 22:09:33
starting toArray 2024-05-16 22:09:35
ending toArray 2024-05-16 22:09:36
peak memory is 44725760

This should be nowhere close to OOM on most systems - I believe PHP ships with a configurable memory limit of 128M, more than 3 times the memory usage of this script. My results are similarly good with 2.0.0 and 2.1.0, and I agree they are much worse on 1.29.0 for both memory and time. Please try again with the newest release.

tommasoiovane · 2024-05-17T07:40:54Z

Performance seems to have significantly improved, however the toArray method still retuns every possible rows and column even thought they are empty. I tried adding the IGNORE_EMPTY_CELLS flag but it does not seems to make a difference.

echo "starting read ", (new DateTime())->format('Y-m-d H:i:s'), "<br>";
$spreadsheet = \PhpOffice\PhpSpreadsheet\IOFactory::load("test_file.xlsx", \PhpOffice\PhpSpreadsheet\Reader\IReader::IGNORE_EMPTY_CELLS);
echo "starting toArray ", (new DateTime())->format('Y-m-d H:i:s'), "<br>";
$data = $spreadsheet->getActiveSheet()->toArray(null, true, false, true);
echo "ending toArray ", (new DateTime())->format('Y-m-d H:i:s'), "<br>";
echo "peak memory is ", memory_get_peak_usage(),"<br>";
echo "Rows: " . count($data);

output

starting read 2024-05-17 09:24:08
starting toArray 2024-05-17 09:24:08
ending toArray 2024-05-17 09:24:09
peak memory is 35172424
Rows: 1048576

It seems to get the wrong rows number because it reads some styles like rowHeight from row 1048570.

Using libreoffice calc i deleted every row after row 6, now it worked with no problems. So is it a bug from excel online (or the program that was used to generate the xls, which unfortunately i don't know) that set style to the last row of the excel?

oleibman · 2024-05-17T16:01:55Z

I don't know how they got there, but the following lines in the xml are the source of the problem:

<row r="1048570" customFormat="false" ht="12.8" hidden="false" customHeight="true" outlineLevel="0" collapsed="false"/>
<row r="1048571" customFormat="false" ht="12.8" hidden="false" customHeight="true" outlineLevel="0" collapsed="false"/>
<row r="1048572" customFormat="false" ht="12.8" hidden="false" customHeight="true" outlineLevel="0" collapsed="false"/>
<row r="1048573" customFormat="false" ht="12.8" hidden="false" customHeight="true" outlineLevel="0" collapsed="false"/>
<row r="1048574" customFormat="false" ht="12.8" hidden="false" customHeight="true" outlineLevel="0" collapsed="false"/>
<row r="1048575" customFormat="false" ht="12.8" hidden="false" customHeight="true" outlineLevel="0" collapsed="false"/>
<row r="1048576" customFormat="false" ht="12.8" hidden="false" customHeight="true" outlineLevel="0" collapsed="false"/>

In other words, as far as this spreadsheet is concerned, rows 1048570-1048576 exist. The processing of the row attributes is entirely separate from the processing of the cells, which is why IGNORE_EMPTY_CELLS had no effect.

Now, since there are no cells associated with those rows, it might be possible to have Xlsx Reader optionally ignore these rows. It's not something we can do automatically (that would be a breaking change), and we probably don't want to use IGNORE_EMPTY_CELLS (ditto), but it might be possible to add a new option to the read filter or to Xlsx reader. I will think about it.

Fix PHPOffice#3982. A number of issues submitted about Xlsx read performance have a common theme, namely that row 1,048,576 and a few rows before it are defined in the worksheet Xml with no cells attached to them. These might be the work of a third party product. While these extraneous rows do not cause any problems for the cells that are actually used on the worksheet, they can lead to excessive memory use. This PR provides an option for the application to ignore rows with no cells when loading. Recent changes to the load logic had already made a significant difference to memory consumption and load time. For the spreadsheet attached to issue 3982, which had caused out-of-memory errors on the user's system, peak memory usage was already reduced to 40-odd MB. With the new option, this is drastically reduced again, to just over 9MB. Specifying the new option is very easy: ```php $reader->setIgnoreRowsWithNoCells(true); ``` Note that there are cases where you might not want this (non-default) behavior. For example, if you set a row height on a row with no cells, the height would be lost with this option. Unfortunately, the extraneous row definitions in the problematic spreadsheets claim to have a custom height, so I can't just use "no custom row styles" as an additional filter.

oleibman mentioned this issue May 22, 2024

Xlsx Reader Optionally Ignore Rows With No Cells #4035

Merged

11 tasks

oleibman closed this as completed in #4035 May 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rangeToArray tries to read every possible cell leading to OOM #3982

rangeToArray tries to read every possible cell leading to OOM #3982

tommasoiovane commented Apr 12, 2024 •

edited

MarkBaker commented Apr 12, 2024

oleibman commented May 17, 2024

tommasoiovane commented May 17, 2024

oleibman commented May 17, 2024

rangeToArray tries to read every possible cell leading to OOM #3982

rangeToArray tries to read every possible cell leading to OOM #3982

Comments

tommasoiovane commented Apr 12, 2024 • edited

What is the expected behavior?

What is the current behavior?

What are the steps to reproduce?

What features do you think are causing the issue

Which versions of PhpSpreadsheet and PHP are affected?

MarkBaker commented Apr 12, 2024

oleibman commented May 17, 2024

tommasoiovane commented May 17, 2024

oleibman commented May 17, 2024

tommasoiovane commented Apr 12, 2024 •

edited