Skip to content

Performance issues when working with large amounts of cells #4607

@Fuzuki785

Description

@Fuzuki785

While trying to work with data from a ≈8.5 MiB xlsx file containing ≈96500 rows for 19 columns on one sheet (≈1.8 million cells, most of which contain data), we noticed significant slowdowns while reading sequentially, row by row using the rangeToArray method on the Worksheet class.

Through some tracing and after following call chains, we found the following issues:

  1. The rangeToArrayYieldRows method on the Worksheet class uses a linear search (with an increment of 1) to find the right cell index, this is quick enough when reading at the start of the file but it gets progressively slower as it has to read further along the file.
  2. The Collection\Cells class always sorts its $index array when calling getSortedCoordinates and getSortedCoordinatesInt even when no change was done to the array in between calls.
  3. The Collection\Cells class uses array_keys and array_values to get the keys and values of its $index array when getCoordinates, getSortedCoordinates, and getSortedCoordinatesInt are called, even when the array was not modified between calls to these methods.

We fixed these issues on our end temporarily by implementing the following changes, respectively:

  1. Using a binary search algorithm to approach the correct index value quickly and finding the proper value afterwards. The searched index value is not always present, which is why, we assume, a linear search was used in the first place, so we also need to fix the value after the binary search, which we did with a linear backwards search followed by a linear forward search, but it limits the iterations to just a row's worth of indices rather than the whole file's worth.
  2. Using a flag to mark when the $index array is sorted, and clearing that flag whenever the array is changed.
  3. Using arrays to cache the keys and values of the $index array and clearing these whenever it is changed.

For reference, we changed the following methods of the Collections\Cells class:

  • delete clears the keys and values caches (if the array is sorted, removing a value should not change that, if the array is not sorted, the flag should not be set, so this method probably shouldn't need to mark the array as needing to be sorted)
  • getCoordinates builds the keys cache if it does not exist and returns the keys cache.
  • getSortedCoordinates sorts the array if it needs sorting, marks it as sorted, builds the keys cache if it does not exist, and returns the keys cache.
  • getSortedCoordinatesInt sorts the array if it needs sorting, marks it as sorted, builds the values cache if it does not exist, and returns the values cache.
  • cloneCellCollection probably needs to clear the sorted flag and both the keys and values cache arrays from the newly created collection (we did it anyway to be sure)
  • add clears the sorted flag and both cache arrays
  • unsetWorksheetCells clears the sorted flag and both cache arrays

Through the modifications listed above, we improved our processing time from a few hours to a few minutes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions