-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Open
Description
While trying to work with data from a ≈8.5 MiB xlsx file containing ≈96500 rows for 19 columns on one sheet (≈1.8 million cells, most of which contain data), we noticed significant slowdowns while reading sequentially, row by row using the rangeToArray
method on the Worksheet
class.
Through some tracing and after following call chains, we found the following issues:
- The
rangeToArrayYieldRows
method on theWorksheet
class uses a linear search (with an increment of 1) to find the right cell index, this is quick enough when reading at the start of the file but it gets progressively slower as it has to read further along the file. - The
Collection\Cells
class always sorts its$index
array when callinggetSortedCoordinates
andgetSortedCoordinatesInt
even when no change was done to the array in between calls. - The
Collection\Cells
class usesarray_keys
andarray_values
to get the keys and values of its$index
array whengetCoordinates
,getSortedCoordinates
, andgetSortedCoordinatesInt
are called, even when the array was not modified between calls to these methods.
We fixed these issues on our end temporarily by implementing the following changes, respectively:
- Using a binary search algorithm to approach the correct index value quickly and finding the proper value afterwards. The searched index value is not always present, which is why, we assume, a linear search was used in the first place, so we also need to fix the value after the binary search, which we did with a linear backwards search followed by a linear forward search, but it limits the iterations to just a row's worth of indices rather than the whole file's worth.
- Using a flag to mark when the
$index
array is sorted, and clearing that flag whenever the array is changed. - Using arrays to cache the keys and values of the
$index
array and clearing these whenever it is changed.
For reference, we changed the following methods of the Collections\Cells
class:
delete
clears the keys and values caches (if the array is sorted, removing a value should not change that, if the array is not sorted, the flag should not be set, so this method probably shouldn't need to mark the array as needing to be sorted)getCoordinates
builds the keys cache if it does not exist and returns the keys cache.getSortedCoordinates
sorts the array if it needs sorting, marks it as sorted, builds the keys cache if it does not exist, and returns the keys cache.getSortedCoordinatesInt
sorts the array if it needs sorting, marks it as sorted, builds the values cache if it does not exist, and returns the values cache.cloneCellCollection
probably needs to clear the sorted flag and both the keys and values cache arrays from the newly created collection (we did it anyway to be sure)add
clears the sorted flag and both cache arraysunsetWorksheetCells
clears the sorted flag and both cache arrays
Through the modifications listed above, we improved our processing time from a few hours to a few minutes.
williamyiu
Metadata
Metadata
Assignees
Labels
No labels