Streaming XLSX reader optimizations #17

sagebind · 2018-09-08T17:38:09Z

There are some potential optimizations we can do when reading XLSX files from an InputStream to reduce memory usage and/or not need to use a temporary file.

One optimization would be to read the XLSX file byte by byte. An XLSX file is simply a ZIP archive containing XML files and other interesting things, that make the format work, but we only care about the following files:

xl/_rels/workbook.xml.rels: Tells us what the filenames are inside the archive for each object ID. If we don't have this, we can usually guess though, as most software that saves to XLSX use standard filenames for things.
xl/sharedStrings.xml: This file is very important, and contains a table of text data that cells often point to, instead of storing the cell contents directly. If we don't read this most cells will be empty.
xl/workbook.xml: Contains a list of sheets in the workbook and other metadata. We can ignore this if we don't care about sheet titles or sheet order.
xl/worksheets/*.xml: Contains the actual sheets, rows, and cells. Usually these are named sheet1.xml, sheet2.xml, sheet3.xml etc, but this is not a guarantee. We can infer sheet order based on filenames if we don't have the workbook metadata.

The problem with zip files is that instead of having a header, they have a trailer at the end of the file instead where all of the files are located in the archive. With a read-once InputStream, by the time we get to the trailer, we can't read any files any more. Fortunately, we can work around this by ignoring the trailer entirely and reading each entry's "header", which includes that entry's filename.

This poses a new problem: we are dependent on the order in which files were written to the zip file. If sheet2.xml shows up in the archive before sheet1.xml, we might read the second sheet first. Even worse, if sharedStrings.xml is the last entry in the archive, then we have to consume the entire archive before we can begin reading any rows!

I propose a "graceful degradation" approach, where XLSX files with an optimal layout will use constant memory similar to the XLS reader, but sub-optimal XLSX files may use more memory or even a temporary file.

We can read the archive entries in order, and alter our behavior depending on the filename encountered:

If we encounter xl/_rels/workbook.xml.rels, do parse it and store it in memory for later.
If we encounter xl/workbook.xml, do parse it and store it in memory, for the purposes of knowing the proper name of each worksheet, but only if we already parsed the rels file (it is useless otherwise).
If we encounter xl/sharedStrings.xml, parse it and store it in memory.
If we encounter a file in the xl/worksheets/ directory:
- If we have not encountered the shared strings table yet, fall back to "tempfile mode" by writing all entries to disk until we reach the shared strings.
- If we have not parsed the workbook file yet, guess the sheet name by its filename.
- Begin reading and returning rows from this sheet.

A disadvantage of this approach is that sheets are never read in their designated order, but rather in the order they appear in the file. In addition, we still use a tempfile if the shared strings do not appear before all sheet files inside the archive, because we cannot properly read the sheet files without it.

The text was updated successfully, but these errors were encountered:

twilco · 2018-09-08T22:15:35Z

This sounds great! Do you have any idea how often XLSXs are actually laid out in the optimal layout in the wild? Would be interesting to know how things end up in the order they do, and whether or not Excel and other XLSX manipulation tools consider the ordering of these files.

sagebind · 2018-09-08T22:37:59Z

I checked a couple of XLSX files in my documents and they all have the shared strings table near the top, though a few had the workbook and rels after the actual sheets.

They are laid out in the order they are added to the archive, so it would depend on the actual code. I'd be surprised if the same tool sometimes produced different orders. I bet the differences are from different versions of Excel.

I decided to test this theory by searching the web for random XLSX files and testing them out:

# Download the top 25 results for "space" on DuckDuckGo in XLSX format
$ ddgr --json -n25 space filetype:xlsx | jq -r '.[].url' | rg '\.xlsx$' | xargs wget
# Dump their file entries in the order they appear
$ echo *.xlsx | xargs -n1 7z l

Of those results, 4 of them had sharedStrings.xml that showed up after at least one sheet file in the archive, which would force us to write a tempfile anyway.

So it looks like it may be common, which lessens the value of this optimization.

twilco · 2018-09-09T15:27:25Z

With some major extrapolation of our small sample size, I'd say optimizing 84% of XLSXs would still be pretty great 😄

sagebind mentioned this issue Oct 1, 2018

Support reading OpenDocument spreadsheet (.ods) files #3

Open

sagebind added a:excel e:medium Effort: Medium labels Oct 3, 2018

sagebind added this to the Wishlist milestone Oct 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming XLSX reader optimizations #17

Streaming XLSX reader optimizations #17

sagebind commented Sep 8, 2018 •

edited

Loading

twilco commented Sep 8, 2018

sagebind commented Sep 8, 2018 •

edited

Loading

twilco commented Sep 9, 2018

Streaming XLSX reader optimizations #17

Streaming XLSX reader optimizations #17

Comments

sagebind commented Sep 8, 2018 • edited Loading

twilco commented Sep 8, 2018

sagebind commented Sep 8, 2018 • edited Loading

twilco commented Sep 9, 2018

sagebind commented Sep 8, 2018 •

edited

Loading

sagebind commented Sep 8, 2018 •

edited

Loading