Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming XLSX reader optimizations #17

Open
sagebind opened this issue Sep 8, 2018 · 3 comments
Open

Streaming XLSX reader optimizations #17

sagebind opened this issue Sep 8, 2018 · 3 comments
Labels
a:excel e:medium Effort: Medium
Milestone

Comments

@sagebind
Copy link
Member

sagebind commented Sep 8, 2018

There are some potential optimizations we can do when reading XLSX files from an InputStream to reduce memory usage and/or not need to use a temporary file.

One optimization would be to read the XLSX file byte by byte. An XLSX file is simply a ZIP archive containing XML files and other interesting things, that make the format work, but we only care about the following files:

  • xl/_rels/workbook.xml.rels: Tells us what the filenames are inside the archive for each object ID. If we don't have this, we can usually guess though, as most software that saves to XLSX use standard filenames for things.
  • xl/sharedStrings.xml: This file is very important, and contains a table of text data that cells often point to, instead of storing the cell contents directly. If we don't read this most cells will be empty.
  • xl/workbook.xml: Contains a list of sheets in the workbook and other metadata. We can ignore this if we don't care about sheet titles or sheet order.
  • xl/worksheets/*.xml: Contains the actual sheets, rows, and cells. Usually these are named sheet1.xml, sheet2.xml, sheet3.xml etc, but this is not a guarantee. We can infer sheet order based on filenames if we don't have the workbook metadata.

The problem with zip files is that instead of having a header, they have a trailer at the end of the file instead where all of the files are located in the archive. With a read-once InputStream, by the time we get to the trailer, we can't read any files any more. Fortunately, we can work around this by ignoring the trailer entirely and reading each entry's "header", which includes that entry's filename.

This poses a new problem: we are dependent on the order in which files were written to the zip file. If sheet2.xml shows up in the archive before sheet1.xml, we might read the second sheet first. Even worse, if sharedStrings.xml is the last entry in the archive, then we have to consume the entire archive before we can begin reading any rows!

I propose a "graceful degradation" approach, where XLSX files with an optimal layout will use constant memory similar to the XLS reader, but sub-optimal XLSX files may use more memory or even a temporary file.

We can read the archive entries in order, and alter our behavior depending on the filename encountered:

  • If we encounter xl/_rels/workbook.xml.rels, do parse it and store it in memory for later.
  • If we encounter xl/workbook.xml, do parse it and store it in memory, for the purposes of knowing the proper name of each worksheet, but only if we already parsed the rels file (it is useless otherwise).
  • If we encounter xl/sharedStrings.xml, parse it and store it in memory.
  • If we encounter a file in the xl/worksheets/ directory:
    • If we have not encountered the shared strings table yet, fall back to "tempfile mode" by writing all entries to disk until we reach the shared strings.
    • If we have not parsed the workbook file yet, guess the sheet name by its filename.
    • Begin reading and returning rows from this sheet.

A disadvantage of this approach is that sheets are never read in their designated order, but rather in the order they appear in the file. In addition, we still use a tempfile if the shared strings do not appear before all sheet files inside the archive, because we cannot properly read the sheet files without it.

@twilco
Copy link
Contributor

twilco commented Sep 8, 2018

This sounds great! Do you have any idea how often XLSXs are actually laid out in the optimal layout in the wild? Would be interesting to know how things end up in the order they do, and whether or not Excel and other XLSX manipulation tools consider the ordering of these files.

@sagebind
Copy link
Member Author

sagebind commented Sep 8, 2018

I checked a couple of XLSX files in my documents and they all have the shared strings table near the top, though a few had the workbook and rels after the actual sheets.

They are laid out in the order they are added to the archive, so it would depend on the actual code. I'd be surprised if the same tool sometimes produced different orders. I bet the differences are from different versions of Excel.


I decided to test this theory by searching the web for random XLSX files and testing them out:

# Download the top 25 results for "space" on DuckDuckGo in XLSX format
$ ddgr --json -n25 space filetype:xlsx | jq -r '.[].url' | rg '\.xlsx$' | xargs wget
# Dump their file entries in the order they appear
$ echo *.xlsx | xargs -n1 7z l

Of those results, 4 of them had sharedStrings.xml that showed up after at least one sheet file in the archive, which would force us to write a tempfile anyway.

So it looks like it may be common, which lessens the value of this optimization.

@twilco
Copy link
Contributor

twilco commented Sep 9, 2018

With some major extrapolation of our small sample size, I'd say optimizing 84% of XLSXs would still be pretty great 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a:excel e:medium Effort: Medium
Projects
None yet
Development

No branches or pull requests

2 participants