Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support reading OpenDocument spreadsheet (.ods) files #3

Open
sagebind opened this issue May 22, 2017 · 2 comments
Open

Support reading OpenDocument spreadsheet (.ods) files #3

sagebind opened this issue May 22, 2017 · 2 comments
Labels
e:medium Effort: Medium
Milestone

Comments

@sagebind
Copy link
Member

sagebind commented May 22, 2017

OpenDocument spreadsheets are the preferred open spreadsheet format outside of Microsoft Office. We should support reading them at the very least for completeness.

@sagebind sagebind added this to the 1.0.0 milestone May 22, 2017
@sagebind sagebind added the ready label Jun 7, 2017
@sagebind sagebind self-assigned this Jul 28, 2017
@sagebind sagebind removed this from the 1.0.0 milestone Jul 28, 2017
@sagebind sagebind added the e:medium Effort: Medium label Oct 20, 2017
@sagebind sagebind removed their assignment Apr 26, 2018
@twilco
Copy link
Contributor

twilco commented Sep 19, 2018

Did some digging into this one. There certainly aren't many ODF/ODS JVM processing libraries, but I did find a couple.

  1. ODFDOM

This library seems to be a low level-ish wrapper over the raw ODS XML. We could potentially write our own ODS implementation on top of this, pending some more investigation.

  1. Apache Simple ODF

This is a higher level library that appears to be written by the same people who created ODFDOM, and is a nice wrapper over ODFDOM. This seemed to be what we were looking for, so I decided to give it a test run to make sure it used constant memory. Here was my test program:

import org.odftoolkit.simple.SpreadsheetDocument;
import org.odftoolkit.simple.table.Table;

public class Sandbox {

    public static void main(String[] args)
    {
        try {
            SpreadsheetDocument doc = SpreadsheetDocument.loadDocument(Sandbox.class.getResourceAsStream("one_hundred_thousand.ods"));

            Table table = doc.getSheetByIndex(0);
            table.getRowIterator().forEachRemaining(row -> System.out.println(row.toString()));
        } catch (Exception e) {
            System.err.println(e);
        }
    }
}

Shortly after starting this up with a 100k row ODS, memory usage for my Java process jumped to max usage (currently configured to be ~2GB), and printed nothing for the couple minutes I let it run. I tried again with an ODS file that had 1 million rows and was immediately met with an Exception in thread "main" java.lang.OutOfMemoryError: Java heap space. So this library won't work.

  1. jOpenDocument

Their license scheme requires money to be paid for closed source applications, which isn't what we seem to be going for with this project: http://www.jopendocument.org/support.html. Seems like a showstopper.

@sagebind
Copy link
Member Author

sagebind commented Oct 1, 2018

The ODFDOM part of the Apache ODF Toolkit is also out of the picture. It is well designed, and they have a clean split between the XML handling and the ZIP handling, but neither support streaming.

  • The OdfPackage API requires either a file or the input to be cached in memory.
  • The DOM API is just that; a DOM API. The whole XML file must be parsed and loaded into memory before you can read any data.

While the file format of an ODS file is surprisingly similar to an XLSX (they are both ZIP files with XML in them), the OpenDocument format is much better designed and has none of the problems outlined in #17 if we want to write our own streaming reader.

I will see if I can write a basic reader from scratch that supports streaming. The whole format spec is online, and this document actually is really helpful to see how you might implement your own parser: http://incubator.apache.org/odftoolkit/odfdom/Layers.html. Almost as easy as parsing CSV, dare I say...

@sagebind sagebind changed the title Support for OpenDocument spreadsheet (.ods) files Support reading OpenDocument spreadsheet (.ods) files Oct 1, 2018
@sagebind sagebind removed the s:ready label Oct 5, 2018
@sagebind sagebind modified the milestones: 1.0.0, 0.5.0, 0.6.0 Oct 5, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
e:medium Effort: Medium
Projects
None yet
Development

No branches or pull requests

2 participants