Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does not really stream, keeps objects in memory #58

Open
nkbt opened this issue Dec 11, 2015 · 8 comments
Open

Does not really stream, keeps objects in memory #58

nkbt opened this issue Dec 11, 2015 · 8 comments

Comments

@nkbt
Copy link

nkbt commented Dec 11, 2015

Used it for some time until we found node process crashes because of too much memory usage.

After deeper investigation we realized that "streaming" 250Mb XMLs through xml-stream is not possible. Somewhere in the first third of it node crashes.

Every time new chunk of interesting data was found mem increased by 1M and never went back.

I did not investigate what exactly leak memory and switched directly to node-expat, now node processing 350Mb XML takes up to 400Mb mem max (depends on the largest object passed through and usually is up to 50Mb with rare spikes), but it never grows and actually streams.

I am not sure if I have time to have a deeper look at the code and figure out what exactly holds extra objects.

Might be related to #16

@Artazor
Copy link
Collaborator

Artazor commented Dec 11, 2015

Have you any "collect" in your xml-steam configuration? It can produce garbage if you collect items that are both inside and outside of your nodes of interest. In that case you only need to collect on more specific selector (including parent into it)

@Artazor
Copy link
Collaborator

Artazor commented Dec 11, 2015

FYI some users reported that with xml-stream they were able to parse 2GB files without any problems. Nevertheless, I have plans to rewrite xml-stream and make it true stream, and more configurable.

@nkbt
Copy link
Author

nkbt commented Dec 11, 2015

Sure we did use collect, since most of the data we collect are collections.

Make it true stream does not seem like something crazy. I've reimplemented it yesterday. I also have a feature to detect "arrays". When on the same level there is another tag coming with the same name - parent becomes array.

That does not work perfectly since if you have only one item (which is meant to be an item of array) - it cannot be detected.

@nkbt
Copy link
Author

nkbt commented Dec 11, 2015

To avoid leftovers in mem - I keep parent in the tag until it is ended. When it is ended - current element is cleaned up and reassigned to its parent.

Relies on object mutability, but it was a reasonable choice for speed and mem.

@stephengardner
Copy link

@nkbt do you have a recommendation for a module that actually streams? I'm parsing a 1GB file right now and worried as things get larger.

@nkbt
Copy link
Author

nkbt commented Oct 6, 2016

@stephengardner I used plait xml parser that is used in this lib (node-expat) with custom parsing rules tailored for our data. Works perfectly, actually streams and never eats more mem then needed for any single extracted object. Easily goes through 350mb xmls.

It is actually used in aws lambda, so had to be super efficient.

Expat is surprisingly good and easy to use. Not sure why would anyone need an extra wrapper around it.

@statico
Copy link

statico commented Dec 5, 2018

I found this issue while debugging a memory leak while parsing a 10 GB XML file. It looks like xml-stream is indeed leaking memory somewhere. Mine might be because I'm using pause() and resume() to manage back pressure.

image

@AdnanCukur
Copy link

Just experienced this too.. xml.Collect has a memory leak and will save everything in memory.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants