Does not really stream, keeps objects in memory #58

nkbt · 2015-12-11T03:57:32Z

Used it for some time until we found node process crashes because of too much memory usage.

After deeper investigation we realized that "streaming" 250Mb XMLs through xml-stream is not possible. Somewhere in the first third of it node crashes.

Every time new chunk of interesting data was found mem increased by 1M and never went back.

I did not investigate what exactly leak memory and switched directly to node-expat, now node processing 350Mb XML takes up to 400Mb mem max (depends on the largest object passed through and usually is up to 50Mb with rare spikes), but it never grows and actually streams.

I am not sure if I have time to have a deeper look at the code and figure out what exactly holds extra objects.

Might be related to #16

Artazor · 2015-12-11T06:22:58Z

Have you any "collect" in your xml-steam configuration? It can produce garbage if you collect items that are both inside and outside of your nodes of interest. In that case you only need to collect on more specific selector (including parent into it)

Artazor · 2015-12-11T06:28:10Z

FYI some users reported that with xml-stream they were able to parse 2GB files without any problems. Nevertheless, I have plans to rewrite xml-stream and make it true stream, and more configurable.

nkbt · 2015-12-11T11:51:41Z

Sure we did use collect, since most of the data we collect are collections.

Make it true stream does not seem like something crazy. I've reimplemented it yesterday. I also have a feature to detect "arrays". When on the same level there is another tag coming with the same name - parent becomes array.

That does not work perfectly since if you have only one item (which is meant to be an item of array) - it cannot be detected.

nkbt · 2015-12-11T11:53:57Z

To avoid leftovers in mem - I keep parent in the tag until it is ended. When it is ended - current element is cleaned up and reassigned to its parent.

Relies on object mutability, but it was a reasonable choice for speed and mem.

stephengardner · 2016-10-06T13:36:53Z

@nkbt do you have a recommendation for a module that actually streams? I'm parsing a 1GB file right now and worried as things get larger.

nkbt · 2016-10-06T21:36:23Z

@stephengardner I used plait xml parser that is used in this lib (node-expat) with custom parsing rules tailored for our data. Works perfectly, actually streams and never eats more mem then needed for any single extracted object. Easily goes through 350mb xmls.

It is actually used in aws lambda, so had to be super efficient.

Expat is surprisingly good and easy to use. Not sure why would anyone need an extra wrapper around it.

statico · 2018-12-05T22:56:51Z

I found this issue while debugging a memory leak while parsing a 10 GB XML file. It looks like xml-stream is indeed leaking memory somewhere. Mine might be because I'm using pause() and resume() to manage back pressure.

AdnanCukur · 2018-12-31T14:27:42Z

Just experienced this too.. xml.Collect has a memory leak and will save everything in memory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does not really stream, keeps objects in memory #58

Does not really stream, keeps objects in memory #58

nkbt commented Dec 11, 2015

Artazor commented Dec 11, 2015

Artazor commented Dec 11, 2015

nkbt commented Dec 11, 2015

nkbt commented Dec 11, 2015

stephengardner commented Oct 6, 2016

nkbt commented Oct 6, 2016

statico commented Dec 5, 2018

AdnanCukur commented Dec 31, 2018

Does not really stream, keeps objects in memory #58

Does not really stream, keeps objects in memory #58

Comments

nkbt commented Dec 11, 2015

Artazor commented Dec 11, 2015

Artazor commented Dec 11, 2015

nkbt commented Dec 11, 2015

nkbt commented Dec 11, 2015

stephengardner commented Oct 6, 2016

nkbt commented Oct 6, 2016

statico commented Dec 5, 2018

AdnanCukur commented Dec 31, 2018