-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Does not really stream, keeps objects in memory #58
Comments
Have you any "collect" in your xml-steam configuration? It can produce garbage if you collect items that are both inside and outside of your nodes of interest. In that case you only need to collect on more specific selector (including parent into it) |
FYI some users reported that with xml-stream they were able to parse 2GB files without any problems. Nevertheless, I have plans to rewrite xml-stream and make it true stream, and more configurable. |
Sure we did use collect, since most of the data we collect are collections. Make it true stream does not seem like something crazy. I've reimplemented it yesterday. I also have a feature to detect "arrays". When on the same level there is another tag coming with the same name - parent becomes array. That does not work perfectly since if you have only one item (which is meant to be an item of array) - it cannot be detected. |
To avoid leftovers in mem - I keep parent in the tag until it is ended. When it is ended - current element is cleaned up and reassigned to its parent. Relies on object mutability, but it was a reasonable choice for speed and mem. |
@nkbt do you have a recommendation for a module that actually streams? I'm parsing a 1GB file right now and worried as things get larger. |
@stephengardner I used plait xml parser that is used in this lib (node-expat) with custom parsing rules tailored for our data. Works perfectly, actually streams and never eats more mem then needed for any single extracted object. Easily goes through 350mb xmls. It is actually used in aws lambda, so had to be super efficient. Expat is surprisingly good and easy to use. Not sure why would anyone need an extra wrapper around it. |
Just experienced this too.. xml.Collect has a memory leak and will save everything in memory. |
Used it for some time until we found node process crashes because of too much memory usage.
After deeper investigation we realized that "streaming" 250Mb XMLs through xml-stream is not possible. Somewhere in the first third of it node crashes.
Every time new chunk of interesting data was found mem increased by 1M and never went back.
I did not investigate what exactly leak memory and switched directly to node-expat, now node processing 350Mb XML takes up to 400Mb mem max (depends on the largest object passed through and usually is up to 50Mb with rare spikes), but it never grows and actually streams.
I am not sure if I have time to have a deeper look at the code and figure out what exactly holds extra objects.
Might be related to #16
The text was updated successfully, but these errors were encountered: