Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems with special encoded character #61

Open
vincentsaluzzo opened this issue Feb 23, 2016 · 2 comments
Open

Problems with special encoded character #61

vincentsaluzzo opened this issue Feb 23, 2016 · 2 comments

Comments

@vincentsaluzzo
Copy link

I parse a big xml file (700mo) and in one line i've a special character : 
And when xml-stream reach this line, it fails with this error:

events.js:141549
      throw er; // Unhandled 'error' event
      ^

Error: reference to invalid character number in line 12482025
    at parseChunk (/Users/.../node_modules/xml-stream/lib/xml-stream.js:514:26)
    at ReadStream.<anonymous> (/Users/.../node_modules/xml-stream/lib/xml-stream.js:521:7)
    at emitOne (events.js:77:13)
    at ReadStream.emit (events.js:169:7)
    at readableAddChunk (_stream_readable.js:146:16)
    at ReadStream.Readable.push (_stream_readable.js:110:10)
    at onread (fs.js:1744:12)
    at FSReqWrap.wrapper [as oncomplete] (fs.js:576:17)

Any idea ?

@yufengyw
Copy link

I met the same problem. I have a line 2 & 3 . error when parsed here. What can I do with this?

@jvills
Copy link

jvills commented Apr 12, 2016

Came here in the hopes to see if there was an option for maintaining encoding from $text but I'm interested in the above, too.

In a similar way, I've used this to stream the contents of wordpress site backups and I often run across errors like this in the beginning of that process that we end up just chalking up to cleaning the data. Microsoft's nonprinting control characters are probably the biggest problem and we've encountered it enough that we just built a find/replace tool with a list of them. While you're seeing a random there are many, many more examples than it.

Control characters and encoded characters make a bit of sense to me. I imagine there's some sort of evaluation of the item to see if it contains an xml child and that evaluation throws this error but the weirdest case to me is when an item in the xml has many spaces (as few as 10 or more) the xml stream spits the same error. I wish I had the exact error lines for that error so I could add it, but we ran across that case about this time last year so it's lost in rather old logs. It could be unrelated but if this does get fixed, I'd appreciate that getting addressed as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants