-
Notifications
You must be signed in to change notification settings - Fork 98
Add encoding option to marcxml.record_to_xml #105
Comments
I'm curious what MARC based system you are using that rejected the record with the unicode character entities. It seems to me that utf-8 should be the default encoding for XML so hard coding |
We're using Invenio, but I don't really know what the internals are doing, since I don't have access to the source code our vendor is maintaining. It's not that the system rejected the unicode characters. It's that the output file ended up with a MIME encoding of us-ascii (as reported by the file --mime-encoding command), and the Invenio batch uploader module rejected it as not being encoded properly. Agreed that utf-8 should be default for XML encoding, which is why I find the function description in ET.tostring so strange:
|
If you have time to put together a pull request for the change and an accompanying test I would be grateful. |
This might cover it, but I admit I am still new to writing tests and may have taken the wrong approach. https://github.com/dag-hammarskjold-library/pymarc/tree/marcxml-encode |
Unless I am missing something with respect to the marcxml functionality, the record_to_xml function seems to return text encoded in us-ascii, which causes problems when systems are expecting utf-8 encoding. Tracing this issue to its source revealed that xml.etree.ElementTree.tostring takes an optional encoding parameter, which defaults to us-ascii. I am proposing to be able to pass an optional encoding parameter from marcxml.record_to_xml's invocation of ET.tostring.
In my local fork, I have made the following change:
Without the change, my output for record_to_xml on UTF-8 strings that contain diacritics looks like this:
And the resulting file ends up with a us-ascii encoding, which causes import of the record to fail on the MARC based system we are using.
With the change, I get output that looks like this when I pass the optional encoding:
I invoke as follows:
And the resulting file ends up with a utf-8 encoding.
Note that I tried forcing encoding to utf-8 at each successive level beginning with the open() function and working backward to the record itself. The only thing I found that actually works is to pass an encoding parameter in this particular function. If I am missing something (obvious or not), I'd be interested in correcting my oversight.
The change looks trivial to me and preserves the default functionality, but I don't know if there are tests that depend on it.
The text was updated successfully, but these errors were encountered: