-
Notifications
You must be signed in to change notification settings - Fork 98
MARCReader obscure warning: couldn't find 0xa0 in g0=66 g1=69 #114
Comments
Ah, the input is UNIMARC. Does this make the report invalid? |
I didn't think UNIMARC was a problem. Is that the only error you see? Perhaps a codepoint was added to a MARC-8 character set that pymarc doesn't know about yet? It would sadden me a great deal to learn MARC-8 was still being actively developed. |
I don't think our data is so advanced! It might just be some control character entered by mistake, because this is data endured some funny travels between various platforms. |
Actually it looks like pymarc doesn't have any character mappings for g0=66 or g1=69. https://github.com/edsu/pymarc/blob/master/pymarc/marc8_mapping.py Depending on what you are doing (a one off, or part of a workflow) you might want to consider converting your data to utf-8 with yaz-marcdump and then work with it from python? |
Ed Summers, 13/03/2018 23:24:
Depending on what you are doing (a one off, or part of a workflow) you
might want to consider converting your data to utf-8 with yaz-marcdump
<https://software.indexdata.com/yaz/doc/yaz-marcdump.html> and then work
with it from python?
Thank you a lot for the suggestion. It's been a while since I last used
yaz so I had neglected to consider it. I'll let you know how it goes (if
it's relevant for this report; feel free to close as invalid!).
|
After
|
That's weird. Why would it be processing MARC-8 if it had been converted to UTF-8? |
Smaller test case attached, from http://id.sbn.it/bid/BVE0764705
Note the UNIMARC has None of the yaz-marcdump conversion options which do something seem to help:
The yaz-marcdump default conversion to UTF-8 appears correct in itself (cf. https://lists.uni-bielefeld.de/mailman2/unibi/public/librecat-dev/2017-January/000175.html):
Sorry if I'm missing something obvious... |
The obscure warning is coming from lines 135-136 of the marc8.py file. Generally, this section:
It's unable to read the character and spits out that bit of information: "couldn't find 0x%x in g0=%s g1=%s\n", with the %x and %s being replaced with relevant pieces. Which is really not helpful, if you don't already know what it's doing. A simple change on line 135 would make the error much more human friendly: In my case, I was able to correct the single character that was giving me the problem. A math symbol that wasn't being read correctly or had been corrupted. |
I tested your data and found that setting "to_unicode=True, force_utf8=True" when reading the file removes all of the "couldn't find errors." From the MARCReader class docstring:
|
I think 66. (0x42) & 69. (0x45) are the actually default character sets:
per: https://www.loc.gov/marc/specifications/speccharmarc8.html#field066 Based on the comment above: #114 (comment) |
Not all, but a good portion of the records in the associated mrc file, when read, produce the warning "couldn't find 0xa0 in g0=66 g1=69". Is this expected?
IE001_MIL_EL_00017104.zip
The text was updated successfully, but these errors were encountered: