MARCReader obscure warning: couldn't find 0xa0 in g0=66 g1=69 #114

nemobis · 2018-03-12T20:40:21Z

Not all, but a good portion of the records in the associated mrc file, when read, produce the warning "couldn't find 0xa0 in g0=66 g1=69". Is this expected?

>>> record.as_marc()
'01182nam0 22003253i 450 001001100000005001700011010001800028100004100046101000800087102000700095181002000102182001100122200008100133205001700214210003400231215001800265225001000283300003100293300004800324410003200372500004800404676004100452700003700493702004000530790004800570801002800618850001900646950017800665977001300843\x1eMIL0864540\x1e20180302002150.0\x1e  \x1fa9788804642091\x1e  \x1fa20140730d2014    ||||0itac50      ba\x1e| \x1faita\x1e  \x1fait\x1e 1\x1f6z01\x1fai \x1fbxxxe  \x1e 1\x1f6z01\x1fan\x1e1 \x1faRaccolto di sangue\x1fe[thriller]\x1ffSharon Bolton\x1fgtraduzione di Manuela Faimali\x1e  \x1faEd. speciale\x1e  \x1faMilano\x1fcOscar Mondadori\x1fd2014\x1e  \x1fa453 p.\x1fd20 cm\x1e| \x1faOscar\x1e  \x1faIn copertina: Oscar estate\x1e  \x1faA pagina IV di copertina: ebook disponibile\x1e 0\x1f1001CFI0000102\x1f12001 \x1faOscar\x1e10\x1faHThe Iblood harvest\x1f3UBO3836087\x1f9RAVV580629\x1e  \x1fa823.92\x1f9Narrativa inglese. 2000-\x1fv22\x1e 1\x1faBolton\x1fb, S. J.\x1f3RAVV580629\x1f4070\x1e 1\x1faFaimali\x1fb, Manuela\x1f3LO1V356745\x1f4070\x1e 1\x1faBolton\x1fb, Sharon\x1f3CFIV315469\x1fzBolton, S. J.\x1e 3\x1faIT\x1fbIT-000000\x1fc20140730\x1e  \x1faIT-\x1faIT-MI0185\x1e 0\x1faArch. della  Produzione Editoriale della Lombardia\x1fc1 v.\x1fd ELAPE-M     F18                     4698\x1fe ELAPE0001648725  VMN                       1 v.\x1ffB \x1fh20141126\x1fi20141126\x1e  \x1fa EL\x1fa NB\x1e\x1d'
>>> record = reader.next()
>>> record = reader.next()
>>> record = reader.next()
>>> record = reader.next()
>>> record = reader.next()
couldn't find 0xa0 in g0=66 g1=69
couldn't find 0xa0 in g0=66 g1=69
>>> record.as_marc()
'01030nam0 22003013i 450 001001100000005001700011010001800028010001800046100004100064101001300105102000700118181002000125182001100145200004200156210002500198215001800223225001600241300003300257410003800290500004800328517003100376676004200407700004100449801002800490850001900518950017800537977001300715\x1eMIL0864555\x1e20180302002150.0\x1e  \x1fa9788856639339\x1e  \x1fa9788856646948\x1e  \x1fa20140730d2014    ||||0itac50      ba\x1e| \x1faita\x1fcita\x1e  \x1fait\x1e 1\x1f6z01\x1fai \x1fbxxxe  \x1e 1\x1f6z01\x1fan\x1e1 \x1faTutta mia la citt\xa9 \x1ffCarlotta Pistone\x1e  \x1faMilano\x1fcPiemme\x1fd2014\x1e  \x1fa306 p.\x1fd22 cm\x1e| \x1faPiemme voci\x1e  \x1faIn copertina: Milano in love\x1e 0\x1f1001CAG1804037\x1f12001 \x1faPiemme voci\x1e10\x1faTutta mia la citt\xa9 \x1f3LO11530364\x1f9RMLV077939\x1e1 \x1faMilano in love\x1f9BVE0684571\x1e  \x1fa853.92\x1f9Narrativa italiana. 2000-\x1fv22\x1e 1\x1faPistone\x1fb, Carlotta\x1f3RMLV077939\x1f4070\x1e 3\x1faIT\x1fbIT-000000\x1fc20140730\x1e  \x1faIT-\x1faIT-MI0185\x1e 0\x1faArch. della  Produzione Editoriale della Lombardia\x1fc1 v.\x1fd ELAPE-M     F18                     4725\x1fe ELAPE0001649025  VMN                       1 v.\x1ffB \x1fh20141126\x1fi20141126\x1e  \x1fa EL\x1fa NB\x1e\x1d'
>>> record.as_dict()
{'fields': [{'001': u'MIL0864555'}, {'005': u'20180302002150.0'}, {'010': {'ind1': u' ', 'subfields': [{u'a': u'9788856639339'}], 'ind2': u' '}}, {'010': {'ind1': u' ', 'subfields': [{u'a': u'9788856646948'}], 'ind2': u' '}}, {'100': {'ind1': u' ', 'subfields': [{u'a': u'20140730d2014    ||||0itac50      ba'}], 'ind2': u' '}}, {'101': {'ind1': u'|', 'subfields': [{u'a': u'ita'}, {u'c': u'ita'}], 'ind2': u' '}}, {'102': {'ind1': u' ', 'subfields': [{u'a': u'it'}], 'ind2': u' '}}, {'181': {'ind1': u' ', 'subfields': [{u'6': u'z01'}, {u'a': u'i '}, {u'b': u'xxxe  '}], 'ind2': u'1'}}, {'182': {'ind1': u' ', 'subfields': [{u'6': u'z01'}, {u'a': u'n'}], 'ind2': u'1'}}, {'200': {'ind1': u'1', 'subfields': [{u'a': u'Tutta mia la citt\xa9 '}, {u'f': u'Carlotta Pistone'}], 'ind2': u' '}}, {'210': {'ind1': u' ', 'subfields': [{u'a': u'Milano'}, {u'c': u'Piemme'}, {u'd': u'2014'}], 'ind2': u' '}}, {'215': {'ind1': u' ', 'subfields': [{u'a': u'306 p.'}, {u'd': u'22 cm'}], 'ind2': u' '}}, {'225': {'ind1': u'|', 'subfields': [{u'a': u'Piemme voci'}], 'ind2': u' '}}, {'300': {'ind1': u' ', 'subfields': [{u'a': u'In copertina: Milano in love'}], 'ind2': u' '}}, {'410': {'ind1': u' ', 'subfields': [{u'1': u'001CAG1804037'}, {u'1': u'2001 '}, {u'a': u'Piemme voci'}], 'ind2': u'0'}}, {'500': {'ind1': u'1', 'subfields': [{u'a': u'Tutta mia la citt\xa9 '}, {u'3': u'LO11530364'}, {u'9': u'RMLV077939'}], 'ind2': u'0'}}, {'517': {'ind1': u'1', 'subfields': [{u'a': u'Milano in love'}, {u'9': u'BVE0684571'}], 'ind2': u' '}}, {'676': {'ind1': u' ', 'subfields': [{u'a': u'853.92'}, {u'9': u'Narrativa italiana. 2000-'}, {u'v': u'22'}], 'ind2': u' '}}, {'700': {'ind1': u' ', 'subfields': [{u'a': u'Pistone'}, {u'b': u', Carlotta'}, {u'3': u'RMLV077939'}, {u'4': u'070'}], 'ind2': u'1'}}, {'801': {'ind1': u' ', 'subfields': [{u'a': u'IT'}, {u'b': u'IT-000000'}, {u'c': u'20140730'}], 'ind2': u'3'}}, {'850': {'ind1': u' ', 'subfields': [{u'a': u'IT-'}, {u'a': u'IT-MI0185'}], 'ind2': u' '}}, {'950': {'ind1': u' ', 'subfields': [{u'a': u'Arch. della  Produzione Editoriale della Lombardia'}, {u'c': u'1 v.'}, {u'd': u' ELAPE-M     F18                     4725'}, {u'e': u' ELAPE0001649025  VMN                       1 v.'}, {u'f': u'B '}, {u'h': u'20141126'}, {u'i': u'20141126'}], 'ind2': u'0'}}, {'977': {'ind1': u' ', 'subfields': [{u'a': u' EL'}, {u'a': u' NB'}], 'ind2': u' '}}], 'leader': u'01030nam0 22003013i 450 '}

IE001_MIL_EL_00017104.zip

The text was updated successfully, but these errors were encountered:

nemobis · 2018-03-13T17:21:48Z

Ah, the input is UNIMARC. Does this make the report invalid?

edsu · 2018-03-13T21:10:55Z

I didn't think UNIMARC was a problem. Is that the only error you see? Perhaps a codepoint was added to a MARC-8 character set that pymarc doesn't know about yet? It would sadden me a great deal to learn MARC-8 was still being actively developed.

nemobis · 2018-03-13T21:17:30Z

I don't think our data is so advanced! It might just be some control character entered by mistake, because this is data endured some funny travels between various platforms.

edsu · 2018-03-13T21:24:12Z

Actually it looks like pymarc doesn't have any character mappings for g0=66 or g1=69.

https://github.com/edsu/pymarc/blob/master/pymarc/marc8_mapping.py

Depending on what you are doing (a one off, or part of a workflow) you might want to consider converting your data to utf-8 with yaz-marcdump and then work with it from python?

nemobis · 2018-03-13T21:30:47Z

Ed Summers, 13/03/2018 23:24:

Depending on what you are doing (a one off, or part of a workflow) you might want to consider converting your data to utf-8 with yaz-marcdump <https://software.indexdata.com/yaz/doc/yaz-marcdump.html> and then work with it from python?

Thank you a lot for the suggestion. It's been a while since I last used yaz so I had neglected to consider it. I'll let you know how it goes (if it's relevant for this report; feel free to close as invalid!).

nemobis · 2018-03-15T19:04:14Z

After yaz-marcdump -i marc -f marc8 -t utf8 -o marc I get

  11710 couldn't find 0xaf in g0=66 g1=69
   7844 couldn't find 0x80 in g0=66 g1=69
   3205 couldn't find 0xbf in g0=66 g1=69
   1335 couldn't find 0xca in g0=66 g1=69
   1175 couldn't find 0xa0 in g0=66 g1=69
   1042 couldn't find 0xcc in g0=66 g1=69
    299 couldn't find 0xbb in g0=66 g1=69
    122 couldn't find 0xbe in g0=66 g1=69

edsu · 2018-03-15T20:53:37Z

That's weird. Why would it be processing MARC-8 if it had been converted to UTF-8?

nemobis · 2018-04-24T19:46:03Z

Smaller test case attached, from http://id.sbn.it/bid/BVE0764705

>>> from pymarc import MARCReader
>>> print(MARCReader(open('BVE0764705.marc21.mrc', 'rb')).next().get_fields('650')[0].subfields[3])
Attività professionale
>>> print(MARCReader(open('BVE0764705.unimarc.mrc', 'rb')).next().get_fields('606')[0].subfields[3])
couldn't find 0xa0 in g0=66 g1=69
Attivit©  professionale

Note the UNIMARC has 0 in Leader/09, not a space nor a (cf. https://www.loc.gov/marc/bibliographic/bdleader.html ).

None of the yaz-marcdump conversion options which do something seem to help:

$ for code in iso5426 iso8859-1 marc8; do yaz-marcdump -i marc -o marc -t utf8 -f $code BVE0764705.unimarc.mrc > BVE0764705.unimarc.$code.mrc.new ; done
$ python 
>>> print(MARCReader(open('BVE0764705.unimarc.iso5426.mrc', 'rb')).next().get_fields('606')[0].subfields[3])
couldn't find 0xcc in g0=66 g1=69
Attivit  professionale
>>> print(MARCReader(open('BVE0764705.unimarc.marc8.mrc', 'rb')).next().get_fields('606')[0].subfields[3])
Attivit℗♭ professionale
>>> print(MARCReader(open('BVE0764705.unimarc.iso8859-1.mrc', 'rb')).next().get_fields('606')[0].subfields[3])
couldn't find 0xa0 in g0=66 g1=69
Attivit©℗  professionale

The yaz-marcdump default conversion to UTF-8 appears correct in itself (cf. https://lists.uni-bielefeld.de/mailman2/unibi/public/librecat-dev/2017-January/000175.html):

$ yaz-marcdump -o marcxml -t utf8 BVE0764705.unimarc.mrc | grep 606 -A 4
  <datafield tag="606" ind1=" " ind2=" ">
    <subfield code="a">Operatori turistici</subfield>
    <subfield code="x">Attività professionale</subfield>
    <subfield code="2">FN </subfield>
    <subfield code="3">IT\ICCU\MILC\267308</subfield>
$ yaz-marcdump -t utf8 BVE0764705.unimarc.mrc | grep 606
606    $a Operatori turistici $x Attività professionale $2 FN  $3 IT\ICCU\MILC\267308
$ yaz-marcdump BVE0764705.unimarc.mrc | grep 606
606    $a Operatori turistici $x Attività professionale $2 FN  $3 IT\ICCU\MILC\267308

Sorry if I'm missing something obvious...

BVE0764705.marc21.mrc.gz
BVE0764705.unimarc.mrc.gz

josephalway · 2018-12-12T23:26:24Z

The obscure warning is coming from lines 135-136 of the marc8.py file.

Generally, this section:

            try:
                if code_point > 0x80 and not mb_flag:
                    (uni, cflag) = marc8_mapping.CODESETS[self.g1][code_point]
                else:
                    (uni, cflag) = marc8_mapping.CODESETS[self.g0][code_point]
            except KeyError:
                try:
                    uni = marc8_mapping.ODD_MAP[code_point]
                    uni_list.append(unichr(uni))
                    # we can short circuit because we know these mappings
                    # won't be involved in combinings.  (i hope?)
                    continue
                except KeyError:
                    pass
                if not self.quiet:
                    sys.stderr.write("couldn't find 0x%x in g0=%s g1=%s\n" %
                        (code_point, self.g0, self.g1))

It's unable to read the character and spits out that bit of information: "couldn't find 0x%x in g0=%s g1=%s\n", with the %x and %s being replaced with relevant pieces. Which is really not helpful, if you don't already know what it's doing.

A simple change on line 135 would make the error much more human friendly:
sys.stderr.write("Unable to read character, couldn't find 0x%x in g0=%s g1=%s\n" %

In my case, I was able to correct the single character that was giving me the problem. A math symbol that wasn't being read correctly or had been corrupted.

josephalway · 2018-12-12T23:39:16Z

import pymarc as pym

with open('C:\\Users\\MY_USER\\Downloads\\IE001_MIL_EL_00017104\\IE001_MIL_EL_00017104.mrc', 'rb') as fh:
    reader = pym.MARCReader(fh, to_unicode=True, force_utf8=True)
    for record in reader:
        for field in record.get_fields('020'):
            if field['a'] is not None:
                print(field['a'])
            elif field['a'] is None:
                print('No ISBN')
            else:
                pass

I tested your data and found that setting "to_unicode=True, force_utf8=True" when reading the file removes all of the "couldn't find errors."

From the MARCReader class docstring:

If you find yourself in the unfortunate position of having data that
is utf-8 encoded without the leader set appropriately you can use
the force_utf8 parameter:

    reader = MARCReader(file('file.dat'), to_unicode=True,
        force_utf8=True)

tfmorris · 2019-12-11T18:05:13Z

Actually it looks like pymarc doesn't have any character mappings for g0=66 or g1=69.

I think 66. (0x42) & 69. (0x45) are the actually default character sets:

42(hex) [ASCII graphic: B] = Basic Latin (ASCII)
21(hex)45(hex) [ASCII graphics: !E] = Extended Latin (ANSEL) (the 21(hex) technically is a second character of the Intermediate segment of this escape sequence.)

per: https://www.loc.gov/marc/specifications/speccharmarc8.html#field066

Based on the comment above: #114 (comment)
it sounds like the MARC file contains UTF-8 encoded characters.

nemobis mentioned this issue Dec 11, 2019

Broken diacritics in MARC8 binary imports internetarchive/openlibrary#713

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MARCReader obscure warning: couldn't find 0xa0 in g0=66 g1=69 #114

MARCReader obscure warning: couldn't find 0xa0 in g0=66 g1=69 #114

nemobis commented Mar 12, 2018

nemobis commented Mar 13, 2018

edsu commented Mar 13, 2018

nemobis commented Mar 13, 2018 •

edited

Loading

edsu commented Mar 13, 2018

nemobis commented Mar 13, 2018 via email

nemobis commented Mar 15, 2018

edsu commented Mar 15, 2018

nemobis commented Apr 24, 2018

josephalway commented Dec 12, 2018

josephalway commented Dec 12, 2018 •

edited

Loading

tfmorris commented Dec 11, 2019

MARCReader obscure warning: couldn't find 0xa0 in g0=66 g1=69 #114

MARCReader obscure warning: couldn't find 0xa0 in g0=66 g1=69 #114

Comments

nemobis commented Mar 12, 2018

nemobis commented Mar 13, 2018

edsu commented Mar 13, 2018

nemobis commented Mar 13, 2018 • edited Loading

edsu commented Mar 13, 2018

nemobis commented Mar 13, 2018 via email

nemobis commented Mar 15, 2018

edsu commented Mar 15, 2018

nemobis commented Apr 24, 2018

josephalway commented Dec 12, 2018

josephalway commented Dec 12, 2018 • edited Loading

tfmorris commented Dec 11, 2019

nemobis commented Mar 13, 2018 •

edited

Loading

josephalway commented Dec 12, 2018 •

edited

Loading