Skip to content
This repository has been archived by the owner on Feb 4, 2020. It is now read-only.

MARCReader obscure warning: couldn't find 0xa0 in g0=66 g1=69 #114

Open
nemobis opened this issue Mar 12, 2018 · 11 comments
Open

MARCReader obscure warning: couldn't find 0xa0 in g0=66 g1=69 #114

nemobis opened this issue Mar 12, 2018 · 11 comments

Comments

@nemobis
Copy link
Contributor

nemobis commented Mar 12, 2018

Not all, but a good portion of the records in the associated mrc file, when read, produce the warning "couldn't find 0xa0 in g0=66 g1=69". Is this expected?

>>> record.as_marc()
'01182nam0 22003253i 450 001001100000005001700011010001800028100004100046101000800087102000700095181002000102182001100122200008100133205001700214210003400231215001800265225001000283300003100293300004800324410003200372500004800404676004100452700003700493702004000530790004800570801002800618850001900646950017800665977001300843\x1eMIL0864540\x1e20180302002150.0\x1e  \x1fa9788804642091\x1e  \x1fa20140730d2014    ||||0itac50      ba\x1e| \x1faita\x1e  \x1fait\x1e 1\x1f6z01\x1fai \x1fbxxxe  \x1e 1\x1f6z01\x1fan\x1e1 \x1faRaccolto di sangue\x1fe[thriller]\x1ffSharon Bolton\x1fgtraduzione di Manuela Faimali\x1e  \x1faEd. speciale\x1e  \x1faMilano\x1fcOscar Mondadori\x1fd2014\x1e  \x1fa453 p.\x1fd20 cm\x1e| \x1faOscar\x1e  \x1faIn copertina: Oscar estate\x1e  \x1faA pagina IV di copertina: ebook disponibile\x1e 0\x1f1001CFI0000102\x1f12001 \x1faOscar\x1e10\x1faHThe Iblood harvest\x1f3UBO3836087\x1f9RAVV580629\x1e  \x1fa823.92\x1f9Narrativa inglese. 2000-\x1fv22\x1e 1\x1faBolton\x1fb, S. J.\x1f3RAVV580629\x1f4070\x1e 1\x1faFaimali\x1fb, Manuela\x1f3LO1V356745\x1f4070\x1e 1\x1faBolton\x1fb, Sharon\x1f3CFIV315469\x1fzBolton, S. J.\x1e 3\x1faIT\x1fbIT-000000\x1fc20140730\x1e  \x1faIT-\x1faIT-MI0185\x1e 0\x1faArch. della  Produzione Editoriale della Lombardia\x1fc1 v.\x1fd ELAPE-M     F18                     4698\x1fe ELAPE0001648725  VMN                       1 v.\x1ffB \x1fh20141126\x1fi20141126\x1e  \x1fa EL\x1fa NB\x1e\x1d'
>>> record = reader.next()
>>> record = reader.next()
>>> record = reader.next()
>>> record = reader.next()
>>> record = reader.next()
couldn't find 0xa0 in g0=66 g1=69
couldn't find 0xa0 in g0=66 g1=69
>>> record.as_marc()
'01030nam0 22003013i 450 001001100000005001700011010001800028010001800046100004100064101001300105102000700118181002000125182001100145200004200156210002500198215001800223225001600241300003300257410003800290500004800328517003100376676004200407700004100449801002800490850001900518950017800537977001300715\x1eMIL0864555\x1e20180302002150.0\x1e  \x1fa9788856639339\x1e  \x1fa9788856646948\x1e  \x1fa20140730d2014    ||||0itac50      ba\x1e| \x1faita\x1fcita\x1e  \x1fait\x1e 1\x1f6z01\x1fai \x1fbxxxe  \x1e 1\x1f6z01\x1fan\x1e1 \x1faTutta mia la citt\xa9 \x1ffCarlotta Pistone\x1e  \x1faMilano\x1fcPiemme\x1fd2014\x1e  \x1fa306 p.\x1fd22 cm\x1e| \x1faPiemme voci\x1e  \x1faIn copertina: Milano in love\x1e 0\x1f1001CAG1804037\x1f12001 \x1faPiemme voci\x1e10\x1faTutta mia la citt\xa9 \x1f3LO11530364\x1f9RMLV077939\x1e1 \x1faMilano in love\x1f9BVE0684571\x1e  \x1fa853.92\x1f9Narrativa italiana. 2000-\x1fv22\x1e 1\x1faPistone\x1fb, Carlotta\x1f3RMLV077939\x1f4070\x1e 3\x1faIT\x1fbIT-000000\x1fc20140730\x1e  \x1faIT-\x1faIT-MI0185\x1e 0\x1faArch. della  Produzione Editoriale della Lombardia\x1fc1 v.\x1fd ELAPE-M     F18                     4725\x1fe ELAPE0001649025  VMN                       1 v.\x1ffB \x1fh20141126\x1fi20141126\x1e  \x1fa EL\x1fa NB\x1e\x1d'
>>> record.as_dict()
{'fields': [{'001': u'MIL0864555'}, {'005': u'20180302002150.0'}, {'010': {'ind1': u' ', 'subfields': [{u'a': u'9788856639339'}], 'ind2': u' '}}, {'010': {'ind1': u' ', 'subfields': [{u'a': u'9788856646948'}], 'ind2': u' '}}, {'100': {'ind1': u' ', 'subfields': [{u'a': u'20140730d2014    ||||0itac50      ba'}], 'ind2': u' '}}, {'101': {'ind1': u'|', 'subfields': [{u'a': u'ita'}, {u'c': u'ita'}], 'ind2': u' '}}, {'102': {'ind1': u' ', 'subfields': [{u'a': u'it'}], 'ind2': u' '}}, {'181': {'ind1': u' ', 'subfields': [{u'6': u'z01'}, {u'a': u'i '}, {u'b': u'xxxe  '}], 'ind2': u'1'}}, {'182': {'ind1': u' ', 'subfields': [{u'6': u'z01'}, {u'a': u'n'}], 'ind2': u'1'}}, {'200': {'ind1': u'1', 'subfields': [{u'a': u'Tutta mia la citt\xa9 '}, {u'f': u'Carlotta Pistone'}], 'ind2': u' '}}, {'210': {'ind1': u' ', 'subfields': [{u'a': u'Milano'}, {u'c': u'Piemme'}, {u'd': u'2014'}], 'ind2': u' '}}, {'215': {'ind1': u' ', 'subfields': [{u'a': u'306 p.'}, {u'd': u'22 cm'}], 'ind2': u' '}}, {'225': {'ind1': u'|', 'subfields': [{u'a': u'Piemme voci'}], 'ind2': u' '}}, {'300': {'ind1': u' ', 'subfields': [{u'a': u'In copertina: Milano in love'}], 'ind2': u' '}}, {'410': {'ind1': u' ', 'subfields': [{u'1': u'001CAG1804037'}, {u'1': u'2001 '}, {u'a': u'Piemme voci'}], 'ind2': u'0'}}, {'500': {'ind1': u'1', 'subfields': [{u'a': u'Tutta mia la citt\xa9 '}, {u'3': u'LO11530364'}, {u'9': u'RMLV077939'}], 'ind2': u'0'}}, {'517': {'ind1': u'1', 'subfields': [{u'a': u'Milano in love'}, {u'9': u'BVE0684571'}], 'ind2': u' '}}, {'676': {'ind1': u' ', 'subfields': [{u'a': u'853.92'}, {u'9': u'Narrativa italiana. 2000-'}, {u'v': u'22'}], 'ind2': u' '}}, {'700': {'ind1': u' ', 'subfields': [{u'a': u'Pistone'}, {u'b': u', Carlotta'}, {u'3': u'RMLV077939'}, {u'4': u'070'}], 'ind2': u'1'}}, {'801': {'ind1': u' ', 'subfields': [{u'a': u'IT'}, {u'b': u'IT-000000'}, {u'c': u'20140730'}], 'ind2': u'3'}}, {'850': {'ind1': u' ', 'subfields': [{u'a': u'IT-'}, {u'a': u'IT-MI0185'}], 'ind2': u' '}}, {'950': {'ind1': u' ', 'subfields': [{u'a': u'Arch. della  Produzione Editoriale della Lombardia'}, {u'c': u'1 v.'}, {u'd': u' ELAPE-M     F18                     4725'}, {u'e': u' ELAPE0001649025  VMN                       1 v.'}, {u'f': u'B '}, {u'h': u'20141126'}, {u'i': u'20141126'}], 'ind2': u'0'}}, {'977': {'ind1': u' ', 'subfields': [{u'a': u' EL'}, {u'a': u' NB'}], 'ind2': u' '}}], 'leader': u'01030nam0 22003013i 450 '}

IE001_MIL_EL_00017104.zip

@nemobis
Copy link
Contributor Author

nemobis commented Mar 13, 2018

Ah, the input is UNIMARC. Does this make the report invalid?

@edsu
Copy link
Owner

edsu commented Mar 13, 2018

I didn't think UNIMARC was a problem. Is that the only error you see? Perhaps a codepoint was added to a MARC-8 character set that pymarc doesn't know about yet? It would sadden me a great deal to learn MARC-8 was still being actively developed.

@nemobis
Copy link
Contributor Author

nemobis commented Mar 13, 2018

I don't think our data is so advanced! It might just be some control character entered by mistake, because this is data endured some funny travels between various platforms.

@edsu
Copy link
Owner

edsu commented Mar 13, 2018

Actually it looks like pymarc doesn't have any character mappings for g0=66 or g1=69.

https://github.com/edsu/pymarc/blob/master/pymarc/marc8_mapping.py

Depending on what you are doing (a one off, or part of a workflow) you might want to consider converting your data to utf-8 with yaz-marcdump and then work with it from python?

@nemobis
Copy link
Contributor Author

nemobis commented Mar 13, 2018 via email

@nemobis
Copy link
Contributor Author

nemobis commented Mar 15, 2018

After yaz-marcdump -i marc -f marc8 -t utf8 -o marc I get

  11710 couldn't find 0xaf in g0=66 g1=69
   7844 couldn't find 0x80 in g0=66 g1=69
   3205 couldn't find 0xbf in g0=66 g1=69
   1335 couldn't find 0xca in g0=66 g1=69
   1175 couldn't find 0xa0 in g0=66 g1=69
   1042 couldn't find 0xcc in g0=66 g1=69
    299 couldn't find 0xbb in g0=66 g1=69
    122 couldn't find 0xbe in g0=66 g1=69

@edsu
Copy link
Owner

edsu commented Mar 15, 2018

That's weird. Why would it be processing MARC-8 if it had been converted to UTF-8?

@nemobis
Copy link
Contributor Author

nemobis commented Apr 24, 2018

Smaller test case attached, from http://id.sbn.it/bid/BVE0764705

>>> from pymarc import MARCReader
>>> print(MARCReader(open('BVE0764705.marc21.mrc', 'rb')).next().get_fields('650')[0].subfields[3])
Attività professionale
>>> print(MARCReader(open('BVE0764705.unimarc.mrc', 'rb')).next().get_fields('606')[0].subfields[3])
couldn't find 0xa0 in g0=66 g1=69
Attivit©  professionale

Note the UNIMARC has 0 in Leader/09, not a space nor a (cf. https://www.loc.gov/marc/bibliographic/bdleader.html ).

None of the yaz-marcdump conversion options which do something seem to help:

$ for code in iso5426 iso8859-1 marc8; do yaz-marcdump -i marc -o marc -t utf8 -f $code BVE0764705.unimarc.mrc > BVE0764705.unimarc.$code.mrc.new ; done
$ python 
>>> print(MARCReader(open('BVE0764705.unimarc.iso5426.mrc', 'rb')).next().get_fields('606')[0].subfields[3])
couldn't find 0xcc in g0=66 g1=69
Attivit  professionale
>>> print(MARCReader(open('BVE0764705.unimarc.marc8.mrc', 'rb')).next().get_fields('606')[0].subfields[3])
Attivit℗♭ professionale
>>> print(MARCReader(open('BVE0764705.unimarc.iso8859-1.mrc', 'rb')).next().get_fields('606')[0].subfields[3])
couldn't find 0xa0 in g0=66 g1=69
Attivit©℗  professionale

The yaz-marcdump default conversion to UTF-8 appears correct in itself (cf. https://lists.uni-bielefeld.de/mailman2/unibi/public/librecat-dev/2017-January/000175.html):

$ yaz-marcdump -o marcxml -t utf8 BVE0764705.unimarc.mrc | grep 606 -A 4
  <datafield tag="606" ind1=" " ind2=" ">
    <subfield code="a">Operatori turistici</subfield>
    <subfield code="x">Attività professionale</subfield>
    <subfield code="2">FN </subfield>
    <subfield code="3">IT\ICCU\MILC\267308</subfield>
$ yaz-marcdump -t utf8 BVE0764705.unimarc.mrc | grep 606
606    $a Operatori turistici $x Attività professionale $2 FN  $3 IT\ICCU\MILC\267308
$ yaz-marcdump BVE0764705.unimarc.mrc | grep 606
606    $a Operatori turistici $x Attività professionale $2 FN  $3 IT\ICCU\MILC\267308

Sorry if I'm missing something obvious...

BVE0764705.marc21.mrc.gz
BVE0764705.unimarc.mrc.gz

@josephalway
Copy link

The obscure warning is coming from lines 135-136 of the marc8.py file.

Generally, this section:

            try:
                if code_point > 0x80 and not mb_flag:
                    (uni, cflag) = marc8_mapping.CODESETS[self.g1][code_point]
                else:
                    (uni, cflag) = marc8_mapping.CODESETS[self.g0][code_point]
            except KeyError:
                try:
                    uni = marc8_mapping.ODD_MAP[code_point]
                    uni_list.append(unichr(uni))
                    # we can short circuit because we know these mappings
                    # won't be involved in combinings.  (i hope?)
                    continue
                except KeyError:
                    pass
                if not self.quiet:
                    sys.stderr.write("couldn't find 0x%x in g0=%s g1=%s\n" %
                        (code_point, self.g0, self.g1))

It's unable to read the character and spits out that bit of information: "couldn't find 0x%x in g0=%s g1=%s\n", with the %x and %s being replaced with relevant pieces. Which is really not helpful, if you don't already know what it's doing.

A simple change on line 135 would make the error much more human friendly:
sys.stderr.write("Unable to read character, couldn't find 0x%x in g0=%s g1=%s\n" %

In my case, I was able to correct the single character that was giving me the problem. A math symbol that wasn't being read correctly or had been corrupted.

@josephalway
Copy link

josephalway commented Dec 12, 2018

import pymarc as pym

with open('C:\\Users\\MY_USER\\Downloads\\IE001_MIL_EL_00017104\\IE001_MIL_EL_00017104.mrc', 'rb') as fh:
    reader = pym.MARCReader(fh, to_unicode=True, force_utf8=True)
    for record in reader:
        for field in record.get_fields('020'):
            if field['a'] is not None:
                print(field['a'])
            elif field['a'] is None:
                print('No ISBN')
            else:
                pass

I tested your data and found that setting "to_unicode=True, force_utf8=True" when reading the file removes all of the "couldn't find errors."

From the MARCReader class docstring:

If you find yourself in the unfortunate position of having data that
is utf-8 encoded without the leader set appropriately you can use
the force_utf8 parameter:

    reader = MARCReader(file('file.dat'), to_unicode=True,
        force_utf8=True)

@tfmorris
Copy link

Actually it looks like pymarc doesn't have any character mappings for g0=66 or g1=69.

I think 66. (0x42) & 69. (0x45) are the actually default character sets:

42(hex) [ASCII graphic: B] = Basic Latin (ASCII)
21(hex)45(hex) [ASCII graphics: !E] = Extended Latin (ANSEL) (the 21(hex) technically is a second character of the Intermediate segment of this escape sequence.)

per: https://www.loc.gov/marc/specifications/speccharmarc8.html#field066

Based on the comment above: #114 (comment)
it sounds like the MARC file contains UTF-8 encoded characters.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants