You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Can't get pdfminer to parse the amazon sample similar to pdftotext
We support different PDF parsers. Each have its own strengths and weaknesses.
Testing each parsers against the same template does not lead to consistent results.
Current test cases might need some work.
One test only checks if a string is resulted. see
print(res) # Check why logger.info is not working, for the time being using print
exceptImportError:
# print("pdfminer module not installed!")
self.assertTrue(False, "pdfminer is not installed")
self.assertTrue(type(res) isstr, "return is not a string")
Which is likely to pass.
However when comparing the actual result it fails.
As in case of the amazon.pdf example.
The parsing with pdfminer results in a different text layout then with pdftotext parser.
Which results in the regexes failling.
Proposed solution:
Update testing mechanism. Create parser specific tests
Adapt the template file could contain the preffered parser and setting.
As an example these use cases:
A) Invoices in which the issuer data is incapsulated in a image.
(vat number, issuer name & adress)
That data is actually needed to match a template.
So to be able to match a template, that image need to be parsed by OCR.
As far as I know. pdftotext is unable to do that.
But pdfminer.six would be capable to do that (--all-texts)
B) Using invoice2data as a module. An invoice is parsed by default with the pdftotext parser.
The extracted text is enough to match a template. But from experience we know that for full detection of the fields a different parser e.g. pdfplbumber could be used.
In the template a key could be added which leads to re-parsing the invoice with that specific parser.
The text was updated successfully, but these errors were encountered:
bosd
changed the title
PDFminer implementation broken
pdfminer test broken
Sep 24, 2022
PDFminer tests are broken.
Can't get
pdfminer
to parse the amazon sample similar to pdftotextWe support different PDF parsers. Each have its own strengths and weaknesses.
Testing each parsers against the same template does not lead to consistent results.
Current test cases might need some work.
One test only checks if a string is resulted. see
invoice2data/tests/test_lib.py
Lines 78 to 87 in f6080ba
Which is likely to pass.
However when comparing the actual result it fails.
As in case of the amazon.pdf example.
The parsing with pdfminer results in a different text layout then with
pdftotext
parser.Which results in the regexes failling.
Proposed solution:
As an example these use cases:
A) Invoices in which the issuer data is incapsulated in a image.
(vat number, issuer name & adress)
That data is actually needed to match a template.
So to be able to match a template, that image need to be parsed by OCR.
As far as I know.
pdftotext
is unable to do that.But
pdfminer.six
would be capable to do that (--all-texts)B) Using invoice2data as a module. An invoice is parsed by default with the
pdftotext
parser.The extracted text is enough to match a template. But from experience we know that for full detection of the fields a different parser e.g.
pdfplbumber
could be used.In the template a key could be added which leads to re-parsing the invoice with that specific parser.
The text was updated successfully, but these errors were encountered: