Extract terms through OCR for non-text source documents #752

clementbiron · 2022-02-22T13:38:26Z

With the following declaration (the dedicated branch is here OpenTermsArchive/france-declarations@5d1c1c3 )

{
  "name": "Desigual",
  "documents": {
    "Commercial Terms": {
      "fetch": "https://www.desigual.com/on/demandware.static/-/Library-Sites-DsglSharedLibrary/default/dw98507c8d/docs/legal/Footer_legal_documents/Francia/FRANCIA-Condiciones_Generales_Venta_Vfinal_FR_230321.pdf"
    },
    "Privacy Policy": {
      "fetch": "https://www.desigual.com/on/demandware.static/-/Library-Sites-DsglSharedLibrary/default/dw77e5bf6a/docs/legal/Footer_legal_documents/Francia/FRANCIA-POLITICA_DE_PRIVACIDAD_Vfinal_FR_230321.pdf"
    }
  }
}

i get empty version for Commerical Terms and the following wrong version for Privacy Policy

The snapshots are good.

The text was updated successfully, but these errors were encountered:

MattiSG · 2022-03-04T18:25:55Z

Unfortunately these documents are protected: if I access the PDF and try to copy their contents, I also only get spaces. I don't think this is an issue with Open Terms Archive (or rather, with the dependency @accordproject). However, it is worth reflecting on whether we can detect this automatically and how we should handle such cases, as it is pretty much the PDF equivalent to an HTTP 403.

martinratinaud · 2022-05-11T10:42:38Z

And for the record, it is NOT fixed by #836

Considering how fast the answer from accordproject was on the whitespace matter, I suggest we create an issue in their repo to see if they can do something about it (even though I doubt)

MattiSG · 2023-04-24T09:35:35Z

The source file has been vectorised. There is indeed no text in the PDF. The only way to obtain the content would be to use OCR. This could be useful. I'll rename this issue accordingly. Please add other example cases where this would enable extraction!

MattiSG · 2024-11-04T13:28:21Z

An alternative could be https://ds4sd.github.io/docling/

clementbiron added the bug label Feb 22, 2022

MattiSG removed the bug label Mar 4, 2022

MattiSG changed the title ~~Empty and wrong PDF~~ Handle unreadable PDF files Mar 4, 2022

MattiSG moved this to To assess in OpenTermsArchive Core Mar 4, 2022

MattiSG added this to OpenTermsArchive Core Mar 4, 2022

MattiSG mentioned this issue May 3, 2022

Missing whitespace when generating version from PDF #836

Closed

MattiSG changed the title ~~Handle unreadable PDF files~~ Extract terms through OCR for non-text source documents Apr 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extract terms through OCR for non-text source documents #752

Extract terms through OCR for non-text source documents #752

clementbiron commented Feb 22, 2022

MattiSG commented Mar 4, 2022

martinratinaud commented May 11, 2022

MattiSG commented Apr 24, 2023

MattiSG commented Nov 4, 2024

Extract terms through OCR for non-text source documents #752

Extract terms through OCR for non-text source documents #752

Comments

clementbiron commented Feb 22, 2022

MattiSG commented Mar 4, 2022

martinratinaud commented May 11, 2022

MattiSG commented Apr 24, 2023

MattiSG commented Nov 4, 2024