Use Mozilla's PDF.js to extract OCR text #59

figadore · 2024-05-04T19:52:20Z

Intended to address #21

I'm currently unable to effectively test changes to the code

scambier · 2024-05-05T05:49:19Z

lib/src/pdf/pdf-worker.ts

-onmessage = async evt => {
-  const buffer = Uint8Array.from(decodedPlugin, c => c.charCodeAt(0))
-  await plugin.default(Promise.resolve(buffer))
+onmessage = async path => {


Log it to make sure, but this should probably stay async evt =>. This call is triggered when you call worker.run() in pdf-manager.ts, so you should receive an event object of the shape

{ data: { path: string, name: string } }

scambier · 2024-05-05T05:58:11Z

lib/package.json

@@ -47,8 +47,9 @@
    "@apollo/utils.createhash": "^3.0.0",
    "mammoth": "^1.6.0",
    "p-queue": "^7.4.1",
+    "pdfjs-dist": "^4.2.67",


You probably don't need to add a dependency to pdf.js, as it's already bundled in Obsidian and you should be able to use it with something like this.

const arrayBuffer = await app.vault.readBinary(file); // @ts-ignore const document = await window.pdfjsLib.getDocument(arrayBuffer).promise; for (let i = 1; i <= document.numPages; i++) { const page = await document.getPage(i); // etc.

But! The bundled version might cause issues so it's still worth trying with an external dependency 👍

i'll try the bundled version first to reduce the number of variables, i'm having a hard enough time just getting a basic development/iteration workflow

figadore · 2024-05-05T20:48:18Z

@scambier I made a few changes and I think I got pdf.js working on a small number of files. I added a 10 second delay, but with a large number of files, it still crashes within a few seconds, ~~which makes me wonder if the problem is related to the queueing somehow (or maybe I just added the delay incorrectly)~~ edit: I've either added the delay incorrectly or am not rebuilding/reloading correctly, since a dozen PDFs still process within a few seconds... I'll keep working

figadore · 2024-05-06T05:15:46Z

Thanks @scambier for the help and guidance so far. The latest place I'm stuck at seems to be related to how omnisearch and text-extractor work together. At least that's my best guess so far.

My workflow to test my changes has been to quit Obsidian, load a bunch of PDFs into the vault directory, and re-open Obsidian. Once open, since Omnisearch has "PDFs content indexing" enabled, the developer console lists a whole bunch of entries like

Omnisearch - 0:39:280 - Generating IndexedDocument from attachments/Y2OOMETPKX3ZMEMJJYX5MHOHTIO3JVSP.pdf

If there are a large enough number of PDFs, Obsidian crashes. Otherwise it successfully extracts all the text from all the new PDFs. None of the debug messages that I placed in the pdf-worker.ts or pdf-manager.ts show up though, like it's somehow shortcutting the queue and calling the extraction library directly. When I right click on a file and choose "Text Extractor" -> "Extract Text to clipboard", however, then my debug log messages show up (and the pdf-worker script throws an exception about Uncaught ReferenceError: obsidian is not defined for the argument in the closure, but I'm guessing that's a separate issue)

Any thoughts? Am I on the right track in thinking Omnisearch may not be using the pdf queue mechanism?

scambier · 2024-05-06T06:21:15Z

Omnisearch will get a list of all indexable files, and asynchronously convert them to IndexedDocuments.

This conversion is done in 3 different ways:

If the IndexedDocument already exists in the "live cache" (RAM), it's read from there
Else if the file is a markdown, it is read from Obsidian (and saved in the live cache)
Else if it's a pdf/image/docx/etc., Omnisearch calls Text Extractor

The queue management happens in Text Extractor: extractText() calls the manager, which uses the queue.

I'm quite confident it works as intended; I had several problems when spawning too many web workers to process the files, and the CPU usage goes through the roof (hence this small trick to leave some room to breathe for the cpu)

figadore · 2024-05-08T06:15:13Z

Progress report:

Gah, I wasted so many hours testing changes to the text-extractor repo, and none of my changes were showing up, and I finally realized it was because of my original attempt at solving this by modifying the omnisearch plugin back when I cloned #290. My changes there made it so nothing I have been testing was having any effect for Omnisearch indexing/extracting (so it was sort of skipping the queue mechanism, just not for the reasons I guessed)

I also somehow missed that pdf-worker.ts is a "web worker", which has its own set of rules for sharing state, loading libraries, etc. I'm not able to use window.pdfjsLib or import { loadPdfJs } from 'obsidian'. I guess my next step is to figure out how to load either the bundled pdf.js lib in the worker, or load it as an external dependency. My assumption is that this is a rollup thing.

scambier · 2024-05-08T06:25:40Z

mmmh I think PDF.js uses its own web worker(s), so you should be able to remove pdf-worker.ts and instead directly call PDF.js here.

Because yeah, web workers only take serializable data as input/output so they're kinda difficult to work with for anything that requires external dependencies :/

figadore · 2024-12-21T00:31:54Z

Adding a delay and limiting concurrency do seem to help to some degree, but with enough PDFs it still crashes eventually ("listeners" count continues to increase?). On the plus side, every time Obsidian starts, it extracts and caches text for more PDFs, so it will process all of them if Obsidian is re-opened enough times. I understand that this is not production ready, but it is good enough for me since I now have all of my PDFs searchable. I wish I knew more about typescript/javascript and how to solve the underlying issue. Hopefully someone can build on this some day and solve the out of memory issue, but until then, I'll likely be using my fork since it is functional for my use case.

No delay:

With delay:

I also noticed that the number of instances of pdf.worker.min.js is large

figadore added 2 commits May 4, 2024 11:31

limit to 1 worker

8e8e69c

wip, not working yet

918725d

figadore changed the title ~~Pdfjs~~ Use Mozilla's PDF.js to extract OCR text May 4, 2024

scambier reviewed May 5, 2024

View reviewed changes

figadore added 3 commits May 5, 2024 08:43

use bundled pdfjs lib

bebd6f5

ts-ignore everything

a182992

fix a few things and add delay before resolve

a90d1e7

try another method of delaying the result

dd3bc67

figadore added 4 commits December 20, 2024 23:58

Merge remote-tracking branch 'upstream/master' into pdfjs

d8c0b1c

add some development instructions

66fa256

use the pdfjs that comes with obsidian

992d078

add concurrency logic back, add queue length console output

5f3b33f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use Mozilla's PDF.js to extract OCR text #59

Use Mozilla's PDF.js to extract OCR text #59

figadore commented May 4, 2024 •

edited

Loading

scambier May 5, 2024

figadore May 5, 2024

scambier May 5, 2024

figadore May 5, 2024 •

edited

Loading

figadore commented May 5, 2024 •

edited

Loading

figadore commented May 6, 2024

scambier commented May 6, 2024

figadore commented May 8, 2024

scambier commented May 8, 2024 •

edited

Loading

figadore commented Dec 21, 2024 •

edited

Loading

Use Mozilla's PDF.js to extract OCR text #59

Are you sure you want to change the base?

Use Mozilla's PDF.js to extract OCR text #59

Conversation

figadore commented May 4, 2024 • edited Loading

scambier May 5, 2024

Choose a reason for hiding this comment

figadore May 5, 2024

Choose a reason for hiding this comment

scambier May 5, 2024

Choose a reason for hiding this comment

figadore May 5, 2024 • edited Loading

Choose a reason for hiding this comment

figadore commented May 5, 2024 • edited Loading

figadore commented May 6, 2024

scambier commented May 6, 2024

figadore commented May 8, 2024

scambier commented May 8, 2024 • edited Loading

figadore commented Dec 21, 2024 • edited Loading

figadore commented May 4, 2024 •

edited

Loading

figadore May 5, 2024 •

edited

Loading

figadore commented May 5, 2024 •

edited

Loading

scambier commented May 8, 2024 •

edited

Loading

figadore commented Dec 21, 2024 •

edited

Loading