-
-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Mozilla's PDF.js to extract OCR text #59
base: master
Are you sure you want to change the base?
Conversation
lib/src/pdf/pdf-worker.ts
Outdated
onmessage = async evt => { | ||
const buffer = Uint8Array.from(decodedPlugin, c => c.charCodeAt(0)) | ||
await plugin.default(Promise.resolve(buffer)) | ||
onmessage = async path => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Log it to make sure, but this should probably stay async evt =>
. This call is triggered when you call worker.run()
in pdf-manager.ts, so you should receive an event object of the shape
{
data: {
path: string,
name: string
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
lib/package.json
Outdated
@@ -47,8 +47,9 @@ | |||
"@apollo/utils.createhash": "^3.0.0", | |||
"mammoth": "^1.6.0", | |||
"p-queue": "^7.4.1", | |||
"pdfjs-dist": "^4.2.67", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably don't need to add a dependency to pdf.js, as it's already bundled in Obsidian and you should be able to use it with something like this.
const arrayBuffer = await app.vault.readBinary(file);
// @ts-ignore
const document = await window.pdfjsLib.getDocument(arrayBuffer).promise;
for (let i = 1; i <= document.numPages; i++) {
const page = await document.getPage(i);
// etc.
But! The bundled version might cause issues so it's still worth trying with an external dependency 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i'll try the bundled version first to reduce the number of variables, i'm having a hard enough time just getting a basic development/iteration workflow
@scambier I made a few changes and I think I got pdf.js working on a small number of files. I added a 10 second delay, but with a large number of files, it still crashes within a few seconds, |
Thanks @scambier for the help and guidance so far. The latest place I'm stuck at seems to be related to how omnisearch and text-extractor work together. At least that's my best guess so far. My workflow to test my changes has been to quit Obsidian, load a bunch of PDFs into the vault directory, and re-open Obsidian. Once open, since Omnisearch has "PDFs content indexing" enabled, the developer console lists a whole bunch of entries like
If there are a large enough number of PDFs, Obsidian crashes. Otherwise it successfully extracts all the text from all the new PDFs. None of the debug messages that I placed in the pdf-worker.ts or pdf-manager.ts show up though, like it's somehow shortcutting the queue and calling the extraction library directly. When I right click on a file and choose "Text Extractor" -> "Extract Text to clipboard", however, then my debug log messages show up (and the pdf-worker script throws an exception about Any thoughts? Am I on the right track in thinking Omnisearch may not be using the pdf queue mechanism? |
Omnisearch will get a list of all indexable files, and asynchronously convert them to This conversion is done in 3 different ways:
The queue management happens in Text Extractor: I'm quite confident it works as intended; I had several problems when spawning too many web workers to process the files, and the CPU usage goes through the roof (hence this small trick to leave some room to breathe for the cpu) |
Progress report: Gah, I wasted so many hours testing changes to the text-extractor repo, and none of my changes were showing up, and I finally realized it was because of my original attempt at solving this by modifying the omnisearch plugin back when I cloned #290. My changes there made it so nothing I have been testing was having any effect for Omnisearch indexing/extracting (so it was sort of skipping the queue mechanism, just not for the reasons I guessed) I also somehow missed that pdf-worker.ts is a "web worker", which has its own set of rules for sharing state, loading libraries, etc. I'm not able to use |
mmmh I think PDF.js uses its own web worker(s), so you should be able to remove pdf-worker.ts and instead directly call PDF.js here. Because yeah, web workers only take serializable data as input/output so they're kinda difficult to work with for anything that requires external dependencies :/ |
Intended to address #21
I'm currently unable to effectively test changes to the code