OCR: Free Online OCR seems like the best solution #3
Replies: 6 comments 3 replies
-
Here's an example of the conversion to TXT of the first page. no edits! `
` |
Beta Was this translation helpful? Give feedback.
-
It's also semi-smart with single byte hex numbers. If a zero is next to a number it will pick a zero, but if it's next to an alpha character it will pick capital O. Given the similarity and listing quality, that's relatively easy to take into account (though my eyes hurt from correcting and combining ecbbs2.sa today). |
Beta Was this translation helpful? Give feedback.
-
That is pretty good. Is the objective to OCR the listing as a whole, or just OCR out the source code? |
Beta Was this translation helpful? Give feedback.
-
If you sign up they give you 50 pages for free, which has the advantage that you can then just send up a zip file with multiple pages rather than doing one at a time. 50 is more than enough for even the largest source file and so saves a bit of time. |
Beta Was this translation helpful? Give feedback.
-
I ended up buying 300 pages as I really like this service, so I'll OCR and convert to UTF8 and then commit the raw uncorrected OCR so that you don't need to do this step. |
Beta Was this translation helpful? Give feedback.
-
The OCR for ECBO64.SA were pretty grotty, and so with just ECBOEM.SA remaining (which Duncan may do) I thought I'd try to re-OCR those to make Duncan's job easier given they were even worse. I cropped the tractor feed (and magazine bleed), resubmitted, and the result is significantly better. I wish I'd done that for all sources images :) |
Beta Was this translation helpful? Give feedback.
-
I talked with Richard about his experience of capturing dot matrix, which included trying to train Tesseract and he wasn't hopeful. I installed a few different front ends for Tesseract, and tried various methods, and it was ugly. Results would be faster to manually transcribe.
I next searched for results using some of the cheaper commercial offerings and found this discussion about ReadIris 8. The current version is 17 and I was impressed by it finding the text areas of a page (and the ability to select an area yourself), but the inline correction editor was awful.
Eventually I stumbled upon Free Online OCR, and although the website is reminiscent of 1995 the results of converting a few test pages to TXT files were quite superb. There was no need to crop the image to remove the tractor feed or the magazine we used to hold down the paper, and even whitespace was accurately preserved. This service is free for up to 20 pages per hour and doesn't appear to have any other service restrictions in its T&Cs.
I think we should plan to just scan the pages as-is (maybe some will need cropping and rotating a few degrees, I didn't test with pages that could do with the latter), including the title lines, line numbers etc. During reconstruction we can always use an editor that lets you crop columns (virtually any code editor) to remove line numbers, assembled output etc.
Note: Free Oneline OCR returns .txt files using UTF16-LE encoding. You'll need to resave them in your editor to UTF-8 or ANSI for Git diffing to work well.
Beta Was this translation helpful? Give feedback.
All reactions