OCR: Free Online OCR seems like the best solution #3

davidlinsley · 2022-12-01T06:04:04Z

davidlinsley
Dec 1, 2022
Maintainer

I talked with Richard about his experience of capturing dot matrix, which included trying to train Tesseract and he wasn't hopeful. I installed a few different front ends for Tesseract, and tried various methods, and it was ugly. Results would be faster to manually transcribe.

I next searched for results using some of the cheaper commercial offerings and found this discussion about ReadIris 8. The current version is 17 and I was impressed by it finding the text areas of a page (and the ability to select an area yourself), but the inline correction editor was awful.

Eventually I stumbled upon Free Online OCR, and although the website is reminiscent of 1995 the results of converting a few test pages to TXT files were quite superb. There was no need to crop the image to remove the tractor feed or the magazine we used to hold down the paper, and even whitespace was accurately preserved. This service is free for up to 20 pages per hour and doesn't appear to have any other service restrictions in its T&Cs.

I think we should plan to just scan the pages as-is (maybe some will need cropping and rotating a few degrees, I didn't test with pages that could do with the latter), including the title lines, line numbers etc. During reconstruction we can always use an editor that lets you crop columns (virtually any code editor) to remove line numbers, assembled output etc.

Note: Free Oneline OCR returns .txt files using UTF16-LE encoding. You'll need to resave them in your editor to UTF-8 or ANSI for Git diffing to work well.

davidlinsley · 2022-12-10T04:55:09Z

davidlinsley
Dec 10, 2022
Maintainer Author

Here's an example of the conversion to TXT of the first page. no edits!

`
PAGE 001 ECBCOM .SA:0 TXRRAM Extension ROM Direct Page RAM

            00700                                                                            OPT              L,LLE=120


            00701                                                                            NAM •            TXRRAM               -     RAM Off The Direct Page For TCC BASIC
                                                                                                          ,

            00702 


            00703                                                            ****************************4***************************************** 


            00704                                                            *                                                                                                                                                                   *


            ()0705                                                           *           Copyright               1982 by Microsoft Corporation,                                           all       rights reserved                              *

            00706                                                            *                                                                                                                                          .                        *

            00707                                                            ************************************************************t*********


            00708


            00709                                                                            TTL              External Declarations


            00710                                                                                                                                                                                             -    •


            00711D           0000                                                            DSCT


            00712


            00/13                                                                            XREF             FCERR,FUNDSP,FUNLST


            00714                                                                            XREF             NFUNTK,NNRMTK


            00715                                                                            XREF             POWRUP


            00716                                                                            XREF             RESLST


            00717                                                                            XREF             SNERR,STMDSP


            00718


            00719                                                                            TTL              RAM off the direct page.


            00720


            00721                                                                            XDEF             VSWI3


            00722D           0000                    0003               A VSWI3              RMB              &3                   SWI3 vector.


            00723                                                                            XDEF             VSWI2


            00724D           0003                    0003               A VSWI2              RMB              &3                   SWI2 vector.


            00725                                                                            XDEF             VSWI


            00726D           0006                    0003               A VSWI               RMB              &3                   SWI vector.


            00727                                                                            XDEF             VNMI


            00728D           0009                    0003               A VNMI               RMB              &3                   NMI vector.

`

1 reply

davidlinsley Dec 14, 2022
Maintainer Author

It also works well with the first skewed page I came across: OriginalScans\ECBBS2.SA\CCI10222022_0058.png

davidlinsley · 2022-12-15T01:13:42Z

davidlinsley
Dec 15, 2022
Maintainer Author

It's also semi-smart with single byte hex numbers. If a zero is next to a number it will pick a zero, but if it's next to an alpha character it will pick capital O. Given the similarity and listing quality, that's relatively easy to take into account (though my eyes hurt from correcting and combining ecbbs2.sa today).

0 replies

bluearcus · 2022-12-15T15:23:07Z

bluearcus
Dec 15, 2022
Collaborator

That is pretty good.

Is the objective to OCR the listing as a whole, or just OCR out the source code?

1 reply

davidlinsley Dec 17, 2022
Maintainer Author

I updated the readme.md on the repo page to reflect what I've done so far, and the steps for that. I've been preserving the listing (and then converting to src) so that we have it exactly as Duncan printed it out. It is a bunch more work, but I think provides a little extra for the Dragon history side (though we do have the scans). I could also see someone writing a script to extract the bytes and compare to the ROM dumps and figure out what is then needed to recreate the build steps in the Motorola Exor emulators - including adding the DNS patch :)

Unless I get through the core Microsoft files, I'm leaving the four Dragon Data and customized files. I thought they may be the more interesting from a pure Dragon perspective and be a little less joyless for anyone else!

davidlinsley · 2022-12-17T22:57:00Z

davidlinsley
Dec 17, 2022
Maintainer Author

If you sign up they give you 50 pages for free, which has the advantage that you can then just send up a zip file with multiple pages rather than doing one at a time. 50 is more than enough for even the largest source file and so saves a bit of time.

0 replies

davidlinsley · 2022-12-28T18:20:42Z

davidlinsley
Dec 28, 2022
Maintainer Author

I ended up buying 300 pages as I really like this service, so I'll OCR and convert to UTF8 and then commit the raw uncorrected OCR so that you don't need to do this step.

1 reply

davidlinsley Dec 29, 2022
Maintainer Author

The raw OCR (converted to UTF8) are now committed. I had to redo a couple of pages, which worked after cropping, except for one which will just need to be done fully by hand (but I don't think it had a huge amount of content). Not bad for 340 pages of 39 year old dot matrix!

davidlinsley · 2023-03-07T04:48:26Z

davidlinsley
Mar 7, 2023
Maintainer Author

The OCR for ECBO64.SA were pretty grotty, and so with just ECBOEM.SA remaining (which Duncan may do) I thought I'd try to re-OCR those to make Duncan's job easier given they were even worse. I cropped the tractor feed (and magazine bleed), resubmitted, and the result is significantly better. I wish I'd done that for all sources images :)

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCR: Free Online OCR seems like the best solution #3

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

OCR: Free Online OCR seems like the best solution #3

davidlinsley Dec 1, 2022 Maintainer

Replies: 6 comments · 3 replies

davidlinsley Dec 10, 2022 Maintainer Author

davidlinsley Dec 14, 2022 Maintainer Author

davidlinsley Dec 15, 2022 Maintainer Author

bluearcus Dec 15, 2022 Collaborator

davidlinsley Dec 17, 2022 Maintainer Author

davidlinsley Dec 17, 2022 Maintainer Author

davidlinsley Dec 28, 2022 Maintainer Author

davidlinsley Dec 29, 2022 Maintainer Author

davidlinsley Mar 7, 2023 Maintainer Author

davidlinsley
Dec 1, 2022
Maintainer

Replies: 6 comments 3 replies

davidlinsley
Dec 10, 2022
Maintainer Author

davidlinsley Dec 14, 2022
Maintainer Author

davidlinsley
Dec 15, 2022
Maintainer Author

bluearcus
Dec 15, 2022
Collaborator

davidlinsley Dec 17, 2022
Maintainer Author

davidlinsley
Dec 17, 2022
Maintainer Author

davidlinsley
Dec 28, 2022
Maintainer Author

davidlinsley Dec 29, 2022
Maintainer Author

davidlinsley
Mar 7, 2023
Maintainer Author