Skip to content
/ fusus Public

a workflow to transform Arabic classical works in printed form to structured text

License

Notifications You must be signed in to change notification settings

among/fusus

Folders and files

NameName
Last commit message
Last commit date

Latest commit

be8e145 · Nov 14, 2023
Nov 10, 2023
Feb 13, 2021
Mar 17, 2021
Nov 14, 2023
Nov 14, 2023
Jul 13, 2020
Nov 12, 2021
Dec 7, 2020
Nov 14, 2023
Feb 13, 2021
Nov 10, 2023
Nov 2, 2021
Nov 14, 2023
Nov 14, 2023
Nov 16, 2021
Dec 10, 2020
Apr 11, 2023
Oct 31, 2020
Jan 21, 2021
Mar 3, 2020
Jan 6, 2022
Feb 3, 2021
Feb 12, 2021
Nov 24, 2021
Nov 12, 2021

Repository files navigation

DOI SWH Project Status: WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public.

Fusus

This is a workflow that transforms scanned pages into readable text.

The pages come from several printed Arabic books from the past few centuries.

The workflow takes care of cleaning, OCR, proofing, converting to tab separated files and from there to Text-Fabric from where the text material can be processed further.

pipeline

Features

  • cleaning is included: specks and symbols can be specified for cleaning by copying and pasting such fragments and storing them in a designated directory;
  • column layout and line boundaries are detected prior to OCRing;
  • individual lines will be passed to the OCR engine, which is Kraken using a model trained on many printed Arabic books, see model;
  • the results are stored in tab-separated files, retaining boundary boxes and confidences;
  • proofing pages can be generated for manually checking the OCR results;
  • the OCR results of each book are composed into Text-Fabric datasets.

This lays the foundations for:

  • correcting OCR mistakes;
  • enriching the text with morphological/linguistic annotations, named entities;
  • perform intertextuality research between the ground work (the "Fusus" by Ibn Arabi) and its commentary books.

A lot of cleaning has been carried out on two editions of the Fusus: Lakhnawi and Afifi. After that these editions have been aligned and brought together in a single dataset, in which it is possible read back the individual editions.

Text-Fabric interface

Get started with the tutorial.

We also have generated a static search interface.

Just click fusus-search and off you go.

You can do full text search via regular expressions, not only in the full-text, but also in attributes of the text, notably the bounding box information of each word.

Authors

Project

Fusus has been funded by the IT Research Innovation Fund.

It has been developed between 2020-03-01 and 2021-03-01

Correction, enrichment and alignment of the two Fusus editions was done from the end of the project till the end of 2021.

Docs

There is more documentation about sources, the research project, and how to use this software in the docs.