Docling Converter

1. Introduction

Docling Converter is a Streamlit web application that leverages the powerful Docling library to convert a variety of document formats into Markdown, JSON, or YAML. It supports PDF (with optional OCR), Word, HTML, PowerPoint, images, AsciiDoc, and Markdown sources.

The demo is available at: https://doclingconvert.streamlit.app/.

2. Features

• Multi-format input – PDF, DOCX, HTML, PPTX, images, AsciiDoc, and Markdown.
• Flexible output – choose between Markdown, JSON, or YAML.
• OCR support – extract text from scanned PDFs/images with one click.
• Adjustable image resolution – fine-tune the DPI multiplier (1.0-4.0).
• Streamlit UI – modern, reactive interface with instant previews & downloads.

3. Quick Start

# clone the repository
$ git clone https://github.com/hparreao/doclingconverter.git
$ cd doclingconverter

# create virtual environment (optional but recommended)
$ python -m venv venv && source venv/bin/activate

# install dependencies
$ pip install -r requirements.txt

# run the Streamlit server
$ streamlit run app.py

Navigate to http://localhost:8501 in your browser and start converting documents.

4. Installation

The project has only two runtime dependencies:

Docling  # heavy-lifting document conversion engine
Streamlit # frontend/UI framework

Both are automatically installed via requirements.txt.

For development you might also want:

pip install black isort flake8 pre-commit

5. Running the Application

Command	Description
`streamlit run app.py`	Launch the local development server.
`streamlit run app.py --server.headless true`	Run headless (useful for remote/Docker deployments).

Environment variables (all optional):

Variable	Purpose	Default
`DOC_CONVERTER_MAX_PAGES`	Override `AppConfig.MAX_PAGES`	`100`
`DOC_CONVERTER_MAX_FILE_SIZE`	Override `AppConfig.MAX_FILE_SIZE` (bytes)	`20971520`

6. Usage Guide

Select the document type from the left sidebar.
Upload the file (max 20 MB; max 100 pages).
Pick the desired output format (Markdown / JSON / YAML).
Toggle OCR and adjust image resolution (if available).
Hit Start Conversion.
Download the generated file or inspect the preview inline.

7. Application Architecture

app.py        # Streamlit entry-point & main module
│
├── AppConfig                    # Centralised runtime configuration
├── DocumentConverterUI          # UI/UX helpers (layout & widgets)
├── DocumentProcessor            # Wrapper around Docling's DocumentConverter
└── handle_conversion_output()   # Result post-processing & download link

The heavy lifting is delegated to docling.DocumentConverter. This repository only configures the converter (pipelines, OCR, page limits) and provides a sleek Streamlit interface.

8. Programmatic API Reference

Below is a high-level overview of the public classes/functions you may import in your own scripts.

8.1 `AppConfig`

@dataclass
class AppConfig:
    SUPPORTED_TYPES: Dict[str, List[str]]
    OUTPUT_FORMATS: List[str]
    MAX_PAGES: int
    MAX_FILE_SIZE: int
    DEFAULT_IMAGE_SCALE: float

Configuration defaults controlling allowed extensions, limits and UI presets. Feel free to instantiate your own subclass or override attributes at runtime.

8.2 `DocumentConverterUI`

Responsible for setting the Streamlit page parameters and rendering the widget tree.

Key methods:

• setup_page() – initialises page meta data and header.
• render_main_content() – returns a settings dictionary capturing all user-selected options (file type, OCR flag, resolution, etc.).

8.3 `DocumentProcessor`

class DocumentProcessor:
    @staticmethod
    @st.cache_resource
    def get_converter(use_ocr: bool = True) -> DocumentConverter: ...

    @staticmethod
    def process_document(file, settings: dict, config: AppConfig): ...

get_converter() – creates (and caches) a docling.DocumentConverter with customised pipelines:
- PDFs ➜ StandardPdfPipeline + PyPdfiumDocumentBackend
- DOCX/HTML/PPTX ➜ SimplePipeline
process_document() – orchestrates the conversion, enforces MAX_PAGES and MAX_FILE_SIZE, and returns a docling.DocumentConversionResult.

8.4 `handle_conversion_output(result, settings, file)`

Formats the conversion result into the chosen output representation, injects a one-click download link, and renders an inline preview using Streamlit utilities.

9. Extending the Converter

Want to add new formats or tweak pipelines? Follow these steps:

Import the corresponding InputFormat and FormatOption implementation from Docling.
Update DocumentProcessor.get_converter() by appending a new element to allowed_formats and its mapping in format_options.
(Optional) Extend AppConfig.SUPPORTED_TYPES to expose the new extension in the UI dropdown.

Example for adding EPUB support:

from docling.datamodel.base_models import InputFormat
from docling.document_converter import EpubFormatOption

# inside get_converter()
allowed_formats=[..., InputFormat.EPUB]
format_options={
    ...,
    InputFormat.EPUB: EpubFormatOption(),
}

10. Deployment

Docker (recommended)

# Dockerfile
FROM python:3.11-slim

WORKDIR /app
COPY . .
RUN pip install --no-cache-dir -r requirements.txt

EXPOSE 8501
CMD ["streamlit", "run", "app.py", "--server.headless", "true"]

Then build & run:

docker build -t docling-converter .
docker run -p 8501:8501 docling-converter

Streamlit Community Cloud

Push your fork to GitHub.
Create a new Streamlit app, select the repo and app.py as the entry point.
Add requirements.txt.
Deploy – that's all!

11. Contributing

Pull requests and issues are welcome! Please open a discussion if you plan major changes.

Development guidelines:

# lint & format
$ black . && isort . && flake8

# run app with hot-reload
$ streamlit run app.py

12. License

This project is licensed under the terms of the MIT License – see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
LICENSE		LICENSE
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Docling Converter

Table of Contents

1. Introduction

2. Features

3. Quick Start

4. Installation

5. Running the Application

6. Usage Guide

7. Application Architecture

8. Programmatic API Reference

8.1 `AppConfig`

8.2 `DocumentConverterUI`

8.3 `DocumentProcessor`

8.4 `handle_conversion_output(result, settings, file)`

9. Extending the Converter

10. Deployment

Docker (recommended)

Streamlit Community Cloud

11. Contributing

12. License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

hparreao/doclingconverter

Folders and files

Latest commit

History

Repository files navigation

Docling Converter

Table of Contents

1. Introduction

2. Features

3. Quick Start

4. Installation

5. Running the Application

6. Usage Guide

7. Application Architecture

8. Programmatic API Reference

8.1 AppConfig

8.2 DocumentConverterUI

8.3 DocumentProcessor

8.4 handle_conversion_output(result, settings, file)

9. Extending the Converter

10. Deployment

Docker (recommended)

Streamlit Community Cloud

11. Contributing

12. License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

8.1 `AppConfig`

8.2 `DocumentConverterUI`

8.3 `DocumentProcessor`

8.4 `handle_conversion_output(result, settings, file)`

Packages