Skip to content

Extract Structured Data from PDFs, Word Docs and Images. Embeddable directly into your application, regardless of the stack.

License

Notifications You must be signed in to change notification settings

ikantkode/data-wizard

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


Data Wizard Logo

🪄 Data Wizard

Extract Structured Data from Any Document with LLMs

Data Wizard is an open-source tool designed to simplify and automate the extraction of structured data from unstructured documents like PDFs, Word files, and images using Large Language Models. Turn complex documents into validated, machine-readable JSON effortlessly.

Quick Start Guide | Documentation | Homepage | Made by Lukas Mateffy

License: AGPL v3 GitHub Repo stars



Key Features

  • Powerful Extraction Engine: Define exactly the data structure you need using standard JSON Schema and Data Wizard will extract it from your documents.
  • LLM Agnostic: Works with various LLMs including OpenAI (GPT-4, GPT-3.5), Anthropic (Claude), Google AI (Gemini), Mistral AI, local models via Ollama/LMStudio, and more through OpenRouter. Powered by mateffy/llm-magic.
  • Multiple Extraction Strategies: Choose the best strategy (Simple, Sequential, Parallel, Auto-Merging, Double-Pass) for your document type and complexity, or create your own Custom Strategy.
  • Handles Diverse File Types: Process PDFs (native & scanned w/ OCR), Word (DOCX), Excel (XLSX), images (PNG, JPG), and more.
  • Visual Context: Utilizes embedded images and page screenshots to provide crucial visual context to the LLM, improving accuracy.
  • Data Validation: Rigorously validates the LLM's output against your JSON Schema, ensuring clean, reliable, and immediately usable data.
  • Seamless Integration: Embed the UI easily via iFrame with a JavaScript API, or interact programmatically using the comprehensive RESTful and GraphQL APIs.
  • Open Source & Self-Hostable: Deploy easily using Docker for complete data control and privacy on your own infrastructure. Licensed under AGPL-3.0.

How it Works

  1. Configure Extractor: Define your desired output data structure using JSON Schema, select an LLM, and choose an Extraction Strategy.
  2. Upload Documents: Upload your files (PDFs, DOCX, images, etc.) via the UI, embedded iFrame, or programmatically via the API. Files are pre-processed to extract text and images.
  3. Get Structured Data: The chosen strategy directs the LLM interaction. The AI extracts the data, which is then validated against your schema. Receive clean JSON via the UI, webhook, or API.

Example Use Cases

  • Automate Data Entry: Extract data from invoices, receipts, and forms into ERP/accounting systems.
  • SaaS Smart Import: Allow users to upload documents to populate data in your CRM, SaaS app, etc.
  • Document Conversion: Turn batches of PDFs or scans into structured JSON/CSV.
  • Core Extraction Engine: Power your own document processing platform using Data Wizard's API.
  • Market Research: Gather product/pricing data from competitor materials.
  • Compliance Checks: Extract specific clauses or data points from contracts or reports.

Usage / Getting Started

The easiest way to use Data Wizard is to use the pre-built Docker container.

1. Generate APP_KEY: Before running, generate a secure application key:

openssl rand -base64 32

Copy the generated key.

2. Run Docker Container:

  • Using docker run:

    docker run \
      --name data-wizard \
      -p 9090:80 \
      -p 4430:443 \
      -p 4430:443/udp \
      -v data_wizard_storage:/app/storage \
      -v data_wizard_sqlite_data:/app/database \
      -v data_wizard_caddy_data:/data \
      -v data_wizard_caddy_config:/config \
      -e APP_KEY=<REPLACE_WITH_GENERATED_APP_KEY> \
      mateffy/data-wizard:latest

    (Remember to replace <REPLACE_WITH_GENERATED_APP_KEY>)

  • Using docker-compose: Create a docker-compose.yml file:

    version: '3.8'
    
    services:
      data-wizard:
        container_name: data-wizard # Optional: Define a specific container name
        image: mateffy/data-wizard:latest
        ports:
          - "9090:80"
          - "4430:443"
          - "4430:443/udp"
        volumes:
          - data_wizard_storage:/app/storage
          - data_wizard_sqlite_data:/app/database
          - data_wizard_caddy_data:/data
          - data_wizard_caddy_config:/config
        environment:
          - APP_KEY=<REPLACE_WITH_GENERATED_APP_KEY> # Replace with your generated key
    
    volumes:
      data_wizard_storage:
      data_wizard_sqlite_data:
      data_wizard_caddy_data:
      data_wizard_caddy_config:

    Then run: docker-compose up -d

3. Access Data Wizard: You can then access the application at https://localhost:4430. You may need to accept a self-signed certificate warning in your browser for local access.

➡️ For more details, see the Quick Start Guide and Deployment Documentation.

Requirements for Manual Installation

Data Wizard is a Laravel application, so you'll need everything that Laravel requires in order to run. Most databases should work, but SQLite and Postgres have been tested.

DataWizard uses mateffy/llm-magic for LLM interaction and file data extraction.

In order for file extraction to work you'll need to have uv installed on your machine and in your PATH. You can also configure custom paths to use in the llm-magic.php config file. For more on this see the llm-magic documentation.

While llm-magic uses a custom Python script to extract text and images from PDFs, Blaspsoft/doxswap is used for converting Word and other rich text documents to PDF beforehand. doxswap requires that LibreOffice is installed on your machine. You may need to set the LIBRE_OFFICE_PATH environment variable to the path of the soffice executable.

Thesis

This project was made as part of my 2025 BSc thesis at Leuphana University Lüneburg. The thesis is available here.

Screenshots


The standalone UI allows you to run extraction tasks manually, which also helps evaluating and debugging your extractor.

Create reusable extractors for different documents. The built-in extractor editor allows you to define the JSON Schema, configure extra instructions & the context window, as well as the extraction strategy.

Users can upload files via the UI. The files are pre-processed in the background, with the text and any embedded images being extracted from the PDF or Word file.

Easily embed Data Wizard in your app. Users can upload documents, edit JSON, and view results in a user-friendly interface.

The JSON output is validated against the JSON Schema, including rules like minLength or multipleOf.

Data Wizard is not limited to a single LLM provider. You can choose from a variety of LLMs, including GPT-4, Claude and Gemini.

Choose from a variety of extraction strategies, including simple, sequential, parallel and auto-merging.

Copyright and License

This project is made by Lukas Mateffy and is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0).

Contributing

At the moment, this project is not open for contributions. However, if you have ideas, bugs or suggestions, feel free to open an issue or start a discussion!

About

Extract Structured Data from PDFs, Word Docs and Images. Embeddable directly into your application, regardless of the stack.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • JavaScript 46.8%
  • PHP 27.5%
  • Blade 24.0%
  • Other 1.7%