🪄 Data Wizard

Extract Structured Data from Any Document with LLMs

Data Wizard is an open-source tool designed to simplify and automate the extraction of structured data from unstructured documents like PDFs, Word files, and images using Large Language Models. Turn complex documents into validated, machine-readable JSON effortlessly.

Quick Start Guide | Documentation | Homepage | Made by Lukas Mateffy

Key Features

Powerful Extraction Engine: Define exactly the data structure you need using standard JSON Schema and Data Wizard will extract it from your documents.
LLM Agnostic: Works with various LLMs including OpenAI (GPT-4, GPT-3.5), Anthropic (Claude), Google AI (Gemini), Mistral AI, local models via Ollama/LMStudio, and more through OpenRouter. Powered by mateffy/llm-magic.
Multiple Extraction Strategies: Choose the best strategy (Simple, Sequential, Parallel, Auto-Merging, Double-Pass) for your document type and complexity, or create your own Custom Strategy.
Handles Diverse File Types: Process PDFs (native & scanned w/ OCR), Word (DOCX), Excel (XLSX), images (PNG, JPG), and more.
Visual Context: Utilizes embedded images and page screenshots to provide crucial visual context to the LLM, improving accuracy.
Data Validation: Rigorously validates the LLM's output against your JSON Schema, ensuring clean, reliable, and immediately usable data.
Seamless Integration: Embed the UI easily via iFrame with a JavaScript API, or interact programmatically using the comprehensive RESTful and GraphQL APIs.
Open Source & Self-Hostable: Deploy easily using Docker for complete data control and privacy on your own infrastructure. Licensed under AGPL-3.0.

How it Works

Configure Extractor: Define your desired output data structure using JSON Schema, select an LLM, and choose an Extraction Strategy.
Upload Documents: Upload your files (PDFs, DOCX, images, etc.) via the UI, embedded iFrame, or programmatically via the API. Files are pre-processed to extract text and images.
Get Structured Data: The chosen strategy directs the LLM interaction. The AI extracts the data, which is then validated against your schema. Receive clean JSON via the UI, webhook, or API.

Example Use Cases

Automate Data Entry: Extract data from invoices, receipts, and forms into ERP/accounting systems.
SaaS Smart Import: Allow users to upload documents to populate data in your CRM, SaaS app, etc.
Document Conversion: Turn batches of PDFs or scans into structured JSON/CSV.
Core Extraction Engine: Power your own document processing platform using Data Wizard's API.
Market Research: Gather product/pricing data from competitor materials.
Compliance Checks: Extract specific clauses or data points from contracts or reports.

Usage / Getting Started

The easiest way to use Data Wizard is to use the pre-built Docker container.

1. Generate APP_KEY: Before running, generate a secure application key:

openssl rand -base64 32

Copy the generated key.

2. Run Docker Container:

Using docker run:

docker run \
  --name data-wizard \
  -p 9090:80 \
  -p 4430:443 \
  -p 4430:443/udp \
  -v data_wizard_storage:/app/storage \
  -v data_wizard_sqlite_data:/app/database \
  -v data_wizard_caddy_data:/data \
  -v data_wizard_caddy_config:/config \
  -e APP_KEY=<REPLACE_WITH_GENERATED_APP_KEY> \
  mateffy/data-wizard:latest

(Remember to replace <REPLACE_WITH_GENERATED_APP_KEY>)

Using docker-compose: Create a docker-compose.yml file:

version: '3.8'

services:
  data-wizard:
    container_name: data-wizard # Optional: Define a specific container name
    image: mateffy/data-wizard:latest
    ports:
      - "9090:80"
      - "4430:443"
      - "4430:443/udp"
    volumes:
      - data_wizard_storage:/app/storage
      - data_wizard_sqlite_data:/app/database
      - data_wizard_caddy_data:/data
      - data_wizard_caddy_config:/config
    environment:
      - APP_KEY=<REPLACE_WITH_GENERATED_APP_KEY> # Replace with your generated key

volumes:
  data_wizard_storage:
  data_wizard_sqlite_data:
  data_wizard_caddy_data:
  data_wizard_caddy_config:

Then run: docker-compose up -d

3. Access Data Wizard: You can then access the application at https://localhost:4430. You may need to accept a self-signed certificate warning in your browser for local access.

➡️ For more details, see the Quick Start Guide and Deployment Documentation.

Requirements for Manual Installation

Data Wizard is a Laravel application, so you'll need everything that Laravel requires in order to run. Most databases should work, but SQLite and Postgres have been tested.

DataWizard uses mateffy/llm-magic for LLM interaction and file data extraction.

In order for file extraction to work you'll need to have uv installed on your machine and in your PATH. You can also configure custom paths to use in the llm-magic.php config file. For more on this see the llm-magic documentation.

While llm-magic uses a custom Python script to extract text and images from PDFs, Blaspsoft/doxswap is used for converting Word and other rich text documents to PDF beforehand. doxswap requires that LibreOffice is installed on your machine. You may need to set the LIBRE_OFFICE_PATH environment variable to the path of the soffice executable.

Thesis

This project was made as part of my 2025 BSc thesis at Leuphana University Lüneburg. The thesis is available here.

Screenshots


The standalone UI allows you to run extraction tasks manually, which also helps evaluating and debugging your extractor.	Create reusable extractors for different documents. The built-in extractor editor allows you to define the JSON Schema, configure extra instructions & the context window, as well as the extraction strategy.
Users can upload files via the UI. The files are pre-processed in the background, with the text and any embedded images being extracted from the PDF or Word file.	Easily embed Data Wizard in your app. Users can upload documents, edit JSON, and view results in a user-friendly interface.
The JSON output is validated against the JSON Schema, including rules like `minLength` or `multipleOf`.	Data Wizard is not limited to a single LLM provider. You can choose from a variety of LLMs, including GPT-4, Claude and Gemini.
Choose from a variety of extraction strategies, including simple, sequential, parallel and auto-merging.

Copyright and License

This project is made by Lukas Mateffy and is licensed under the GNU Affero General Public License v3.0 (AGPL-3.0).

Contributing

At the moment, this project is not open for contributions. However, if you have ideas, bugs or suggestions, feel free to open an issue or start a discussion!

Name		Name	Last commit message	Last commit date
Latest commit History 103 Commits
app		app
bin		bin
bootstrap		bootstrap
config		config
database		database
docs		docs
etc		etc
lang		lang
public		public
resources		resources
routes		routes
storage		storage
tests		tests
.dockerignore		.dockerignore
.editorconfig		.editorconfig
.env.docker		.env.docker
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.phpunit.result.cache		.phpunit.result.cache
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
SUPPORT.md		SUPPORT.md
artisan		artisan
bun.lockb		bun.lockb
cli.Dockerfile		cli.Dockerfile
composer.json		composer.json
composer.lock		composer.lock
docker-compose.yml		docker-compose.yml
fly.toml		fly.toml
herd.yml		herd.yml
package-lock.json		package-lock.json
package.json		package.json
phpunit.xml		phpunit.xml
postcss.config.mjs		postcss.config.mjs
vite.config.js		vite.config.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🪄 Data Wizard

Key Features

How it Works

Example Use Cases

Usage / Getting Started

Requirements for Manual Installation

Thesis

Screenshots

Copyright and License

Contributing

About

Uh oh!

Releases

Packages

Languages

License

ikantkode/data-wizard

Folders and files

Latest commit

History

Repository files navigation

🪄 Data Wizard

Key Features

How it Works

Example Use Cases

Usage / Getting Started

Requirements for Manual Installation

Thesis

Screenshots

Copyright and License

Contributing

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages