Skip to content

Commit

Permalink
specified docs status while switching backend services
Browse files Browse the repository at this point in the history
  • Loading branch information
emcf committed Jul 8, 2024
1 parent 17c4fc8 commit fa9e3ff
Show file tree
Hide file tree
Showing 9 changed files with 265 additions and 165 deletions.
49 changes: 25 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,72 +20,73 @@
</a>
</div>

### Extract markdown and visuals from PDFs URLs, slides, videos, and more, ready for multimodal LLMs. ⚡

### Extract markdown and visuals from PDFs URLs, docs, slides, videos, and more, ready for multimodal LLMs. ⚡

thepi.pe is an AI-native scraping engine that generates LLM-ready markdown and visuals from any document, media, or web page. It is built for multimodal language models such as GPT-4o, and works out-of-the-box with any LLM or vector database. thepi.pe is available as a [hosted API](https://thepi.pe), or it can be self-hosted.
thepi.pe is an API that can scrape multimodal data via `thepipe.scrape` or extract structured data via `thepipe.extract` from a wide range of data. It is built to interface with LLMs such as GPT-4o, and works out-of-the-box with any LLM or vector databases. thepi.pe can be used right away with a [hosted GPU cloud](https://thepi.pe), or it can be self-hosted.

## Features 🌟

- Extract clean markdown, tables, and images from any document or web page
- Output works out-of-the-box with all multimodal LLMs and RAG frameworks
- GPU-accelerated AI layout analysis, chunking, and structured data extraction
- Quick-start integrations for web data like Twitter, YouTube, GitHub, and more
- Self-hosted or hosted API options available
- Extract markdown, tables, and images from any document or webpage
- Extract complex structured data from any document or webpage
- Works out-of-the-box with all LLMs and RAG frameworks
- AI-native filetype detection, layout analysis, and structured data extraction
- Multimodal scraping for video, audio, and image sources

## Get started in 5 minutes 🚀

thepi.pe can read a wide range of filetypes and web sources, so it requires a few dependencies. It also requires a strong machine (16GB+ VRAM for optimal PDF & video response times) for AI extraction features. For these reasons, we host a REST API that works out-of-the-box at [thepi.pe](https://thepi.pe).

### Hosted API (Python)

> ⚠️ **Warning.**
The docs and functionality in this repo differ significantly from the current working version on pip. To use a working version, please refer to the [API docs](https://thepi.pe/docs), and not these docs.

```bash
pip install thepipe-api
setx THEPIPE_API_KEY=your_api_key
setx OPENAI_API_KEY=your_openai_key
```

```python
import thepipe
from thepipe.scraper import scrape_file
from openai import OpenAI

# scrape markdown + images
chunks = thepipe.scrape(source="example.pdf")
# scrape markdown, tables, visuals
chunks = scrape_file(filepath="paper.pdf")

# call LLM
# call LLM with clean, comprehensive data
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=thepipe.chunks_to_messages(chunks),
)
```

### Local Installation
### Local Installation (Python)

For a local installation, you can use the following command:

```bash
pip install thepipe-api[local]
```

```python
import thepipe
from openai import OpenAI
And append `local=True` to your API calls:

# scrape markdown + images
chunks = thepipe.scrape_file(source="example.pdf", local=True)
```python
chunks = scrape_url(url="https://example.com", local=True)
```

You can also use The Pipe from the command line:
```bash
thepipe path/to/folder --include_regex .*\.tsx
```


## Supported File Types 📚

| Source Type | Input types | Multimodal Scraping | Notes |
|--------------------------|----------------------------------------------------------------|---------------------|----------------------|
| Webpage | URLs starting with `http`, `https`, `ftp` | ✔️ | Scrapes markdown, images, and tables from web pages |
| PDF | `.pdf` | ✔️ | Extracts page markdown and page images. Opt-in `ai_extraction` for advanced layout analysis (extracts markdown, LaTeX equations, tables, and figures) |
| Webpage | URLs starting with `http`, `https`, `ftp` | ✔️ | Scrapes markdown, images, and tables from web pages. `ai_extraction` available for AI layout analysis |
| PDF | `.pdf` | ✔️ | Extracts page markdown and page images. `ai_extraction` available for AI layout analysis |
| Word Document | `.docx` | ✔️ | Extracts text, tables, and images |
| PowerPoint | `.pptx` | ✔️ | Extracts text and images from slides |
| Video | `.mp4`, `.mov`, `.wmv` | ✔️ | Uses Whisper for transcription and extracts frames |
Expand All @@ -102,7 +103,7 @@ thepipe path/to/folder --include_regex .*\.tsx

## How it works 🛠️

thepi.pe uses computer vision models and heuristics to extract clean content from the source and process it for downstream use with [language models](https://en.wikipedia.org/wiki/Large_language_model), or [vision transformers](https://en.wikipedia.org/wiki/Vision_transformer). The output from thepi.pe is a prompt (a list of messages) containing all content from the source document. The messages returned should look like this:
thepi.pe uses computer vision models and heuristics to extract clean content from the source and process it for downstream use with [language models](https://en.wikipedia.org/wiki/Large_language_model), or [vision transformers](https://en.wikipedia.org/wiki/Vision_transformer). The output from thepi.pe is a list of chunks containing all content within the source document. These chunks can easily be converted to a prompt format that is compatible with any LLM or multimodal model with `thepipe.chunks_to_messages`, which gives the following format:
```json
[
{
Expand All @@ -123,10 +124,10 @@ thepi.pe uses computer vision models and heuristics to extract clean content fro
]
```

You can feed these messages directly into the model, or you can use `thepipe_api.chunk_by_page`, `thepipe_api.chunk_by_section`, `thepipe_api.chunk_semantic` to chunk these messages for a vector database such as ChromaDB or a RAG framework (a chunk can be converted to LlamaIndex Document/ImageDocument with `.to_llamaindex`).
You can feed these messages directly into the model, or alternatively you can use `thepipe_api.chunk_by_document`, `thepipe_api.chunk_by_page`, `thepipe_api.chunk_by_section`, `thepipe_api.chunk_semantic` to chunk these messages for a vector database such as ChromaDB or a RAG framework. A chunk can be converted to LlamaIndex Document/ImageDocument with `.to_llamaindex`.

> ⚠️ **It is important to be mindful of your model's token limit.**
GPT-4o does not work with too many images in the prompt (see discussion [here](https://community.openai.com/t/gpt-4-vision-maximum-amount-of-images/573110/6)). Large documents should be extracted with `text_only=True` to avoid this issue, or alternatively they can be chunked and saved into a vector database or RAG framework.
GPT-4o does not work with too many images in the prompt (see discussion [here](https://community.openai.com/t/gpt-4-vision-maximum-amount-of-images/573110/6)). To remedy this issue, either use an LLM with a larger context window, extract larger documents with `text_only=True`, or embed the chunks into vector database.

# Sponsors

Expand Down
3 changes: 2 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,5 @@ charset-normalizer
colorama
requests
pillow
pydantic
pydantic
supabase
2 changes: 1 addition & 1 deletion tests/test_chunker.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
import sys
from typing import List
sys.path.append('..')
from thepipe import chunker
import thepipe.chunker as chunker
from thepipe.core import Chunk

class test_chunker(unittest.TestCase):
Expand Down
8 changes: 4 additions & 4 deletions tests/test_core.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@
import os
import sys
sys.path.append('..')
from thepipe import core
from thepipe import scraper
import thepipe.core as core
import thepipe.scraper as scraper
from PIL import Image
from io import BytesIO

Expand All @@ -28,7 +28,7 @@ def test_chunk_to_llamaindex(self):
self.assertEqual(len(llama_index), 1)

def test_chunks_to_messages(self):
chunks = scraper.scrape_file(source=self.files_directory+"/example.md", local=True)
chunks = scraper.scrape_file(filepath=self.files_directory+"/example.md", local=True)
messages = core.chunks_to_messages(chunks)
self.assertEqual(type(messages), list)
for message in messages:
Expand All @@ -44,7 +44,7 @@ def test_save_outputs(self):
text = file.read()
self.assertIn('Hello, World!', text)
# verify with images
chunks = scraper.scrape_file(source=self.files_directory+"/example.jpg", local=True)
chunks = scraper.scrape_file(filepath=self.files_directory+"/example.jpg", local=True)
core.save_outputs(chunks)
self.assertTrue(any('.jpg' in f for f in os.listdir(self.outputs_directory)))

Expand Down
6 changes: 3 additions & 3 deletions tests/test_scraper.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@
import os
import sys
sys.path.append('..')
from thepipe import core
from thepipe import scraper
import thepipe.core as core
import thepipe.scraper as scraper

class test_scraper(unittest.TestCase):
def setUp(self):
Expand Down Expand Up @@ -83,7 +83,7 @@ def test_scrape_audio(self):
self.assertTrue(any('citizens' in chunk.texts[0].lower() for chunk in chunks if chunk.texts is not None))

def test_scrape_video(self):
chunks = scraper.scrape_file(source=self.files_directory+"/example.mp4", verbose=True, local=True)
chunks = scraper.scrape_file(self.files_directory+"/example.mp4", verbose=True, local=True)
# verify it scraped the video file into chunks
self.assertEqual(type(chunks), list)
self.assertNotEqual(len(chunks), 0)
Expand Down
3 changes: 1 addition & 2 deletions thepipe/__init__.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
import os
from .scraper import scrape_file, scrape_url, scrape_directory
from .chunker import chunk_by_document, chunk_by_page, chunk_by_section, chunk_semantic
from .core import Chunk, calculate_tokens, chunks_to_messages, parse_arguments, save_outputs
from .core import parse_arguments, save_outputs

def main() -> None:
args = parse_arguments()
Expand Down
4 changes: 2 additions & 2 deletions thepipe/chunker.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
import re
from typing import Dict, List, Optional, Tuple
from .core import Chunk, calculate_tokens
from typing import List
from .core import Chunk
from sklearn.metrics.pairwise import cosine_similarity

def chunk_by_document(chunks: List[Chunk]) -> List[Chunk]:
Expand Down
39 changes: 31 additions & 8 deletions thepipe/core.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
import argparse
import base64
from io import BytesIO
import json
import os
import re
import time
from typing import Dict, List, Optional, Union
import requests
Expand All @@ -26,20 +26,43 @@ def to_llamaindex(self) -> List[Union[Document, ImageDocument]]:
else:
return [Document(text=document_text)]

def to_message(self, host_images: bool = False, max_resolution : Optional[int] = None) -> Dict:
def to_message(self, host_images: bool = False, max_resolution: Optional[int] = None) -> Dict:
message = {"role": "user", "content": []}
image_urls = [make_image_url(image, host_images, max_resolution) for image in self.images]

if self.texts:
prompt = "\n```\n" + '\n'.join(self.texts) + "\n```\n"
message["content"].append({"type": "text", "text": prompt})
for image in self.images:
image_url = make_image_url(image, host_images, max_resolution)
message_text = "\n\n"
img_index = 0

for text in self.texts:
if host_images:
def replace_image(match):
nonlocal img_index
if img_index < len(image_urls):
url = image_urls[img_index]
img_index += 1
return f"![image]({url})"
return match.group(0) # If we run out of images, leave the original text

# Replace markdown image references with hosted URLs
text = re.sub(r'!\[([^\]]*)\]\([^\)]+\)', replace_image, text)

message_text += text + "\n\n"

# clean up, add to message
message_text = re.sub(r'\n{3,}', '\n\n', message_text).strip()
message["content"].append({"type": "text", "text": message_text})

# Add remaining images that weren't referenced in the text
for image_url in image_urls:
message["content"].append({"type": "image_url", "image_url": image_url})

return message

def to_json(self, host_images: bool = False) -> Dict:
data = {
'path': self.path,
'texts': self.texts,
'texts': [text.strip() for text in self.texts],
'images': [make_image_url(image=image, host_images=host_images) for image in self.images],
'audios': self.audios,
'videos': self.videos,
Expand All @@ -61,7 +84,7 @@ def from_json(data: Dict, host_images: bool = False) -> 'Chunk':
images.append(image)
return Chunk(
path=data['path'],
texts=data['texts'],
texts=[text.strip() for text in data['texts']],
images=images,
audios=data['audios'],
videos=data['videos'],
Expand Down
Loading

0 comments on commit fa9e3ff

Please sign in to comment.