specified docs status while switching backend services

emcf · Jul 8, 2024 · fa9e3ff · fa9e3ff
1 parent 17c4fc8
commit fa9e3ff
Show file tree

Hide file tree

Showing 9 changed files with 265 additions and 165 deletions.
diff --git a/README.md b/README.md
@@ -20,72 +20,73 @@
   </a>
 </div>
 
+### Extract markdown and visuals from PDFs URLs, slides, videos, and more, ready for multimodal LLMs. ⚡
 
-### Extract markdown and visuals from PDFs URLs, docs, slides, videos, and more, ready for multimodal LLMs. ⚡
-
-thepi.pe is an AI-native scraping engine that generates LLM-ready markdown and visuals from any document, media, or web page. It is built for multimodal language models such as GPT-4o, and works out-of-the-box with any LLM or vector database. thepi.pe is available as a [hosted API](https://thepi.pe), or it can be self-hosted. 
+thepi.pe is an API that can scrape multimodal data via `thepipe.scrape` or extract structured data via `thepipe.extract` from a wide range of data. It is built to interface with LLMs such as GPT-4o, and works out-of-the-box with any LLM or vector databases. thepi.pe can be used right away with a [hosted GPU cloud](https://thepi.pe), or it can be self-hosted.
 
 ## Features 🌟
 
-- Extract clean markdown, tables, and images from any document or web page
-- Output works out-of-the-box with all multimodal LLMs and RAG frameworks
-- GPU-accelerated AI layout analysis, chunking, and structured data extraction
-- Quick-start integrations for web data like Twitter, YouTube, GitHub, and more
-- Self-hosted or hosted API options available
+- Extract markdown, tables, and images from any document or webpage
+- Extract complex structured data from any document or webpage
+- Works out-of-the-box with all LLMs and RAG frameworks
+- AI-native filetype detection, layout analysis, and structured data extraction
+- Multimodal scraping for video, audio, and image sources
 
 ## Get started in 5 minutes  🚀
 
 thepi.pe can read a wide range of filetypes and web sources, so it requires a few dependencies. It also requires a strong machine (16GB+ VRAM for optimal PDF & video response times) for AI extraction features. For these reasons, we host a REST API that works out-of-the-box at [thepi.pe](https://thepi.pe).
 
 ### Hosted API (Python)
 
+> ⚠️ **Warning.**
+The docs and functionality in this repo differ significantly from the current working version on pip. To use a working version, please refer to the [API docs](https://thepi.pe/docs), and not these docs.
+
 ```bash
 pip install thepipe-api
 setx THEPIPE_API_KEY=your_api_key
+setx OPENAI_API_KEY=your_openai_key
 ```
 
 ```python
-import thepipe
+from thepipe.scraper import scrape_file
 from openai import OpenAI
 
-# scrape markdown + images
-chunks = thepipe.scrape(source="example.pdf")
+# scrape markdown, tables, visuals
+chunks = scrape_file(filepath="paper.pdf")
 
-# call LLM
+# call LLM with clean, comprehensive data
 client = OpenAI()
 response = client.chat.completions.create(
     model="gpt-4o",
     messages=thepipe.chunks_to_messages(chunks),
 )
 ```
 
-### Local Installation
+### Local Installation (Python)
 
+For a local installation, you can use the following command:
 
 ```bash
 pip install thepipe-api[local]
 ```
 
-```python
-import thepipe
-from openai import OpenAI
+And append `local=True` to your API calls:
 
-# scrape markdown + images
-chunks = thepipe.scrape_file(source="example.pdf", local=True)
+```python
+chunks = scrape_url(url="https://example.com", local=True)
 ```
 
 You can also use The Pipe from the command line:
 ```bash
 thepipe path/to/folder --include_regex .*\.tsx
 ```
 
-
 ## Supported File Types 📚
 
 | Source Type              | Input types                                                    | Multimodal Scraping | Notes |
 |--------------------------|----------------------------------------------------------------|---------------------|----------------------|
-| Webpage                  | URLs starting with `http`, `https`, `ftp`                      | ✔️                  | Scrapes markdown, images, and tables from web pages |
-| PDF                      | `.pdf`                                                          | ✔️                  | Extracts page markdown and page images. Opt-in `ai_extraction` for advanced layout analysis (extracts markdown, LaTeX equations, tables, and figures) |
+| Webpage                  | URLs starting with `http`, `https`, `ftp`                      | ✔️                  | Scrapes markdown, images, and tables from web pages. `ai_extraction` available for AI layout analysis |
+| PDF                      | `.pdf`                                                          | ✔️                  | Extracts page markdown and page images. `ai_extraction` available for AI layout analysis |
 | Word Document  | `.docx`                                                         | ✔️                  | Extracts text, tables, and images |
 | PowerPoint     | `.pptx`                                                         | ✔️                  | Extracts text and images from slides |
 | Video                    | `.mp4`, `.mov`, `.wmv`                                          | ✔️                  | Uses Whisper for transcription and extracts frames |
@@ -102,7 +103,7 @@ thepipe path/to/folder --include_regex .*\.tsx
 
 ## How it works 🛠️
 
-thepi.pe uses computer vision models and heuristics to extract clean content from the source and process it for downstream use with [language models](https://en.wikipedia.org/wiki/Large_language_model), or [vision transformers](https://en.wikipedia.org/wiki/Vision_transformer). The output from thepi.pe is a prompt (a list of messages) containing all content from the source document. The messages returned should look like this:
+thepi.pe uses computer vision models and heuristics to extract clean content from the source and process it for downstream use with [language models](https://en.wikipedia.org/wiki/Large_language_model), or [vision transformers](https://en.wikipedia.org/wiki/Vision_transformer). The output from thepi.pe is a list of chunks containing all content within the source document. These chunks can easily be converted to a prompt format that is compatible with any LLM or multimodal model with `thepipe.chunks_to_messages`, which gives the following format:
 ```json
 [
   {
@@ -123,10 +124,10 @@ thepi.pe uses computer vision models and heuristics to extract clean content fro
 ]
 ```
 
-You can feed these messages directly into the model, or you can use `thepipe_api.chunk_by_page`, `thepipe_api.chunk_by_section`, `thepipe_api.chunk_semantic` to chunk these messages for a vector database such as ChromaDB or a RAG framework (a chunk can be converted to LlamaIndex Document/ImageDocument with `.to_llamaindex`).
+You can feed these messages directly into the model, or alternatively you can use `thepipe_api.chunk_by_document`, `thepipe_api.chunk_by_page`, `thepipe_api.chunk_by_section`, `thepipe_api.chunk_semantic` to chunk these messages for a vector database such as ChromaDB or a RAG framework. A chunk can be converted to LlamaIndex Document/ImageDocument with `.to_llamaindex`.
 
 > ⚠️ **It is important to be mindful of your model's token limit.**
-GPT-4o does not work with too many images in the prompt (see discussion [here](https://community.openai.com/t/gpt-4-vision-maximum-amount-of-images/573110/6)). Large documents should be extracted with `text_only=True` to avoid this issue, or alternatively they can be chunked and saved into a vector database or RAG framework.
+GPT-4o does not work with too many images in the prompt (see discussion [here](https://community.openai.com/t/gpt-4-vision-maximum-amount-of-images/573110/6)). To remedy this issue, either use an LLM with a larger context window, extract larger documents with `text_only=True`, or embed the chunks into vector database.
 
 # Sponsors
 

diff --git a/requirements.txt b/requirements.txt
@@ -4,4 +4,5 @@ charset-normalizer
 colorama
 requests
 pillow
-pydantic
+pydantic
+supabase
diff --git a/tests/test_chunker.py b/tests/test_chunker.py
@@ -3,7 +3,7 @@
 import sys
 from typing import List
 sys.path.append('..')
-from thepipe import chunker
+import thepipe.chunker as chunker
 from thepipe.core import Chunk
 
 class test_chunker(unittest.TestCase):

diff --git a/tests/test_core.py b/tests/test_core.py
@@ -4,8 +4,8 @@
 import os
 import sys
 sys.path.append('..')
-from thepipe import core
-from thepipe import scraper
+import thepipe.core as core
+import thepipe.scraper as scraper
 from PIL import Image
 from io import BytesIO
 
@@ -28,7 +28,7 @@ def test_chunk_to_llamaindex(self):
         self.assertEqual(len(llama_index), 1)
 
     def test_chunks_to_messages(self):
-        chunks = scraper.scrape_file(source=self.files_directory+"/example.md", local=True)
+        chunks = scraper.scrape_file(filepath=self.files_directory+"/example.md", local=True)
         messages = core.chunks_to_messages(chunks)
         self.assertEqual(type(messages), list)
         for message in messages:
@@ -44,7 +44,7 @@ def test_save_outputs(self):
             text = file.read()
         self.assertIn('Hello, World!', text)
         # verify with images
-        chunks = scraper.scrape_file(source=self.files_directory+"/example.jpg", local=True)
+        chunks = scraper.scrape_file(filepath=self.files_directory+"/example.jpg", local=True)
         core.save_outputs(chunks)
         self.assertTrue(any('.jpg' in f for f in os.listdir(self.outputs_directory)))
 

diff --git a/tests/test_scraper.py b/tests/test_scraper.py
@@ -2,8 +2,8 @@
 import os
 import sys
 sys.path.append('..')
-from thepipe import core
-from thepipe import scraper
+import thepipe.core as core
+import thepipe.scraper as scraper
 
 class test_scraper(unittest.TestCase):
     def setUp(self):
@@ -83,7 +83,7 @@ def test_scrape_audio(self):
         self.assertTrue(any('citizens' in chunk.texts[0].lower() for chunk in chunks if chunk.texts is not None))
 
     def test_scrape_video(self):
-        chunks = scraper.scrape_file(source=self.files_directory+"/example.mp4", verbose=True, local=True)
+        chunks = scraper.scrape_file(self.files_directory+"/example.mp4", verbose=True, local=True)
         # verify it scraped the video file into chunks
         self.assertEqual(type(chunks), list)
         self.assertNotEqual(len(chunks), 0)

diff --git a/thepipe/__init__.py b/thepipe/__init__.py
@@ -1,7 +1,6 @@
 import os
 from .scraper import scrape_file, scrape_url, scrape_directory
-from .chunker import chunk_by_document, chunk_by_page, chunk_by_section, chunk_semantic
-from .core import Chunk, calculate_tokens, chunks_to_messages, parse_arguments, save_outputs
+from .core import parse_arguments, save_outputs
 
 def main() -> None:
     args = parse_arguments()

diff --git a/thepipe/chunker.py b/thepipe/chunker.py
@@ -1,6 +1,6 @@
 import re
-from typing import Dict, List, Optional, Tuple
-from .core import Chunk, calculate_tokens
+from typing import List
+from .core import Chunk
 from sklearn.metrics.pairwise import cosine_similarity
 
 def chunk_by_document(chunks: List[Chunk]) -> List[Chunk]:

diff --git a/thepipe/core.py b/thepipe/core.py
@@ -1,8 +1,8 @@
 import argparse
 import base64
 from io import BytesIO
-import json
 import os
+import re
 import time
 from typing import Dict, List, Optional, Union
 import requests
@@ -26,20 +26,43 @@ def to_llamaindex(self) -> List[Union[Document, ImageDocument]]:
         else:
             return [Document(text=document_text)]
 
-    def to_message(self, host_images: bool = False, max_resolution : Optional[int] = None) -> Dict:
+    def to_message(self, host_images: bool = False, max_resolution: Optional[int] = None) -> Dict:
         message = {"role": "user", "content": []}
+        image_urls = [make_image_url(image, host_images, max_resolution) for image in self.images]
+
         if self.texts:
-            prompt = "\n```\n" + '\n'.join(self.texts) + "\n```\n" 
-            message["content"].append({"type": "text", "text": prompt})
-        for image in self.images:
-            image_url = make_image_url(image, host_images, max_resolution)
+            message_text = "\n\n"
+            img_index = 0
+
+            for text in self.texts:
+                if host_images:
+                    def replace_image(match):
+                        nonlocal img_index
+                        if img_index < len(image_urls):
+                            url = image_urls[img_index]
+                            img_index += 1
+                            return f"![image]({url})"
+                        return match.group(0)  # If we run out of images, leave the original text
+
+                    # Replace markdown image references with hosted URLs
+                    text = re.sub(r'!\[([^\]]*)\]\([^\)]+\)', replace_image, text)
+
+                message_text += text + "\n\n"
+
+            # clean up, add to message
+            message_text = re.sub(r'\n{3,}', '\n\n', message_text).strip()
+            message["content"].append({"type": "text", "text": message_text})
+
+        # Add remaining images that weren't referenced in the text
+        for image_url in image_urls:
             message["content"].append({"type": "image_url", "image_url": image_url})
+
         return message
 
     def to_json(self, host_images: bool = False) -> Dict:
         data = {
             'path': self.path,
-            'texts': self.texts,
+            'texts': [text.strip() for text in self.texts],
             'images': [make_image_url(image=image, host_images=host_images) for image in self.images],
             'audios': self.audios,
             'videos': self.videos,
@@ -61,7 +84,7 @@ def from_json(data: Dict, host_images: bool = False) -> 'Chunk':
                 images.append(image)
         return Chunk(
             path=data['path'],
-            texts=data['texts'],
+            texts=[text.strip() for text in data['texts']],
             images=images,
             audios=data['audios'],
             videos=data['videos'],