Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added nonetype check for videos, made url checks more general #15

Merged
merged 4 commits into from
Apr 30, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
33 changes: 8 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@
<a href="https://github.com/emcf/thepipe/blob/main/README.md">English</a> | <a href="https://github.com/emcf/thepipe/blob/main/README_cn.md">中文</a>
</p>

[![codecov](https://codecov.io/gh/emcf/thepipe/graph/badge.svg?token=OE7CUEFUL9)](https://codecov.io/gh/emcf/thepipe) ![python-gh-action](https://github.com/emcf/thepipe/actions/workflows/python-ci.yml/badge.svg) <a href="https://thepi.pe/">![Website](https://img.shields.io/website?url=https%3A%2F%2Fthepipe.up.railway.app%2F&label=API%20status)</a> <a href="https://thepi.pe/">![get API](https://img.shields.io/badge/API-access-blue)</a>
[![codecov](https://codecov.io/gh/emcf/thepipe/graph/badge.svg?token=OE7CUEFUL9)](https://codecov.io/gh/emcf/thepipe) ![python-gh-action](https://github.com/emcf/thepipe/actions/workflows/python-ci.yml/badge.svg) <a href="https://thepi.pe/">![Website](https://img.shields.io/website?url=https%3A%2F%2Fthepipe.up.railway.app%2F&label=API%20status)</a> <a href="https://thepi.pe/">![get API](https://img.shields.io/badge/API-access-blue)</a> <a href="https://discord.gg/bXfKeGs5qV">![Join discord](https://img.shields.io/discord/1227806200478044274?color=4f69ef&label=Discord&logo=discord&logoColor=ffffff)</a>

### Feed PDFs, web pages, word docs, slides, videos, CSV, and more into Vision-LLMs with one line of code ⚡
### Feed PDFs, URLs, Slides, YouTube videos, Word docs and more into Vision-Language models with one line of code ⚡

The Pipe is a multimodal-first tool for feeding files and web pages into vision-language models such as GPT-4V. It is best for LLM and RAG applications that require a deep understanding of tricky data sources. The Pipe is available as a hosted API at [thepi.pe](https://thepi.pe), or it can be set up locally.
The Pipe is a multimodal-first tool for feeding files and web pages into vision-language models such as GPT-4V. It is best for LLM and RAG applications that want to support comprehensive textual and visual understanding across a wide range of data sources. The Pipe is available as a 24/7 hosted API at [thepi.pe](https://thepi.pe), or it can be set up locally to let you run the compute.

![Science assistant demo](https://rpnutzemutbrumczwvue.supabase.co/storage/v1/object/public/assets/science_assistantpy2.png)

Expand Down Expand Up @@ -74,9 +74,9 @@ thepipe path/to/folder --match tsx --ignore tests
| Microsoft PowerPoint Presentation | `.pptx` | ✔️ | ✔️ | Extracts text and images from PowerPoint presentations |
| Video | `.mp4`, `.avi`, `.mov`, `.wmv` | ✔️ | ✔️ | Extracts frames from video files; supports frame extraction and OCR for text extraction from frames |
| Audio | `.mp3`, `.wav` | ✔️ | ❌ | Extracts text from audio files; supports speech-to-text conversion |
| Website | URLs (inputs containing `http`, `https`, `ftp`) | ✔️ | ✔️ | Extracts text from web page along with image (or images if scrollable); text-only extraction available |
| GitHub Repository | GitHub repo URLs | ✔️ | ✔️ | Extracts from GitHub repositories; supports branch specification |
| YouTube Video | YouTube video URLs | ✔️ | ✔️ | Extracts text from YouTube videos; supports subtitles extraction |
| Website | URLs (inputs starting with `http`, `https`, `ftp`) | ✔️ | ✔️ | Extracts text from web page along with image (or images if scrollable); text-only extraction available |
| GitHub Repository | GitHub repo URLs (inputs starting with `https://github.com` or `https://www.github.com`) | ✔️ | ✔️ | Extracts from GitHub repositories; supports branch specification |
| YouTube Video | YouTube video URLs (inputs starting with `https://youtube.com` or `https://www.youtube.com`) | ✔️ | ✔️ | Extracts frames and transcript from YouTube videos in per-minute chunks |
| ZIP File | `.zip` | ✔️ | ✔️ | Extracts contents of ZIP files; supports nested directory extraction |

## How it works 🛠️
Expand Down Expand Up @@ -113,25 +113,8 @@ It uses a variety of heuristics for optimal performance with vision-language mod

## Local Installation 🛠️

The Pipe handles a wide array of complex filetypes, and thus requires installation of many different packages to function. It also requires a very capable machine for good response times. For this reason, we host it as an API that works out-of-the-box. To use The Pipe locally for free instead, you will need [playwright](https://github.com/microsoft/playwright), [ctags](https://github.com/universal-ctags/), [pytesseract](https://github.com/h/pytesseract), and the local python requirements, which differ from the more lightweight API requirements:

```bash
git clone https://github.com/emcf/thepipe
pip install -r requirements_local.txt
```

Tip for windows users: Install the python-libmagic binaries with `pip install python-magic-bin`. Ensure the `tesseract-ocr` binaries and the `ctags` binaries are in your PATH.

Now you can use The Pipe with Python:
```bash
from thepipe_api import thepipe
chunks = thepipe.extract("example.pdf", local=True)
```

or from the command line:
```bash
thepipe path/to/folder --local
```
If you do not wish to use our API, you are welcome host The Pipe for yourself locally.
If you choose to do this, you must install a number of dependencies for the code to function correctly, some of which may incur compute costs and/or require a GPU for reasonable performance. Additional installed dependencies are required: pytorch, universal-ctags, playwright, pytesseract, llmlingua, moviepy, and pytube. This installation process will depend on your system and compute capabilities. After installing them, follow these steps for a local setup:

Arguments are:
- `source` (required): can be a file path, a URL, or a directory path.
Expand Down
16 changes: 10 additions & 6 deletions thepipe_api/extractor.py
Original file line number Diff line number Diff line change
Expand Up @@ -115,9 +115,9 @@ def extract_from_file(file_path: str, source_type: str, verbose: bool = False, a
return [Chunk(path=file_path)]

def detect_type(source: str) -> Optional[SourceTypes]:
if source.startswith("https://www.youtube.com"):
if source.startswith("https://www.youtube.com") or source.startswith("https://youtube.com"):
return SourceTypes.YOUTUBE_VIDEO
if source.startswith("https://github.com"):
elif source.startswith("https://github.com") or source.startswith("https://www.github.com"):
return SourceTypes.GITHUB
elif source.startswith("http") or source.startswith("ftp."):
return SourceTypes.URL
Expand Down Expand Up @@ -352,15 +352,19 @@ def extract_video(file_path: str, verbose: bool = False, text_only: bool = False
image = Image.fromarray(frame)
# extract and transcribe audio for the current chunk
audio_path = os.path.join(tempfile.gettempdir(), f"temp_audio_{i}.wav")
video.subclip(start_time, end_time).audio.write_audiofile(audio_path, codec='pcm_s16le')
result = model.transcribe(audio_path, verbose=verbose)
transcription = result['text']
audio = video.subclip(start_time, end_time).audio
if audio is None:
transcription = None
else:
audio.write_audiofile(audio_path, codec='pcm_s16le')
result = model.transcribe(audio_path, verbose=verbose)
transcription = result['text']
os.remove(audio_path)
# add chunk
if not text_only:
chunks.append(Chunk(path=file_path, text=transcription, image=image, source_type=SourceTypes.VIDEO))
else:
chunks.append(Chunk(path=file_path, text=transcription, image=None, source_type=SourceTypes.VIDEO))
os.remove(audio_path)
return chunks

def extract_youtube(youtube_url: str, text_only: bool = False, verbose: bool = False) -> List[Chunk]:
Expand Down
Loading