Skip to content

Commit

Permalink
Merge pull request #15 from emcf/14-some-videos-without-audio-fail-to…
Browse files Browse the repository at this point in the history
…-extract

Added nonetype check for videos, made url checks more general
  • Loading branch information
emcf authored Apr 30, 2024
2 parents 9276bbc + 4fd2235 commit 17323dd
Show file tree
Hide file tree
Showing 2 changed files with 18 additions and 31 deletions.
33 changes: 8 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@
<a href="https://github.com/emcf/thepipe/blob/main/README.md">English</a> | <a href="https://github.com/emcf/thepipe/blob/main/README_cn.md">中文</a>
</p>

[![codecov](https://codecov.io/gh/emcf/thepipe/graph/badge.svg?token=OE7CUEFUL9)](https://codecov.io/gh/emcf/thepipe) ![python-gh-action](https://github.com/emcf/thepipe/actions/workflows/python-ci.yml/badge.svg) <a href="https://thepi.pe/">![Website](https://img.shields.io/website?url=https%3A%2F%2Fthepipe.up.railway.app%2F&label=API%20status)</a> <a href="https://thepi.pe/">![get API](https://img.shields.io/badge/API-access-blue)</a>
[![codecov](https://codecov.io/gh/emcf/thepipe/graph/badge.svg?token=OE7CUEFUL9)](https://codecov.io/gh/emcf/thepipe) ![python-gh-action](https://github.com/emcf/thepipe/actions/workflows/python-ci.yml/badge.svg) <a href="https://thepi.pe/">![Website](https://img.shields.io/website?url=https%3A%2F%2Fthepipe.up.railway.app%2F&label=API%20status)</a> <a href="https://thepi.pe/">![get API](https://img.shields.io/badge/API-access-blue)</a> <a href="https://discord.gg/bXfKeGs5qV">![Join discord](https://img.shields.io/discord/1227806200478044274?color=4f69ef&label=Discord&logo=discord&logoColor=ffffff)</a>

### Feed PDFs, web pages, word docs, slides, videos, CSV, and more into Vision-LLMs with one line of code ⚡
### Feed PDFs, URLs, Slides, YouTube videos, Word docs and more into Vision-Language models with one line of code ⚡

The Pipe is a multimodal-first tool for feeding files and web pages into vision-language models such as GPT-4V. It is best for LLM and RAG applications that require a deep understanding of tricky data sources. The Pipe is available as a hosted API at [thepi.pe](https://thepi.pe), or it can be set up locally.
The Pipe is a multimodal-first tool for feeding files and web pages into vision-language models such as GPT-4V. It is best for LLM and RAG applications that want to support comprehensive textual and visual understanding across a wide range of data sources. The Pipe is available as a 24/7 hosted API at [thepi.pe](https://thepi.pe), or it can be set up locally to let you run the compute.

![Science assistant demo](https://rpnutzemutbrumczwvue.supabase.co/storage/v1/object/public/assets/science_assistantpy2.png)

Expand Down Expand Up @@ -74,9 +74,9 @@ thepipe path/to/folder --match tsx --ignore tests
| Microsoft PowerPoint Presentation | `.pptx` | ✔️ | ✔️ | Extracts text and images from PowerPoint presentations |
| Video | `.mp4`, `.avi`, `.mov`, `.wmv` | ✔️ | ✔️ | Extracts frames from video files; supports frame extraction and OCR for text extraction from frames |
| Audio | `.mp3`, `.wav` | ✔️ || Extracts text from audio files; supports speech-to-text conversion |
| Website | URLs (inputs containing `http`, `https`, `ftp`) | ✔️ | ✔️ | Extracts text from web page along with image (or images if scrollable); text-only extraction available |
| GitHub Repository | GitHub repo URLs | ✔️ | ✔️ | Extracts from GitHub repositories; supports branch specification |
| YouTube Video | YouTube video URLs | ✔️ | ✔️ | Extracts text from YouTube videos; supports subtitles extraction |
| Website | URLs (inputs starting with `http`, `https`, `ftp`) | ✔️ | ✔️ | Extracts text from web page along with image (or images if scrollable); text-only extraction available |
| GitHub Repository | GitHub repo URLs (inputs starting with `https://github.com` or `https://www.github.com`) | ✔️ | ✔️ | Extracts from GitHub repositories; supports branch specification |
| YouTube Video | YouTube video URLs (inputs starting with `https://youtube.com` or `https://www.youtube.com`) | ✔️ | ✔️ | Extracts frames and transcript from YouTube videos in per-minute chunks |
| ZIP File | `.zip` | ✔️ | ✔️ | Extracts contents of ZIP files; supports nested directory extraction |

## How it works 🛠️
Expand Down Expand Up @@ -113,25 +113,8 @@ It uses a variety of heuristics for optimal performance with vision-language mod

## Local Installation 🛠️

The Pipe handles a wide array of complex filetypes, and thus requires installation of many different packages to function. It also requires a very capable machine for good response times. For this reason, we host it as an API that works out-of-the-box. To use The Pipe locally for free instead, you will need [playwright](https://github.com/microsoft/playwright), [ctags](https://github.com/universal-ctags/), [pytesseract](https://github.com/h/pytesseract), and the local python requirements, which differ from the more lightweight API requirements:

```bash
git clone https://github.com/emcf/thepipe
pip install -r requirements_local.txt
```

Tip for windows users: Install the python-libmagic binaries with `pip install python-magic-bin`. Ensure the `tesseract-ocr` binaries and the `ctags` binaries are in your PATH.

Now you can use The Pipe with Python:
```bash
from thepipe_api import thepipe
chunks = thepipe.extract("example.pdf", local=True)
```

or from the command line:
```bash
thepipe path/to/folder --local
```
If you do not wish to use our API, you are welcome host The Pipe for yourself locally.
If you choose to do this, you must install a number of dependencies for the code to function correctly, some of which may incur compute costs and/or require a GPU for reasonable performance. Additional installed dependencies are required: pytorch, universal-ctags, playwright, pytesseract, llmlingua, moviepy, and pytube. This installation process will depend on your system and compute capabilities. After installing them, follow these steps for a local setup:

Arguments are:
- `source` (required): can be a file path, a URL, or a directory path.
Expand Down
16 changes: 10 additions & 6 deletions thepipe_api/extractor.py
Original file line number Diff line number Diff line change
Expand Up @@ -115,9 +115,9 @@ def extract_from_file(file_path: str, source_type: str, verbose: bool = False, a
return [Chunk(path=file_path)]

def detect_type(source: str) -> Optional[SourceTypes]:
if source.startswith("https://www.youtube.com"):
if source.startswith("https://www.youtube.com") or source.startswith("https://youtube.com"):
return SourceTypes.YOUTUBE_VIDEO
if source.startswith("https://github.com"):
elif source.startswith("https://github.com") or source.startswith("https://www.github.com"):
return SourceTypes.GITHUB
elif source.startswith("http") or source.startswith("ftp."):
return SourceTypes.URL
Expand Down Expand Up @@ -352,15 +352,19 @@ def extract_video(file_path: str, verbose: bool = False, text_only: bool = False
image = Image.fromarray(frame)
# extract and transcribe audio for the current chunk
audio_path = os.path.join(tempfile.gettempdir(), f"temp_audio_{i}.wav")
video.subclip(start_time, end_time).audio.write_audiofile(audio_path, codec='pcm_s16le')
result = model.transcribe(audio_path, verbose=verbose)
transcription = result['text']
audio = video.subclip(start_time, end_time).audio
if audio is None:
transcription = None
else:
audio.write_audiofile(audio_path, codec='pcm_s16le')
result = model.transcribe(audio_path, verbose=verbose)
transcription = result['text']
os.remove(audio_path)
# add chunk
if not text_only:
chunks.append(Chunk(path=file_path, text=transcription, image=image, source_type=SourceTypes.VIDEO))
else:
chunks.append(Chunk(path=file_path, text=transcription, image=None, source_type=SourceTypes.VIDEO))
os.remove(audio_path)
return chunks

def extract_youtube(youtube_url: str, text_only: bool = False, verbose: bool = False) -> List[Chunk]:
Expand Down

0 comments on commit 17323dd

Please sign in to comment.