Merge pull request #15 from emcf/14-some-videos-without-audio-fail-to…

…-extract Added nonetype check for videos, made url checks more general
emcf · Apr 30, 2024 · 17323dd · 17323dd
2 parents 9276bbc + 4fd2235
commit 17323dd
Show file tree

Hide file tree

Showing 2 changed files with 18 additions and 31 deletions.
diff --git a/README.md b/README.md
@@ -4,11 +4,11 @@
   <a href="https://github.com/emcf/thepipe/blob/main/README.md">English</a> | <a href="https://github.com/emcf/thepipe/blob/main/README_cn.md">中文</a>
 </p>
 
-[![codecov](https://codecov.io/gh/emcf/thepipe/graph/badge.svg?token=OE7CUEFUL9)](https://codecov.io/gh/emcf/thepipe) ![python-gh-action](https://github.com/emcf/thepipe/actions/workflows/python-ci.yml/badge.svg) <a href="https://thepi.pe/">![Website](https://img.shields.io/website?url=https%3A%2F%2Fthepipe.up.railway.app%2F&label=API%20status)</a> <a href="https://thepi.pe/">![get API](https://img.shields.io/badge/API-access-blue)</a>
+[![codecov](https://codecov.io/gh/emcf/thepipe/graph/badge.svg?token=OE7CUEFUL9)](https://codecov.io/gh/emcf/thepipe) ![python-gh-action](https://github.com/emcf/thepipe/actions/workflows/python-ci.yml/badge.svg) <a href="https://thepi.pe/">![Website](https://img.shields.io/website?url=https%3A%2F%2Fthepipe.up.railway.app%2F&label=API%20status)</a> <a href="https://thepi.pe/">![get API](https://img.shields.io/badge/API-access-blue)</a> <a href="https://discord.gg/bXfKeGs5qV">![Join discord](https://img.shields.io/discord/1227806200478044274?color=4f69ef&label=Discord&logo=discord&logoColor=ffffff)</a>
 
-### Feed PDFs, web pages, word docs, slides, videos, CSV, and more into Vision-LLMs with one line of code ⚡
+### Feed PDFs, URLs, Slides, YouTube videos, Word docs and more into Vision-Language models with one line of code ⚡
 
-The Pipe is a multimodal-first tool for feeding files and web pages into vision-language models such as GPT-4V. It is best for LLM and RAG applications that require a deep understanding of tricky data sources. The Pipe is available as a hosted API at [thepi.pe](https://thepi.pe), or it can be set up locally. 
+The Pipe is a multimodal-first tool for feeding files and web pages into vision-language models such as GPT-4V. It is best for LLM and RAG applications that want to support comprehensive textual and visual understanding across a wide range of data sources. The Pipe is available as a 24/7 hosted API at [thepi.pe](https://thepi.pe), or it can be set up locally to let you run the compute.
 
 ![Science assistant demo](https://rpnutzemutbrumczwvue.supabase.co/storage/v1/object/public/assets/science_assistantpy2.png)
 
@@ -74,9 +74,9 @@ thepipe path/to/folder --match tsx --ignore tests
 | Microsoft PowerPoint Presentation     | `.pptx`                                 | ✔️               | ✔️               | Extracts text and images from PowerPoint presentations                              |
 | Video                                 | `.mp4`, `.avi`, `.mov`, `.wmv`     | ✔️               | ✔️                | Extracts frames from video files; supports frame extraction and OCR for text extraction from frames |
 | Audio                                 | `.mp3`, `.wav`          | ✔️               | ❌                | Extracts text from audio files; supports speech-to-text conversion        | 
-| Website                               | URLs (inputs containing `http`, `https`, `ftp`)             | ✔️                | ✔️    | Extracts text from web page along with image (or images if scrollable); text-only extraction available          |
-| GitHub Repository                     | GitHub repo URLs                         | ✔️               | ✔️                | Extracts from GitHub repositories; supports branch specification         |
-| YouTube Video                         | YouTube video URLs                      | ✔️               | ✔️                | Extracts text from YouTube videos; supports subtitles extraction          |
+| Website                               | URLs (inputs starting with `http`, `https`, `ftp`)             | ✔️                | ✔️    | Extracts text from web page along with image (or images if scrollable); text-only extraction available          |
+| GitHub Repository                     | GitHub repo URLs (inputs starting with `https://github.com` or `https://www.github.com`)                          | ✔️               | ✔️                | Extracts from GitHub repositories; supports branch specification         |
+| YouTube Video                         | YouTube video URLs (inputs starting with `https://youtube.com` or `https://www.youtube.com`)                     | ✔️               | ✔️                | Extracts frames and transcript from YouTube videos in per-minute chunks          |
 | ZIP File                              | `.zip`                                  | ✔️               | ✔️                | Extracts contents of ZIP files; supports nested directory extraction     |
 
 ## How it works 🛠️
@@ -113,25 +113,8 @@ It uses a variety of heuristics for optimal performance with vision-language mod
 
 ## Local Installation 🛠️
 
-The Pipe handles a wide array of complex filetypes, and thus requires installation of many different packages to function. It also requires a very capable machine for good response times. For this reason, we host it as an API that works out-of-the-box. To use The Pipe locally for free instead, you will need [playwright](https://github.com/microsoft/playwright), [ctags](https://github.com/universal-ctags/), [pytesseract](https://github.com/h/pytesseract), and the local python requirements, which differ from the more lightweight API requirements:
-
-```bash
-git clone https://github.com/emcf/thepipe
-pip install -r requirements_local.txt
-```
-
-Tip for windows users: Install the python-libmagic binaries with `pip install python-magic-bin`. Ensure the `tesseract-ocr` binaries and the `ctags` binaries are in your PATH.
-
-Now you can use The Pipe with Python:
-```bash
-from thepipe_api import thepipe
-chunks = thepipe.extract("example.pdf", local=True)
-```
-
-or from the command line:
-```bash
-thepipe path/to/folder --local
-```
+If you do not wish to use our API, you are welcome host The Pipe for yourself locally. 
+If you choose to do this, you must install a number of dependencies for the code to function correctly, some of which may incur compute costs and/or require a GPU for reasonable performance. Additional installed dependencies are required: pytorch, universal-ctags, playwright, pytesseract, llmlingua, moviepy, and pytube. This installation process will depend on your system and compute capabilities. After installing them, follow these steps for a local setup:
 
 Arguments are:
 - `source` (required): can be a file path, a URL, or a directory path.

diff --git a/thepipe_api/extractor.py b/thepipe_api/extractor.py
@@ -115,9 +115,9 @@ def extract_from_file(file_path: str, source_type: str, verbose: bool = False, a
         return [Chunk(path=file_path)]
 
 def detect_type(source: str) -> Optional[SourceTypes]:
-    if source.startswith("https://www.youtube.com"):
+    if source.startswith("https://www.youtube.com") or source.startswith("https://youtube.com"):
         return SourceTypes.YOUTUBE_VIDEO
-    if source.startswith("https://github.com"):
+    elif source.startswith("https://github.com") or source.startswith("https://www.github.com"):
         return SourceTypes.GITHUB
     elif source.startswith("http") or source.startswith("ftp."):
         return SourceTypes.URL
@@ -352,15 +352,19 @@ def extract_video(file_path: str, verbose: bool = False, text_only: bool = False
         image = Image.fromarray(frame)
         # extract and transcribe audio for the current chunk
         audio_path = os.path.join(tempfile.gettempdir(), f"temp_audio_{i}.wav")
-        video.subclip(start_time, end_time).audio.write_audiofile(audio_path, codec='pcm_s16le')
-        result = model.transcribe(audio_path, verbose=verbose)
-        transcription = result['text']
+        audio = video.subclip(start_time, end_time).audio
+        if audio is None:
+            transcription = None
+        else:
+            audio.write_audiofile(audio_path, codec='pcm_s16le')
+            result = model.transcribe(audio_path, verbose=verbose)
+            transcription = result['text']
+            os.remove(audio_path)
         # add chunk
         if not text_only:
             chunks.append(Chunk(path=file_path, text=transcription, image=image, source_type=SourceTypes.VIDEO))
         else:
             chunks.append(Chunk(path=file_path, text=transcription, image=None, source_type=SourceTypes.VIDEO))
-        os.remove(audio_path)
     return chunks
 
 def extract_youtube(youtube_url: str, text_only: bool = False, verbose: bool = False) -> List[Chunk]: