cleanup

yachty66 · Nov 11, 2023 · 2e76d86 · 2e76d86
1 parent ac67b47
commit 2e76d86
Show file tree

Hide file tree

Showing 15 changed files with 529 additions and 215 deletions.
diff --git a/.DS_Store b/.DS_Store
diff --git a/.gitignore b/.gitignore
@@ -1,5 +1,8 @@
 .env
-archive/
+
+gpt_pdf_md/build/
+gpt_pdf_md/dist/
+gpt_pdf_md/gpt_pdf_md.egg-info/ 
 
 .idea
 __pycache__

diff --git a/README.md b/README.md
@@ -1,6 +1,73 @@
-# gpt_vision_plus
+# gpt_pdf_md
 
-## limitations
+gpt_pdf_md is a Python package which uses GPT-4V and other tools to convert pdf into Markdown files. current limitation of raw gpt-4v is that that it does not support pdf documents in the api and if prompted to convert text which contains figures to markdown, figures are not getting not converted correctly because the image url in the markdown is missing. IT TURNS OUT gpt_pdf_md IS EVEN COMING CLOSE TO OCR QUALITY OF MATHPIX!
+
+## Features
+
+- Extracts figures from PDF files using the `pdffigures2` Scala library.
+- Converts PDF pages to images and uploads them to Google Cloud Bucket.
+- Utilizes GPT-4V Vision to generate Markdown content from pdf an than inserts image urls into markdown.
+
+## Additional Dependencies
+
+This package requires the `pdffigures2` Scala library to extract figures from PDF files. You need to have all necessary dependencies installed for the library https://github.com/allenai/pdffigures2. (this can be quite a hassle because parts of the library are written in scala so you need to have the right version of java and scala installed - we are looking for an alternative, more easy going way to extract images from a pdf, if youn have any ideas, feel free open an [issue](https://github.com/yachty66/gpt_vision_plus/issues) on that)
+
+## Installation
+
+Once you have `pdffigures2` setup you can install gpt_pdf_md via pip:
+
+```bash
+pip install gpt-pdf-md
+```
+
+Configure the required environment variables in your .env file without spaces or unnecessary quotes:
+
+```env
+OPENAI_API_KEY=open_ai_key
+GOOGLE_ID=google_project_id
+GOOGLE_BUCKET=google_bucket_name
+```
+
+NOTE: the project requires a public google bucket where the images which later are getting rendered in the markdown are getting uploaded to.
+
+## Usage
+
+To process a PDF and generate Markdown content its important that the python file is in the same directory than the `pdffigures2` folder. You can use the gpt_pdf_md as following:
+
+```python
+from gpt_pdf_md.reader import process_pdf
+import os
+from dotenv import load_dotenv
+
+load_dotenv()
+
+OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
+GOOGLE_ID = os.getenv('GOOGLE_ID')
+GOOGLE_BUCKET = os.getenv('GOOGLE_BUCKET')
+
+absolute_path = os.path.dirname(os.path.abspath(__file__))
+#absolute path to pdf file
+PDF = absolute_path + "/example.pdf"
+#absolute padth to pdffigures2
+PDFFIGURES2_PATH = absolute_path + "/pdffigures2/"
+process_pdf(PDF, PDFFIGURES2_PATH, OPENAI_API_KEY, GOOGLE_ID, GOOGLE_BUCKET)
+```
+
+This will process the specified PDF and output a Markdown file with the extracted information in the same directory. An example is the `output.md` file which is the converted result of `example.pdf`
+
+## Next steps
+
+- [ ] try rust [vortex](https://github.com/omkar-mohanty/vortex) for pdf image extraction
+- [ ] use gpt-4 128k for final formatting of markdown
+- [ ] clearer readme to make it easier for everyone to use the python package
+- [ ] error handling  
+
+## Contributing & Support
+
+We welcome contributions! Please open an issue or submit a pull request on our GitHub repository.
+
+## License
+
+This project is licensed under the terms of the [MIT License](gpt_pdf_md/LICENSE).
 
-in the case of having two images side by side with one reference as figure like page 4 in https://arxiv.org/abs/1706.03762 the system does not know how to handle this
 
diff --git a/example.pdf b/example.pdf
diff --git a/experiments.py b/experiments.py
@@ -1,3 +1,16 @@
-from reader import main
+from gptpdfreader.reader import process_pdf
+import os
+from dotenv import load_dotenv
 
-main('page_4.pdf')
+load_dotenv()
+
+OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
+GOOGLE_ID = os.getenv('GOOGLE_ID')
+GOOGLE_BUCKET = os.getenv('GOOGLE_BUCKET')
+
+absolute_path = os.path.dirname(os.path.abspath(__file__))
+#absolute path to pdf file
+PDF = absolute_path + "/example.pdf"
+#absolute padth to pdffigures2
+PDFFIGURES2_PATH = absolute_path + "/pdffigures2/"
+process_pdf(PDF, PDFFIGURES2_PATH, OPENAI_API_KEY, GOOGLE_ID, GOOGLE_BUCKET)
diff --git a/gpt_pdf_md/LICENSE b/gpt_pdf_md/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2023 Max Hager
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/gpt_pdf_md/README.md b/gpt_pdf_md/README.md
@@ -0,0 +1,73 @@
+# gpt_pdf_md
+
+gpt_pdf_md is a Python package which uses GPT-4V and other tools to convert pdf into Markdown files. current limitation of raw gpt-4v is that that it does not support pdf documents in the api and if prompted to convert text which contains figures to markdown, figures are not getting not converted correctly because the image url in the markdown is missing. IT TURNS OUT gpt_pdf_md IS EVEN COMING CLOSE TO OCR QUALITY OF MATHPIX!
+
+## Features
+
+- Extracts figures from PDF files using the `pdffigures2` Scala library.
+- Converts PDF pages to images and uploads them to Google Cloud Bucket.
+- Utilizes GPT-4V Vision to generate Markdown content from pdf an than inserts image urls into markdown.
+
+## Additional Dependencies
+
+This package requires the `pdffigures2` Scala library to extract figures from PDF files. You need to have all necessary dependencies installed for the library https://github.com/allenai/pdffigures2. (this can be quite a hassle because parts of the library are written in scala so you need to have the right version of java and scala installed - we are looking for an alternative, more easy going way to extract images from a pdf, if youn have any ideas, feel free open an [issue](https://github.com/yachty66/gpt_vision_plus/issues) on that)
+
+## Installation
+
+Once you have `pdffigures2` setup you can install gpt_pdf_md via pip:
+
+```bash
+pip install gpt-pdf-md
+```
+
+Configure the required environment variables in your .env file without spaces or unnecessary quotes:
+
+```env
+OPENAI_API_KEY=open_ai_key
+GOOGLE_ID=google_project_id
+GOOGLE_BUCKET=google_bucket_name
+```
+
+NOTE: the project requires a public google bucket where the images which later are getting rendered in the markdown are getting uploaded to.
+
+## Usage
+
+To process a PDF and generate Markdown content its important that the python file is in the same directory than the `pdffigures2` folder. You can use the gpt_pdf_md as following:
+
+```python
+from gpt_pdf_md.reader import process_pdf
+import os
+from dotenv import load_dotenv
+
+load_dotenv()
+
+OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
+GOOGLE_ID = os.getenv('GOOGLE_ID')
+GOOGLE_BUCKET = os.getenv('GOOGLE_BUCKET')
+
+absolute_path = os.path.dirname(os.path.abspath(__file__))
+#absolute path to pdf file
+PDF = absolute_path + "/example.pdf"
+#absolute padth to pdffigures2
+PDFFIGURES2_PATH = absolute_path + "/pdffigures2/"
+process_pdf(PDF, PDFFIGURES2_PATH, OPENAI_API_KEY, GOOGLE_ID, GOOGLE_BUCKET)
+```
+
+This will process the specified PDF and output a Markdown file with the extracted information in the same directory. An example is the `output.md` file which is the converted result of `example.pdf`
+
+## Next steps
+
+- [ ] try rust [vortex](https://github.com/omkar-mohanty/vortex) for pdf image extraction
+- [ ] use gpt-4 128k for final formatting of markdown
+- [ ] clearer readme to make it easier for everyone to use the python package
+- [ ] error handling  
+
+## Contributing & Support
+
+We welcome contributions! Please open an issue or submit a pull request on our GitHub repository.
+
+## License
+
+This project is licensed under the terms of the [MIT License](gpt_pdf_md/LICENSE).
+
+
diff --git a/gpt_pdf_md/gpt_pdf_md/__init__.py b/gpt_pdf_md/gpt_pdf_md/__init__.py