Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: Vector DB Support & Python API Enhancement #13

Open
NDA-Github opened this issue Nov 26, 2024 · 7 comments
Open

Feature Request: Vector DB Support & Python API Enhancement #13

NDA-Github opened this issue Nov 26, 2024 · 7 comments

Comments

@NDA-Github
Copy link

First of all, thank you!

I want to start by thanking you for this amazing library. wdoc takes RAG to another level with its powerful features, great documentation and overall thoughtful implementation. The way it handles document processing, querying and summarization is really impressive.

Feature requests

I have two suggestions that could make wdoc even more versatile:

1. Support for vector databases

It would be great to have the option to store embeddings in vector databases like ChromaDB or Pinecone. This would allow:

  • Better scalability for large document collections
  • Persistence of embeddings across sessions
  • Potential for distributed deployments
  • Real-time updates to the document collection

2. Python API for easier integration

While the CLI interface is great, having a proper Python API would make it easier to integrate wdoc into other applications. For example:

from wdoc import WDoc

wdoc = WDoc()
db= #anyDbClient

#Embedding
embeddings = wdoc.create_embeddings(
documents=["doc1.pdf", "doc2.pdf"],
model="openai/text-embedding-3-small",db=db)

#Query
response = wdoc.query(
query="What is the main topic?",
documents=embeddings)

This would make it simpler to:

  • Use wdoc as a library in other Python projects
  • Chain operations programmatically
  • Customize the workflow for specific use cases

Let me know if you'd like me to elaborate on any of these suggestions. Thanks again for this great tool!

@thiswillbeyourgithub
Copy link
Owner

(I just want you to know that I saw your message right away but don't have the time to fully reply yet!)

@thiswillbeyourgithub
Copy link
Owner

Hi!

First of all thank you for your kind words. Can I ask you where you heard about wdoc? What you use it for?

Second, I'm finally done with a side project that finally updates the TODO list in the readme, so you can take a look as what I envision.

I already noticed that the python API was pretty bad, but I need to refactor quite a lot of entangled stuff to do it. But because you asked for it I moved it to the top of my priorities regarding wdoc. Unfortunately I'm awfully busy for quite some time still. Would you be interested in contributing to the changes? I could give you pointers!

Regarding supporting other DBs, I'm totally on board with it but it comes after a few other things (notably refactoring the API, and making the openwebui pipeline).

And sorry but I have to ask :) : did you ask an LLM to use flattery to sugar coat a feature request?

@Dalavidhy
Copy link

Hi!

Thank you for your feedback. I used it for extracting some financial information from financial reports.

After exploring various approaches, starting with basic RAG and experimenting with different methods, I spent time researching multiple GitHub projects before finding that yours truly stood out.

Thank you for updating the priorities - it's great to see active development and to be able to follow this project even more closely.

I'm interested in contributing to the project. I'd appreciate if we could have a discussion beforehand to ensure I fully understand the direction and requirements.

And yes, I did use an LLM to help formulate this request 😄

Let me know when would be a good time to discuss potential contributions.

@thiswillbeyourgithub
Copy link
Owner

thiswillbeyourgithub commented Dec 4, 2024

Hi @davidalhyar, sorry for the wait!

Thanks for sharing your use case with financial reports - it's great to see wdoc being used in such practical applications!

Regarding contributions, I'd love to have your help. Here's the current refactoring roadmap, with tasks that need to be completed in this specific order:

  1. Write unit tests for core features - this will serve as a safety net for the refactoring Done, but the tests need to be comprehenssive now.
  2. Reorganize the codebase:
    • Move query/search code to tasks/query.py
    • Extract argument validation to its own method
    • Split the initialization code to create a cleaner API
    • Untangle the current "spaghetti code" state of the wdoc class declaration
  3. Verify that critical features still work properly:
    • decorator of the wdoc class, and dynamic docstring
    • the --help flag works
    • the USAGE.md file
    • the mechanism from init and main that allow calling from cli
    • the wdoc_parse_file mechanism
    • th README examples
    • Finally update the files in scripts to make sure they respect the new api.

Only after these steps are completed can we properly implement the vector DB support you suggested. Would you be interested in helping with any of these specific tasks? We could start with the unit tests, as they're relatively self-contained.

Let me know which part interests you most, and I can provide more detailed technical guidance. We could use GitHub Discussions for the technical deep-dive if you prefer.

As a medical student my keyboard typing time is limited and I'm currently spread pretty thin among my other projects so helping out is a sure way to get this much sooner than if you wait for me :)

@thiswillbeyourgithub
Copy link
Owner

Addendum: also something I should do but haven't taken the time to learn is to create a readthedocs website for the documentation. Have you experience with that?

@thiswillbeyourgithub
Copy link
Owner

Addendum: also something I should do but haven't taken the time to learn is to create a readthedocs website for the documentation. Have you experience with that?

Well actually it was simpler that I thought so nevermind.

@thiswillbeyourgithub
Copy link
Owner

(update: I made some pytests but they need to be comprehensive now.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants