Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation Ingestion lacks accuracy #28

Open
stefanfaistenauer opened this issue Feb 17, 2025 · 3 comments
Open

Documentation Ingestion lacks accuracy #28

stefanfaistenauer opened this issue Feb 17, 2025 · 3 comments

Comments

@stefanfaistenauer
Copy link
Contributor

Right now we do an axios request on the documentation URL.
If the documentation is graphql or an openapi string, we do an OK job at parsing that and acting accordingly.
If the documentation is an html page, we convert that into markdown and use the first 20k characters or so (see documentation.ts for exact logic) in our context window for the api configuration generation. This seems like a very limited approach, particularly with longer documentation, we can do better.

@stefanfaistenauer
Copy link
Contributor Author

We did put some work in to improve it. GLU-104. Still not amazing but better. Need to merge this at some point.

@Krishnaidnani
Copy link

Krishnaidnani commented Feb 18, 2025

Here is my solution:
We can Improve HTML Parsing with cheerio, Instead of converting the entire HTML to Markdown, we can extract only relevant sections (headings, descriptions, and code blocks) while removing unnecessary elements (nav, footer, sidebar).
we can streamline data fetching part using Axios stream and we can process large documentation files in chunks, preventing memory overload and ensuring no random truncation at 20k characters.
Let me know how you feel about this approach , Will start coding this logic If its ok at your end.

@PaperBoardOfficial
Copy link

That's a solid approach for improving HTML parsing and memory efficiency! Using Cheerio for targeted extraction would definitely help get better quality content from HTML docs.
While we implement that, I'm also exploring embedding-based retrieval which would help find semantically relevant sections even when they don't explicitly mention the endpoint name. The two approaches could complement each other - better HTML parsing to extract quality content, and embeddings to find the most relevant parts of that content.
Would you be interested in collaborating on both improvements?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants