Documentation Ingestion lacks accuracy #28

stefanfaistenauer · 2025-02-17T18:44:32Z

Right now we do an axios request on the documentation URL.
If the documentation is graphql or an openapi string, we do an OK job at parsing that and acting accordingly.
If the documentation is an html page, we convert that into markdown and use the first 20k characters or so (see documentation.ts for exact logic) in our context window for the api configuration generation. This seems like a very limited approach, particularly with longer documentation, we can do better.

stefanfaistenauer · 2025-02-18T02:32:11Z

We did put some work in to improve it. GLU-104. Still not amazing but better. Need to merge this at some point.

Krishnaidnani · 2025-02-18T14:41:55Z

Here is my solution:
We can Improve HTML Parsing with cheerio, Instead of converting the entire HTML to Markdown, we can extract only relevant sections (headings, descriptions, and code blocks) while removing unnecessary elements (nav, footer, sidebar).
we can streamline data fetching part using Axios stream and we can process large documentation files in chunks, preventing memory overload and ensuring no random truncation at 20k characters.
Let me know how you feel about this approach , Will start coding this logic If its ok at your end.

PaperBoardOfficial · 2025-02-28T17:12:48Z

That's a solid approach for improving HTML parsing and memory efficiency! Using Cheerio for targeted extraction would definitely help get better quality content from HTML docs.
While we implement that, I'm also exploring embedding-based retrieval which would help find semantically relevant sections even when they don't explicitly mention the endpoint name. The two approaches could complement each other - better HTML parsing to extract quality content, and embeddings to find the most relevant parts of that content.
Would you be interested in collaborating on both improvements?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation Ingestion lacks accuracy #28

Documentation Ingestion lacks accuracy #28

stefanfaistenauer commented Feb 17, 2025

stefanfaistenauer commented Feb 18, 2025

Krishnaidnani commented Feb 18, 2025 •

edited

Loading

PaperBoardOfficial commented Feb 28, 2025

Documentation Ingestion lacks accuracy #28

Documentation Ingestion lacks accuracy #28

Comments

stefanfaistenauer commented Feb 17, 2025

stefanfaistenauer commented Feb 18, 2025

Krishnaidnani commented Feb 18, 2025 • edited Loading

PaperBoardOfficial commented Feb 28, 2025

Krishnaidnani commented Feb 18, 2025 •

edited

Loading