Skip to content

Commit d7bbca9

Browse files
authored
Merge pull request #896 from helixml/fix/rag-hotfixes
Fix/rag hotfixes
2 parents 0443b66 + be5d8e3 commit d7bbca9

File tree

7 files changed

+185
-34
lines changed

7 files changed

+185
-34
lines changed

api/pkg/prompts/templates/knowledge.tmpl

+60-8
Original file line numberDiff line numberDiff line change
@@ -11,21 +11,47 @@ We have found the following context you may refer to in your answer:
1111
</article>
1212
{{- end }}
1313

14-
Always provide references in the body of your answer in the format '[DOC_ID:DocumentID]'. For example, "The answer is 42 [DOC_ID:f6962c8007]."
14+
IMPORTANT: When referencing documents, always use EXACTLY the document_id values provided above. DO NOT extract or use page IDs from URLs within the content. Always provide references in the body of your answer in the format '[DOC_ID:DocumentID]'. For example, "The answer is 42 [DOC_ID:f6962c8007]." NOT "[DOC_ID:123456]" where 123456 might be a page ID in a URL.
1515

1616
Always provide references in the body of your answer!
1717

18-
After your answer, include one excerpt per document_id in XML format surrounded by three dashes like ---. These should be short sentence-long excerpts from the content that you referenced when answering the question, in the form below. Provide one excerpt per document. Provide one EXACT QUOTE per document. Do not include any other text inside the --- markers.
18+
After completing your answer, create an excerpt section with important quotes from each referenced document.
19+
20+
Follow these steps:
21+
1. Identify each unique document_id you cited in your answer
22+
2. For each document, select a representative quote (1-2 sentences) that best supports your answer
23+
3. Include each document exactly once in the excerpt block using the format below
24+
25+
⚠️ SYSTEM ERROR PREVENTION NOTICE ⚠️
26+
▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
27+
The document database processes each document_id only once. Duplicate entries
28+
will trigger this error:
29+
30+
ERROR: Duplicate document_id detected. Excerpt processing failed.
31+
Document with duplicate entries: [document_id]. Please provide exactly
32+
one excerpt per document_id.
33+
▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
34+
35+
Required excerpt format:
1936

2037
---
2138
<excerpts>
2239
<excerpt>
23-
<document_id>[DocumentID]</document_id>
24-
<snippet>[Excerpt]</snippet>
40+
<document_id>document-id-you-cited</document_id>
41+
<snippet>A representative quote from this document that supports your answer.</snippet>
42+
</excerpt>
43+
<excerpt>
44+
<document_id>another-document-id-you-cited</document_id>
45+
<snippet>A key quote from this document that supports your answer.</snippet>
2546
</excerpt>
2647
</excerpts>
2748
---
2849

50+
FINAL CHECK:
51+
- Each document appears exactly once in your excerpts
52+
- No introductory text appears before the excerpt block
53+
- All document_ids match those you cited in your answer
54+
2955
{{- end }}
3056

3157
{{- if .KnowledgeResults }}
@@ -53,21 +79,47 @@ We have found the following context you may refer to in your answer:
5379
</article>
5480
{{- end }}
5581

56-
Always provide references in the body of your answer in the format '[DOC_ID:DocumentID]'. For example, "The answer is 42 [DOC_ID:f6962c8007]."
82+
IMPORTANT: When referencing documents, always use EXACTLY the document_id values provided above. DO NOT extract or use page IDs from URLs within the content. Always provide references in the body of your answer in the format '[DOC_ID:DocumentID]'. For example, "The answer is 42 [DOC_ID:f6962c8007]." NOT "[DOC_ID:123456]" where 123456 might be a page ID in a URL.
5783

5884
Always provide references in the body of your answer!
5985

60-
After your answer, include one excerpt per document_id in XML format surrounded by three dashes like ---. These should be short sentence-long excerpts from the content that you referenced when answering the question, in the form below. Provide one excerpt per document. Provide one EXACT QUOTE per document. Do not include any other text inside the --- markers.
86+
After completing your answer, create an excerpt section with important quotes from each referenced document.
87+
88+
Follow these steps:
89+
1. Identify each unique document_id you cited in your answer
90+
2. For each document, select a representative quote (1-2 sentences) that best supports your answer
91+
3. Include each document exactly once in the excerpt block using the format below
92+
93+
⚠️ SYSTEM ERROR PREVENTION NOTICE ⚠️
94+
▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
95+
The document database processes each document_id only once. Duplicate entries
96+
will trigger this error:
97+
98+
ERROR: Duplicate document_id detected. Excerpt processing failed.
99+
Document with duplicate entries: [document_id]. Please provide exactly
100+
one excerpt per document_id.
101+
▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓
102+
103+
Required excerpt format:
61104

62105
---
63106
<excerpts>
64107
<excerpt>
65-
<document_id>[DocumentID]</document_id>
66-
<snippet>[Excerpt]</snippet>
108+
<document_id>document-id-you-cited</document_id>
109+
<snippet>A representative quote from this document that supports your answer.</snippet>
110+
</excerpt>
111+
<excerpt>
112+
<document_id>another-document-id-you-cited</document_id>
113+
<snippet>A key quote from this document that supports your answer.</snippet>
67114
</excerpt>
68115
</excerpts>
69116
---
70117

118+
FINAL CHECK:
119+
- Each document appears exactly once in your excerpts
120+
- No introductory text appears before the excerpt block
121+
- All document_ids match those you cited in your answer
122+
71123
{{- end }}
72124

73125
Here is the question from the user:

api/pkg/rag/rag_haystack.go

+45-20
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ import (
1010
"net/http"
1111
"path/filepath"
1212
"strconv"
13+
"strings"
1314

1415
"github.com/helixml/helix/api/pkg/types"
1516
"github.com/rs/zerolog/log"
@@ -40,24 +41,22 @@ func (h *HaystackRAG) Index(ctx context.Context, chunks ...*types.SessionRAGInde
4041

4142
logger.Debug().Msg("Indexing documents")
4243

43-
for _, chunk := range chunks {
44-
logger.Debug().
45-
Str("data_entity_id", chunk.DataEntityID).
46-
Str("document_id", chunk.DocumentID).
47-
Msg("indexing chunk")
44+
// Early exit if no chunks to index
45+
if len(chunks) == 0 {
46+
logger.Warn().Msg("no chunks to index, skipping")
47+
return nil
48+
}
4849

49-
// Metadata check before processing
50-
if chunk.Metadata != nil {
51-
logger.Info().
52-
Str("document_id", chunk.DocumentID).
53-
Interface("chunk_metadata", chunk.Metadata).
54-
Msg("Chunk contains metadata")
55-
} else {
56-
logger.Info().
50+
for _, chunk := range chunks {
51+
// Skip chunks with empty content
52+
if chunk.Content == "" {
53+
logger.Warn().
5754
Str("document_id", chunk.DocumentID).
58-
Msg("Chunk does NOT contain metadata")
55+
Msg("skipping chunk with empty content")
56+
continue
5957
}
6058

59+
// Create multipart/form-data
6160
var b bytes.Buffer
6261
w := multipart.NewWriter(&b)
6362

@@ -67,15 +66,16 @@ func (h *HaystackRAG) Index(ctx context.Context, chunks ...*types.SessionRAGInde
6766

6867
logger.Debug().Str("filename", filename).Msg("Indexing file")
6968

70-
// Add the file as a part
69+
// Create a form file for the document
7170
part, err := w.CreateFormFile("file", filename)
7271
if err != nil {
7372
return fmt.Errorf("creating form file: %w", err)
7473
}
7574

75+
// Write the content - preserve original content including any NUL bytes
7676
_, err = part.Write([]byte(chunk.Content))
7777
if err != nil {
78-
return fmt.Errorf("writing file content: %w", err)
78+
return fmt.Errorf("writing content: %w", err)
7979
}
8080

8181
// Add metadata for the document
@@ -90,7 +90,7 @@ func (h *HaystackRAG) Index(ctx context.Context, chunks ...*types.SessionRAGInde
9090
// Add other metadata as needed
9191
}
9292

93-
// Add any custom metadata from the chunk
93+
// Add user metadata if present
9494
if chunk.Metadata != nil {
9595
logger.Info().
9696
Str("document_id", chunk.DocumentID).
@@ -196,6 +196,18 @@ func (h *HaystackRAG) Query(ctx context.Context, q *types.SessionRAGQuery) ([]*t
196196
Interface("document_id_list", q.DocumentIDList).
197197
Logger()
198198

199+
// Remove NUL bytes from the prompt first
200+
sanitizedPrompt := removeNULBytes(q.Prompt)
201+
if sanitizedPrompt != q.Prompt {
202+
logger.Warn().Msg("query prompt contained NUL bytes that were removed")
203+
}
204+
205+
// Check for empty prompt after sanitizing - return early with error
206+
if sanitizedPrompt == "" {
207+
logger.Error().Msg("empty query prompt received (or only NUL bytes), rejecting request")
208+
return nil, fmt.Errorf("query prompt cannot be empty")
209+
}
210+
199211
// Build document ID conditions
200212
documentIDConditions := make([]Condition, len(q.DocumentIDList))
201213
for i, documentID := range q.DocumentIDList {
@@ -208,7 +220,7 @@ func (h *HaystackRAG) Query(ctx context.Context, q *types.SessionRAGQuery) ([]*t
208220

209221
// Build the complete query request
210222
queryReq := QueryRequest{
211-
Query: q.Prompt,
223+
Query: sanitizedPrompt,
212224
TopK: q.MaxResults,
213225
Filters: QueryFilter{
214226
Operator: "AND",
@@ -339,10 +351,18 @@ func (h *HaystackRAG) Delete(ctx context.Context, req *types.DeleteIndexRequest)
339351

340352
logger.Debug().Msg("Deleting documents from Haystack")
341353

342-
// Create delete request
354+
// Create delete request with properly formatted filters
355+
// The Haystack service expects filters with operator and conditions
343356
deleteReq := map[string]interface{}{
344357
"filters": map[string]interface{}{
345-
"data_entity_id": req.DataEntityID,
358+
"operator": "AND",
359+
"conditions": []map[string]interface{}{
360+
{
361+
"field": "meta.data_entity_id",
362+
"operator": "==",
363+
"value": req.DataEntityID,
364+
},
365+
},
346366
},
347367
}
348368

@@ -421,3 +441,8 @@ func toString(value interface{}) string {
421441
return fmt.Sprint(v)
422442
}
423443
}
444+
445+
// removeNULBytes removes NUL bytes from a string
446+
func removeNULBytes(s string) string {
447+
return strings.ReplaceAll(s, "\x00", "")
448+
}

api/pkg/rag/util.go

+2-1
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,8 @@ import (
77

88
// Extract document IDs from the prompt
99
func ParseDocumentIDs(prompt string) []string {
10-
re := regexp.MustCompile(`\[DOC_ID:([0-9a-f]+)\]`)
10+
// Updated regex to match any alphanumeric characters, not just digits
11+
re := regexp.MustCompile(`\[DOC_ID:([a-zA-Z0-9_-]+)\]`)
1112
matches := re.FindAllStringSubmatch(prompt, -1)
1213

1314
// Convert matches to slice of strings

haystack_service/app/api.py

+49-5
Original file line numberDiff line numberDiff line change
@@ -86,9 +86,25 @@ async def process_file(
8686
# Get file extension
8787
_, ext = os.path.splitext(file.filename)
8888

89-
# Save file temporarily
89+
# Read the file content
90+
content = await file.read()
91+
92+
# Check for empty content
93+
if not content:
94+
logger.error("Empty file content received")
95+
raise HTTPException(status_code=422, detail="Input validation error: File content cannot be empty")
96+
97+
# For binary files like PDFs, we should NOT sanitize content as it will corrupt the file
98+
# PDF files and other binary formats may contain NUL bytes as part of their format
99+
# NUL bytes will be handled after text extraction in the converter
100+
101+
# Only check if the content is ONLY NUL bytes (which would be invalid)
102+
if content == b'\x00' * len(content):
103+
logger.error("File contained only NUL bytes")
104+
raise HTTPException(status_code=422, detail="Input validation error: File content cannot be empty (contained only NUL bytes)")
105+
106+
# Save file temporarily with original binary content intact
90107
with tempfile.NamedTemporaryFile(delete=False, suffix=ext) as temp:
91-
content = await file.read()
92108
temp.write(content)
93109
temp_path = temp.name
94110

@@ -125,9 +141,20 @@ async def extract_text(
125141
# Get file extension
126142
_, ext = os.path.splitext(file.filename)
127143

128-
# Save file temporarily
144+
# Read the file content
145+
content = await file.read()
146+
147+
# Check for empty content
148+
if not content:
149+
logger.error("Empty file content received")
150+
raise HTTPException(status_code=422, detail="Input validation error: File content cannot be empty")
151+
152+
# For binary files like PDFs, we should NOT sanitize content as it will corrupt the file
153+
# PDF files and other binary formats may contain NUL bytes as part of their format
154+
# NUL bytes will be handled after text extraction in the converter
155+
156+
# Save file temporarily with original binary content intact
129157
with tempfile.NamedTemporaryFile(delete=False, suffix=ext) as temp:
130-
content = await file.read()
131158
temp.write(content)
132159
temp_path = temp.name
133160

@@ -150,12 +177,29 @@ async def query(
150177
"""Query for relevant documents"""
151178

152179
try:
180+
# Check for empty query text
181+
if not request.query or request.query.strip() == "":
182+
raise HTTPException(status_code=422, detail="Input validation error: `query` cannot be empty")
183+
184+
# Remove NUL bytes from query if present
185+
sanitized_query = request.query.replace('\x00', '')
186+
if sanitized_query != request.query:
187+
logger.warning("Query contained NUL bytes that were removed")
188+
189+
# Check again for emptiness after sanitizing
190+
if not sanitized_query or sanitized_query.strip() == "":
191+
logger.error("Query contained only NUL bytes")
192+
raise HTTPException(status_code=422, detail="Input validation error: `query` cannot be empty (contained only NUL bytes)")
193+
153194
results = await service.query(
154-
query_text=request.query,
195+
query_text=sanitized_query,
155196
filters=request.filters,
156197
top_k=request.top_k
157198
)
158199
return {"results": results}
200+
except HTTPException:
201+
# Re-raise HTTP exceptions
202+
raise
159203
except Exception as e:
160204
logger.error(f"Error querying: {str(e)}")
161205
raise HTTPException(status_code=500, detail=f"Error querying: {str(e)}")

haystack_service/app/converters.py

+5
Original file line numberDiff line numberDiff line change
@@ -80,6 +80,11 @@ def run(
8080
]
8181
text = "\n\n".join(el for el in markdown_elements if el)
8282

83+
# Filter out NUL bytes from text after extraction
84+
if '\x00' in text:
85+
logger.warning(f"Filtered NUL bytes from document text extracted from {path}")
86+
text = text.replace('\x00', '')
87+
8388
if text.strip():
8489
# Create document with metadata
8590
doc_meta = meta.copy()

haystack_service/app/service.py

+17
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,7 @@ def _init_indexing_pipeline(self):
9393
splitter.warm_up()
9494

9595
# Writer for the vector store (which now handles both embeddings and BM25)
96+
# NUL bytes are filtered out in VectorchordDocumentStore.write_documents method
9697
vector_writer = DocumentWriter(
9798
document_store=self.document_store,
9899
policy=DuplicatePolicy.OVERWRITE # Use overwrite policy to handle duplicate documents
@@ -330,6 +331,16 @@ async def query(self, query_text: str, filters: Dict[str, Any] = None, top_k: in
330331
Returns:
331332
List of dictionaries with document data
332333
"""
334+
# Remove NUL bytes from query if present
335+
if "\x00" in query_text:
336+
logger.warning("Query contained NUL bytes that will be removed")
337+
query_text = query_text.replace("\x00", "")
338+
339+
# Validate query text after sanitizing
340+
if not query_text or query_text.strip() == "":
341+
logger.error("Empty query text received or contained only NUL bytes")
342+
raise ValueError("Query text cannot be empty")
343+
333344
logger.info(f"Querying with: '{query_text}', filters: {filters}, top_k: {top_k}")
334345

335346
# Format filters correctly if they're provided
@@ -522,6 +533,12 @@ async def query(self, query_text: str, filters: Dict[str, Any] = None, top_k: in
522533
documents = output.get("document_joiner", {}).get("documents", [])
523534
logger.info(f"Document joiner returned {len(documents)} documents")
524535

536+
# Filter out NUL bytes from document content
537+
for doc in documents:
538+
if doc.content and '\x00' in doc.content:
539+
logger.warning(f"Filtering NUL bytes from retrieval result document: {doc.id}")
540+
doc.content = doc.content.replace('\x00', '')
541+
525542
# Debug the joined results
526543
for i, doc in enumerate(documents):
527544
logger.info(f"DEBUG: Final joined result {i+1}: id={getattr(doc, 'id', 'unknown')}, "

haystack_service/app/vectorchord/document_store/document_store.py

+7
Original file line numberDiff line numberDiff line change
@@ -655,6 +655,13 @@ def write_documents(self, documents: List[Document], policy: DuplicatePolicy = D
655655
if not documents:
656656
return 0
657657

658+
# Filter out any NUL bytes in document content before writing to PostgreSQL
659+
logger = logging.getLogger(__name__)
660+
for doc in documents:
661+
if doc.content and '\x00' in doc.content:
662+
logger.warning(f"Document store: removing NUL bytes from document content before database write: {doc.id}")
663+
doc.content = doc.content.replace('\x00', '')
664+
658665
# Convert Document objects to Postgres compatible format
659666
pg_documents = self._from_haystack_to_pg_documents(documents)
660667

0 commit comments

Comments
 (0)