-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug fix for, error when inserting more than 500 documents at once #29
base: main
Are you sure you want to change the base?
Conversation
nit: typo in description: |
def _validate_batch_size(self, batch_size: int) -> int: | ||
if batch_size is None or batch_size <= 0: | ||
return DEFAULT_BATCH_SIZE | ||
elif batch_size > 419: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Define 419 as a const - MAX_BATCH_SIZE
self._bind: Union[Connection, Engine] = ( | ||
connection if connection else self._create_engine() | ||
) | ||
self._prepare_json_data_type() | ||
self._embedding_store = self._get_embedding_store(self.table_name, self.schema) | ||
self._create_table_if_not_exists() | ||
|
||
def _validate_batch_size(self, batch_size: int) -> int: | ||
if batch_size is None or batch_size <= 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't a -ve or 0 batch_size be invalid value as well? Shouldn't we error on that value? - with a different error message than the maximum one
@@ -824,16 +849,36 @@ def add_texts( | |||
texts: Iterable of strings to add into the vectorstore. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is the batch_size argument added to add_texts? If this function doesn't take it as an argument, the PR description should be fixed, and so does the argument list
texts = list(texts) | ||
|
||
# Validate batch_size again to confirm if it is still valid. | ||
batch_size = self._validate_batch_size(self._batch_size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why would it not be valid if it was already validated?
|
||
# Validate batch_size again to confirm if it is still valid. | ||
batch_size = self._validate_batch_size(self._batch_size) | ||
for i in range(0, len(list(texts)), batch_size): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isnt texts already a list(texts) as per 865? If yes, shouldnt this be 0, len(texts) ?
# Initialize a list to store results from each batch | ||
embedded_texts = [] | ||
|
||
# Loop through the list of documents and process in batches |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a list of texts or documents? If texts why do we say documents here in the comment? If documents, why is the variable named texts?
# Validate batch_size again to confirm if it is still valid. | ||
batch_size = self._validate_batch_size(self._batch_size) | ||
for i in range(0, len(list(texts)), batch_size): | ||
batch = texts[i : i + batch_size] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if texts is < batch_size, does this indexing mechanism account for that? and is this excluding the last value i + batch_size
?
@@ -13,35 +13,35 @@ files = [ | |||
|
|||
[[package]] | |||
name = "anyio" | |||
version = "4.6.2.post1" | |||
version = "4.8.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what are these changes? and are they required?
@@ -824,16 +849,36 @@ def add_texts( | |||
texts: Iterable of strings to add into the vectorstore. | |||
metadatas: List of metadatas (python dicts) associated with the input texts. | |||
ids: List of IDs for the input texts. | |||
batch_size: Number of documents to be inserted at once to Db, max 419. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is no longer an arg for add_texts
. We should remove this.
elif batch_size > 419: | ||
logging.error("The request contains an invalid batch_size.") | ||
raise ValueError( | ||
"""The request contains an invalid batch_size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can include the actual value passed in by the user in the error message too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need more accurate testing
|
||
text_splitter = RecursiveCharacterTextSplitter(chunk_size=3, chunk_overlap=1) | ||
split_documents = text_splitter.create_documents(texts) | ||
store._batch_size = 400 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the test is modifying the batch_size private member directly. This should not be how we test it. It should be tested via the interface how batch_size is exposed to the user. Is it as an argument to add_documents, add_texts, or from_documents or from_texts API?
"""Test that `add_texts` raises an exception, | ||
when batch_size is updated to more than 419""" | ||
texts *= 200 | ||
store._batch_size = 490 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to add a test with -ve value of batch_size
What is this change about?
There is a bug in the current implementation of add_texts, where if we insert more than 500 texts or documents at once into the DB, it throws an exception as below.
How is it fixed?
Use batching to insert the documents into the database based on the batch_sie value.
How are the changes tested?
Integration tests were added to insert 636 documents at once and also 1000 texts at once using add_texts API. All the tests were successful.
