Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug fix for, error when inserting more than 500 documents at once #29

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

anushakolan
Copy link
Collaborator

@anushakolan anushakolan commented Feb 6, 2025

What is this change about?

There is a bug in the current implementation of add_texts, where if we insert more than 500 texts or documents at once into the DB, it throws an exception as below.

2024-12-02 11:02:35,662 - ERROR - Add text failed:
 ('07002', '[07002] [Microsoft][ODBC Driver 18 for SQL Server]COUNT field incorrect or syntax error (0) (SQLExecDirectW)')

How is it fixed?

Use batching to insert the documents into the database based on the batch_sie value.

  1. The batch_size is an optional parameter for inserting documents / data.
  2. The default batch_size will be 100.

How are the changes tested?

Integration tests were added to insert 636 documents at once and also 1000 texts at once using add_texts API. All the tests were successful.
Screenshot 2025-02-13 193718

@Aniruddh25
Copy link
Contributor

nit: typo in description: based on the batch_sie value. <- should be batch_size here

def _validate_batch_size(self, batch_size: int) -> int:
if batch_size is None or batch_size <= 0:
return DEFAULT_BATCH_SIZE
elif batch_size > 419:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Define 419 as a const - MAX_BATCH_SIZE

self._bind: Union[Connection, Engine] = (
connection if connection else self._create_engine()
)
self._prepare_json_data_type()
self._embedding_store = self._get_embedding_store(self.table_name, self.schema)
self._create_table_if_not_exists()

def _validate_batch_size(self, batch_size: int) -> int:
if batch_size is None or batch_size <= 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't a -ve or 0 batch_size be invalid value as well? Shouldn't we error on that value? - with a different error message than the maximum one

@@ -824,16 +849,36 @@ def add_texts(
texts: Iterable of strings to add into the vectorstore.
Copy link
Contributor

@Aniruddh25 Aniruddh25 Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the batch_size argument added to add_texts? If this function doesn't take it as an argument, the PR description should be fixed, and so does the argument list

texts = list(texts)

# Validate batch_size again to confirm if it is still valid.
batch_size = self._validate_batch_size(self._batch_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would it not be valid if it was already validated?


# Validate batch_size again to confirm if it is still valid.
batch_size = self._validate_batch_size(self._batch_size)
for i in range(0, len(list(texts)), batch_size):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isnt texts already a list(texts) as per 865? If yes, shouldnt this be 0, len(texts) ?

# Initialize a list to store results from each batch
embedded_texts = []

# Loop through the list of documents and process in batches
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a list of texts or documents? If texts why do we say documents here in the comment? If documents, why is the variable named texts?

# Validate batch_size again to confirm if it is still valid.
batch_size = self._validate_batch_size(self._batch_size)
for i in range(0, len(list(texts)), batch_size):
batch = texts[i : i + batch_size]
Copy link
Contributor

@Aniruddh25 Aniruddh25 Feb 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if texts is < batch_size, does this indexing mechanism account for that? and is this excluding the last value i + batch_size?

@@ -13,35 +13,35 @@ files = [

[[package]]
name = "anyio"
version = "4.6.2.post1"
version = "4.8.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what are these changes? and are they required?

@@ -824,16 +849,36 @@ def add_texts(
texts: Iterable of strings to add into the vectorstore.
metadatas: List of metadatas (python dicts) associated with the input texts.
ids: List of IDs for the input texts.
batch_size: Number of documents to be inserted at once to Db, max 419.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer an arg for add_texts. We should remove this.

elif batch_size > 419:
logging.error("The request contains an invalid batch_size.")
raise ValueError(
"""The request contains an invalid batch_size.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can include the actual value passed in by the user in the error message too.

Copy link
Contributor

@Aniruddh25 Aniruddh25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need more accurate testing


text_splitter = RecursiveCharacterTextSplitter(chunk_size=3, chunk_overlap=1)
split_documents = text_splitter.create_documents(texts)
store._batch_size = 400
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the test is modifying the batch_size private member directly. This should not be how we test it. It should be tested via the interface how batch_size is exposed to the user. Is it as an argument to add_documents, add_texts, or from_documents or from_texts API?

"""Test that `add_texts` raises an exception,
when batch_size is updated to more than 419"""
texts *= 200
store._batch_size = 490
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add a test with -ve value of batch_size

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants