Bug fix for, error when inserting more than 500 documents at once #29

anushakolan · 2025-02-06T20:01:36Z

What is this change about?

There is a bug in the current implementation of add_texts, where if we insert more than 500 texts or documents at once into the DB, it throws an exception as below.

2024-12-02 11:02:35,662 - ERROR - Add text failed:
 ('07002', '[07002] [Microsoft][ODBC Driver 18 for SQL Server]COUNT field incorrect or syntax error (0) (SQLExecDirectW)')

How is it fixed?

Use batching to insert the documents into the database based on the batch_sie value.

The batch_size is an optional parameter for inserting documents / data.
The default batch_size will be 100.

How are the changes tested?

Integration tests were added to insert 636 documents at once and also 1000 texts at once using add_texts API. All the tests were successful.

libs/sqlserver/langchain_sqlserver/vectorstores.py

libs/sqlserver/tests/integration_tests/test_vectorstores.py

libs/sqlserver/langchain_sqlserver/vectorstores.py

Aniruddh25 · 2025-02-18T18:21:31Z

nit: typo in description: based on the batch_sie value. <- should be batch_size here

Aniruddh25 · 2025-02-18T18:25:43Z

libs/sqlserver/langchain_sqlserver/vectorstores.py

+    def _validate_batch_size(self, batch_size: int) -> int:
+        if batch_size is None or batch_size <= 0:
+            return DEFAULT_BATCH_SIZE
+        elif batch_size > 419:


Define 419 as a const - MAX_BATCH_SIZE

Aniruddh25 · 2025-02-18T18:27:56Z

libs/sqlserver/langchain_sqlserver/vectorstores.py

        self._bind: Union[Connection, Engine] = (
            connection if connection else self._create_engine()
        )
        self._prepare_json_data_type()
        self._embedding_store = self._get_embedding_store(self.table_name, self.schema)
        self._create_table_if_not_exists()

+    def _validate_batch_size(self, batch_size: int) -> int:
+        if batch_size is None or batch_size <= 0:


Shouldn't a -ve or 0 batch_size be invalid value as well? Shouldn't we error on that value? - with a different error message than the maximum one

Aniruddh25 · 2025-02-18T18:30:58Z

libs/sqlserver/langchain_sqlserver/vectorstores.py

@@ -824,16 +849,36 @@ def add_texts(
            texts: Iterable of strings to add into the vectorstore.


Where is the batch_size argument added to add_texts? If this function doesn't take it as an argument, the PR description should be fixed, and so does the argument list

Aniruddh25 · 2025-02-18T18:32:21Z

libs/sqlserver/langchain_sqlserver/vectorstores.py

+        texts = list(texts)
+
+        # Validate batch_size again to confirm if it is still valid.
+        batch_size = self._validate_batch_size(self._batch_size)


why would it not be valid if it was already validated?

Aniruddh25 · 2025-02-18T18:35:36Z

libs/sqlserver/langchain_sqlserver/vectorstores.py

+
+        # Validate batch_size again to confirm if it is still valid.
+        batch_size = self._validate_batch_size(self._batch_size)
+        for i in range(0, len(list(texts)), batch_size):


isnt texts already a list(texts) as per 865? If yes, shouldnt this be 0, len(texts) ?

Aniruddh25 · 2025-02-18T18:36:30Z

libs/sqlserver/langchain_sqlserver/vectorstores.py

+        # Initialize a list to store results from each batch
+        embedded_texts = []
+
+        # Loop through the list of documents and process in batches


Is it a list of texts or documents? If texts why do we say documents here in the comment? If documents, why is the variable named texts?

Aniruddh25 · 2025-02-18T18:39:30Z

libs/sqlserver/langchain_sqlserver/vectorstores.py

+        # Validate batch_size again to confirm if it is still valid.
+        batch_size = self._validate_batch_size(self._batch_size)
+        for i in range(0, len(list(texts)), batch_size):
+            batch = texts[i : i + batch_size]


What if texts is < batch_size, does this indexing mechanism account for that? and is this excluding the last value i + batch_size?

Aniruddh25 · 2025-02-18T18:40:15Z

libs/sqlserver/poetry.lock

@@ -13,35 +13,35 @@ files = [

 [[package]]
 name = "anyio"
-version = "4.6.2.post1"
+version = "4.8.0"


what are these changes? and are they required?

beccadaniel · 2025-02-18T18:30:54Z

libs/sqlserver/langchain_sqlserver/vectorstores.py

@@ -824,16 +849,36 @@ def add_texts(
            texts: Iterable of strings to add into the vectorstore.
            metadatas: List of metadatas (python dicts) associated with the input texts.
            ids: List of IDs for the input texts.
+            batch_size: Number of documents to be inserted at once to Db, max 419.


This is no longer an arg for add_texts. We should remove this.

beccadaniel · 2025-02-18T18:32:13Z

libs/sqlserver/langchain_sqlserver/vectorstores.py

+        elif batch_size > 419:
+            logging.error("The request contains an invalid batch_size.")
+            raise ValueError(
+                """The request contains an invalid batch_size.


Maybe we can include the actual value passed in by the user in the error message too.

Aniruddh25

Need more accurate testing

Aniruddh25 · 2025-02-18T18:57:29Z

libs/sqlserver/tests/integration_tests/test_vectorstores.py

+
+    text_splitter = RecursiveCharacterTextSplitter(chunk_size=3, chunk_overlap=1)
+    split_documents = text_splitter.create_documents(texts)
+    store._batch_size = 400


the test is modifying the batch_size private member directly. This should not be how we test it. It should be tested via the interface how batch_size is exposed to the user. Is it as an argument to add_documents, add_texts, or from_documents or from_texts API?

Aniruddh25 · 2025-02-18T18:59:08Z

libs/sqlserver/tests/integration_tests/test_vectorstores.py

+    """Test that `add_texts` raises an exception,
+    when batch_size is updated to more than 419"""
+    texts *= 200
+    store._batch_size = 490


Need to add a test with -ve value of batch_size

Anusha Kolan added 2 commits February 6, 2025 11:48

Fixed the bug throwing error when inserting more than 500 documents

5c7f78e

Remove debug statements

ed5076e

anushakolan requested a review from beccadaniel February 6, 2025 20:01

anushakolan added 2 commits February 6, 2025 12:59

Fixing linting issues

6ea78c9

Added new dependencies to poetry

60bd1c6

beccadaniel reviewed Feb 7, 2025

View reviewed changes

libs/sqlserver/langchain_sqlserver/vectorstores.py Outdated Show resolved Hide resolved

libs/sqlserver/langchain_sqlserver/vectorstores.py Outdated Show resolved Hide resolved

libs/sqlserver/tests/integration_tests/test_vectorstores.py Show resolved Hide resolved

anushakolan added 2 commits February 7, 2025 13:03

Addressed comments

d4f3b08

Lint issues

214dd07

anushakolan requested a review from beccadaniel February 7, 2025 21:10

beccadaniel reviewed Feb 7, 2025

View reviewed changes

libs/sqlserver/langchain_sqlserver/vectorstores.py Outdated Show resolved Hide resolved

libs/sqlserver/langchain_sqlserver/vectorstores.py Outdated Show resolved Hide resolved

anushakolan added 2 commits February 13, 2025 19:39

Added code to error out on more than 419 batch_size requests

58d50af

Updated comments

26ae8a5

anushakolan requested a review from beccadaniel February 14, 2025 18:18

Aniruddh25 reviewed Feb 18, 2025

View reviewed changes

beccadaniel reviewed Feb 18, 2025

View reviewed changes

Aniruddh25 suggested changes Feb 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug fix for, error when inserting more than 500 documents at once #29

Bug fix for, error when inserting more than 500 documents at once #29

anushakolan commented Feb 6, 2025 •

edited

Loading

Aniruddh25 commented Feb 18, 2025

Aniruddh25 Feb 18, 2025

Aniruddh25 Feb 18, 2025

Aniruddh25 Feb 18, 2025 •

edited

Loading

Aniruddh25 Feb 18, 2025

Aniruddh25 Feb 18, 2025

Aniruddh25 Feb 18, 2025

Aniruddh25 Feb 18, 2025 •

edited

Loading

Aniruddh25 Feb 18, 2025

beccadaniel Feb 18, 2025

beccadaniel Feb 18, 2025

Aniruddh25 left a comment

Aniruddh25 Feb 18, 2025

Aniruddh25 Feb 18, 2025

		@@ -824,16 +849,36 @@ def add_texts(
		texts: Iterable of strings to add into the vectorstore.

Bug fix for, error when inserting more than 500 documents at once #29

Are you sure you want to change the base?

Bug fix for, error when inserting more than 500 documents at once #29

Conversation

anushakolan commented Feb 6, 2025 • edited Loading

What is this change about?

How is it fixed?

How are the changes tested?

Aniruddh25 commented Feb 18, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Aniruddh25 Feb 18, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Aniruddh25 Feb 18, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Aniruddh25 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anushakolan commented Feb 6, 2025 •

edited

Loading

Aniruddh25 Feb 18, 2025 •

edited

Loading

Aniruddh25 Feb 18, 2025 •

edited

Loading