-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] RecordManager Does Not Properly Cleanup Vector Store's OLD VALUES #3570
Comments
can you help me confirm 2 things:
|
Hey @HenryHengZJ , let me upgrade to the latest version to confirm if this bug still exists. Will get you another update with your answers after that. Thanks! |
Hey @dkindlund , having the exact same issue. Did you manage to get sorted or figure a way around it? Have also updated to the latest update and still having issues. |
Hi @dkindlund and @BenMushen, Did you manage to figure a way around using Postgres vector store? |
Hey @HenryHengZJ apologies for the delay on this.
I don't think that's the case -- as others have commented this problem exists with other data sources (non-Airtable)
It doesn't matter if it's FULL or INCREMENTAL -- the same problem exists from what I see. The stale I think it's because there's no match between the |
Looking at the Postgres table schemas more closely:
I'm thinking there has to be some sort of linkage between the |
Okay, after reviewing the code more deeply, I think I have some theories as to what is happening here... The code is split between the indexing subsystem and the PostgreSQL RecordManager implementation: Flowise/packages/components/src/indexing.ts Line 310 in c0a7478
Flowise/packages/components/nodes/recordmanager/PostgresRecordManager/PostgresRecordManager.ts Line 278 in c0a7478
Here's my raw notes so far: When you see that a document’s uid (used as the record manager’s key) doesn’t match the vector store’s id, it indicates that somewhere in the insertion workflow the intended identifier isn’t “flowing through” as expected. Here are some theories and scenarios that might explain this discrepancy:
Summary The most likely theories are:
In any of these cases, the intended one-to-one mapping (i.e. vec.id should equal rm.key) breaks down, which in turn disrupts cleanup logic that relies on matching these identifiers. The remedy usually involves either configuring the vector store to accept externally supplied ids or adjusting the workflow to update the record manager after retrieving the actual inserted id from the vector store. |
After further analysis, I think there's some sort of subtle bug specific to the Flowise/packages/components/src/indexing.ts Line 300 in c0a7478
The Still investigating... |
Okay, so because we're using the In this implementation, the vector store’s table (the
Notice the column definition for id: const TypeORMDocumentEntity = new EntitySchema<TypeORMVectorStoreDocument>({
name: fields.tableName ?? defaultDocumentTableName,
columns: {
id: {
generated: "uuid",
type: "uuid",
primary: true,
},
pageContent: { type: String },
metadata: { type: "jsonb" },
embedding: { type: String },
},
}); Here, the id field is marked with
In the ensureTableInDatabase method, the table is created with a SQL statement that includes: CREATE TABLE IF NOT EXISTS ${this.tableName} (
"id" uuid NOT NULL DEFAULT uuid_generate_v4() PRIMARY KEY,
"pageContent" text,
metadata jsonb,
embedding vector
); This means that if an insert does not explicitly provide a value for id, PostgreSQL will automatically generate one using
When documents are added, the following happens:
const documentRow = {
pageContent: documents[idx].pageContent,
embedding: embeddingString,
metadata: documents[idx].metadata,
}; WARNING: Notice that no
|
Pinpointing the exact location of the code where this problem exists: Flowise/packages/components/src/indexing.ts Line 300 in c0a7478
See that second parameter that is getting passed into await vectorStore.addDocuments(docsToIndex, { ids: uids })
^^^^^^^^^^^^^ If we look at the async addDocuments(documents: Document[]): Promise<void> { Notice how the
|
Further note -- this appears to be a bug specific to the typeorm implementation. Let me explain: This PGVectorStore implementation does not suffer the same problem. Here’s why:
The addDocuments method in PGVectorStore is defined to accept an optional second argument: async addDocuments(
documents: Document[],
options?: { ids?: string[] }
): Promise<void> { ... } It then passes these
In the if (ids) {
values.push(ids[i]);
} So each row will contain the document’s content, embedding, metadata, (optionally a collection id), and finally the provided id.
if (rows.length !== 0 && columns.length === rows[0].length - 1) {
columns.push(this.idColumnName);
} This dynamically includes the id column (e.g. "
Conclusion: |
A decent workaround here might be to try to use the |
Okay, from this commit, I think I see why it was disabled: This comment explains why in v2.2.5:
So while the workaround to use |
Hey @HenryHengZJ , I took a closer look at your Based on analyzing this version of the code: The bug is in the custom override of the pool’s query method in the
const queryResult = await basePoolQueryFn(queryString, parameters)
instance.client.release()
instance.client = undefined
return queryResult However, if the base query function throws an error, the release call is skipped. Without a try/finally block, the connection remains open, gradually increasing the number of open connections.
if (!instance.client) {
instance.client = await instance.pool.connect()
} If multiple queries are fired concurrently, they might conflict over the same
In summary, the lack of a try/finally block (or equivalent error handling) and the unsafe management of a shared client in the overridden |
Below is one recommended diff that wraps the query execution in a try/finally block so that the connection is always released, even when an error occurs. In the original code, the custom override does this: // Run base function
const queryResult = await basePoolQueryFn(queryString, parameters)
// ensure connection is released
instance.client.release()
instance.client = undefined
return queryResult Instead, we can change it to: - // Run base function
- const queryResult = await basePoolQueryFn(queryString, parameters)
-
- // ensure connection is released
- instance.client.release()
- instance.client = undefined
-
- return queryResult
+ let queryResult;
+ try {
+ queryResult = await basePoolQueryFn(queryString, parameters);
+ } finally {
+ if (instance.client) {
+ instance.client.release();
+ instance.client = undefined;
+ }
+ }
+ return queryResult; Explanation:
Note: This diff should help mitigate the problem of unreleased connections in the PGVector driver’s custom query override. Hope this helps, @HenryHengZJ |
Hey @BenMushen and @iristhiawl95 , so hopefully this explains the root cause behind the issue. I'm not sure how @HenryHengZJ wants to resolve it. Essentially, if the TypeORM implementation is going to eventually be replaced with the PGVector driver, then I could see him just fixing the PGVector driver, which ultimately solves this issue. Otherwise, the fix for the TypeORM issue looks complicated, as it might require committing some upstream fixes in the langchainjs library in order to resolve this issue. So, to recap (TL;DR version):
Hope that helps! |
Another possible temp option:
|
Update: I got a chance to use Option #4 and I can confirm that Record Manager operations work perfectly -- both FULL and INCREMENTAL mode now work as expected. I'm working with about 30K records and I can say that FULL cleanup mode takes about 10x longer to run than INCREMENTAL (which is sort of expected). That said, it feels like the whole indexing algorithm feels a bit slow, though. The only way the system can determine what records are duplicative, is by inserting ALL records again into the DB table and then perform some sort of differential SQL query to prune the duplicates. This ends up causing a very large IO overhead on the DB. A faster way to accomplish this, might be to implement a bloom filter within the RecordManager codebase -- as records are inserted into the DB, store that record's hash in a bloom filter. Then, during the next update, cross-check the bloom filter before inserting new records in the DB. It might be higher CPU utilization but it should still be orders of magnitude less overhead than the subsequent DB calls. |
Hey @HenryHengZJ , after analyzing the Original Problem: The indexing system was potentially performing unnecessary database operations and duplicating work across multiple containers when dealing with large document sets. Key Improvements:
Implementation Notes for Maintainers:
And here's an approximate DIFF that illustrates these improvements to the NOTE: This code has NOT been fully tested and MAY contain errors. // New imports
+ import Redis from 'ioredis'
+ import { default as Redlock } from 'redlock'
+ import XXH from 'xxhash-wasm'
+ import { MurmurHash3 } from 'murmurhash3js-revisited'
// Modified IndexOptions
export type IndexOptions = {
// ... existing options ...
+ /**
+ * Enable Bloom filter optimization
+ */
+ useBloomFilter?: boolean
+ /**
+ * Redis URL for BloomFilter store
+ */
+ redisUrl?: string
+ /**
+ * Bloom filter false positive rate
+ */
+ bloomFilterFalsePositiveRate?: number
+ /**
+ * TTL for Bloom filters in Redis
+ */
+ bloomFilterTTL?: number
}
// New class definitions
+ /**
+ * Optimized Bloom Filter implementation using multiple hash functions
+ * for better distribution and false positive rate control
+ */
+ class BloomFilter {
+ private bitArray: Uint8Array
+ private hashFunctions: number
+ private size: number
+ private xxhash: any
+ private itemCount: number
+ private maxItems: number
+ private readonly MAX_FILL_RATIO = 0.75 // Threshold before needing resize
+
+ constructor(expectedItems: number, falsePositiveRate = 0.01) {
+ this.size = this.calculateOptimalSize(expectedItems, falsePositiveRate)
+ this.hashFunctions = this.calculateOptimalHashFunctions(expectedItems, this.size)
+ this.bitArray = new Uint8Array(Math.ceil(this.size / 8))
+ this.itemCount = 0
+ this.maxItems = expectedItems
+ // Initialize xxhash asynchronously - will be ready before first use
+ XXH().then(xxh => { this.xxhash = xxh })
+ }
+
+ private calculateOptimalSize(n: number, p: number): number {
+ return Math.ceil(-(n * Math.log(p)) / (Math.log(2) ** 2))
+ }
+
+ private calculateOptimalHashFunctions(n: number, m: number): number {
+ return Math.max(1, Math.round((m / n) * Math.log(2)))
+ }
+
+ private async getHashValues(item: string): Promise<number[]> {
+ const results: number[] = []
+ // Use xxhash for first hash
+ const hash1 = this.xxhash ? this.xxhash.h64(item) : MurmurHash3.x64.hash128(item)
+ // Use MurmurHash for second hash
+ const hash2 = MurmurHash3.x64.hash128(item)
+
+ // Double hashing technique for generating multiple hash values
+ for (let i = 0; i < this.hashFunctions; i++) {
+ const combinedHash = (hash1 + i * hash2) % this.size
+ results.push(Math.abs(combinedHash))
+ }
+ return results
+ }
+
+ private isFull(): boolean {
+ return this.itemCount >= this.maxItems * this.MAX_FILL_RATIO
+ }
+
+ private async resize(): Promise<void> {
+ const newSize = this.maxItems * 2
+ const newFilter = new BloomFilter(newSize, this.getFalsePositiveRate())
+
+ // Copy existing bit array
+ const oldBitArray = this.bitArray
+ this.maxItems = newSize
+ this.size = this.calculateOptimalSize(newSize, this.getFalsePositiveRate())
+ this.hashFunctions = this.calculateOptimalHashFunctions(newSize, this.size)
+ this.bitArray = new Uint8Array(Math.ceil(this.size / 8))
+
+ // Signal that filter needs reconstruction
+ return {
+ needsReconstruction: true,
+ newSize: newSize
+ }
+ }
+
+ private getFalsePositiveRate(): number {
+ // Calculate current false positive rate based on item count and size
+ return Math.pow(1 - Math.exp(-this.hashFunctions * this.itemCount / this.size), this.hashFunctions)
+ }
+
+ async add(item: string): Promise<void> {
+ if (this.isFull()) {
+ return await this.resize()
+ }
+
+ const hashes = await this.getHashValues(item)
+ for (const hash of hashes) {
+ const byteIndex = Math.floor(hash / 8)
+ const bitIndex = hash % 8
+ this.bitArray[byteIndex] |= (1 << bitIndex)
+ }
+ this.itemCount++
+ return { needsReconstruction: false }
+ }
+
+ async test(item: string): Promise<boolean> {
+ const hashes = await this.getHashValues(item)
+ for (const hash of hashes) {
+ const byteIndex = Math.floor(hash / 8)
+ const bitIndex = hash % 8
+ if (!(this.bitArray[byteIndex] & (1 << bitIndex))) {
+ return false
+ }
+ }
+ return true
+ }
+
+ toJSON(): {
+ bitArray: number[],
+ hashFunctions: number,
+ size: number,
+ itemCount: number,
+ maxItems: number
+ } {
+ return {
+ bitArray: Array.from(this.bitArray),
+ hashFunctions: this.hashFunctions,
+ size: this.size,
+ itemCount: this.itemCount,
+ maxItems: this.maxItems
+ }
+ }
+
+ static fromJSON(data: {
+ bitArray: number[],
+ hashFunctions: number,
+ size: number,
+ itemCount: number,
+ maxItems: number
+ }): BloomFilter {
+ const filter = new BloomFilter(1) // Temporary initialization
+ filter.size = data.size
+ filter.hashFunctions = data.hashFunctions
+ filter.bitArray = new Uint8Array(data.bitArray)
+ filter.itemCount = data.itemCount
+ filter.maxItems = data.maxItems
+ return filter
+ }
+ }
+
+ /**
+ * Redis-based store for BloomFilters with distributed locking
+ */
+ class RedisBloomFilterStore {
+ private redis: Redis
+ private redlock: Redlock
+ private ttlSeconds: number
+ private lockTimeout: number
+
+ constructor(redisUrl: string, options: {
+ ttlSeconds?: number,
+ lockTimeout?: number
+ } = {}) {
+ this.redis = new Redis(redisUrl)
+ this.redlock = new Redlock(
+ [this.redis],
+ {
+ retryJitter: 200,
+ retryCount: 5,
+ driftFactor: 0.01,
+ }
+ )
+ this.ttlSeconds = options.ttlSeconds ?? 3600
+ this.lockTimeout = options.lockTimeout ?? 5000
+ }
+
+ private getKey(namespace: string, filterType: 'existing' | 'processed'): string {
+ return `bloom:${namespace}:${filterType}`
+ }
+
+ private getLockKey(namespace: string, filterType: 'existing' | 'processed'): string {
+ return `lock:${this.getKey(namespace, filterType)}`
+ }
+
+ async get(namespace: string, filterType: 'existing' | 'processed'): Promise<BloomFilter | null> {
+ try {
+ const data = await this.redis.get(this.getKey(namespace, filterType))
+ if (!data) return null
+ return BloomFilter.fromJSON(JSON.parse(data))
+ } catch (error) {
+ console.error('Error retrieving BloomFilter:', error)
+ return null
+ }
+ }
+
+ async set(
+ namespace: string,
+ filterType: 'existing' | 'processed',
+ filter: BloomFilter
+ ): Promise<void> {
+ const lockKey = this.getLockKey(namespace, filterType)
+ let lock
+
+ try {
+ lock = await this.redlock.acquire([lockKey], this.lockTimeout)
+ // Get current filter to check if we need to merge during resize
+ const currentFilter = await this.get(namespace, filterType)
+ if (currentFilter) {
+ // If current filter exists and new filter is larger, merge them
+ if (filter.maxItems > currentFilter.maxItems) {
+ // Merge operation would happen here
+ // We'd need to add all items from the current filter to the new one
+ // This would require keeping track of actual items or reconstructing from source
+ }
+ }
+ await this.redis
+ .multi()
+ .set(
+ this.getKey(namespace, filterType),
+ JSON.stringify(filter.toJSON()),
+ 'EX',
+ this.ttlSeconds
+ )
+ .exec()
+ } finally {
+ if (lock) await lock.release().catch(console.error)
+ }
+ }
+
+ async delete(namespace: string, filterType: 'existing' | 'processed'): Promise<void> {
+ await this.redis.del(this.getKey(namespace, filterType))
+ }
+ }
// Modified index function
export async function index(args: IndexArgs): Promise<IndexingResult> {
const { docsSource, recordManager, vectorStore, options } = args
- const { batchSize = 100, cleanup, sourceIdKey, cleanupBatchSize = 1000, forceUpdate = false, vectorStoreName } = options ?? {}
+ const {
+ batchSize = 100,
+ cleanup,
+ sourceIdKey,
+ cleanupBatchSize = 1000,
+ forceUpdate = false,
+ vectorStoreName,
+ useBloomFilter = false,
+ redisUrl,
+ bloomFilterFalsePositiveRate = 0.01,
+ bloomFilterTTL = 3600
+ } = options ?? {}
+ // New validation
+ if (useBloomFilter && !redisUrl) {
+ throw new Error("redisUrl is required when useBloomFilter is enabled")
+ }
+ // Initialize BloomFilter store
+ let bloomFilterStore: RedisBloomFilterStore | null = null
+ if (useBloomFilter && redisUrl) {
+ bloomFilterStore = new RedisBloomFilterStore(redisUrl, {
+ ttlSeconds: bloomFilterTTL
+ })
+ }
// ... existing setup code ...
+ // Initialize Bloom filters
+ let existingDocsFilter: BloomFilter | null = null
+ let processedDocsFilter: BloomFilter | null = null
+
+ if (useBloomFilter && bloomFilterStore) {
+ // Try to retrieve existing filters
+ existingDocsFilter = await bloomFilterStore.get(namespace, 'existing')
+ processedDocsFilter = await bloomFilterStore.get(namespace, 'processed')
+
+ // Create new filters if they don't exist
+ if (!existingDocsFilter || !processedDocsFilter) {
+ const existingRecordCount = (await recordManager.listKeys({})).length
+ const estimatedTotalSize = Math.max(existingRecordCount, docs.length) * 1.5
+
+ existingDocsFilter = new BloomFilter(
+ estimatedTotalSize,
+ bloomFilterFalsePositiveRate
+ )
+ processedDocsFilter = new BloomFilter(
+ estimatedTotalSize,
+ bloomFilterFalsePositiveRate
+ )
+
+ // Pre-populate existing docs filter
+ for await (const keyBatch of streamingBatch(
+ await recordManager.listKeys({}),
+ {
+ maxMemoryMB: 512, // Default memory limit
+ maxBatchSize: 1000,
+ minBatchSize: 100
+ }
+ )) {
+ for (const key of keyBatch) {
+ await existingDocsFilter.add(key)
+ }
+ }
+
+ // Store the initialized filters
+ await bloomFilterStore.set(namespace, 'existing', existingDocsFilter)
+ await bloomFilterStore.set(namespace, 'processed', processedDocsFilter)
+ }
+ }
for (const batch of batches) {
const hashedDocs = _deduplicateInOrder(batch.map((doc) => _HashedDocument.fromDocument(doc)))
- const batchExists = await recordManager.exists(hashedDocs.map((doc) => doc.uid))
+ // Modified existence check
+ let batchExists: boolean[]
+ if (useBloomFilter && existingDocsFilter) {
+ // Use Bloom filter for initial check
+ batchExists = await Promise.all(
+ hashedDocs.map(doc => existingDocsFilter.test(doc.uid))
+ )
+
+ // Verify positive matches with database
+ const probableExists = hashedDocs.filter((_, i) => batchExists[i])
+ if (probableExists.length > 0) {
+ const confirmedExists = await recordManager.exists(
+ probableExists.map(doc => doc.uid)
+ )
+
+ // Update batchExists with confirmed results
+ let confirmedIndex = 0
+ batchExists = batchExists.map(exists =>
+ exists ? confirmedExists[confirmedIndex++] : false
+ )
+ }
+ } else {
+ batchExists = await recordManager.exists(hashedDocs.map((doc) => doc.uid))
+ }
// ... existing document processing ...
+ // Update Bloom filters
+ if (useBloomFilter && processedDocsFilter) {
+ for (const doc of hashedDocs) {
+ const addResult = await processedDocsFilter.add(doc.uid)
+ if (addResult.needsReconstruction) {
+ // Create new filter with larger size
+ const newFilter = new BloomFilter(
+ addResult.newSize!,
+ bloomFilterFalsePositiveRate
+ )
+
+ // We need to reconstruct the filter from source
+ // This might require reading all processed documents again
+ const processedDocs = await recordManager.listKeys({
+ after: indexStartDt
+ })
+
+ for (const processedDoc of processedDocs) {
+ await newFilter.add(processedDoc)
+ }
+
+ // Update the filter in Redis
+ await bloomFilterStore!.set(namespace, 'processed', newFilter)
+ processedDocsFilter = newFilter
+ }
+ }
+ }
if (cleanup === 'incremental') {
// ... existing code ...
- const uidsToDelete = await recordManager.listKeys({
+ let uidsToDelete = await recordManager.listKeys({
before: indexStartDt,
groupIds: sourceIds
})
+ // Add Bloom filter check for cleanup
+ if (useBloomFilter && processedDocsFilter) {
+ uidsToDelete = (await Promise.all(
+ uidsToDelete.map(async uid => ({
+ uid,
+ processed: await processedDocsFilter.test(uid)
+ }))
+ )).filter(item => !item.processed)
+ .map(item => item.uid)
+ }
}
}
+ // Final update of processed filter
+ if (useBloomFilter && processedDocsFilter && bloomFilterStore) {
+ await bloomFilterStore.set(namespace, 'processed', processedDocsFilter)
+ }
if (cleanup === 'full') {
+ let continuing = true
- let uidsToDelete = await recordManager.listKeys({
- before: indexStartDt,
- limit: cleanupBatchSize
- })
- while (uidsToDelete.length > 0) {
- await vectorStore.delete({ ids: uidsToDelete })
- await recordManager.deleteKeys(uidsToDelete)
- numDeleted += uidsToDelete.length
+ while (continuing) {
+ let uidsToDelete = await recordManager.listKeys({
+ before: indexStartDt,
+ limit: cleanupBatchSize
+ })
+
+ if (uidsToDelete.length === 0) {
+ continuing = false
+ continue
+ }
+
+ // Use Bloom filter to check which docs weren't processed
+ if (useBloomFilter && processedDocsFilter) {
+ uidsToDelete = (await Promise.all(
+ uidsToDelete.map(async uid => ({
+ uid,
+ processed: await processedDocsFilter.test(uid)
+ }))
+ )).filter(item => !item.processed)
+ .map(item => item.uid)
+ }
+
+ if (uidsToDelete.length > 0) {
+ await vectorStore.delete({ ids: uidsToDelete })
+ await recordManager.deleteKeys(uidsToDelete)
+ numDeleted += uidsToDelete.length
+ }
+
uidsToDelete = await recordManager.listKeys({
before: indexStartDt,
limit: cleanupBatchSize
})
}
}
// ... rest of the function remains the same ...
} |
And here's a comprehensive example of how this code could be called upstream: const result = await index({
docsSource,
recordManager,
vectorStore,
options: {
// Basic indexing options (original)
cleanup: 'full', // or 'incremental'
batchSize: 1000, // Default: 100
sourceIdKey: 'source', // Required for incremental cleanup
cleanupBatchSize: 1000, // Default: 1000
forceUpdate: false, // Default: false
vectorStoreName: 'my_store', // Namespace for this store
// Bloom filter options (new)
useBloomFilter: true, // Enable Bloom filter optimization
redisUrl: 'redis://localhost:6379', // Required if useBloomFilter is true
bloomFilterFalsePositiveRate: 0.01, // Default: 0.01 (1%)
bloomFilterTTL: 3600, // Default: 3600 (1 hour)
// Memory management options (new)
maxMemoryMB: 512, // Default: 512
maxBatchSize: 1000, // Default: 1000
minBatchSize: 100, // Default: 100
// Redis lock options (new)
lockTimeout: 5000, // Default: 5000 (5 seconds)
}
});
// TypeScript type reference:
interface IndexOptions {
// Original options
batchSize?: number;
cleanup?: 'full' | 'incremental';
sourceIdKey?: string | ((doc: DocumentInterface) => string);
cleanupBatchSize?: number;
forceUpdate?: boolean;
vectorStoreName?: string;
// New Bloom filter options
useBloomFilter?: boolean;
redisUrl?: string;
bloomFilterFalsePositiveRate?: number;
bloomFilterTTL?: number;
// New memory management options
maxMemoryMB?: number;
maxBatchSize?: number;
minBatchSize?: number;
// New Redis lock options
lockTimeout?: number;
} All of these options are optional except:
A minimal configuration with Bloom filter enabled would be: const result = await index({
docsSource,
recordManager,
vectorStore,
options: {
useBloomFilter: true,
redisUrl: 'redis://localhost:6379',
vectorStoreName: 'my_store'
}
}); The implementation will use sensible defaults for all other options. |
I've opened in the upstream langchainjs project to hopefully track the long-term fix for this: |
Describe the bug
In Flowise, I've created a new
Document Store
where anAirtable
Document Loader feeds data into a Postgres Vector Store. We're also using Postgres to store the Record Manager records. Whenever I update an existing record in Airtable and then proceed to perform theUpsert
operation, I see the new vector store record created but the OLD vector store record still EXISTS. The OLD vector store record does not get properly deleted!To Reproduce
Steps to reproduce the behavior:
Expected behavior
I would expect that Upserting would purge the OLD vector store record.
Screenshots
If applicable, add screenshots to help explain your problem.
Flow
If applicable, add exported flow in order to help replicating the problem.
Setup
Additional context
So my vector store table is named
mechanisms_vec_v1
and my record manager table is namedmechanisms_rm_v1
. I THINK the way the deletion logic is supposed to work is a match on wheremechanisms_vec_v1.id = mechanisms_rm_v1.key
. HOWEVER, in my case, there is NO OVERLAP IN VALUES. For some reason, when a vector store document is inserted, that record'smechanisms_vec_v1.id
IS NEVER THE SAME VALUE as it's correspondingmechanisms_rm_v1.key
value.Because of this, I think there's some sort of BUG between when the Airtable data gets loaded to when it gets added to the vector store -- specifically, I think the
mechanisms_vec_v1.id
computed value might NOT be getting set the same way as themechanisms_rm_v1.key
value.It's exceptionally hard to figure out exactly where the bug is located, because of all the layers of indirection and use of langchain built-in libraries, but I hope these clues help, @HenryHengZJ . Thanks in advance!
The text was updated successfully, but these errors were encountered: