Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle zero chunk size #264

Closed
wants to merge 74 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
74 commits
Select commit Hold shift + click to select a range
7b13491
Index sections (#165)
Thejas-bhat Nov 3, 2023
7b302a5
memory leak fixes (#176)
Thejas-bhat Nov 8, 2023
c7c5cc5
bug fix: reconstruct failing due to key not found (#177)
Thejas-bhat Nov 10, 2023
b8c2117
Use latest of blevesearch/go-faiss (#178)
abhinavdangeti Nov 10, 2023
badde32
MB-59569: Upgrade to blevesearch/[email protected] (#183)
abhinavdangeti Nov 17, 2023
bbe6b87
MB-59692: Respect choice of similarity metric, [email protected]
abhinavdangeti Nov 17, 2023
2ff712a
optimization: avoiding storing of bitmaps when vector is in a single …
Thejas-bhat Nov 17, 2023
24c8620
bug fix: accounting the vecIDs approriately while merging (#179)
Thejas-bhat Nov 17, 2023
225d0b0
replacing the flat index with the IVF family index. (#180)
Thejas-bhat Nov 20, 2023
cd8af4c
command line tool support for vector search (#175)
Thejas-bhat Nov 21, 2023
c182fcb
choosing the index types more responsibly (#186)
Thejas-bhat Nov 22, 2023
9be5eb5
fixing IO stats computation (#188)
Thejas-bhat Nov 30, 2023
d176983
bug fix: fixing the logic of tracking vector ids during a merge (#187)
Thejas-bhat Nov 30, 2023
c86b701
Resetting the fields of opaque for resusability (#189)
Thejas-bhat Nov 30, 2023
6e8258a
Separate functions to read and search vector indexes. (#184)
metonymic-smokey Nov 30, 2023
e11ed23
Improve search path for multi kNN and more necessarily for single kNN…
abhinavdangeti Dec 1, 2023
b9a5731
Graceful exit when min arguments not provided for vector (#193)
abhinavdangeti Dec 7, 2023
8d58766
Beautify & Fix cmd line tooling output for vector content (#194)
abhinavdangeti Dec 7, 2023
d8031a6
MB-60068: Case when dimensionality of index and query vector mismatch…
abhinavdangeti Dec 12, 2023
972e54b
MB-60071 - using the appropriate error value when the closeCh is clos…
Thejas-bhat Dec 13, 2023
13ff83d
MB-60084: Change index merge procedure to account for existing index …
metonymic-smokey Dec 14, 2023
4c4c02d
MB-59447: support nested/chunked vectors (#185)
moshaad7 Dec 14, 2023
335bab0
MB-60071: fixing out of bound slice access panic (#198)
Thejas-bhat Dec 15, 2023
b31d5e8
MB-60152: Panic fix while trying to fetch metadata of a vector index …
Thejas-bhat Dec 22, 2023
7a95724
TestVecPostingsIterator unit test fix (#200)
Thejas-bhat Dec 22, 2023
71c6f00
MB-60269 - Merge path fixes (#202)
metonymic-smokey Jan 5, 2024
d96321b
MB-60029 - Changing the defaults for centroid count and nprobe. (#203)
metonymic-smokey Jan 10, 2024
fb96e93
MB-60202 - Filtering deleted documents during FAISS vector index sear…
metonymic-smokey Jan 12, 2024
ec40d20
MB-60029 - Persisting faiss index optimization setting (#204)
metonymic-smokey Jan 17, 2024
9cb449c
MB-60367: Handle memory allocations more responsibly (#206)
Thejas-bhat Jan 18, 2024
478f2b8
Upgrade to [email protected] + additional commentary (#207)
abhinavdangeti Jan 18, 2024
a592edf
MB-60400: Reduce Garbage generated (#208)
CascadingRadium Jan 30, 2024
f7c6ff1
Defer mergeBufferPool.Put to cover error exits (#211)
abhinavdangeti Feb 2, 2024
a0ebb93
Introducing a size API for vector index on query path (#212)
abhinavdangeti Feb 2, 2024
84ca6fe
Early exit condition for vector index merging (#213)
metonymic-smokey Feb 5, 2024
03e71e1
MB-60662: Revert merge buffer pool (#214)
CascadingRadium Feb 5, 2024
a9ccd23
Upgrade go-faiss to address a possible overflow during slice access (…
abhinavdangeti Feb 6, 2024
5515ef0
MB-60367 Restructuring the code for better memory cleanup (#217)
Thejas-bhat Feb 8, 2024
06f9883
MB-60739: Faiss UP (#218)
abhinavdangeti Feb 13, 2024
9d28635
MB-60781: Writing the vector section only when there are valid number…
Thejas-bhat Feb 15, 2024
18bb93e
MB-60816: Upgrade go-faiss for fix (#220)
abhinavdangeti Feb 21, 2024
a4265eb
Upgrade blevesearch/go-faiss (#221)
abhinavdangeti Feb 22, 2024
96431e1
MB-60662: revert doc id cache (#222)
CascadingRadium Feb 23, 2024
8cea0fc
Update zap.md to reflect v16 file format (#224)
Thejas-bhat Mar 12, 2024
9ec1077
MB-60914: Zap field specific metrics (#223)
Likith101 Mar 18, 2024
40d804f
Upgrade to RoaringBitmap/[email protected] (#227)
abhinavdangeti Mar 26, 2024
579fb96
MB-60791: Avoid compute of vectorIDsToExclude on every search (#229)
abhinavdangeti Mar 26, 2024
f72c611
MB-60943 - Add a coarse quantiser to the IVF indexes (#225)
metonymic-smokey Apr 1, 2024
c4f4234
Revert "MB-60943 - Add a coarse quantiser to the IVF indexes (#225)" …
abhinavdangeti Apr 10, 2024
1beccd1
MB-61029: Deferring the closing of vector index (#226)
Thejas-bhat Apr 12, 2024
0aa74ef
minor optimizations and bug fixes (#233)
CascadingRadium Apr 17, 2024
08feb6f
MB-61029: Caching Vec To DocID Map (#231)
Likith101 Apr 18, 2024
6a955fa
MB-61650: Upgrade blevesearch/go-faiss (#236)
abhinavdangeti Apr 29, 2024
3d28f8e
MB-60697: Upgrade blevesearch/go-faiss for windows fix (#238)
abhinavdangeti Apr 29, 2024
a8febb7
add map capacity (#237)
CascadingRadium Apr 29, 2024
52342b1
MB-60943 - Reduce number of centroids for IVF indexes. (#234)
metonymic-smokey Apr 30, 2024
d16a7e9
Stop the ticker in vector cache (#240)
CascadingRadium May 6, 2024
9d26375
Move up to [email protected] & [email protected] (#242)
abhinavdangeti May 14, 2024
164c638
MB-59575: Handling memory mapped content more responsibly (#241)
Thejas-bhat May 15, 2024
f3fb79e
bug fix: de-duplicating fieldsInv value (#239)
Thejas-bhat May 15, 2024
0ec3ad3
MB-60943 - Change quantiser when vector index is tuned for memory (#244)
metonymic-smokey May 21, 2024
7860ad1
MB-60943: Index optimization refactor memory -> memory-efficient (#245)
abhinavdangeti Jun 3, 2024
b6b0428
MB-62167: Fix windows crash (#246)
CascadingRadium Jun 6, 2024
2d9329f
MB-61930: Use the go-faiss OpenMP API for setting the default number …
CascadingRadium Jun 11, 2024
ab311ca
MB-62221: Upgrade go-faiss (#248)
abhinavdangeti Jun 12, 2024
75dad85
MB-62221: Upgrade go-faiss to v1.0.19 (#249)
abhinavdangeti Jun 13, 2024
2108c55
MB-61889: support search with params (#250)
moshaad7 Jul 25, 2024
af88577
MB-62427: seeking the pointer correctly while reading segmentBase dat…
Thejas-bhat Jul 26, 2024
1b549a1
MB-61889: fix test cases (#253)
moshaad7 Jul 29, 2024
90e0ee6
MB-62354: Support Cosine Similarity (#251)
CascadingRadium Aug 5, 2024
e2e3b77
MB-61889: Upgrade go-faiss for corner case fix (#260)
abhinavdangeti Sep 4, 2024
9049151
MB-62230 - Support for pre-filtering with kNN (#255)
metonymic-smokey Sep 9, 2024
4d44092
Fix unit test breakage from previous commit (API change) (#262)
abhinavdangeti Sep 9, 2024
af6bef4
Handle zero chunk size
moshaad7 Sep 12, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ on:
push:
branches:
- master
- v15.x
- v14.x
- v13.x
- v12.x
Expand All @@ -12,7 +13,7 @@ jobs:
test:
strategy:
matrix:
go-version: [1.18.x, 1.19.x, 1.20.x]
go-version: [1.20.x, 1.21.x, 1.22.x]
platform: [ubuntu-latest, macos-latest]
runs-on: ${{ matrix.platform }}
steps:
Expand Down
44 changes: 26 additions & 18 deletions build.go
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,8 @@ import (
"github.com/blevesearch/vellum"
)

const Version uint32 = 15

const Version uint32 = 16
const IndexSectionsVersion uint32 = 16
const Type string = "zap"

const fieldNotUninverted = math.MaxUint64
Expand Down Expand Up @@ -98,7 +98,7 @@ func persistSegmentBaseToWriter(sb *SegmentBase, w io.Writer) (int, error) {
return 0, err
}

err = persistFooter(sb.numDocs, sb.storedIndexOffset, sb.fieldsIndexOffset,
err = persistFooter(sb.numDocs, sb.storedIndexOffset, sb.fieldsIndexOffset, sb.sectionsIndexOffset,
sb.docValueOffset, sb.chunkMode, sb.memCRC, br)
if err != nil {
return 0, err
Expand Down Expand Up @@ -159,25 +159,33 @@ func persistStoredFieldValues(fieldID int,

func InitSegmentBase(mem []byte, memCRC uint32, chunkMode uint32,
fieldsMap map[string]uint16, fieldsInv []string, numDocs uint64,
storedIndexOffset uint64, fieldsIndexOffset uint64, docValueOffset uint64,
dictLocs []uint64) (*SegmentBase, error) {
storedIndexOffset uint64, dictLocs []uint64,
sectionsIndexOffset uint64) (*SegmentBase, error) {
sb := &SegmentBase{
mem: mem,
memCRC: memCRC,
chunkMode: chunkMode,
fieldsMap: fieldsMap,
fieldsInv: fieldsInv,
numDocs: numDocs,
storedIndexOffset: storedIndexOffset,
fieldsIndexOffset: fieldsIndexOffset,
docValueOffset: docValueOffset,
dictLocs: dictLocs,
fieldDvReaders: make(map[uint16]*docValueReader),
fieldFSTs: make(map[uint16]*vellum.FST),
mem: mem,
memCRC: memCRC,
chunkMode: chunkMode,
fieldsMap: fieldsMap,
numDocs: numDocs,
storedIndexOffset: storedIndexOffset,
fieldsIndexOffset: sectionsIndexOffset,
sectionsIndexOffset: sectionsIndexOffset,
fieldDvReaders: make([]map[uint16]*docValueReader, len(segmentSections)),
docValueOffset: 0, // docValueOffsets identified automatically by the section
dictLocs: dictLocs,
fieldFSTs: make(map[uint16]*vellum.FST),
vecIndexCache: newVectorIndexCache(),
}
sb.updateSize()

err := sb.loadDvReaders()
// load the data/section starting offsets for each field
// by via the sectionsIndexOffset as starting point.
err := sb.loadFieldsNew()
if err != nil {
return nil, err
}

err = sb.loadDvReaders()
if err != nil {
return nil, err
}
Expand Down
14 changes: 14 additions & 0 deletions chunk.go
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
package zap

import (
"errors"
"fmt"
)

Expand All @@ -26,6 +27,13 @@ var LegacyChunkMode uint32 = 1024
// be used by default.
var DefaultChunkMode uint32 = 1026

var ErrChunkSizeZero = errors.New("chunk size is zero")

// getChunkSize returns the chunk size for the given chunkMode, cardinality, and
// maxDocs.
//
// In error cases, the returned chunk size will be 0. Caller can differentiate
// between a valid chunk size of 0 and an error by checking for ErrChunkSizeZero.
func getChunkSize(chunkMode uint32, cardinality uint64, maxDocs uint64) (uint64, error) {
switch {
// any chunkMode <= 1024 will always chunk with chunkSize=chunkMode
Expand All @@ -46,6 +54,9 @@ func getChunkSize(chunkMode uint32, cardinality uint64, maxDocs uint64) (uint64,
// chunk-size items.
// no attempt is made to tweak any other case
if cardinality <= 1024 {
if maxDocs == 0 {
return 0, ErrChunkSizeZero
}
return maxDocs, nil
}
return 1024, nil
Expand All @@ -61,6 +72,9 @@ func getChunkSize(chunkMode uint32, cardinality uint64, maxDocs uint64) (uint64,
// 2. convert to chunkSize, dividing into maxDocs
numChunks := (cardinality / 1024) + 1
chunkSize := maxDocs / numChunks
if chunkSize == 0 {
return 0, ErrChunkSizeZero
}
return chunkSize, nil
}
return 0, fmt.Errorf("unknown chunk mode %d", chunkMode)
Expand Down
2 changes: 1 addition & 1 deletion cmd/zap/cmd/dict.go
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ import (

"github.com/RoaringBitmap/roaring"
"github.com/blevesearch/vellum"
zap "github.com/blevesearch/zapx/v15"
zap "github.com/blevesearch/zapx/v16"
"github.com/spf13/cobra"
)

Expand Down
Loading
Loading