Skip to content

Conversation

ManasviGoyal
Copy link

@ManasviGoyal ManasviGoyal commented Aug 19, 2025

This PR makes Lucene99HnswVectorsFormat accept an injected FlatVectorsFormat to allow custom flatvector format to be used with the existing codec.

For testing and BWC, the original package-private 5-arg constructor is retained. No on-disk format or runtime behavior changes occur unless a custom format/scorer is provided.

@ManasviGoyal ManasviGoyal marked this pull request as draft August 19, 2025 04:24
Copy link
Contributor

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@ManasviGoyal
Copy link
Author

@ChrisHegarty @benwtrent Please have a look. Thanks!

@benwtrent
Copy link
Member

All lucene formats are named based. Meaning, no arguments can actually be applied to the reader.

How does this actually work in practice?

I would expect a new format class is required for any different flat format that is used for HNSW because of names format loader

@ManasviGoyal
Copy link
Author

ManasviGoyal commented Aug 20, 2025

All lucene formats are named based. Meaning, no arguments can actually be applied to the reader.

How does this actually work in practice?

I would expect a new format class is required for any different flat format that is used for HNSW because of names format loader

Yes it is correct that each distinct flat format still needs its own named FlatVectorsFormat

This PR just lets Lucene99HnswVectorsFormat delegate to whichever named flat format is provided at write time, so we don’t need a new HNSW outer format class for each variation. HNSW stays the same and delegates to the named flat format chosen at write time.

@rmuir
Copy link
Member

rmuir commented Aug 20, 2025

This isn't how it works: you should create an outer format for each variation.

We can't support backwards compatibility for custom formats.

@ManasviGoyal
Copy link
Author

This isn't how it works: you should create an outer format for each variation.

We can't support backwards compatibility for custom formats.

More context: Currently we do have outer HNSW format wrappers for each variation of flat vectors format. But in order to do so we are creating multiple duplicates of Lucene99HnswVectorsFormat with the only change being the flat vectors format. This means losing out on any future updates Lucene99HnswVectorsFormat and manually getting the changes for each variation.

Can you please elaborate on your point regarding backward compatibility concerns?

@benwtrent
Copy link
Member

Currently we do have outer HNSW format wrappers for each variation of flat vectors format. But in order to do so we are creating multiple duplicates of Lucene99HnswVectorsFormat with the only change being the flat vectors format. This means losing out on any future updates Lucene99HnswVectorsFormat and manually getting the changes for each variation.

I understand the desire to get new changes out of the box. However, all formats are named based. If there was a substantial change to the HNSW format that required a new name, you would need a new inherited class that provides a NEW name for your format that utilizes a new flat format.

The SPI loader cannot provide any parameters. When constructing the reader, its done with the default ctor (e.g. new Lucene99HnswVectorsFormat()). Without a custom name, you lose your custom flat format.

This API change with how things are now just will not work. Further complexity in the named format loader to handle recursively named things just seems way too complicated to justify the 20-30 lines of code saved.

@ChrisHegarty
Copy link
Contributor

ChrisHegarty commented Aug 28, 2025

I can understand the reasoning behind this request - I encountered a somewhat similar situation in the past and had considered making a similar change, but didn't (for the same reasons as given by @benwtrent and @rmuir).

That said, I'm not sure this PR is addressing the core issue. If there's a meaningful, reusable piece here, it might make sense to refactor it out of the format so that custom formats can take advantage of it - but it's unclear to me what that reusable part would be.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants