Read path #7

piodul · 2025-01-09T08:18:04Z

The goal is to make it possible to issue an ANN query, assuming a populated index in OpenSearch cluster.

The idea is to reuse the existing infrastructure for querying secondary indexes. A secondary index is a helper table for which the partition key = the indexed column, and clustering key = the full key of the table being indexed. Currently, when you query a table WHERE indexed_column = X, the following happens:

The partition corresponding to the value of the indexed column is queried in the index (SELECT * FROM index_table WHERE pk = X).
From each row from the result of the previous query, we extract the key of the base table and query the row in the base table.

The idea is to plug into the step (1) and replace it with a query to the OpenSearch instance.

In addition to this, the syntax for ANN queries needs to be implemented (ORDER BY column_of_vector_type ANN OF ...).

Tips

The code responsible for implementing the two-step algorithm above is in cql3/statements/select_statement.cc file, in the indexed_table_select_statement class, and the code responsible for step (1) seems to be placed in the read_posting_list method.
Short note about the syntax of ANN queries: https://cassandra.apache.org/doc/latest/cassandra/vector-search/vector-search-working-with.html#query-vector-data-with-cql. For now, implementing the similarity_{dot_product,cosine,euclidean} functions can be skipped.
For now, tests can assume that a vector index is already created in the OpenSearch cluster; the tests should pre-populate the index manually and only query data through Scylla.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read path #7

Read path #7

piodul commented Jan 9, 2025

Read path #7

Read path #7

Comments

piodul commented Jan 9, 2025