SASI has been incorporated into Apache Cassandra trunk, updated doc is available as SASI.md
SSTableAttachedSecondaryIndex
,
or "SASI" for short, is an implementation of Cassandra's
SecondaryIndex
interface that can be used as an alternative to the
existing implementations. SASI's indexing and querying improves on
existing implementations by tailoring it specifically to Cassandra's
needs. SASI, efficiently, satisfies complex queries not previously
supported, including compound queries using OR
, and has superior
performance in cases where queries would previously require
filtering. In achieving this performance, SASI aims to be
significantly less resource intensive than existing implementations,
in memory, disk, and CPU usage.
The changes to Cassandra are available in two parts: the internal
patches that make minor changes to Cassandra, and the SASI components
that implement the SecondaryIndex
and accompanying interfaces, as
well as supporting classes. Builds for specific versions of Cassandra
including both change-sets are
available.
The following goes on describe how to get up and running with SASI, demonstrates usage with examples, and provides some details on its implementation.
Builds including SASI against 2.0.17 are available
here. The included
build-sasi.sh
script can also be run from the root of the
repository, to apply the patches and build the project locally.
The examples below walk through creating a table and indexes on its
columns, and performing queries on some inserted data. The patchset in
this repository includes support for the Thrift and CQL3 interfaces,
however, the CQL3 tables must be using
COMPACT STORAGE. Support
for CQL3 without COMPACT STORAGE
has been implemented but its not
available in this repository at this time.
The examples below assume the demo
keyspace has been created and is
in use.
cqlsh> CREATE KEYSPACE demo WITH replication = {
... 'class': 'SimpleStrategy',
... 'replication_factor': '1'
... };
cqlsh> USE demo;
All examples are performed on the sasi
table:
cqlsh:demo> CREATE TABLE sasi (id uuid, first_name text, last_name text,
... age int, height int, created_at bigint, primary key (id))
... WITH COMPACT STORAGE;
To create SASI indexes use CQLs CREATE CUSTOM INDEX
statement:
cqlsh:demo> CREATE CUSTOM INDEX ON sasi (first_name) USING 'org.apache.cassandra.db.index.SSTableAttachedSecondaryIndex'
... WITH OPTIONS = {
... 'analyzer_class':
... 'org.apache.cassandra.db.index.sasi.analyzer.NonTokenizingAnalyzer',
... 'case_sensitive': 'false'
... };
cqlsh:demo> CREATE CUSTOM INDEX ON sasi (last_name) USING 'org.apache.cassandra.db.index.SSTableAttachedSecondaryIndex'
... WITH OPTIONS = {'mode': 'SUFFIX'};
cqlsh:demo> CREATE CUSTOM INDEX ON sasi (age) USING 'org.apache.cassandra.db.index.SSTableAttachedSecondaryIndex';
cqlsh:demo> CREATE CUSTOM INDEX ON sasi (created_at) USING 'org.apache.cassandra.db.index.SSTableAttachedSecondaryIndex'
... WITH OPTIONS = {'mode': 'SPARSE'};
The indexes created have some options specified that customize their
behaviour and potentially performance. The index on first_name
is
case-insensitive. The analyzers are discussed more in a subsequent
example. The NonTokenizingAnalyzer
performs no analysis on the
text. Each index has a mode: NORMAL
, SUFFIX
, or SPARSE
, the
first being the default. The last_name
index is created with the
mode SUFFIX
which matches terms on suffixes instead of prefix
only. Examples of this are available below and more detail can be
found in the section on
OnDiskIndex.The
created_at
column is created with its mode set to SPARSE
, which is
meant to improve performance of querying large, dense number ranges
like timestamps for data inserted every millisecond. Details of the
SPARSE
implementation can also be found in the section on the
OnDiskIndex. The age
index is created with the default NORMAL
mode and no
case-sensitivity or text analysis options are specified since the
field is numeric.
After inserting the following data and performing a nodetool flush
,
SASI performing index flushes to disk can be seen in Cassandra's logs
-- although the direct call to flush is not required (see
IndexMemtable for more details).
cqlsh:demo> INSERT INTO sasi (id, first_name, last_name, age, height, created_at)
... VALUES (556ebd54-cbe5-4b75-9aae-bf2a31a24500, 'Pavel', 'Yaskevich', 27, 181, 1442959315018);
cqlsh:demo> INSERT INTO sasi (id, first_name, last_name, age, height, created_at)
... VALUES (5770382a-c56f-4f3f-b755-450e24d55217, 'Jordan', 'West', 26, 173, 1442959315019);
cqlsh:demo> INSERT INTO sasi (id, first_name, last_name, age, height, created_at)
... VALUES (96053844-45c3-4f15-b1b7-b02c441d3ee1, 'Mikhail', 'Stepura', 36, 173, 1442959315020);
cqlsh:demo> INSERT INTO sasi (id, first_name, last_name, age, height, created_at)
... VALUES (f5dfcabe-de96-4148-9b80-a1c41ed276b4, 'Michael', 'Kjellman', 26, 180, 1442959315021);
cqlsh:demo> INSERT INTO sasi (id, first_name, last_name, age, height, created_at)
... VALUES (2970da43-e070-41a8-8bcb-35df7a0e608a, 'Johnny', 'Zhang', 32, 175, 1442959315022);
cqlsh:demo> INSERT INTO sasi (id, first_name, last_name, age, height, created_at)
... VALUES (6b757016-631d-4fdb-ac62-40b127ccfbc7, 'Jason', 'Brown', 40, 182, 1442959315023);
cqlsh:demo> INSERT INTO sasi (id, first_name, last_name, age, height, created_at)
... VALUES (8f909e8a-008e-49dd-8d43-1b0df348ed44, 'Vijay', 'Parthasarathy', 34, 183, 1442959315024);
cqlsh:demo> SELECT first_name, last_name, age, height, created_at FROM sasi;
first_name | last_name | age | height | created_at
------------+---------------+-----+--------+---------------
Michael | Kjellman | 26 | 180 | 1442959315021
Mikhail | Stepura | 36 | 173 | 1442959315020
Jason | Brown | 40 | 182 | 1442959315023
Pavel | Yaskevich | 27 | 181 | 1442959315018
Vijay | Parthasarathy | 34 | 183 | 1442959315024
Jordan | West | 26 | 173 | 1442959315019
Johnny | Zhang | 32 | 175 | 1442959315022
(7 rows)
SASI supports simple queries already supported by CQL, however, for
text fields like first_name
equals queries perform prefix searches
-- this is similar to first_name LIKE 'M*'
in SQL (excluding case
sensitivity, which is dependent on the index configuration). The
semantics of CQL's =
were modified instead of making further
modifications of the grammar with the introduction of a LIKE
operator. Ideally, CQL would be modified to include such an operator,
supporting both prefix and suffix searches.
cqlsh:demo> SELECT first_name, last_name, age, height, created_at FROM sasi
... WHERE first_name = 'M';
first_name | last_name | age | height | created_at
------------+-----------+-----+--------+---------------
Michael | Kjellman | 26 | 180 | 1442959315021
Mikhail | Stepura | 36 | 173 | 1442959315020
(2 rows)
Of course, the case of the query does not matter for the first_name
column because of the options provided at index creation time.
cqlsh:demo> SELECT first_name, last_name, age, height, created_at FROM sasi
... WHERE first_name = 'm';
first_name | last_name | age | height | created_at
------------+-----------+-----+--------+---------------
Michael | Kjellman | 26 | 180 | 1442959315021
Mikhail | Stepura | 36 | 173 | 1442959315020
(2 rows)
SASI supports queries with multiple predicates, however, due to the
nature of the default indexing implementation, CQL requires the user
to specify ALLOW FILTERING
to opt-in to the potential performance
pitfalls of such a query. With SASI, while the requirement to include
ALLOW FILTERING
remains, to reduce modifications to the grammar, the
performance pitfalls do not exist because filtering is not
performed. Details on how SASI joins data from multiple predicates is
available below in the
Implementation Details
section.
cqlsh:demo> SELECT first_name, last_name, age, height, created_at FROM sasi
... WHERE first_name = 'M' and age < 30 ALLOW FILTERING;
first_name | last_name | age | height | created_at
------------+-----------+-----+--------+---------------
Michael | Kjellman | 26 | 180 | 1442959315021
(1 rows)
SASI also supports multiple predicates joined by OR
instead of
AND
. This is not currently supported by the existing Cassandra
indexing implementation. ALLOW FILTERING
again must be specified but
like the query above the performance concern, due to filtering, is not present.
cqlsh:demo> SELECT * FROM sasi
... WHERE first_name = 'P' OR first_name = 'j' OR first_name = 'M' ALLOW FILTERING;
id | age | created_at | first_name | height | last_name
--------------------------------------+-----+---------------+------------+--------+-----------
f5dfcabe-de96-4148-9b80-a1c41ed276b4 | 26 | 1442959315021 | Michael | 180 | Kjellman
96053844-45c3-4f15-b1b7-b02c441d3ee1 | 36 | 1442959315020 | Mikhail | 173 | Stepura
6b757016-631d-4fdb-ac62-40b127ccfbc7 | 40 | 1442959315023 | Jason | 182 | Brown
556ebd54-cbe5-4b75-9aae-bf2a31a24500 | 27 | 1442959315018 | Pavel | 181 | Yaskevich
5770382a-c56f-4f3f-b755-450e24d55217 | 26 | 1442959315019 | Jordan | 173 | West
2970da43-e070-41a8-8bcb-35df7a0e608a | 32 | 1442959315022 | Johnny | 175 | Zhang
(6 rows)
SASI also adds support for not equals, as demonstrated by these queries:
cqlsh:demo> SELECT * FROM sasi WHERE age > 30 AND first_name = 'j' ALLOW FILTERING;
id | age | created_at | first_name | height | last_name
-------------------------------------+-----+---------------+------------+--------+-----------
6b757016-631d-4fdb-ac62-40b127ccfbc7 | 40 | 1442959315023 | Jason | 182 | Brown
2970da43-e070-41a8-8bcb-35df7a0e608a | 32 | 1442959315022 | Johnny | 175 | Zhang
(2 rows)
cqlsh:demo> SELECT * FROM sasi WHERE age > 30 AND first_name = 'j' AND age != 32 ALLOW FILTERING;
id | age | created_at | first_name | height | last_name
-------------------------------------+-----+---------------+------------+--------+-----------
6b757016-631d-4fdb-ac62-40b127ccfbc7 | 40 | 1442959315023 | Jason | 182 | Brown
(1 rows)
How the not equals check is performed is dependent on the other expressions included in the query. If there is another expression for the same column as the one the not equals is specified on, then the data is filtered out during the search of the index. However, if the not equals is the only expression for that column, the predicate is applied to rows as they are read off disk.
Additionally, with the introduction of OR
comes the need to be able
to specify groupings of predicates using parentheses. The queries
below demonstrate the difference in semantics when using
parentheses. In the first query the row whose first_name
equals
"Pavel" is filtered out because of the created_at
predicate. By
using parentheses in the second query, to indicate the correct
intention, the row is included in the result set.
cqlsh:demo> SELECT * FROM sasi WHERE age > 26 OR first_name = 'Pavel' AND created_at > 1442959315018 ALLOW FILTERING;
id | age | created_at | first_name | height | last_name
--------------------------------------+-----+---------------+------------+--------+---------------
96053844-45c3-4f15-b1b7-b02c441d3ee1 | 36 | 1442959315020 | Mikhail | 173 | Stepura
6b757016-631d-4fdb-ac62-40b127ccfbc7 | 40 | 1442959315023 | Jason | 182 | Brown
8f909e8a-008e-49dd-8d43-1b0df348ed44 | 34 | 1442959315024 | Vijay | 183 | Parthasarathy
2970da43-e070-41a8-8bcb-35df7a0e608a | 32 | 1442959315022 | Johnny | 175 | Zhang
(4 rows)
cqlsh:demo> SELECT * FROM sasi WHERE (created_at > 1442959315018 OR first_name = 'P') AND age > 26 ALLOW FILTERING;
id | age | created_at | first_name | height | last_name
--------------------------------------+-----+---------------+------------+--------+---------------
96053844-45c3-4f15-b1b7-b02c441d3ee1 | 36 | 1442959315020 | Mikhail | 173 | Stepura
6b757016-631d-4fdb-ac62-40b127ccfbc7 | 40 | 1442959315023 | Jason | 182 | Brown
556ebd54-cbe5-4b75-9aae-bf2a31a24500 | 27 | 1442959315018 | Pavel | 181 | Yaskevich
8f909e8a-008e-49dd-8d43-1b0df348ed44 | 34 | 1442959315024 | Vijay | 183 | Parthasarathy
2970da43-e070-41a8-8bcb-35df7a0e608a | 32 | 1442959315022 | Johnny | 175 | Zhang
(5 rows)
As expected, parentheses can be used as needed and may contain sub-expressions also including parentheses.
The next example demonstrates SUFFIX
mode on the last_name
column. By using this mode predicates can search for any strings
containing the search string as a sub-string. In this case the strings
containing "a" or "an".
cqlsh:demo> SELECT * FROM sasi WHERE last_name = 'a';
id | age | created_at | first_name | height | last_name
--------------------------------------+-----+---------------+------------+--------+---------------
f5dfcabe-de96-4148-9b80-a1c41ed276b4 | 26 | 1442959315021 | Michael | 180 | Kjellman
96053844-45c3-4f15-b1b7-b02c441d3ee1 | 36 | 1442959315020 | Mikhail | 173 | Stepura
556ebd54-cbe5-4b75-9aae-bf2a31a24500 | 27 | 1442959315018 | Pavel | 181 | Yaskevich
8f909e8a-008e-49dd-8d43-1b0df348ed44 | 34 | 1442959315024 | Vijay | 183 | Parthasarathy
2970da43-e070-41a8-8bcb-35df7a0e608a | 32 | 1442959315022 | Johnny | 175 | Zhang
(5 rows)
cqlsh:demo> SELECT * FROM sasi WHERE last_name = 'an';
id | age | created_at | first_name | height | last_name
--------------------------------------+-----+---------------+------------+--------+-----------
f5dfcabe-de96-4148-9b80-a1c41ed276b4 | 26 | 1442959315021 | Michael | 180 | Kjellman
2970da43-e070-41a8-8bcb-35df7a0e608a | 32 | 1442959315022 | Johnny | 175 | Zhang
(2 rows)
SASI also supports filtering on non-indexed columns like height
. The
expression can only narrow down an existing query using AND
.
cqlsh:demo> SELECT * FROM sasi WHERE last_name = 'a' AND height >= 175 ALLOW FILTERING;
id | age | created_at | first_name | height | last_name
--------------------------------------+-----+---------------+------------+--------+---------------
f5dfcabe-de96-4148-9b80-a1c41ed276b4 | 26 | 1442959315021 | Michael | 180 | Kjellman
556ebd54-cbe5-4b75-9aae-bf2a31a24500 | 27 | 1442959315018 | Pavel | 181 | Yaskevich
8f909e8a-008e-49dd-8d43-1b0df348ed44 | 34 | 1442959315024 | Vijay | 183 | Parthasarathy
2970da43-e070-41a8-8bcb-35df7a0e608a | 32 | 1442959315022 | Johnny | 175 | Zhang
(4 rows)
Lastly, to demonstrate text analysis an additional column is needed on the table. Its definition, index, and statements to update rows are shown below.
cqlsh:demo> ALTER TABLE sasi ADD bio text;
cqlsh:demo> CREATE CUSTOM INDEX ON sasi (bio) USING 'org.apache.cassandra.db.index.SSTableAttachedSecondaryIndex'
... WITH OPTIONS = {
... 'analyzer_class': 'org.apache.cassandra.db.index.sasi.analyzer.StandardAnalyzer',
... 'tokenization_enable_stemming': 'true',
... 'analyzed': 'true',
... 'tokenization_normalize_lowercase': 'true',
... 'tokenization_locale': 'en'
... };
cqlsh:demo> UPDATE sasi SET bio = 'Software Engineer, who likes distributed systems, doesnt like to argue.' WHERE id = 5770382a-c56f-4f3f-b755-450e24d55217;
cqlsh:demo> UPDATE sasi SET bio = 'Software Engineer, works on the freight distribution at nights and likes arguing' WHERE id = 556ebd54-cbe5-4b75-9aae-bf2a31a24500;
cqlsh:demo> SELECT * FROM sasi;
id | age | bio | created_at | first_name | height | last_name
--------------------------------------+-----+----------------------------------------------------------------------------------+---------------+------------+--------+---------------
f5dfcabe-de96-4148-9b80-a1c41ed276b4 | 26 | null | 1442959315021 | Michael | 180 | Kjellman
96053844-45c3-4f15-b1b7-b02c441d3ee1 | 36 | null | 1442959315020 | Mikhail | 173 | Stepura
6b757016-631d-4fdb-ac62-40b127ccfbc7 | 40 | null | 1442959315023 | Jason | 182 | Brown
556ebd54-cbe5-4b75-9aae-bf2a31a24500 | 27 | Software Engineer, works on the freight distribution at nights and likes arguing | 1442959315018 | Pavel | 181 | Yaskevich
8f909e8a-008e-49dd-8d43-1b0df348ed44 | 34 | null | 1442959315024 | Vijay | 183 | Parthasarathy
5770382a-c56f-4f3f-b755-450e24d55217 | 26 | Software Engineer, who likes distributed systems, doesnt like to argue. | 1442959315019 | Jordan | 173 | West
2970da43-e070-41a8-8bcb-35df7a0e608a | 32 | null | 1442959315022 | Johnny | 175 | Zhang
(7 rows)
Index terms and query search strings are stemmed for the bio
column
because it was configured to use the
StandardAnalyzer
and analyzed
is set to true
. The
tokenization_normalize_lowercase
is similar to the case_sensitive
property but for the
StandardAnalyzer
. These
query demonstrates the stemming applied by StandardAnalyzer
.
cqlsh:demo> SELECT * FROM sasi WHERE bio = 'distributing';
id | age | bio | created_at | first_name | height | last_name
--------------------------------------+-----+----------------------------------------------------------------------------------+---------------+------------+--------+-----------
556ebd54-cbe5-4b75-9aae-bf2a31a24500 | 27 | Software Engineer, works on the freight distribution at nights and likes arguing | 1442959315018 | Pavel | 181 | Yaskevich
5770382a-c56f-4f3f-b755-450e24d55217 | 26 | Software Engineer, who likes distributed systems, doesnt like to argue. | 1442959315019 | Jordan | 173 | West
(2 rows)
cqlsh:demo> SELECT * FROM sasi WHERE bio = 'they argued';
id | age | bio | created_at | first_name | height | last_name
--------------------------------------+-----+----------------------------------------------------------------------------------+---------------+------------+--------+-----------
556ebd54-cbe5-4b75-9aae-bf2a31a24500 | 27 | Software Engineer, works on the freight distribution at nights and likes arguing | 1442959315018 | Pavel | 181 | Yaskevich
5770382a-c56f-4f3f-b755-450e24d55217 | 26 | Software Engineer, who likes distributed systems, doesnt like to argue. | 1442959315019 | Jordan | 173 | West
(2 rows)
cqlsh:demo> SELECT * FROM sasi WHERE bio = 'working at the company';
id | age | bio | created_at | first_name | height | last_name
--------------------------------------+-----+----------------------------------------------------------------------------------+---------------+------------+--------+-----------
556ebd54-cbe5-4b75-9aae-bf2a31a24500 | 27 | Software Engineer, works on the freight distribution at nights and likes arguing | 1442959315018 | Pavel | 181 | Yaskevich
(1 rows)
cqlsh:demo> SELECT * FROM sasi WHERE bio = 'soft eng';
id | age | bio | created_at | first_name | height | last_name
--------------------------------------+-----+----------------------------------------------------------------------------------+---------------+------------+--------+-----------
556ebd54-cbe5-4b75-9aae-bf2a31a24500 | 27 | Software Engineer, works on the freight distribution at nights and likes arguing | 1442959315018 | Pavel | 181 | Yaskevich
5770382a-c56f-4f3f-b755-450e24d55217 | 26 | Software Engineer, who likes distributed systems, doesnt like to argue. | 1442959315019 | Jordan | 173 | West
(2 rows)
While SASI, at the surface, is simply an implementation of the
SecondaryIndex
interface, at its core there are several data
structures and algorithms used to satisfy it. These are described
here. Additionally, the changes internal to Cassandra to support SASIs
integration are described.
The SecondaryIndex
interface divides responsibility of the
implementer into two parts: Indexing and Querying. Further, Cassandra
makes it possible to divide those responsibilities into the memory and
disk components. SASI takes advantage of Cassandra's write-once,
immutable, ordered data model to build indexes along with the flushing
of the memtable to disk -- this is the origin of the name "SSTable
Attached Secondary Index".
The SASI index data structures are built in memory as the SSTable is being written and they are flushed to disk before the writing of the SSTable completes. The writing of each index file only requires sequential writes to disk. In some cases, partial flushes are performed, and later stitched back together, to reduce memory usage. These data structures are optimized for this use case.
Taking advantage of Cassandra's ordered data model, at query time, candidate indexes are narrowed down for searching minimize the amount of work done. Searching is then performed using an efficient method that streams data off disk as needed.
Per SSTable, SASI writes an index file for each indexed column. The
data for these files is built in memory using the
OnDiskIndexBuilder
. Once
flushed to disk, the data is read using the
OnDiskIndex
class. These are composed of bytes representing indexed terms,
organized for efficient writing or searching respectively. The keys
and values they hold represent tokens and positions in an SSTable and
these are stored per-indexed term in
TokenTreeBuilder
s
for writing, and
TokenTree
s
for querying. These index files are memory mapped after being written
to disk, for quicker access. For indexing data in the memtable SASI
uses its
IndexMemtable
class.
Each
OnDiskIndex
is an instance of a modified
Suffix Array data
structure. The
OnDiskIndex
is comprised of page-size blocks of sorted terms and pointers to the
terms' associated data, as well as the data itself, stored also in one
or more page-sized blocks. The
OnDiskIndex
is structured as a tree of arrays, where each level describes the
terms in the level below, the final level being the terms
themselves. The PointerLevel
s and their PointerBlock
s contain
terms and pointers to other blocks that end with those terms. The
DataLevel
, the final level, and its DataBlock
s contain terms and
point to the data itself, contained in TokenTree
s.
The terms written to the
OnDiskIndex
vary depending on its "mode": either NORMAL
, SUFFIX
, or
SPARSE
. In the NORMAL
and SPARSE
cases terms exact values are
written exactly once per OnDiskIndex
. For example, a NORMAL
index
with terms Jason
, Jordan
, Pavel
, all three will be included in
the index. A SUFFIX
index writes additional terms for each suffix of
each term recursively. Continuing with the example, a SUFFIX
index
storing the previous terms would also store ason
, ordan
, avel
,
son
, rdan
, vel
, etc. This allows for queries on the suffix of
strings. The SPARSE
mode differs from NORMAL
in that for every 64
blocks of terms a
TokenTree
is built merging all the
TokenTree
s
for each term into a single one. This copy of the data is used for
efficient iteration of large ranges of e.g. timestamps. The index
"mode" is configurable per column at index creation time.
The
TokenTree
is an implementation of the well-known
B+-tree that has been
modified to optimize for its use-case. In particular, it has been
optimized to associate tokens, longs, with a set of positions in an
SSTable, also longs. Allowing the set of long values accommodates
the possibility of a hash collision in the token, but the data
structure is optimized for the unlikely possibility of such a
collision.
To optimize for its write-once environment the
TokenTreeBuilder
completely loads its interior nodes as the tree is built and it uses
the well-known algorithm optimized for bulk-loading the data
structure.
TokenTree
s provide the means to iterate a tokens, and file
positions, that match a given term, and to skip forward in that
iteration, an operation used heavily at query time.
The
IndexMemtable
handles indexing the in-memory data held in the memtable. The
IndexMemtable
in turn manages either a
TrieMemIndex
or a
SkipListMemIndex
per-column. The choice of which index type is used is data
dependent. The
TrieMemIndex
is used for literal types. AsciiType
and UTF8Type
are literal
types by defualt but any column can be configured as a literal type
using the is_literal
option at index creation time. For non-literal
types the
SkipListMemIndex
is used. The
TrieMemIndex
is an implementation that can efficiently support prefix queries on
character-like data. The
SkipListMemIndex
,
conversely, is better suited for Cassandra other data types like
numbers.
The
TrieMemIndex
is built using either the ConcurrentRadixTree
or
ConcurrentSuffixTree
from the com.goooglecode.concurrenttrees
package. The choice between the two is made based on the indexing
mode, NORMAL
or other modes, and SUFFIX
mode, respectively.
The
SkipListMemIndex
is built on top of java.util.concurrent.ConcurrentSkipListSet
.
Responsible for converting the internal IndexExpression
representation into SASI's
Operation
and
Expression
tree, optimizing the tree to reduce the amount of work done, and
driving the query itself the
QueryPlan
is the work horse of SASI's querying implementation. To efficiently
perform union and intersection operations SASI provides several
iterators similar to Cassandra's MergeIterator
but tailored
specifically for SASIs use, and with more features. The
RangeUnionIterator
,
like its name suggests, performs set union over sets of tokens/keys
matching the query, only reading as much data as it needs from each
set to satisfy the query. The
RangeIntersectionIterator
,
similar to its counterpart, performs set intersection over its data.
The
QueryPlan
instantiated per search query is at the core of SASIs querying
implementation. Its work can be divided in two stages: analysis and
execution.
During the analysis phase,
QueryPlan
converts from Cassandra's internal representation of
IndexExpression
s, which has also been modified to support encoding
queries that contain ORs and groupings of expressions using
parentheses (see the
Cassandra Internal Changes
section below for more details). This process produces a tree of
Operation
s, which in turn may contain Expression
s, all of which
provide an alternative, more efficient, representation of the query.
During execution the
QueryPlan
uses the DecoratedKey
-generating iterator created from the
Operation
tree. These keys are read from disk and a final check to
ensure they satisfy the query is made, once again using the
Operation
tree. At the point the desired amount of matching data has
been found, or there is no more matching data, the result set is
returned to the coordinator through the existing internal components.
The number of queries (total/failed/timed-out), and their latencies, are maintined per-table/column family.
SASI also supports concurrently iterating terms for the same index
accross SSTables. The concurrency factor is controlled by the
cassandra.search_concurrency_factor
system property. The default is
1
.
Each
QueryPlan
references a
QueryController
used throughout the execution phase. The
QueryController
has two responsibilities: to manage and ensure the proper cleanup of
resources (indexes), and to strictly enforce the time bound for query,
specified by the user via the range slice timeout. All indexes are
accessed via the
QueryController
so that they can be safely released by it later. The
QueryController
's
checkpoint
function is called in specific places in the execution
path to ensure the time-bound is enforced.
While in the analysis phase, the
QueryPlan
performs several potential optimizations to the query. The goal of
these optimizations is to reduce the amount of work performed during
the execution phase.
The simplest optimization performed is compacting multiple expressions
joined by logical intersection (AND
) into a single Operation
with
three or more Expression
s. For example, the query WHERE age < 100 AND fname = 'p*' AND first_name != 'pa*' AND age > 21
would,
without modification, have the following tree:
┌───────┐
┌────────│ AND │──────┐
│ └───────┘ │
▼ ▼
┌───────┐ ┌──────────┐
┌─────│ AND │─────┐ │age < 100 │
│ └───────┘ │ └──────────┘
▼ ▼
┌──────────┐ ┌───────┐
│ fname=p* │ ┌─│ AND │───┐
└──────────┘ │ └───────┘ │
▼ ▼
┌──────────┐ ┌──────────┐
│fname!=pa*│ │ age > 21 │
└──────────┘ └──────────┘
QueryPlan
will remove the redundant right branch whose root is the final AND
and has leaves fname != pa*
and age > 21
. These Expression
s will
be compacted into the parent AND
, a safe operation due to AND
being associative and commutative. The resulting tree looks like the
following:
┌───────┐
┌────────│ AND │──────┐
│ └───────┘ │
▼ ▼
┌───────┐ ┌──────────┐
┌───────────│ AND │────────┐ │age < 100 │
│ └───────┘ │ └──────────┘
▼ │ ▼
┌──────────┐ │ ┌──────────┐
│ fname=p* │ ▼ │ age > 21 │
└──────────┘ ┌──────────┐ └──────────┘
│fname!=pa*│
└──────────┘
When excluding results from the result set, using !=
, the
QueryPlan
determines the best method for handling it. For range queries, for
example, it may be optimal to divide the range into multiple parts
with a hole for the exclusion. For string queries, such as this one,
it is more optimal, however, to simply note which data to skip, or
exclude, while scanning the index. Following this optimization the
tree looks like this:
┌───────┐
┌────────│ AND │──────┐
│ └───────┘ │
▼ ▼
┌───────┐ ┌──────────┐
┌───────│ AND │────────┐ │age < 100 │
│ └───────┘ │ └──────────┘
▼ ▼
┌──────────────────┐ ┌──────────┐
│ fname=p* │ │ age > 21 │
│ exclusions=[pa*] │ └──────────┘
└──────────────────┘
The last type of optimization applied, for this query, is to merge
range expressions across branches of the tree -- without modifying the
meaning of the query, of course. In this case, because the query
contains all AND
s the age
expressions can be collapsed. Along with
this optimization, the initial collapsing of unneeded AND
s can also
be applied once more to result in this final tree using to execute the
query:
┌───────┐
┌──────│ AND │───────┐
│ └───────┘ │
▼ ▼
┌──────────────────┐ ┌────────────────┐
│ fname=p* │ │ 21 < age < 100 │
│ exclusions=[pa*] │ └────────────────┘
└──────────────────┘
As discussed, the
QueryPlan
optimizes a tree represented by
Operation
s
as interior nodes, and
Expression
s
as leaves. The
Operation
class, more specifically, can have zero, one, or two
Operation
s
as children and an unlimited number of expressions. The iterators used
to perform the queries, discussed below in the
"Range(Union|Intersection)Iterator" section, implement the necessary
logic to merge results transparently regardless of the
Operation
s
children.
Besides participating in the optimizations performed by the
QueryPlan
,
Operation
is also responsible for taking a row that has been returned by the
query and making a final validation that it in fact does match. This
satisfiesBy
operation is performed recursively from the root of the
Operation
tree for a given query. These checks are performed directly on the
data in a given row. For more details on how satisfiesBy
works see
the documentation
in the code.
The abstract RangeIterator
class provides a unified interface over
the two main operations performed by SASI at various layers in the
execution path: set intersection and union. These operations are
performed in a iterated, or "streaming", fashion to prevent unneeded
reads of elements from either set. In both the intersection and union
cases the algorithms take advantage of the data being pre-sorted using
the same sort order, e.g. term or token order.
The
RangeUnionIterator
performs the "Merge-Join" portion of the
Sort-Merge-Join
algorithm, with the properties of an outer-join, or union. It is
implemented with several optimizations to improve its performance over
a large number of iterators -- sets to union. Specifically, the
iterator exploits the likely case of the data having many sub-groups
of overlapping ranges and the unlikely case that all ranges will
overlap each other. For more details see the
javadoc.
The
RangeIntersectionIterator
itself is not a subclass of RangeIterator
. It is a container for
several classes, one of which, AbstractIntersectionIterator
,
sub-classes RangeIterator
. SASI supports two methods of performing
the intersection operation, and the ability to be adaptive in choosing
between them based on some properties of the data.
BounceIntersectionIterator
, and the BOUNCE
strategy, works like
the
RangeUnionIterator
in that it performs a "Merge-Join", however, its nature is similar to
a inner-join, where like values are merged by a data-specific merge
function (e.g. merging two tokens in a list to lookup in a SSTable
later). See the
javadoc
for more details on its implementation.
LookupIntersectionIterator
, and the LOOKUP
strategy, performs a
different operation, more similar to a lookup in an associative data
structure, or "hash lookup" in database terminology. Once again,
details on the implementation can be found in the
javadoc.
The choice between the two iterators, or the ADAPTIVE
strategy, is
based upon the ratio of data set sizes of the minimum and maximum
range of the sets being intersected. If the number of the elements in
minimum range divided by the number of elements is the maximum range
is less than or equal to 0.01
, then the ADAPTIVE
strategy chooses
the LookupIntersectionIterator
, otherwise the
BounceIntersectionIterator
is chosen.
The above components are glued together by the
SSTableAttachedSecondaryIndex
class which sub-classes PerRowSecondaryIndex
, and is instantiated
per-table containing SASI indexes. It manages all indexes for a table
via the
sasi.conf.DataTracker
and
sasi.conf.view.View
components, controls writing of all indexes for an SSTable via its
PerSSTableIndexWriter
, and initiates searches with
LocalSecondaryIndexSearcher
. These classes glue the previously
mentioned indexing components together with Cassandra's SSTable
life-cycle ensuring indexes are not only written when Memtable's flush
but also as SSTable's are compacted. For querying, the
LocalSecondaryIndexSearcher
does little but defer to
QueryPlan
and update e.g. latency metrics exposed by SASI.
To support the above changes and integrate them into Cassandra a few minor internal changes were made to Cassandra itself. These are described here.
The SSTableWriterListener
is an observer pattern-like interface,
whose sub-classes can register to be notified about events in the
life-cycle of writing out a SSTable. Sub-classes can be notified when a
flush begins and ends, as well as when each next row is about to be
written, and each next column. SASI's PerSSTableIndexWriter
,
discussed above, is the only current subclass.
Changes to both the thrift and CQL interfaces were necessary to add
support for "OR" and "does not equal", and in CQL's case to support
parentheses to group expressions in the WHERE
clause.
For thrift, simple changes were made to the Thirft definitions to add
NOT_EQ
to IndexOperator
and LogicalIndexOperator
was introduced
to encode AND
/OR
.
For CQL, changes were made to both the ANTLR grammar and
SelectStatement
, to add support for OR
, !=
, and parentheses. The
grammar was modified to produce a tree instead of a
list. SelectStatement
s modifications allow it to flatten this
tree, using a pre-order tree traversal, so that no modifications were
necessary to the messaging service.
The following are items that can be addressed in future updates but are not available in this repository or are not currently implemented.
- The cluster must be configured to use a partitioner that produces
LongToken
s, e.g.Murmur3Partitioner
. Other existing partitioners which don't produce LongToken e.g.ByteOrderedPartitioner
andRandomPartitioner
will not work with SASI. - CQL3 tables are only supported if created with
COMPACT STORAGE
or from the CQL2/Thrift interfaces. - SASI indexes must be the only indexes defined for a table
- Only Cassandra 2.0.x is supported
- CQL3
ALLOW FILTERING
, requirement of at least one indexes=
expression, and lack ofLIKE
limit SASIs feature-set. Modifications to the grammar to allowSecondaryIndex
implementations to enumerate its supported features would allow SASI to expose more features without need to support them in other implementations.
- CQL3 & Cassandra 3.0 Support
- Complete Property Based Tests
- Modify CQL3 Grammar to better support SASI feature set
- Extend the 3.0
SecondaryIndex
interface with SASIs iterators, so results can be streamed from the data node to the coordinator.
Also, see the current Issues