patterns/dim_index.malloynb

>>>markdown
# Malloy Dimensional Indexes
Malloy has a special query operator that builds _Dimensional Search Indexes for sources.  A Dimensional Search Index is a table with 4 columns.  Dimensional indexes are useful for a variety of things including filtering suggestions and LLMs.

When filtering data, you might know a term, but not necessarily which column in the one of the join data contains it.  Indexing the data on field names and high cardinality fields let's you qucikly find the term and the associated value.

Indexing could be used by LLMs to find the interesting column/term mapping in the data set.

* *fieldName* - The path to the column in the source
* *fieldValue* - The dimensional value for the field (or range if fieldType is not a string)
* *fieldType* - Type type of the column.
* *weight* - a weighting to use as to the importance the distinct dimensional value.  Defaults to cardinality of the field.

## Simple Example
We're going to take the airports table and index it.  The results are an un ordered list of distinct *fieldName/fieldValue* pairs appear in the table.  The weight, in this case is the number of rows that partciular occurs on. 
>>>malloy
#(docs) size=medium limit=100
run: duckdb.table('../data/airports.parquet') -> {
  index: *
}
>>>markdown

## Add Ordering
Adding a second query stage to filter on _string_ columns and ordering by weight descending shows us the most common *fieldName/fieldValue* pairs in the dataset.  

All Malloy queries run as a single SQL query.  The `index:` operator is no different.  Click the **SQL** tab to see how this works.  
>>>malloy
#(docs) size=medium limit=100
run: duckdb.table('../data/airports.parquet') -> {
  index: *
} -> {
  where: fieldType = 'string'
  select: *
  order_by: weight desc
}
>>>markdown

## Index For Filtering User Interfaces

Indexes can be used find the best way to filter a dataset.  For example supposed we'd like to find 'SANTA CRUZ' in the dataset. Upon approaching the dataset, but we don't which column might contain it.  In a UI you might imagine that you type 'SANTA' and let have suggestons for values that might be appropriate.  In the results we can see that top value, 'SANTA ROSA', appears as county on 26 rows in the table.  We can also see that 'SANTA CRUZ' is both a `city` and a `county`..
>>>malloy
#(docs) size=medium limit=100
run: duckdb.table('../data/airports.parquet') -> {
  index: *
} -> {
  where: fieldValue ~ r'SANTA'
  select: *
  order_by: weight desc
  limit: 15
}
>>>markdown

We can then write a simple query to show the rows.  It turns out that 'SANTA CRUZ' is a county in both California and Arizona.
>>>malloy
#(docs) size=medium limit=100
run: duckdb.table('../data/airports.parquet') -> {
  where: county ~ 'SANTA CRUZ'
  select: *
}
>>>markdown

## Indexing to show top values for each dimension
It is often difficult to approach a new dataset.  The index operator provides an intersting way to quickly gain an understanding of the dataset.  By piping the results of an index another stage, we can quickly see all the interesting values for each of the interesting dimesions.  Again, the weight shows the number of rows for that particular dimension/value.
>>>malloy
#(docs) size=medium limit=100
run: duckdb.table('../data/airports.parquet') -> {
  index: *
} -> {
  group_by: fieldName
  nest: values is {
    group_by: fieldValue, weight
    order_by: weight desc
    limit: 10
  }
  order_by: fieldName
}
>>>markdown

## Sampling
With large datasets, you can also sample a small subsection using the `sample:` parameter.  Sampled indexes are great at identifing the important low cardinality fields.
>>>malloy
#(docs) size=medium limit=100
run: duckdb.table('../data/airports.parquet') -> {
  index: *
  sample: 5000  // sample only 5000 rows
} -> {
  group_by: fieldName
  nest: values is  {
    group_by: fieldValue, weight
    order_by: weight desc
    limit: 10
  }
  order_by: fieldName
}
>>>markdown

<img src="./imdb_schema.png" style="width: 230px; float: right; padding: 3px 3px 3px 3px">

## A More Complex Example

The rest of this pages uses the model below.  The data is an excerpt from the IMDB.  The Malloy schema for this model is shown on the right.  The core value is movies, but joined at the principals (the people that worked on the movie) and the people (the actual data about the individuals).  

We use the measure `total_ratings` to determin a movie's popularity.  An individual's popularity is determined by the some of all the ratings of the movies a person has worked on.
>>>malloy
source: movies is duckdb.table('../imdb/data/titles.parquet') extend {
  join_many: principals is duckdb.table('../imdb/data/principals.parquet') extend {
    join_one: people is duckdb.table('../imdb/data/names.parquet') 
      on nconst = people.nconst
  } on tconst = principals.tconst
  measure: total_ratings is numVotes.sum()
}
>>>markdown

## Weights can be any measure
Often a row count will work nicely as a weight, but sometimes there is something better.  In movies, for example the sum of the number of votes will not only find the interesting most interesting movies but will also find the most interesting people.  For example the most interesting people in the dataset.
>>>malloy
#(docs) size=small limit=200
run: movies -> {
  group_by: principals.people.primaryName
  aggregate: total_ratings
} 
>>>markdown

## Index the entire graph
Indexing can work across an entire network of joins and can be selective.
>>>malloy
#(docs) size=small limit=200
run: movies -> {
  index:
    *
    genres.*
    principals.category, principals.job
    principals.characters.*
    principals.people.primaryName
  by total_ratings
  sample: 5000
} -> {
  select: *
  order_by: weight desc
}
>>>markdown

## Search Index Naming Convention 

By convention indexes in sources are named `search_index`.
>>>malloy
source: movies2 is duckdb.table('../imdb/data/titles.parquet') extend {
  join_many: principals is duckdb.table('../imdb/data/principals.parquet') extend {
    join_one: people is duckdb.table('../imdb/data/names.parquet') 
      on nconst = people.nconst
  } on tconst = principals.tconst

  measure: total_ratings is numVotes.sum()

  view: search_index is {
    index:
      *
      genres.*
      principals.category, principals.job
      principals.characters.*
      principals.people.primaryName
    by total_ratings
  }
}
>>>markdown

So to look for 'Brad'
>>>malloy
#(docs) size=small limit=200
run: movies2 -> search_index -> { 
  select: *
  where: fieldValue ~ 'Brad%'
  order_by: weight desc
}
>>>markdown

So to look for 'Bat'
>>>malloy
#(docs) size=small limit=200
run: movies2 -> search_index -> { 
  select: *
  where: fieldValue ~ 'Bat%'
  order_by: weight desc
}
>>>markdown