-
Notifications
You must be signed in to change notification settings - Fork 14
/
Copy pathdim_index.malloynb
185 lines (162 loc) · 6.61 KB
/
dim_index.malloynb
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
>>>markdown
# Malloy Dimensional Indexes
Malloy has a special query operator that builds _Dimensional Search Indexes for sources. A Dimensional Search Index is a table with 4 columns. Dimensional indexes are useful for a variety of things including filtering suggestions and LLMs.
When filtering data, you might know a term, but not necessarily which column in the one of the join data contains it. Indexing the data on field names and high cardinality fields let's you qucikly find the term and the associated value.
Indexing could be used by LLMs to find the interesting column/term mapping in the data set.
* *fieldName* - The path to the column in the source
* *fieldValue* - The dimensional value for the field (or range if fieldType is not a string)
* *fieldType* - Type type of the column.
* *weight* - a weighting to use as to the importance the distinct dimensional value. Defaults to cardinality of the field.
## Simple Example
We're going to take the airports table and index it. The results are an un ordered list of distinct *fieldName/fieldValue* pairs appear in the table. The weight, in this case is the number of rows that partciular occurs on.
>>>malloy
#(docs) size=medium limit=100
run: duckdb.table('../data/airports.parquet') -> {
index: *
}
>>>markdown
## Add Ordering
Adding a second query stage to filter on _string_ columns and ordering by weight descending shows us the most common *fieldName/fieldValue* pairs in the dataset.
All Malloy queries run as a single SQL query. The `index:` operator is no different. Click the **SQL** tab to see how this works.
>>>malloy
#(docs) size=medium limit=100
run: duckdb.table('../data/airports.parquet') -> {
index: *
} -> {
where: fieldType = 'string'
select: *
order_by: weight desc
}
>>>markdown
## Index For Filtering User Interfaces
Indexes can be used find the best way to filter a dataset. For example supposed we'd like to find 'SANTA CRUZ' in the dataset. Upon approaching the dataset, but we don't which column might contain it. In a UI you might imagine that you type 'SANTA' and let have suggestons for values that might be appropriate. In the results we can see that top value, 'SANTA ROSA', appears as county on 26 rows in the table. We can also see that 'SANTA CRUZ' is both a `city` and a `county`..
>>>malloy
#(docs) size=medium limit=100
run: duckdb.table('../data/airports.parquet') -> {
index: *
} -> {
where: fieldValue ~ r'SANTA'
select: *
order_by: weight desc
limit: 15
}
>>>markdown
We can then write a simple query to show the rows. It turns out that 'SANTA CRUZ' is a county in both California and Arizona.
>>>malloy
#(docs) size=medium limit=100
run: duckdb.table('../data/airports.parquet') -> {
where: county ~ 'SANTA CRUZ'
select: *
}
>>>markdown
## Indexing to show top values for each dimension
It is often difficult to approach a new dataset. The index operator provides an intersting way to quickly gain an understanding of the dataset. By piping the results of an index another stage, we can quickly see all the interesting values for each of the interesting dimesions. Again, the weight shows the number of rows for that particular dimension/value.
>>>malloy
#(docs) size=medium limit=100
run: duckdb.table('../data/airports.parquet') -> {
index: *
} -> {
group_by: fieldName
nest: values is {
group_by: fieldValue, weight
order_by: weight desc
limit: 10
}
order_by: fieldName
}
>>>markdown
## Sampling
With large datasets, you can also sample a small subsection using the `sample:` parameter. Sampled indexes are great at identifing the important low cardinality fields.
>>>malloy
#(docs) size=medium limit=100
run: duckdb.table('../data/airports.parquet') -> {
index: *
sample: 5000 // sample only 5000 rows
} -> {
group_by: fieldName
nest: values is {
group_by: fieldValue, weight
order_by: weight desc
limit: 10
}
order_by: fieldName
}
>>>markdown
<img src="./imdb_schema.png" style="width: 230px; float: right; padding: 3px 3px 3px 3px">
## A More Complex Example
The rest of this pages uses the model below. The data is an excerpt from the IMDB. The Malloy schema for this model is shown on the right. The core value is movies, but joined at the principals (the people that worked on the movie) and the people (the actual data about the individuals).
We use the measure `total_ratings` to determin a movie's popularity. An individual's popularity is determined by the some of all the ratings of the movies a person has worked on.
>>>malloy
source: movies is duckdb.table('../imdb/data/titles.parquet') extend {
join_many: principals is duckdb.table('../imdb/data/principals.parquet') extend {
join_one: people is duckdb.table('../imdb/data/names.parquet')
on nconst = people.nconst
} on tconst = principals.tconst
measure: total_ratings is numVotes.sum()
}
>>>markdown
## Weights can be any measure
Often a row count will work nicely as a weight, but sometimes there is something better. In movies, for example the sum of the number of votes will not only find the interesting most interesting movies but will also find the most interesting people. For example the most interesting people in the dataset.
>>>malloy
#(docs) size=small limit=200
run: movies -> {
group_by: principals.people.primaryName
aggregate: total_ratings
}
>>>markdown
## Index the entire graph
Indexing can work across an entire network of joins and can be selective.
>>>malloy
#(docs) size=small limit=200
run: movies -> {
index:
*
genres.*
principals.category, principals.job
principals.characters.*
principals.people.primaryName
by total_ratings
sample: 5000
} -> {
select: *
order_by: weight desc
}
>>>markdown
## Search Index Naming Convention
By convention indexes in sources are named `search_index`.
>>>malloy
source: movies2 is duckdb.table('../imdb/data/titles.parquet') extend {
join_many: principals is duckdb.table('../imdb/data/principals.parquet') extend {
join_one: people is duckdb.table('../imdb/data/names.parquet')
on nconst = people.nconst
} on tconst = principals.tconst
measure: total_ratings is numVotes.sum()
view: search_index is {
index:
*
genres.*
principals.category, principals.job
principals.characters.*
principals.people.primaryName
by total_ratings
}
}
>>>markdown
So to look for 'Brad'
>>>malloy
#(docs) size=small limit=200
run: movies2 -> search_index -> {
select: *
where: fieldValue ~ 'Brad%'
order_by: weight desc
}
>>>markdown
So to look for 'Bat'
>>>malloy
#(docs) size=small limit=200
run: movies2 -> search_index -> {
select: *
where: fieldValue ~ 'Bat%'
order_by: weight desc
}
>>>markdown