1BRC in SQL with Snowflake (1-1.5 s, but the 1TRC below 20 seconds) #188
waldekkot
started this conversation in
Show and tell
Replies: 2 comments 2 replies
-
How many shards this was running on? |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Small contribution to the fun 1BRC - here: using Snowflake and all in SQL (the data generation, the contest's aggregation query and output). The aggregation query and output inspired by the DuckDB (#39) contest entry 😄
Data generation does not use files (actually does not use tables or databases either 😉 ). This makes the whole experimentation much easier - all you need is a browser and a Snowflake account (there is no-strings-attached / no-credit-card-required Snowflake trial available at: https://signup.snowflake.com/ ). Takes 1-2 minutes to get up and running. The lowest (Standard) edition of Snowflake is sufficient for the contest. Pick up your favorite cloud provider and the cloud region where you want to have the Snowflake service available. After activation and logging in, create a new SQL worksheet (big blue Plus button top right corner in the Snowflake UI) and paste the SQL script below for the 1BRC fun. Run all (the same big blue Play button). No other tools or anything else needed. Timings, etc. are in the Query History (left menu, Activity -> Query History) - filter out the important queries by query tag ("BRC:").
Excerpts (whole script at the bottom):
My 1BRC results (Snowflake on AWS Ireland, zero tuning or configuration, i.e. a vanilla trial account):
1 Billion Rows:
Those are the best timing-wise (not so much common-sense wise), running on the largest (4-6XL) Snowflake engines (difference between the largest engines was less than 2 seconds - the task to do is too easy simply).
Note: since Snowflake is a distributed system running on the public cloud infrastructure, there is some variability (jitter), especially impacting the timing of small (like the =1 billion) queries, so your mileage might differ by a second or two (or rather 5-10% of execution time), in both directions, up and down. The queries in Snowflake have to be compiled, scheduled to run on (sometimes) hundreds of the cloud provider's nodes (VMs), etc. Hence the variability (although the cloud infrastructure is very homogeneous)... For larger, analytical queries, this does not matter much.
This is particularly visible, if you scale up to the humongous Snowflake engines (=virtual warehouses), like 5XL or 6XL. Time to provision them might sometimes take even 2-3 minutes, it is dynamic and depends on what cloud provider & cloud region you selected and how busy it currently is. For virtual warehouses smaller than 4XL, you should rarely experience more than 1 second of provisioning or queuing time (P90 <1 second). So, the jitter will be negligible.
So, if the data is materialized into a Snowflake tables, things become more interesting... For 1 bln rows, the smaller virtual warehouse (and cheaper than the humongous ones) - XLARGE actually gives better results (less scheduling overhead):
1 Billion Rows:
With Snowpark-optimized engine (having more memory), on X2LARGE, this 1 bln query goes below 1 second often... 😵💫
Actually - and probably this is the most interesting aspect of this contribution to the whole entertainment: the 1 Billion is NOT a very challenging task for a modern DBMS 😉 ... Try 10 and 100 Billion or why stop here: go the 1TRC (1 Trillion Rows Challenge ®) 😉
My 1 Trillion (1000 billion) Rows (Snowflake 6XL engine) timing:
I can bet the 1QRC will be fine too (Q = quadrillion, a million of a billion, 10^15 rows). Just might take a few hours 😄 (the scaling here is significantly sub-linear...)
The code has no tricks and is rather simple (reading what the gurus do with the 1BRC Java code is fascinating - super cool!).
Important: the data generation is a one-time thing - you can refer to its results by the query ID (UUID). BTW: notice, in the SQL code there is no materialization of the data into any table (no database/schema/table creation statements - yet it all works). This is Snowflake's Persisted Query Results buffer in action (more in the docs for the curious ones: https://docs.snowflake.com/en/user-guide/querying-persisted-results). So, if you want to tweak the query code or run the query on a differently sized Snowflake engine, just modify the line 80 marked HERE (
SET WH_SIZE = 'X3LARGE'
) and run the rest of the script below it (select lines from 80 'HERE' including till the end and press the big blue Play button) to re-run the calculation query.There are actually 3 parameters to play with:
In terms of further tuning - there are (and it is like that 'by design' in Snowflake) only a few tuning parameters/mechanisms to play with in Snowflake platform. I seriously doubt they would matter for such a simple query as the 1BRC one. I wish the challenge, besides the 1 Trillion+ rows, also tested more complex aggregations and - more importantly - running many queries concurrently, e.g. simulating many BI/ML. So, not just a single query at a time. For a modern, distributed DBMS, this is where it should shine.
A few "for educational purposes" notes on the data generation piece (this is a common pattern when working with synthetic data in Snowflake - also easy, as there is no need to bother with files, wasting laptop storage, etc.):
stations_qid
' session variable, used in the code later (https://docs.snowflake.com/en/sql-reference/session-variables)number_of_measurements
' - another session variable which defines how many measurements will be generated<block of procedural SQL>
$$ (https://docs.snowflake.com/en/sql-reference/sql/execute-immediate)m:station_name
set gen_qid = LAST_QUERY_ID(-2);
SELECT * FROM TABLE(RESULT_SCAN( UUID ))
All in all, it might not be the best SQL code as it has been written by only averagely-intelligent person (=me) and in maybe 30 minutes or so (plus copying-and-tweaking the DuckDB original calculation query to work on Snowflake 😉 ). But in the end this is maybe 50 lines of code, not a few hundred. The beauty of declarative language like SQL. Explanations are longer than the SQL code itself 😄
Enjoy !
Cheers,
Waldek from ❄️ (email me if you spot a bug: [email protected])
The whole script - copy/paste it into a SQL worksheet in UI of your Snowflake, then run:
Beta Was this translation helpful? Give feedback.
All reactions