Please ask new questions as a Q&A
in discussions.
PuffinDB is a serverless HTAP cloud data platform powered by Arrow × DuckDB × Iceberg.
If you are using DuckDB client-side with any client application, adding the PuffinDB extension will let you:
- Distribute queries across thousands of serverless functions and a Monostore
- Read from and write to hundreds of applications using any Airbyte connector
- Collaborate on the same Iceberg tables with other users
- Write back to an Iceberg table with ACID transactional integrity
- Execute cross-database joins (Cf. Edge-Driven Data Integration)
- Translate between 19 SQL dialects
- Invoke remote query generators
- Invoke curl commands
- Accelerate and | or schedule the downloading of large tables to your client
- Cache tables and run computations at the edge (Amazon CloudFront × Lambda@Edge)
- Log queries on your data lake
Many excellent distributed SQL engines are available today. Why do we need yet another one?
- True serverless architecture
- Future-proof architecture
- Designed for virtual private cloud deployment
- Designed for small to large datasets
- Designed for real-time analytics
- Designed for interactive analytics
- Designed for transformation and analytics
- Designed for analytics and transactions
- Designed for next-generation query engines
- Designed for next-generation file formats
- Designed for lakehouses
- Designed for data mesh integration
- Designed for all users
- Designed for extensibility
- Designed for embedability
- Optimized for machine-generated queries
- Scalable across large user bases
Cloud Data is what comes after Big Data.
Edge-Driven Data Integration (EDDI) is an inversion of control proposed by the PuffinDB project. Its main idea is that data integration should be driven at the edge by dynamic user-driven integration scenarios, rather than on the cloud with static data integration pipelines, yet without sacrificing solid architecture design and proper data governance (read more).
While running DuckDB on your personal computer will work great in many instances, running it on the cloud can bring several benefits:
- No need to download large datasets from the cloud to your local computer.
- Ability to work with larger datasets by taking advantage of fleets of AWS Lambda functions and | or large Amazon EC2 instances.
- Enforcement of column and | or row-level access control policies.
- Low-latency analytics on whatever amount of data you are willing to cache on your client.
- Ability to handle datasets that are too large for your client.
- Concurrent users editing the same cloud-side table.
Yes, you can use PuffinDB with just an Object Store like Amazon S3. But you should still take a look at Iceberg, for the following reasons:
A Lakehouse like Apache Iceberg, Apache Hudi, and Delta Lake offers many critical features:
- Schema evolution supports add, drop, update, or rename, and has no side-effects.
- Hidden partitioning prevents user mistakes that cause silently incorrect results or extremely slow queries.
- Partition layout evolution can update the layout of a table as data volume or query patterns change.
- Time travel enables reproducible queries that use exactly the same table snapshot.
- Table version rollback allows users to quickly correct problems by resetting tables to a good state.
- Advanced filtering prunes data files with partition and column-level stats, using table metadata.
- Serializable isolation makes table changes atomic and ensures that readers never see partial or uncommitted changes.
- Multiple concurrent writers use optimistic concurrency and transaction retry.
Credits: feature descriptions courtesy of Apache Iceberg.
Yes, it will (or so we've read). Nevertheless, this support is likely to be limited to DuckDB running on a single host (client, function, or VM). PuffinDB is designed to support DuckDB running on multiple functions in parallel, or multiple functions and a VM coordinating their work. Consequently, while DuckDB can directly request a table scan through Iceberg's Java API when running in single-host mode, this scan must be performed outside of DuckDB when running in multi-host mode. Furthermore, PuffinDB supports the optional execution of some read queries on Amazon Athena, whenever such delegated execution would be faster or more cost-effective.
Amazon Athena can be used to read | write any Iceberg tables, but many read queries will be faster and more cost-effective with DuckDB.
If you just make read queries on Iceberg tables, you do not. But you will need Amazon Athena for write queries on these tables.
Alternatively, you will be able to use Icecap once it becomes available.
No. You will only be billed for AWS Lambda functions, thanks to the fact that the Iceberg Java API used for table scans is packaged as a standalone AWS Lambda function running without Apache Spark. This ensures the lowest possible costs, and a much lower table scan marginal latency under 500 ms. You will get billed for Amazon Athena only when doing write queries on Iceberg tables. Finally, write queries made directly on Object Store objects do not use Apache Iceberg, hence do not require Amazon Athena.
Initially, AWS. Support for Microsoft Azure and Google Cloud will be added in future releases.
Apache Iceberg, Apache Hudi, and Delta Lake.
So that you can pick the one that will work best for you:
- Hard: Bun and Python modules deeply integrated within your own tool or application
- Easy: AWS Lambda functions deployed within your own cloud platform
- Easier: AWS CloudFormation template deployed within your own VPC
- Easiest: AWS Marketplace product added to your own cloud environment
Modern public clouds like Amazon Web Services, Microsoft Azure, and Google Cloud are true marvels of engineering, and have reached a level of sophistication and scale that is far beyond the reach of any corporate IT organization. And as time goes by, the very concept of private cloud is becoming less and less sensical. We have zero interest in spending any resources to deliver a sub-par user experience that will be of interest to less and less organization.
Be on the right side of history...
PuffinDB is deployed on AWS Lambda functions and one Amazon EC2 instance. Support for AWS Fargate will be added later on.
Fair question. Whenever PuffinDB is deployed on Amazon EC2 instances, it still uses fleets of AWS Lambda for loading objects from the Object Store and for processing some parts of distributed queries, while using the EC2 instances for reduction purposes. The goal here is to be as serverless as possible, while remaining pragmatic (serverless Fargates remain much less powerful than the largest EC2 instances). Eventually, we expect Fargates to become more and more powerful, so much so that we will not need to rely on EC2 instances anymore for the vast majority of applications and workloads.
No. PuffinDB has a clientless architecture and can be used from any application embedding the DuckDB engine, using a simple extension.
PuffinDB embeds a PRQL to SQL translator and a Malloy to SQL translator. Many such translators are available today, for many different applications and syntaxes, but PuffinDB selected these two, because they address two complementary needs: PRQL is great for data preparation, while Malloy is unparalleled for data analytics. Both have been embedded to stress the platform in different ways, and to provide ready-to-use "applications" that can be used to convincingly showcase the platform's capabilities. Nevertheless, they are packaged as optional modules that can be omitted if so desired. Later on, they will be dynamically injected using the Query Proxy.
- Parse SQL query using native DuckDB parser and outline table filter predicates.
- Scan table(s) using low-latency Lakehouse Catalog API running on AWS Lambda function.
- Replace reference(s) to table(s) with references to partitions returned by table scan(s).
- Stringify SQL query using native DuckDB stringifier.
- Execute query using DuckDB running on AWS Lambda function.
Whenever you execute a query on a table (be it backed by objects on the Object Store or a table in the Lakehouse) and are planning to make multiple queries on the same subset of the table, its partitions can be cached within the memory of the AWS Lambda function(s) used to execute the query. During subsequent queries on the same partitions, the function has a 99% probability of enjoying a hot start, giving it instant access to previously-cached partitions. This helps reduce latency and cost, since downloading partitions from the Object Store to the function is usually the longest step in the end-to-end query process (at least for relatively simple queries). The only drawback of such an optimization is that it slows the initial query down, because it requires an extra copy of the data in memory, but this is a small price to pay if many queries are to be made on the same table partitions.
Whenever you execute a read query, its result can be cached on the Object Store (Amazon S3), and this cache is automatically referenced in the query logs table. Recent entries in this table are cached in the main AWS Lambda function for fast lookup. Whenever the same query is requested again, its cached result can be returned instead of re-executing the query. This helps reduce latency and cost, while freeing limited resources (most AWS accounts are limited to 3,000 concurrent Lambda functions) for other queries. Cached query results can be served through a Content Delivery Network (Amazon CloudFront) using Object Store Origins to further reduce latency and remove any potential bottleneck on the Object Store.
These reference materials are a solid starting point.
PuffinDB uses the following languages:
- C++ for most parts of the DuckDB extension
- TypeScript for most of the middleware code executed by the Engine serverless function
- SMT-LIB for the formal models of the distributed query planner
- Rust for the routines of the distributed query planner
- Java for the connectors to databases
- Python for the connectors to applications
- HCL for the Terraform templates
STOIC is in the business of developing and selling a progressive data platform allowing any data citizen to interact with datasets of any size, from kilobytes to petabytes. In that context, we need to perform read | write SQL queries on large datasets, with low latency (2s or less) and low cost (one or two orders of magnitude lower than conventional solutions). And we need the required query engine to properly support the top-three cloud platforms and the top-three table formats. Furthermore, we want it powered by Arrow and DuckDB, because there are no better technologies available today. We could not find such an engine distributed under a liberal open source license, so we decided to build one. And because this component is a means to an end for us, yet could benefit countless other projects and organizations, we decided to develop it as an open source project.
STOIC is a progressive data platform allowing any data citizen to interact with datasets of any size, through an intuitive spreadsheet-like user experience. STOIC is built on top of PuffinDB, with a modern cloud-native architecture and a unique hierarchical caching stack. Its user interface is designed for a very wide range of personas, including business analysts, data architects, and software engineers. It combines a table editor, sheet editor, and notebook editor into a fully integrated and coherent user experience. It supports data pipelines, data editing, workflow automation, and insight-driven data visualization. For more information, please check this presentation.
Yes.
Vendors and users alike should consider sponsoring the PuffinDB project, for several reasons outlined on our sponsors page.
A first version with a naive distributed query engine (filter pushdown to serverless functions) will be released in Q2 (Cf. roadmap).
Please send an email to info at puffindb dot io.