Frequently Asked Questions

Please ask new questions as a Q&A in discussions.

What is PuffinDB?

PuffinDB is a serverless HTAP cloud data platform powered by Arrow × DuckDB × Iceberg.

Why should I use PuffinDB?

If you are using DuckDB client-side with any client application, adding the PuffinDB extension will let you:

Distribute queries across thousands of serverless functions and a Monostore
Read from and write to hundreds of applications using any Airbyte connector
Collaborate on the same Iceberg tables with other users
Write back to an Iceberg table with ACID transactional integrity
Execute cross-database joins (Cf. Edge-Driven Data Integration)
Translate between 19 SQL dialects
Invoke remote query generators
Invoke curl commands
Accelerate and | or schedule the downloading of large tables to your client
Cache tables and run computations at the edge (Amazon CloudFront × Lambda@Edge)
Log queries on your data lake

Why develop yet another distributed SQL engine?

Many excellent distributed SQL engines are available today. Why do we need yet another one?

True serverless architecture
Future-proof architecture
Designed for virtual private cloud deployment
Designed for small to large datasets
Designed for real-time analytics
Designed for interactive analytics
Designed for transformation and analytics
Designed for analytics and transactions
Designed for next-generation query engines
Designed for next-generation file formats
Designed for lakehouses
Designed for data mesh integration
Designed for all users
Designed for extensibility
Designed for embedability
Optimized for machine-generated queries
Scalable across large user bases

What is Cloud Data?

Cloud Data is what comes after Big Data.

What is Edge-Driven Data Integration?

Edge-Driven Data Integration (EDDI) is an inversion of control proposed by the PuffinDB project. Its main idea is that data integration should be driven at the edge by dynamic user-driven integration scenarios, rather than on the cloud with static data integration pipelines, yet without sacrificing solid architecture design and proper data governance (read more).

Why should I run DuckDB on the cloud instead of my personal computer?

While running DuckDB on your personal computer will work great in many instances, running it on the cloud can bring several benefits:

No need to download large datasets from the cloud to your local computer.
Ability to work with larger datasets by taking advantage of fleets of AWS Lambda functions and | or large Amazon EC2 instances.
Enforcement of column and | or row-level access control policies.

Why should I run DuckDB both client-side and cloud-side?

Low-latency analytics on whatever amount of data you are willing to cache on your client.
Ability to handle datasets that are too large for your client.
Concurrent users editing the same cloud-side table.

Can I use PuffinDB without a Lakehouse?

Yes, you can use PuffinDB with just an Object Store like Amazon S3. But you should still take a look at Iceberg, for the following reasons:

Why should I use a Lakehouse like Iceberg instead of just an Object Store like S3?

A Lakehouse like Apache Iceberg, Apache Hudi, and Delta Lake offers many critical features:

Schema evolution supports add, drop, update, or rename, and has no side-effects.
Hidden partitioning prevents user mistakes that cause silently incorrect results or extremely slow queries.
Partition layout evolution can update the layout of a table as data volume or query patterns change.
Time travel enables reproducible queries that use exactly the same table snapshot.
Table version rollback allows users to quickly correct problems by resetting tables to a good state.
Advanced filtering prunes data files with partition and column-level stats, using table metadata.
Serializable isolation makes table changes atomic and ensures that readers never see partial or uncommitted changes.
Multiple concurrent writers use optimistic concurrency and transaction retry.

Credits: feature descriptions courtesy of Apache Iceberg.

Won't DuckDB support Iceberg soon?

Yes, it will (or so we've read). Nevertheless, this support is likely to be limited to DuckDB running on a single host (client, function, or VM). PuffinDB is designed to support DuckDB running on multiple functions in parallel, or multiple functions and a VM coordinating their work. Consequently, while DuckDB can directly request a table scan through Iceberg's Java API when running in single-host mode, this scan must be performed outside of DuckDB when running in multi-host mode. Furthermore, PuffinDB supports the optional execution of some read queries on Amazon Athena, whenever such delegated execution would be faster or more cost-effective.

Why not just use Amazon Athena?

Amazon Athena can be used to read | write any Iceberg tables, but many read queries will be faster and more cost-effective with DuckDB.

Do I need Amazon Athena to query Iceberg tables with PuffinDB?

If you just make read queries on Iceberg tables, you do not. But you will need Amazon Athena for write queries on these tables.

Alternatively, you will be able to use Icecap once it becomes available.

Will I get billed for Amazon Athena when executing read queries?

No. You will only be billed for AWS Lambda functions, thanks to the fact that the Iceberg Java API used for table scans is packaged as a standalone AWS Lambda function running without Apache Spark. This ensures the lowest possible costs, and a much lower table scan marginal latency under 500 ms. You will get billed for Amazon Athena only when doing write queries on Iceberg tables. Finally, write queries made directly on Object Store objects do not use Apache Iceberg, hence do not require Amazon Athena.

Which cloud platforms will be supported?

Initially, AWS. Support for Microsoft Azure and Google Cloud will be added in future releases.

Which Lakehouse platforms will be supported?

Apache Iceberg, Apache Hudi, and Delta Lake.

Why support so many deployment options?

So that you can pick the one that will work best for you:

Hard: Bun and Python modules deeply integrated within your own tool or application
Easy: AWS Lambda functions deployed within your own cloud platform
Easier: AWS CloudFormation template deployed within your own VPC
Easiest: AWS Marketplace product added to your own cloud environment

Why not support private cloud deployment?

Modern public clouds like Amazon Web Services, Microsoft Azure, and Google Cloud are true marvels of engineering, and have reached a level of sophistication and scale that is far beyond the reach of any corporate IT organization. And as time goes by, the very concept of private cloud is becoming less and less sensical. We have zero interest in spending any resources to deliver a sub-par user experience that will be of interest to less and less organization.

Be on the right side of history...

Can PuffinDB be deployed on EC2 instances or Fargates?

PuffinDB is deployed on AWS Lambda functions and one Amazon EC2 instance. Support for AWS Fargate will be added later on.

How is PuffinDB serverless if it is deployed on EC2 instances?

Fair question. Whenever PuffinDB is deployed on Amazon EC2 instances, it still uses fleets of AWS Lambda for loading objects from the Object Store and for processing some parts of distributed queries, while using the EC2 instances for reduction purposes. The goal here is to be as serverless as possible, while remaining pragmatic (serverless Fargates remain much less powerful than the largest EC2 instances). Eventually, we expect Fargates to become more and more powerful, so much so that we will not need to rely on EC2 instances anymore for the vast majority of applications and workloads.

Do I need a specific client to use PuffinDB?

No. PuffinDB has a clientless architecture and can be used from any application embedding the DuckDB engine, using a simple extension.

Why embed PRQL and Malloy?

PuffinDB embeds a PRQL to SQL translator and a Malloy to SQL translator. Many such translators are available today, for many different applications and syntaxes, but PuffinDB selected these two, because they address two complementary needs: PRQL is great for data preparation, while Malloy is unparalleled for data analytics. Both have been embedded to stress the platform in different ways, and to provide ready-to-use "applications" that can be used to convincingly showcase the platform's capabilities. Nevertheless, they are packaged as optional modules that can be omitted if so desired. Later on, they will be dynamically injected using the Query Proxy.

How are read queries on Lakehouse tables executed?

Parse SQL query using native DuckDB parser and outline table filter predicates.
Scan table(s) using low-latency Lakehouse Catalog API running on AWS Lambda function.
Replace reference(s) to table(s) with references to partitions returned by table scan(s).
Stringify SQL query using native DuckDB stringifier.
Execute query using DuckDB running on AWS Lambda function.

How does partition caching work?

Whenever you execute a query on a table (be it backed by objects on the Object Store or a table in the Lakehouse) and are planning to make multiple queries on the same subset of the table, its partitions can be cached within the memory of the AWS Lambda function(s) used to execute the query. During subsequent queries on the same partitions, the function has a 99% probability of enjoying a hot start, giving it instant access to previously-cached partitions. This helps reduce latency and cost, since downloading partitions from the Object Store to the function is usually the longest step in the end-to-end query process (at least for relatively simple queries). The only drawback of such an optimization is that it slows the initial query down, because it requires an extra copy of the data in memory, but this is a small price to pay if many queries are to be made on the same table partitions.

How does query result caching work?

Whenever you execute a read query, its result can be cached on the Object Store (Amazon S3), and this cache is automatically referenced in the query logs table. Recent entries in this table are cached in the main AWS Lambda function for fast lookup. Whenever the same query is requested again, its cached result can be returned instead of re-executing the query. This helps reduce latency and cost, while freeing limited resources (most AWS accounts are limited to 3,000 concurrent Lambda functions) for other queries. Cached query results can be served through a Content Delivery Network (Amazon CloudFront) using Object Store Origins to further reduce latency and remove any potential bottleneck on the Object Store.

Where can I learn more about SQL query engines?

These reference materials are a solid starting point.

Which languages is PuffinDB written in?

PuffinDB uses the following languages:

C++ for most parts of the DuckDB extension
TypeScript for most of the middleware code executed by the Engine serverless function
SMT-LIB for the formal models of the distributed query planner
Rust for the routines of the distributed query planner
Java for the connectors to databases
Python for the connectors to applications
HCL for the Terraform templates

Why is STOIC initiating and funding this open source project?

STOIC is in the business of developing and selling a progressive data platform allowing any data citizen to interact with datasets of any size, from kilobytes to petabytes. In that context, we need to perform read | write SQL queries on large datasets, with low latency (2s or less) and low cost (one or two orders of magnitude lower than conventional solutions). And we need the required query engine to properly support the top-three cloud platforms and the top-three table formats. Furthermore, we want it powered by Arrow and DuckDB, because there are no better technologies available today. We could not find such an engine distributed under a liberal open source license, so we decided to build one. And because this component is a means to an end for us, yet could benefit countless other projects and organizations, we decided to develop it as an open source project.

What is STOIC?

STOIC is a progressive data platform allowing any data citizen to interact with datasets of any size, through an intuitive spreadsheet-like user experience. STOIC is built on top of PuffinDB, with a modern cloud-native architecture and a unique hierarchical caching stack. Its user interface is designed for a very wide range of personas, including business analysts, data architects, and software engineers. It combines a table editor, sheet editor, and notebook editor into a fully integrated and coherent user experience. It supports data pipelines, data editing, workflow automation, and insight-driven data visualization. For more information, please check this presentation.

Will STOIC offer hosting services for PuffinDB?

Yes.

Why should my organization become a sponsor?

Vendors and users alike should consider sponsoring the PuffinDB project, for several reasons outlined on our sponsors page.

When will PuffinDB be released?

A first version with a naive distributed query engine (filter pushdown to serverless functions) will be released in Q2 (Cf. roadmap).

How may I contribute?

Please send an email to info at puffindb dot io.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FAQ.md

FAQ.md

Frequently Asked Questions

What is PuffinDB?

Why should I use PuffinDB?

Why develop yet another distributed SQL engine?

What is Cloud Data?

What is Edge-Driven Data Integration?

Why should I run DuckDB on the cloud instead of my personal computer?

Why should I run DuckDB both client-side and cloud-side?

Can I use PuffinDB without a Lakehouse?

Why should I use a Lakehouse like Iceberg instead of just an Object Store like S3?

Won't DuckDB support Iceberg soon?

Why not just use Amazon Athena?

Do I need Amazon Athena to query Iceberg tables with PuffinDB?

Will I get billed for Amazon Athena when executing read queries?

Which cloud platforms will be supported?

Which Lakehouse platforms will be supported?

Why support so many deployment options?

Why not support private cloud deployment?

Can PuffinDB be deployed on EC2 instances or Fargates?

How is PuffinDB serverless if it is deployed on EC2 instances?

Do I need a specific client to use PuffinDB?

Why embed PRQL and Malloy?

How are read queries on Lakehouse tables executed?

How does partition caching work?

How does query result caching work?

Where can I learn more about SQL query engines?

Which languages is PuffinDB written in?

Why is STOIC initiating and funding this open source project?

What is STOIC?

Will STOIC offer hosting services for PuffinDB?

Why should my organization become a sponsor?

When will PuffinDB be released?

How may I contribute?

Files

FAQ.md

Latest commit

History

FAQ.md

File metadata and controls

Frequently Asked Questions

What is PuffinDB?

Why should I use PuffinDB?

Why develop yet another distributed SQL engine?

What is Cloud Data?

What is Edge-Driven Data Integration?

Why should I run DuckDB on the cloud instead of my personal computer?

Why should I run DuckDB both client-side and cloud-side?

Can I use PuffinDB without a Lakehouse?

Why should I use a Lakehouse like Iceberg instead of just an Object Store like S3?

Won't DuckDB support Iceberg soon?

Why not just use Amazon Athena?

Do I need Amazon Athena to query Iceberg tables with PuffinDB?

Will I get billed for Amazon Athena when executing read queries?

Which cloud platforms will be supported?

Which Lakehouse platforms will be supported?

Why support so many deployment options?

Why not support private cloud deployment?

Can PuffinDB be deployed on EC2 instances or Fargates?

How is PuffinDB serverless if it is deployed on EC2 instances?

Do I need a specific client to use PuffinDB?

Why embed PRQL and Malloy?

How are read queries on Lakehouse tables executed?

How does partition caching work?

How does query result caching work?

Where can I learn more about SQL query engines?

Which languages is PuffinDB written in?

Why is STOIC initiating and funding this open source project?

What is STOIC?

Will STOIC offer hosting services for PuffinDB?

Why should my organization become a sponsor?

When will PuffinDB be released?

How may I contribute?