DuckDB, is The Hype Train Behind it Real?
May 21, 2024
Gentrit Mehmeti
Overview
Ever heard of a species that can do just fine living under any component that makes up the planet Earth, something that can do just fine in water or land or both. That’s the duck. It can do it all when it comes to surviving. Duckdb is the same, it’s in your process you just go ahead and install it with a [pip install] and it’s there ready to live with you in your applications.
Architecture and Deployment
DuckDB is an embedded in-memory analytical database management system. It is designed for fast query performance on analytical workloads, particularly on a single node or machine. It has no external dependencies, neither for compilation nor during run-time. For releases, the entire source tree of DuckDB is compiled into two files, a header and an implementation file, a so-called “amalgamation”. This greatly simplifies deployment and integration in other build processes. For building, all that is required to build DuckDB is a working C++11 compiler.
DuckDB is designed to support analytical query workloads, also known as OLAP. These workloads are characterized by complex, relatively long-running queries that process significant portions of the stored dataset, for example, aggregations over entire tables or joins between several large tables. Changes to the data are expected to be rather large-scale as well, with several rows being appended, or large portions of tables being changed or added at the same time.
In-process
As we mentioned before, duckdb can live anywhere, there is no DBMS server software to install, update, and maintain. DuckDB does not run as a separate process but is completely embedded within a host process. For the analytical use cases that DuckDB targets, this has the additional advantage of high-speed data transfer to and from the database. In some cases, DuckDB can process foreign data without copying. For example, the DuckDB Python package can run queries directly on Pandas data without ever importing or copying any data.
Storage
DuckDB uses a columnar storage format, where data is organized and stored in columns rather than rows. This columnar storage allows for efficient compression and the ability to read only the necessary columns during query execution. Columnar database vs row database (fivetran.com)
Very Fast
DuckDB contains a columnar-vectorized query execution engine, where queries are still interpreted, but a large batch of values (a “vector”) are processed in one operation. This greatly reduces overhead present in traditional systems such as PostgreSQL, MySQL, or SQLite which process each row sequentially. Vectorized query execution leads to far better performance in OLAP queries. Difference Between Vector Processor and Scalar Processor (geeksforgeeks.org)
MVCC
DuckDB supports multiple writer threads using a combination of MVCC (Multi-Version Concurrency Control) and optimistic concurrency control. As long as there are no write conflicts, multiple concurrent writes will succeed. Appends will never conflict, even on the same table. Multiple threads can also simultaneously update separate tables or separate subsets of the same table. Optimistic concurrency control comes into play when two threads attempt to edit (update or delete) the same row of data at the same time. In that situation, the second thread to attempt the edit will fail with a conflict error.
Multi-Database Support
DuckDB has a pluggable storage and transactional layer. This flexible layer allows new storage back-ends to be created by DuckDB extensions. These storage back-ends can support all database operations in the same way that DuckDB supports them, including inserting data and even modifying schemas. The MySQL, Postgres, and SQLite extensions implement this new pluggable storage and transactional layer, allowing DuckDB to connect to those systems and operate on them in the same way that it operates on its native storage engine. These extensions enable several useful features. For example, using these extensions you can:
Export data from SQLite to JSON
Read data from Parquet into Postgres
Move data from MySQL to Postgres
At the end?
Duckdb is great but is the hype train behind it worth having a look at? Sure it depends on what your use cases are. Duckdb is free and open source, if you don’t want to enforce dataframe-level frameworks and libraries to your business analysts who have spent time and effort learning SQL and are just looking to perform fast queries on your data then Duckdb is for you. Let’s explore some pros, cons, and use cases of duckdb.
DuckDB pros, cons, and use cases
Pros
Cost saving
Duckdb is an open source free to use and fast to up and run with, you can run it in your Python process or even on the web through web assembly
Feature-rich
Supports multiple input and output formats:
CSV
Paquet
JSON
Ndjson
Iceberg
duckdb
Flexible
Can read from:
https
Other databases (PostgreSQL, SQLite)
s3
gcs
Portable
duckdb CLI
python client
R client
Rust client
Java client and more
Postgre SQL dialect and Unique SQL features
Duckdb uses PostgreSQL dialect and incorporates new unique SQL features
Extensible
User-defined functions
Defining types
Geospatial Extension
Full-text search
Json
Parquet
Cons
Small community
Duckdb has rather a small community since it’s a relatively new tool
Scalability
Embedded and single-node focused, with limited support for parallelism
Use cases
Duckdb is intended to be used for:
Embedded analytics
Data Science
Data processing
ETL pipelines and more
When to use it
Processing and storing tabular datasets, e.g., from CSV or Parquet files
Interactive data analysis, e/g/, join and aggregate multiple large tables
Concurrent large changes to multiple large tables, e.g., appending rows, adding/removing updating columns
Large result set to transfer to client
When not to use it
High-volume transactional use cases
Large client/server installations for centralized enterprise data warehousing
Writing to a single database from multiple concurrent processes
Multiple concurrent processes reading from a single writable database
Comparisons worth having a look at:
https://db-engines.com/en/system/DuckDB%3BSpark+SQL%3BTrino
https://db-engines.com/en/system/DuckDB%3BRedis%3BTrino
https://www.confessionsofadataguy.com/duckdb-vs-polars-for-data-engineering/