DuckDB, is The Hype Train Behind it Real?

May 21, 2024

Gentrit Mehmeti

Overview

Ever heard of a species that can do just fine living under any component that makes up the planet Earth, something that can do just fine in water or land or both. That’s the duck. It can do it all when it comes to surviving. Duckdb is the same, it’s in your process you just go ahead and install it with a [pip install] and it’s there ready to live with you in your applications.

Architecture and Deployment

DuckDB is an embedded in-memory analytical database management system. It is designed for fast query performance on analytical workloads, particularly on a single node or machine. It has no external dependencies, neither for compilation nor during run-time. For releases, the entire source tree of DuckDB is compiled into two files, a header and an implementation file, a so-called “amalgamation”. This greatly simplifies deployment and integration in other build processes. For building, all that is required to build DuckDB is a working C++11 compiler.

DuckDB is designed to support analytical query workloads, also known as OLAP. These workloads are characterized by complex, relatively long-running queries that process significant portions of the stored dataset, for example, aggregations over entire tables or joins between several large tables. Changes to the data are expected to be rather large-scale as well, with several rows being appended, or large portions of tables being changed or added at the same time.

In-process

As we mentioned before, duckdb can live anywhere, there is no DBMS server software to install, update, and maintain. DuckDB does not run as a separate process but is completely embedded within a host process. For the analytical use cases that DuckDB targets, this has the additional advantage of high-speed data transfer to and from the database. In some cases, DuckDB can process foreign data without copying. For example, the DuckDB Python package can run queries directly on Pandas data without ever importing or copying any data.




Storage

DuckDB uses a columnar storage format, where data is organized and stored in columns rather than rows. This columnar storage allows for efficient compression and the ability to read only the necessary columns during query execution. Columnar database vs row database (fivetran.com)



Very Fast

DuckDB contains a columnar-vectorized query execution engine, where queries are still interpreted, but a large batch of values (a “vector”) are processed in one operation. This greatly reduces overhead present in traditional systems such as PostgreSQL, MySQL, or SQLite which process each row sequentially. Vectorized query execution leads to far better performance in OLAP queries. Difference Between Vector Processor and Scalar Processor (geeksforgeeks.org)



MVCC

DuckDB supports multiple writer threads using a combination of MVCC (Multi-Version Concurrency Control) and optimistic concurrency control. As long as there are no write conflicts, multiple concurrent writes will succeed. Appends will never conflict, even on the same table. Multiple threads can also simultaneously update separate tables or separate subsets of the same table. Optimistic concurrency control comes into play when two threads attempt to edit (update or delete) the same row of data at the same time. In that situation, the second thread to attempt the edit will fail with a conflict error.

Multi-Database Support

DuckDB has a pluggable storage and transactional layer. This flexible layer allows new storage back-ends to be created by DuckDB extensions. These storage back-ends can support all database operations in the same way that DuckDB supports them, including inserting data and even modifying schemas. The MySQL, Postgres, and SQLite extensions implement this new pluggable storage and transactional layer, allowing DuckDB to connect to those systems and operate on them in the same way that it operates on its native storage engine. These extensions enable several useful features. For example, using these extensions you can:

  • Export data from SQLite to JSON

  • Read data from Parquet into Postgres

  • Move data from MySQL to Postgres

At the end?

Duckdb is great but is the hype train behind it worth having a look at? Sure it depends on what your use cases are. Duckdb is free and open source, if you don’t want to enforce dataframe-level frameworks and libraries to your business analysts who have spent time and effort learning SQL and are just looking to perform fast queries on your data then Duckdb is for you. Let’s explore some pros, cons, and use cases of duckdb.

DuckDB pros, cons, and use cases
Pros

Cost saving

  • Duckdb is an open source free to use and fast to up and run with, you can run it in your Python process or even on the web through web assembly

Feature-rich

Supports multiple input and output formats:

  • CSV

  • Paquet

  • JSON

  • Ndjson

  • Iceberg

  • duckdb

Flexible

Can read from:

  • https

  • Other databases (PostgreSQL, SQLite)

  • s3

  • gcs

Portable

  • duckdb CLI

  • python client

  • R client

  • Rust client

  • Java client and more

Postgre SQL dialect and Unique SQL features

  • Duckdb uses PostgreSQL dialect and incorporates new unique SQL features

Extensible

  • User-defined functions

  • Defining types

  • Geospatial Extension

  • Full-text search

  • Json

  • Parquet


Cons

Small community

  • Duckdb has rather a small community since it’s a relatively new tool

Scalability

  • Embedded and single-node focused, with limited support for parallelism

Use cases

Duckdb is intended to be used for:

  • Embedded analytics

  • Data Science

  • Data processing

  • ETL pipelines and more

When to use it

  • Processing and storing tabular datasets, e.g., from CSV or Parquet files

  • Interactive data analysis, e/g/, join and aggregate multiple large tables

  • Concurrent large changes to multiple large tables, e.g., appending rows, adding/removing updating columns

  • Large result set to transfer to client

When not to use it

  • High-volume transactional use cases

  • Large client/server installations for centralized enterprise data warehousing

  • Writing to a single database from multiple concurrent processes

  • Multiple concurrent processes reading from a single writable database

Comparisons worth having a look at:

https://db-engines.com/en/system/DuckDB%3BSpark+SQL%3BTrino

https://db-engines.com/en/system/DuckDB%3BRedis%3BTrino

https://www.confessionsofadataguy.com/duckdb-vs-polars-for-data-engineering/