Exploring DuckDB - Part 1 (Import, Aggregates, Casting)

Recently, I started using DuckDB for some quick analysis of large data files that I had received. In doing so, I needed to figure out how to do things within DuckDB. This post is a collection of "tips" for doing things that I commonly need to do. You’re unlikely to find advanced tips here but if you’re also just getting started, this post might save you some amount of searching.

While these tips may apply to any language from which you execute queries, I ran all of these from the official DuckDB command line application.

In this post, we will look at common things to do with:

Importing data into an in-memory and persistent database
Querying aggregates of data
Data Type Casting and Simple Text Operations

I’m assuming that you are already able to use DuckDB, or you can refer to this post for a basic guide to make it work on Windows and with Ruby.

Background to the Sample Data

Through this post, I will use some small sample data that I created myself. It is similar to the data that I originally used for exploring the features of DuckDB but have been changed sufficiently so that it doesn’t point to the actual real information that was used.

The data is available from the duckdb_small_data repository on my GitHub page if you want to follow along by running the commands yourself and see how these individual tips might work in your case.

The data represents a hypothetical system that sends back a bunch of CSV files. Each CSV file has the below structure:

Timestamp: A 14-digit numeric value that represents that timestamp (YYYYmmddHHMMSS) at which the data was captured.
Sensor ID: The ID for the particular sensor. Note that the field name has a space in it rather than being simple like sensor_id.
Value: A simple numeric value that represents the value captured by the sensor.
Description: A small text description to have some data that is not numeric for our demonstrations.

In addition, each file is named as data-20250121-132000.csv where the date component (8 digits) and time component (6 digits) are separated by a - and the filename starts with the word data and ends with .csv as the extension. The file name represents the time at which the data was exported from the system.

I have created a total of six files - we will bulk import five files and leave the last one to be loaded independently. The first five files are in the folder bulk and the individual file is in single so that our tree looks like the structure below.

├───bulk
│       data-20250119-131500.csv
│       data-20250119-132000.csv
│       data-20250120-131500.csv
│       data-20250120-132500.csv
│       data-20250121-132000.csv
│
└───single
        data-20250122-131500.csv

Let’s get started.

Importing Data

We have CSV files, so we will use the CSV import capability directly. At this stage, we will use the command line client though most of the queries would work the same way if done programmatically.

Let’s first start duckdb - for now, while we explore, we will do it in memory.

Structure of the CSV

We have standard looking CSVs and we want to see how DuckDB sees the structure and if it looks like it can import it. For this, we try to just select from trying to READ_CSV one of the files. We don’t have a lot of records but we add a LIMIT 10 anyway to restrict how many records are shown.

D SELECT * FROM READ_CSV('single/data-20250122-131500.csv') LIMIT 10;
┌────────────────┬───────────┬───────┬──────────────┐
│   Timestamp    │ Sensor ID │ Value │ Description  │
│     int64      │   int64   │ int64 │   varchar    │
├────────────────┼───────────┼───────┼──────────────┤
│ 20250122130103 │    108923 │    80 │ Light OFF    │
│ 20250122130144 │    108911 │    41 │ Valve Open   │
│ 20250122130218 │    108921 │    43 │ NA           │
│ 20250122130229 │    108922 │    54 │ Valve Open   │
│ 20250122130403 │    108901 │    19 │ Valve Closed │
│ 20250122130524 │    108922 │    41 │ NA           │
│ 20250122130528 │    108911 │    81 │ Light OK     │
│ 20250122130652 │    108914 │    75 │ Fire OK      │
│ 20250122130742 │    108913 │    39 │ Valve Closed │
│ 20250122130758 │    108913 │    41 │ Light OK     │
├────────────────┴───────────┴───────┴──────────────┤
│ 10 rows                                 4 columns │
└───────────────────────────────────────────────────┘

Great! We notice that DuckDB seems to have mapped our data acceptably to data types based on what it found in the CSV.

CREATE TABLE as SELECT (CTAS)

We know the structure is mapped fine and so, we can now create the table to load this data. The simplest way to create a table based on a CSV file (or a collection of files) is to use CREATE TABLE AS SELECT - this creates a table based on the schema of the result set of doing the SELECT. So, we can do this now with 1 change - drop the LIMIT clause in our original query and give a name for the table after CREATE TABLE

1
2
CREATE TABLE data AS
SELECT * FROM READ_CSV('single/data-20250122-131500.csv');

Once this completes, you can probe the database by doing:

D CREATE TABLE data as
┬╖ SELECT * FROM READ_CSV('single/data-20250122-131500.csv');
Run Time (s): real 0.015 user 0.015625 sys 0.000000
D .tables
data
D .schema data
CREATE TABLE IF NOT EXISTS "data"("Timestamp" BIGINT, "Sensor ID" BIGINT, "Value" BIGINT, Description VARCHAR);
D select count(*) from data;
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│           19 │
└──────────────┘
Run Time (s): real 0.004 user 0.000000 sys 0.000000
D select * from data limit 5;
┌────────────────┬───────────┬───────┬──────────────┐
│   Timestamp    │ Sensor ID │ Value │ Description  │
│     int64      │   int64   │ int64 │   varchar    │
├────────────────┼───────────┼───────┼──────────────┤
│ 20250122130103 │    108923 │    80 │ Light OFF    │
│ 20250122130144 │    108911 │    41 │ Valve Open   │
│ 20250122130218 │    108921 │    43 │ NA           │
│ 20250122130229 │    108922 │    54 │ Valve Open   │
│ 20250122130403 │    108901 │    19 │ Valve Closed │
└────────────────┴───────────┴───────┴──────────────┘
Run Time (s): real 0.004 user 0.015625 sys 0.000000

We did a few things:

.tables - in the command line application, this shows the names of all tables in the database
.schema data - given the name of a table, this shows the schema of the table. You can see that this matches what we had seen earlier

Other than that, we just ran two queries:

One to show the count/ number of records we had imported
Another to show a list of a few records - again, this matches what we had seen earlier.

This is good progress. We loaded a single CSV file into an in-memory database and it’s ready to use.

A Persistent Database

The previous database that we created was an in-memory database so everything stays in memory as it is processed. If we want a database that we can access again, we need to create a persistent database on the file system. We exit the earlier client (by doing .quit from the command line) and then start duckdb again by passing a file name for the database.

$ duckdb sensor_data.ddb
v1.1.3 19864453f7
Enter ".help" for usage hints.
D

Importing Multiple CSV Files

You can use READ_CSV to load multiple CSV files together. We have our files in the bulk folder and can use wildcards to point to multiple files. For simplicity, we will use the bulk/data*.csv as a way to identify the files.

D CREATE TABLE data AS
┬╖ SELECT * FROM
┬╖ READ_CSV ('bulk/data*.csv');
D .tables
data
D .schema data
CREATE TABLE IF NOT EXISTS "data"("Timestamp" BIGINT, "Sensor ID" BIGINT, "Value" BIGINT, Description VARCHAR);
D select count(*) from data;
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│           68 │
└──────────────┘
D select * from data limit 5;
┌────────────────┬───────────┬───────┬──────────────┐
│   Timestamp    │ Sensor ID │ Value │ Description  │
│     int64      │   int64   │ int64 │   varchar    │
├────────────────┼───────────┼───────┼──────────────┤
│ 20250119130123 │    108901 │    34 │ Valve Open   │
│ 20250119130124 │    108902 │    64 │ NA           │
│ 20250119130129 │    108901 │    45 │ Valve Open   │
│ 20250119130203 │    108901 │    41 │ Valve Closed │
│ 20250119130224 │    108902 │    60 │ NA           │
└────────────────┴───────────┴───────┴──────────────┘
D

It imported 68 records from the 5 files that we had in the bulk folder.

Common Options when Importing CSV files

We know that we loaded 68 records in all. It would be good to know how many were imported from each file. We can’t do that with our current schema. We need to have the name of the file for each record that was imported. So, let’s DROP TABLE data; and execute the command again, but this time, let’s add filename= true in the READ_CSV command.

D drop table data;
D CREATE TABLE data AS
┬╖ SELECT * FROM
┬╖ READ_CSV ('bulk/data*.csv', filename = true);
D .schema data
CREATE TABLE IF NOT EXISTS "data"("Timestamp" BIGINT, "Sensor ID" BIGINT, "Value" BIGINT, Description VARCHAR, filename VARCHAR);
D select * from data limit 5;
┌────────────────┬───────────┬───────┬──────────────┬───────────────────────────────┐
│   Timestamp    │ Sensor ID │ Value │ Description  │           filename            │
│     int64      │   int64   │ int64 │   varchar    │            varchar            │
├────────────────┼───────────┼───────┼──────────────┼───────────────────────────────┤
│ 20250119130123 │    108901 │    34 │ Valve Open   │ bulk\data-20250119-131500.csv │
│ 20250119130124 │    108902 │    64 │ NA           │ bulk\data-20250119-131500.csv │
│ 20250119130129 │    108901 │    45 │ Valve Open   │ bulk\data-20250119-131500.csv │
│ 20250119130203 │    108901 │    41 │ Valve Closed │ bulk\data-20250119-131500.csv │
│ 20250119130224 │    108902 │    60 │ NA           │ bulk\data-20250119-131500.csv │
└────────────────┴───────────┴───────┴──────────────┴───────────────────────────────┘

We can again check the count of records and we can also craft the SQL to check record count grouped by filename.

D select count(*) from data;
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│           68 │
└──────────────┘

D select filename AS file, count(*) from data GROUP BY file;
┌───────────────────────────────┬──────────────┐
│             file              │ count_star() │
│            varchar            │    int64     │
├───────────────────────────────┼──────────────┤
│ bulk\data-20250120-131500.csv │           18 │
│ bulk\data-20250121-132000.csv │           12 │
│ bulk\data-20250119-132000.csv │           12 │
│ bulk\data-20250119-131500.csv │           15 │
│ bulk\data-20250120-132500.csv │           11 │
└───────────────────────────────┴──────────────┘

Changing Directories (Command Line Client)

Notice that the file name starts with bulk\ since that’s part of the import file name. If you want to avoid that, in the command line client, you can do a .cd bulk which changes the working directory to bulk and then do the import.

D drop table data;
D .cd bulk

D CREATE TABLE data AS
┬╖ SELECT * FROM
┬╖ READ_CSV ('data*.csv', filename = true);
D select filename AS file, count(*) from data GROUP BY file;
┌──────────────────────────┬──────────────┐
│           file           │ count_star() │
│         varchar          │    int64     │
├──────────────────────────┼──────────────┤
│ data-20250121-132000.csv │           12 │
│ data-20250119-132000.csv │           12 │
│ data-20250120-132500.csv │           11 │
│ data-20250120-131500.csv │           18 │
│ data-20250119-131500.csv │           15 │
└──────────────────────────┴──────────────┘

We changed the READ_CSV to read data*.csv since we changed path to bulk directory. We can get back to the previous directory now by doing .cd .. in the command line client.

Loading data from another CSV file

We have already created the table schema by doing a CTAS operation. If we now needed to import another file into the table, we do an INSERT ... SELECT as shown below:

We .cd into the single directory
We use .system dir to execute the dir command from DOS to see the files in the directory - there is only one.
We execute the query to INSERT .. SELECT .. FROM read_csv (...)

D .cd single
D .system dir
 Volume in drive D is D_DRIVE
 Volume Serial Number is C2DE-A69A

 Directory of D:\mohit\Desktop\data\single

2025-02-16  06:40 PM    <DIR>          .
2025-02-16  06:40 PM    <DIR>          ..
2025-02-16  06:24 PM               698 data-20250122-131500.csv
               1 File(s)            698 bytes
               2 Dir(s)  440,976,990,208 bytes free
D
D INSERT INTO data
┬╖ SELECT * FROM
┬╖ read_csv ('data-20250122-131500.csv', filename = true);
D
D select count(*) from data;
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│           87 │
└──────────────┘
D
D select filename AS file, count(*) from data GROUP BY file ORDER by file ASC;
┌──────────────────────────┬──────────────┐
│           file           │ count_star() │
│         varchar          │    int64     │
├──────────────────────────┼──────────────┤
│ data-20250119-131500.csv │           15 │
│ data-20250119-132000.csv │           12 │
│ data-20250120-131500.csv │           18 │
│ data-20250120-132500.csv │           11 │
│ data-20250121-132000.csv │           12 │
│ data-20250122-131500.csv │           19 │
└──────────────────────────┴──────────────┘

We can see now that the additional records have been added to the table and when we list the files (ordered by file ASC), we can see the new file added to the list.

Other CSV Options

There is a long list of CSV import options but the most common ones to start with will be:

header = true : this tells DuckDB that your CSV file has a header line
delim : the default delimiter/ separator for CSV is , but in case your data is separated by something else, e.g., |, you can set delim = '|' when doing a read_csv
filename : set this to true to add a column on each record with the file name that is imported. We did this earlier.
ignore_errors : ignore parsing errors
skip : allows you to skip a certain number of lines at the start. This can be helpful in cases where you have some header information at the start about the contents of the CSV file.

If you think you need some special support, it’s probably already there - so, do refer to the CSV import page once you start to need more advanced support. Specifically, you might find parameters like normalize_names, encoding, quote, nullstr, names, dateformat, timestampformat, compression and types all more helpful parameters. Importing and cleaning data is half the battle - so, don’t overlook this once you start getting serious about using DuckDB. Maybe, we’ll look at more of these options with examples in a future post.

Aggregating and Querying Data

At this stage, you might want to do some aggregations since you are trying to analyse the data that you received. Some of the common queries are as below.

(1) Let’s get the number of distinct sensors - see how we had to use the name in double quotes as "Sensor ID" since the name has a space in it.

D select count(distinct ("Sensor ID")) from data;
┌─────────────────────────────┐
│ count(DISTINCT "Sensor ID") │
│            int64            │
├─────────────────────────────┤
│                          13 │
└─────────────────────────────┘

(2) We previously used GROUP BY and of course, you can count how many data items from each sensor.

D select "Sensor ID" as sensor, count(*) from data GROUP by sensor ORDER by sensor ASC;
┌────────┬──────────────┐
│ sensor │ count_star() │
│ int64  │    int64     │
├────────┼──────────────┤
│ 108901 │           18 │
│ 108902 │           14 │
│ 108903 │            2 │
│ 108905 │            1 │
│ 108911 │           17 │
│ 108912 │            8 │
│ 108913 │           13 │
│ 108914 │            1 │
│ 108921 │            3 │
│ 108922 │            3 │
│ 108923 │            1 │
│ 108950 │            3 │
│ 108988 │            3 │
├────────┴──────────────┤
│ 13 rows     2 columns │
└───────────────────────┘

(3) Simple Statistics - min, max, average and median

D select min(value), max(value), avg (value), median (value) from data;
┌──────────────┬──────────────┬────────────────────┬─────────────────┐
│ min("value") │ max("value") │    avg("value")    │ median("value") │
│    int64     │    int64     │       double       │     double      │
├──────────────┼──────────────┼────────────────────┼─────────────────┤
│           19 │           81 │ 48.333333333333336 │            41.0 │
└──────────────┴──────────────┴────────────────────┴─────────────────┘

(4) Max N or Min N Values - with or without DISTINCT. This is extremely useful to get the bottom N (or Top N) values. You can select min (value, 3) to get the 3 bottom values (which may be duplicates) or min (distinct (value), 3) to get the 3 lowest distinct values. The same applies to max also.

D select min (value, 3), min (distinct(value), 3), max (value, 3), max (distinct(value), 3) from data;
┌─────────────────┬──────────────────────────┬─────────────────┬──────────────────────────┐
│ min("value", 3) │ min(DISTINCT "value", 3) │ max("value", 3) │ max(DISTINCT "value", 3) │
│     int64[]     │         int64[]          │     int64[]     │         int64[]          │
├─────────────────┼──────────────────────────┼─────────────────┼──────────────────────────┤
│ [19, 19, 19]    │ [19, 24, 29]             │ [81, 81, 81]    │ [81, 80, 76]             │
└─────────────────┴──────────────────────────┴─────────────────┴──────────────────────────┘

(5) You can also query a histogram that returns a map of the values. This is similar to a select.. count.. group by query.

D select histogram(value) from data;
┌─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                 histogram("value")                                                  │
│                                                map(bigint, ubigint)                                                 │
├─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ {19=4, 24=4, 29=1, 34=2, 35=5, 39=8, 41=20, 43=4, 45=5, 51=5, 54=4, 56=3, 60=3, 63=3, 64=2, 75=5, 76=1, 80=3, 81=5} │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

(6) We will leave with one last example that returns a LIST corresponding to the lowest N values. For getting the {Sensor ID, timestamp, value} for each of the lowest 3 values, we can use arg_min(("Sensor ID", timestamp, value), value, 3) on our data.

D select arg_min(("Sensor ID", timestamp, value), value, 3) from data;
┌────────────────────────────────────────────────────────────────────────────────────────────┐
│             arg_min(main."row"("Sensor ID", "timestamp", "value"), "value", 3)             │
│                              struct(bigint, bigint, bigint)[]                              │
├────────────────────────────────────────────────────────────────────────────────────────────┤
│ [(108905, 20250121131003, 19), (108901, 20250120131913, 19), (108901, 20250119131603, 19)] │
└────────────────────────────────────────────────────────────────────────────────────────────┘

As is to be expected, there is a lot more and you should check the aggregate functions documentation for more details as you start to get deeper and need more statistics from the data that you have imported.

Data Type Casting and Text Operations

Casting data (changing its type from one type to another) comes in very handy, as does text processing. We won’t look at a lot of examples here but I needed to do some processing on the filenames and the timestamps that I had in my data.

Extracting a Portion of Text

Extracting text is a very common requirement. We don’t have many text fields in our data, so we will use the filename field. Let’s assume that we want to extract the date portion of our filename. We use substring in this case since we need the 8 characters starting from position 6 (note: the 1st character is at position 1).

D select filename from data limit 5;
┌──────────────────────────┐
│         filename         │
│         varchar          │
├──────────────────────────┤
│ data-20250119-131500.csv │
│ data-20250119-131500.csv │
│ data-20250119-131500.csv │
│ data-20250119-131500.csv │
│ data-20250119-131500.csv │
└──────────────────────────┘
D select substring(filename, 6, 8) from data limit 5;
┌──────────────────────────────────┐
│ main."substring"(filename, 6, 8) │
│             varchar              │
├──────────────────────────────────┤
│ 20250119                         │
│ 20250119                         │
│ 20250119                         │
│ 20250119                         │
│ 20250119                         │
└──────────────────────────────────┘

You can also use [start:end] on any string to extract a portion of it. So, the above query could be written using filename[6:13] instead of substring.

D select filename[6:13] from data limit 5;
┌────────────────┐
│ filename[6:13] │
│    varchar     │
├────────────────┤
│ 20250119       │
│ 20250119       │
│ 20250119       │
│ 20250119       │
│ 20250119       │
└────────────────┘

Concatenating Text

If we wanted to create a 14 character combination of the date and time, we could use 2 substrings and join them using || (shows NULL as "NULL") or concat (skips NULL values).

D select substring(filename, 6, 8) || substring (filename, 15, 6) from data limit 5;
┌─────────────────────────────────────────────────────────────────────────┐
│ (main."substring"(filename, 6, 8) || main."substring"(filename, 15, 6)) │
│                                 varchar                                 │
├─────────────────────────────────────────────────────────────────────────┤
│ 20250119131500                                                          │
│ 20250119131500                                                          │
│ 20250119131500                                                          │
│ 20250119131500                                                          │
│ 20250119131500                                                          │
└─────────────────────────────────────────────────────────────────────────┘

There are many more functions for text processing that I did not need to use yet - so, I encourage you to read the text function documentation for more details.

Casting Data Types

You need to change types to be able to operate on data using methods meant for a different type, e.g., you might need to convert a string to a number to perform mathematical operations, or convert a number to a string to extract the left most characters or concatenate them with other information, etc.

In our data, we have "Sensor ID" which is a 6 digit number. Let’s assume that the first 2 digits give us the sensor batch and the last 4 digits give us the actual sensor ID and we want to print the values at 10-8901 rather than 108901. This is tough to do in numbers but very easy to do in strings by using what we saw above.

D select distinct (concat (
┬╖   CAST ("Sensor ID" as VARCHAR)[1:2],
┬╖   '-',
┬╖   CAST ("Sensor ID" as VARCHAR)[3:6]
┬╖ ) ) AS "Sensor"
┬╖ FROM data LIMIT 10;
┌─────────┐
│ Sensor  │
│ varchar │
├─────────┤
│ 10-8902 │
│ 10-8913 │
│ 10-8922 │
│ 10-8911 │
│ 10-8914 │
│ 10-8912 │
│ 10-8905 │
│ 10-8988 │
│ 10-8903 │
│ 10-8950 │
├─────────┤
│ 10 rows │
└─────────┘

By casting the number into a string, we can easily manipulate the data. We will use more casting as part of Date and Time operations in Part 2 of this post. For Casting, the table on this page in the documentation is really important since it shows which data types you can convert from/ to.

Conclusion

We will stop Part 1 here - we have picked up a number of things to do with importing, aggregating and manipulating the data that we have. In Part 2, we will go on to looking at date and time functions, and list out useful commands in the DuckDB CLI. We will also link to useful resources from that post.