Begin transmission…
Replicating Postgres into DuckDB
Earlier, we posted about using streaming logical replication in Postgres. We also wrote about exporting Postgres tables to data to DuckDB using the postgres scanner extension. In this post, I want to discuss data pipeline that moves data from Postgres tables to an embedded DuckDB instance using logical replication.
There are two main challenges while building such a pipeline. First, we need to read the WAL logs to detect the change on the source table. Second, we must generate the DuckDB-compatible SQL statements to replay the change set in an embedded DuckDB instance. We have covered the first issue here and here. In this post, I want to tackle the second issue.
DuckDB compatible SQL statements From WAL
There are many tools available that allow us to solve the first problem. We picked the wal2json
extension to receive the change updates. Any new transaction on the source database emits a WAL record. Its JSON representation looks like the one shown below.
{
"commit_lsn":957398296,
"records":[
{
"action":"I",
"xid":1058,
"lsn":"0/3910B898",
"nextlsn":"",
"timestamp":"2023-08-22 14:44:04.043586-03",
"schema":"public",
"table":"t",
"columns":[
{
"name":"id",
"type":"numeric",
"value":0.5
}
]
}
]
}
The WAL record contains enough information to parse it and convert it into a DuckDB SQL query. For example, in this particular instance, we can tell that the source table: t
in the schema public
has received an Insert
statement similar to INSERT INTO public.t (id) VALUES (0.5);
However, we must create the required table in the destination database before applying such an Insert statement. To do so, we must generate a SQL DDL statement to create the table.
To know the type of columns in the source table, we will use a query similar to the one shown below. It will enable us to inspect the source table and read its metadata.
WITH primary_key_info AS (
SELECT tc.constraint_schema,
tc.table_name,
ccu.column_name
FROM information_schema.table_constraints tc
JOIN information_schema.constraint_column_usage AS ccu USING (CONSTRAINT_SCHEMA, CONSTRAINT_NAME)
WHERE constraint_type = 'PRIMARY KEY' ),
array_type_info AS (
SELECT c.table_name,
c.column_name,
pg_catalog.format_type(t.oid, NULL) AS full_data_type
FROM information_schema.columns AS c
JOIN pg_catalog.pg_type AS t ON c.udt_name = t.typname
WHERE c.data_type = 'ARRAY')
SELECT
c.column_name,
CASE
WHEN c.data_type = 'ARRAY' THEN ati.full_data_type
ELSE c.data_type
END AS data_type,
c.is_nullable = 'YES' AS is_nullable,
pki.column_name IS NOT NULL AS is_primary
FROM information_schema.columns AS c
LEFT JOIN primary_key_info pki ON c.table_schema = pki.constraint_schema
AND pki.table_name = c.table_name
AND pki.column_name = c.column_name
LEFT JOIN array_type_info ati ON c.table_name = ati.table_name
AND c.column_name = ati.column_name
WHERE c.table_name = 't'; -- <=== YOUR TABLE NAME
The query yields a result similar to what we show below.
It contains column names, their types, and whether the column is nullable or is a primary key. This information allows us to generate a create SQL statement, for instance: CREATE TABLE t (id boolean[]);
Thinking through navigation flows in the Studio
by Jim Kosem
When designing a user experience (UX), we designers often get deep into the weeds of the user interface (UI). We create screens, share them and then talk about them in meetings and keep on keeping on expanding and honing them till we feel we’ve come to decent, presentable and testable solutions to technical user problems and needs.
This week we took a stab at rethinking the navigation to explore what we’ve learned through two rounds of user testing and seeing how we can both simplify the amount of steps and screens the user needs to go through, thus making it faster, but also less prone to error for the user. We realised that managing tables is one thing, but managing tables and their deployments is even more complicated so we would need a new way of being able to navigate around them. This required taking a completely new look at the UI and reworking how we thought about it’s organisation.
Other updates this week
We’re in the process of exposing an HTTP API for Basin, which will let you programmatically use it in logic vs. CLI commands-only. It’s still going through internal testing, so stay tuned for more updates.
End transmission…
Want to dive deeper, ask questions, or just nerd out with us? Jump into our Discord for weekly research office hours. Or, for hands-on technical support, you can join our weekly developer office hours.
And if you’d like to discuss any of these topics in more detail, comment on the issue over in GitHub!