This is just a quick post so that I basically don’t forget that if I have to do similar things in the future. Also, just in case AI becomes prohibitively expensive, it is nice to just read dumb text :-)
Since DuckDB is columnar storage it compresses really really well.
Forcing DuckDB to read all json keys
By default DuckDB doesn’t read all keys, but it optimizes for some. If you really want to convert all json keys to columns, do the following:
duckdb example.db -c "create table example_data as select * from read_ndjson('*.jsonl',sample_size=-1, union_by_name=true,map_inference_threshold=-1);"
The above ensures everything is scanned and no types are inferred. In addition, it mergens non-existing columns to contain NULL.
Create a table with unique entries only
Faster than doing this with an external application / script.
CREATE TABLE example_data_unique AS
SELECT
id,
MAX(COLUMNS(* EXCLUDE (id)))
FROM example_data
GROUP BY id;
Built-in User Interface
So, although very well documented, I didn’t figure this out until a few days later:
duckdb -ui example_data.db
The above will spawn a built-in user interface, that is pretty sweet, as indicated by the below screenshot. Basically it contains all the essentials to explore the data.
