Advanced SQL
Aggregations
The simplest form of aggregation is the one Datasette does to implement its own faceting feature.
select
party,
count(*)
from
executive_terms
where
type = 'prez'
group by
party
The group by
creates groups of rows, then the aggregation functions calculate a value across that entire group.
The most common aggregation functions are:
count(*)
- count the number of rows in each groupmax(column)
- maximum value for a columnmin(column)
- minimum value for a columnsum(column)
- sum up the values in the column
Here’s an example of sum()
and count()
from What’s in the RedPajama-Data-1T LLM training set:
select
top_folders,
sum(size_gb) as total_gb,
count(*) as num_files
from raw
group by top_folders
order by sum(size_gb) desc
Change the total_gb
line to this to round it to the nearest integer:
cast (sum(size_gb) as integer) as total_gb,
Subqueries
SQLite has excellent support for subqueries. You can use them in where X in
clauses:
select html_url from releases where repo in (
select id from repos where full_name in (
select repo from plugin_repos
)
)
order by created_at desc
Run that on datasette.io. Sometimes I find these to be more readable than joins!
You can also use them directly in select
clauses:
select
full_name,
(
select
html_url
from
releases
where
releases.repo = repos.id
order by
created_at desc
limit
1
) as latest_release
from
repos
CTEs
CTE is a terrible name for an incredibly powerful feature. It stands for Common Table Expressions. Think of it as a way of creating an alias to a temporary table for the duration of a query.
with presidents as (
select
executives.name
from
executive_terms
join executives
on executive_terms.executive_id = executives.id
where
executive_terms.type = 'prez'
),
vice_presidents as (
select
executives.name
from
executive_terms
join executives
on executive_terms.executive_id = executives.id
where
executive_terms.type = 'viceprez'
)
select
distinct name
from
presidents
where name in vice_presidents
JSON
SQLite has excellent JSON functionality built in. Store JSON in a text
column and you can query it using json_extract()
- you can also build JSON values in select
queries.
Returning related rows in a single SQL query using JSON shows some advanced tricks you can do with this.
select
legislators.id,
legislators.name,
json_group_array(json_object(
'type', legislator_terms.type,
'state', legislator_terms.state,
'start', legislator_terms.start,
'end', legislator_terms.end,
'party', legislator_terms.party
)) as terms,
count(*) as num_terms
from
legislators join legislator_terms on legislator_terms.legislator_id = legislators.id
group by legislators.id
order by
id
limit
10
Paul Ford said about SQLite’s JSON support:
The JSON interface is like, “we save the text and when you retrieve it we parse the JSON at several hundred MB/s and let you do path queries against it please stop overthinking it, this is filing cabinet.”
Window functions
I wanted to run a query that would return the following:
The repository name
The date of the most recent release from that repository (the releases table is a many-to-one against repos)
The total number of releases
The three most recent releases (as a JSON array of objects)
with cte as (
select
repos.full_name,
releases.created_at,
releases.id as rel_id,
releases.name as rel_name,
releases.created_at as rel_created_at,
rank() over (partition by repos.id order by releases.created_at desc) as rel_rank
from repos
left join releases on releases.repo = repos.id
)
select
full_name,
max(created_at) as max_created_at,
count(rel_id) as releases_count,
json_group_array(
json_object(
'id', rel_id,
'name', rel_name,
'created_at', rel_created_at
)
) filter (where rel_id is not null and rel_rank <= 3) as recent_releases
from cte
group by full_name
order by releases_count desc
Running this smaller query helps show what’s going on with that rel_rank
column:
with cte as (
select
repos.full_name,
releases.created_at,
releases.id as rel_id,
releases.name as rel_name,
releases.created_at as rel_created_at,
rank() over (partition by repos.id order by releases.created_at desc) as rel_rank
from repos
left join releases on releases.repo = repos.id
)
select * from cte