Advanced SQL

Aggregations

The simplest form of aggregation is the one Datasette does to implement its own faceting feature.

select
  party,
  count(*)
from
  executive_terms
where
  type = 'prez'
group by
  party

Try that query here.

The group by creates groups of rows, then the aggregation functions calculate a value across that entire group.

The most common aggregation functions are:

  • count(*) - count the number of rows in each group

  • max(column) - maximum value for a column

  • min(column) - minimum value for a column

  • sum(column) - sum up the values in the column

Here’s an example of sum() and count() from What’s in the RedPajama-Data-1T LLM training set:

select
  top_folders,
  sum(size_gb) as total_gb,
  count(*) as num_files
from raw
group by top_folders
order by sum(size_gb) desc

Run that in Datasette Lite.

Change the total_gb line to this to round it to the nearest integer:

  cast (sum(size_gb) as integer) as total_gb,

Subqueries

SQLite has excellent support for subqueries. You can use them in where X in clauses:

select html_url from releases where repo in (
  select id from repos where full_name in (
    select repo from plugin_repos
  )
)
order by created_at desc

Run that on datasette.io. Sometimes I find these to be more readable than joins!

You can also use them directly in select clauses:

select
  full_name,
  (
    select
      html_url
    from
      releases
    where
      releases.repo = repos.id
    order by
      created_at desc
    limit
      1
  ) as latest_release
from
  repos

Run that here.

CTEs

CTE is a terrible name for an incredibly powerful feature. It stands for Common Table Expressions. Think of it as a way of creating an alias to a temporary table for the duration of a query.

with presidents as (
  select
    executives.name
  from
    executive_terms
    join executives
      on executive_terms.executive_id = executives.id
  where
    executive_terms.type = 'prez'
),
vice_presidents as (
  select
    executives.name
  from
    executive_terms
    join executives
      on executive_terms.executive_id = executives.id
  where
    executive_terms.type = 'viceprez'
)
select
  distinct name
from
  presidents
where name in vice_presidents

Try this CTE query here.

JSON

SQLite has excellent JSON functionality built in. Store JSON in a text column and you can query it using json_extract() - you can also build JSON values in select queries.

Returning related rows in a single SQL query using JSON shows some advanced tricks you can do with this.

select
  legislators.id,
  legislators.name,
  json_group_array(json_object(
    'type', legislator_terms.type,
    'state', legislator_terms.state,
    'start', legislator_terms.start,
    'end', legislator_terms.end,
    'party', legislator_terms.party
   )) as terms,
   count(*) as num_terms
from
  legislators join legislator_terms on legislator_terms.legislator_id = legislators.id
  group by legislators.id
order by
  id
limit
  10

Run that query.

Paul Ford said about SQLite’s JSON support:

The JSON interface is like, “we save the text and when you retrieve it we parse the JSON at several hundred MB/s and let you do path queries against it please stop overthinking it, this is filing cabinet.”

Window functions

I wanted to run a query that would return the following:

  • The repository name

  • The date of the most recent release from that repository (the releases table is a many-to-one against repos)

  • The total number of releases

  • The three most recent releases (as a JSON array of objects)

with cte as (
  select
    repos.full_name,
    releases.created_at,
    releases.id as rel_id,
    releases.name as rel_name,
    releases.created_at as rel_created_at,
    rank() over (partition by repos.id order by releases.created_at desc) as rel_rank
  from repos
    left join releases on releases.repo = repos.id
)
select
  full_name,
  max(created_at) as max_created_at,
  count(rel_id) as releases_count,
  json_group_array(
    json_object(
      'id', rel_id,
      'name', rel_name,
      'created_at', rel_created_at
    )
  ) filter (where rel_id is not null and rel_rank <= 3) as recent_releases
from cte
group by full_name
order by releases_count desc

Run that query here.

Running this smaller query helps show what’s going on with that rel_rank column:

with cte as (
  select
    repos.full_name,
    releases.created_at,
    releases.id as rel_id,
    releases.name as rel_name,
    releases.created_at as rel_created_at,
    rank() over (partition by repos.id order by releases.created_at desc) as rel_rank
  from repos
    left join releases on releases.repo = repos.id
)
select * from cte