Data analysis with SQLite and Python, PyCon 2023

On YouTube: Data analysis with SQLite and Python - 2hr45m

What you’ll need

python3 and pip

For the first part of this tutorial, you’ll need a Python 3 interpreter with the sqlite3 standard library module available.

You can run this on your own computer, or use a browser-based environment.

https://pyodide.org/en/stable/console.html will work for a purely browser-based (WebAssembly) environment.

For the second part, you’ll also need the ability to pip install Python packages.

Python 3 on your own laptop (maybe in a fresh virtual environment) is a good option here. You could also use GitHub Codespaces or Google Colab or Jupyter Hub or your online notebook solution of choice.

Optional: GitHub Codespaces

I’ll be working through the tutorial using GitHub Codespaces, using https://github.com/github/codespaces-jupyter

Animated demo of Codespaces Jupyter

Introduction to SQLite

Why SQLite?

  • It’s the database you already have - sqlite3 has been built into Python since 2006

  • It’s screamingly fast, and surprisingly powerful

  • Amazing compatibility: bindings for every language, files work on every platform, fantastic track record for backwards compatibility, so it’s safe to trust your data to a SQLite file

  • Databases are just files on disk. You can create and discard them without any ceremony.

  • It handles text (including JSON), integers, floating point numbers, and binary blobs. Which means it can store anything.

  • It can handle up to 2.8TB of data(!)

  • It has some interesting characteristics, for example Many Small Queries Are Efficient In SQLite

First steps with Python

Let’s download a database to play with - we’ll use the database that powers the https://datasette.io/ website:

wget https://datasette.io/content.db

To access it from Python:

import sqlite3

db = sqlite3.connect("content.db")

print(db.execute("select sqlite_version()").fetchall())
# [('3.39.0',)]

# Show rows from the plugin_repos table
for row in db.execute("SELECT * FROM plugin_repos LIMIT 10"):
    print(row)

# Each row is a tuple. We can change that like this:
db.row_factory = sqlite3.Row

for row in db.execute("SELECT * FROM plugin_repos LIMIT 10"):
    print(row)

# This outputs <sqlite3.Row object at 0x7f5d3d8a3760>
# We can use dict() to turn those into dictionaries instead
for row in db.execute("SELECT * FROM plugin_repos LIMIT 10"):
    print(dict(row))

Creating a table

Let’s create a table:

db.execute("""
create table peps (
  id integer primary key,
  title text,
  author text,
  status text,
  type text,
  created text,
  body text
);
""")

Inserting some data

Here’s a function I wrote that can parse a PEP:

def parse_pep(s):
    intro, body = s.split("\n\n", 1)
    pep = {}
    current_key = None
    current_value = None
    for line in intro.split("\n"):
        # If the line starts with whitespace, it's a continuation of the previous value
        if line.startswith(" ") or line.startswith("\t"):
            if current_key is not None:
                current_value += " " + line.strip()
                pep[current_key] = current_value.strip()
        else:
            # Split the line into key and value
            parts = line.split(": ", 1)
            if len(parts) == 2:
                key, value = parts
                # Update the current key and value
                current_key = key
                current_value = value
                # Add the key-value pair to the pep dictionary
                pep[current_key] = current_value.strip()
    pep["Body"] = body.strip()
    return pep

Let’s fetch and parse the Zen of Python:

import urllib.request

zen = urllib.request.urlopen(
    "https://raw.githubusercontent.com/python/peps/main/pep-0020.txt"
).read().decode("utf-8")

pep = parse_pep(zen)

And insert that into our database:

db.execute("""
    insert into peps (
        id, title, author, status, type, created, body
    ) values (
        ?, ?, ?, ?, ?, ?, ?
    )
""", (
    pep["PEP"],
    pep["Title"],
    pep["Author"],
    pep["Status"],
    pep["Type"],
    pep["Created"],
    pep["Body"],
))

Since this is a dictionary already, we can use alternative syntax like this:

db.execute("delete from peps where id = 20")
db.execute("""
    insert into peps (
        id, title, author, status, type, created, body
    ) values (
        :PEP, :Title, :Author, :Status, :Type, :Created, :Body
    )
""", pep)

To confirm that it was correctly inserted:

print(db.execute("select * from peps").fetchall())

UPDATE and DELETE

To update a record:

with db:
    db.execute("""
    update peps set author = ?
    where id = ?
    """, ["Tim Peters", 20])

This will run in a transaction.

To delete a record:

with db:
    db.execute("""
    delete from peps
    where id = ?
    """, [20])

Or to delete everything:

delete from peps

SQLite column types

SQLite create table is easier than many other databases, because there are less types to worry about. There are four types you need to worry about:

  • integer

  • real

  • text

  • blob

Unlike other databases, length limits are neither required or enforced - so don’t worry about varchar(255), just use text.

Tables automatically get an ID column called rowid - an incrementing integer. This will be the primary key if you don’t specify one.

If you specify integer primary key it will be auto-incrementing and will actually map to that underlying rowid.

You can set id text primary key for a text primary key - this will not increment, you will have to set it to a unique value for each row yourself. You could do this with UUIDs generated using uuid.uuid4() for example.

SQLite is loosely typed by default: you can insert any type into any column, even if it conflicts with the column type!

A lot of people find this very uncomfortable.

As-of SQLite 3.37.0 (2021-11-27) you can set strict mode on a table to opt-out of this loose typing:

create table peps (
  id integer primary key,
  title text,
  author text,
  body text
) strict

Transactions

Here’s an example of the impact transactions have on file-based databases:

>>> import sqlite3
>>> db = sqlite3.connect("/tmp/data.db")
>>> db.execute("create table foo (id integer primary key, name text)")
<sqlite3.Cursor object at 0x102ec5c40>

In another window:

% sqlite3 data.db .dump
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE foo (id integer primary key, name text);
COMMIT;
>>> db.execute('insert into foo (name) values (?)', ['text'])
<sqlite3.Cursor object at 0x102ec5bc0>

In the other window:

% sqlite3 data.db .dump
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE foo (id integer primary key, name text);
COMMIT;
>>> db.commit()

And now:

% sqlite3 data.db .dump
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE foo (id integer primary key, name text);
INSERT INTO foo VALUES(1,'text');
COMMIT;

A nicer pattern is to do this:

>>> with db:
...     db.execute('insert into foo (name) values (?)', ['text'])

The with db: wraps everything inside that block in a transaction.

Exploring data with Datasette

Datasette is “an open source multi-tool for exploring and publishing data”.

Installing Datasette locally

pip install datasette

Or if you prefer pipx:

pipx install datasette

Or Homebrew (on macOS):

brew install datasette

More installations options.

In Codespaces you should also install the datasette-codespaces plugin:

datasette install datasette-codespaces

Try a database: legislators.db

wget https://congress-legislators.datasettes.com/legislators.db

This is a database of US legislators, presidents and vice presidents.

You can explore it online at https://congress-legislators.datasettes.com/legislators

Open it in Datasette like this:

datasette legislators.db

We’ll follow this tutorial to explore Datasette’s features: Exploring a database with Datasette

Install some plugins

Datasette has over a hundred plugins: https://datasette.io/plugins

You can pip install them, but it’s better to use datasette install as that ensures they will go in the correct virtual environment, especially useful if you used pipx or Homebrew to install Datasette itself.

datasette install datasette-cluster-map

Now restart Datasette and visit the “offices” table to see the result.

You can review what plugins are installed with:

datasette plugins

Or by visiting the /-/plugins page in Datasette.

Plugins can be uninstalled with:

datasette uninstall datasette-cluster-map

Learning SQL with Datasette

The “✎ View and edit SQL” link is a quick way to start learning basic SQL queries.

We’ll follow this tutorial next: Learn SQL with Datasette

sqlite-utils

sqlite-utils provides “CLI tool and Python utility functions for manipulating SQLite databases”.

You can install it the same way as Datasette:

pip install sqlite-utils

Or with pipx:

pipx install sqlite-utils

Or with Homebrew:

brew install sqlite-utils

It works as both a CLI tool and a Python library.

Using the command-line tools to clean data

We’ll follow this tutorial next: Cleaning data with sqlite-utils and Datasette

Using sqlite-utils as a Python library, to import all the PEPs

Let’s take our PEPs example from earlier and implement it again, but better, using sqlite-utils.

I’ll do this in a notebook.

!git clone https://github.com/python/peps /tmp/peps

We now have ALL of the PEPs in /tmp/peps

import pathlib

files = list(pathlib.Path("/tmp/peps").glob("pep-*.txt"))

And parse them with our function from earlier:

def parse_pep(s):
    intro, body = s.split("\n\n", 1)
    pep = {}
    current_key = None
    current_value = None
    for line in intro.split("\n"):
        # If the line starts with whitespace, it's a continuation of the previous value
        if line.startswith(" ") or line.startswith("\t"):
            if current_key is not None:
                current_value += " " + line.strip()
                pep[current_key] = current_value.strip()
        else:
            # Split the line into key and value
            parts = line.split(": ", 1)
            if len(parts) == 2:
                key, value = parts
                # Update the current key and value
                current_key = key
                current_value = value
                # Add the key-value pair to the pep dictionary
                pep[current_key] = current_value.strip()
    pep["Body"] = body.strip()
    return pep
peps = []
for file in files:
    peps.append(parse_pep(file.read_text()))

We now have a list of dictionaries. Let’s load them into SQLite:

%pip install sqlite-utils
import sqlite_utils
db = sqlite_utils.Database("/tmp/peps.db")
db["peps"].insert_all(peps, pk="PEP", alter=True, replace=True)

I got this error:

OperationalError: table peps has no column named PEP-Delegate

To fix that:

db["peps"].insert_all(peps, pk="PEP", alter=True, replace=True)
print(db["peps"].count)
# Outputs 429 

Publishing a database to Vercel

First, install both Vercel and the datasette-publish-vercel plugin.

https://vercel.com/docs/cli has documentation for installing the Vercel CLI.

On macOS:

brew install vercel-cli

Or use one of these:

npm i -g vercel

Or:

pnpm i -g vercel

Now run this command to login:

vercel login

Install the plugin:

datasette install datasette-publish-vercel

And deploy the database:

datasette publish vercel /tmp/peps.db --project python-peps

Other publishing options

Datasette can publish to the following providers:

Further deployment options are described in the documentation.

Datasette Lite

It’s Datasette… running entirely in your web browser with WebAssembly and Pyodide!

https://lite.datasette.io/

Loading SQLite, CSV and JSON data

Installing plugins

Add ?install=name-of-plugin to pip install that plugin into your browser’s environment!

This only works with a subset of plugins.

Further reading

Advanced SQL

Aggregations

The simplest form of aggregation is the one Datasette does to implement its own faceting feature.

select
  party,
  count(*)
from
  executive_terms
where
  type = 'prez'
group by
  party

Try that query here.

The group by creates groups of rows, then the aggregation functions calculate a value across that entire group.

The most common aggregation functions are:

  • count(*) - count the number of rows in each group

  • max(column) - maximum value for a column

  • min(column) - minimum value for a column

  • sum(column) - sum up the values in the column

Here’s an example of sum() and count() from What’s in the RedPajama-Data-1T LLM training set:

select
  top_folders,
  sum(size_gb) as total_gb,
  count(*) as num_files
from raw
group by top_folders
order by sum(size_gb) desc

Run that in Datasette Lite.

Change the total_gb line to this to round it to the nearest integer:

  cast (sum(size_gb) as integer) as total_gb,

Subqueries

SQLite has excellent support for subqueries. You can use them in where X in clauses:

select html_url from releases where repo in (
  select id from repos where full_name in (
    select repo from plugin_repos
  )
)
order by created_at desc

Run that on datasette.io. Sometimes I find these to be more readable than joins!

You can also use them directly in select clauses:

select
  full_name,
  (
    select
      html_url
    from
      releases
    where
      releases.repo = repos.id
    order by
      created_at desc
    limit
      1
  ) as latest_release
from
  repos

Run that here.

CTEs

CTE is a terrible name for an incredibly powerful feature. It stands for Common Table Expressions. Think of it as a way of creating an alias to a temporary table for the duration of a query.

with presidents as (
  select
    executives.name
  from
    executive_terms
    join executives
      on executive_terms.executive_id = executives.id
  where
    executive_terms.type = 'prez'
),
vice_presidents as (
  select
    executives.name
  from
    executive_terms
    join executives
      on executive_terms.executive_id = executives.id
  where
    executive_terms.type = 'viceprez'
)
select
  distinct name
from
  presidents
where name in vice_presidents

Try this CTE query here.

JSON

SQLite has excellent JSON functionality built in. Store JSON in a text column and you can query it using json_extract() - you can also build JSON values in select queries.

Returning related rows in a single SQL query using JSON shows some advanced tricks you can do with this.

select
  legislators.id,
  legislators.name,
  json_group_array(json_object(
    'type', legislator_terms.type,
    'state', legislator_terms.state,
    'start', legislator_terms.start,
    'end', legislator_terms.end,
    'party', legislator_terms.party
   )) as terms,
   count(*) as num_terms
from
  legislators join legislator_terms on legislator_terms.legislator_id = legislators.id
  group by legislators.id
order by
  id
limit
  10

Run that query.

Paul Ford said about SQLite’s JSON support:

The JSON interface is like, “we save the text and when you retrieve it we parse the JSON at several hundred MB/s and let you do path queries against it please stop overthinking it, this is filing cabinet.”

Window functions

I wanted to run a query that would return the following:

  • The repository name

  • The date of the most recent release from that repository (the releases table is a many-to-one against repos)

  • The total number of releases

  • The three most recent releases (as a JSON array of objects)

with cte as (
  select
    repos.full_name,
    releases.created_at,
    releases.id as rel_id,
    releases.name as rel_name,
    releases.created_at as rel_created_at,
    rank() over (partition by repos.id order by releases.created_at desc) as rel_rank
  from repos
    left join releases on releases.repo = repos.id
)
select
  full_name,
  max(created_at) as max_created_at,
  count(rel_id) as releases_count,
  json_group_array(
    json_object(
      'id', rel_id,
      'name', rel_name,
      'created_at', rel_created_at
    )
  ) filter (where rel_id is not null and rel_rank <= 3) as recent_releases
from cte
group by full_name
order by releases_count desc

Run that query here.

Running this smaller query helps show what’s going on with that rel_rank column:

with cte as (
  select
    repos.full_name,
    releases.created_at,
    releases.id as rel_id,
    releases.name as rel_name,
    releases.created_at as rel_created_at,
    rank() over (partition by repos.id order by releases.created_at desc) as rel_rank
  from repos
    left join releases on releases.repo = repos.id
)
select * from cte

Baked Data

The The Baked Data architectural pattern describes this approach, which is key to taking full advantage of SQLite and Datasette.

I like to build my databases in GitHub Actions.

Niche Museums and TILs

Both of these sites have Atom feeds that are defined using a Datasette canned query, in conjunction with the datasette-atom plugin.

Generating a newsletter with an Observable notebook

I wrote about this in Semi-automating a Substack newsletter with an Observable notebook:

More demos and further reading

Fun demos

I post a lot of Datasette projects on my blog. Some of my recent favourites:

SpatiaLite