Data analysis with SQLite and Python, PyCon 2023
On YouTube: Data analysis with SQLite and Python - 2hr45m
What you’ll need
python3 and pip
For the first part of this tutorial, you’ll need a Python 3 interpreter with the sqlite3
standard library module available.
You can run this on your own computer, or use a browser-based environment.
https://pyodide.org/en/stable/console.html will work for a purely browser-based (WebAssembly) environment.
For the second part, you’ll also need the ability to pip install
Python packages.
Python 3 on your own laptop (maybe in a fresh virtual environment) is a good option here. You could also use GitHub Codespaces or Google Colab or Jupyter Hub or your online notebook solution of choice.
Optional: GitHub Codespaces
I’ll be working through the tutorial using GitHub Codespaces, using https://github.com/github/codespaces-jupyter
Introduction to SQLite
Why SQLite?
It’s the database you already have -
sqlite3
has been built into Python since 2006It’s screamingly fast, and surprisingly powerful
Amazing compatibility: bindings for every language, files work on every platform, fantastic track record for backwards compatibility, so it’s safe to trust your data to a SQLite file
Databases are just files on disk. You can create and discard them without any ceremony.
It handles text (including JSON), integers, floating point numbers, and binary blobs. Which means it can store anything.
It can handle up to 2.8TB of data(!)
It has some interesting characteristics, for example Many Small Queries Are Efficient In SQLite
First steps with Python
Let’s download a database to play with - we’ll use the database that powers the https://datasette.io/ website:
wget https://datasette.io/content.db
To access it from Python:
import sqlite3
db = sqlite3.connect("content.db")
print(db.execute("select sqlite_version()").fetchall())
# [('3.39.0',)]
# Show rows from the plugin_repos table
for row in db.execute("SELECT * FROM plugin_repos LIMIT 10"):
print(row)
# Each row is a tuple. We can change that like this:
db.row_factory = sqlite3.Row
for row in db.execute("SELECT * FROM plugin_repos LIMIT 10"):
print(row)
# This outputs <sqlite3.Row object at 0x7f5d3d8a3760>
# We can use dict() to turn those into dictionaries instead
for row in db.execute("SELECT * FROM plugin_repos LIMIT 10"):
print(dict(row))
Creating a table
Let’s create a table:
db.execute("""
create table peps (
id integer primary key,
title text,
author text,
status text,
type text,
created text,
body text
);
""")
Inserting some data
Here’s a function I wrote that can parse a PEP:
def parse_pep(s):
intro, body = s.split("\n\n", 1)
pep = {}
current_key = None
current_value = None
for line in intro.split("\n"):
# If the line starts with whitespace, it's a continuation of the previous value
if line.startswith(" ") or line.startswith("\t"):
if current_key is not None:
current_value += " " + line.strip()
pep[current_key] = current_value.strip()
else:
# Split the line into key and value
parts = line.split(": ", 1)
if len(parts) == 2:
key, value = parts
# Update the current key and value
current_key = key
current_value = value
# Add the key-value pair to the pep dictionary
pep[current_key] = current_value.strip()
pep["Body"] = body.strip()
return pep
Let’s fetch and parse the Zen of Python:
import urllib.request
zen = urllib.request.urlopen(
"https://raw.githubusercontent.com/python/peps/main/pep-0020.txt"
).read().decode("utf-8")
pep = parse_pep(zen)
And insert that into our database:
db.execute("""
insert into peps (
id, title, author, status, type, created, body
) values (
?, ?, ?, ?, ?, ?, ?
)
""", (
pep["PEP"],
pep["Title"],
pep["Author"],
pep["Status"],
pep["Type"],
pep["Created"],
pep["Body"],
))
Since this is a dictionary already, we can use alternative syntax like this:
db.execute("delete from peps where id = 20")
db.execute("""
insert into peps (
id, title, author, status, type, created, body
) values (
:PEP, :Title, :Author, :Status, :Type, :Created, :Body
)
""", pep)
To confirm that it was correctly inserted:
print(db.execute("select * from peps").fetchall())
UPDATE and DELETE
To update a record:
with db:
db.execute("""
update peps set author = ?
where id = ?
""", ["Tim Peters", 20])
This will run in a transaction.
To delete a record:
with db:
db.execute("""
delete from peps
where id = ?
""", [20])
Or to delete everything:
delete from peps
SQLite column types
SQLite create table
is easier than many other databases, because there are less types to worry about. There are four types you need to worry about:
integer
real
text
blob
Unlike other databases, length limits are neither required or enforced - so don’t worry about varchar(255)
, just use text
.
Tables automatically get an ID column called rowid
- an incrementing integer. This will be the primary key if you don’t specify one.
If you specify integer primary key
it will be auto-incrementing and will actually map to that underlying rowid
.
You can set id text primary key
for a text primary key - this will not increment, you will have to set it to a unique value for each row yourself. You could do this with UUIDs generated using uuid.uuid4()
for example.
SQLite is loosely typed by default: you can insert any type into any column, even if it conflicts with the column type!
A lot of people find this very uncomfortable.
As-of SQLite 3.37.0 (2021-11-27) you can set strict mode on a table to opt-out of this loose typing:
create table peps (
id integer primary key,
title text,
author text,
body text
) strict
Transactions
Here’s an example of the impact transactions have on file-based databases:
>>> import sqlite3
>>> db = sqlite3.connect("/tmp/data.db")
>>> db.execute("create table foo (id integer primary key, name text)")
<sqlite3.Cursor object at 0x102ec5c40>
In another window:
% sqlite3 data.db .dump
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE foo (id integer primary key, name text);
COMMIT;
>>> db.execute('insert into foo (name) values (?)', ['text'])
<sqlite3.Cursor object at 0x102ec5bc0>
In the other window:
% sqlite3 data.db .dump
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE foo (id integer primary key, name text);
COMMIT;
>>> db.commit()
And now:
% sqlite3 data.db .dump
PRAGMA foreign_keys=OFF;
BEGIN TRANSACTION;
CREATE TABLE foo (id integer primary key, name text);
INSERT INTO foo VALUES(1,'text');
COMMIT;
A nicer pattern is to do this:
>>> with db:
... db.execute('insert into foo (name) values (?)', ['text'])
The with db:
wraps everything inside that block in a transaction.
Exploring data with Datasette
Datasette is “an open source multi-tool for exploring and publishing data”.
Installing Datasette locally
pip install datasette
Or if you prefer pipx
:
pipx install datasette
Or Homebrew (on macOS):
brew install datasette
In Codespaces you should also install the datasette-codespaces
plugin:
datasette install datasette-codespaces
Try a database: legislators.db
wget https://congress-legislators.datasettes.com/legislators.db
This is a database of US legislators, presidents and vice presidents.
You can explore it online at https://congress-legislators.datasettes.com/legislators
Open it in Datasette like this:
datasette legislators.db
We’ll follow this tutorial to explore Datasette’s features: Exploring a database with Datasette
Install some plugins
Datasette has over a hundred plugins: https://datasette.io/plugins
You can pip install
them, but it’s better to use datasette install
as that ensures they will go in the correct virtual environment, especially useful if you used pipx
or Homebrew to install Datasette itself.
datasette install datasette-cluster-map
Now restart Datasette and visit the “offices” table to see the result.
You can review what plugins are installed with:
datasette plugins
Or by visiting the /-/plugins
page in Datasette.
Plugins can be uninstalled with:
datasette uninstall datasette-cluster-map
Learning SQL with Datasette
The “✎ View and edit SQL” link is a quick way to start learning basic SQL queries.
We’ll follow this tutorial next: Learn SQL with Datasette
sqlite-utils
sqlite-utils provides “CLI tool and Python utility functions for manipulating SQLite databases”.
You can install it the same way as Datasette:
pip install sqlite-utils
Or with pipx
:
pipx install sqlite-utils
Or with Homebrew:
brew install sqlite-utils
It works as both a CLI tool and a Python library.
Using the command-line tools to clean data
We’ll follow this tutorial next: Cleaning data with sqlite-utils and Datasette
Using sqlite-utils as a Python library, to import all the PEPs
Let’s take our PEPs example from earlier and implement it again, but better, using sqlite-utils
.
I’ll do this in a notebook.
!git clone https://github.com/python/peps /tmp/peps
We now have ALL of the PEPs in /tmp/peps
import pathlib
files = list(pathlib.Path("/tmp/peps").glob("pep-*.txt"))
And parse them with our function from earlier:
def parse_pep(s):
intro, body = s.split("\n\n", 1)
pep = {}
current_key = None
current_value = None
for line in intro.split("\n"):
# If the line starts with whitespace, it's a continuation of the previous value
if line.startswith(" ") or line.startswith("\t"):
if current_key is not None:
current_value += " " + line.strip()
pep[current_key] = current_value.strip()
else:
# Split the line into key and value
parts = line.split(": ", 1)
if len(parts) == 2:
key, value = parts
# Update the current key and value
current_key = key
current_value = value
# Add the key-value pair to the pep dictionary
pep[current_key] = current_value.strip()
pep["Body"] = body.strip()
return pep
peps = []
for file in files:
peps.append(parse_pep(file.read_text()))
We now have a list of dictionaries. Let’s load them into SQLite:
%pip install sqlite-utils
import sqlite_utils
db = sqlite_utils.Database("/tmp/peps.db")
db["peps"].insert_all(peps, pk="PEP", alter=True, replace=True)
I got this error:
OperationalError: table peps has no column named PEP-Delegate
To fix that:
db["peps"].insert_all(peps, pk="PEP", alter=True, replace=True)
print(db["peps"].count)
# Outputs 429
Enabling full-text search
SQLite has surprisingly good full-text search built in.
sqlite-utils
can help you enable it:
db["peps"].enable_fts(["Title", "Body"])
Datasette will detect this and add a search box to the top of the table page.
To run searches in relevance order you’ll need to execute a custom SQL query:
select
PEP,
peps.Title,
Version,
Author,
Status,
Type,
Created,
peps.Body,
peps_fts.rank
from
peps
join
peps_fts on peps.rowid = peps_fts.rowid
where
peps_fts match :search
order by
peps_fts.rank
limit
20
Publishing a database to Vercel
First, install both Vercel and the datasette-publish-vercel
plugin.
https://vercel.com/docs/cli has documentation for installing the Vercel CLI.
On macOS:
brew install vercel-cli
Or use one of these:
npm i -g vercel
Or:
pnpm i -g vercel
Now run this command to login:
vercel login
Install the plugin:
datasette install datasette-publish-vercel
And deploy the database:
datasette publish vercel /tmp/peps.db --project python-peps
Other publishing options
Datasette can publish to the following providers:
Heroku (
datasette publish heroku
)Google Cloud Run (
datasette publish cloudrun
)Vercel (with datasette-publish-vercel)
Fly (with datasette-publish-fly)
Further deployment options are described in the documentation.
Datasette Lite
It’s Datasette… running entirely in your web browser with WebAssembly and Pyodide!
Loading SQLite, CSV and JSON data
SQLite: https://lite.datasette.io/?url=https://github.com/simonw/scrape-hmb-traffic/blob/main/hmb.db?&install=datasette-copyable#/hmb?sql=with+item1+as+( ++select ++++time(datetime(commits.commit_at%2C+’-7+hours’))+as+t%2C ++++duration_in_traffic+%2F+60+as+mins_in_traffic ++from ++++item_version ++++join+commits+on+item_version._commit+%3D+commits.id ++order+by ++++commits.commit_at )%2C item2+as+( ++select ++++time(datetime(commits.commit_at%2C+’-7+hours’))+as+t%2C ++++duration_in_traffic+%2F+60+as+mins_in_traffic ++from ++++item2_version ++++join+commits+on+item2_version._commit+%3D+commits.id ++order+by ++++commits.commit_at ) select ++item1.*%2C ++item2.mins_in_traffic+as+mins_in_traffic_other_way from ++item1 ++join+item2+on+item1.t+%3D+item2.t - see Measuring traffic during the Half Moon Bay Pumpkin Festival
JSON: https://lite.datasette.io/?json=https://gist.github.com/simonw/73d15c0dd1025d1196829740bacf4464
Installing plugins
Add ?install=name-of-plugin
to pip install
that plugin into your browser’s environment!
This only works with a subset of plugins.
Further reading
Advanced SQL
Aggregations
The simplest form of aggregation is the one Datasette does to implement its own faceting feature.
select
party,
count(*)
from
executive_terms
where
type = 'prez'
group by
party
The group by
creates groups of rows, then the aggregation functions calculate a value across that entire group.
The most common aggregation functions are:
count(*)
- count the number of rows in each groupmax(column)
- maximum value for a columnmin(column)
- minimum value for a columnsum(column)
- sum up the values in the column
Here’s an example of sum()
and count()
from What’s in the RedPajama-Data-1T LLM training set:
select
top_folders,
sum(size_gb) as total_gb,
count(*) as num_files
from raw
group by top_folders
order by sum(size_gb) desc
Change the total_gb
line to this to round it to the nearest integer:
cast (sum(size_gb) as integer) as total_gb,
Subqueries
SQLite has excellent support for subqueries. You can use them in where X in
clauses:
select html_url from releases where repo in (
select id from repos where full_name in (
select repo from plugin_repos
)
)
order by created_at desc
Run that on datasette.io. Sometimes I find these to be more readable than joins!
You can also use them directly in select
clauses:
select
full_name,
(
select
html_url
from
releases
where
releases.repo = repos.id
order by
created_at desc
limit
1
) as latest_release
from
repos
CTEs
CTE is a terrible name for an incredibly powerful feature. It stands for Common Table Expressions. Think of it as a way of creating an alias to a temporary table for the duration of a query.
with presidents as (
select
executives.name
from
executive_terms
join executives
on executive_terms.executive_id = executives.id
where
executive_terms.type = 'prez'
),
vice_presidents as (
select
executives.name
from
executive_terms
join executives
on executive_terms.executive_id = executives.id
where
executive_terms.type = 'viceprez'
)
select
distinct name
from
presidents
where name in vice_presidents
JSON
SQLite has excellent JSON functionality built in. Store JSON in a text
column and you can query it using json_extract()
- you can also build JSON values in select
queries.
Returning related rows in a single SQL query using JSON shows some advanced tricks you can do with this.
select
legislators.id,
legislators.name,
json_group_array(json_object(
'type', legislator_terms.type,
'state', legislator_terms.state,
'start', legislator_terms.start,
'end', legislator_terms.end,
'party', legislator_terms.party
)) as terms,
count(*) as num_terms
from
legislators join legislator_terms on legislator_terms.legislator_id = legislators.id
group by legislators.id
order by
id
limit
10
Paul Ford said about SQLite’s JSON support:
The JSON interface is like, “we save the text and when you retrieve it we parse the JSON at several hundred MB/s and let you do path queries against it please stop overthinking it, this is filing cabinet.”
Window functions
I wanted to run a query that would return the following:
The repository name
The date of the most recent release from that repository (the releases table is a many-to-one against repos)
The total number of releases
The three most recent releases (as a JSON array of objects)
with cte as (
select
repos.full_name,
releases.created_at,
releases.id as rel_id,
releases.name as rel_name,
releases.created_at as rel_created_at,
rank() over (partition by repos.id order by releases.created_at desc) as rel_rank
from repos
left join releases on releases.repo = repos.id
)
select
full_name,
max(created_at) as max_created_at,
count(rel_id) as releases_count,
json_group_array(
json_object(
'id', rel_id,
'name', rel_name,
'created_at', rel_created_at
)
) filter (where rel_id is not null and rel_rank <= 3) as recent_releases
from cte
group by full_name
order by releases_count desc
Running this smaller query helps show what’s going on with that rel_rank
column:
with cte as (
select
repos.full_name,
releases.created_at,
releases.id as rel_id,
releases.name as rel_name,
releases.created_at as rel_created_at,
rank() over (partition by repos.id order by releases.created_at desc) as rel_rank
from repos
left join releases on releases.repo = repos.id
)
select * from cte
Baked Data
The The Baked Data architectural pattern describes this approach, which is key to taking full advantage of SQLite and Datasette.
I like to build my databases in GitHub Actions.
Niche Museums and TILs
https://www.niche-museums.com/ is published from the https://github.com/simonw/museums repository - one big YAML file for the content.
https://til.simonwillison.net/ is published https://github.com/simonw/til - separate Markdown files for each item.
Both of these sites have Atom feeds that are defined using a Datasette canned query, in conjunction with the datasette-atom plugin.
More demos and further reading
Fun demos
I post a lot of Datasette projects on my blog. Some of my recent favourites:
Exploring the training data behind Stable Diffusion describes how I built laion-aesthetic.datasette.io to explore 12m training images used for Stable Diffusion.
Analyzing ScotRail audio announcements with Datasette—from prototype to production explains the scotrail.datasette.io project.
I built a ChatGPT plugin to answer questions about data hosted in Datasette is one of my earlier explorations at the intersection of LLM AI and Datasette.
SpatiaLite
Building a location to time zone API with SpatiaLite shows how to use SpatiaLite and Datasette to create an API for looking up time zones based on latitude/longitude points.
Drawing shapes on a map to query a SpatiaLite database introduces the
datasette-leaflet-freedraw
plugin and introduces the calands.datasettes.com demo.