Simon Willison’s Weblog

Subscribe

Weeknotes: Getting ready for NICAR

27th February 2024

Next week is NICAR 2024 in Baltimore—the annual data journalism conference hosted by Investigative Reporters and Editors. I’m running a workshop on Datasette, and I plan to spend most of my time in the hallway track talking to people about Datasette, Datasette Cloud and how the Datasette ecosystem can best help support their work.

I’ve been working with Alex Garcia to get Datasette Cloud ready for the conference. We have a few new features that we’re putting the final touches on, in addition to ensuring features like Datasette Enrichments and Datasette Comments are in good shape for the event.

Releases

  • llm-mistral 0.3—2024-02-26
    LLM plugin providing access to Mistral models using the Mistral API

Mistral released Mistral Large this morning, so I rushed out a new release of my llm-mistral plugin to add support for it.

pipx install llm
llm install llm-mistral --upgrade
llm keys set mistral
# <Paste in your Mistral API key>
llm -m mistral-large 'Prompt goes here'

The plugin now hits the Mistral API endpoint that lists models (via a cache), which means future model releases should be supported automatically without needing a new plugin release.

  • dclient 0.3—2024-02-25
    A client CLI utility for Datasette instances

dclient provides a tool for interacting with a remote Datasette instance. You can use it to run queries:

dclient query https://datasette.io/content \
  "select * from news limit 3"

You can set aliases for your Datasette instances:

dclient alias add simon https://simon.datasette.cloud/data

And for Datasette 1.0 alpha instances with the write API (as seen on Datasette Cloud) you can insert data into a new or an existing table:

dclient auth add simon
# <Paste in your API token>
dclient insert simon my_new_table data.csv --create

The 0.3 release adds improved support for streaming data into a table. You can run a command like this:

tail -f log.ndjson | dclient insert simon my_table \
  --nl - --interval 5 --batch-size 20

The --interval 5 option is new: it means that records will be written to the API if 5 seconds have passed since the last write. --batch-size 20 means that records will be written in batches of 20, and will be sent as soon as the batch is full or the interval has passed.

I wrote about the new Datasette Events mechanism in the 1.0a8 release notes. This new plugin was originally built for Datasette Cloud—it forwards analytical events from an instance to a central analytics instance. Using Datasette Cloud for analytics for Datasette Cloud is a pleasing exercise in dogfooding.

A tiny cosmetic bug fix.

  • datasette 1.0a11—2024-02-19
    An open source multi-tool for exploring and publishing data

I’m increasing the frequency of the Datasette 1.0 alphas. This one has a minor permissions fix (the ability to replace a row using the insert API now requires the update-row permission) and a small cosmetic fix which I’m really pleased with: the menus displayed by the column action menu now align correctly with their cog icon!

Clicking on a cog icon now shows a menu directly below that icon, with a little grey arrow in the right place to align with the icon that was clicked

This is a pretty significant release: it adds finely-grained permission support such that Datasette’s core create-table, alter-table and drop-table permissions are now respected by the plugin.

The alter-table permission was introduced in Datasette 1.0a9 a couple of weeks ago.

When testing permissions it’s useful to have a really convenient way to sign in to Datasette using different accounts. This plugin provides that, but only if you start Datasette with custom plugin configuration or by using this new 1.0 alpha shortcut setting option:

datasette -s plugins.datasette-unsafe-actor-debug.enabled 1

An experiment in bundling plugins. pipx install datasette-studio gets you an installation of Datasette under a separate alias—datasette-studio—which comes preconfigured with a set of useful plugins.

The really fun thing about this one is that the entire package is defined by a pyproject.toml file, with no additional Python code needed. Here’s a truncated copy of that TOML:

[project]
name = "datasette-studio"
version = "0.1a0"
description = "Datasette pre-configured with useful plugins"
requires-python = ">=3.8"
dependencies = [
    "datasette>=1.0a10",
    "datasette-edit-schema",
    "datasette-write-ui",
    "datasette-configure-fts",
    "datasette-write",
]

[project.entry-points.console_scripts]
datasette-studio = "datasette.cli:cli"

I think it’s pretty neat that a full application can be defined like this in terms of 5 dependencies and a custom console_scripts entry point.

Datasette Studio is still very experimental, but I think it’s pointing in a promising direction.

This resolves a dreaded “database locked” error I was seeing occasionally in Datasette Cloud.

Short version: SQLite, when running in WAL mode, is almost immune to those errors... provided you remember to run all write operations in short, well-defined transactions.

I’d forgotten to do that in this plugin and it was causing problems.

After shipping this release I decided to make it much harder to make this mistake in the future, so I released Datasette 1.0a10 which now automatically wraps calls to database.execute_write_fn() in a transaction even if you forget to do so yourself.

Blog entries

My first full blog post of the year to end up on Hacker News, where it sparked a lively conversation with 489 comments!

TILs

Yet another experiment with audit tables in SQLite. This one uses a terrifying nested sequenc of json_patch() calls to assemble a JSON document describing the change made to the table.

Val Town is a very neat attempt at solving another of my favourite problems: how to execute user-provided code safely in a sandbox. It turns out to be the perfect mechanism for running simple scheduled functions such as code that reads data and writes it to Datasette Cloud using the write API.

FIPS is the Federal Information Processing Standard, and systems that obey it refuse to run Datasette due to its use of MD5 hash functions. I figured out how to get that to work anyway, since Datasette’s MD5 usage is purely cosmetic, not cryptographic.

This actually showed up on Hacker News without me noticing until a few days later, where many people told me that I should rewire my existing Ethernet cables rather than resorting to more exotic solutions.

I guess this is another super lightweight form of RAG: you can use the rg context options (include X lines before/after each match) to assemble just enough context to get useful answers to questions about code.