Lessons as code: turning postmortems into pre-flight checks

There is a file I read at the start of every session, called lessons.md. It holds twenty-three numbered patterns, each one a specific way I have broken my own system, with the rule I now apply to stop it happening again. Every pattern has happened to me at least twice.

It is the most broadly useful piece of infrastructure on this site, and you can adopt it without the rest of my setup. The pattern is to treat a postmortem as something you reuse, not something you file and forget.

Before any non-trivial change, a skill called /lessons-check compares the plan against the patterns and flags any match.

The file itself

lessons.md is a structured markdown file at memory/topics/lessons.md. Every entry has the same shape:

## Pattern N: <one-line name>
**What happened**: Sequence of events, immediate cause.
**Incidents**: Dated references to the specific times this happened.
**Prevention**: The rule now applied. If technical enforcement exists, link it.

An example (Pattern 1):

## Pattern 1: "Fix creates new problem"
**What happened**: Every automated helper I built to make my system more
reliable ended up making it less reliable. The uninterruptible power
supply (UPS) watchdog killed prints on USB glitches. The auto-speed
adjuster killed prints with mid-flight parameter changes. The enhanced
power-loss recovery chain triggered SAVE_CONFIG during pauses. The
printer daemon's auto-recovery sent FIRMWARE_RESTART mid-print.
**Incidents**: 2026-03-12 (watchdog), 2026-03-01 (power-loss recovery
race), 2026-03-05 (SAVE_CONFIG during print), 2026-03-11
(FIRMWARE_RESTART mid-print)
**Prevention**: Before creating any automated process that touches a
physical system, answer five questions:
  1. What commands can this code send?
  2. Does it check state before EVERY action?
  3. What happens if the network drops mid-execution?
  4. What happens if the target is already in error state?
  5. Can I stop it with a single command?
If any answer is "I don't know", do not ship it.

Each part of the entry does a different job.

What happened captures the sequence for someone reading cold, including me six months from now. The detail has to be specific. “A daemon caused problems” is not an entry. “The UPS watchdog paused the printer twice in five minutes on 2026-03-12 during an 18-hour print, because CyberPower’s USB HID link returns false ‘on battery’ readings at sub-minute resolution” is.

Incidents cite the dates and the specific failures. A claim needs a date to be a finding.

Prevention is the rule applied going forward. It has to be concrete enough to check against a specific change. “Be careful with daemons” is not a rule; “before any automated process that touches a physical system, answer five specific questions” is.

Severity tiers

The file is in two tiers, read differently.

Tier 1 patterns caused real damage: destroyed prints, lost work, broken production infrastructure, data corruption, a security exposure. There are nine, read at every session start. A pattern earns Tier 1 only when it has caused concrete harm.

Tier 2 patterns are behavioural: they did not cause damage but wasted time, produced rework, or introduced operational smell. Read when relevant: at the start of autonomous mode, before infrastructure changes, when touching shared state.

Reading all twenty-three at every session start would dilute attention on the ones that matter. Tiering loads the critical ones every session and surfaces the rest in context.

Technical enforcement or text

The most important distinction in the file is between patterns with technical enforcement and patterns that rely on text rules.

Technical enforcement means the system itself refuses to allow the pattern to recur. Examples:

Pattern 3 (silent hook failures) is enforced by a scenarios test suite at tim-claude-controlplane/scenarios/test_behavioral.py. Thirteen integration tests feed real Claude Code JSON payloads into each hook and assert the correct allow or deny result. A hook that silently passes every command (which happened for months in early 2026) would now fail the suite and block deployment.
Pattern 5 (printer safety) is enforced by Klipper macros that refuse SAVE_CONFIG during a print, a PreToolUse hook that blocks FIRMWARE_RESTART, and SSH access controls.
Pattern 9 (shared OAuth token file with multiple writers) is enforced by a single LaunchAgent (com.timtrailor.token-refresh) designated as the sole writer. Every other script reads the token and refreshes in memory but cannot write back.
Pattern 12 (plutil -extract overwriting files in-place) is enforced by a settings hook that blocks the dangerous flag combination, plus chflags uchg on the plist files.
Pattern 13 (launchctl operations) is enforced by the same protected_path_hook, which blocks launchctl bootstrap, bootout, kickstart, load and unload without explicit approval.

Text rules rely on the agent reading them, remembering them and applying them under pressure. Examples:

Pattern 2 (safety guards added after the incident, not before): no mechanism to enforce “add the guard before the happy path”.
Pattern 6 (asking me for things Claude can do itself): the agent is told to try SSH, tools and memory before asking, with no hard block.
Pattern 11 (change first, verify after): the discipline is to gather data, verify the assumption, then change.

Text rules fail under pressure. Every Tier 1 incident in my file started with a text rule that was in effect and ignored: the rule was in the file, the agent had read it, and the agent, focused on something else, routed around it. Every recurrence despite a text rule queues the pattern for promotion to technical enforcement.

Eight of the twenty-three patterns now have technical enforcement, and those eight have not recurred since. The other fifteen either have not had the chance to recur, or are queued for enforcement once I find the right mechanism.

The /lessons-check skill

The skill is short. It loads lessons.md, takes a proposed change or plan as input, and flags any pattern that matches.

The match criterion is explicit. Not “does this plan involve daemons?” (which would flag nearly everything) but “does this plan share structural features with one of the incident descriptions?” Things like: introduces a new automated process acting on a physical system (Pattern 1), changes a shared token or credentials file (Pattern 9), uses plutil -extract without -o - (Pattern 12), bootstraps or bootouts a LaunchAgent (Pattern 13).

When the skill matches, the output is a structured warning with the pattern number, reason, and prevention rule. I decide whether the match is real or a false positive. Most are real.

The skill runs explicitly, invoked in plan mode before any non-trivial change, with overhead of a few seconds. The catch rate is high enough that I run it now by default on anything that looks like plan mode.

How patterns get added

A pattern earns its place only after happening at least twice. One-time incidents get documented in the session log and fixed but do not become patterns. Every future session has to read each pattern, every future plan has to be checked against it, and every false positive wastes attention.

The second occurrence is the trigger. The write-up captures both incidents’ dates and includes whatever technical enforcement is feasible. If none is feasible today, the entry says “no technical enforcement” explicitly.

The /dream promotion pass surfaces candidate repetitions: the agent notices a session has a similar failure to one logged two weeks ago, flags it, and I either promote or dismiss.

The loop the file closes

What lessons.md does, the thing worth the attention, is close a loop.

Without it: a problem happens, a fix is applied, the fix lands in the code, the context lands in the commit message, and the next session starts fresh.

With it: the pattern is extracted, the prevention rule written, and the next session reads it at its start. If a technical mechanism can enforce the rule, it is added. The next recurrence does not happen because it cannot.

Twenty-three patterns documented, eight enforced at the system level, and each of those eight is a class of failure that once recurred regularly and now does not.

Recent essays on building with language models make a similar point: a small handful of patterns cause most problems. If three patterns cause most of the problems, writing them down, enforcing them, and re-reading them at the start of every session is cheaper than continuing to hit the same three.

How to start one

A starter kit for building this for your own system:

Pick a file and a location. lessons.md somewhere the agent auto-loads. Short lines, numbered patterns, structured entries.
Write the first three patterns. The ones where you already know the answer. “The last time I broke production, this is what happened, this is the rule I now follow.”
Read the file at every session start. Put it in auto-load. Re-read the Tier 1 patterns specifically, not just scan.
When an incident happens a second time, write it up. Not the first time. The second.
For Tier 1 patterns, find a technical mechanism. A hook, guard, test, permission, anything that makes the pattern impossible rather than just documented. Text rules are a starting point.
Build the pre-flight skill last. Useful but not essential. The file re-read at every session does most of the work; the skill automates what you already do by hand.

The investment is small and grows more useful as the file fills out.

What I would tell a past me

I built the technical safety layers (printer hooks, credential guards, protected-path enforcement) before I wrote lessons.md, and that order was wrong. Writing the patterns down first would have made the hooks more targeted. Instead I built hooks for what seemed dangerous at the time and wrote the patterns later. Some were well-targeted, a few overkill, a few addressed problems that did not recur.

If I were starting again I would write lessons.md first. Every incident would become an entry, and on the second occurrence I would promote to Tier 1 and design the technical enforcement it needed. The hooks would follow the patterns, not precede them.

The file is the artefact I would most want to keep if I had to throw away everything else on the Mac Mini. The code and configuration are mostly recoverable. The twenty-three documented patterns of how this specific system breaks are the expensive part.

If you build one of these and want to compare notes on how patterns evolve over a year, the about page has my contact details.

lessons.md lives at memory/topics/lessons.md on the Mac Mini. The /lessons-check skill is at ~/.claude/skills/lessons-check/SKILL.md. Both are small files; both will be in the eventual control-plane repository. The pattern is portable before any of the code is.