Tim Trailor
Essay

Lessons as code: turning postmortems into pre-flight checks

A file I read at the start of every session, twenty-three numbered patterns of how I have broken my own system, and the pre-flight skill that checks proposed work against them. The pattern is the most portable thing on this site.

I keep a file called lessons.md. It has twenty-three numbered patterns in it as of this writing. Each pattern is a specific way I have broken my own system and the rule I now apply to prevent a recurrence. Every pattern in the file has happened to me at least twice. The file is read at the start of every session. Before any non-trivial change, a skill called /lessons-check compares the proposed change against the patterns and flags any match.

This is the most broadly useful piece of infrastructure on this site. You do not need any of the rest of my setup to adopt it. The pattern is: treat postmortems as artefacts that get reused, not as summaries that get filed and forgotten. This post is the worked version of what that looks like and why it compounds.

The file itself

lessons.md is a structured markdown file at memory/topics/lessons.md. Every entry has the same shape:

## Pattern N: <one-line name>
**What happened**: Sequence of events, immediate cause.
**Incidents**: Dated references to the specific times this happened.
**Prevention**: The rule now applied. If technical enforcement exists, link it.

An example (Pattern 1 in my file, paraphrased here for readability):

## Pattern 1: "Fix creates new problem"
**What happened**: Every automated helper I built to make my system more
reliable ended up making it less reliable. Uninterruptible Power Supply
watchdog killed prints on USB glitches. Auto-speed adjuster killed prints
with mid-flight parameter changes. Enhanced Power Loss Recovery chain
triggered SAVE_CONFIG during pauses. Printer daemon's auto-recovery path
sent FIRMWARE_RESTART mid-print.
**Incidents**: 2026-03-12 (watchdog), 2026-03-01 (PLR race), 2026-03-05
(SAVE_CONFIG during print), 2026-03-11 (FIRMWARE_RESTART mid-print)
**Prevention**: Before creating any automated process that touches a
physical system, answer five questions:
  1. What commands can this code send?
  2. Does it check state before EVERY action?
  3. What happens if the network drops mid-execution?
  4. What happens if the target is already in error state?
  5. Can I stop it with a single command?
If any answer is "I don't know", do not ship it.

The specific wording matters. Each element of the entry does a different job.

What happened captures the sequence for someone reading the entry cold, including me six months from now. The details have to be specific enough that the entry is not generic. “A daemon caused problems” is not an entry; “the UPS watchdog paused the printer twice in five minutes on 2026-03-12 during an 18-hour print because CyberPower’s USB HID link returns false ‘on battery’ readings at sub-minute resolution” is.

Incidents cite the dates and the specific failures. This is the evidence. It converts the entry from an abstract claim to a record of concrete experience. A claim without dates is a belief; a claim with dates is a finding.

Prevention is the rule applied going forward. The rule has to be concrete enough that it can be checked mechanically or by a human against a specific proposed change. “Be careful with daemons” is not a rule; “before any automated process that touches a physical system, answer five specific questions” is.

Severity tiers

The file is organised into two tiers, read differently.

Tier 1 patterns caused real damage. Destroyed prints. Lost work. Broken production infrastructure. Data corruption. A security exposure. Tier 1 patterns are read at every session start. There are nine of them in my file as of this writing. The list is deliberately short; the rule is that a pattern earns Tier 1 only when it has caused concrete harm, not just when it was annoying.

Tier 2 patterns are behavioural. They did not cause damage, but they wasted time, produced rework, or introduced operational smell. They are read when relevant (at the start of autonomous mode, before infrastructure changes, when touching shared state). There are fourteen Tier 2 patterns in my file.

The distinction is useful because reading all twenty-three patterns at every session start would be expensive and would dilute the attention given to the ones that actually matter. By tiering, I can load the critical ones into every session and surface the others contextually.

Technical enforcement or text

The most important distinction in the file is between patterns that have technical enforcement and patterns that rely on text rules.

Technical enforcement means the pattern cannot recur because the system itself refuses to allow it. Examples from my file:

  • Pattern 3 (silent hook failures) is enforced by a scenarios test suite at tim-claude-controlplane/scenarios/test_behavioral.py. Thirteen integration tests feed real Claude Code JavaScript Object Notation (JSON) payloads into each hook and assert the correct allow or deny behaviour. A hook that silently passes every command, which is what happened for months in early 2026, would now fail the scenarios suite and block deployment.
  • Pattern 5 (printer safety) is enforced by Klipper macros that refuse SAVE_CONFIG during a print, a PreToolUse hook that blocks FIRMWARE_RESTART, and Secure Shell access controls. The full architecture is in a companion post.
  • Pattern 9 (shared OAuth token file with multiple writers) is enforced by a single LaunchAgent (com.timtrailor.token-refresh) designated as the sole writer. Every other script reads the token and refreshes in memory, but cannot write back.
  • Pattern 12 (plutil -extract overwriting files in-place) is enforced by a settings hook that blocks the dangerous flag combination, plus chflags uchg on the plist files.
  • Pattern 13 (launchctl operations) is enforced by the same protected_path_hook, which blocks launchctl bootstrap, bootout, kickstart, load, and unload without explicit approval.

Text rules are patterns without technical enforcement. They exist as instructions in lessons.md or elsewhere and rely on the agent reading them, remembering them, and applying them under pressure. Examples:

  • Pattern 2 (safety guards added after the incident, not before) is a text rule. There is no mechanism to enforce “add the guard before the happy path”. The rule is a discipline, not a check.
  • Pattern 6 (asking me for things Claude can do itself) is a text rule. The agent is told to try SSH, tools, and memory before asking. There is no hard technical block.
  • Pattern 11 (change first, verify after) is a text rule. The discipline is to gather data, verify the assumption, then change; there is no system-level enforcement.

The distinction is load-bearing because text rules fail under pressure. Every Tier 1 incident in my file started with a text rule that was in effect and ignored. The rule was in the file. The agent had read it. The agent, optimising for something else, routed around it. Every time a pattern recurs despite a text rule, the pattern is marked for promotion to technical enforcement.

My rough count is that eight of the twenty-three patterns now have technical enforcement. Those eight have not recurred since enforcement was added. The other fifteen either have not had the opportunity to recur (they are contextual) or are in the queue for enforcement when I can identify the right mechanism.

The /lessons-check skill

The skill is short. About thirty lines of instruction. It loads lessons.md, takes a proposed change or plan as input, and flags any pattern that matches.

The match criterion is explicit in the skill definition. It is not “does this plan involve daemons?”; that would flag nearly everything. It is “does this plan share structural features with one of the incident descriptions?”. The structural features are things like: does the plan introduce a new automated process acting on a physical system (Pattern 1)? Does it change a shared token or credentials file (Pattern 9)? Does it use plutil -extract without -o - (Pattern 12)? Does it bootstrap or bootout a LaunchAgent (Pattern 13)?

When the skill matches, the output is a structured warning with the pattern number, the reason it matched, and the prevention rule. The human reading the output (me) decides whether the match is a real concern or a false positive. Most matches are concerns. A handful are false positives and I note them for the skill’s prompt to ignore on future runs.

The skill runs explicitly. I invoke it in plan mode before starting any non-trivial change. The overhead is a few seconds. The catch rate (proposed changes where the skill identified a real risk I had not thought of) is meaningfully above zero. Catching even a small fraction of the Tier 1 patterns before they recurred would have been worth the cost; the actual catch rate is enough that I run the skill now by default on anything resembling plan mode.

How patterns get added

A pattern earns its place in lessons.md only after it has happened at least twice. One-time incidents get documented in the session log and fixed, but they do not become patterns. The reason is that the cost of adding a pattern is real: every future session has to read it, every future plan has to be checked against it, and every false positive against it wastes attention. One-time incidents are noise; twice-occurring incidents are signal.

The second occurrence is the promotion trigger. When the same category of mistake happens a second time, I write the pattern up. The write-up captures both incidents’ dates. The entry includes whatever technical enforcement is feasible. If no enforcement is feasible today, the entry notes “no technical enforcement” explicitly, so future me remembers to keep looking for a mechanism.

The /dream consolidation pass described in the memory post helps with the promotion by surfacing candidate repetitions. The agent notices that a session has a similar failure to one logged two weeks ago, flags it for my review, and I either promote it to a pattern or dismiss it as coincidence. Most of the time I promote. The model of the system I am building is that repeated errors are almost always patterns, and the cost of writing one up is smaller than the cost of the next recurrence.

The meta-loop

The specific thing lessons.md does, that I think is worth the attention, is close a loop. A problem happens. A fix is applied. Without lessons.md, that is where the loop ends. The fix is in the code; the context is in the commit message; the next session starts fresh.

With lessons.md, the loop continues. The pattern is extracted. The prevention rule is written. The next session reads the pattern at its start. The pattern is in the context of every future piece of work. If a technical mechanism exists that can enforce the rule, the mechanism is added. The next recurrence does not happen because it cannot.

The compounding effect is real. My system has twenty-three patterns documented, eight enforced at the system level. Each of those eight is a class of failure that once recurred on a regular cadence and now does not. The cumulative time saved is larger than the time spent writing the patterns. The cumulative harm prevented is larger still.

Hamel Husain’s recent essays about building on top of language models make the point that typically three issues cause sixty per cent of problems. He wrote that about a different class of system (machine learning pipelines with customer-facing errors), but the underlying claim generalises. If three patterns cause most of the problems, writing those patterns down, enforcing them, and re-reading them at the start of every session is strictly cheaper than continuing to hit the same three problems.

How to start one

A starter kit for someone who wants to build this for their own system:

  1. Pick a file and a location. lessons.md in a place the agent auto-loads. Short lines, numbered patterns, structured entries.
  2. Write the first three patterns. The first three are the ones where you already know the answer. “The last time I broke production, this is what happened, this is the rule I now follow.” Three patterns is enough to start.
  3. Read the file at the start of every session. Put it in your auto-load. Re-read the Tier 1 patterns specifically, not just scanning.
  4. When an incident happens a second time, write it up. Not the first time. The second time. One-time incidents are noise; repetitions are signal.
  5. For Tier 1 patterns, find a technical mechanism. A hook, a guard, a test, a permission, anything that makes the pattern impossible rather than merely documented. Text rules are a starting point, not the ending point.
  6. Build the pre-flight skill last. It is useful but not essential. The file itself, re-read at every session, is doing most of the work. The skill is the automation of something you are already doing by hand.

The investment is small. The returns compound.

What I would tell a past me

I built the technical safety layers (the printer hooks, the credential guards, the protected-path enforcement) before I wrote lessons.md. That order was wrong. Writing the patterns down first would have made the hooks I built later more targeted, because each hook would have been explicitly addressing a specific pattern. Instead, I built hooks for what seemed dangerous in the moment and wrote the patterns later. Some of the hooks were well-targeted; a few were overkill; a few addressed problems that did not recur.

If I were starting again, I would write lessons.md first. Every time an incident happened, I would add an entry. When the second occurrence of a pattern came, I would promote it to a Tier 1 entry and design the specific technical enforcement that pattern required. The hooks would follow the patterns, not precede them.

The file is now the artefact I would most want to keep if I had to throw away everything else on the Mac Mini. The code is recoverable; the configuration is recoverable; the memory is mostly recoverable. The twenty-three documented patterns of how this specific system breaks are the expensive part.


lessons.md lives at memory/topics/lessons.md on the Mac Mini. The /lessons-check skill is at ~/.claude/skills/lessons-check/SKILL.md. Both are small files; both will be in the eventual control-plane repository. The pattern is portable before any of the code is.