Tim Trailor

Five things I built to help my AI agent that I had to remove

I removed all four. The system is more reliable now than it was when those features existed, and the total code footprint is smaller.

This post is about why I removed them. Not as anecdote. As a pattern I now treat as a rule. For agent systems running near anything that matters, the default for recovery automation should be observe and alert, never act. Intervention authority requires a human in the loop.

Below is each feature, what it was built to help with, what it actually did, and the incident that convinced me to remove it. One fifth removal at the end, the one I almost kept because it sounded sensible.

Setup context

I run three 3D printers from a Mac Mini server via a Claude Code environment I use as my primary interface for infrastructure, memory, automation, and occasional chaos. Two of the printers run Klipper (a Sovol SV08 Max and a Snapmaker U1). The third is a Bambu A1. Each can take 10 to 30 hours to complete a print. Filament, time, and the integrity of the part are all at stake every hour.

The rest of this post is about the Sovol SV08 Max specifically, because it is the printer with the longest history, the most expensive prints, and therefore the most incident data. Every removal below was triggered by a real failed print with a date and a cost.

1. The UPS watchdog

What it was built for. I have a CyberPower CP1600 UPS connected to the Mac Mini over USB. The Mac Mini queries it via pmset -g batt and learns its battery state. The idea was sensible on paper: if the UPS reports “On Battery Power”, the Mac Mini is now on battery, which means a power cut has happened, which means the printer might also be affected, which means I want to pause the current print cleanly before the whole house goes dark. The UPS watchdog was a daemon that polled pmset every 5 seconds and triggered a planned pause on three consecutive “battery” readings.

What it actually did. CyberPower’s USB HID link is not reliable at sub-minute resolution. pmset -g batt frequently returned “On Battery Power” at 100% charge for 10 to 30 seconds at a stretch, despite no real power event. Three consecutive 5-second polls comfortably fit inside one of those glitches. On 2026-03-12 the watchdog paused a Zephyros print twice in 5 minutes at 60% complete, 18 hours into a 32-hour job. On 2026-03-07 it had destroyed a different 10.9-hour print by triggering its pause macro at a point where the pause macro itself had a bug (it turned the hotend off during the macro, the part cooled unevenly mid-layer, and the print was ruined).

Why I removed it. The watchdog caused more print failures than real power cuts ever would. Real power cuts during a print, in my house, over the year preceding the watchdog: zero. Watchdog-triggered print failures, in the first three weeks the watchdog ran: three or four. The risk the watchdog was built to mitigate was smaller than the risk introduced by the watchdog itself.

What I run instead. Two things. The Mac Mini’s native macOS UPS integration (pmset -u) still triggers a clean shutdown at 15% battery or 5 minutes remaining. That is a genuine last-ditch guard against the motherboard rather than the printer. And the Sovol has Klipper’s Power Loss Recovery (PLR) built in: a patched gcode_move.py writes the Z height and last command line on every Z-changing G1 move, and POWER_RESUME restores after a real hard cut. PLR is enough. The watchdog was over-reach.

ups_watchdog.py was deleted from the Mac Mini on 2026-03-12. The LaunchAgent plist was unloaded and removed. The decision is documented in a memory topic file with a deliberately permanent warning: do not recreate this, not even with “fixes”, not even if a future session thinks it would help.

2. Auto-speed adjustment

What it was built for. A print’s optimal speed factor varies by layer geometry. A simple single-perimeter layer can print at 150% of slicer speed without issues; a dense small-feature layer with many direction changes cannot. Manually watching the print and adjusting M220 is tedious. The auto-speed adjuster was a daemon that polled the current layer number from Moonraker, computed the layer’s geometric complexity from a pre-processed gcode profile, and sent M220 at layer transitions to scale speed.

What it actually did. It worked perfectly in isolation. It did not work in combination with Klipper’s motion planner, which had already committed look-ahead plans based on the previous speed factor. When M220 was applied to a layer mid-flight, the planner’s acceleration and velocity assumptions were invalidated for any move that had not yet been dispatched but was already in the queue. On three separate prints between late February and mid-March 2026, I lost prints to layer-transition artefacts (ringing, blobbing, dropped layer adhesion) that were unambiguously caused by mid-flight M220.

Why I removed it. I patched it three times. Each patch reduced the failure rate but none eliminated it. The patches themselves became complex: delay the M220 by N milliseconds, only apply on odd layer numbers, skip the first 5 layers, enforce bounds. Each patch was a sticking plaster on a fundamentally wrong abstraction. The right place to do speed adjustment is inside the slicer’s output (per-region speeds are already baked into the gcode) or inside Klipper’s motion planner (which has the full queue state). Outside both, with best-effort polling, it is not solvable.

What I run instead. Four slicer-defined speed profiles (Standard, Silent, Sport, Ludicrous) which bake the per-region speeds into the gcode itself. Manual M220 if I want to nudge during a print, but no automation ever touches M220 for me. The gcode_profile.py tool still exists because it is useful for predicting per-layer timing for the ETA display, but its auto-speed output mode is disabled at the source. This was lessons.md Pattern 4: “Fixes that don’t stick.” A fix that has failed twice must include technical enforcement or removal, not a stronger text rule. I chose removal.

3. The enhanced Power Loss Recovery chain

What it was built for. Sovol’s native PLR is minimal. It saves the Z height and last G1 command, and POWER_RESUME re-homes and resumes. It does not save speed factor, flow factor, fan speeds, or the gcode file pointer. If you had M220 at 120% before a power cut, recovery resumes at 100%. If you had adjusted fans, they start fresh. The “enhanced” PLR was a plr_enhanced.cfg that layered on top: a SAVE_AND_PAUSE macro that captured everything, a PLR_AUTO_SAVE delayed gcode that wrote state to saved_variables.cfg every 60 seconds, a POWER_RESUME that restored it all.

What it actually did. Under normal operation, fine. Under abnormal operation, it became the source of the abnormality. On 2026-03-01 the LOG_Z macro (called from the patched gcode_move.py on every Z-changing G1, hundreds of times a minute) raced with PLR_AUTO_SAVE on writes to saved_variables.cfg. Both were trying to update power_resume_z. The file corrupted. On next print start, Klipper failed to load variables and refused to begin. I fixed it by making LOG_Z a no-op and designating PLR_AUTO_SAVE as the single writer. Then on 2026-03-05 a more subtle failure: the enhanced PLR chain, under certain exit paths from SAVE_AND_PAUSE, caused a SAVE_CONFIG to run. SAVE_CONFIG was running 12 hours into a 20-hour print. It flushed and restarted. The stepper motors de-energised. The print head dropped onto the part.

Why I partly removed it. I kept SAVE_AND_PAUSE (I needed manual planned-pause functionality) and POWER_RESUME (actual recovery). I removed PLR_AUTO_SAVE as a delayed gcode running in Klipper, and moved the functionality into a Mac-side Python daemon (plr_autosave.py) that calls Moonraker’s API over HTTP. This gets the state out of the printer’s own config-mutation path. If the Python daemon crashes, the printer is unaffected. If saved_variables.cfg corrupts, the Python daemon rebuilds it from scratch. The SAVE_CONFIG-during-print path is now physically impossible from the PLR chain.

What remains. The Klipper-side PLR is now three macros only: SAVE_AND_PAUSE (explicit, pauses and captures state, never touches SAVE_CONFIG), POWER_RESUME (explicit, called by me after a real power cut), and clear_plr (clean up stale state). Everything else moved to the Mac. The Klipper SAVE_CONFIG macro itself now refuses to run if print_stats.state is “printing” or “paused” (the actual 2026-03-05 fix at the macro layer).

4. The printer daemon’s auto-recovery path

What it was built for. Klipper can enter an error state for many reasons (MCU shutdown, thermistor out of range, lost heartbeat). Some of those are recoverable with a simple FIRMWARE_RESTART. If the printer is idle, the daemon’s auto-recovery path would attempt the restart, log it, and notify me. The idea was: reduce manual intervention for benign errors.

What it actually did. On 2026-03-11 a filament jam triggered the extruder MCU to shut down (thermistor went out of range briefly during the jam). The daemon’s auto-recovery path activated. It did not check print_stats.state. A print was paused at that moment (the jam had triggered a pause, not a cancel). The daemon sent FIRMWARE_RESTART. The steppers de-energised. The print head dropped. The 12-hour print was destroyed.

Why I fixed it rather than removing it. I considered removing it entirely. The more I looked at it, the less necessary the “recovery” was: I was at the printer at the time, or Moonraker would have alerted me, or the next print would have triggered a clean restart anyway. Automated recovery was saving me roughly one click a week at the cost of catastrophic failure modes.

In the end I fixed it rather than removing it, because the fix was small and well-bounded: check print_stats.state before every action; if the state is “printing” or “paused”, block the recovery action and alert instead. The fix was 12 lines. The daemon’s job changed from “attempt recovery” to “notice error and alert”. This change is documented as lessons.md Pattern 5: Tim corrected the same category of mistake (restart during a print) four times across different automations before the pattern was strong enough to become a hard macro-level block.

The principle. Observation is safe. Notification is safe. Action is not. Any daemon that has authority to act on a physical system must be treated as a safety-critical component and architected accordingly: explicit state gates, sub-10-line happy paths, incident-reviewed fail-safes, no optimism.

5. The one I almost kept: the clog detector’s automatic response

What it was built for. Klipper on the SV08 Max has a CHECK_NOZZLE_CLOG macro. It reads the filament buffer’s weight, extrusion rate, and filament motion sensor to detect clogs. If it fires, it pauses the print and alerts. This is clearly useful. What was less clearly useful: I had added a second stage where, on clog detection, the system would try to clear the clog automatically with a cold-pull-style purge and then resume. The cold pull would heat to 240, extrude 10mm, retract 40mm, then repeat.

What it actually did. Once. It worked once. Most of the time the automatic clear produced a worse state than the original clog: partially-extruded filament fused onto the hotend, a manual disassembly job, the print lost anyway. When CHECK_NOZZLE_CLOG was the thing waking me up at night, the automatic response was a gamble I would not take awake. It turned a two-minute “get up and check the printer” into a fifty-fifty “wake up to a destroyed hotend or a saved print”.

Why I almost kept it. The optics of removal were uncomfortable. Removing an automation that “works 50% of the time” feels like giving up. The autonomy reflex in agent systems says more automation equals better. In my head I was arguing with an imaginary reader who would point out that 50% is better than 0%.

Why I removed it anyway. Because the failure mode was asymmetric. When the automatic clear worked, it saved me two minutes. When it failed, it cost me four to six hours of disassembly plus the failed print. A 50% success rate on a binary decision with asymmetric downside is a negative expected value. The clog detection itself stays. The automatic response does not.

This is the point worth dwelling on. The decision is not “automation good, no automation bad.” The decision is: what is the expected value, given asymmetric downside? For recovery automation near physical systems, downside is almost always asymmetric. Keep the observer. Keep the alert. Remove the reflex.

The pattern I now apply

For any new automation that touches a physical system, five questions. I run them before writing any code. If any answer is “I do not know”, the automation does not get built.

  1. What commands can this code send to the external system? List every one.
  2. Does it check the system’s state before every action?
  3. What happens if the network drops mid-execution?
  4. What happens if the system is already in an error state?
  5. Can I stop it with a single command?

These live in lessons.md as Pattern 1. They exist because every automation I have built that failed these questions, failed in production. The five questions are the cost of admission.

What the system looks like now

Lighter. The LaunchAgent list on the Mac Mini has shrunk from a peak of 19 to 15. The Klipper config is smaller: roughly 200 lines of removed macros, roughly 40 lines of replacement. The Mac-side sv08_tools/ directory has fewer scripts. The printer is more reliable than it was at any point in the last year. I have not lost a print to automation-induced failure since the last of these removals in mid-March.

The meta-point is the one I did not expect when I started building: for personal-scale AI systems operating near physical or otherwise irreversible consequences, the direction of leverage is usually subtraction. The automations worth keeping are the ones that observe and alert. The automations worth removing are the ones that act.

If you are building something similar, the test is: for each of your daemons, when the worst-case failure mode fires, what is the blast radius? If the answer is “a print ruined”, “a config file corrupted”, or “a filesystem modified”, that daemon needs either a hard state-gate in front of it or removal. A helper that exists but has no authority to act is almost always worth keeping. A helper that has authority to act and fires on best-effort signals is almost always worth removing.

The project of building these systems is not about getting to more automation. It is about getting to the right line between observation and action, and keeping that line cleanly defended.


The control plane repository containing the hooks, Klipper macros, pytest scenarios, and full lessons database will be published separately. Specific dates above are drawn from the system’s own memory log, not reconstructed from notes.