From 13d82d69e3d3b98039724c89dc92c2d9bac0833d Mon Sep 17 00:00:00 2001 From: Noah White Date: Fri, 30 Jan 2026 05:43:37 +0000 Subject: [PATCH 1/3] Add Tailscale device cleanup runbook and update CLAUDE.md Add runbook documenting the manual cleanup process for removing stale Tailscale devices before instance recreation. This prevents naming conflicts where new instances receive suffixed hostnames (e.g., ghost-dev-01-1 instead of ghost-dev-01). Also adds warnings to CLAUDE.md in both the Alloy and Tailscale sysext update sections, reminding operators to clean up Tailscale devices before any change that triggers instance recreation. --- CLAUDE.md | 7 + docs/runbooks/tailscale-device-cleanup.md | 175 ++++++++++++++++++++++ 2 files changed, 182 insertions(+) create mode 100644 docs/runbooks/tailscale-device-cleanup.md diff --git a/CLAUDE.md b/CLAUDE.md index b2c3e60..74ab547 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -276,6 +276,9 @@ The Grafana Alloy systemd-sysext image is built automatically by the OpenTofu to destroy and recreate the instance, as the Ignition config is immutable and only applied on first boot. This is the expected idempotent behavior. +**Important:** Before recreating an instance, remove the old device from Tailscale admin +to prevent naming conflicts. See `docs/runbooks/tailscale-device-cleanup.md` for details. + ### Updating Tailscale Sysext Version Tailscale is installed via systemd-sysext from the [Flatcar sysext-bakery](https://flatcar.github.io/sysext-bakery/tailscale/). @@ -310,6 +313,10 @@ are handled via systemd-sysupdate instead. A separate `tailscale-auth.service` runs on first boot to authenticate using the auth key and enable Tailscale SSH. +**Important:** Changing the Tailscale version will recreate the instance. Before applying, +remove the old device from Tailscale admin to prevent naming conflicts (e.g., the new +instance being named `ghost-dev-01-1`). See `docs/runbooks/tailscale-device-cleanup.md`. + ### Debugging deployment failures 1. Check GitHub Actions logs 2. SSH to instance and check container logs diff --git a/docs/runbooks/tailscale-device-cleanup.md b/docs/runbooks/tailscale-device-cleanup.md new file mode 100644 index 0000000..5ad6817 --- /dev/null +++ b/docs/runbooks/tailscale-device-cleanup.md @@ -0,0 +1,175 @@ +# Runbook: Tailscale Device Cleanup Before Instance Recreation + +## Overview + +This runbook documents the process of cleaning up stale Tailscale devices before recreating a Ghost instance. This prevents naming conflicts that cause the new instance to receive a suffixed hostname (e.g., `ghost-dev-01-1` instead of `ghost-dev-01`). + +## Background + +When an instance is destroyed and recreated, Tailscale keeps the old device registration in its inventory. When the new instance authenticates with the same hostname, Tailscale appends a suffix (e.g., `-1`) to avoid conflicts. + +**Key behavior:** The `-1` suffix persists even after the old device is removed, since Tailscale considers the machine name "taken" at authentication time. + +### Impact + +- **SSH Access**: Works, but requires using the suffixed name (`tailscale ssh core@ghost-dev-01-1`) +- **Monitoring**: The `tailscale-monitor.service` may fail on first boot if it finds the stale device first (uses prefix matching with `head -n 1`) +- **Documentation/Scripts**: Any hardcoded references to `ghost-dev-01` will fail + +## When to Use This Runbook + +**Before** any operation that recreates a Ghost instance: + +- Changes to `ghost.bu` (Butane/Ignition configuration) +- Sysext version updates (Tailscale, Alloy, docker-compose) +- Instance type changes +- Any OpenTofu change where the plan shows: + ``` + # module.ghost_instance.vultr_instance.ghost must be replaced + ``` + +## Prerequisites + +- Admin access to Tailscale admin console +- Or: Tailscale API access with write permissions + +## Procedure + +### Option 1: Manual Cleanup via Admin Console (Recommended) + +1. **Open Tailscale Admin Console** + - Navigate to: https://login.tailscale.com/admin/machines + +2. **Find the Device** + - Search for the device name (e.g., `ghost-dev-01`) + - Or filter by tag if using ACL tags + +3. **Remove the Device** + - Click on the device + - Click the "..." menu (three dots) + - Select "Remove device" + - Confirm removal + +4. **Verify Removal** + - Refresh the machines page + - Confirm the device is no longer listed + +5. **Proceed with Infrastructure Changes** + - Now run `tofu apply` to recreate the instance + - The new instance will register with the correct hostname + +### Option 2: CLI Cleanup + +```bash +# List devices to find the device ID +tailscale status + +# If you have API access, you can also use: +# curl -s -H "Authorization: Bearer $TAILSCALE_API_KEY" \ +# "https://api.tailscale.com/api/v2/tailnet/-/devices" | jq '.devices[] | {id, name, hostname}' + +# Remove via the admin console (CLI removal requires API key with write access) +``` + +### Option 3: API Cleanup (Automation) + +For future automation, devices can be removed via API: + +```bash +# Get the device ID first +DEVICE_ID=$(curl -s -H "Authorization: Bearer $TAILSCALE_API_KEY" \ + "https://api.tailscale.com/api/v2/tailnet/-/devices" | \ + jq -r '.devices[] | select(.hostname == "ghost-dev-01") | .id') + +# Delete the device +curl -X DELETE -H "Authorization: Bearer $TAILSCALE_API_KEY" \ + "https://api.tailscale.com/api/v2/device/$DEVICE_ID" +``` + +**Note:** This requires an API key with device write permissions. The current deployment uses auth keys for device registration, which is a different permission scope. + +## Post-Recreation Verification + +After the new instance is created: + +1. **Check Device Name** + ```bash + # From the Tailscale admin console or via API + # Device should be listed as "ghost-dev-01" (no suffix) + ``` + +2. **Verify SSH Access** + ```bash + tailscale ssh core@ghost-dev-01 + ``` + +3. **Check Monitor Service** + ```bash + # On the instance + systemctl status tailscale-monitor.timer + systemctl status tailscale-monitor.service + journalctl -u tailscale-monitor.service + ``` + +## Troubleshooting + +### Instance Registered with Suffixed Name + +**Symptom:** New instance shows as `ghost-dev-01-1` in Tailscale + +**Cause:** Old device was not removed before instance recreation + +**Resolution Options:** + +1. **Rename via CLI** (preserves current state): + ```bash + # On the instance + sudo tailscale set --hostname=ghost-dev-01 + ``` + Then remove the old `ghost-dev-01` device from admin console. + +2. **Recreate Instance** (clean approach): + - Remove both devices from Tailscale admin + - Run `tofu apply` with a change that forces recreation (e.g., add a comment to `ghost.bu`) + +### Monitor Service Failed on First Boot + +**Symptom:** `tailscale-monitor.service` shows failed status after instance creation + +**Cause:** The monitor script found the stale device first during prefix matching + +**Resolution:** +1. Remove the stale device from Tailscale admin +2. The timer will automatically retry the monitor service +3. Or manually restart: `sudo systemctl restart tailscale-monitor.service` + +### Cannot Find Device in Admin Console + +**Symptom:** Old device not visible in Tailscale admin + +**Possible Causes:** +- Device was already removed +- Auth key has expired and device was auto-removed +- Looking at wrong tailnet + +**Resolution:** Proceed with instance recreation - no cleanup needed + +## Related Documentation + +- Tailscale Machine Names: https://tailscale.com/kb/1098/machine-names +- Ghost Instance Configuration: `opentofu/modules/vultr/instance/userdata/ghost.bu` +- Tailscale Monitor Script: Located on block storage at `/var/mnt/storage/sbin/tailscale_monitor/` +- Tailscale Sysext Update Process: See CLAUDE.md "Updating Tailscale Sysext Version" + +## Future Improvements + +Consider automating device cleanup as part of the deployment workflow: + +1. **Pre-destroy hook**: Remove device from Tailscale before `tofu apply` +2. **CI integration**: Add a step to deployment workflow that cleans up stale devices +3. **Idempotent naming**: Use instance ID or other unique identifier in hostname + +These improvements would require: +- Tailscale API key with device write permissions stored in secrets +- Updates to `deploy-dev.yml` workflow +- Potentially a custom OpenTofu provider or local-exec provisioner From d4a57027f3ded3b19f3a05fb90f558d67991fd06 Mon Sep 17 00:00:00 2001 From: Noah White Date: Fri, 30 Jan 2026 05:54:00 +0000 Subject: [PATCH 2/3] Add known issue note for alloy.service not auto-starting Document that alloy.service may not start automatically after instance recreation due to a timing issue with Ignition and systemd-sysext. Include manual fix command. --- CLAUDE.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/CLAUDE.md b/CLAUDE.md index 74ab547..25f39d3 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -279,6 +279,14 @@ and only applied on first boot. This is the expected idempotent behavior. **Important:** Before recreating an instance, remove the old device from Tailscale admin to prevent naming conflicts. See `docs/runbooks/tailscale-device-cleanup.md` for details. +**Known issue:** After instance recreation, `alloy.service` may not start automatically +despite being configured as `enabled: true` in ghost.bu. This appears to be a timing issue +where Ignition tries to enable the service before systemd-sysext merges the extension. +If Alloy is not running after recreation, manually enable it: +```bash +sudo systemctl enable --now alloy.service +``` + ### Updating Tailscale Sysext Version Tailscale is installed via systemd-sysext from the [Flatcar sysext-bakery](https://flatcar.github.io/sysext-bakery/tailscale/). From 7e044fdaedd9f8c29d727d09de5be1f3af7e12e6 Mon Sep 17 00:00:00 2001 From: Noah White Date: Fri, 30 Jan 2026 05:57:42 +0000 Subject: [PATCH 3/3] Use read -s for secure API key input in runbook Avoid exposing sensitive TAILSCALE_API_KEY in command line history. Use zsh read -s to securely read the key, export it, and unset when done. --- docs/runbooks/tailscale-device-cleanup.md | 10 +++++++++- 1 file changed, 9 insertions(+), 1 deletion(-) diff --git a/docs/runbooks/tailscale-device-cleanup.md b/docs/runbooks/tailscale-device-cleanup.md index 5ad6817..973b801 100644 --- a/docs/runbooks/tailscale-device-cleanup.md +++ b/docs/runbooks/tailscale-device-cleanup.md @@ -75,7 +75,12 @@ tailscale status For future automation, devices can be removed via API: -```bash +```zsh +# Securely read the API key (zsh) - input will be hidden +read -s "TAILSCALE_API_KEY?Enter Tailscale API key: " +echo # newline after hidden input +export TAILSCALE_API_KEY + # Get the device ID first DEVICE_ID=$(curl -s -H "Authorization: Bearer $TAILSCALE_API_KEY" \ "https://api.tailscale.com/api/v2/tailnet/-/devices" | \ @@ -84,6 +89,9 @@ DEVICE_ID=$(curl -s -H "Authorization: Bearer $TAILSCALE_API_KEY" \ # Delete the device curl -X DELETE -H "Authorization: Bearer $TAILSCALE_API_KEY" \ "https://api.tailscale.com/api/v2/device/$DEVICE_ID" + +# Clear the variable when done +unset TAILSCALE_API_KEY ``` **Note:** This requires an API key with device write permissions. The current deployment uses auth keys for device registration, which is a different permission scope.