Make CI tolerant of homelab-runner network flakes #186

Open
opened 2026-05-15 05:45:19 +00:00 by chihlasm · 0 comments
Owner

The Gitea Actions homelab runners have hit two transient external-network timeouts in the last week, both blocking otherwise-green PRs:

  • frontend jobnpm install died with EIDLETIMEOUT — Idle timeout reached for host registry.npmjs.org:443 after ~5 min. Cache restore also timed out (getCacheEntry failed: Request timeout).
  • mirror jobfatal: unable to access 'https://github.com/***/': SSL connection timeout after ~5 min, blocking the GitHub mirror push.

Neither is a code issue, but each requires a manual re-run, and the failed runs noisily mark the PR's CI status as red. Buyers / reviewers / future-us reading the PR history can't tell at a glance which red runs were real failures vs. flakes.

Possible fixes

  • Retry with backoff: wrap network-touching steps (npm ci, mirror push, actions/cache restore) in a retry — 2 attempts with a 30s delay would have caught both.
  • Caching shield: a local npm registry mirror (verdaccio or similar) on the homelab would eliminate registry.npmjs.org dependency for the common case.
  • Mirror push to a self-hosted GitHub-API proxy if the SSL flake is reproducible — though a simple retry is probably enough.
  • At minimum: document the known-flake pattern somewhere CI-related so we don't re-investigate every time.

Acceptance

A single transient network timeout no longer red-marks a PR. Either a retry hides it, or the failure mode is documented and obviously distinguishable from a real failure.

The Gitea Actions homelab runners have hit two transient external-network timeouts in the last week, both blocking otherwise-green PRs: - **frontend job** — `npm install` died with `EIDLETIMEOUT — Idle timeout reached for host registry.npmjs.org:443` after ~5 min. Cache restore also timed out (`getCacheEntry failed: Request timeout`). - **mirror job** — `fatal: unable to access 'https://github.com/***/': SSL connection timeout` after ~5 min, blocking the GitHub mirror push. Neither is a code issue, but each requires a manual re-run, and the failed runs noisily mark the PR's CI status as red. Buyers / reviewers / future-us reading the PR history can't tell at a glance which red runs were real failures vs. flakes. ## Possible fixes - **Retry with backoff**: wrap network-touching steps (`npm ci`, mirror push, `actions/cache` restore) in a retry — 2 attempts with a 30s delay would have caught both. - **Caching shield**: a local npm registry mirror (verdaccio or similar) on the homelab would eliminate registry.npmjs.org dependency for the common case. - **Mirror push to a self-hosted GitHub-API proxy** if the SSL flake is reproducible — though a simple retry is probably enough. - At minimum: document the known-flake pattern somewhere CI-related so we don't re-investigate every time. ## Acceptance A single transient network timeout no longer red-marks a PR. Either a retry hides it, or the failure mode is documented and obviously distinguishable from a real failure.
Sign in to join this conversation.