Runbooks
Operational runbooks
First-response playbooks for the most common production issues.
Webhook failure rate spike
Trigger: delivery success falls below 95% for 10+ minutes.
Checks
- Verify downstream endpoint availability and TLS validity.
- Inspect response body/status in delivery logs.
- Confirm webhook secret rotations are synced downstream.
Recovery
- Replay failed deliveries after endpoint recovery.
- Rotate compromised secret and notify consumer.
Missing referrer payouts
Trigger: calls attributed but referrer share is zero.
Checks
- Confirm attribution header or API key is present on paid requests.
- Confirm deployment includes non-zero referrer bps.
- Check whether rows are settled vs distribute_failed.
Recovery
- Fix attribution source and run a fresh paid-call test.
- Retry failed distributions and validate final split amounts.
Attribution mismatch report
Trigger: partner-reported volume doesn’t match dashboard.
Checks
- Compare partner logs to call IDs and timestamps.
- Validate wallet/handle mapping used by integration.
- Check downstream filtering and delayed event ingestion.
Recovery
- Replay affected deliveries for downstream correction.
- Share export snapshot for reconciliation with partner.
Escalation threshold
Escalate to incident mode when payout-impacting failures persist for 30 minutes, or when delivery backlog continues to grow after replay.