Self-healing message failure management
When messages fail in your event-driven systems, Jetsam Recover salvages them — collecting from dead-letter queues, diagnosing root causes with AI, and intelligently retrying when the underlying problem resolves. Without waking anyone up.
Single container · Go + SQLite · Deploy in minutes
The Problem
It's 3:14 AM. Your phone screams. A message queue is backing up. Thousands of events are piling into a dead-letter queue, and the threshold alert just fired.
You open your laptop. Begin the ritual. SSH into the box. Pull a sample of failed messages. Squint at stack traces. Google the error. Realize it's a downstream service that deployed a bad schema twenty minutes ago. The messages themselves are fine.
You write an ad-hoc replay script. Run it. Some messages fail again because the downstream hasn't actually rolled back yet. You wait. You retry. You wait more. By 5:30 AM, you've manually nursed 2,847 messages back through the pipeline. You file a ticket to "improve DLQ handling." You go back to sleep for forty-five minutes before your alarm goes off.
Next Tuesday, it happens again.
Nearly three-quarters of message failures resolve themselves when the underlying condition clears — a service restarts, a rate limit resets, a deployment rolls back. You're being paged for problems that fix themselves. You just need someone to retry at the right moment.
That's how long it takes a human to SSH in, sample the DLQ, diagnose the root cause, write a replay script, test it, wait for conditions to clear, and replay. For a problem that an automated system could resolve in seconds, once it knows what to look for.
You're not short on observability. You can see the failures. Dashboards light up, alerts fire, Slack channels scroll. What you're short on is automated recovery. The gap isn't detection. It's the part that comes after.
“Every DLQ is an admission that your system gave up.”
Jetsam picks up where your system left off.
How It Works
Jetsam Recover watches your dead-letter queues, diagnoses why messages failed, and retries them when conditions clear. Automatically. Intelligently. While you sleep.
Jetsam connects to your dead-letter queues across NATS, RabbitMQ, Kafka, SQS, and Azure Service Bus. Every failed message is captured with full context — headers, metadata, timestamps, and the error that caused the failure. Nothing is lost, nothing expires silently.
Corpus-based classification identifies failure patterns through semantic understanding, not brittle regex. AI agents diagnose root causes, cluster related failures, and categorize each as dependency, consumer bug, schema issue, or transient. The system ships with hundreds of known failure patterns and learns yours.
When the underlying problem resolves — the service recovers, the schema is fixed, the rate limit resets — Jetsam replays failed messages with graduated retry: test one, then 10, then 50, then 100, then the rest. No retry storms. Business-critical messages go first. You wake up to a summary, not a crisis.
Without Jetsam
With Jetsam
Features
Not another monitoring dashboard. Not another alerting rule. Jetsam Recover closes the gap between detecting failures and actually fixing them.
Errors are classified by semantic similarity to a learned corpus, not brittle regex patterns. New failure modes are recognized automatically. A new error format, a different service, an unexpected payload — the system adapts without new rules.
AI agents that investigate failures the way a senior SRE would. They correlate errors across queues, check downstream health, identify root causes, and determine the right moment to retry — not on a timer, but when conditions actually improve.
A failed $50,000 wire transfer and a failed analytics event aren't equal. Jetsam understands business context and recovers high-value messages first. Payment events before marketing emails. Order confirmations before internal telemetry.
One container. One Go binary. Embedded SQLite — no external databases to provision. Launches in seconds, runs anywhere containers run. No Kubernetes operator, no Helm chart sprawl. Just docker run and you're operational.
No retry storms. Jetsam tests one message first. If it succeeds, it ramps: 10, then 50, then 100, then the rest. If the first message fails, it waits and tries again later. Your recovering downstream service isn't immediately slammed with 10,000 replayed messages.
Ships with a curated registry of hundreds of known failure patterns across dependency failures, consumer bugs, data/schema issues, and transient errors. Jetsam starts smart on day one and gets smarter as it learns the specific ways your systems break.
Integrations
Jetsam connects to every major message broker out of the box. One deployment covers your entire event-driven architecture.
JetStream
MQ
Apache
AWS
Service Bus
Early Access
We built Jetsam Recover because we were tired of being the person who gets paged at 3am to do something a machine should handle. If that sounds familiar, join the waitlist.
Turn 3AM Pages into 9AM Summaries