RabbitMQ Retries — The Full Story
RabbitMQ is one of the most widely used message brokers today. A large portion of nanit’s inter-service communication goes through RabbitMQ, which led us on a journey of finding the best way to retry processing a message upon failure.
Surprisingly, RabbitMQ itself does not implement any retry mechanism natively. In this blog post I explore 4 different ways to implement retries on RabbitMQ. On each option we will go through:
- The RabbitMQ topology diagram
- The flow of retrying
- An example ruby code to replicate the topology and a subscriber which retries processing a message
- The output of running the ruby code
- A summary of advantages and disadvantages
The example code for each of the scenarios can be found on our Github Repository.
I strongly suggest you to run the code examples and play with them as you read through the post.
Before we go on to the details let’s get a better understanding of how nanit’s RabbitMQ topology looks like:
Users API
is a publisher which publishes to a direct exchange —nanit.users
- We use direct exchanges with the naming convention of
nanit.object_name
— in this casenanit.users
- Each service (mailman/subscriptions) creates a queue named
service_name.object_name.routing_key
and binds it to the corresponding exchange with the appropriate routing key. In the above case both subscriptions and mailman services are registered to the user created event but only mailman is registered to the user deleted event. - The service then consumes the created queue.
Dead Letter Exchanges
Another subject worth mentioning before we dive into the specifics is Dead Letter Exchange (DLX). A Dead Letter Exchange is just a regular RabbitMQ exchange. If exchange ex1
is set as the DLX of a queue q1
a message is forwarded from q1
to ex1
if:
- A message was rejected on
q1
with requeue=false - A message TTL has expired on
q1
q1
's queue length limit has exceeded
We are going to use Dead Letter Exchanges throughout the tutorial quite a lot.
Now that we know how our topology looks like and what a dead letter exchange is, we can asses a few options to implement retries.
Option 1: Reject + Requeue
Topology:
Nothing fancy here — We didn’t have to create any additional exchanges or queues.
Flow:
- A message arrives a mailman consumer
- The consumer fails processing the message and rejects it with the requeue flag set to true. The message is put to the head of the queue.
- The message arrives a consumer again, this time with the
redelivered
flag set to true. - To avoid entering a retry loop on that message the consumer should only
requeue
if the message was not redelivered
Output:
$> OPTION=1 make run-example
14:11:48 received message: hello | redelivered: false
first try, rejecting with requeue=true
14:11:48 received message: hello | redelivered: true
already retried, rejecting with retry=false
14:11:53 Bye
This method allows us only 1 retry per message with no delay at all.
Option 2: Reject + DLX topology
Topology:
We added here two exchanges and one queue.
We set nanit.users.retry1
as the dead letter exchange of the queue mailman.users.created
so when a message is rejected from that queue it is immediately passed to nanit.users.retry1
.
nanit.users.wait_queue
, the wait queue, is where messages are being held between retries. It has a TTL set via x-message-ttl
and when that TTL expires the message is forwarded to nanit.users.retry2
which is set as its dead letter exchange.
Flow:
- A message arrives a mailman consumer from
nanit.users.created
queue - The consumer fails processing and rejects the message
- The message is forwarded to the queue’s Dead Letter Exchange
nanit.users.retry1
. We replace the original message routing keycreated
with the queue name the message originated from —nanit.users.created
. This will be explained later on. - A single queue
nanit.users.wait_queue
is bound to the DLX by all routing keys thus the message is passed on to that queue - The wait queue has a TTL set up via the
x-message-ttl
argument. Once the TTL is over the message is passed to the second exchange,nanit.users.retry2
, which is set asnanit.users.wait_queue
Dead Letter Exchange. - The original queue,
nanit.users.created
is bound to the exchangenanit.users.retry2
by a routing key matching its own name so the message only arrives the exact queue it was rejected from and not to all queues bound by thecreated
key *(see note)
*Note: We must replace the original message routing key
created
with the name of the queue the message was rejected onmailman.users.created
. If we had left thecreated
routing key as is, bothmailman
andsubscriptions
services would have re-processed the message while onlymailman
failed processing it. To have the message’s routing key replaced upon dead lettering we set thex-dead-letter-routing-key
header on the queue. Once set, the message’s routing key is replaced by the one defined on that header value when it is forwarded to the Dead Letter Exchange.
Output:
$> OPTION=2 make run-example
14:12:50 received message: hello | retry_count: 0
rejecting (retry via DLX)
14:12:55 received message: hello | retry_count: 1
rejecting (retry via DLX)
14:13:00 received message: hello | retry_count: 2
rejecting (retry via DLX)
14:13:05 received message: hello | retry_count: 3
max retries reached - acking
14:13:11 Bye
This topology allows us to define maximum retry attempts with a constant delay between retries.
To get the current retry count for a message we can use the count
field on the x-death
header which is incremented by RabbitMQ each time a message is dead-lettered.
The retry delay is constant since it is defined on the wait queue and not on a per-message basis.
Option 3: Republishing to a Retry Exchange
Topology:
The topology here is pretty similar to the previous topology except for the fact that nanit.users.retry1
is not set as a dead letter exchange since we republish the failed message rather than rejecting it.
Flow:
- A message arrives a mailman consumer from
nanit.users.created
queue - The consumer fails processing the message, acknowledges it and publishes it to
nanit.users.retry1
exchange. Since we do not reject the message, RabbitMQ won’t save thex-death
header for us and we have to take care of retry count ourselves. We can easily do so via incrementing (or initializing) a customer header on the message —x-retries
for example. We also have to take care of the TTL: since it is no longer set on the wait queue, we have to publish the message with a per-message TTL. We can do so by setting theexpiration
field to be a multiple ofbase-retry-delay
and the current number of retries. This way, the retry delay increases with the number of retries. The last thing to note is that we publish the message with a routing key matching the queue it arrived from. In this case it would bemailman.users.created
. - The message is delivered via the
nanit.users.retry1
exchange tonanit.users.wait_queue
. This time, the queue has no default TTL since we’re specifying the TTL per message. - When the TTL on the message expires, it is passed from the wait queue to its DLX
nanit.users.retry2
using the key it originally arrived with —mailman.users.created
. - The original queue,
nanit.users.created
is bound to the DLXnanit.users.retry2
by a routing key matching its own name so the message only arrives the exact queue it was rejected from and not to all queues bound by thecreated
key.
Output:
$> OPTION=3 make run-example
14:14:32 received message: hello | retry_count: 0
publishing to retry exchange with 3s delay
14:14:35 received message: hello | retry_count: 1
publishing to retry exchange with 6s delay
14:14:41 received message: hello | retry_count: 2
publishing to retry exchange with 9s delay
14:14:50 received message: hello | retry_count: 3
max retries reached - throwing message
14:14:53 Bye
This implementation allows us to specific both retry limit and have an increasing retry delay per message. The retry number is being traced on the x-retries
header and the message expiration
is always calculated by retry count and some base expiration we set.
Option 4: Delayed Exchange
The final option and the one we at nanit actually use is a delayed exchange which is a RabbitMQ plugin. It allows us to easily definea TTL per message without setting an additional wait queue and Dead Letter Exchanges.
Topology:
The topology is pretty simple — we have a single retry exchange which is a delayed exchange. When the consumer fails processing a message it publishes it to this exchange with an increasing delay as long as we’re below the retries limit. This setup achieves the same goals of option 3 with a simpler topology and flow.
Flow:
- mailman consumer receives a message from RabbitMQ and fails processing
- It then ACK’s the original message and publishes it to the delayed exchange with an incremented
x-retries
header, a calculatedx-delay
header to have the message delayed before it is being forwarded on and a routing key matching the name of the queue the message originated from (mailman.users.created
). - When the TTL (delay) expires the delayed exchange forwards the message back to the queue
mailman.users.created
which is attached to it via a routing key of its name. - mailman consumes the message again
Output:
$> OPTION=4 make run-example
14:15:43 received message: hello | retry_count: 0
publishing to retry (delayed) exchange with 3s delay
14:15:46 received message: hello | retry_count: 1
publishing to retry (delayed) exchange with 6s delay
14:15:52 received message: hello | retry_count: 2
publishing to retry (delayed) exchange with 9s delay
14:16:01 received message: hello | retry_count: 3
max retries reached - throwing message
14:16:04 Bye
Using Retries Smartly
While having a retry mechanism is always a good idea, we have to remember it has its price: it means more messages are passing through RabbitMQ and as a result more messages are being consumed by our consumers. In the end it translates to higher CPU/memory/network usage. This is the reason it is important to differentiate between failures in order to decide if a message should even be considered for a retry or ignored immediately.
A malformed message would be an example to a message we should not attempt to retry processing since nothing is going to make it processable the next time it arrives the consumer.
An example for a message worth retrying would be a service that uses a third party API and receives a 503 temporarily unavailable response. It is reasonable to believe that the third party API will be available in the future and the message might become processable in that case.
Summary
I hope this guide gave you some insights as to how we at nanit use RabbitMQ.
You are invited to check out our open source RabbitMQ on Kubernetes setup.