RabbitMQ Retries — The Full Story

Erez Rabih
Nanit Engineering
Published in
8 min readJun 18, 2018

--

RabbitMQ is one of the most widely used message brokers today. A large portion of nanit’s inter-service communication goes through RabbitMQ, which led us on a journey of finding the best way to retry processing a message upon failure.
Surprisingly, RabbitMQ itself does not implement any retry mechanism natively. In this blog post I explore 4 different ways to implement retries on RabbitMQ. On each option we will go through:

  1. The RabbitMQ topology diagram
  2. The flow of retrying
  3. An example ruby code to replicate the topology and a subscriber which retries processing a message
  4. The output of running the ruby code
  5. A summary of advantages and disadvantages

The example code for each of the scenarios can be found on our Github Repository.
I strongly suggest you to run the code examples and play with them as you read through the post.

Before we go on to the details let’s get a better understanding of how nanit’s RabbitMQ topology looks like:

  • Users API is a publisher which publishes to a direct exchange — nanit.users
  • We use direct exchanges with the naming convention of nanit.object_namein this case nanit.users
  • Each service (mailman/subscriptions) creates a queue named service_name.object_name.routing_key and binds it to the corresponding exchange with the appropriate routing key. In the above case both subscriptions and mailman services are registered to the user created event but only mailman is registered to the user deleted event.
  • The service then consumes the created queue.

Dead Letter Exchanges

Another subject worth mentioning before we dive into the specifics is Dead Letter Exchange (DLX). A Dead Letter Exchange is just a regular RabbitMQ exchange. If exchange ex1 is set as the DLX of a queue q1 a message is forwarded from q1 to ex1 if:

  1. A message was rejected on q1 with requeue=false
  2. A message TTL has expired on q1
  3. q1's queue length limit has exceeded

We are going to use Dead Letter Exchanges throughout the tutorial quite a lot.

Now that we know how our topology looks like and what a dead letter exchange is, we can asses a few options to implement retries.

Option 1: Reject + Requeue

Option 1 — Reject + Requeue

Topology:

Nothing fancy here — We didn’t have to create any additional exchanges or queues.

Flow:

  1. A message arrives a mailman consumer
  2. The consumer fails processing the message and rejects it with the requeue flag set to true. The message is put to the head of the queue.
  3. The message arrives a consumer again, this time with the redelivered flag set to true.
  4. To avoid entering a retry loop on that message the consumer should only requeue if the message was not redelivered

Output:

$> OPTION=1 make run-example
14:11:48 received message: hello | redelivered: false
first try, rejecting with requeue=true
14:11:48 received message: hello | redelivered: true
already retried, rejecting with retry=false
14:11:53 Bye

This method allows us only 1 retry per message with no delay at all.

Option 2: Reject + DLX topology

Topology:

We added here two exchanges and one queue.

We set nanit.users.retry1 as the dead letter exchange of the queue mailman.users.created so when a message is rejected from that queue it is immediately passed to nanit.users.retry1.

nanit.users.wait_queue, the wait queue, is where messages are being held between retries. It has a TTL set via x-message-ttl and when that TTL expires the message is forwarded to nanit.users.retry2 which is set as its dead letter exchange.

Flow:

  1. A message arrives a mailman consumer from nanit.users.created queue
  2. The consumer fails processing and rejects the message
  3. The message is forwarded to the queue’s Dead Letter Exchange nanit.users.retry1. We replace the original message routing key created with the queue name the message originated from — nanit.users.created. This will be explained later on.
  4. A single queue nanit.users.wait_queue is bound to the DLX by all routing keys thus the message is passed on to that queue
  5. The wait queue has a TTL set up via the x-message-ttl argument. Once the TTL is over the message is passed to the second exchange, nanit.users.retry2, which is set as nanit.users.wait_queue Dead Letter Exchange.
  6. The original queue, nanit.users.created is bound to the exchange nanit.users.retry2 by a routing key matching its own name so the message only arrives the exact queue it was rejected from and not to all queues bound by the created key *(see note)

*Note: We must replace the original message routing key created with the name of the queue the message was rejected on mailman.users.created. If we had left the created routing key as is, both mailman and subscriptions services would have re-processed the message while only mailman failed processing it. To have the message’s routing key replaced upon dead lettering we set the
x-dead-letter-routing-key header on the queue. Once set, the message’s routing key is replaced by the one defined on that header value when it is forwarded to the Dead Letter Exchange.

Output:

$> OPTION=2 make run-example
14:12:50 received message: hello | retry_count: 0
rejecting (retry via DLX)
14:12:55 received message: hello | retry_count: 1
rejecting (retry via DLX)
14:13:00 received message: hello | retry_count: 2
rejecting (retry via DLX)
14:13:05 received message: hello | retry_count: 3
max retries reached - acking
14:13:11 Bye

This topology allows us to define maximum retry attempts with a constant delay between retries.
To get the current retry count for a message we can use the count field on the x-death header which is incremented by RabbitMQ each time a message is dead-lettered.
The retry delay is constant since it is defined on the wait queue and not on a per-message basis.

Option 3: Republishing to a Retry Exchange

Republish Topology

Topology:

The topology here is pretty similar to the previous topology except for the fact that nanit.users.retry1 is not set as a dead letter exchange since we republish the failed message rather than rejecting it.

Flow:

  1. A message arrives a mailman consumer from nanit.users.created queue
  2. The consumer fails processing the message, acknowledges it and publishes it to nanit.users.retry1 exchange. Since we do not reject the message, RabbitMQ won’t save the x-death header for us and we have to take care of retry count ourselves. We can easily do so via incrementing (or initializing) a customer header on the message — x-retries for example. We also have to take care of the TTL: since it is no longer set on the wait queue, we have to publish the message with a per-message TTL. We can do so by setting the expiration field to be a multiple of base-retry-delay and the current number of retries. This way, the retry delay increases with the number of retries. The last thing to note is that we publish the message with a routing key matching the queue it arrived from. In this case it would be mailman.users.created.
  3. The message is delivered via the nanit.users.retry1 exchange to nanit.users.wait_queue. This time, the queue has no default TTL since we’re specifying the TTL per message.
  4. When the TTL on the message expires, it is passed from the wait queue to its DLX nanit.users.retry2 using the key it originally arrived with — mailman.users.created.
  5. The original queue, nanit.users.created is bound to the DLX nanit.users.retry2 by a routing key matching its own name so the message only arrives the exact queue it was rejected from and not to all queues bound by the created key.

Output:

$> OPTION=3 make run-example
14:14:32 received message: hello | retry_count: 0
publishing to retry exchange with 3s delay
14:14:35 received message: hello | retry_count: 1
publishing to retry exchange with 6s delay
14:14:41 received message: hello | retry_count: 2
publishing to retry exchange with 9s delay
14:14:50 received message: hello | retry_count: 3
max retries reached - throwing message
14:14:53 Bye

This implementation allows us to specific both retry limit and have an increasing retry delay per message. The retry number is being traced on the x-retries header and the message expiration is always calculated by retry count and some base expiration we set.

Option 4: Delayed Exchange

The final option and the one we at nanit actually use is a delayed exchange which is a RabbitMQ plugin. It allows us to easily definea TTL per message without setting an additional wait queue and Dead Letter Exchanges.

Option 4 — delayed exchange

Topology:

The topology is pretty simple — we have a single retry exchange which is a delayed exchange. When the consumer fails processing a message it publishes it to this exchange with an increasing delay as long as we’re below the retries limit. This setup achieves the same goals of option 3 with a simpler topology and flow.

Flow:

  1. mailman consumer receives a message from RabbitMQ and fails processing
  2. It then ACK’s the original message and publishes it to the delayed exchange with an incremented x-retries header, a calculated x-delay header to have the message delayed before it is being forwarded on and a routing key matching the name of the queue the message originated from (mailman.users.created).
  3. When the TTL (delay) expires the delayed exchange forwards the message back to the queue mailman.users.created which is attached to it via a routing key of its name.
  4. mailman consumes the message again

Output:

$> OPTION=4 make run-example
14:15:43 received message: hello | retry_count: 0
publishing to retry (delayed) exchange with 3s delay
14:15:46 received message: hello | retry_count: 1
publishing to retry (delayed) exchange with 6s delay
14:15:52 received message: hello | retry_count: 2
publishing to retry (delayed) exchange with 9s delay
14:16:01 received message: hello | retry_count: 3
max retries reached - throwing message
14:16:04 Bye

Using Retries Smartly

While having a retry mechanism is always a good idea, we have to remember it has its price: it means more messages are passing through RabbitMQ and as a result more messages are being consumed by our consumers. In the end it translates to higher CPU/memory/network usage. This is the reason it is important to differentiate between failures in order to decide if a message should even be considered for a retry or ignored immediately.
A malformed message would be an example to a message we should not attempt to retry processing since nothing is going to make it processable the next time it arrives the consumer.
An example for a message worth retrying would be a service that uses a third party API and receives a 503 temporarily unavailable response. It is reasonable to believe that the third party API will be available in the future and the message might become processable in that case.

Summary

I hope this guide gave you some insights as to how we at nanit use RabbitMQ.
You are invited to check out our open source RabbitMQ on Kubernetes setup.

--

--