apache zookeeper – Kafka behavior during partition leader shutdown and how to reduce downtime during partition leader re-election

I’m conducting tests to analyze Kafka (with Zookeeper) cluster behavior during intentional broker shutdowns within a Kubernetes environment. Our Kafka setup includes 3 brokers, with a topic configuration of 1 partition and a replication factor of 3.

Testing scenario

While continuously producing and consuming messages through the topic, we intentionally shut down the broker that is the current partition leader. We log any messages that fail to be delivered during this transition. As observed, the cluster takes approximately 1 minute and 10 seconds on average to elect a new partition leader. Consequently, all messages sent during this period fail to be delivered, except for those sent in the last 10 seconds before the new leader takes over. With a message rate of 1 per second, we typically have around 60 undelivered messages.

Question

What processes is Kafka executing during this 1-minute downtime, and how can we potentially reduce this duration?

The Kafka documentation hasn’t been very illuminating on this subject, and I’m struggling to understand the internal mechanics during this downtime.

Attempted Understanding and Questions

From my understanding, the Kafka cluster controller should initiate a new leader election within the zookeeper.session.timeout.ms timeframe (defaulting to 18 seconds). However, this doesn’t seem to align with our observations, as there’s an additional unaccounted-for 40-second delay until the leader election actually takes place.

I should also note the detection method for the new partition leader, it might not be correct:

initial_output=$(kafka-topics --bootstrap-server kafka-headless:9092 --describe --topic test_topic)

...shut down the partition leader...

while : ; do
  current_output=$(kafka-topics --bootstrap-server kafka-headless:9092 --describe --topic test_topic)
  if [[ "$initial_output" != "$current_output" ]]; then
    break
  else
    sleep 1
  fi
done  

Additionally, we observed that increasing the retry count on the producer client seemed to resolve undelivered messages, with even retries=1 bringing undelivered messages down to zero. This is confusing because, theoretically, the retries should only extend the message timeout to 20 seconds, which doesn’t cover the entire downtime until a new leader is elected. It’s as if the retries coincided with the zookeeper.session.timeout.ms window.

We noticed similar downtime in tests with different topic configurations, such as 10 partitions with a replication factor of 2.

Any insights into the leader election process during these conditions and advice on configuration adjustments to minimize downtime would be greatly appreciated.

Read more here: Source link