One node in the cluster suddenly became unavailable.
I am writing to report an issue with one of the nodes in our cluster. The node is experiencing intermittent unavailability, where it becomes unreachable and then recovers automatically after approximately 1 minute. This issue has occurred multiple times, with the most recent occurrence being on April 3rd, 2023, and other occurrences on March 18th, 2023, and March 2nd, 2023.
Based on the monitoring data exposed by rabbitmq_prometheus, the monitoring data for the faulty node was interrupted along with other nodes.

Our cluster consists of 3 nodes running on CentOS 7 on Azure with a machine configuration of 8-core CPU, 32GB RAM, and 256GB SSD. The RabbitMQ version is 3.8.12 and the Erlang version is 22.3.
Based on the monitoring data exposed by node-exporter, no other applications were deployed on the machine where RabbitMQ is deployed, and there was no high CPU usage before or after the failure, with a stable load of 5 or less, and no significant fluctuations in memory. Disk I/O did not appear to be a bottleneck.
I have put the rabbitmqctl environment information and logs into a file. I am confused about what caused this situation, and I sincerely hope to be given the cause of the failure or suggestions for improvement to avoid such situations from happening again.
Thank you for your attention, and we look forward to your reply.
Sincerely,
Wong
Read more here: Source link
