Description of problem: rabbitmq is unstable & has to start it at-least twice a week to make it keep on working ,it is becoming more frequent and results in outage in the cloud. Initial investigation was point to low start/stop values & we recommended to "Increasing the stop and start operation to help on the restart operation from the pacemaker resource agent point of view" . pcs resource update rabbitmq-server op stop timeout=360 pcs resource update rabbitmq-server op start timeout=300 After above changes things were fine for 3-4 days but again issues started like "Timed out waiting for a reply to message ID cb6b17389dbe4f059eb294c8fd79ad6c" for some operation like heat stack deletion or erros like Failed Actions: * rabbitmq-server_monitor_10000 on pcmk-os2ctrl03 'not running' (7): call=3097, status=complete, exitreason='none', last-rc-change='Mon Jul 24 22:00:13 2017', queued=0ms, exec=0ms Version-Release number of selected component (if applicable): rabbitmq-server-3.3.5-31.el7ost.noarch How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: we tried to collect some performance matrix as in past thy were having high i/o on disk during DB backup which caused slow disk responce which leads to rabbit issue but nothing much found this time which is pointing to i/o . Galera also looks unstable sometimes on both clusters
So here is the status. The customer had IO issues due to a enormously huge Ceilometer DB which caused enormous IO. No critical RabbitMQ issues were found by inspecting RabitMQ logs. We proposed to increase timeouts here and there, purge DB, and everything started working again. I believe this should be closed as NOTABUG. Feel free to reopen it if necessary.