Bug 1476220 - RabbitMq is very unstable
RabbitMq is very unstable
Product: Red Hat OpenStack
Classification: Red Hat
Component: rabbitmq-server (Show other bugs)
6.0 (Juno)
Unspecified Unspecified
urgent Severity urgent
: ---
: ---
Assigned To: Peter Lemenkov
Udi Shkalim
Depends On:
  Show dependency treegraph
Reported: 2017-07-28 06:29 EDT by Anil Dhingra
Modified: 2017-08-21 09:36 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2017-08-21 09:36:25 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

External Trackers
Tracker ID Priority Status Summary Last Updated
Github rabbitmq/rabbitmq-server/issues/368 None None None 2017-08-07 09:55 EDT
Github rabbitmq/rabbitmq-server/issues/914 None None None 2017-08-07 10:01 EDT
Github rabbitmq/rabbitmq-server/pull/951 None None None 2017-08-07 10:04 EDT

  None (edit)
Description Anil Dhingra 2017-07-28 06:29:53 EDT
Description of problem:
rabbitmq is unstable & has to start it at-least twice a week to make it keep on working ,it is becoming more frequent and results in outage in the cloud.

Initial investigation was point to low start/stop values & we recommended to  "Increasing the stop and start operation to help on the restart operation from the pacemaker resource agent point of view" .

pcs resource update rabbitmq-server op stop timeout=360
pcs resource update rabbitmq-server op start timeout=300

After above changes things were fine for 3-4 days but again issues started
like  "Timed out waiting for a reply to message ID cb6b17389dbe4f059eb294c8fd79ad6c" for some operation like heat stack deletion or erros like 
Failed Actions:
* rabbitmq-server_monitor_10000 on pcmk-os2ctrl03 'not running' (7): call=3097, status=complete, exitreason='none',
    last-rc-change='Mon Jul 24 22:00:13 2017', queued=0ms, exec=0ms

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:

Actual results:

Expected results:

Additional info:
we tried to collect some performance matrix as in past thy were having high i/o on disk during DB backup which caused slow disk responce which leads to rabbit issue but nothing much found this time which is pointing to i/o .

Galera also looks unstable sometimes on both clusters
Comment 10 Peter Lemenkov 2017-08-21 09:36:25 EDT
So here is the status. 

The customer had IO issues due to a enormously huge Ceilometer DB which caused enormous IO. No critical RabbitMQ issues were found by inspecting RabitMQ logs. We proposed to increase timeouts here and there, purge DB, and everything started working again.

I believe this should be closed as NOTABUG. Feel free to reopen it if necessary.

Note You need to log in before you can comment on or make changes to this bug.