Bug 1476220 - RabbitMq is very unstable
Summary: RabbitMq is very unstable
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rabbitmq-server
Version: 6.0 (Juno)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: ---
Assignee: Peter Lemenkov
QA Contact: Udi Shkalim
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-07-28 10:29 UTC by Anil Dhingra
Modified: 2021-03-11 15:30 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-08-21 13:36:25 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github rabbitmq rabbitmq-server issues 368 0 None None None 2017-08-07 13:55:52 UTC
Github rabbitmq rabbitmq-server issues 914 0 None None None 2017-08-07 14:01:06 UTC
Github rabbitmq rabbitmq-server pull 951 0 None None None 2017-08-07 14:04:05 UTC

Description Anil Dhingra 2017-07-28 10:29:53 UTC
Description of problem:
rabbitmq is unstable & has to start it at-least twice a week to make it keep on working ,it is becoming more frequent and results in outage in the cloud.

Initial investigation was point to low start/stop values & we recommended to  "Increasing the stop and start operation to help on the restart operation from the pacemaker resource agent point of view" .

pcs resource update rabbitmq-server op stop timeout=360
pcs resource update rabbitmq-server op start timeout=300

After above changes things were fine for 3-4 days but again issues started
like  "Timed out waiting for a reply to message ID cb6b17389dbe4f059eb294c8fd79ad6c" for some operation like heat stack deletion or erros like 
Failed Actions:
* rabbitmq-server_monitor_10000 on pcmk-os2ctrl03 'not running' (7): call=3097, status=complete, exitreason='none',
    last-rc-change='Mon Jul 24 22:00:13 2017', queued=0ms, exec=0ms

Version-Release number of selected component (if applicable):
rabbitmq-server-3.3.5-31.el7ost.noarch

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
we tried to collect some performance matrix as in past thy were having high i/o on disk during DB backup which caused slow disk responce which leads to rabbit issue but nothing much found this time which is pointing to i/o .

Galera also looks unstable sometimes on both clusters

Comment 10 Peter Lemenkov 2017-08-21 13:36:25 UTC
So here is the status. 

The customer had IO issues due to a enormously huge Ceilometer DB which caused enormous IO. No critical RabbitMQ issues were found by inspecting RabitMQ logs. We proposed to increase timeouts here and there, purge DB, and everything started working again.

I believe this should be closed as NOTABUG. Feel free to reopen it if necessary.


Note You need to log in before you can comment on or make changes to this bug.