1476220 – RabbitMq is very unstable

Bug 1476220 - RabbitMq is very unstable

Summary: RabbitMq is very unstable

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	rabbitmq-server
Sub Component:
Version:	6.0 (Juno)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	---
Assignee:	Peter Lemenkov
QA Contact:	Udi Shkalim
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-07-28 10:29 UTC by Anil Dhingra
Modified:	2021-03-11 15:30 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-08-21 13:36:25 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	rabbitmq rabbitmq-server issues 368	None	None	None	2017-08-07 13:55:52 UTC
Github	rabbitmq rabbitmq-server issues 914	None	None	None	2017-08-07 14:01:06 UTC
Github	rabbitmq rabbitmq-server pull 951	None	None	None	2017-08-07 14:04:05 UTC

Description Anil Dhingra 2017-07-28 10:29:53 UTC

Description of problem:
rabbitmq is unstable & has to start it at-least twice a week to make it keep on working ,it is becoming more frequent and results in outage in the cloud.

Initial investigation was point to low start/stop values & we recommended to  "Increasing the stop and start operation to help on the restart operation from the pacemaker resource agent point of view" .

pcs resource update rabbitmq-server op stop timeout=360
pcs resource update rabbitmq-server op start timeout=300

After above changes things were fine for 3-4 days but again issues started
like  "Timed out waiting for a reply to message ID cb6b17389dbe4f059eb294c8fd79ad6c" for some operation like heat stack deletion or erros like 
Failed Actions:
* rabbitmq-server_monitor_10000 on pcmk-os2ctrl03 'not running' (7): call=3097, status=complete, exitreason='none',
    last-rc-change='Mon Jul 24 22:00:13 2017', queued=0ms, exec=0ms

Version-Release number of selected component (if applicable):
rabbitmq-server-3.3.5-31.el7ost.noarch

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:
we tried to collect some performance matrix as in past thy were having high i/o on disk during DB backup which caused slow disk responce which leads to rabbit issue but nothing much found this time which is pointing to i/o .

Galera also looks unstable sometimes on both clusters

Comment 10 Peter Lemenkov 2017-08-21 13:36:25 UTC

So here is the status. 

The customer had IO issues due to a enormously huge Ceilometer DB which caused enormous IO. No critical RabbitMQ issues were found by inspecting RabitMQ logs. We proposed to increase timeouts here and there, purge DB, and everything started working again.

I believe this should be closed as NOTABUG. Feel free to reopen it if necessary.

Note You need to log in before you can comment on or make changes to this bug.