Bug 1337704

Summary: "rabbimqctl list_queues ***" command hangs for more than 10 mins after a controller server is hard powered off
Product: Red Hat OpenStack Reporter: Zhen Qin <zhenqin>
Component: rabbitmq-serverAssignee: Peter Lemenkov <plemenko>
Status: CLOSED CURRENTRELEASE QA Contact: Udi Shkalim <ushkalim>
Severity: medium Docs Contact:
Priority: medium    
Version: 8.0 (Liberty)CC: aathomas, apevec, charcrou, fdinitto, jdonohue, jeckersb, lhh, plemenko, srevivo, zhenqin
Target Milestone: ---Keywords: ZStream
Target Release: 8.0 (Liberty)   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-10-14 13:45:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1194008, 1295530    
Attachments:
Description Flags
output of rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().' none

Description Zhen Qin 2016-05-19 21:17:17 UTC
Description of problem:
Context: there are three rabbitmq servers running in one cluster. All queues are mirrored for HA.

We noticed that if a server which has rabbitmq server running is hard killed (a.k.a., hard power off), then although openstack service was not impacted (I can still boot new instance, check service status etc.), "rabbitmqctl list_queus name ***" command hangs there for more than 10 minutes before it returns results.
 
However, Other rabbitmq functions works fine (e.g., rabbitmqctl cluster_status, rabbitmqctl status).

This could mislead people that rabbitmq server doesn't function properly if that command is used for monitoring rabbitmq status. If the erlang shell is kill this erlang shell during hanging period then erlang node throws erl_crash.dump file. 

We would like to check if Red Hat have any patch available to fix this hanging issue. Thanks! Please let us know if you need any more info.

Version-Release number of selected component (if applicable):
rabbitmq version: rabbitmq-server-3.3.5-22.el7ost.noarch
erlang version: erlang-R16B-03.10min.9.el7ost.x86_64

How reproducible:


Steps to Reproduce:
1. find out which rabbitmq node has at least master queues:
> rabbitmqctl list_queues name pid | grep rabbit@node1 | wc -l
> rabbitmqctl list_queues name pid | grep rabbit@node2 | wc -l
> rabbitmqctl list_queues name pid | grep rabbit@node3 | wc -l
2. Hard kill the server where the rabbitmq server with some master queues runs there, so as to guarantee slave to master queues promotion happens.
3. on other controller node check rabbitmq queues info:
watch -n 5 rabbitmqctl list_queues name

Actual results:
rabbitmqctl list_queues name returns result after more than 10 mins.

Expected results:
rabbitmqctl list_queues name returns result in few seconds.

Additional info:

Comment 2 Peter Lemenkov 2016-05-20 12:05:37 UTC
(In reply to Zhen Qin from comment #0)

Could you please provide the output of 

rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().'


This looks like a known issue reported upstream as https://github.com/rabbitmq/rabbitmq-server/issues/581

Comment 3 Zhen Qin 2016-05-20 20:05:37 UTC
Hi Peter,

The output (rabbit_diagnostics_output.txt) is attached within this ticket. Thanks!

(In reply to Peter Lemenkov from comment #2)
> (In reply to Zhen Qin from comment #0)
> 
> Could you please provide the output of 
> 
> rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().'
> 
> 
> This looks like a known issue reported upstream as
> https://github.com/rabbitmq/rabbitmq-server/issues/581

Comment 4 Zhen Qin 2016-05-20 20:07:12 UTC
Created attachment 1160037 [details]
output of rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().'

Comment 5 Charles Crouch 2016-06-29 23:40:33 UTC
Peter, any updates on this issue? Thanks very much

Comment 6 Peter Lemenkov 2016-07-08 14:18:44 UTC
(In reply to Zhen Qin from comment #4)
> Created attachment 1160037 [details]
> output of rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().'

Status report - this looks like a known upstream issue:

https://github.com/rabbitmq/rabbitmq-server/issues/714

Bad news is that it requires RabbitMQ 3.6.x. Although it's relatively easy to backport the particular change, there are some other changes needs to be backported.

Wild guess - is it possible to upgrade to RabbitMQ 3.6.3 from upcoming RHOS 9?

Comment 7 Charles Crouch 2016-07-08 15:38:13 UTC
Peter
If RabbitMQ 3.6.3 fixes the issue will RHEL-OSP8 be upgraded to that version?
Thanks
Charles

Comment 8 Peter Lemenkov 2016-07-11 12:35:57 UTC
(In reply to Charles Crouch from comment #7)
> Peter
> If RabbitMQ 3.6.3 fixes the issue will RHEL-OSP8 be upgraded to that version?
> Thanks
> Charles

I don't think we'll push major upgrade to the released version. So if you cannot upgrade it with out-of-repository packages, then we need to backport changes to the previous version.

Comment 9 Peter Lemenkov 2016-07-12 15:45:03 UTC
(In reply to Charles Crouch from comment #7)
> Peter
> If RabbitMQ 3.6.3 fixes the issue will RHEL-OSP8 be upgraded to that version?
> Thanks
> Charles

Heads up. I'm working on backporting necessary changes from master branch down to 3.3.5. It will take a while.

I'll post ETA soon.

Comment 10 Fabio Massimo Di Nitto 2016-10-14 13:45:53 UTC
After a long engineering evaluation, backporting the fix is too invasive and very risky (to destabilize other areas).

We strongly recommend to upgrade the environment to OSP9 (or higher) that already contains the fix.

Comment 11 Peter Lemenkov 2016-11-08 13:41:11 UTC
We didn't fix the issue which triggered this behaviour, but we backported a couple of patches which adds timeouts to various rabbitmqctl command. This can be a quite good workaround. So consider upgrading to rabbitmq-server-3.3.5-25.el7ost at least.