Bug 1337704 - "rabbimqctl list_queues ***" command hangs for more than 10 mins after a controller server is hard powered off
Summary: "rabbimqctl list_queues ***" command hangs for more than 10 mins after a cont...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rabbitmq-server
Version: 8.0 (Liberty)
Hardware: Unspecified
OS: Linux
medium
medium
Target Milestone: ---
: 8.0 (Liberty)
Assignee: Peter Lemenkov
QA Contact: Udi Shkalim
URL:
Whiteboard:
Depends On:
Blocks: 1194008 1295530
TreeView+ depends on / blocked
 
Reported: 2016-05-19 21:17 UTC by Zhen Qin
Modified: 2019-11-14 08:09 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2016-10-14 13:45:53 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
output of rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().' (231.24 KB, text/plain)
2016-05-20 20:07 UTC, Zhen Qin
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github rabbitmq rabbitmq-server issues 714 0 'None' closed Deadlock while syncing mirrored queues 2020-09-05 07:15:29 UTC

Description Zhen Qin 2016-05-19 21:17:17 UTC
Description of problem:
Context: there are three rabbitmq servers running in one cluster. All queues are mirrored for HA.

We noticed that if a server which has rabbitmq server running is hard killed (a.k.a., hard power off), then although openstack service was not impacted (I can still boot new instance, check service status etc.), "rabbitmqctl list_queus name ***" command hangs there for more than 10 minutes before it returns results.
 
However, Other rabbitmq functions works fine (e.g., rabbitmqctl cluster_status, rabbitmqctl status).

This could mislead people that rabbitmq server doesn't function properly if that command is used for monitoring rabbitmq status. If the erlang shell is kill this erlang shell during hanging period then erlang node throws erl_crash.dump file. 

We would like to check if Red Hat have any patch available to fix this hanging issue. Thanks! Please let us know if you need any more info.

Version-Release number of selected component (if applicable):
rabbitmq version: rabbitmq-server-3.3.5-22.el7ost.noarch
erlang version: erlang-R16B-03.10min.9.el7ost.x86_64

How reproducible:


Steps to Reproduce:
1. find out which rabbitmq node has at least master queues:
> rabbitmqctl list_queues name pid | grep rabbit@node1 | wc -l
> rabbitmqctl list_queues name pid | grep rabbit@node2 | wc -l
> rabbitmqctl list_queues name pid | grep rabbit@node3 | wc -l
2. Hard kill the server where the rabbitmq server with some master queues runs there, so as to guarantee slave to master queues promotion happens.
3. on other controller node check rabbitmq queues info:
watch -n 5 rabbitmqctl list_queues name

Actual results:
rabbitmqctl list_queues name returns result after more than 10 mins.

Expected results:
rabbitmqctl list_queues name returns result in few seconds.

Additional info:

Comment 2 Peter Lemenkov 2016-05-20 12:05:37 UTC
(In reply to Zhen Qin from comment #0)

Could you please provide the output of 

rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().'


This looks like a known issue reported upstream as https://github.com/rabbitmq/rabbitmq-server/issues/581

Comment 3 Zhen Qin 2016-05-20 20:05:37 UTC
Hi Peter,

The output (rabbit_diagnostics_output.txt) is attached within this ticket. Thanks!

(In reply to Peter Lemenkov from comment #2)
> (In reply to Zhen Qin from comment #0)
> 
> Could you please provide the output of 
> 
> rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().'
> 
> 
> This looks like a known issue reported upstream as
> https://github.com/rabbitmq/rabbitmq-server/issues/581

Comment 4 Zhen Qin 2016-05-20 20:07:12 UTC
Created attachment 1160037 [details]
output of rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().'

Comment 5 Charles Crouch 2016-06-29 23:40:33 UTC
Peter, any updates on this issue? Thanks very much

Comment 6 Peter Lemenkov 2016-07-08 14:18:44 UTC
(In reply to Zhen Qin from comment #4)
> Created attachment 1160037 [details]
> output of rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().'

Status report - this looks like a known upstream issue:

https://github.com/rabbitmq/rabbitmq-server/issues/714

Bad news is that it requires RabbitMQ 3.6.x. Although it's relatively easy to backport the particular change, there are some other changes needs to be backported.

Wild guess - is it possible to upgrade to RabbitMQ 3.6.3 from upcoming RHOS 9?

Comment 7 Charles Crouch 2016-07-08 15:38:13 UTC
Peter
If RabbitMQ 3.6.3 fixes the issue will RHEL-OSP8 be upgraded to that version?
Thanks
Charles

Comment 8 Peter Lemenkov 2016-07-11 12:35:57 UTC
(In reply to Charles Crouch from comment #7)
> Peter
> If RabbitMQ 3.6.3 fixes the issue will RHEL-OSP8 be upgraded to that version?
> Thanks
> Charles

I don't think we'll push major upgrade to the released version. So if you cannot upgrade it with out-of-repository packages, then we need to backport changes to the previous version.

Comment 9 Peter Lemenkov 2016-07-12 15:45:03 UTC
(In reply to Charles Crouch from comment #7)
> Peter
> If RabbitMQ 3.6.3 fixes the issue will RHEL-OSP8 be upgraded to that version?
> Thanks
> Charles

Heads up. I'm working on backporting necessary changes from master branch down to 3.3.5. It will take a while.

I'll post ETA soon.

Comment 10 Fabio Massimo Di Nitto 2016-10-14 13:45:53 UTC
After a long engineering evaluation, backporting the fix is too invasive and very risky (to destabilize other areas).

We strongly recommend to upgrade the environment to OSP9 (or higher) that already contains the fix.

Comment 11 Peter Lemenkov 2016-11-08 13:41:11 UTC
We didn't fix the issue which triggered this behaviour, but we backported a couple of patches which adds timeouts to various rabbitmqctl command. This can be a quite good workaround. So consider upgrading to rabbitmq-server-3.3.5-25.el7ost at least.


Note You need to log in before you can comment on or make changes to this bug.