Bug 1337704
Summary: | "rabbimqctl list_queues ***" command hangs for more than 10 mins after a controller server is hard powered off | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Zhen Qin <zhenqin> | ||||
Component: | rabbitmq-server | Assignee: | Peter Lemenkov <plemenko> | ||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Udi Shkalim <ushkalim> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 8.0 (Liberty) | CC: | aathomas, apevec, charcrou, fdinitto, jdonohue, jeckersb, lhh, plemenko, srevivo, zhenqin | ||||
Target Milestone: | --- | Keywords: | ZStream | ||||
Target Release: | 8.0 (Liberty) | ||||||
Hardware: | Unspecified | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2016-10-14 13:45:53 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1194008, 1295530 | ||||||
Attachments: |
|
(In reply to Zhen Qin from comment #0) Could you please provide the output of rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().' This looks like a known issue reported upstream as https://github.com/rabbitmq/rabbitmq-server/issues/581 Hi Peter, The output (rabbit_diagnostics_output.txt) is attached within this ticket. Thanks! (In reply to Peter Lemenkov from comment #2) > (In reply to Zhen Qin from comment #0) > > Could you please provide the output of > > rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().' > > > This looks like a known issue reported upstream as > https://github.com/rabbitmq/rabbitmq-server/issues/581 Created attachment 1160037 [details]
output of rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().'
Peter, any updates on this issue? Thanks very much (In reply to Zhen Qin from comment #4) > Created attachment 1160037 [details] > output of rabbitmqctl eval 'rabbit_diagnostics:maybe_stuck().' Status report - this looks like a known upstream issue: https://github.com/rabbitmq/rabbitmq-server/issues/714 Bad news is that it requires RabbitMQ 3.6.x. Although it's relatively easy to backport the particular change, there are some other changes needs to be backported. Wild guess - is it possible to upgrade to RabbitMQ 3.6.3 from upcoming RHOS 9? Peter If RabbitMQ 3.6.3 fixes the issue will RHEL-OSP8 be upgraded to that version? Thanks Charles (In reply to Charles Crouch from comment #7) > Peter > If RabbitMQ 3.6.3 fixes the issue will RHEL-OSP8 be upgraded to that version? > Thanks > Charles I don't think we'll push major upgrade to the released version. So if you cannot upgrade it with out-of-repository packages, then we need to backport changes to the previous version. (In reply to Charles Crouch from comment #7) > Peter > If RabbitMQ 3.6.3 fixes the issue will RHEL-OSP8 be upgraded to that version? > Thanks > Charles Heads up. I'm working on backporting necessary changes from master branch down to 3.3.5. It will take a while. I'll post ETA soon. After a long engineering evaluation, backporting the fix is too invasive and very risky (to destabilize other areas). We strongly recommend to upgrade the environment to OSP9 (or higher) that already contains the fix. We didn't fix the issue which triggered this behaviour, but we backported a couple of patches which adds timeouts to various rabbitmqctl command. This can be a quite good workaround. So consider upgrading to rabbitmq-server-3.3.5-25.el7ost at least. |
Description of problem: Context: there are three rabbitmq servers running in one cluster. All queues are mirrored for HA. We noticed that if a server which has rabbitmq server running is hard killed (a.k.a., hard power off), then although openstack service was not impacted (I can still boot new instance, check service status etc.), "rabbitmqctl list_queus name ***" command hangs there for more than 10 minutes before it returns results. However, Other rabbitmq functions works fine (e.g., rabbitmqctl cluster_status, rabbitmqctl status). This could mislead people that rabbitmq server doesn't function properly if that command is used for monitoring rabbitmq status. If the erlang shell is kill this erlang shell during hanging period then erlang node throws erl_crash.dump file. We would like to check if Red Hat have any patch available to fix this hanging issue. Thanks! Please let us know if you need any more info. Version-Release number of selected component (if applicable): rabbitmq version: rabbitmq-server-3.3.5-22.el7ost.noarch erlang version: erlang-R16B-03.10min.9.el7ost.x86_64 How reproducible: Steps to Reproduce: 1. find out which rabbitmq node has at least master queues: > rabbitmqctl list_queues name pid | grep rabbit@node1 | wc -l > rabbitmqctl list_queues name pid | grep rabbit@node2 | wc -l > rabbitmqctl list_queues name pid | grep rabbit@node3 | wc -l 2. Hard kill the server where the rabbitmq server with some master queues runs there, so as to guarantee slave to master queues promotion happens. 3. on other controller node check rabbitmq queues info: watch -n 5 rabbitmqctl list_queues name Actual results: rabbitmqctl list_queues name returns result after more than 10 mins. Expected results: rabbitmqctl list_queues name returns result in few seconds. Additional info: