Bug 1387474 - Queue master process terminates in rabbit_mirror_queue_master:stop_all_slaves on promotion
Summary: Queue master process terminates in rabbit_mirror_queue_master:stop_all_slaves...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: rabbitmq-server
Version: 5.0 (RHEL 7)
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: async
: 5.0 (RHEL 7)
Assignee: Peter Lemenkov
QA Contact: Asaf Hirshberg
URL:
Whiteboard:
Depends On: 1319334
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-10-21 02:02 UTC by James Biao
Modified: 2020-02-14 18:04 UTC (History)
9 users (show)

Fixed In Version: rabbitmq-server-3.3.5-25.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1391186 1391188 1391190 (view as bug list)
Environment:
Last Closed: 2017-01-19 13:33:13 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github rabbitmq rabbitmq-server issues 812 0 None None None 2016-10-21 11:23:05 UTC
Red Hat Bugzilla 1348276 0 unspecified CLOSED Queue master process terminates in rabbit_mirror_queue_master:stop_all_slaves on promotion 2021-02-22 00:41:40 UTC
Red Hat Product Errata RHBA-2017:0167 0 normal SHIPPED_LIVE Red Hat Enterprise Linux OpenStack Platform 5 Bug Fix and Enhancement Advisory 2017-01-19 18:21:46 UTC

Description James Biao 2016-10-21 02:02:20 UTC
Description of problem:

Request to backport bz Bug 1348276 to OSP 5

Upstream bug: https://github.com/rabbitmq/rabbitmq-server/issues/812

Version-Release number of selected component (if applicable):
rabbitmq-server-3.3.5-18.el7ost.noarch

How reproducible:


Steps to Reproduce:


Additional info:

=ERROR REPORT==== 17-Oct-2016::04:32:57 ===
** Generic server <0.800.0> terminating
** Last message in was {maybe_expire,2}
** When Server state == {q,
                         {amqqueue,
                          {resource,<<"/">>,queue,
 ......
                          
** Reason for termination ==
** {timeout_value,
       [{rabbit_mirror_queue_master,'-stop_all_slaves/2-lc$^1/1-1-',3,
            [{file,"src/rabbit_mirror_queue_master.erl"},{line,202}]},
        {rabbit_mirror_queue_master,stop_all_slaves,2,
            [{file,"src/rabbit_mirror_queue_master.erl"},{line,202}]},
        {rabbit_mirror_queue_master,delete_and_terminate,2,
            [{file,"src/rabbit_mirror_queue_master.erl"},{line,190}]},
        {rabbit_amqqueue_process,'-terminate_delete/3-fun-1-',6,
            [{file,"src/rabbit_amqqueue_process.erl"},{line,159}]},
        {rabbit_amqqueue_process,terminate_shutdown,2,
            [{file,"src/rabbit_amqqueue_process.erl"},{line,320}]},
        {gen_server2,terminate,3,[{file,"src/gen_server2.erl"},{line,1119}]},
        {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,249}]}]}
** In 'terminate' callback with reason ==
** normal

Comment 1 Fabio Massimo Di Nitto 2016-10-21 05:31:57 UTC
Please provide all sosreports from the environment.

Before backporting any fix we need to make sure that there are no other causes that are triggering this problem.

Comment 2 Peter Lemenkov 2016-10-21 11:27:29 UTC
Fix applied in rabbitmq-server-3.3.5-23.el7ost. As soon as we verify that this is indeed caused by GH issue no. 812, we'll propose this build as a fix.

Comment 3 Peter Lemenkov 2016-10-21 12:51:30 UTC
Interestingly, but I've found another one issue after inspecting SOS logs. Namely this one:

https://github.com/rabbitmq/rabbitmq-server/issues/255.

This log message points to that issue:

=================================
=SUPERVISOR REPORT==== 18-Oct-2016::17:12:52 ===
     Supervisor: {local,rabbit_amqqueue_sup}
     Context:    child_terminated
     Reason:     {{case_clause,{empty,{[],[]}}},
                  [{rabbit_queue_consumers,subtract_acks,4,
                       [{file,"src/rabbit_queue_consumers.erl"},{line,274}]},
                   {rabbit_queue_consumers,subtract_acks,3,
                       [{file,"src/rabbit_queue_consumers.erl"},{line,252}]},
                   {rabbit_amqqueue_process,subtract_acks,4,
                       [{file,"src/rabbit_amqqueue_process.erl"},{line,660}]},
                   {rabbit_amqqueue_process,handle_cast,2,
                       [{file,"src/rabbit_amqqueue_process.erl"},{line,1082}]},
                   {gen_server2,handle_msg,2,
                       [{file,"src/gen_server2.erl"},{line,1022}]},
                   {proc_lib,init_p_do_apply,3,
                       [{file,"proc_lib.erl"},{line,239}]}]}
     Offender:   [{pid,<0.26511.0>},
                  {name,rabbit_amqqueue},
                  {mfargs,{rabbit_amqqueue_process,start_link,undefined}},
                  {restart_type,temporary},
                  {shutdown,4294967295},
                  {child_type,worker}]
=================================

Comment 4 Peter Lemenkov 2016-10-21 12:53:08 UTC
Patch available.

Comment 6 James Biao 2016-10-24 00:13:20 UTC
Customer has a further question. It is observed that the rabbitmq cluster was non responsive at all several hours after rabbit node-002 started to log "rabbit_mirror_queue_master:stop_all_slaves". Can this issue cause the cluster not responding to the messages? 

Regarding the second bug, Unknown acks (e.g. after network partition heals) should be handled gracefully, is there any plan to backport this?

Provided sosreports from 2 sites that was having the issue. The second site got into partition and saw the same error.

Comment 7 Fabio Massimo Di Nitto 2016-10-24 03:55:46 UTC
(In reply to James Biao from comment #6)
> Customer has a further question. It is observed that the rabbitmq cluster
> was non responsive at all several hours after rabbit node-002 started to log
> "rabbit_mirror_queue_master:stop_all_slaves". Can this issue cause the
> cluster not responding to the messages? 
> 
> Regarding the second bug, Unknown acks (e.g. after network partition heals)
> should be handled gracefully, is there any plan to backport this?
> 
> Provided sosreports from 2 sites that was having the issue. The second site
> got into partition and saw the same error.

James, if the customer is observing two bugs, please file 2 separate bugzillas. Here we will track only this specific one.

Comment 8 James Biao 2016-10-24 06:43:20 UTC
(In reply to Fabio Massimo Di Nitto from comment #7)

Sure. I'll open a new bz. Let's focus on the oringinal issue on this one.

Comment 9 Peter Lemenkov 2016-10-24 12:12:03 UTC
Ok, this issue (see comment 1) is addressed in rabbitmq-server-3.3.5-23.el7ost. For details regarding another issue mentioned here (GH#255) see bug 1387988.

Comment 10 Peter Lemenkov 2016-10-25 14:38:53 UTC
(In reply to James Biao from comment #6)
> Customer has a further question. It is observed that the rabbitmq cluster
> was non responsive at all several hours after rabbit node-002 started to log
> "rabbit_mirror_queue_master:stop_all_slaves". Can this issue cause the
> cluster not responding to the messages? 

Short answer is yes.

Comment 16 Peter Lemenkov 2016-11-02 18:04:58 UTC
Please try this package - rabbitmq-server-3.3.5-25.el7ost
It should fully address this issue.

Comment 28 errata-xmlrpc 2017-01-19 13:33:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0167.html


Note You need to log in before you can comment on or make changes to this bug.