Bug 1387474

Summary: Queue master process terminates in rabbit_mirror_queue_master:stop_all_slaves on promotion
Product: Red Hat OpenStack Reporter: James Biao <jbiao>
Component: rabbitmq-serverAssignee: Peter Lemenkov <plemenko>
Status: CLOSED ERRATA QA Contact: Asaf Hirshberg <ahirshbe>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 5.0 (RHEL 7)CC: apevec, dmaley, fdinitto, jbiao, jeckersb, jthomas, lhh, plemenko, srevivo
Target Milestone: asyncKeywords: ZStream
Target Release: 5.0 (RHEL 7)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: rabbitmq-server-3.3.5-25.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1391186 1391188 1391190 (view as bug list) Environment:
Last Closed: 2017-01-19 13:33:13 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1319334    
Bug Blocks:    

Description James Biao 2016-10-21 02:02:20 UTC
Description of problem:

Request to backport bz Bug 1348276 to OSP 5

Upstream bug: https://github.com/rabbitmq/rabbitmq-server/issues/812

Version-Release number of selected component (if applicable):
rabbitmq-server-3.3.5-18.el7ost.noarch

How reproducible:


Steps to Reproduce:


Additional info:

=ERROR REPORT==== 17-Oct-2016::04:32:57 ===
** Generic server <0.800.0> terminating
** Last message in was {maybe_expire,2}
** When Server state == {q,
                         {amqqueue,
                          {resource,<<"/">>,queue,
 ......
                          
** Reason for termination ==
** {timeout_value,
       [{rabbit_mirror_queue_master,'-stop_all_slaves/2-lc$^1/1-1-',3,
            [{file,"src/rabbit_mirror_queue_master.erl"},{line,202}]},
        {rabbit_mirror_queue_master,stop_all_slaves,2,
            [{file,"src/rabbit_mirror_queue_master.erl"},{line,202}]},
        {rabbit_mirror_queue_master,delete_and_terminate,2,
            [{file,"src/rabbit_mirror_queue_master.erl"},{line,190}]},
        {rabbit_amqqueue_process,'-terminate_delete/3-fun-1-',6,
            [{file,"src/rabbit_amqqueue_process.erl"},{line,159}]},
        {rabbit_amqqueue_process,terminate_shutdown,2,
            [{file,"src/rabbit_amqqueue_process.erl"},{line,320}]},
        {gen_server2,terminate,3,[{file,"src/gen_server2.erl"},{line,1119}]},
        {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,249}]}]}
** In 'terminate' callback with reason ==
** normal

Comment 1 Fabio Massimo Di Nitto 2016-10-21 05:31:57 UTC
Please provide all sosreports from the environment.

Before backporting any fix we need to make sure that there are no other causes that are triggering this problem.

Comment 2 Peter Lemenkov 2016-10-21 11:27:29 UTC
Fix applied in rabbitmq-server-3.3.5-23.el7ost. As soon as we verify that this is indeed caused by GH issue no. 812, we'll propose this build as a fix.

Comment 3 Peter Lemenkov 2016-10-21 12:51:30 UTC
Interestingly, but I've found another one issue after inspecting SOS logs. Namely this one:

https://github.com/rabbitmq/rabbitmq-server/issues/255.

This log message points to that issue:

=================================
=SUPERVISOR REPORT==== 18-Oct-2016::17:12:52 ===
     Supervisor: {local,rabbit_amqqueue_sup}
     Context:    child_terminated
     Reason:     {{case_clause,{empty,{[],[]}}},
                  [{rabbit_queue_consumers,subtract_acks,4,
                       [{file,"src/rabbit_queue_consumers.erl"},{line,274}]},
                   {rabbit_queue_consumers,subtract_acks,3,
                       [{file,"src/rabbit_queue_consumers.erl"},{line,252}]},
                   {rabbit_amqqueue_process,subtract_acks,4,
                       [{file,"src/rabbit_amqqueue_process.erl"},{line,660}]},
                   {rabbit_amqqueue_process,handle_cast,2,
                       [{file,"src/rabbit_amqqueue_process.erl"},{line,1082}]},
                   {gen_server2,handle_msg,2,
                       [{file,"src/gen_server2.erl"},{line,1022}]},
                   {proc_lib,init_p_do_apply,3,
                       [{file,"proc_lib.erl"},{line,239}]}]}
     Offender:   [{pid,<0.26511.0>},
                  {name,rabbit_amqqueue},
                  {mfargs,{rabbit_amqqueue_process,start_link,undefined}},
                  {restart_type,temporary},
                  {shutdown,4294967295},
                  {child_type,worker}]
=================================

Comment 4 Peter Lemenkov 2016-10-21 12:53:08 UTC
Patch available.

Comment 6 James Biao 2016-10-24 00:13:20 UTC
Customer has a further question. It is observed that the rabbitmq cluster was non responsive at all several hours after rabbit node-002 started to log "rabbit_mirror_queue_master:stop_all_slaves". Can this issue cause the cluster not responding to the messages? 

Regarding the second bug, Unknown acks (e.g. after network partition heals) should be handled gracefully, is there any plan to backport this?

Provided sosreports from 2 sites that was having the issue. The second site got into partition and saw the same error.

Comment 7 Fabio Massimo Di Nitto 2016-10-24 03:55:46 UTC
(In reply to James Biao from comment #6)
> Customer has a further question. It is observed that the rabbitmq cluster
> was non responsive at all several hours after rabbit node-002 started to log
> "rabbit_mirror_queue_master:stop_all_slaves". Can this issue cause the
> cluster not responding to the messages? 
> 
> Regarding the second bug, Unknown acks (e.g. after network partition heals)
> should be handled gracefully, is there any plan to backport this?
> 
> Provided sosreports from 2 sites that was having the issue. The second site
> got into partition and saw the same error.

James, if the customer is observing two bugs, please file 2 separate bugzillas. Here we will track only this specific one.

Comment 8 James Biao 2016-10-24 06:43:20 UTC
(In reply to Fabio Massimo Di Nitto from comment #7)

Sure. I'll open a new bz. Let's focus on the oringinal issue on this one.

Comment 9 Peter Lemenkov 2016-10-24 12:12:03 UTC
Ok, this issue (see comment 1) is addressed in rabbitmq-server-3.3.5-23.el7ost. For details regarding another issue mentioned here (GH#255) see bug 1387988.

Comment 10 Peter Lemenkov 2016-10-25 14:38:53 UTC
(In reply to James Biao from comment #6)
> Customer has a further question. It is observed that the rabbitmq cluster
> was non responsive at all several hours after rabbit node-002 started to log
> "rabbit_mirror_queue_master:stop_all_slaves". Can this issue cause the
> cluster not responding to the messages? 

Short answer is yes.

Comment 16 Peter Lemenkov 2016-11-02 18:04:58 UTC
Please try this package - rabbitmq-server-3.3.5-25.el7ost
It should fully address this issue.

Comment 28 errata-xmlrpc 2017-01-19 13:33:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0167.html