1387474 – Queue master process terminates in rabbit_mirror_queue_master:stop_all_slaves on promotion

Bug 1387474 - Queue master process terminates in rabbit_mirror_queue_master:stop_all_slaves on promotion

Summary: Queue master process terminates in rabbit_mirror_queue_master:stop_all_slaves...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	rabbitmq-server
Sub Component:
Version:	5.0 (RHEL 7)
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	async
Target Release:	5.0 (RHEL 7)
Assignee:	Peter Lemenkov
QA Contact:	Asaf Hirshberg
Docs Contact:
URL:
Whiteboard:
Depends On:	1319334
Blocks:
TreeView+	depends on / blocked

Reported:	2016-10-21 02:02 UTC by James Biao
Modified:	2020-02-14 18:04 UTC (History)
CC List:	9 users (show)
Fixed In Version:	rabbitmq-server-3.3.5-25.el7ost
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1391186 1391188 1391190 (view as bug list)
Environment:
Last Closed:	2017-01-19 13:33:13 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	rabbitmq rabbitmq-server issues 812	None	None	None	2016-10-21 11:23:05 UTC
Red Hat Bugzilla	1348276	unspecified	CLOSED	Queue master process terminates in rabbit_mirror_queue_master:stop_all_slaves on promotion	2021-02-22 00:41:40 UTC
Red Hat Product Errata	RHBA-2017:0167	normal	SHIPPED_LIVE	Red Hat Enterprise Linux OpenStack Platform 5 Bug Fix and Enhancement Advisory	2017-01-19 18:21:46 UTC

Description James Biao 2016-10-21 02:02:20 UTC

Description of problem:

Request to backport bz Bug 1348276 to OSP 5

Upstream bug: https://github.com/rabbitmq/rabbitmq-server/issues/812

Version-Release number of selected component (if applicable):
rabbitmq-server-3.3.5-18.el7ost.noarch

How reproducible:


Steps to Reproduce:


Additional info:

=ERROR REPORT==== 17-Oct-2016::04:32:57 ===
** Generic server <0.800.0> terminating
** Last message in was {maybe_expire,2}
** When Server state == {q,
                         {amqqueue,
                          {resource,<<"/">>,queue,
 ......
                          
** Reason for termination ==
** {timeout_value,
       [{rabbit_mirror_queue_master,'-stop_all_slaves/2-lc$^1/1-1-',3,
            [{file,"src/rabbit_mirror_queue_master.erl"},{line,202}]},
        {rabbit_mirror_queue_master,stop_all_slaves,2,
            [{file,"src/rabbit_mirror_queue_master.erl"},{line,202}]},
        {rabbit_mirror_queue_master,delete_and_terminate,2,
            [{file,"src/rabbit_mirror_queue_master.erl"},{line,190}]},
        {rabbit_amqqueue_process,'-terminate_delete/3-fun-1-',6,
            [{file,"src/rabbit_amqqueue_process.erl"},{line,159}]},
        {rabbit_amqqueue_process,terminate_shutdown,2,
            [{file,"src/rabbit_amqqueue_process.erl"},{line,320}]},
        {gen_server2,terminate,3,[{file,"src/gen_server2.erl"},{line,1119}]},
        {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,249}]}]}
** In 'terminate' callback with reason ==
** normal

Comment 1 Fabio Massimo Di Nitto 2016-10-21 05:31:57 UTC

Please provide all sosreports from the environment.

Before backporting any fix we need to make sure that there are no other causes that are triggering this problem.

Comment 2 Peter Lemenkov 2016-10-21 11:27:29 UTC

Fix applied in rabbitmq-server-3.3.5-23.el7ost. As soon as we verify that this is indeed caused by GH issue no. 812, we'll propose this build as a fix.

Comment 3 Peter Lemenkov 2016-10-21 12:51:30 UTC

Interestingly, but I've found another one issue after inspecting SOS logs. Namely this one:

https://github.com/rabbitmq/rabbitmq-server/issues/255.

This log message points to that issue:

=================================
=SUPERVISOR REPORT==== 18-Oct-2016::17:12:52 ===
     Supervisor: {local,rabbit_amqqueue_sup}
     Context:    child_terminated
     Reason:     {{case_clause,{empty,{[],[]}}},
                  [{rabbit_queue_consumers,subtract_acks,4,
                       [{file,"src/rabbit_queue_consumers.erl"},{line,274}]},
                   {rabbit_queue_consumers,subtract_acks,3,
                       [{file,"src/rabbit_queue_consumers.erl"},{line,252}]},
                   {rabbit_amqqueue_process,subtract_acks,4,
                       [{file,"src/rabbit_amqqueue_process.erl"},{line,660}]},
                   {rabbit_amqqueue_process,handle_cast,2,
                       [{file,"src/rabbit_amqqueue_process.erl"},{line,1082}]},
                   {gen_server2,handle_msg,2,
                       [{file,"src/gen_server2.erl"},{line,1022}]},
                   {proc_lib,init_p_do_apply,3,
                       [{file,"proc_lib.erl"},{line,239}]}]}
     Offender:   [{pid,<0.26511.0>},
                  {name,rabbit_amqqueue},
                  {mfargs,{rabbit_amqqueue_process,start_link,undefined}},
                  {restart_type,temporary},
                  {shutdown,4294967295},
                  {child_type,worker}]
=================================

Comment 4 Peter Lemenkov 2016-10-21 12:53:08 UTC

Patch available.

Comment 6 James Biao 2016-10-24 00:13:20 UTC

Customer has a further question. It is observed that the rabbitmq cluster was non responsive at all several hours after rabbit node-002 started to log "rabbit_mirror_queue_master:stop_all_slaves". Can this issue cause the cluster not responding to the messages? 

Regarding the second bug, Unknown acks (e.g. after network partition heals) should be handled gracefully, is there any plan to backport this?

Provided sosreports from 2 sites that was having the issue. The second site got into partition and saw the same error.

Comment 7 Fabio Massimo Di Nitto 2016-10-24 03:55:46 UTC

(In reply to James Biao from comment #6)
> Customer has a further question. It is observed that the rabbitmq cluster
> was non responsive at all several hours after rabbit node-002 started to log
> "rabbit_mirror_queue_master:stop_all_slaves". Can this issue cause the
> cluster not responding to the messages? 
> 
> Regarding the second bug, Unknown acks (e.g. after network partition heals)
> should be handled gracefully, is there any plan to backport this?
> 
> Provided sosreports from 2 sites that was having the issue. The second site
> got into partition and saw the same error.

James, if the customer is observing two bugs, please file 2 separate bugzillas. Here we will track only this specific one.

Comment 8 James Biao 2016-10-24 06:43:20 UTC

(In reply to Fabio Massimo Di Nitto from comment #7)

Sure. I'll open a new bz. Let's focus on the oringinal issue on this one.

Comment 9 Peter Lemenkov 2016-10-24 12:12:03 UTC

Ok, this issue (see comment 1) is addressed in rabbitmq-server-3.3.5-23.el7ost. For details regarding another issue mentioned here (GH#255) see bug 1387988.

Comment 10 Peter Lemenkov 2016-10-25 14:38:53 UTC

(In reply to James Biao from comment #6)
> Customer has a further question. It is observed that the rabbitmq cluster
> was non responsive at all several hours after rabbit node-002 started to log
> "rabbit_mirror_queue_master:stop_all_slaves". Can this issue cause the
> cluster not responding to the messages? 

Short answer is yes.

Comment 16 Peter Lemenkov 2016-11-02 18:04:58 UTC

Please try this package - rabbitmq-server-3.3.5-25.el7ost
It should fully address this issue.

Comment 28 errata-xmlrpc 2017-01-19 13:33:13 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2017-0167.html

Note You need to log in before you can comment on or make changes to this bug.