Description of problem: Request to backport bz Bug 1348276 to OSP 5 Upstream bug: https://github.com/rabbitmq/rabbitmq-server/issues/812 Version-Release number of selected component (if applicable): rabbitmq-server-3.3.5-18.el7ost.noarch How reproducible: Steps to Reproduce: Additional info: =ERROR REPORT==== 17-Oct-2016::04:32:57 === ** Generic server <0.800.0> terminating ** Last message in was {maybe_expire,2} ** When Server state == {q, {amqqueue, {resource,<<"/">>,queue, ...... ** Reason for termination == ** {timeout_value, [{rabbit_mirror_queue_master,'-stop_all_slaves/2-lc$^1/1-1-',3, [{file,"src/rabbit_mirror_queue_master.erl"},{line,202}]}, {rabbit_mirror_queue_master,stop_all_slaves,2, [{file,"src/rabbit_mirror_queue_master.erl"},{line,202}]}, {rabbit_mirror_queue_master,delete_and_terminate,2, [{file,"src/rabbit_mirror_queue_master.erl"},{line,190}]}, {rabbit_amqqueue_process,'-terminate_delete/3-fun-1-',6, [{file,"src/rabbit_amqqueue_process.erl"},{line,159}]}, {rabbit_amqqueue_process,terminate_shutdown,2, [{file,"src/rabbit_amqqueue_process.erl"},{line,320}]}, {gen_server2,terminate,3,[{file,"src/gen_server2.erl"},{line,1119}]}, {proc_lib,wake_up,3,[{file,"proc_lib.erl"},{line,249}]}]} ** In 'terminate' callback with reason == ** normal
Please provide all sosreports from the environment. Before backporting any fix we need to make sure that there are no other causes that are triggering this problem.
Fix applied in rabbitmq-server-3.3.5-23.el7ost. As soon as we verify that this is indeed caused by GH issue no. 812, we'll propose this build as a fix.
Interestingly, but I've found another one issue after inspecting SOS logs. Namely this one: https://github.com/rabbitmq/rabbitmq-server/issues/255. This log message points to that issue: ================================= =SUPERVISOR REPORT==== 18-Oct-2016::17:12:52 === Supervisor: {local,rabbit_amqqueue_sup} Context: child_terminated Reason: {{case_clause,{empty,{[],[]}}}, [{rabbit_queue_consumers,subtract_acks,4, [{file,"src/rabbit_queue_consumers.erl"},{line,274}]}, {rabbit_queue_consumers,subtract_acks,3, [{file,"src/rabbit_queue_consumers.erl"},{line,252}]}, {rabbit_amqqueue_process,subtract_acks,4, [{file,"src/rabbit_amqqueue_process.erl"},{line,660}]}, {rabbit_amqqueue_process,handle_cast,2, [{file,"src/rabbit_amqqueue_process.erl"},{line,1082}]}, {gen_server2,handle_msg,2, [{file,"src/gen_server2.erl"},{line,1022}]}, {proc_lib,init_p_do_apply,3, [{file,"proc_lib.erl"},{line,239}]}]} Offender: [{pid,<0.26511.0>}, {name,rabbit_amqqueue}, {mfargs,{rabbit_amqqueue_process,start_link,undefined}}, {restart_type,temporary}, {shutdown,4294967295}, {child_type,worker}] =================================
Patch available.
Customer has a further question. It is observed that the rabbitmq cluster was non responsive at all several hours after rabbit node-002 started to log "rabbit_mirror_queue_master:stop_all_slaves". Can this issue cause the cluster not responding to the messages? Regarding the second bug, Unknown acks (e.g. after network partition heals) should be handled gracefully, is there any plan to backport this? Provided sosreports from 2 sites that was having the issue. The second site got into partition and saw the same error.
(In reply to James Biao from comment #6) > Customer has a further question. It is observed that the rabbitmq cluster > was non responsive at all several hours after rabbit node-002 started to log > "rabbit_mirror_queue_master:stop_all_slaves". Can this issue cause the > cluster not responding to the messages? > > Regarding the second bug, Unknown acks (e.g. after network partition heals) > should be handled gracefully, is there any plan to backport this? > > Provided sosreports from 2 sites that was having the issue. The second site > got into partition and saw the same error. James, if the customer is observing two bugs, please file 2 separate bugzillas. Here we will track only this specific one.
(In reply to Fabio Massimo Di Nitto from comment #7) Sure. I'll open a new bz. Let's focus on the oringinal issue on this one.
Ok, this issue (see comment 1) is addressed in rabbitmq-server-3.3.5-23.el7ost. For details regarding another issue mentioned here (GH#255) see bug 1387988.
(In reply to James Biao from comment #6) > Customer has a further question. It is observed that the rabbitmq cluster > was non responsive at all several hours after rabbit node-002 started to log > "rabbit_mirror_queue_master:stop_all_slaves". Can this issue cause the > cluster not responding to the messages? Short answer is yes.
Please try this package - rabbitmq-server-3.3.5-25.el7ost It should fully address this issue.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2017-0167.html