Bug 1283291

Summary: [Unconfirmed] rabbitmq-3.3.5 failed to start after fence/reboot
Product: Red Hat OpenStack Reporter: Robin Cernin <rcernin>
Component: rabbitmq-serverAssignee: Peter Lemenkov <plemenko>
Status: CLOSED NOTABUG QA Contact: yeylon <yeylon>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 7.0 (Kilo)CC: apevec, jeckersb, lhh, srevivo, yeylon
Target Milestone: ---   
Target Release: 8.0 (Liberty)   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-02-01 15:28:49 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Robin Cernin 2015-11-18 15:47:40 UTC
Description of problem:

Version-Release number of selected component (if applicable):

RabbitMQ 3.3.5 on Erlang R16B03

How reproducible:
rabbitmq-server failed to start after fence/reboot of node1 in cluster.

Here is stack trace during boot visible in /var/log/messages:

rabbitmq-server: RabbitMQ 3.3.5. Copyright (C) 2007-2014 GoPivotal, Inc.
rabbitmq-server: ##  ##      Licensed under the MPL.  See http://www.rabbitmq.com/
rabbitmq-server: ##  ##
rabbitmq-server: ##########  Logs: /var/log/rabbitmq/rabbit
rabbitmq-server: ######  ##        /var/log/rabbitmq/rabbit
rabbitmq-server: ##########
rabbitmq-server: Starting broker...
rabbitmq-server: BOOT FAILED
rabbitmq-server: ===========
rabbitmq-server: Error description:
rabbitmq-server: {error,mnesia_not_running}
rabbitmq-server: Log files (may contain more information):
rabbitmq-server: /var/log/rabbitmq/rabbit
rabbitmq-server: /var/log/rabbitmq/rabbit
rabbitmq-server: Stack trace:
rabbitmq-server: [{rabbit_mnesia,ensure_mnesia_running,0,
rabbitmq-server: [{file,"src/rabbit_mnesia.erl"},{line,485}]},
rabbitmq-server: {rabbit_mnesia,init,0,[{file,"src/rabbit_mnesia.erl"},{line,95}]},
rabbitmq-server: {rabbit,'-run_boot_step/1-lc$^1/1-1-',1,
rabbitmq-server: [{file,"src/rabbit.erl"},{line,509}]},
rabbitmq-server: {rabbit,run_boot_step,1,[{file,"src/rabbit.erl"},{line,508}]},
rabbitmq-server: {rabbit,'-start/2-lc$^0/1-0-',1,[{file,"src/rabbit.erl"},{line,475}]},
rabbitmq-server: {rabbit,start,2,[{file,"src/rabbit.erl"},{line,475}]},
rabbitmq-server: {application_master,start_it_old,4,
rabbitmq-server: [{file,"application_master.erl"},{line,269}]}]
rabbitmq-server: BOOT FAILED
rabbitmq-server: ===========
rabbitmq-server: Error description:
rabbitmq-server: {could_not_start,rabbit,
rabbitmq-server: {bad_return,
rabbitmq-server: {{rabbit,start,[normal,[]]},
rabbitmq-server: {'EXIT',
rabbitmq-server: {rabbit,failure_during_boot,{error,mnesia_not_running}}}}}}
rabbitmq-server: Log files (may contain more information):
rabbitmq-server: /var/log/rabbitmq/rabbit
rabbitmq-server: /var/log/rabbitmq/rabbit
rabbitmq-server: {"init terminating in do_boot",{rabbit,failure_during_boot,{could_not_start,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{rabbit,failure_during_boot,{error,mnesia_not_running}}}}}}}}
~              

It look like to me there was problem of getting the membership information of mnesia group members. 

In the /var/log/rabbitmq/rabbit:

Mnesia('rabbit@node1-clu'): ** ERROR ** mnesia_late_loader got unexpected message: {'$gen_cast',
                                                                                              {'$gm',
                                                                                               60,
                                                                                               check_neighbours}}


Here is some more report regarding the check_neighbours:

** Generic server mnesia_snmp_sup terminating 
** Last message in was {'$gen_cast',{'$gm',60,check_neighbours}}
** When Server state == {state,
                            {local,mnesia_snmp_sup},
                            simple_one_for_one,
                            [{child,undefined,mnesia_snmp_sup,
                                 {mnesia_snmp_hook,start,[]},
                                 transient,3000,worker,
                                 [mnesia_snmp_sup,mnesia_snmp_hook,
                                  supervisor]}],
                            undefined,0,86400000,[],mnesia_snmp_sup,[]}

** Reason for termination == 
** {function_clause,
       [{supervisor,handle_cast,
            [{'$gm',60,check_neighbours},
             {state,
                 {local,mnesia_snmp_sup},
                 simple_one_for_one,
                 [{child,undefined,mnesia_snmp_sup,
                      {mnesia_snmp_hook,start,[]},
                      transient,3000,worker,
                      [mnesia_snmp_sup,mnesia_snmp_hook,supervisor]}],
                 undefined,0,86400000,[],mnesia_snmp_sup,[]}],
            [{file,"supervisor.erl"},{line,532}]},
        {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,604}]},
        {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}

Error in process <0.408.0> on node 'rabbit@node1-clu' with exit value: {badarg,[{mnesia_tm,return_abort,3,[{file,"mnesia_tm.erl"},{line,950}]},{mnesia_loader,init_receiver,5,[{file,"mnesia_loader.erl"},{line,269}]},{mnesia_loader,do_get_network_copy,5,[{file,"mnesia_loader.erl"},{line,206}]},{mnesia_controller...
~                   

To me it seems that it wanted to restart mnesia, however ended up in the loop and reached_max_restart_intensity:

     Supervisor: {local,mnesia_kernel_sup}
     Context:    shutdown
     Reason:     reached_max_restart_intensity
     Offender:   [{pid,<0.342.0>},
                  {name,mnesia_snmp_sup},
                  {mfargs,{mnesia_snmp_sup,start,[]}},
                  {restart_type,permanent},
                  {shutdown,infinity},
                  {child_type,supervisor}]


And here is the crasher:

  crasher:
    initial call: supervisor:mnesia_snmp_sup/1
    pid: <0.342.0>
    registered_name: mnesia_snmp_sup
    exception exit: {function_clause,
                        [{supervisor,handle_cast,
                             [{'$gm',60,check_neighbours},
                              {state,
                                  {local,mnesia_snmp_sup},
                                  simple_one_for_one,
                                  [{child,undefined,mnesia_snmp_sup,
                                       {mnesia_snmp_hook,start,[]},
                                       transient,3000,worker,
                                       [mnesia_snmp_sup,mnesia_snmp_hook,
                                        supervisor]}],
                                  undefined,0,86400000,[],mnesia_snmp_sup,[]}],
                             [{file,"supervisor.erl"},{line,532}]},
                         {gen_server,handle_msg,5,
                             [{file,"gen_server.erl"},{line,604}]},
                         {proc_lib,init_p_do_apply,3,
                             [{file,"proc_lib.erl"},{line,239}]}]}
      in function  gen_server:terminate/6 (gen_server.erl, line 744)
    ancestors: [mnesia_kernel_sup,mnesia_sup,<0.56.0>]
    messages: []
    links: [<0.59.0>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 610
    stack_size: 27
    reductions: 318
  neighbours:

I am noticing here that the neighbours: is empty here.

Actual results:

rabbitmq-server failed to start.

Expected results:

rabbitmq-server started with return code 0

Additional info:

rabbitmq is clustered with pacemaker in total of 3 nodes.

not sure if there was a bug in 3.3.5 which could have fix this or is it a new one which is present also in older versions.

Regards,
Robin Černín

Comment 5 Peter Lemenkov 2015-11-25 12:11:48 UTC
Ok, here is what's going on there.

The problem is that RabbitMQ core got stuck in a semi-stopped state. Please next try run:

rabbitmqctl stop_app
rabbitmqctl start_app

on that node. Or just restart the entire node again.

Comment 6 Robin Cernin 2016-02-01 15:28:49 UTC
Thank you, closing this BZ as it is resolved now.

Thank you
Kind Regards,
Robin Černín