Bug 1283291 - [Unconfirmed] rabbitmq-3.3.5 failed to start after fence/reboot
[Unconfirmed] rabbitmq-3.3.5 failed to start after fence/reboot
Status: CLOSED NOTABUG
Product: Red Hat OpenStack
Classification: Red Hat
Component: rabbitmq-server (Show other bugs)
7.0 (Kilo)
Unspecified Linux
unspecified Severity medium
: ---
: 8.0 (Liberty)
Assigned To: Peter Lemenkov
yeylon@redhat.com
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2015-11-18 10:47 EST by Robin Cernin
Modified: 2016-04-18 03:11 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2016-02-01 10:28:49 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Robin Cernin 2015-11-18 10:47:40 EST
Description of problem:

Version-Release number of selected component (if applicable):

RabbitMQ 3.3.5 on Erlang R16B03

How reproducible:
rabbitmq-server failed to start after fence/reboot of node1 in cluster.

Here is stack trace during boot visible in /var/log/messages:

rabbitmq-server: RabbitMQ 3.3.5. Copyright (C) 2007-2014 GoPivotal, Inc.
rabbitmq-server: ##  ##      Licensed under the MPL.  See http://www.rabbitmq.com/
rabbitmq-server: ##  ##
rabbitmq-server: ##########  Logs: /var/log/rabbitmq/rabbit@node1-clu.log
rabbitmq-server: ######  ##        /var/log/rabbitmq/rabbit@node1-clu-sasl.log
rabbitmq-server: ##########
rabbitmq-server: Starting broker...
rabbitmq-server: BOOT FAILED
rabbitmq-server: ===========
rabbitmq-server: Error description:
rabbitmq-server: {error,mnesia_not_running}
rabbitmq-server: Log files (may contain more information):
rabbitmq-server: /var/log/rabbitmq/rabbit@node1-clu.log
rabbitmq-server: /var/log/rabbitmq/rabbit@node1-clu-sasl.log
rabbitmq-server: Stack trace:
rabbitmq-server: [{rabbit_mnesia,ensure_mnesia_running,0,
rabbitmq-server: [{file,"src/rabbit_mnesia.erl"},{line,485}]},
rabbitmq-server: {rabbit_mnesia,init,0,[{file,"src/rabbit_mnesia.erl"},{line,95}]},
rabbitmq-server: {rabbit,'-run_boot_step/1-lc$^1/1-1-',1,
rabbitmq-server: [{file,"src/rabbit.erl"},{line,509}]},
rabbitmq-server: {rabbit,run_boot_step,1,[{file,"src/rabbit.erl"},{line,508}]},
rabbitmq-server: {rabbit,'-start/2-lc$^0/1-0-',1,[{file,"src/rabbit.erl"},{line,475}]},
rabbitmq-server: {rabbit,start,2,[{file,"src/rabbit.erl"},{line,475}]},
rabbitmq-server: {application_master,start_it_old,4,
rabbitmq-server: [{file,"application_master.erl"},{line,269}]}]
rabbitmq-server: BOOT FAILED
rabbitmq-server: ===========
rabbitmq-server: Error description:
rabbitmq-server: {could_not_start,rabbit,
rabbitmq-server: {bad_return,
rabbitmq-server: {{rabbit,start,[normal,[]]},
rabbitmq-server: {'EXIT',
rabbitmq-server: {rabbit,failure_during_boot,{error,mnesia_not_running}}}}}}
rabbitmq-server: Log files (may contain more information):
rabbitmq-server: /var/log/rabbitmq/rabbit@node1-clu.log
rabbitmq-server: /var/log/rabbitmq/rabbit@node1-clu-sasl.log
rabbitmq-server: {"init terminating in do_boot",{rabbit,failure_during_boot,{could_not_start,rabbit,{bad_return,{{rabbit,start,[normal,[]]},{'EXIT',{rabbit,failure_during_boot,{error,mnesia_not_running}}}}}}}}
~              

It look like to me there was problem of getting the membership information of mnesia group members. 

In the /var/log/rabbitmq/rabbit@node1-clu.log:

Mnesia('rabbit@node1-clu'): ** ERROR ** mnesia_late_loader got unexpected message: {'$gen_cast',
                                                                                              {'$gm',
                                                                                               60,
                                                                                               check_neighbours}}


Here is some more report regarding the check_neighbours:

** Generic server mnesia_snmp_sup terminating 
** Last message in was {'$gen_cast',{'$gm',60,check_neighbours}}
** When Server state == {state,
                            {local,mnesia_snmp_sup},
                            simple_one_for_one,
                            [{child,undefined,mnesia_snmp_sup,
                                 {mnesia_snmp_hook,start,[]},
                                 transient,3000,worker,
                                 [mnesia_snmp_sup,mnesia_snmp_hook,
                                  supervisor]}],
                            undefined,0,86400000,[],mnesia_snmp_sup,[]}

** Reason for termination == 
** {function_clause,
       [{supervisor,handle_cast,
            [{'$gm',60,check_neighbours},
             {state,
                 {local,mnesia_snmp_sup},
                 simple_one_for_one,
                 [{child,undefined,mnesia_snmp_sup,
                      {mnesia_snmp_hook,start,[]},
                      transient,3000,worker,
                      [mnesia_snmp_sup,mnesia_snmp_hook,supervisor]}],
                 undefined,0,86400000,[],mnesia_snmp_sup,[]}],
            [{file,"supervisor.erl"},{line,532}]},
        {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,604}]},
        {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,239}]}]}

Error in process <0.408.0> on node 'rabbit@node1-clu' with exit value: {badarg,[{mnesia_tm,return_abort,3,[{file,"mnesia_tm.erl"},{line,950}]},{mnesia_loader,init_receiver,5,[{file,"mnesia_loader.erl"},{line,269}]},{mnesia_loader,do_get_network_copy,5,[{file,"mnesia_loader.erl"},{line,206}]},{mnesia_controller...
~                   

To me it seems that it wanted to restart mnesia, however ended up in the loop and reached_max_restart_intensity:

     Supervisor: {local,mnesia_kernel_sup}
     Context:    shutdown
     Reason:     reached_max_restart_intensity
     Offender:   [{pid,<0.342.0>},
                  {name,mnesia_snmp_sup},
                  {mfargs,{mnesia_snmp_sup,start,[]}},
                  {restart_type,permanent},
                  {shutdown,infinity},
                  {child_type,supervisor}]


And here is the crasher:

  crasher:
    initial call: supervisor:mnesia_snmp_sup/1
    pid: <0.342.0>
    registered_name: mnesia_snmp_sup
    exception exit: {function_clause,
                        [{supervisor,handle_cast,
                             [{'$gm',60,check_neighbours},
                              {state,
                                  {local,mnesia_snmp_sup},
                                  simple_one_for_one,
                                  [{child,undefined,mnesia_snmp_sup,
                                       {mnesia_snmp_hook,start,[]},
                                       transient,3000,worker,
                                       [mnesia_snmp_sup,mnesia_snmp_hook,
                                        supervisor]}],
                                  undefined,0,86400000,[],mnesia_snmp_sup,[]}],
                             [{file,"supervisor.erl"},{line,532}]},
                         {gen_server,handle_msg,5,
                             [{file,"gen_server.erl"},{line,604}]},
                         {proc_lib,init_p_do_apply,3,
                             [{file,"proc_lib.erl"},{line,239}]}]}
      in function  gen_server:terminate/6 (gen_server.erl, line 744)
    ancestors: [mnesia_kernel_sup,mnesia_sup,<0.56.0>]
    messages: []
    links: [<0.59.0>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 610
    stack_size: 27
    reductions: 318
  neighbours:

I am noticing here that the neighbours: is empty here.

Actual results:

rabbitmq-server failed to start.

Expected results:

rabbitmq-server started with return code 0

Additional info:

rabbitmq is clustered with pacemaker in total of 3 nodes.

not sure if there was a bug in 3.3.5 which could have fix this or is it a new one which is present also in older versions.

Regards,
Robin Černín
Comment 5 Peter Lemenkov 2015-11-25 07:11:48 EST
Ok, here is what's going on there.

The problem is that RabbitMQ core got stuck in a semi-stopped state. Please next try run:

rabbitmqctl stop_app
rabbitmqctl start_app

on that node. Or just restart the entire node again.
Comment 6 Robin Cernin 2016-02-01 10:28:49 EST
Thank you, closing this BZ as it is resolved now.

Thank you
Kind Regards,
Robin Černín

Note You need to log in before you can comment on or make changes to this bug.