Bug 1329103

Summary: after backup and restore of rabbitmq on 3 node OSP6 controller cluster, rabbitmq got stuck at startup
Product: Red Hat OpenStack Reporter: Martin Schuppert <mschuppe>
Component: rabbitmq-serverAssignee: Peter Lemenkov <plemenko>
Status: CLOSED NOTABUG QA Contact: Udi Shkalim <ushkalim>
Severity: medium Docs Contact:
Priority: medium    
Version: 6.0 (Juno)CC: apevec, jeckersb, lhh, srevivo
Target Milestone: ---Keywords: ZStream
Target Release: 6.0 (Juno)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-08-11 12:02:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description Martin Schuppert 2016-04-21 08:28:07 UTC
Description of problem:

Wile working on a backup/restore of the environment (OSP6). After a restore test , on 2 controllers rabbitmq got hung and they had to kill processes to recover (rabbitmq-server-3.3.5-22.el7ost) :

Apr 11 22:54:34 controller-2.nokia.ncio.localdomain rabbitmq-cluster(rabbitmq-server)[13608]: INFO: Joining existing cluster with [ rabbit@lb-backend-controller-3  ] nodes.
Apr 11 22:54:34 controller-2.nokia.ncio.localdomain rabbitmq-cluster(rabbitmq-server)[13620]: INFO: Waiting for server to start
Apr 11 22:55:09 controller-2.nokia.ncio.localdomain rabbitmq-cluster(rabbitmq-server)[17226]: INFO: rabbitmq-server start failed: 2
Apr 11 22:55:09 controller-2.nokia.ncio.localdomain rabbitmq-cluster(rabbitmq-server)[17232]: INFO: node failed to join, wiping data directory and trying again
Apr 11 22:55:10 controller-2.nokia.ncio.localdomain rabbitmq-cluster(rabbitmq-server)[17369]: INFO: RabbitMQ server is not running
Apr 11 22:55:10 controller-2.nokia.ncio.localdomain rabbitmq-cluster(rabbitmq-server)[17380]: INFO: Joining existing cluster with [ rabbit@lb-backend-controller-3  ] nodes.
Apr 11 22:55:10 controller-2.nokia.ncio.localdomain rabbitmq-cluster(rabbitmq-server)[17388]: INFO: Waiting for server to start
Apr 11 22:55:13 controller-2.nokia.ncio.localdomain rabbitmq-cluster(rabbitmq-server)[18377]: INFO: Attempting to join cluster with target node rabbit@lb-backend-controller-3
Apr 11 22:55:13 controller-2.nokia.ncio.localdomain rabbitmq-cluster(rabbitmq-server)[18473]: INFO: Join process incomplete, shutting down.
Apr 11 22:55:13 controller-2.nokia.ncio.localdomain rabbitmq-cluster(rabbitmq-server)[18479]: INFO: node failed to join even after reseting local data. Check SELINUX policy
Apr 11 22:55:13 controller-2.nokia.ncio.localdomain lrmd[3269]: notice: rabbitmq-server_start_0:13295:stderr [ Error: process_not_running ]
Apr 11 22:55:13 controller-2.nokia.ncio.localdomain lrmd[3269]: notice: rabbitmq-server_start_0:13295:stderr [ Error: mnesia_not_running ]
Apr 11 22:55:13 controller-2.nokia.ncio.localdomain crmd[3272]: notice: Operation rabbitmq-server_start_0: unknown error (node=pcmk-controller-2, call=1831, rc=1, cib-update=1248, confirmed=true)
Apr 11 22:55:13 controller-2.nokia.ncio.localdomain crmd[3272]: notice: pcmk-controller-2-rabbitmq-server_start_0:1831 [ Error: process_not_running\nError: mnesia_not_running\n ]
Apr 11 22:55:16 controller-2.nokia.ncio.localdomain rabbitmq-cluster(rabbitmq-server)[19085]: INFO: RabbitMQ server is not running
Apr 11 22:55:16 controller-2.nokia.ncio.localdomain crmd[3272]: notice: Operation rabbitmq-server_stop_0: ok (node=pcmk-controller-2, call=1832, rc=0, cib-update=1249, confirmed=true)
Apr 11 22:55:42 controller-2.nokia.ncio.localdomain rabbitmq-cluster(rabbitmq-server)[21285]: INFO: RabbitMQ server is not running
Apr 11 22:55:42 controller-2.nokia.ncio.localdomain crmd[3272]: notice: Operation rabbitmq-server_monitor_0: not running (node=pcmk-controller-2, call=1839, rc=7, cib-update=1255, confirmed=true)
Apr 11 22:59:35 controller-2.nokia.ncio.localdomain rabbitmq-cluster(rabbitmq-server)[38927]: INFO: RabbitMQ server is not running
Apr 11 22:59:36 controller-2.nokia.ncio.localdomain crmd[35676]: notice: Operation rabbitmq-server_monitor_0: not running (node=pcmk-controller-2, call=69, rc=7, cib-update=67, confirmed=true)

Selinux is probably not the root cause as they backup /restore to a tar using option to preserve selinux and it failed 1 out of 5 times, but we can not check as the env has been already recovered.

From rabbit:
=INFO REPORT==== 11-Apr-2016::22:55:06 ===
Error description:
   {could_not_start,rabbit,
       {bad_return,
           {{rabbit,start,[normal,[]]},
            {'EXIT',
                {rabbit,failure_during_boot,
                    {error,
                        {timeout_waiting_for_tables,

[rabbit_user,rabbit_user_permission,rabbit_vhost,
                             rabbit_durable_route,rabbit_durable_exchange,
                             rabbit_runtime_parameters,
                             rabbit_durable_queue]}}}}}}}


=CRASH REPORT==== 11-Apr-2016::22:55:06 ===
  crasher:
    initial call: application_master:init/4
    pid: <0.110.0>
    registered_name: []
    exception exit: {bad_return,
                     {{rabbit,start,[normal,[]]},
                      {'EXIT',
                       {rabbit,failure_during_boot,
                        {error,
                         {timeout_waiting_for_tables,
                          [rabbit_user,rabbit_user_permission,rabbit_vhost,
                           rabbit_durable_route,rabbit_durable_exchange,
                           rabbit_runtime_parameters,
                           rabbit_durable_queue]}}}}}}
      in function  application_master:init/4 (application_master.erl, line 133)
    ancestors: [<0.109.0>]
    messages: [{'EXIT',<0.111.0>,normal}]
    links: [<0.109.0>,<0.7.0>]
    dictionary: []
    trap_exit: true
    status: running
    heap_size: 4185
    stack_size: 27
    reductions: 424
  neighbours:

What is the correct way to backup restore RabbitMQ?

Their procedure is:

~~~
With full-restore yes we stop every service including rabbitmq before restoring the files. But in selective( only one or a couple of services) we only stop it on the node that is under restoring. So basically we do these steps in selective:
1. pcs cluster standby node
2. wait until all resource stops on that node
3. restore directories and files
4. unstandby
5. wait until all resource start on the node
6. move on to the next node

In full restore:
1. Stop every service(pcs resource disable on every service on controller, systemctl stop on compute)
2. wait until every service stops
3. Run every plugin which will restore its own files and directories(every service has its own plugin)
4. Start every service with pcs or with systemctl
5. wait until every service starts

But in this case we only restored the rabbitmq with full-restore.
~~~ 


Version-Release number of selected component (if applicable):
rabbitmq-server-3.3.5-22.el7ost

How reproducible:
once so far

Steps to Reproduce:
1. see full restore steps from above

Actual results:
rabbitmq process needed to be killed on 2 controllers to be able to restart

Expected results:
rabbit comes up without manual needed action

Comment 1 Martin Schuppert 2016-04-21 08:28:49 UTC
Just a few brief thoughts at a glance:

- It generally doesn't make a whole lot of sense to backup/restore
  rabbitmq.  The message data is all transient, the only thing you are
  really backing up probably is users, which can be recreated easily
  enough if something disasterous happens.

- The given error looks like either the backup/restore corrupted the
  mnesia database, or maybe the rabbitmq nodes were not started in the
  correct order (last to shut down must be first to start up).  I'd need
  to look into this more to be sure.

Comment 2 Martin Schuppert 2016-04-21 08:39:59 UTC
>Have you stopped the nodes manually in an order and started them starting with the last shut down?

Pacemaker stopped the rabbitmq when we issued the "pcs resource disable rabbitmq-server". Also the pacemaker started the rabbitmq when we enabled the rabbitmq-server resource again.(actually we restoring the pacemaker cib.xml, which hasn't got the role=Stopped lines in it, so when the pacemaker is started with that cib.xml it will start every service)

Don't think there is a way right now that pacemaker start the rabbit nodes in the opposite order they have been stopped.

Should only /etc/rabbitmq be backed up?

Comment 4 Peter Lemenkov 2016-04-24 14:26:40 UTC
(In reply to Martin Schuppert from comment #2)
> >Have you stopped the nodes manually in an order and started them starting with the last shut down?
> 
> Pacemaker stopped the rabbitmq when we issued the "pcs resource disable
> rabbitmq-server". Also the pacemaker started the rabbitmq when we enabled
> the rabbitmq-server resource again.(actually we restoring the pacemaker
> cib.xml, which hasn't got the role=Stopped lines in it, so when the
> pacemaker is started with that cib.xml it will start every service)
> 
> Don't think there is a way right now that pacemaker start the rabbit nodes
> in the opposite order they have been stopped.
> 
> Should only /etc/rabbitmq be backed up?

Yes, please backup only /etc/rabbitmq directory. You should restore rabbitmq internal users (if any).