| Summary: | after backup and restore of rabbitmq on 3 node OSP6 controller cluster, rabbitmq got stuck at startup | ||
|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Martin Schuppert <mschuppe> |
| Component: | rabbitmq-server | Assignee: | Peter Lemenkov <plemenko> |
| Status: | CLOSED NOTABUG | QA Contact: | Udi Shkalim <ushkalim> |
| Severity: | medium | Docs Contact: | |
| Priority: | medium | ||
| Version: | 6.0 (Juno) | CC: | apevec, jeckersb, lhh, srevivo |
| Target Milestone: | --- | Keywords: | ZStream |
| Target Release: | 6.0 (Juno) | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2016-08-11 12:02:26 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
Just a few brief thoughts at a glance: - It generally doesn't make a whole lot of sense to backup/restore rabbitmq. The message data is all transient, the only thing you are really backing up probably is users, which can be recreated easily enough if something disasterous happens. - The given error looks like either the backup/restore corrupted the mnesia database, or maybe the rabbitmq nodes were not started in the correct order (last to shut down must be first to start up). I'd need to look into this more to be sure. >Have you stopped the nodes manually in an order and started them starting with the last shut down?
Pacemaker stopped the rabbitmq when we issued the "pcs resource disable rabbitmq-server". Also the pacemaker started the rabbitmq when we enabled the rabbitmq-server resource again.(actually we restoring the pacemaker cib.xml, which hasn't got the role=Stopped lines in it, so when the pacemaker is started with that cib.xml it will start every service)
Don't think there is a way right now that pacemaker start the rabbit nodes in the opposite order they have been stopped.
Should only /etc/rabbitmq be backed up?
(In reply to Martin Schuppert from comment #2) > >Have you stopped the nodes manually in an order and started them starting with the last shut down? > > Pacemaker stopped the rabbitmq when we issued the "pcs resource disable > rabbitmq-server". Also the pacemaker started the rabbitmq when we enabled > the rabbitmq-server resource again.(actually we restoring the pacemaker > cib.xml, which hasn't got the role=Stopped lines in it, so when the > pacemaker is started with that cib.xml it will start every service) > > Don't think there is a way right now that pacemaker start the rabbit nodes > in the opposite order they have been stopped. > > Should only /etc/rabbitmq be backed up? Yes, please backup only /etc/rabbitmq directory. You should restore rabbitmq internal users (if any). |
Description of problem: Wile working on a backup/restore of the environment (OSP6). After a restore test , on 2 controllers rabbitmq got hung and they had to kill processes to recover (rabbitmq-server-3.3.5-22.el7ost) : Apr 11 22:54:34 controller-2.nokia.ncio.localdomain rabbitmq-cluster(rabbitmq-server)[13608]: INFO: Joining existing cluster with [ rabbit@lb-backend-controller-3 ] nodes. Apr 11 22:54:34 controller-2.nokia.ncio.localdomain rabbitmq-cluster(rabbitmq-server)[13620]: INFO: Waiting for server to start Apr 11 22:55:09 controller-2.nokia.ncio.localdomain rabbitmq-cluster(rabbitmq-server)[17226]: INFO: rabbitmq-server start failed: 2 Apr 11 22:55:09 controller-2.nokia.ncio.localdomain rabbitmq-cluster(rabbitmq-server)[17232]: INFO: node failed to join, wiping data directory and trying again Apr 11 22:55:10 controller-2.nokia.ncio.localdomain rabbitmq-cluster(rabbitmq-server)[17369]: INFO: RabbitMQ server is not running Apr 11 22:55:10 controller-2.nokia.ncio.localdomain rabbitmq-cluster(rabbitmq-server)[17380]: INFO: Joining existing cluster with [ rabbit@lb-backend-controller-3 ] nodes. Apr 11 22:55:10 controller-2.nokia.ncio.localdomain rabbitmq-cluster(rabbitmq-server)[17388]: INFO: Waiting for server to start Apr 11 22:55:13 controller-2.nokia.ncio.localdomain rabbitmq-cluster(rabbitmq-server)[18377]: INFO: Attempting to join cluster with target node rabbit@lb-backend-controller-3 Apr 11 22:55:13 controller-2.nokia.ncio.localdomain rabbitmq-cluster(rabbitmq-server)[18473]: INFO: Join process incomplete, shutting down. Apr 11 22:55:13 controller-2.nokia.ncio.localdomain rabbitmq-cluster(rabbitmq-server)[18479]: INFO: node failed to join even after reseting local data. Check SELINUX policy Apr 11 22:55:13 controller-2.nokia.ncio.localdomain lrmd[3269]: notice: rabbitmq-server_start_0:13295:stderr [ Error: process_not_running ] Apr 11 22:55:13 controller-2.nokia.ncio.localdomain lrmd[3269]: notice: rabbitmq-server_start_0:13295:stderr [ Error: mnesia_not_running ] Apr 11 22:55:13 controller-2.nokia.ncio.localdomain crmd[3272]: notice: Operation rabbitmq-server_start_0: unknown error (node=pcmk-controller-2, call=1831, rc=1, cib-update=1248, confirmed=true) Apr 11 22:55:13 controller-2.nokia.ncio.localdomain crmd[3272]: notice: pcmk-controller-2-rabbitmq-server_start_0:1831 [ Error: process_not_running\nError: mnesia_not_running\n ] Apr 11 22:55:16 controller-2.nokia.ncio.localdomain rabbitmq-cluster(rabbitmq-server)[19085]: INFO: RabbitMQ server is not running Apr 11 22:55:16 controller-2.nokia.ncio.localdomain crmd[3272]: notice: Operation rabbitmq-server_stop_0: ok (node=pcmk-controller-2, call=1832, rc=0, cib-update=1249, confirmed=true) Apr 11 22:55:42 controller-2.nokia.ncio.localdomain rabbitmq-cluster(rabbitmq-server)[21285]: INFO: RabbitMQ server is not running Apr 11 22:55:42 controller-2.nokia.ncio.localdomain crmd[3272]: notice: Operation rabbitmq-server_monitor_0: not running (node=pcmk-controller-2, call=1839, rc=7, cib-update=1255, confirmed=true) Apr 11 22:59:35 controller-2.nokia.ncio.localdomain rabbitmq-cluster(rabbitmq-server)[38927]: INFO: RabbitMQ server is not running Apr 11 22:59:36 controller-2.nokia.ncio.localdomain crmd[35676]: notice: Operation rabbitmq-server_monitor_0: not running (node=pcmk-controller-2, call=69, rc=7, cib-update=67, confirmed=true) Selinux is probably not the root cause as they backup /restore to a tar using option to preserve selinux and it failed 1 out of 5 times, but we can not check as the env has been already recovered. From rabbit: =INFO REPORT==== 11-Apr-2016::22:55:06 === Error description: {could_not_start,rabbit, {bad_return, {{rabbit,start,[normal,[]]}, {'EXIT', {rabbit,failure_during_boot, {error, {timeout_waiting_for_tables, [rabbit_user,rabbit_user_permission,rabbit_vhost, rabbit_durable_route,rabbit_durable_exchange, rabbit_runtime_parameters, rabbit_durable_queue]}}}}}}} =CRASH REPORT==== 11-Apr-2016::22:55:06 === crasher: initial call: application_master:init/4 pid: <0.110.0> registered_name: [] exception exit: {bad_return, {{rabbit,start,[normal,[]]}, {'EXIT', {rabbit,failure_during_boot, {error, {timeout_waiting_for_tables, [rabbit_user,rabbit_user_permission,rabbit_vhost, rabbit_durable_route,rabbit_durable_exchange, rabbit_runtime_parameters, rabbit_durable_queue]}}}}}} in function application_master:init/4 (application_master.erl, line 133) ancestors: [<0.109.0>] messages: [{'EXIT',<0.111.0>,normal}] links: [<0.109.0>,<0.7.0>] dictionary: [] trap_exit: true status: running heap_size: 4185 stack_size: 27 reductions: 424 neighbours: What is the correct way to backup restore RabbitMQ? Their procedure is: ~~~ With full-restore yes we stop every service including rabbitmq before restoring the files. But in selective( only one or a couple of services) we only stop it on the node that is under restoring. So basically we do these steps in selective: 1. pcs cluster standby node 2. wait until all resource stops on that node 3. restore directories and files 4. unstandby 5. wait until all resource start on the node 6. move on to the next node In full restore: 1. Stop every service(pcs resource disable on every service on controller, systemctl stop on compute) 2. wait until every service stops 3. Run every plugin which will restore its own files and directories(every service has its own plugin) 4. Start every service with pcs or with systemctl 5. wait until every service starts But in this case we only restored the rabbitmq with full-restore. ~~~ Version-Release number of selected component (if applicable): rabbitmq-server-3.3.5-22.el7ost How reproducible: once so far Steps to Reproduce: 1. see full restore steps from above Actual results: rabbitmq process needed to be killed on 2 controllers to be able to restart Expected results: rabbit comes up without manual needed action