Bug 1343905
Summary: | RabbitMQ fails to rejoin the cluster during overcloud update | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Jiri Stransky <jstransk> | ||||||||||
Component: | resource-agents | Assignee: | Peter Lemenkov <plemenko> | ||||||||||
Status: | CLOSED ERRATA | QA Contact: | Leonid Natapov <lnatapov> | ||||||||||
Severity: | urgent | Docs Contact: | |||||||||||
Priority: | urgent | ||||||||||||
Version: | 7.2 | CC: | abeekhof, agk, cluster-maint, dmaley, fdinitto, jeckersb, jstransk, mandreou, mburns, mcornea, michele, mkolaja, mnovacek, nkrishna, oalbrigt, plemenko, rhel-osp-director-maint, rscarazz, sasha, sclewis, ushkalim | ||||||||||
Target Milestone: | pre-dev-freeze | Keywords: | ZStream | ||||||||||
Target Release: | --- | ||||||||||||
Hardware: | Unspecified | ||||||||||||
OS: | Unspecified | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | resource-agents-3.9.5-80.el7 | Doc Type: | If docs needed, set a value | ||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | |||||||||||||
: | 1351446 (view as bug list) | Environment: | |||||||||||
Last Closed: | 2016-11-04 00:03:42 UTC | Type: | Bug | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | |||||||||||||
Bug Blocks: | 1333976, 1333977, 1343027, 1344405, 1351446, 1351784 | ||||||||||||
Attachments: |
|
Description
Jiri Stransky
2016-06-08 09:50:53 UTC
The interesting part of pacemaker journal: čen 08 09:26:58 overcloud-controller-0.localdomain lrmd[21781]: notice: rabbitmq_start_0:23020:stderr [ User defined signal 2 ] čen 08 09:26:58 overcloud-controller-0.localdomain lrmd[21781]: notice: rabbitmq_start_0:23020:stderr [ User defined signal 2 ] čen 08 09:26:58 overcloud-controller-0.localdomain lrmd[21781]: notice: rabbitmq_start_0:23020:stderr [ Error: {aborted,{no_exists,[rabbit_user, ] čen 08 09:26:58 overcloud-controller-0.localdomain lrmd[21781]: notice: rabbitmq_start_0:23020:stderr [ [{{internal_user,'$1','_','_'}, ] čen 08 09:26:58 overcloud-controller-0.localdomain lrmd[21781]: notice: rabbitmq_start_0:23020:stderr [ [{'/=','$1',<<"guest">>}], ] čen 08 09:26:58 overcloud-controller-0.localdomain lrmd[21781]: notice: rabbitmq_start_0:23020:stderr [ ['$_']}]]}} ] čen 08 09:26:58 overcloud-controller-0.localdomain lrmd[21781]: notice: rabbitmq_start_0:23020:stderr [ User defined signal 2 ] čen 08 09:26:58 overcloud-controller-0.localdomain lrmd[21781]: notice: rabbitmq_start_0:23020:stderr [ User defined signal 2 ] čen 08 09:26:58 overcloud-controller-0.localdomain crmd[21784]: notice: Operation rabbitmq_start_0: unknown error (node=overcloud-controller-0, call=224, rc=1, cib-update=64, confirmed=true) čen 08 09:26:58 overcloud-controller-0.localdomain crmd[21784]: notice: overcloud-controller-0-rabbitmq_start_0:224 [ User defined signal 2\nUser defined signal 2\nError: {aborted,{no_exists,[rabbit_user,\n [{{internal_user,'$1','_','_'},\n [{'/=','$1',<<"guest">>}],\n ['$_']}]]}}\nUser defined signal 2\nUser defined signal 2\n ] Created attachment 1165902 [details]
pacemaker journal during the update
Created attachment 1165903 [details]
pcs status
I thought it might be the limitation of RabbitMQ loopback user list so i tried with the patch that addresses this upstream [1], but i still get the same behavior, RabbitMQ is not able to rejoin the cluster. [1] https://github.com/openstack/tripleo-heat-templates/commit/0e6e864b3b7d125e7eba8c71724d3a458ffef7dd This is could be a bug in the resource agent perhaps? Looking at /usr/lib/ocf/resource.d/heartbeat/rabbitmq-cluster i see that rmq_stop contains a line: Users = mnesia:dirty_select(rabbit_user, [{ {internal_user, '\\\$1', '_', '_'}, [{'/=', '\\\$1', <<\"guest\">>}], ['\\\$_'] } ]) which pretty much exactly corresponds to the error message. However when i look at upstream code for this agent: https://github.com/ClusterLabs/resource-agents/blob/902bd05738aaaf3c1de76afb6526ba7ff6fc2312/heartbeat/rabbitmq-cluster#L352 ... the line is not there. Is it possible that this is a bug that has been already addressed upstream and we're just missing a backport? Moving to resource-agents component for further investigation. Package versions: resource-agents-3.9.5-54.el7_2.10.x86_64 rabbitmq-server-3.6.2-3.el7ost.noarch pacemaker-remote-1.1.13-10.el7_2.2.x86_64 pacemaker-cluster-libs-1.1.13-10.el7_2.2.x86_64 pacemaker-1.1.13-10.el7_2.2.x86_64 pacemaker-libs-1.1.13-10.el7_2.2.x86_64 pacemaker-cli-1.1.13-10.el7_2.2.x86_64 corosync-2.3.4-7.el7_2.1.x86_64 corosynclib-2.3.4-7.el7_2.1.x86_64 Hey, FYI I am also seeing this issue. I am trying the upgrades workflow and for controller upgrades I am reworking the scripts at https://review.openstack.org/#/c/321027/ for Mitaka to Newton upgrades. As you know :) during the controllers upgrade first we run controller_pacemaker_1.sh to bring the cluster down. That happens fine but then during service startup I see rabbit failing to start on controllers 1 and 2 (but running on controller 0). On controller-2 for example logs are like: Jun 10 14:56:58 localhost rabbitmq-cluster(rabbitmq)[23557]: INFO: RabbitMQ server is not running Jun 10 14:56:58 localhost lrmd[10099]: notice: rabbitmq_stop_0:23557:stderr [ Error: {aborted,{no_exists,[rabbit_user, ] Jun 10 14:56:58 localhost lrmd[10099]: notice: rabbitmq_stop_0:23557:stderr [ [{{internal_user,'$1','_','_'}, ] Jun 10 14:56:58 localhost lrmd[10099]: notice: rabbitmq_stop_0:23557:stderr [ [{'/=','$1',<<"guest">>}], ] Jun 10 14:56:58 localhost lrmd[10099]: notice: rabbitmq_stop_0:23557:stderr [ ['$_']}]]}} ] Jun 10 14:56:58 localhost crmd[10102]: notice: Operation rabbitmq_stop_0: ok (node=overcloud-controller-2, call=247, rc=0, cib-update=80, confirmed=true) Jun 10 14:57:02 localhost redis(redis)[24917]: INFO: start: /usr/bin/redis-server --daemonize yes --unixsocket '/var/run/redis/redis.sock' --pidfile '/var/run/redis/redis-server.pid' Jun 10 14:57:02 localhost su: (to redis) root on none Jun 10 14:57:02 localhost systemd: Created slice user-984.slice. Jun 10 14:57:02 localhost systemd: Starting user-984.slice. Jun 10 14:57:02 localhost systemd: Started Session c5359 of user redis. Jun 10 14:57:02 localhost systemd: Starting Session c5359 of user redis. Jun 10 14:57:02 localhost redis(redis)[24917]: INFO: Postponing setting master score for overcloud-controller-2 until last known master instance [overcloud-controller-0] is promoted Initially I thought it was something related to my scripts, so I tested current master and saw the same issue. thanks, marios Ok, I think I've got a reproducer. Expect a fix and a test build during this wednesday. Got a bugfix: https://github.com/lemenkov/resource-agents/commit/f00a952bd5e133cad30689d9edcc98f5d33a71a9 Will do a test build shortly. Jiri, please retest with this resource-agents build. It still fails, but with a different error message now, will attach the current pacemaker journal output. It's worth noting that i did this to test: * Deployed with what we get from repositories set up with rhos-release, which is resource-agents-3.9.5-54.el7_2.10.x86_64 * Updated the deployment with new resource-agents i got from Brew here https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=499744 * Then i tried running the update. Stop went fine, but the rejoin didn't. Is there a reason to believe that if i deployed with the version from brew ^^ from the very start, instead of updating to it, the rejoin would work? If so, i can also try installing the package before Pacemaker is started. Created attachment 1169119 [details]
pacemaker journal June 17
Did another try where i updated the resource-agents even before starting pacemaker, it seems the error is the same. This issue might need some further investigation. Let me know if you want to log into the environment. Created attachment 1169140 [details]
pacemaker journal June 17 2nd try
(In reply to Jiri Stransky from comment #17) > Did another try where i updated the resource-agents even before starting > pacemaker, it seems the error is the same. This issue might need some > further investigation. Let me know if you want to log into the environment. Love to have a login. Could you please send me it off-list? "petro" @ irc. Status update. The scriptlets in start/stop sections of a resource agent' script cannot be executed when cluster isn't ready yet or the only one node remains (cluster shutdown). That's not a problem because these scriptlets will be executed on a different node (or were executed already on a different node). So we should check if the cluster is ready and run scriptlets only if possible. Also we may re-start internal DB and re-read data and stop internal DB again. I've got a patch - will make a build soon. Here is another patch https://github.com/lemenkov/resource-agents/commit/74b3cff4fce5483d126b16131db53f8bd5804c82 New build with the latest patch. Unfortunately it's still failing, i left the environment intact for further investigation, it can be accessed the same way as before. Failed Actions: * rabbitmq_start_0 on overcloud-controller-0 'unknown error' (1): call=217, status=complete, exitreason='none', last-rc-change='Wed Jun 22 10:31:55 2016', queued=0ms, exec=13412ms The error message is different than what it was previously: čen 22 10:32:08 overcloud-controller-0.localdomain lrmd[1077]: notice: rabbitmq_start_0:1806:stderr [ User defined signal 2 ] čen 22 10:32:08 overcloud-controller-0.localdomain lrmd[1077]: notice: rabbitmq_start_0:1806:stderr [ User defined signal 2 ] čen 22 10:32:08 overcloud-controller-0.localdomain lrmd[1077]: notice: rabbitmq_start_0:1806:stderr [ Error: unable to connect to node 'rabbit@overcloud-controller-0': nodedown ] čen 22 10:32:08 overcloud-controller-0.localdomain lrmd[1077]: notice: rabbitmq_start_0:1806:stderr [ ] čen 22 10:32:08 overcloud-controller-0.localdomain lrmd[1077]: notice: rabbitmq_start_0:1806:stderr [ DIAGNOSTICS ] čen 22 10:32:08 overcloud-controller-0.localdomain lrmd[1077]: notice: rabbitmq_start_0:1806:stderr [ =========== ] čen 22 10:32:08 overcloud-controller-0.localdomain lrmd[1077]: notice: rabbitmq_start_0:1806:stderr [ ] čen 22 10:32:08 overcloud-controller-0.localdomain lrmd[1077]: notice: rabbitmq_start_0:1806:stderr [ attempted to contact: ['rabbit@overcloud-controller-0'] ] čen 22 10:32:08 overcloud-controller-0.localdomain lrmd[1077]: notice: rabbitmq_start_0:1806:stderr [ ] čen 22 10:32:08 overcloud-controller-0.localdomain lrmd[1077]: notice: rabbitmq_start_0:1806:stderr [ rabbit@overcloud-controller-0: ] čen 22 10:32:08 overcloud-controller-0.localdomain lrmd[1077]: notice: rabbitmq_start_0:1806:stderr [ * connected to epmd (port 4369) on overcloud-controller-0 ] čen 22 10:32:08 overcloud-controller-0.localdomain lrmd[1077]: notice: rabbitmq_start_0:1806:stderr [ * epmd reports: node 'rabbit' not running at all ] čen 22 10:32:08 overcloud-controller-0.localdomain lrmd[1077]: notice: rabbitmq_start_0:1806:stderr [ no other nodes on overcloud-controller-0 ] čen 22 10:32:08 overcloud-controller-0.localdomain lrmd[1077]: notice: rabbitmq_start_0:1806:stderr [ * suggestion: start the node ] čen 22 10:32:08 overcloud-controller-0.localdomain lrmd[1077]: notice: rabbitmq_start_0:1806:stderr [ ] čen 22 10:32:08 overcloud-controller-0.localdomain lrmd[1077]: notice: rabbitmq_start_0:1806:stderr [ current node details: ] čen 22 10:32:08 overcloud-controller-0.localdomain lrmd[1077]: notice: rabbitmq_start_0:1806:stderr [ - node name: 'rabbitmq-cli-57@overcloud-controller-0' ] čen 22 10:32:08 overcloud-controller-0.localdomain lrmd[1077]: notice: rabbitmq_start_0:1806:stderr [ - home dir: /var/lib/rabbitmq ] čen 22 10:32:08 overcloud-controller-0.localdomain lrmd[1077]: notice: rabbitmq_start_0:1806:stderr [ - cookie hash: jhpOzW5+oUaNAkMXDesQyQ== ] čen 22 10:32:08 overcloud-controller-0.localdomain lrmd[1077]: notice: rabbitmq_start_0:1806:stderr [ ] Posted a pull request upstream: https://github.com/ClusterLabs/resource-agents/pull/831 Tested by applying the patch https://github.com/jistr/resource-agents/commit/f727570225fe78debdfb285446b2a6fbf8b75dad.patch on top of resource-agents-3.9.5-54.el7_2.10.x86_64 and RabbitMQ nodes were able to leave and rejoin the cluster fine in testing with OSPd 9 minor update. Also tested separately with simply `pcs cluster stop` and `pcs cluster start` on one of the controller nodes, and again the node rejoined fine. New build with the latest patch. Retested with RPMs only (resource-agents-3.9.5-76.el7) without additional modifications, worked fine. Environment: resource-agents-3.9.5-76.el7.x86_64 Clone Set: rabbitmq-clone [rabbitmq] Started: [ overcloud-controller-0 ] Stopped: [ overcloud-controller-1 overcloud-controller-2 ] Unable to start rabbitmq on controllers 1 and 2 (tried to cleanup/restart). Jul 6 13:40:58 overcloud-controller-1 lrmd[25757]: notice: rabbitmq_stop_0:33861:stderr [ Error: unable to connect to node 'rabbit@overcloud-controller-1': nodedown ] Jul 6 13:40:58 overcloud-controller-1 lrmd[25757]: notice: rabbitmq_stop_0:33861:stderr [ ] Jul 6 13:40:58 overcloud-controller-1 lrmd[25757]: notice: rabbitmq_stop_0:33861:stderr [ DIAGNOSTICS ] Jul 6 13:40:58 overcloud-controller-1 lrmd[25757]: notice: rabbitmq_stop_0:33861:stderr [ =========== ] Jul 6 13:40:58 overcloud-controller-1 lrmd[25757]: notice: rabbitmq_stop_0:33861:stderr [ ] Jul 6 13:40:58 overcloud-controller-1 lrmd[25757]: notice: rabbitmq_stop_0:33861:stderr [ attempted to contact: ['rabbit@overcloud-controller-1'] ] Jul 6 13:40:58 overcloud-controller-1 lrmd[25757]: notice: rabbitmq_stop_0:33861:stderr [ ] Jul 6 13:40:58 overcloud-controller-1 lrmd[25757]: notice: rabbitmq_stop_0:33861:stderr [ rabbit@overcloud-controller-1: ] Jul 6 13:40:58 overcloud-controller-1 lrmd[25757]: notice: rabbitmq_stop_0:33861:stderr [ * connected to epmd (port 4369) on overcloud-controller-1 ] Jul 6 13:40:58 overcloud-controller-1 lrmd[25757]: notice: rabbitmq_stop_0:33861:stderr [ * epmd reports: node 'rabbit' not running at all ] Jul 6 13:40:58 overcloud-controller-1 lrmd[25757]: notice: rabbitmq_stop_0:33861:stderr [ no other nodes on overcloud-controller-1 ] Jul 6 13:40:58 overcloud-controller-1 lrmd[25757]: notice: rabbitmq_stop_0:33861:stderr [ * suggestion: start the node ] Jul 6 13:40:58 overcloud-controller-1 lrmd[25757]: notice: rabbitmq_stop_0:33861:stderr [ ] Jul 6 13:40:58 overcloud-controller-1 lrmd[25757]: notice: rabbitmq_stop_0:33861:stderr [ current node details: ] Jul 6 13:40:58 overcloud-controller-1 lrmd[25757]: notice: rabbitmq_stop_0:33861:stderr [ - node name: 'rabbitmq-cli-06@overcloud-controller-1' ] Jul 6 13:40:58 overcloud-controller-1 lrmd[25757]: notice: rabbitmq_stop_0:33861:stderr [ - home dir: /var/lib/rabbitmq ] Jul 6 13:40:58 overcloud-controller-1 lrmd[25757]: notice: rabbitmq_stop_0:33861:stderr [ - cookie hash: imq2MuvlmJpQO2jdNBEsEA== ] Jul 6 13:40:58 overcloud-controller-1 lrmd[25757]: notice: rabbitmq_stop_0:33861:stderr [ ] Jul 6 13:40:58 overcloud-controller-1 crmd[25760]: notice: Operation rabbitmq_stop_0: ok (node=overcloud-controller-1, call=1231, rc=0, cib-update=765, confirmed=true) Hi, I hit this today too - in the same way reported by Sasha at comment #40. Providing some more info here for debug - bandini said there may be an issue with the way the fix was applied/sequencing: 1. Deploy 8 under/overcloud, upgrade undercloud, aodh migration on overcloud, keystone migration on overcloud, upgrade-init on overcloud (repo switch). Applied the os-collect-config fixup for bug 1350489: for i in $(nova list|grep ctlplane|awk -F' ' '{ print $12 }'|awk -F'=' '{ print $2 }'); do ssh heat-admin@$i "hostname; echo ''; sudo yum -y update os-collect-config; echo '';"; done Install the latest resource-agents for this bug: for i in $(nova list|grep ctlplane|awk -F' ' '{ print $12 }'|awk -F'=' '{ print $2 }'); do ssh heat-admin@$i "hostname; echo ''; sudo yum -y install http://download.eng.bos.redhat.com/brewroot/packages/resource-agents/3.9.5/76.el7/x86_64/resource-agents-3.9.5-76.el7.x86_64.rpm; echo '';"; done Confirm the agents were installed ok: [stack@instack ~]$ for i in $(nova list|grep ctlplane|awk -F' ' '{ print $12 }'|awk -F'=' '{ print $2 }'); do ssh heat-admin@$i "hostname; sudo rpm -qa | grep resource-agents; echo '';"; done overcloud-compute-0.localdomain resource-agents-3.9.5-76.el7.x86_64 overcloud-controller-0.localdomain resource-agents-3.9.5-76.el7.x86_64 overcloud-controller-1.localdomain resource-agents-3.9.5-76.el7.x86_64 overcloud-controller-2.localdomain resource-agents-3.9.5-76.el7.x86_64 Run the controller upgrade: openstack overcloud deploy --templates /usr/share/openstack-tripleo-heat-templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml --control-scale 3 --compute-scale 1 --libvirt-type qemu -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml -e network_env.yaml --ntp-server '0.fedora.pool.ntp.org' -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker.yaml Heat declares "UPDATE_COMPLETE", but rabbit is still down on controllers 1 and 2: 26- Clone Set: rabbitmq-clone [rabbitmq] 27- Started: [ overcloud-controller-0 ] 28: Stopped: [ overcloud-controller-1 overcloud-controller-2 ] A pcs resource cleanup or pcs resource restart rabbit doesn't help. The errors I see are like: [stack@instack ~]$ for i in $(nova list|grep ctlplane|awk -F' ' '{ print $12 }'|awk -F'=' '{ print $2 }'); do ssh heat-admin@$i "hostname; echo ''; sudo journalctl -n 100000000 | grep -ni "error:.*rabbit" ; echo '';"; done overcloud-compute-0.localdomain overcloud-controller-0.localdomain 219715:Jul 08 12:34:53 overcloud-controller-0.localdomain lrmd[5703]: notice: rabbitmq_stop_0:15791:stderr [ Error: unable to connect to node 'rabbit@overcloud-controller-0': nodedown ] overcloud-controller-1.localdomain 243554:Jul 08 13:09:35 overcloud-controller-1.localdomain rabbitmq-cluster(rabbitmq)[5423]: ERROR: Failed to forget node rabbit@overcloud-controller-1 via rabbit@overcloud-controller-0. 243613:Jul 08 13:09:39 overcloud-controller-1.localdomain lrmd[26882]: notice: rabbitmq_start_0:3843:stderr [ Error: unable to connect to node 'rabbit@overcloud-controller-1': nodedown ] 243631:Jul 08 13:09:39 overcloud-controller-1.localdomain lrmd[26882]: notice: rabbitmq_start_0:3843:stderr [ Error: {offline_node_no_offline_flag,"You are trying to remove a node from an offline node. That is dangerous, but can be done with the --offline flag. Please consult the manual for rabbitmqctl for more information."} ] 243633:Jul 08 13:09:39 overcloud-controller-1.localdomain lrmd[26882]: notice: rabbitmq_start_0:3843:stderr [ Error: unable to connect to node 'rabbit@overcloud-controller-1': nodedown ] 243652:Jul 08 13:09:39 overcloud-controller-1.localdomain crmd[26885]: notice: overcloud-controller-1-rabbitmq_start_0:231 [ init terminating in do_boot ()\r\nError: unable to connect to node 'rabbit@overcloud-controller-1': nodedown\n\nDIAGNOSTICS\n===========\n\nattempted to contact: ['rabbit@overcloud-controller-1']\n\nrabbit@overcloud-controller-1:\n * connected to epmd (port 4369) on overcloud-controller-1\n * epmd reports: node 'rabbit' not running at all\n no other nodes on overcloud-controller-1\n * suggestion: start the node\n\ncurrent node deta 243664:Jul 08 13:09:40 overcloud-controller-1.localdomain lrmd[26882]: notice: rabbitmq_stop_0:6518:stderr [ Error: unable to connect to node 'rabbit@overcloud-controller-1': nodedown ] overcloud-controller-2.localdomain 221752:Jul 08 13:09:54 overcloud-controller-2.localdomain rabbitmq-cluster(rabbitmq)[8132]: ERROR: Failed to forget node rabbit@overcloud-controller-2 via rabbit@overcloud-controller-0. 221801:Jul 08 13:09:59 overcloud-controller-2.localdomain lrmd[505]: notice: rabbitmq_start_0:7169:stderr [ Error: unable to connect to node 'rabbit@overcloud-controller-2': nodedown ] 221821:Jul 08 13:09:59 overcloud-controller-2.localdomain lrmd[505]: notice: rabbitmq_start_0:7169:stderr [ Error: unable to connect to node 'rabbit@overcloud-controller-2': nodedown ] 221840:Jul 08 13:09:59 overcloud-controller-2.localdomain crmd[509]: notice: overcloud-controller-2-rabbitmq_start_0:227 [ Error: process_not_running\nError: unable to connect to node 'rabbit@overcloud-controller-2': nodedown\n\nDIAGNOSTICS\n===========\n\nattempted to contact: ['rabbit@overcloud-controller-2']\n\nrabbit@overcloud-controller-2:\n * connected to epmd (port 4369) on overcloud-controller-2\n * epmd reports: node 'rabbit' not running at all\n no other nodes on overcloud-controller-2\n * suggestion: start the node\n\ncurrent node details:\n 221856:Jul 08 13:10:00 overcloud-controller-2.localdomain lrmd[505]: notice: rabbitmq_stop_0:9364:stderr [ Error: unable to connect to node 'rabbit@overcloud-controller-2': nodedown ] I am wondering if a simple upgrade of the resource-agents is enough in this case. Normally, given that pacemaker forks()/execs() it is okay, but in case the newer RA changes metadata or if it changed bevaviour that relies on a previous action we would need a restart of the resource for the changes to be effective. From a very quick glance, it seems that Jiri's patch does not fall into any of the above categories so a simple upgrade should be fine. Peter, it'd be nice if you could confirm here? Marios, we could easily confirm this (if you 100% reproduce this issue that is) by restarting the rabbitmq resource after the resource-agents package upgrade, wait for the cluster to settle fully and then do the upgrade. FailedQA: FailingQA based on comment #40 and comment #41. (In reply to Michele Baldessari from comment #42) > I am wondering if a simple upgrade of the resource-agents is enough in this > case. > Normally, given that pacemaker forks()/execs() it is okay, but in case the > newer > RA changes metadata or if it changed bevaviour that relies on a previous > action > we would need a restart of the resource for the changes to be effective. > > From a very quick glance, it seems that Jiri's patch does not fall into any > of > the above categories so a simple upgrade should be fine. Peter, it'd be nice > if you could confirm here? > > Marios, we could easily confirm this (if you 100% reproduce this issue that > is) by restarting the rabbitmq resource after the resource-agents package > upgrade, > wait for the cluster to settle fully and then do the upgrade. Hi Bandini, only just managed to revisit here. SO, I ran it again today as you suggested. I updated the resource-agents, did a pcs resource restart rabbitmq, waiting for things to settle and ran the upgrade again. Unfortunately same results: heat declares UPDATE_COMPLETE, however, rabbit is down on controller 1 and 2 (For clarity, I updated the resource agents like in comment #40 and confirmed my nodes had resource-agents-3.9.5-76.el7.x86_64): Every 2.0s: pcs status | grep -ni 'stop' -C 2 Tue Jul 12 14:35:04 2016 26- Clone Set: rabbitmq-clone [rabbitmq] 27- Started: [ overcloud-controller-0 ] 28: Stopped: [ overcloud-controller-1 overcloud-controller-2 ] 29- Clone Set: memcached-clone [memcached] 30- Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] WRT errors around that time (heat stack update was complete at "2016-07-12 14:32:23 [0]: SIGNAL_COMPLETE ") - I see errors at 14:27 for example that may be relevant from 1 & 2. I also captured the SOS reports but BZ won't let me attach them (23,25 and 28 mb from the controllers, there is a 20mb limit apparently). If useful we can work out how to get them to you tomorrow, but for now: [stack@instack ~]$ for i in $(nova list|grep ctlplane|awk -F' ' '{ print $12 }'|awk -F'=' '{ print $2 }'); do ssh heat-admin@$i "hostname; echo ''; sudo journalctl -n 100000000 | grep -ni "error:.*rabbit" ; echo '';"; done overcloud-compute-0.localdomain overcloud-controller-0.localdomain 234269:Jul 12 13:46:22 overcloud-controller-0.localdomain lrmd[6277]: notice: rabbitmq_stop_0:9695:stderr [ Error: unable to connect to node 'rabbit@overcloud-controller-0': nodedown ] 241259:Jul 12 14:02:34 overcloud-controller-0.localdomain lrmd[6277]: notice: rabbitmq_stop_0:27855:stderr [ Error: unable to connect to node 'rabbit@overcloud-controller-0': nodedown ] overcloud-controller-1.localdomain 237069:Jul 12 13:46:42 overcloud-controller-1.localdomain rabbitmq-cluster(rabbitmq)[25102]: ERROR: Failed to forget node rabbit@overcloud-controller-1 via rabbit@overcloud-controller-0. 237121:Jul 12 13:46:45 overcloud-controller-1.localdomain lrmd[2066]: notice: rabbitmq_start_0:24156:stderr [ Error: unable to connect to node 'rabbit@overcloud-controller-1': nodedown ] 247718:Jul 12 14:26:59 overcloud-controller-1.localdomain rabbitmq-cluster(rabbitmq)[3232]: ERROR: Failed to forget node rabbit@overcloud-controller-1 via rabbit@overcloud-controller-0. 247778:Jul 12 14:27:05 overcloud-controller-1.localdomain lrmd[27846]: notice: rabbitmq_start_0:1572:stderr [ Error: unable to connect to node 'rabbit@overcloud-controller-1': nodedown ] 247796:Jul 12 14:27:05 overcloud-controller-1.localdomain lrmd[27846]: notice: rabbitmq_start_0:1572:stderr [ Error: {offline_node_no_offline_flag,"You are trying to remove a node from an offline node. That is dangerous, but can be done with the --offline flag. Please consult the manual for rabbitmqctl for more information."} ] 247798:Jul 12 14:27:05 overcloud-controller-1.localdomain lrmd[27846]: notice: rabbitmq_start_0:1572:stderr [ Error: unable to connect to node 'rabbit@overcloud-controller-1': nodedown ] 247817:Jul 12 14:27:05 overcloud-controller-1.localdomain crmd[27849]: notice: overcloud-controller-1-rabbitmq_start_0:231 [ init terminating in do_boot ()\r\nError: unable to connect to node 'rabbit@overcloud-controller-1': nodedown\n\nDIAGNOSTICS\n===========\n\nattempted to contact: ['rabbit@overcloud-controller-1']\n\nrabbit@overcloud-controller-1:\n * connected to epmd (port 4369) on overcloud-controller-1\n * epmd reports: node 'rabbit' not running at all\n no other nodes on overcloud-controller-1\n * suggestion: start the node\n\ncurrent node deta 247866:Jul 12 14:27:06 overcloud-controller-1.localdomain lrmd[27846]: notice: rabbitmq_stop_0:4394:stderr [ Error: unable to connect to node 'rabbit@overcloud-controller-1': nodedown ] overcloud-controller-2.localdomain 260877:Jul 12 13:46:51 overcloud-controller-2.localdomain rabbitmq-cluster(rabbitmq)[15735]: ERROR: Failed to forget node rabbit@overcloud-controller-2 via rabbit@overcloud-controller-0. 260883:Jul 12 13:46:51 overcloud-controller-2.localdomain rabbitmq-cluster(rabbitmq)[15788]: ERROR: Failed to forget node rabbit@overcloud-controller-2 via rabbit@overcloud-controller-1. 260930:Jul 12 13:46:54 overcloud-controller-2.localdomain lrmd[9409]: notice: rabbitmq_start_0:15255:stderr [ Error: unable to connect to node 'rabbit@overcloud-controller-2': nodedown ] 272524:Jul 12 14:27:11 overcloud-controller-2.localdomain rabbitmq-cluster(rabbitmq)[32639]: ERROR: Failed to forget node rabbit@overcloud-controller-2 via rabbit@overcloud-controller-0. 272568:Jul 12 14:27:16 overcloud-controller-2.localdomain lrmd[25006]: notice: rabbitmq_start_0:31614:stderr [ Error: unable to connect to node 'rabbit@overcloud-controller-2': nodedown ] 272588:Jul 12 14:27:16 overcloud-controller-2.localdomain lrmd[25006]: notice: rabbitmq_start_0:31614:stderr [ Error: unable to connect to node 'rabbit@overcloud-controller-2': nodedown ] 272607:Jul 12 14:27:16 overcloud-controller-2.localdomain crmd[25009]: notice: overcloud-controller-2-rabbitmq_start_0:227 [ Error: process_not_running\nError: unable to connect to node 'rabbit@overcloud-controller-2': nodedown\n\nDIAGNOSTICS\n===========\n\nattempted to contact: ['rabbit@overcloud-controller-2']\n\nrabbit@overcloud-controller-2:\n * connected to epmd (port 4369) on overcloud-controller-2\n * epmd reports: node 'rabbit' not running at all\n no other nodes on overcloud-controller-2\n * suggestion: start the node\n\ncurrent node details:\n 272619:Jul 12 14:27:17 overcloud-controller-2.localdomain lrmd[25006]: notice: rabbitmq_stop_0:1481:stderr [ Error: unable to connect to node 'rabbit@overcloud-controller-2': nodedown ] (In reply to marios from comment #44) marios, could you please post RabbitMQ logs itself? This would be very helpful. (In reply to Peter Lemenkov from comment #45) > (In reply to marios from comment #44) > > marios, could you please post RabbitMQ logs itself? This would be very > helpful. Hey Peter see private message with details for the sos reports thanks update: o/ Peter/bandini... tested this today with resource-agents-3.9.5-77.el7.x86_64 (previously been using resource-agents-3.9.5-76.el7.x86_64.rpm). The heat stack does got to upgrade complete but the cluster is in a bad state. I had to resort to "pcs cluster stop --all" and then start again. After this I also ran the compute upgrade and converge to completion without issues. I think we are close, but this step in particular is still a problem for me. I will cycle back tomorrow if there are any updates, thanks, marios copy/pasta notes for clarity: ----------------------------- *=* 15:20:37 *=*=*= " confirm repos on overcloud nodes:" [stack@instack ~]$ for i in $(nova list|grep ctlplane|awk -F' ' '{ print $12 }'|awk -F'=' '{ print $2 }'); do ssh heat-admin@$i "hostname; echo ''; sudo ls -l /etc/yum.repos.d/; echo '';"; done overcloud-compute-0.localdomain total 16 -rw-r--r--. 1 root root 358 Mar 3 16:36 redhat.repo -rw-r--r--. 1 root root 2127 Jul 18 13:43 rhos-release-9-director.repo -rw-r--r--. 1 root root 2307 Jul 18 13:43 rhos-release-9.repo -rw-r--r--. 1 root root 1237 Jul 18 13:43 rhos-release-rhel-7.2.repo overcloud-controller-0.localdomain total 16 -rw-r--r--. 1 root root 358 Mar 3 16:36 redhat.repo -rw-r--r--. 1 root root 2127 Jul 18 13:43 rhos-release-9-director.repo -rw-r--r--. 1 root root 2307 Jul 18 13:43 rhos-release-9.repo -rw-r--r--. 1 root root 1237 Jul 18 13:43 rhos-release-rhel-7.2.repo overcloud-controller-1.localdomain total 16 -rw-r--r--. 1 root root 358 Mar 3 16:36 redhat.repo -rw-r--r--. 1 root root 2127 Jul 18 13:44 rhos-release-9-director.repo -rw-r--r--. 1 root root 2307 Jul 18 13:44 rhos-release-9.repo -rw-r--r--. 1 root root 1237 Jul 18 13:44 rhos-release-rhel-7.2.repo overcloud-controller-2.localdomain total 16 -rw-r--r--. 1 root root 358 Mar 3 16:36 redhat.repo -rw-r--r--. 1 root root 2127 Jul 18 13:43 rhos-release-9-director.repo -rw-r--r--. 1 root root 2307 Jul 18 13:44 rhos-release-9.repo -rw-r--r--. 1 root root 1237 Jul 18 13:43 rhos-release-rhel-7.2.repo [stack@instack ~]$ *=* 15:23:36 *=*=*= "Applied the os-collect-config fixup for bug 1350489:" for i in $(nova list|grep ctlplane|awk -F' ' '{ print $12 }'|awk -F'=' '{ print $2 }'); do ssh heat-admin@$i "hostname; echo ''; sudo yum -y update os-collect-config; echo '';"; done *=* 15:27:39 *=*=*= " " [stack@instack ~]$ for i in $(nova list|grep ctlplane|awk -F' ' '{ print $12 }'|awk -F'=' '{ print $2 }'); do ssh heat-admin@$i "hostname; echo ''; sudo rpm -qa | grep os-collect-config; echo '';"; done overcloud-compute-0.localdomain os-collect-config-0.1.37-5.el7ost.noarch overcloud-controller-0.localdomain os-collect-config-0.1.37-5.el7ost.noarch overcloud-controller-1.localdomain os-collect-config-0.1.37-5.el7ost.noarch overcloud-controller-2.localdomain os-collect-config-0.1.37-5.el7ost.noarch [stack@instack ~]$ try http://download.eng.bos.redhat.com/brewroot/packages/resource-agents/3.9.5/77.el7/x86_64/resource-agents-3.9.5-77.el7.x86_64.rpm today is the newest... 05 july http://download.eng.bos.redhat.com/brewroot/packages/resource-agents/3.9.5/ I have previously been using this in the for below http://download.eng.bos.redhat.com/brewroot/packages/resource-agents/3.9.5/76.el7/x86_64/resource-agents-3.9.5-76.el7.x86_64.rpm *=* 15:36:53 *=*=*= "Install the latest resource-agents for this bug:https://bugzilla.redhat.com/show_bug.cgi?id=1343905" for i in $(nova list|grep ctlplane|awk -F' ' '{ print $12 }'|awk -F'=' '{ print $2 }'); do ssh heat-admin@$i "hostname; echo ''; sudo yum -y install http://download.eng.bos.redhat.com/brewroot/packages/resource-agents/3.9.5/77.el7/x86_64/resource-agents-3.9.5-77.el7.x86_64.rpm; echo '';"; done *=* 15:47:06 *=*=*= "confirm resource agents installed:" [stack@instack ~]$ for i in $(nova list|grep ctlplane|awk -F' ' '{ print $12 }'|awk -F'=' '{ print $2 }'); do ssh heat-admin@$i "hostname; echo ''; sudo rpm -qa | grep resource-agents; echo '';"; done overcloud-compute-0.localdomain resource-agents-3.9.5-77.el7.x86_64 overcloud-controller-0.localdomain resource-agents-3.9.5-77.el7.x86_64 overcloud-controller-1.localdomain resource-agents-3.9.5-77.el7.x86_64 overcloud-controller-2.localdomain resource-agents-3.9.5-77.el7.x86_64 *=* 15:48:23 *=*=*= "Run the controller upgrade: " openstack overcloud deploy --templates /usr/share/openstack-tripleo-heat-templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml --control-scale 3 --compute-scale 1 --libvirt-type qemu -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml -e network_env.yaml --ntp-server '0.fedora.pool.ntp.org' -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker.yaml Jul 19 12:51:54 overcloud-controller-0.localdomain heat-engine[1425]: 2016-07-19 12:51:54.479 1539 ERROR sqlalchemy.pool.QueuePool Jul 19 12:51:54 overcloud-controller-0.localdomain systemd[1]: openstack-heat-engine.service: main process exited, code=exited, status=1/FAILURE Jul 19 12:51:54 overcloud-controller-0.localdomain heat-engine[1425]: 2016-07-19 12:51:54.799 1542 WARNING oslo_messaging.server [-] start/stop/wait must be called in the same thread Jul 19 12:52:05 overcloud-controller-0.localdomain crmd[2185]: notice: Initiating action 306: stop openstack-cinder-scheduler_stop_0 on overcloud-controller-2 Jul 19 12:52:06 overcloud-controller-0.localdomain haproxy[7586]: Server swift_proxy_server/overcloud-controller-2 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue. *=* 16:11:11 *=*=*= " " Jul 19 13:00:16 overcloud-controller-0.localdomain systemd[1]: Stopped OpenStack Object Storage (swift) - Object Server. Jul 19 13:00:16 overcloud-controller-0.localdomain systemd[1]: Stopped OpenStack Object Storage (swift) - Proxy Server. Jul 19 13:00:14 overcloud-controller-0.localdomain cib[2180]: notice: Disconnecting from Corosync Jul 19 13:00:14 overcloud-controller-0.localdomain cib[2180]: notice: Disconnecting from Corosync Jul 19 13:00:14 overcloud-controller-0.localdomain pacemakerd[2179]: notice: Shutdown complete Jul 19 13:00:14 overcloud-controller-0.localdomain systemd[1]: Stopped Pacemaker High Availability Cluster Manager. Jul 19 13:00:15 overcloud-controller-0.localdomain systemd[1]: Stopping Corosync Cluster Engine... Jul 19 13:00:15 overcloud-controller-0.localdomain corosync[2164]: [MAIN ] Node was shut down by a signal Jul 19 13:00:15 overcloud-controller-0.localdomain corosync[11312]: Signaling Corosync Cluster Engine (corosync) to terminate: [ OK ] Jul 19 13:00:15 overcloud-controller-0.localdomain corosync[2164]: [SERV ] Unloading all Corosync service engines. Jul 19 13:00:15 overcloud-controller-0.localdomain corosync[2164]: [QB ] withdrawing server sockets Jul 19 13:00:15 overcloud-controller-0.localdomain corosync[2164]: [SERV ] Service engine unloaded: corosync vote quorum service v1.0 Jul 19 13:00:15 overcloud-controller-0.localdomain corosync[2164]: [QB ] withdrawing server sockets Jul 19 13:00:15 overcloud-controller-0.localdomain corosync[2164]: [SERV ] Service engine unloaded: corosync configuration map access Jul 19 13:00:15 overcloud-controller-0.localdomain corosync[2164]: [QB ] withdrawing server sockets Jul 19 13:00:15 overcloud-controller-0.localdomain corosync[2164]: [SERV ] Service engine unloaded: corosync configuration service Jul 19 13:00:15 overcloud-controller-0.localdomain corosync[2164]: [QB ] withdrawing server sockets Jul 19 13:00:15 overcloud-controller-0.localdomain corosync[2164]: [SERV ] Service engine unloaded: corosync cluster closed process group service v1.01 Jul 19 13:00:15 overcloud-controller-0.localdomain corosync[2164]: [QB ] withdrawing server sockets Jul 19 13:00:15 overcloud-controller-0.localdomain corosync[2164]: [SERV ] Service engine unloaded: corosync cluster quorum service v0.1 Jul 19 13:00:15 overcloud-controller-0.localdomain corosync[2164]: [SERV ] Service engine unloaded: corosync profile loading service Jul 19 13:00:15 overcloud-controller-0.localdomain corosync[2164]: [MAIN ] Corosync Cluster Engine exiting normally Jul 19 13:15:41 overcloud-controller-0.localdomain yum[12207]: Updated: python-zaqarclient-1.0.0-1.el7ost.noarch Jul 19 13:39:10 overcloud-controller-0.localdomain systemd[1]: Starting Cluster Controlled openstack-aodh-listener... 2016-07-19 13:42:24 [AllNodesExtraConfig]: UPDATE_COMPLETE state changed Stack overcloud UPDATE_COMPLETE *=* 16:47:00 *=*=*= "update complete but lots of stuff still down " Clone Set: rabbitmq-clone [rabbitmq] 27- Started: [ overcloud-controller-0 ] 28: Stopped: [ overcloud-controller-1 overcloud-controller-2 ] 29- Clone Set: memcached-clone [memcached] 30- Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] 31- Clone Set: openstack-nova-scheduler-clone [openstack-nova-scheduler] 32- Started: [ overcloud-controller-0 ] 33: Stopped: [ overcloud-controller-1 overcloud-controller-2 ] 34- Clone Set: neutron-l3-agent-clone [neutron-l3-agent] 35: Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] 36- Clone Set: openstack-heat-engine-clone [openstack-heat-engine] 37- Started: [ overcloud-controller-0 ] 38: Stopped: [ overcloud-controller-1 overcloud-controller-2 ] 39- Clone Set: openstack-ceilometer-api-clone [openstack-ceilometer-api] 40- Started: [ overcloud-controller-0 ] 41: Stopped: [ overcloud-controller-1 overcloud-controller-2 ] 42- Clone Set: neutron-metadata-agent-clone [neutron-metadata-agent] 43: Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] 44- Clone Set: neutron-ovs-cleanup-clone [neutron-ovs-cleanup] 45- Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] -- 48- Clone Set: openstack-heat-api-clone [openstack-heat-api] 49- Started: [ overcloud-controller-0 ] 50: Stopped: [ overcloud-controller-1 overcloud-controller-2 ] 51- Clone Set: openstack-cinder-scheduler-clone [openstack-cinder-scheduler] 52- Started: [ overcloud-controller-0 ] 53: Stopped: [ overcloud-controller-1 overcloud-controller-2 ] 54- Clone Set: openstack-nova-api-clone [openstack-nova-api] 55- Started: [ overcloud-controller-0 ] 56: Stopped: [ overcloud-controller-1 overcloud-controller-2 ] 57- Clone Set: openstack-heat-api-cloudwatch-clone [openstack-heat-api-cloudwatch] 58- Started: [ overcloud-controller-0 ] 59: Stopped: [ overcloud-controller-1 overcloud-controller-2 ] 60- Clone Set: openstack-ceilometer-collector-clone [openstack-ceilometer-collector] 61- Started: [ overcloud-controller-0 ] 62: Stopped: [ overcloud-controller-1 overcloud-controller-2 ] 63- Clone Set: openstack-nova-consoleauth-clone [openstack-nova-consoleauth] 64- Started: [ overcloud-controller-0 ] 65: Stopped: [ overcloud-controller-1 overcloud-controller-2 ] 66- Clone Set: openstack-glance-registry-clone [openstack-glance-registry] 67- Started: [ overcloud-controller-0 ] 68: Stopped: [ overcloud-controller-1 overcloud-controller-2 ] 69- Clone Set: openstack-ceilometer-notification-clone [openstack-ceilometer-notification] 70- Started: [ overcloud-controller-0 ] 71: Stopped: [ overcloud-controller-1 overcloud-controller-2 ] 72- Clone Set: openstack-cinder-api-clone [openstack-cinder-api] 73- Started: [ overcloud-controller-0 ] 74: Stopped: [ overcloud-controller-1 overcloud-controller-2 ] 75- Clone Set: neutron-dhcp-agent-clone [neutron-dhcp-agent] 76: Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] 77- Clone Set: openstack-glance-api-clone [openstack-glance-api] 78- Started: [ overcloud-controller-0 ] 79: Stopped: [ overcloud-controller-1 overcloud-controller-2 ] 80- Clone Set: neutron-openvswitch-agent-clone [neutron-openvswitch-agent] 81: Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] 82- Clone Set: openstack-nova-novncproxy-clone [openstack-nova-novncproxy] 83- Started: [ overcloud-controller-0 ] 84: Stopped: [ overcloud-controller-1 overcloud-controller-2 ] 85- Clone Set: delay-clone [delay] 86- Started: [ overcloud-controller-0 ] 87: Stopped: [ overcloud-controller-1 overcloud-controller-2 ] 88- Clone Set: neutron-server-clone [neutron-server] 89: Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] 90- Clone Set: httpd-clone [httpd] 91- Started: [ overcloud-controller-0 ] 92: Stopped: [ overcloud-controller-1 overcloud-controller-2 ] 93- Clone Set: openstack-ceilometer-central-clone [openstack-ceilometer-central] 94- Started: [ overcloud-controller-0 ] 95: Stopped: [ overcloud-controller-1 overcloud-controller-2 ] 96- Clone Set: openstack-heat-api-cfn-clone [openstack-heat-api-cfn] 97- Started: [ overcloud-controller-0 ] 98: Stopped: [ overcloud-controller-1 overcloud-controller-2 ] 99- openstack-cinder-volume (systemd:openstack-cinder-volume): Started overcloud-controller-0 100- Clone Set: openstack-nova-conductor-clone [openstack-nova-conductor] 101- Started: [ overcloud-controller-0 ] 102: Stopped: [ overcloud-controller-1 overcloud-controller-2 ] 103- Clone Set: openstack-aodh-listener-clone [openstack-aodh-listener] 104- Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] -- 109- Clone Set: openstack-core-clone [openstack-core] 110- Started: [ overcloud-controller-0 ] 111: Stopped: [ overcloud-controller-1 overcloud-controller-2 ] 112- 113-Failed Actions: s *=* 17:06:15 *=*=*= " pcs resource cleanup" svces still down *=* 17:35:47 *=*=*= "pcs cluster stop --all" *=* 17:56:34 *=*=*= "pcs cluster start --all" *=* 18:04:21 *=*=*= " no stopped services... try compute upgrade then converge..." *=* 18:04:55 *=*=*= "[stack@instack ~]$ upgrade-non-controller.sh --upgrade overcloud-compute-0 " *=* 18:22:04 *=*=*= "pingtest " after controller and compute upgrade | server1 | | OS::Nova::Server | CREATE_FAILED | 2016-07-19T15:22:28 | tenant-stack | *=* 18:44:37 *=*=*= "upgrade converge" openstack overcloud deploy --templates /usr/share/openstack-tripleo-heat-templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml --control-scale 3 --compute-scale 1 --libvirt-type qemu -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml -e network_env.yaml --ntp-server '0.fedora.pool.ntp.org' -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker-converge.yaml *=* 18:55:42 *=*=*= " " Clone Set: openstack-nova-scheduler-clone [openstack-nova-scheduler] (unmanaged) openstack-nova-scheduler (systemd:openstack-nova-scheduler): Started overcloud-controller-0 (unmanaged) Every 2.0s: pcs status | grep -ni stop -C 2 Tue Jul 19 15:58:02 2016 161- openstack-core (ocf::heartbeat:Dummy): Started overcloud-controller-1 (unmanaged) 162- Clone Set: openstack-gnocchi-metricd-clone [openstack-gnocchi-metricd] (unmanaged) 163: Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] 164- Clone Set: openstack-sahara-api-clone [openstack-sahara-api] (unmanaged) 165: Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] 166- Clone Set: openstack-sahara-engine-clone [openstack-sahara-engine] (unmanaged) 167: Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] 168- Clone Set: openstack-gnocchi-statsd-clone [openstack-gnocchi-statsd] (unmanaged) 169: Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ] 170- 171-PCSD Status: *=* 19:02:56 *=*=*= " managed" *=* 19:20:38 *=*=*= " "update complete! \o/ no services down Every 2.0s: pcs status | grep -ni stop -C 2 Tue Jul 19 16:21:00 2016 2016-07-19 16:19:41 [BlockStorageNodesPostDeployment]: CREATE_COMPLETE state changed Stack overcloud UPDATE_COMPLETE *=* 19:38:15 *=*=*= " " tripleo.sh -- Overcloud pingtest SUCCEEDED (after converge) Thanks for the update Marios. FYI I'm trying to reproduce this as well but for now I'm having some other unrelated issues to work through first. (In reply to John Eckersberg from comment #49) > Thanks for the update Marios. FYI I'm trying to reproduce this as well but > for now I'm having some other unrelated issues to work through first. ack, fwiw, through keystone migrate at https://bugzilla.redhat.com/show_bug.cgi?id=1348831#c16 (and then as in comment #48 above) OK, I've managed to reproduce this. The end result after the upgrade: Clone Set: rabbitmq-clone [rabbitmq] Started: [ overcloud-controller-0 ] Stopped: [ overcloud-controller-1 overcloud-controller-2 ] And even though 0 is "started", the RabbitMQ app is not actually running: # rabbitmqctl status Status of node 'rabbit@overcloud-controller-0' ... [{pid,29875}, {running_applications,[{ranch,"Socket acceptor pool for TCP protocols.", "1.2.1"}, {xmerl,"XML parser","1.3.10"}, {sasl,"SASL CXC 138 11","2.7"}, {stdlib,"ERTS CXC 138 10","2.8"}, {kernel,"ERTS CXC 138 10","4.2"}]}, So let's see how we got there. 0 sees 1 go down: =INFO REPORT==== 21-Jul-2016::17:06:24 === rabbit on node 'rabbit@overcloud-controller-1' down 2 also sees 1 go down: =INFO REPORT==== 21-Jul-2016::17:06:24 === rabbit on node 'rabbit@overcloud-controller-1' down And 1 willingly goes down: =INFO REPORT==== 21-Jul-2016::17:06:24 === Stopping RabbitMQ So far so good. Four seconds later, 0 sees 2 go down: =INFO REPORT==== 21-Jul-2016::17:06:28 === rabbit on node 'rabbit@overcloud-controller-2' down And 2 went down willingly as well: =INFO REPORT==== 21-Jul-2016::17:06:27 === Stopping RabbitMQ Finally, 0 shuts itself down willingly: =INFO REPORT==== 21-Jul-2016::17:06:29 === Stopping RabbitMQ Cool. So at this point everything shut down cleanly. Packages are updated... node 1: Jul 21 17:09:30 Updated: rabbitmq-server-3.6.3-4.el7ost.noarch node 0: Jul 21 17:09:33 Updated: rabbitmq-server-3.6.3-4.el7ost.noarch node 2: Jul 21 17:09:33 Updated: rabbitmq-server-3.6.3-4.el7ost.noarch Now things start coming back. Node 0 is first to start: =INFO REPORT==== 21-Jul-2016::17:14:52 === Starting RabbitMQ 3.6.3 on Erlang 18.3.4.1 Worth noting that the mnesia database is already wiped: =INFO REPORT==== 21-Jul-2016::17:14:53 === Database directory at /var/lib/rabbitmq/mnesia/rabbit@overcloud-controller-0 is empty. Initialising from scratch... Started all the way: =INFO REPORT==== 21-Jul-2016::17:14:53 === Server startup complete; 0 plugins started. Next node to start is 1. First thing of note is that the mnesia database was *not* wiped on the filesystem as it was on node 0: =INFO REPORT==== 21-Jul-2016::17:14:59 === upgrades: Mnesia dir backed up to "/var/lib/rabbitmq/mnesia/rabbit@overcloud-controller-1-upgrade-backup" =INFO REPORT==== 21-Jul-2016::17:15:00 === Starting RabbitMQ 3.6.3 on Erlang 18.3.4.1 Aside, note that the config seems to be invalid, will file a new bug for this: =WARNING REPORT==== 21-Jul-2016::17:15:01 === Transport option binary unknown or invalid. =WARNING REPORT==== 21-Jul-2016::17:15:01 === Transport option {packet,raw} unknown or invalid. =WARNING REPORT==== 21-Jul-2016::17:15:01 === Transport option {reuseaddr,true} unknown or invalid. Node 1 clusters correctly with 0 and is fully started: =INFO REPORT==== 21-Jul-2016::17:15:01 === rabbit on node 'rabbit@overcloud-controller-0' up =INFO REPORT==== 21-Jul-2016::17:15:01 === Server startup complete; 0 plugins started. ... and then 3 seconds later the resource agent stops the application and tries to re-cluster it but it fails since it's already clustered: =INFO REPORT==== 21-Jul-2016::17:15:04 === Stopping RabbitMQ =INFO REPORT==== 21-Jul-2016::17:15:04 === stopped TCP Listener on 172.16.2.9:5672 =INFO REPORT==== 21-Jul-2016::17:15:04 === Stopped RabbitMQ application =INFO REPORT==== 21-Jul-2016::17:15:05 === Already member of cluster: ['rabbit@overcloud-controller-1', 'rabbit@overcloud-controller-0'] Doh. Meanwhile, node 0 sees node 1 come up then drop off again. At this point node 0 thinks it's in the minority so it pauses itself: =INFO REPORT==== 21-Jul-2016::17:15:07 === node 'rabbit@overcloud-controller-1' down: connection_closed =WARNING REPORT==== 21-Jul-2016::17:15:07 === Cluster minority/secondary status detected - awaiting recovery =INFO REPORT==== 21-Jul-2016::17:15:07 === Stopping RabbitMQ ... =INFO REPORT==== 21-Jul-2016::17:15:07 === Stopped RabbitMQ application At the same time (:11), nodes 0 and 1 both start again, first 0: =INFO REPORT==== 21-Jul-2016::17:15:11 === Starting RabbitMQ 3.6.3 on Erlang 18.3.4.1 ... =INFO REPORT==== 21-Jul-2016::17:15:11 === rabbit on node 'rabbit@overcloud-controller-1' up ... =INFO REPORT==== 21-Jul-2016::17:15:11 === Server startup complete; 0 plugins started. And 1: =INFO REPORT==== 21-Jul-2016::17:15:11 === Starting RabbitMQ 3.6.3 on Erlang 18.3.4.1 ... =INFO REPORT==== 21-Jul-2016::17:15:12 === rabbit on node 'rabbit@overcloud-controller-0' up =INFO REPORT==== 21-Jul-2016::17:15:12 === Server startup complete; 0 plugins started. But we end up with the same problem again. Node 1 tries to recluster but fails: =INFO REPORT==== 21-Jul-2016::17:15:15 === Stopping RabbitMQ =INFO REPORT==== 21-Jul-2016::17:15:15 === stopped TCP Listener on 172.16.2.9:5672 =INFO REPORT==== 21-Jul-2016::17:15:15 === Stopped RabbitMQ application =INFO REPORT==== 21-Jul-2016::17:15:16 === Already member of cluster: ['rabbit@overcloud-controller-1', 'rabbit@overcloud-controller-0'] After that the node 1 is marked stopped and remains in this state. Node 0 stops due to the minority detection again: =INFO REPORT==== 21-Jul-2016::17:15:18 === node 'rabbit@overcloud-controller-1' down: connection_closed =WARNING REPORT==== 21-Jul-2016::17:15:18 === Cluster minority/secondary status detected - awaiting recovery =INFO REPORT==== 21-Jul-2016::17:15:18 === Stopping RabbitMQ ... =INFO REPORT==== 21-Jul-2016::17:15:18 === Stopped RabbitMQ application So now we're at our end state. Node 1 failed to cluster twice and is now stopped. Meanwhile node 0 is paused because it thinks it's the minority, waiting for node 1 to come back (but it never does). After a bit, node 2 comes to life but fails to start since it cannot be the first node to start (it did not stop last): =INFO REPORT==== 21-Jul-2016::17:15:23 === upgrades: Mnesia dir backed up to "/var/lib/rabbitmq/mnesia/rabbit@overcloud-controller-2-upgrade-backup" =ERROR REPORT==== 21-Jul-2016::17:15:23 === Cluster upgrade needed but other disc nodes shut down after this one. Please first start the last disc node to shut down. Note: if several disc nodes were shut down simultaneously they may all show this message. In which case, remove the lock file on one of them and start that node. The lock file on this node is: /var/lib/rabbitmq/mnesia/rabbit@overcloud-controller-2/nodes_running_at_shutdown Looks like the resource agent kills it and wipes the mnesia directory and tries again: =INFO REPORT==== 21-Jul-2016::17:15:29 === Starting RabbitMQ 3.6.3 on Erlang 18.3.4.1 ... =INFO REPORT==== 21-Jul-2016::17:15:30 === Database directory at /var/lib/rabbitmq/mnesia/rabbit@overcloud-controller-2 is empty. Initialising from scratch... ... =INFO REPORT==== 21-Jul-2016::17:15:30 === Server startup complete; 0 plugins started. ... and then subsequently kills it. Why, I'm not sure: =INFO REPORT==== 21-Jul-2016::17:15:34 === Stopping RabbitMQ ... =INFO REPORT==== 21-Jul-2016::17:15:34 === Stopped RabbitMQ application That brings us to the final state when the upgrade is "complete". What Eck wrote above matches my observation too. We might have 2 different but related issues here. The issue with *minor updates* was that nodes weren't getting forgotten correctly, and i still believe that one is fixed by the old patch, so we still need that patch. (=> It would be a valid alternative to have 2 different BZs here to verify.) The 2nd issue which happens only on *major upgrades* is that the 1st reclustering attempt (without cleaning/forgetting) seems to bring down the rabbitmq app on the bootstrap node for some reason, and then when the 2nd clustering attempt is made (the one where nodes are being forgotten and data cleared) there's no running RabbitMQ app to talk to. Interesting is that this seems to happen only during the upgrade. When we tested on OSP 9 only last week, a full cluster stop/start (like the one that happens on major upgrade) went fine. I wrote down my manual recovery approach into a patch that seems to fix the 2nd issue for me: https://github.com/jistr/resource-agents/commit/9fc04d5c97f62e2425ca6c7986bf274e060a3f6c Two things worth noting: 1) I only tested this patch to re-form the cluster after the failure already happened. I'm still about to verify this patch in an actual upgrade run, to see if it is able to *prevent* the failure. 2) I'm not familiar enough with RabbitMQ to fully judge if the approach i took with the patch is somewhat a decent solution or too big of a hack. It's possible that a cleaner solution to our troubles exists. (?) (In reply to Jiri Stransky from comment #52) > 1) I only tested this patch to re-form the cluster after the failure already > happened. I'm still about to verify this patch in an actual upgrade run, to > see if it is able to *prevent* the failure. Yes the patch prevented the RabbitMQ startup failure during major upgrade for me. In case this isn't an acceptable fix, perhaps we could at least use it as a workaround in the meantime, to allow us to validate later stages of the upgrade with a clean pacemaker cluster. The patch needs to be applied in addition to the previous patch that fixed minor updates, which is here: https://github.com/ClusterLabs/resource-agents/commit/279bae7ec9a571a4d52b0d876850e27772eb0933 ^ this one is already in resource-agents-3.9.5-76.el7 Hi, fyi, I just tested jistr latest fixup and it also looks good on my env. I got through the controller upgrade with heat update complete and no services down (esp not the rabbit!). For clarity I first installed the latest resource-agents on the overcloud nodes and then applied the fix that jistr points to at comment #53 The patch looks reasonable to me, especially since we've confirmed that it fixes the problem in testing. I'm not very confident that it will fix all outstanding clustering issues (I think there's probably some sort of race condition still), but it's an improvement and given the time constraints I think we should go ahead with that patch. (In reply to John Eckersberg from comment #51) > OK, I've managed to reproduce this. The end result after the upgrade: > > Clone Set: rabbitmq-clone [rabbitmq] > Started: [ overcloud-controller-0 ] > Stopped: [ overcloud-controller-1 overcloud-controller-2 ] Do you have sos reports for the controllers? I'd like to poke around in them to see which cluster actions were involved. Ie. did pacemaker tell rabbit to stop at '=INFO REPORT==== 21-Jul-2016::17:15:04 ===' or was that the agent. It looks like no, but I'd like to be 100% sure. Upstream pull request with the workaround-fix-patch discussed above: https://github.com/ClusterLabs/resource-agents/pull/838 Just thought a bit more about the strange nature of this bug: OSP 8 stop -> OSP 8 start -- worked fine OSP 9 stop -> OSP 9 start -- worked fine OSP 8 stop -> OSP 9 start -- troubles (i mean without the patch) Perhaps it might be some bug/peculiarity in RabbitMQ itself, happening just during RabbitMQ upgrade. Still it might be ok to work around it in resource-agents unless we find a better solution. Just confirming that the cluster appears to be sending the right instructions to the resource agent: # zgrep -e crmd:.*match_graph_event.*rabbit -e crmd:.*Initiating.*rabbit ./sosreport-overcloud-controller-2.localdomain-20160725135019/var/log/cluster/corosync.log-20160722.gz Jul 21 17:13:10 [27957] overcloud-controller-2.localdomain crmd: notice: te_rsc_command: Initiating action 104: monitor rabbitmq:0_monitor_0 on overcloud-controller-2 (local) Jul 21 17:13:10 [27957] overcloud-controller-2.localdomain crmd: notice: te_rsc_command: Initiating action 16: monitor rabbitmq:0_monitor_0 on overcloud-controller-0 Jul 21 17:13:10 [27957] overcloud-controller-2.localdomain crmd: notice: te_rsc_command: Initiating action 60: monitor rabbitmq:0_monitor_0 on overcloud-controller-1 Jul 21 17:13:11 [27957] overcloud-controller-2.localdomain crmd: info: match_graph_event: Action rabbitmq_monitor_0 (16) confirmed on overcloud-controller-0 (rc=7) Jul 21 17:13:11 [27957] overcloud-controller-2.localdomain crmd: info: match_graph_event: Action rabbitmq_monitor_0 (104) confirmed on overcloud-controller-2 (rc=7) Jul 21 17:13:12 [27957] overcloud-controller-2.localdomain crmd: info: match_graph_event: Action rabbitmq_monitor_0 (60) confirmed on overcloud-controller-1 (rc=7) Jul 21 17:14:47 [27957] overcloud-controller-2.localdomain crmd: notice: te_rsc_command: Initiating action 108: start rabbitmq_start_0 on overcloud-controller-0 Jul 21 17:14:56 [27957] overcloud-controller-2.localdomain crmd: info: match_graph_event: Action rabbitmq_start_0 (108) confirmed on overcloud-controller-0 (rc=0) Jul 21 17:14:56 [27957] overcloud-controller-2.localdomain crmd: notice: te_rsc_command: Initiating action 110: monitor rabbitmq_monitor_10000 on overcloud-controller-0 Jul 21 17:14:56 [27957] overcloud-controller-2.localdomain crmd: notice: te_rsc_command: Initiating action 111: start rabbitmq:1_start_0 on overcloud-controller-1 Jul 21 17:14:58 [27957] overcloud-controller-2.localdomain crmd: info: match_graph_event: Action rabbitmq_monitor_10000 (110) confirmed on overcloud-controller-0 (rc=0) Jul 21 17:15:17 [27957] overcloud-controller-2.localdomain crmd: info: match_graph_event: Action rabbitmq_start_0 (111) confirmed on overcloud-controller-1 (rc=1) Jul 21 17:15:17 [27957] overcloud-controller-2.localdomain crmd: info: match_graph_event: Action rabbitmq_start_0 (111) confirmed on overcloud-controller-1 (rc=1) Jul 21 17:15:17 [27957] overcloud-controller-2.localdomain crmd: notice: te_rsc_command: Initiating action 14: stop rabbitmq_stop_0 on overcloud-controller-1 Jul 21 17:15:20 [27957] overcloud-controller-2.localdomain crmd: info: match_graph_event: Action rabbitmq_stop_0 (14) confirmed on overcloud-controller-1 (rc=0) Jul 21 17:15:20 [27957] overcloud-controller-2.localdomain crmd: notice: te_rsc_command: Initiating action 111: start rabbitmq:1_start_0 on overcloud-controller-2 (local) Jul 21 17:15:36 [27957] overcloud-controller-2.localdomain crmd: info: match_graph_event: Action rabbitmq_start_0 (111) confirmed on overcloud-controller-2 (rc=1) Jul 21 17:15:36 [27957] overcloud-controller-2.localdomain crmd: info: match_graph_event: Action rabbitmq_start_0 (111) confirmed on overcloud-controller-2 (rc=1) Jul 21 17:15:37 [27957] overcloud-controller-2.localdomain crmd: notice: te_rsc_command: Initiating action 6: stop rabbitmq_stop_0 on overcloud-controller-2 (local) Jul 21 17:15:39 [27957] overcloud-controller-2.localdomain crmd: info: match_graph_event: Action rabbitmq_stop_0 (6) confirmed on overcloud-controller-2 (rc=0) Three things...
1. Having understood the patch and it's context, I would suggest that if this is the correct logic, it should happen in `rmq_start` prior to `rmq_forget_cluster_node_remotely` being called.
That the app may be started as a side-effect of `rmq_forget_cluster_node_remotely` being called is non-obvious and could result in calls to `rmq_notify` changing the state of the service which is not looked on favorably.
2. The RMQ_CRM_ATTR_COOKIE logic seems faulty to me.
Presumably, the purpose is to ensure that the last node to stop is the first one to start. Is this why the attribute is not deleted in `rmq_stop` ?
However I cannot figure out how this is supposed to work, as:
a. we perform an initial monitor of all resources before starting them
b. the value of RMQ_CRM_ATTR_COOKIE is deleted during a monitor if the resource is not running
c. the start operation doesn't seem to have a check that prevents the "wrong" node from starting
Instead, if the cluster starts the "wrong" node first, that node will just wipe the mnesia db and start again.
We should probably have a conversation outside of bugzilla about how this works or should work.
3. This particular case...
> ... and then 3 seconds later the resource agent stops the application
> and tries to re-cluster it but it fails since it's already clustered:
This seems to be the root cause IMHO.
Rabbit (now?) does more than the agent expects.
If we look as `rmq_join_existing`, it calls `rmq_init_and_wait` followed by `$RMQ_CTL stop_app` followed by `$RMQ_CTL join_cluster $node` in a loop.
This is where I would be looking to patch the agent by finding a way to make the stop conditional.
(In reply to Andrew Beekhof from comment #61) > Three things... > > 1. Having understood the patch and it's context, I would suggest that if > this is the correct logic, it should happen in `rmq_start` prior to > `rmq_forget_cluster_node_remotely` being called. > > That the app may be started as a side-effect of > `rmq_forget_cluster_node_remotely` being called is non-obvious and could > result in calls to `rmq_notify` changing the state of the service which is > not looked on favorably. Yeah, now that I look more closely at this patch, I see that given two nodes A and B, when the resource agent runs on node B to join A, it will cause a side-effect on node A by remotely starting the rabbitmq app on node A. This probably is not what we want to happen. That violates a clean separation of control between the nodes. > 2. The RMQ_CRM_ATTR_COOKIE logic seems faulty to me. > > Presumably, the purpose is to ensure that the last node to stop is the first > one to start. Is this why the attribute is not deleted in `rmq_stop` ? > > However I cannot figure out how this is supposed to work, as: > > a. we perform an initial monitor of all resources before starting them > b. the value of RMQ_CRM_ATTR_COOKIE is deleted during a monitor if the > resource is not running > c. the start operation doesn't seem to have a check that prevents the > "wrong" node from starting > > Instead, if the cluster starts the "wrong" node first, that node will just > wipe the mnesia db and start again. > > We should probably have a conversation outside of bugzilla about how this > works or should work. I have no idea what the intention is but I will take your word as to the actual implemented behavior :) +1 that this really needs extra discussion. I think for the immediate need it's not a huge deal. > 3. This particular case... > > > ... and then 3 seconds later the resource agent stops the application > > and tries to re-cluster it but it fails since it's already clustered: > > This seems to be the root cause IMHO. > Rabbit (now?) does more than the agent expects. > > If we look as `rmq_join_existing`, it calls `rmq_init_and_wait` followed by > `$RMQ_CTL stop_app` followed by `$RMQ_CTL join_cluster $node` in a loop. > > This is where I would be looking to patch the agent by finding a way to make > the stop conditional. Yep, I'm going to try to patch the agent to check the cluster status before doing the stop. If it starts and the cluster is still fine... just leave it alone! Proposed patch - https://github.com/ClusterLabs/resource-agents/pull/839 Seems to fix this in my testing. (In reply to John Eckersberg from comment #62) > +1 that this really needs extra discussion. I think for the immediate > need it's not a huge deal. Agreed. Not something we want to address now, but it would be good to do in the next cycle. > > > 3. This particular case... > > > > > ... and then 3 seconds later the resource agent stops the application > > > and tries to re-cluster it but it fails since it's already clustered: > > > > This seems to be the root cause IMHO. > > Rabbit (now?) does more than the agent expects. > > > > If we look as `rmq_join_existing`, it calls `rmq_init_and_wait` followed by > > `$RMQ_CTL stop_app` followed by `$RMQ_CTL join_cluster $node` in a loop. > > > > This is where I would be looking to patch the agent by finding a way to make > > the stop conditional. > > Yep, I'm going to try to patch the agent to check the cluster status > before doing the stop. If it starts and the cluster is still > fine... just leave it alone! Perfect https://github.com/ClusterLabs/resource-agents/commit/1621dbb60454840d469f3a0e317a97d94510f7ab Merged upstream. I used eck's fix today instead of the previous workaround, had a successful 8->9 major upgrade with it :) *** Bug 1361750 has been marked as a duplicate of this bug. *** checked that patch is inside the resource-agents-3.9.5-80.el7 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2016-2174.html |