Bug 1343905

Summary: RabbitMQ fails to rejoin the cluster during overcloud update
Product: Red Hat Enterprise Linux 7 Reporter: Jiri Stransky <jstransk>
Component: resource-agentsAssignee: Peter Lemenkov <plemenko>
Status: CLOSED ERRATA QA Contact: Leonid Natapov <lnatapov>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 7.2CC: abeekhof, agk, cluster-maint, dmaley, fdinitto, jeckersb, jstransk, mandreou, mburns, mcornea, michele, mkolaja, mnovacek, nkrishna, oalbrigt, plemenko, rhel-osp-director-maint, rscarazz, sasha, sclewis, ushkalim
Target Milestone: pre-dev-freezeKeywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: resource-agents-3.9.5-80.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1351446 (view as bug list) Environment:
Last Closed: 2016-11-04 00:03:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1333976, 1333977, 1343027, 1344405, 1351446, 1351784    
Attachments:
Description Flags
pacemaker journal during the update
none
pcs status
none
pacemaker journal June 17
none
pacemaker journal June 17 2nd try none

Description Jiri Stransky 2016-06-08 09:50:53 UTC
Description of problem:

When testing overcloud minor update on OSP 9, RabbitMQ failed to come back up after yum update, even though the rabbitmq package wasn't updated. It could be either a RabbitMQ or a TripleO or a Pacemaker resource agent issue.

Version-Release number of selected component (if applicable):

OC:
rabbitmq-server-3.6.2-3.el7ost.noarch

UC:
python-tripleoclient-2.0.0-1.0.2.el7ost.noarch
openstack-tripleo-heat-templates-kilo-2.0.0-8.el7ost.noarch
openstack-tripleo-0.0.8-0.2.d81bd6dgit.el7ost.noarch
openstack-tripleo-puppet-elements-2.0.0-2.el7ost.noarch
openstack-tripleo-heat-templates-liberty-2.0.0-8.el7ost.noarch
openstack-tripleo-heat-templates-2.0.0-8.el7ost.noarch
openstack-tripleo-common-2.0.0-5.el7ost.noarch
openstack-tripleo-image-elements-0.9.9-3.el7ost.noarch

How reproducible:

2 of 2 tries


Steps to Reproduce:

# on OSP 9 poodle

openstack overcloud deploy \
    --libvirt-type qemu \
    --ntp-server clock.redhat.com \
    --control-scale 3 \
    --templates

# update repos on overcloud via rhos-release

openstack overcloud update stack overcloud -i --templates

Comment 1 Jiri Stransky 2016-06-08 10:07:03 UTC
The interesting part of pacemaker journal:

čen 08 09:26:58 overcloud-controller-0.localdomain lrmd[21781]:   notice: rabbitmq_start_0:23020:stderr [ User defined signal 2 ]
čen 08 09:26:58 overcloud-controller-0.localdomain lrmd[21781]:   notice: rabbitmq_start_0:23020:stderr [ User defined signal 2 ]
čen 08 09:26:58 overcloud-controller-0.localdomain lrmd[21781]:   notice: rabbitmq_start_0:23020:stderr [ Error: {aborted,{no_exists,[rabbit_user, ]
čen 08 09:26:58 overcloud-controller-0.localdomain lrmd[21781]:   notice: rabbitmq_start_0:23020:stderr [                             [{{internal_user,'$1','_','_'}, ]
čen 08 09:26:58 overcloud-controller-0.localdomain lrmd[21781]:   notice: rabbitmq_start_0:23020:stderr [                               [{'/=','$1',<<"guest">>}], ]
čen 08 09:26:58 overcloud-controller-0.localdomain lrmd[21781]:   notice: rabbitmq_start_0:23020:stderr [                               ['$_']}]]}} ]
čen 08 09:26:58 overcloud-controller-0.localdomain lrmd[21781]:   notice: rabbitmq_start_0:23020:stderr [ User defined signal 2 ]
čen 08 09:26:58 overcloud-controller-0.localdomain lrmd[21781]:   notice: rabbitmq_start_0:23020:stderr [ User defined signal 2 ]
čen 08 09:26:58 overcloud-controller-0.localdomain crmd[21784]:   notice: Operation rabbitmq_start_0: unknown error (node=overcloud-controller-0, call=224, rc=1, cib-update=64, confirmed=true)
čen 08 09:26:58 overcloud-controller-0.localdomain crmd[21784]:   notice: overcloud-controller-0-rabbitmq_start_0:224 [ User defined signal 2\nUser defined signal 2\nError: {aborted,{no_exists,[rabbit_user,\n   
                         [{{internal_user,'$1','_','_'},\n                              [{'/=','$1',<<"guest">>}],\n                              ['$_']}]]}}\nUser defined signal 2\nUser defined signal 2\n ]

Comment 2 Jiri Stransky 2016-06-08 10:07:38 UTC
Created attachment 1165902 [details]
pacemaker journal during the update

Comment 3 Jiri Stransky 2016-06-08 10:08:54 UTC
Created attachment 1165903 [details]
pcs status

Comment 4 Jiri Stransky 2016-06-09 10:12:34 UTC
I thought it might be the limitation of RabbitMQ loopback user list so i tried with the patch that addresses this upstream [1], but i still get the same behavior, RabbitMQ is not able to rejoin the cluster.

[1] https://github.com/openstack/tripleo-heat-templates/commit/0e6e864b3b7d125e7eba8c71724d3a458ffef7dd

Comment 5 Jiri Stransky 2016-06-09 12:44:08 UTC
This is could be a bug in the resource agent perhaps? Looking at /usr/lib/ocf/resource.d/heartbeat/rabbitmq-cluster i see that rmq_stop contains a line:

Users = mnesia:dirty_select(rabbit_user, [{ {internal_user, '\\\$1', '_', '_'}, [{'/=', '\\\$1', <<\"guest\">>}], ['\\\$_'] } ])

which pretty much exactly corresponds to the error message. However when i look at upstream code for this agent:

https://github.com/ClusterLabs/resource-agents/blob/902bd05738aaaf3c1de76afb6526ba7ff6fc2312/heartbeat/rabbitmq-cluster#L352

... the line is not there. Is it possible that this is a bug that has been already addressed upstream and we're just missing a backport? Moving to resource-agents component for further investigation.

Comment 6 Jiri Stransky 2016-06-09 12:46:55 UTC
Package versions:

resource-agents-3.9.5-54.el7_2.10.x86_64

rabbitmq-server-3.6.2-3.el7ost.noarch

pacemaker-remote-1.1.13-10.el7_2.2.x86_64
pacemaker-cluster-libs-1.1.13-10.el7_2.2.x86_64
pacemaker-1.1.13-10.el7_2.2.x86_64
pacemaker-libs-1.1.13-10.el7_2.2.x86_64
pacemaker-cli-1.1.13-10.el7_2.2.x86_64

corosync-2.3.4-7.el7_2.1.x86_64
corosynclib-2.3.4-7.el7_2.1.x86_64

Comment 9 Marios Andreou 2016-06-10 15:28:27 UTC
Hey, FYI I am also seeing this issue. I am trying the upgrades workflow and for controller upgrades I am reworking the scripts at https://review.openstack.org/#/c/321027/ for Mitaka to Newton upgrades. 

As you know :) during the controllers upgrade first we run controller_pacemaker_1.sh to bring the cluster down. That happens fine but then during service  startup I see rabbit failing to start on controllers 1 and 2 (but running on controller 0). On controller-2 for example logs are like:

Jun 10 14:56:58 localhost rabbitmq-cluster(rabbitmq)[23557]: INFO: RabbitMQ server is not running
Jun 10 14:56:58 localhost lrmd[10099]:  notice: rabbitmq_stop_0:23557:stderr [ Error: {aborted,{no_exists,[rabbit_user, ]
Jun 10 14:56:58 localhost lrmd[10099]:  notice: rabbitmq_stop_0:23557:stderr [                             [{{internal_user,'$1','_','_'}, ]
Jun 10 14:56:58 localhost lrmd[10099]:  notice: rabbitmq_stop_0:23557:stderr [                               [{'/=','$1',<<"guest">>}], ]
Jun 10 14:56:58 localhost lrmd[10099]:  notice: rabbitmq_stop_0:23557:stderr [                               ['$_']}]]}} ]
Jun 10 14:56:58 localhost crmd[10102]:  notice: Operation rabbitmq_stop_0: ok (node=overcloud-controller-2, call=247, rc=0, cib-update=80, confirmed=true)
Jun 10 14:57:02 localhost redis(redis)[24917]: INFO: start: /usr/bin/redis-server --daemonize yes --unixsocket '/var/run/redis/redis.sock' --pidfile '/var/run/redis/redis-server.pid'
Jun 10 14:57:02 localhost su: (to redis) root on none
Jun 10 14:57:02 localhost systemd: Created slice user-984.slice.
Jun 10 14:57:02 localhost systemd: Starting user-984.slice.
Jun 10 14:57:02 localhost systemd: Started Session c5359 of user redis.
Jun 10 14:57:02 localhost systemd: Starting Session c5359 of user redis.
Jun 10 14:57:02 localhost redis(redis)[24917]: INFO: Postponing setting master score for overcloud-controller-2 until last known master instance [overcloud-controller-0] is promoted

Initially I thought it was something related to my scripts, so I tested current master and saw the same issue.

thanks, marios

Comment 10 Peter Lemenkov 2016-06-14 15:11:25 UTC
Ok, I think I've got a reproducer. Expect a fix and a test build during this  wednesday.

Comment 11 Peter Lemenkov 2016-06-16 15:13:03 UTC
Got a bugfix:

https://github.com/lemenkov/resource-agents/commit/f00a952bd5e133cad30689d9edcc98f5d33a71a9

Will do a test build shortly.

Comment 14 Peter Lemenkov 2016-06-17 10:30:05 UTC
Jiri, please retest with this resource-agents build.

Comment 15 Jiri Stransky 2016-06-17 12:35:40 UTC
It still fails, but with a different error message now, will attach the current pacemaker journal output.

It's worth noting that i did this to test:

* Deployed with what we get from repositories set up with rhos-release, which is resource-agents-3.9.5-54.el7_2.10.x86_64

* Updated the deployment with new resource-agents i got from Brew here
  https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=499744

* Then i tried running the update. Stop went fine, but the rejoin didn't.

Is there a reason to believe that if i deployed with the version from brew ^^ from the very start, instead of updating to it, the rejoin would work? If so, i can also try installing the package before Pacemaker is started.

Comment 16 Jiri Stransky 2016-06-17 12:36:15 UTC
Created attachment 1169119 [details]
pacemaker journal June 17

Comment 17 Jiri Stransky 2016-06-17 14:11:52 UTC
Did another try where i updated the resource-agents even before starting pacemaker, it seems the error is the same. This issue might need some further investigation. Let me know if you want to log into the environment.

Comment 18 Jiri Stransky 2016-06-17 14:12:25 UTC
Created attachment 1169140 [details]
pacemaker journal June 17 2nd try

Comment 19 Peter Lemenkov 2016-06-18 09:54:23 UTC
(In reply to Jiri Stransky from comment #17)
> Did another try where i updated the resource-agents even before starting
> pacemaker, it seems the error is the same. This issue might need some
> further investigation. Let me know if you want to log into the environment.

Love to have a login. Could you please send me it off-list? "petro" @ irc.

Comment 21 Peter Lemenkov 2016-06-21 11:04:40 UTC
Status update. The scriptlets in start/stop sections of a resource agent' script cannot be executed when cluster isn't ready yet or the only one node remains (cluster shutdown). That's not a problem because these scriptlets will be executed on a different node (or were executed already on a different node).

So we should check if the cluster is ready and run scriptlets only if possible.

Also we may re-start internal DB and re-read data and stop internal DB again.

I've got a patch - will make a build soon.

Comment 23 Oyvind Albrigtsen 2016-06-21 14:40:54 UTC
New build with the latest patch.

Comment 25 Jiri Stransky 2016-06-22 10:39:58 UTC
Unfortunately it's still failing, i left the environment intact for further investigation, it can be accessed the same way as before.

Failed Actions:
* rabbitmq_start_0 on overcloud-controller-0 'unknown error' (1): call=217, status=complete, exitreason='none',
    last-rc-change='Wed Jun 22 10:31:55 2016', queued=0ms, exec=13412ms

The error message is different than what it was previously:

čen 22 10:32:08 overcloud-controller-0.localdomain lrmd[1077]:   notice: rabbitmq_start_0:1806:stderr [ User defined signal 2 ]
čen 22 10:32:08 overcloud-controller-0.localdomain lrmd[1077]:   notice: rabbitmq_start_0:1806:stderr [ User defined signal 2 ]
čen 22 10:32:08 overcloud-controller-0.localdomain lrmd[1077]:   notice: rabbitmq_start_0:1806:stderr [ Error: unable to connect to node 'rabbit@overcloud-controller-0': nodedown ]
čen 22 10:32:08 overcloud-controller-0.localdomain lrmd[1077]:   notice: rabbitmq_start_0:1806:stderr [  ]
čen 22 10:32:08 overcloud-controller-0.localdomain lrmd[1077]:   notice: rabbitmq_start_0:1806:stderr [ DIAGNOSTICS ]
čen 22 10:32:08 overcloud-controller-0.localdomain lrmd[1077]:   notice: rabbitmq_start_0:1806:stderr [ =========== ]
čen 22 10:32:08 overcloud-controller-0.localdomain lrmd[1077]:   notice: rabbitmq_start_0:1806:stderr [  ]
čen 22 10:32:08 overcloud-controller-0.localdomain lrmd[1077]:   notice: rabbitmq_start_0:1806:stderr [ attempted to contact: ['rabbit@overcloud-controller-0'] ]
čen 22 10:32:08 overcloud-controller-0.localdomain lrmd[1077]:   notice: rabbitmq_start_0:1806:stderr [  ]
čen 22 10:32:08 overcloud-controller-0.localdomain lrmd[1077]:   notice: rabbitmq_start_0:1806:stderr [ rabbit@overcloud-controller-0: ]
čen 22 10:32:08 overcloud-controller-0.localdomain lrmd[1077]:   notice: rabbitmq_start_0:1806:stderr [   * connected to epmd (port 4369) on overcloud-controller-0 ]
čen 22 10:32:08 overcloud-controller-0.localdomain lrmd[1077]:   notice: rabbitmq_start_0:1806:stderr [   * epmd reports: node 'rabbit' not running at all ]
čen 22 10:32:08 overcloud-controller-0.localdomain lrmd[1077]:   notice: rabbitmq_start_0:1806:stderr [                   no other nodes on overcloud-controller-0 ]
čen 22 10:32:08 overcloud-controller-0.localdomain lrmd[1077]:   notice: rabbitmq_start_0:1806:stderr [   * suggestion: start the node ]
čen 22 10:32:08 overcloud-controller-0.localdomain lrmd[1077]:   notice: rabbitmq_start_0:1806:stderr [  ]
čen 22 10:32:08 overcloud-controller-0.localdomain lrmd[1077]:   notice: rabbitmq_start_0:1806:stderr [ current node details: ]
čen 22 10:32:08 overcloud-controller-0.localdomain lrmd[1077]:   notice: rabbitmq_start_0:1806:stderr [ - node name: 'rabbitmq-cli-57@overcloud-controller-0' ]
čen 22 10:32:08 overcloud-controller-0.localdomain lrmd[1077]:   notice: rabbitmq_start_0:1806:stderr [ - home dir: /var/lib/rabbitmq ]
čen 22 10:32:08 overcloud-controller-0.localdomain lrmd[1077]:   notice: rabbitmq_start_0:1806:stderr [ - cookie hash: jhpOzW5+oUaNAkMXDesQyQ== ]
čen 22 10:32:08 overcloud-controller-0.localdomain lrmd[1077]:   notice: rabbitmq_start_0:1806:stderr [  ]

Comment 30 Jiri Stransky 2016-06-23 16:36:58 UTC
Posted a pull request upstream:

https://github.com/ClusterLabs/resource-agents/pull/831

Tested by applying the patch

https://github.com/jistr/resource-agents/commit/f727570225fe78debdfb285446b2a6fbf8b75dad.patch

on top of resource-agents-3.9.5-54.el7_2.10.x86_64 and RabbitMQ nodes were able to leave and rejoin the cluster fine in testing with OSPd 9 minor update. Also tested separately with simply `pcs cluster stop` and `pcs cluster start` on one of the controller nodes, and again the node rejoined fine.

Comment 34 Oyvind Albrigtsen 2016-06-24 09:21:26 UTC
New build with the latest patch.

Comment 36 Jiri Stransky 2016-06-24 14:23:06 UTC
Retested with RPMs only (resource-agents-3.9.5-76.el7) without additional modifications, worked fine.

Comment 40 Alexander Chuzhoy 2016-07-06 14:31:10 UTC
Environment:
resource-agents-3.9.5-76.el7.x86_64

 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ overcloud-controller-0 ]
     Stopped: [ overcloud-controller-1 overcloud-controller-2 ]

Unable to start rabbitmq on controllers 1 and 2 (tried to cleanup/restart).


Jul  6 13:40:58 overcloud-controller-1 lrmd[25757]:  notice: rabbitmq_stop_0:33861:stderr [ Error: unable to connect to node 'rabbit@overcloud-controller-1': nodedown ]
Jul  6 13:40:58 overcloud-controller-1 lrmd[25757]:  notice: rabbitmq_stop_0:33861:stderr [  ]
Jul  6 13:40:58 overcloud-controller-1 lrmd[25757]:  notice: rabbitmq_stop_0:33861:stderr [ DIAGNOSTICS ]
Jul  6 13:40:58 overcloud-controller-1 lrmd[25757]:  notice: rabbitmq_stop_0:33861:stderr [ =========== ]
Jul  6 13:40:58 overcloud-controller-1 lrmd[25757]:  notice: rabbitmq_stop_0:33861:stderr [  ]
Jul  6 13:40:58 overcloud-controller-1 lrmd[25757]:  notice: rabbitmq_stop_0:33861:stderr [ attempted to contact: ['rabbit@overcloud-controller-1'] ]
Jul  6 13:40:58 overcloud-controller-1 lrmd[25757]:  notice: rabbitmq_stop_0:33861:stderr [  ]
Jul  6 13:40:58 overcloud-controller-1 lrmd[25757]:  notice: rabbitmq_stop_0:33861:stderr [ rabbit@overcloud-controller-1: ]
Jul  6 13:40:58 overcloud-controller-1 lrmd[25757]:  notice: rabbitmq_stop_0:33861:stderr [   * connected to epmd (port 4369) on overcloud-controller-1 ]
Jul  6 13:40:58 overcloud-controller-1 lrmd[25757]:  notice: rabbitmq_stop_0:33861:stderr [   * epmd reports: node 'rabbit' not running at all ]
Jul  6 13:40:58 overcloud-controller-1 lrmd[25757]:  notice: rabbitmq_stop_0:33861:stderr [                   no other nodes on overcloud-controller-1 ]
Jul  6 13:40:58 overcloud-controller-1 lrmd[25757]:  notice: rabbitmq_stop_0:33861:stderr [   * suggestion: start the node ]
Jul  6 13:40:58 overcloud-controller-1 lrmd[25757]:  notice: rabbitmq_stop_0:33861:stderr [  ]
Jul  6 13:40:58 overcloud-controller-1 lrmd[25757]:  notice: rabbitmq_stop_0:33861:stderr [ current node details: ]
Jul  6 13:40:58 overcloud-controller-1 lrmd[25757]:  notice: rabbitmq_stop_0:33861:stderr [ - node name: 'rabbitmq-cli-06@overcloud-controller-1' ]
Jul  6 13:40:58 overcloud-controller-1 lrmd[25757]:  notice: rabbitmq_stop_0:33861:stderr [ - home dir: /var/lib/rabbitmq ]
Jul  6 13:40:58 overcloud-controller-1 lrmd[25757]:  notice: rabbitmq_stop_0:33861:stderr [ - cookie hash: imq2MuvlmJpQO2jdNBEsEA== ]
Jul  6 13:40:58 overcloud-controller-1 lrmd[25757]:  notice: rabbitmq_stop_0:33861:stderr [  ]
Jul  6 13:40:58 overcloud-controller-1 crmd[25760]:  notice: Operation rabbitmq_stop_0: ok (node=overcloud-controller-1, call=1231, rc=0, cib-update=765, confirmed=true)

Comment 41 Marios Andreou 2016-07-08 17:07:26 UTC
Hi, I hit this today too - in the same way reported by Sasha at comment #40. Providing some more info here for debug - bandini said there may be an issue with the way the fix was applied/sequencing:

1. Deploy 8 under/overcloud, upgrade undercloud, aodh migration on overcloud, keystone migration on overcloud, upgrade-init on overcloud (repo switch). 

Applied the os-collect-config fixup for bug 1350489:

    for i in $(nova list|grep ctlplane|awk -F' ' '{ print $12 }'|awk -F'=' '{ print $2 }'); do ssh heat-admin@$i "hostname; echo ''; sudo yum -y update os-collect-config;  echo '';"; done

Install the latest resource-agents for this bug:

    for i in $(nova list|grep ctlplane|awk -F' ' '{ print $12 }'|awk -F'=' '{ print $2 }'); do ssh heat-admin@$i "hostname; echo ''; sudo yum -y install http://download.eng.bos.redhat.com/brewroot/packages/resource-agents/3.9.5/76.el7/x86_64/resource-agents-3.9.5-76.el7.x86_64.rpm; echo '';"; done

Confirm the agents were installed ok:

    [stack@instack ~]$ for i in $(nova list|grep ctlplane|awk -F' ' '{ print $12 }'|awk -F'=' '{ print $2 }'); do ssh heat-admin@$i "hostname; sudo rpm -qa | grep resource-agents; echo '';"; done
    overcloud-compute-0.localdomain resource-agents-3.9.5-76.el7.x86_64
    overcloud-controller-0.localdomain resource-agents-3.9.5-76.el7.x86_64
    overcloud-controller-1.localdomain resource-agents-3.9.5-76.el7.x86_64
    overcloud-controller-2.localdomain resource-agents-3.9.5-76.el7.x86_64

Run the controller upgrade:

    openstack overcloud deploy --templates /usr/share/openstack-tripleo-heat-templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml  --control-scale 3 --compute-scale 1 --libvirt-type qemu -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml -e network_env.yaml --ntp-server '0.fedora.pool.ntp.org' -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker.yaml 

Heat declares "UPDATE_COMPLETE", but rabbit is still down on controllers 1 and 2:

    26- Clone Set: rabbitmq-clone [rabbitmq]
    27-     Started: [ overcloud-controller-0 ]
    28:     Stopped: [ overcloud-controller-1 overcloud-controller-2 ]

A pcs resource cleanup or pcs resource restart rabbit doesn't help. The errors I see are like:

[stack@instack ~]$ for i in $(nova list|grep ctlplane|awk -F' ' '{ print $12 }'|awk -F'=' '{ print $2 }'); do ssh heat-admin@$i "hostname; echo ''; sudo journalctl -n 100000000 | grep -ni "error:.*rabbit"  ; echo '';"; done
overcloud-compute-0.localdomain


overcloud-controller-0.localdomain

219715:Jul 08 12:34:53 overcloud-controller-0.localdomain lrmd[5703]:   notice: rabbitmq_stop_0:15791:stderr [ Error: unable to connect to node 'rabbit@overcloud-controller-0': nodedown ]

overcloud-controller-1.localdomain

243554:Jul 08 13:09:35 overcloud-controller-1.localdomain rabbitmq-cluster(rabbitmq)[5423]: ERROR: Failed to forget node rabbit@overcloud-controller-1 via rabbit@overcloud-controller-0.
243613:Jul 08 13:09:39 overcloud-controller-1.localdomain lrmd[26882]:   notice: rabbitmq_start_0:3843:stderr [ Error: unable to connect to node 'rabbit@overcloud-controller-1': nodedown ]
243631:Jul 08 13:09:39 overcloud-controller-1.localdomain lrmd[26882]:   notice: rabbitmq_start_0:3843:stderr [ Error: {offline_node_no_offline_flag,"You are trying to remove a node from an offline node. That is dangerous, but can be done with the --offline flag. Please consult the manual for rabbitmqctl for more information."} ]
243633:Jul 08 13:09:39 overcloud-controller-1.localdomain lrmd[26882]:   notice: rabbitmq_start_0:3843:stderr [ Error: unable to connect to node 'rabbit@overcloud-controller-1': nodedown ]
243652:Jul 08 13:09:39 overcloud-controller-1.localdomain crmd[26885]:   notice: overcloud-controller-1-rabbitmq_start_0:231 [ init terminating in do_boot ()\r\nError: unable to connect to node 'rabbit@overcloud-controller-1': nodedown\n\nDIAGNOSTICS\n===========\n\nattempted to contact: ['rabbit@overcloud-controller-1']\n\nrabbit@overcloud-controller-1:\n  * connected to epmd (port 4369) on overcloud-controller-1\n  * epmd reports: node 'rabbit' not running at all\n                  no other nodes on overcloud-controller-1\n  * suggestion: start the node\n\ncurrent node deta
243664:Jul 08 13:09:40 overcloud-controller-1.localdomain lrmd[26882]:   notice: rabbitmq_stop_0:6518:stderr [ Error: unable to connect to node 'rabbit@overcloud-controller-1': nodedown ]

overcloud-controller-2.localdomain

221752:Jul 08 13:09:54 overcloud-controller-2.localdomain rabbitmq-cluster(rabbitmq)[8132]: ERROR: Failed to forget node rabbit@overcloud-controller-2 via rabbit@overcloud-controller-0.
221801:Jul 08 13:09:59 overcloud-controller-2.localdomain lrmd[505]:   notice: rabbitmq_start_0:7169:stderr [ Error: unable to connect to node 'rabbit@overcloud-controller-2': nodedown ]
221821:Jul 08 13:09:59 overcloud-controller-2.localdomain lrmd[505]:   notice: rabbitmq_start_0:7169:stderr [ Error: unable to connect to node 'rabbit@overcloud-controller-2': nodedown ]
221840:Jul 08 13:09:59 overcloud-controller-2.localdomain crmd[509]:   notice: overcloud-controller-2-rabbitmq_start_0:227 [ Error: process_not_running\nError: unable to connect to node 'rabbit@overcloud-controller-2': nodedown\n\nDIAGNOSTICS\n===========\n\nattempted to contact: ['rabbit@overcloud-controller-2']\n\nrabbit@overcloud-controller-2:\n  * connected to epmd (port 4369) on overcloud-controller-2\n  * epmd reports: node 'rabbit' not running at all\n                  no other nodes on overcloud-controller-2\n  * suggestion: start the node\n\ncurrent node details:\n
221856:Jul 08 13:10:00 overcloud-controller-2.localdomain lrmd[505]:   notice: rabbitmq_stop_0:9364:stderr [ Error: unable to connect to node 'rabbit@overcloud-controller-2': nodedown ]

Comment 42 Michele Baldessari 2016-07-08 17:24:34 UTC
I am wondering if a simple upgrade of the resource-agents is enough in this case.
Normally, given that pacemaker forks()/execs() it is okay, but in case the newer
RA changes metadata or if it changed bevaviour that relies on a previous action
we would need a restart of the resource for the changes to be effective.

From a very quick glance, it seems that Jiri's patch does not fall into any of
the above categories so a simple upgrade should be fine. Peter, it'd be nice if you could confirm here?

Marios, we could easily confirm this (if you 100% reproduce this issue that is) by restarting the rabbitmq resource after the resource-agents package upgrade,
wait for the cluster to settle fully and then do the upgrade.

Comment 43 Alexander Chuzhoy 2016-07-12 13:51:32 UTC
FailedQA:
FailingQA based on comment #40 and comment #41.

Comment 44 Marios Andreou 2016-07-12 16:31:53 UTC
(In reply to Michele Baldessari from comment #42)
> I am wondering if a simple upgrade of the resource-agents is enough in this
> case.
> Normally, given that pacemaker forks()/execs() it is okay, but in case the
> newer
> RA changes metadata or if it changed bevaviour that relies on a previous
> action
> we would need a restart of the resource for the changes to be effective.
> 
> From a very quick glance, it seems that Jiri's patch does not fall into any
> of
> the above categories so a simple upgrade should be fine. Peter, it'd be nice
> if you could confirm here?
> 
> Marios, we could easily confirm this (if you 100% reproduce this issue that
> is) by restarting the rabbitmq resource after the resource-agents package
> upgrade,
> wait for the cluster to settle fully and then do the upgrade.

Hi Bandini, only just managed to revisit here. SO, I ran it again today as you suggested. I updated the resource-agents, did a pcs resource restart rabbitmq, waiting for things to settle and ran the upgrade again. Unfortunately same results: heat declares UPDATE_COMPLETE, however, rabbit is down on controller 1 and 2 (For clarity, I updated the resource agents like in comment #40 and confirmed my nodes had resource-agents-3.9.5-76.el7.x86_64):


    Every 2.0s: pcs status | grep -ni 'stop' -C 2                                     Tue Jul 12 14:35:04 2016

    26- Clone Set: rabbitmq-clone [rabbitmq]
    27-     Started: [ overcloud-controller-0 ]
    28:     Stopped: [ overcloud-controller-1 overcloud-controller-2 ]
    29- Clone Set: memcached-clone [memcached]
    30-     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]


WRT errors around that time (heat stack update was complete at "2016-07-12 14:32:23 [0]: SIGNAL_COMPLETE ") - I see errors at 14:27 for example that may be relevant from 1 & 2. I also captured the SOS reports but BZ won't let me attach them (23,25 and 28 mb from the controllers, there is a 20mb limit apparently). If useful we can work out how to get them to you tomorrow, but for now:


[stack@instack ~]$ for i in $(nova list|grep ctlplane|awk -F' ' '{ print $12 }'|awk -F'=' '{ print $2 }'); do ssh heat-admin@$i "hostname; echo ''; sudo journalctl -n 100000000 | grep -ni "error:.*rabbit" ; echo '';"; done
overcloud-compute-0.localdomain


overcloud-controller-0.localdomain

234269:Jul 12 13:46:22 overcloud-controller-0.localdomain lrmd[6277]:   notice: rabbitmq_stop_0:9695:stderr [ Error: unable to connect to node 'rabbit@overcloud-controller-0': nodedown ]
241259:Jul 12 14:02:34 overcloud-controller-0.localdomain lrmd[6277]:   notice: rabbitmq_stop_0:27855:stderr [ Error: unable to connect to node 'rabbit@overcloud-controller-0': nodedown ]

overcloud-controller-1.localdomain

237069:Jul 12 13:46:42 overcloud-controller-1.localdomain rabbitmq-cluster(rabbitmq)[25102]: ERROR: Failed to forget node rabbit@overcloud-controller-1 via rabbit@overcloud-controller-0.
237121:Jul 12 13:46:45 overcloud-controller-1.localdomain lrmd[2066]:   notice: rabbitmq_start_0:24156:stderr [ Error: unable to connect to node 'rabbit@overcloud-controller-1': nodedown ]
247718:Jul 12 14:26:59 overcloud-controller-1.localdomain rabbitmq-cluster(rabbitmq)[3232]: ERROR: Failed to forget node rabbit@overcloud-controller-1 via rabbit@overcloud-controller-0.
247778:Jul 12 14:27:05 overcloud-controller-1.localdomain lrmd[27846]:   notice: rabbitmq_start_0:1572:stderr [ Error: unable to connect to node 'rabbit@overcloud-controller-1': nodedown ]
247796:Jul 12 14:27:05 overcloud-controller-1.localdomain lrmd[27846]:   notice: rabbitmq_start_0:1572:stderr [ Error: {offline_node_no_offline_flag,"You are trying to remove a node from an offline node. That is dangerous, but can be done with the --offline flag. Please consult the manual for rabbitmqctl for more information."} ]
247798:Jul 12 14:27:05 overcloud-controller-1.localdomain lrmd[27846]:   notice: rabbitmq_start_0:1572:stderr [ Error: unable to connect to node 'rabbit@overcloud-controller-1': nodedown ]
247817:Jul 12 14:27:05 overcloud-controller-1.localdomain crmd[27849]:   notice: overcloud-controller-1-rabbitmq_start_0:231 [ init terminating in do_boot ()\r\nError: unable to connect to node 'rabbit@overcloud-controller-1': nodedown\n\nDIAGNOSTICS\n===========\n\nattempted to contact: ['rabbit@overcloud-controller-1']\n\nrabbit@overcloud-controller-1:\n  * connected to epmd (port 4369) on overcloud-controller-1\n  * epmd reports: node 'rabbit' not running at all\n                  no other nodes on overcloud-controller-1\n  * suggestion: start the node\n\ncurrent node deta
247866:Jul 12 14:27:06 overcloud-controller-1.localdomain lrmd[27846]:   notice: rabbitmq_stop_0:4394:stderr [ Error: unable to connect to node 'rabbit@overcloud-controller-1': nodedown ]

overcloud-controller-2.localdomain

260877:Jul 12 13:46:51 overcloud-controller-2.localdomain rabbitmq-cluster(rabbitmq)[15735]: ERROR: Failed to forget node rabbit@overcloud-controller-2 via rabbit@overcloud-controller-0.
260883:Jul 12 13:46:51 overcloud-controller-2.localdomain rabbitmq-cluster(rabbitmq)[15788]: ERROR: Failed to forget node rabbit@overcloud-controller-2 via rabbit@overcloud-controller-1.
260930:Jul 12 13:46:54 overcloud-controller-2.localdomain lrmd[9409]:   notice: rabbitmq_start_0:15255:stderr [ Error: unable to connect to node 'rabbit@overcloud-controller-2': nodedown ]
272524:Jul 12 14:27:11 overcloud-controller-2.localdomain rabbitmq-cluster(rabbitmq)[32639]: ERROR: Failed to forget node rabbit@overcloud-controller-2 via rabbit@overcloud-controller-0.
272568:Jul 12 14:27:16 overcloud-controller-2.localdomain lrmd[25006]:   notice: rabbitmq_start_0:31614:stderr [ Error: unable to connect to node 'rabbit@overcloud-controller-2': nodedown ]
272588:Jul 12 14:27:16 overcloud-controller-2.localdomain lrmd[25006]:   notice: rabbitmq_start_0:31614:stderr [ Error: unable to connect to node 'rabbit@overcloud-controller-2': nodedown ]
272607:Jul 12 14:27:16 overcloud-controller-2.localdomain crmd[25009]:   notice: overcloud-controller-2-rabbitmq_start_0:227 [ Error: process_not_running\nError: unable to connect to node 'rabbit@overcloud-controller-2': nodedown\n\nDIAGNOSTICS\n===========\n\nattempted to contact: ['rabbit@overcloud-controller-2']\n\nrabbit@overcloud-controller-2:\n  * connected to epmd (port 4369) on overcloud-controller-2\n  * epmd reports: node 'rabbit' not running at all\n                  no other nodes on overcloud-controller-2\n  * suggestion: start the node\n\ncurrent node details:\n
272619:Jul 12 14:27:17 overcloud-controller-2.localdomain lrmd[25006]:   notice: rabbitmq_stop_0:1481:stderr [ Error: unable to connect to node 'rabbit@overcloud-controller-2': nodedown ]

Comment 45 Peter Lemenkov 2016-07-13 12:55:33 UTC
(In reply to marios from comment #44)

marios, could you please post RabbitMQ logs itself? This would be very helpful.

Comment 47 Marios Andreou 2016-07-13 16:06:48 UTC
(In reply to Peter Lemenkov from comment #45)
> (In reply to marios from comment #44)
> 
> marios, could you please post RabbitMQ logs itself? This would be very
> helpful.

Hey Peter see private message with details for the sos reports

thanks

Comment 48 Marios Andreou 2016-07-19 17:05:04 UTC
update: o/ Peter/bandini... tested this today with resource-agents-3.9.5-77.el7.x86_64 (previously been using resource-agents-3.9.5-76.el7.x86_64.rpm). The heat stack does got to upgrade complete but the cluster is in a bad state. I had to resort to "pcs cluster stop --all" and then start again. After this I also ran the compute upgrade and converge to completion without issues. I think we are close, but this step in particular is still a problem for me. I will cycle back tomorrow if there are any updates, 

thanks, marios


copy/pasta notes for clarity:
-----------------------------

       
*=* 15:20:37 *=*=*= " confirm repos on overcloud nodes:" 

[stack@instack ~]$ for i in $(nova list|grep ctlplane|awk -F' ' '{ print $12 }'|awk -F'=' '{ print $2 }'); do ssh heat-admin@$i "hostname; echo ''; sudo ls -l /etc/yum.repos.d/; echo '';"; done
overcloud-compute-0.localdomain

total 16
-rw-r--r--. 1 root root  358 Mar  3 16:36 redhat.repo
-rw-r--r--. 1 root root 2127 Jul 18 13:43 rhos-release-9-director.repo
-rw-r--r--. 1 root root 2307 Jul 18 13:43 rhos-release-9.repo
-rw-r--r--. 1 root root 1237 Jul 18 13:43 rhos-release-rhel-7.2.repo

overcloud-controller-0.localdomain

total 16
-rw-r--r--. 1 root root  358 Mar  3 16:36 redhat.repo
-rw-r--r--. 1 root root 2127 Jul 18 13:43 rhos-release-9-director.repo
-rw-r--r--. 1 root root 2307 Jul 18 13:43 rhos-release-9.repo
-rw-r--r--. 1 root root 1237 Jul 18 13:43 rhos-release-rhel-7.2.repo

overcloud-controller-1.localdomain

total 16
-rw-r--r--. 1 root root  358 Mar  3 16:36 redhat.repo
-rw-r--r--. 1 root root 2127 Jul 18 13:44 rhos-release-9-director.repo
-rw-r--r--. 1 root root 2307 Jul 18 13:44 rhos-release-9.repo
-rw-r--r--. 1 root root 1237 Jul 18 13:44 rhos-release-rhel-7.2.repo

overcloud-controller-2.localdomain

total 16
-rw-r--r--. 1 root root  358 Mar  3 16:36 redhat.repo
-rw-r--r--. 1 root root 2127 Jul 18 13:43 rhos-release-9-director.repo
-rw-r--r--. 1 root root 2307 Jul 18 13:44 rhos-release-9.repo
-rw-r--r--. 1 root root 1237 Jul 18 13:43 rhos-release-rhel-7.2.repo

[stack@instack ~]$ 

*=* 15:23:36 *=*=*= "Applied the os-collect-config fixup for bug 1350489:"

    for i in $(nova list|grep ctlplane|awk -F' ' '{ print $12 }'|awk -F'=' '{ print $2 }'); do ssh heat-admin@$i "hostname; echo ''; sudo yum -y update os-collect-config;  echo '';"; done    

*=* 15:27:39 *=*=*= " " 
[stack@instack ~]$     for i in $(nova list|grep ctlplane|awk -F' ' '{ print $12 }'|awk -F'=' '{ print $2 }'); do ssh heat-admin@$i "hostname; echo ''; sudo rpm -qa | grep os-collect-config;  echo '';"; done
overcloud-compute-0.localdomain
os-collect-config-0.1.37-5.el7ost.noarch
overcloud-controller-0.localdomain
os-collect-config-0.1.37-5.el7ost.noarch
overcloud-controller-1.localdomain
os-collect-config-0.1.37-5.el7ost.noarch
overcloud-controller-2.localdomain
os-collect-config-0.1.37-5.el7ost.noarch

[stack@instack ~]$ 
    
    try http://download.eng.bos.redhat.com/brewroot/packages/resource-agents/3.9.5/77.el7/x86_64/resource-agents-3.9.5-77.el7.x86_64.rpm today is the newest... 05 july http://download.eng.bos.redhat.com/brewroot/packages/resource-agents/3.9.5/

I have previously been using this in the for below http://download.eng.bos.redhat.com/brewroot/packages/resource-agents/3.9.5/76.el7/x86_64/resource-agents-3.9.5-76.el7.x86_64.rpm

*=* 15:36:53 *=*=*= "Install the latest resource-agents for this bug:https://bugzilla.redhat.com/show_bug.cgi?id=1343905"

    for i in $(nova list|grep ctlplane|awk -F' ' '{ print $12 }'|awk -F'=' '{ print $2 }'); do ssh heat-admin@$i "hostname; echo ''; sudo yum -y install http://download.eng.bos.redhat.com/brewroot/packages/resource-agents/3.9.5/77.el7/x86_64/resource-agents-3.9.5-77.el7.x86_64.rpm; echo '';"; done
    
*=* 15:47:06 *=*=*= "confirm resource agents installed:" 
[stack@instack ~]$     for i in $(nova list|grep ctlplane|awk -F' ' '{ print $12 }'|awk -F'=' '{ print $2 }'); do ssh heat-admin@$i "hostname; echo ''; sudo rpm -qa | grep resource-agents;  echo '';"; done
overcloud-compute-0.localdomain resource-agents-3.9.5-77.el7.x86_64
overcloud-controller-0.localdomain resource-agents-3.9.5-77.el7.x86_64
overcloud-controller-1.localdomain resource-agents-3.9.5-77.el7.x86_64
overcloud-controller-2.localdomain resource-agents-3.9.5-77.el7.x86_64

*=* 15:48:23 *=*=*= "Run the controller upgrade: " 

    openstack overcloud deploy --templates /usr/share/openstack-tripleo-heat-templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml  --control-scale 3 --compute-scale 1 --libvirt-type qemu -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml -e network_env.yaml --ntp-server '0.fedora.pool.ntp.org' -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker.yaml 


Jul 19 12:51:54 overcloud-controller-0.localdomain heat-engine[1425]: 2016-07-19 12:51:54.479 1539 ERROR sqlalchemy.pool.QueuePool
Jul 19 12:51:54 overcloud-controller-0.localdomain systemd[1]: openstack-heat-engine.service: main process exited, code=exited, status=1/FAILURE
Jul 19 12:51:54 overcloud-controller-0.localdomain heat-engine[1425]: 2016-07-19 12:51:54.799 1542 WARNING oslo_messaging.server [-] start/stop/wait must be called in the same thread

Jul 19 12:52:05 overcloud-controller-0.localdomain crmd[2185]:   notice: Initiating action 306: stop openstack-cinder-scheduler_stop_0 on overcloud-controller-2
Jul 19 12:52:06 overcloud-controller-0.localdomain haproxy[7586]: Server swift_proxy_server/overcloud-controller-2 is DOWN, reason: Layer4 connection problem, info: "Connection refused", check duration: 0ms. 1 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 remaining in queue.
*=* 16:11:11 *=*=*= " " 
Jul 19 13:00:16 overcloud-controller-0.localdomain systemd[1]: Stopped OpenStack Object Storage (swift) - Object Server.
Jul 19 13:00:16 overcloud-controller-0.localdomain systemd[1]: Stopped OpenStack Object Storage (swift) - Proxy Server.
Jul 19 13:00:14 overcloud-controller-0.localdomain cib[2180]:   notice: Disconnecting from Corosync
Jul 19 13:00:14 overcloud-controller-0.localdomain cib[2180]:   notice: Disconnecting from Corosync
Jul 19 13:00:14 overcloud-controller-0.localdomain pacemakerd[2179]:   notice: Shutdown complete
Jul 19 13:00:14 overcloud-controller-0.localdomain systemd[1]: Stopped Pacemaker High Availability Cluster Manager.
Jul 19 13:00:15 overcloud-controller-0.localdomain systemd[1]: Stopping Corosync Cluster Engine...
Jul 19 13:00:15 overcloud-controller-0.localdomain corosync[2164]:  [MAIN  ] Node was shut down by a signal
Jul 19 13:00:15 overcloud-controller-0.localdomain corosync[11312]: Signaling Corosync Cluster Engine (corosync) to terminate: [  OK  ]
Jul 19 13:00:15 overcloud-controller-0.localdomain corosync[2164]:  [SERV  ] Unloading all Corosync service engines.
Jul 19 13:00:15 overcloud-controller-0.localdomain corosync[2164]:  [QB    ] withdrawing server sockets
Jul 19 13:00:15 overcloud-controller-0.localdomain corosync[2164]:  [SERV  ] Service engine unloaded: corosync vote quorum service v1.0
Jul 19 13:00:15 overcloud-controller-0.localdomain corosync[2164]:  [QB    ] withdrawing server sockets
Jul 19 13:00:15 overcloud-controller-0.localdomain corosync[2164]:  [SERV  ] Service engine unloaded: corosync configuration map access
Jul 19 13:00:15 overcloud-controller-0.localdomain corosync[2164]:  [QB    ] withdrawing server sockets
Jul 19 13:00:15 overcloud-controller-0.localdomain corosync[2164]:  [SERV  ] Service engine unloaded: corosync configuration service
Jul 19 13:00:15 overcloud-controller-0.localdomain corosync[2164]:  [QB    ] withdrawing server sockets
Jul 19 13:00:15 overcloud-controller-0.localdomain corosync[2164]:  [SERV  ] Service engine unloaded: corosync cluster closed process group service v1.01
Jul 19 13:00:15 overcloud-controller-0.localdomain corosync[2164]:  [QB    ] withdrawing server sockets
Jul 19 13:00:15 overcloud-controller-0.localdomain corosync[2164]:  [SERV  ] Service engine unloaded: corosync cluster quorum service v0.1
Jul 19 13:00:15 overcloud-controller-0.localdomain corosync[2164]:  [SERV  ] Service engine unloaded: corosync profile loading service
Jul 19 13:00:15 overcloud-controller-0.localdomain corosync[2164]:  [MAIN  ] Corosync Cluster Engine exiting normally
Jul 19 13:15:41 overcloud-controller-0.localdomain yum[12207]: Updated: python-zaqarclient-1.0.0-1.el7ost.noarch
Jul 19 13:39:10 overcloud-controller-0.localdomain systemd[1]: Starting Cluster Controlled openstack-aodh-listener...
2016-07-19 13:42:24 [AllNodesExtraConfig]: UPDATE_COMPLETE state changed
Stack overcloud UPDATE_COMPLETE
*=* 16:47:00 *=*=*= "update complete but lots of stuff still down  " 
 Clone Set: rabbitmq-clone [rabbitmq]
27-     Started: [ overcloud-controller-0 ]
28:     Stopped: [ overcloud-controller-1 overcloud-controller-2 ]
29- Clone Set: memcached-clone [memcached]
30-     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
31- Clone Set: openstack-nova-scheduler-clone [openstack-nova-scheduler]
32-     Started: [ overcloud-controller-0 ]
33:     Stopped: [ overcloud-controller-1 overcloud-controller-2 ]
34- Clone Set: neutron-l3-agent-clone [neutron-l3-agent]
35:     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
36- Clone Set: openstack-heat-engine-clone [openstack-heat-engine]
37-     Started: [ overcloud-controller-0 ]
38:     Stopped: [ overcloud-controller-1 overcloud-controller-2 ]
39- Clone Set: openstack-ceilometer-api-clone [openstack-ceilometer-api]
40-     Started: [ overcloud-controller-0 ]
41:     Stopped: [ overcloud-controller-1 overcloud-controller-2 ]
42- Clone Set: neutron-metadata-agent-clone [neutron-metadata-agent]
43:     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
44- Clone Set: neutron-ovs-cleanup-clone [neutron-ovs-cleanup]
45-     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
--
48- Clone Set: openstack-heat-api-clone [openstack-heat-api]
49-     Started: [ overcloud-controller-0 ]
50:     Stopped: [ overcloud-controller-1 overcloud-controller-2 ]
51- Clone Set: openstack-cinder-scheduler-clone [openstack-cinder-scheduler]
52-     Started: [ overcloud-controller-0 ]
53:     Stopped: [ overcloud-controller-1 overcloud-controller-2 ]
54- Clone Set: openstack-nova-api-clone [openstack-nova-api]
55-     Started: [ overcloud-controller-0 ]
56:     Stopped: [ overcloud-controller-1 overcloud-controller-2 ]
57- Clone Set: openstack-heat-api-cloudwatch-clone [openstack-heat-api-cloudwatch]
58-     Started: [ overcloud-controller-0 ]
59:     Stopped: [ overcloud-controller-1 overcloud-controller-2 ]
60- Clone Set: openstack-ceilometer-collector-clone [openstack-ceilometer-collector]
61-     Started: [ overcloud-controller-0 ]
62:     Stopped: [ overcloud-controller-1 overcloud-controller-2 ]
63- Clone Set: openstack-nova-consoleauth-clone [openstack-nova-consoleauth]
64-     Started: [ overcloud-controller-0 ]
65:     Stopped: [ overcloud-controller-1 overcloud-controller-2 ]
66- Clone Set: openstack-glance-registry-clone [openstack-glance-registry]
67-     Started: [ overcloud-controller-0 ]
68:     Stopped: [ overcloud-controller-1 overcloud-controller-2 ]
69- Clone Set: openstack-ceilometer-notification-clone [openstack-ceilometer-notification]
70-     Started: [ overcloud-controller-0 ]
71:     Stopped: [ overcloud-controller-1 overcloud-controller-2 ]
72- Clone Set: openstack-cinder-api-clone [openstack-cinder-api]
73-     Started: [ overcloud-controller-0 ]
74:     Stopped: [ overcloud-controller-1 overcloud-controller-2 ]
75- Clone Set: neutron-dhcp-agent-clone [neutron-dhcp-agent]
76:     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
77- Clone Set: openstack-glance-api-clone [openstack-glance-api]
78-     Started: [ overcloud-controller-0 ]
79:     Stopped: [ overcloud-controller-1 overcloud-controller-2 ]
80- Clone Set: neutron-openvswitch-agent-clone [neutron-openvswitch-agent]
81:     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
82- Clone Set: openstack-nova-novncproxy-clone [openstack-nova-novncproxy]
83-     Started: [ overcloud-controller-0 ]
84:     Stopped: [ overcloud-controller-1 overcloud-controller-2 ]
85- Clone Set: delay-clone [delay]
86-     Started: [ overcloud-controller-0 ]
87:     Stopped: [ overcloud-controller-1 overcloud-controller-2 ]
88- Clone Set: neutron-server-clone [neutron-server]
89:     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
90- Clone Set: httpd-clone [httpd]
91-     Started: [ overcloud-controller-0 ]
92:     Stopped: [ overcloud-controller-1 overcloud-controller-2 ]
93- Clone Set: openstack-ceilometer-central-clone [openstack-ceilometer-central]
94-     Started: [ overcloud-controller-0 ]
95:     Stopped: [ overcloud-controller-1 overcloud-controller-2 ]
96- Clone Set: openstack-heat-api-cfn-clone [openstack-heat-api-cfn]
97-     Started: [ overcloud-controller-0 ]
98:     Stopped: [ overcloud-controller-1 overcloud-controller-2 ]
99- openstack-cinder-volume     (systemd:openstack-cinder-volume):      Started overcloud-controller-0
100- Clone Set: openstack-nova-conductor-clone [openstack-nova-conductor]
101-     Started: [ overcloud-controller-0 ]
102:     Stopped: [ overcloud-controller-1 overcloud-controller-2 ]
103- Clone Set: openstack-aodh-listener-clone [openstack-aodh-listener]
104-     Started: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
--
109- Clone Set: openstack-core-clone [openstack-core]
110-     Started: [ overcloud-controller-0 ]
111:     Stopped: [ overcloud-controller-1 overcloud-controller-2 ]
112-
113-Failed Actions:
s

*=* 17:06:15 *=*=*= " pcs resource cleanup" svces still down 
*=* 17:35:47 *=*=*= "pcs cluster stop --all" 
*=* 17:56:34 *=*=*= "pcs cluster start --all" 
*=* 18:04:21 *=*=*= " no stopped services... try compute upgrade then converge..." 
*=* 18:04:55 *=*=*= "[stack@instack ~]$ upgrade-non-controller.sh --upgrade overcloud-compute-0 " 
*=* 18:22:04 *=*=*= "pingtest "  after controller and compute upgrade
| server1               |                                                                                     | OS::Nova::Server             | CREATE_FAILED   | 2016-07-19T15:22:28 | tenant-stack |
*=* 18:44:37 *=*=*= "upgrade converge" 
    openstack overcloud deploy --templates /usr/share/openstack-tripleo-heat-templates -e /usr/share/openstack-tripleo-heat-templates/overcloud-resource-registry-puppet.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/puppet-pacemaker.yaml  --control-scale 3 --compute-scale 1 --libvirt-type qemu -e /usr/share/openstack-tripleo-heat-templates/environments/network-isolation.yaml -e /usr/share/openstack-tripleo-heat-templates/environments/net-single-nic-with-vlans.yaml -e network_env.yaml --ntp-server '0.fedora.pool.ntp.org' -e /usr/share/openstack-tripleo-heat-templates/environments/major-upgrade-pacemaker-converge.yaml 
*=* 18:55:42 *=*=*= " " 
 Clone Set: openstack-nova-scheduler-clone [openstack-nova-scheduler] (unmanaged)
     openstack-nova-scheduler   (systemd:openstack-nova-scheduler):     Started overcloud-controller-0 (unmanaged)
Every 2.0s: pcs status | grep -ni stop -C 2                                       Tue Jul 19 15:58:02 2016
161-     openstack-core (ocf::heartbeat:Dummy): Started overcloud-controller-1 (unmanaged)
162- Clone Set: openstack-gnocchi-metricd-clone [openstack-gnocchi-metricd] (unmanaged)
163:     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
164- Clone Set: openstack-sahara-api-clone [openstack-sahara-api] (unmanaged)
165:     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
166- Clone Set: openstack-sahara-engine-clone [openstack-sahara-engine] (unmanaged)
167:     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
168- Clone Set: openstack-gnocchi-statsd-clone [openstack-gnocchi-statsd] (unmanaged)
169:     Stopped: [ overcloud-controller-0 overcloud-controller-1 overcloud-controller-2 ]
170-
171-PCSD Status:
*=* 19:02:56 *=*=*= " managed" 
*=* 19:20:38 *=*=*= " "update complete! \o/ no services down Every 2.0s: pcs status | grep -ni stop -C 2                                       Tue Jul 19 16:21:00 2016

 2016-07-19 16:19:41 [BlockStorageNodesPostDeployment]: CREATE_COMPLETE state changed
Stack overcloud UPDATE_COMPLETE

*=* 19:38:15 *=*=*= " " tripleo.sh -- Overcloud pingtest SUCCEEDED (after converge)

Comment 49 John Eckersberg 2016-07-19 17:09:45 UTC
Thanks for the update Marios.  FYI I'm trying to reproduce this as well but for now I'm having some other unrelated issues to work through first.

Comment 50 Marios Andreou 2016-07-19 17:13:49 UTC
(In reply to John Eckersberg from comment #49)
> Thanks for the update Marios.  FYI I'm trying to reproduce this as well but
> for now I'm having some other unrelated issues to work through first.

ack, fwiw, through keystone migrate at https://bugzilla.redhat.com/show_bug.cgi?id=1348831#c16  (and then as in comment #48 above)

Comment 51 John Eckersberg 2016-07-21 20:18:09 UTC
OK, I've managed to reproduce this.  The end result after the upgrade:

  Clone Set: rabbitmq-clone [rabbitmq]
      Started: [ overcloud-controller-0 ]
      Stopped: [ overcloud-controller-1 overcloud-controller-2 ]

And even though 0 is "started", the RabbitMQ app is not actually running:

  # rabbitmqctl status
  Status of node 'rabbit@overcloud-controller-0' ...
  [{pid,29875},
   {running_applications,[{ranch,"Socket acceptor pool for TCP protocols.",
                                 "1.2.1"},
                          {xmerl,"XML parser","1.3.10"},
                          {sasl,"SASL  CXC 138 11","2.7"},
                          {stdlib,"ERTS  CXC 138 10","2.8"},
                          {kernel,"ERTS  CXC 138 10","4.2"}]},

So let's see how we got there.

0 sees 1 go down:

  =INFO REPORT==== 21-Jul-2016::17:06:24 ===
  rabbit on node 'rabbit@overcloud-controller-1' down

2 also sees 1 go down:

  =INFO REPORT==== 21-Jul-2016::17:06:24 ===
  rabbit on node 'rabbit@overcloud-controller-1' down

And 1 willingly goes down:

  =INFO REPORT==== 21-Jul-2016::17:06:24 ===
  Stopping RabbitMQ

So far so good.  Four seconds later, 0 sees 2 go down:

  =INFO REPORT==== 21-Jul-2016::17:06:28 ===
  rabbit on node 'rabbit@overcloud-controller-2' down

And 2 went down willingly as well:

  =INFO REPORT==== 21-Jul-2016::17:06:27 ===
  Stopping RabbitMQ

Finally, 0 shuts itself down willingly:

  =INFO REPORT==== 21-Jul-2016::17:06:29 ===
  Stopping RabbitMQ

Cool.  So at this point everything shut down cleanly.

Packages are updated...

node 1:
  Jul 21 17:09:30 Updated: rabbitmq-server-3.6.3-4.el7ost.noarch

node 0:
  Jul 21 17:09:33 Updated: rabbitmq-server-3.6.3-4.el7ost.noarch

node 2:
  Jul 21 17:09:33 Updated: rabbitmq-server-3.6.3-4.el7ost.noarch

Now things start coming back.  Node 0 is first to start:

  =INFO REPORT==== 21-Jul-2016::17:14:52 ===
  Starting RabbitMQ 3.6.3 on Erlang 18.3.4.1

Worth noting that the mnesia database is already wiped:

  =INFO REPORT==== 21-Jul-2016::17:14:53 ===
  Database directory at /var/lib/rabbitmq/mnesia/rabbit@overcloud-controller-0 is empty. Initialising from scratch...

Started all the way:

  =INFO REPORT==== 21-Jul-2016::17:14:53 ===
  Server startup complete; 0 plugins started.

Next node to start is 1.  First thing of note is that the mnesia
database was *not* wiped on the filesystem as it was on node 0:

  =INFO REPORT==== 21-Jul-2016::17:14:59 ===
  upgrades: Mnesia dir backed up to "/var/lib/rabbitmq/mnesia/rabbit@overcloud-controller-1-upgrade-backup"

  =INFO REPORT==== 21-Jul-2016::17:15:00 ===
  Starting RabbitMQ 3.6.3 on Erlang 18.3.4.1

Aside, note that the config seems to be invalid, will file a new bug
for this:

  =WARNING REPORT==== 21-Jul-2016::17:15:01 ===
  Transport option binary unknown or invalid.

  =WARNING REPORT==== 21-Jul-2016::17:15:01 ===
  Transport option {packet,raw} unknown or invalid.

  =WARNING REPORT==== 21-Jul-2016::17:15:01 ===
  Transport option {reuseaddr,true} unknown or invalid.

Node 1 clusters correctly with 0 and is fully started:

  =INFO REPORT==== 21-Jul-2016::17:15:01 ===
  rabbit on node 'rabbit@overcloud-controller-0' up

  =INFO REPORT==== 21-Jul-2016::17:15:01 ===
  Server startup complete; 0 plugins started.

... and then 3 seconds later the resource agent stops the application
and tries to re-cluster it but it fails since it's already clustered:

  =INFO REPORT==== 21-Jul-2016::17:15:04 ===
  Stopping RabbitMQ

  =INFO REPORT==== 21-Jul-2016::17:15:04 ===
  stopped TCP Listener on 172.16.2.9:5672

  =INFO REPORT==== 21-Jul-2016::17:15:04 ===
  Stopped RabbitMQ application

  =INFO REPORT==== 21-Jul-2016::17:15:05 ===
  Already member of cluster: ['rabbit@overcloud-controller-1',
                              'rabbit@overcloud-controller-0']

Doh.

Meanwhile, node 0 sees node 1 come up then drop off again.  At this
point node 0 thinks it's in the minority so it pauses itself:

  =INFO REPORT==== 21-Jul-2016::17:15:07 ===
  node 'rabbit@overcloud-controller-1' down: connection_closed

  =WARNING REPORT==== 21-Jul-2016::17:15:07 ===
  Cluster minority/secondary status detected - awaiting recovery

  =INFO REPORT==== 21-Jul-2016::17:15:07 ===
  Stopping RabbitMQ

  ...

  =INFO REPORT==== 21-Jul-2016::17:15:07 ===
  Stopped RabbitMQ application

At the same time (:11), nodes 0 and 1 both start again, first 0:

  =INFO REPORT==== 21-Jul-2016::17:15:11 ===
  Starting RabbitMQ 3.6.3 on Erlang 18.3.4.1

  ...

  =INFO REPORT==== 21-Jul-2016::17:15:11 ===
  rabbit on node 'rabbit@overcloud-controller-1' up

  ...

  =INFO REPORT==== 21-Jul-2016::17:15:11 ===
  Server startup complete; 0 plugins started.

And 1:

  =INFO REPORT==== 21-Jul-2016::17:15:11 ===
  Starting RabbitMQ 3.6.3 on Erlang 18.3.4.1

  ...

  =INFO REPORT==== 21-Jul-2016::17:15:12 ===
  rabbit on node 'rabbit@overcloud-controller-0' up

  =INFO REPORT==== 21-Jul-2016::17:15:12 ===
  Server startup complete; 0 plugins started.


But we end up with the same problem again.  Node 1 tries to recluster
but fails:

  =INFO REPORT==== 21-Jul-2016::17:15:15 ===
  Stopping RabbitMQ

  =INFO REPORT==== 21-Jul-2016::17:15:15 ===
  stopped TCP Listener on 172.16.2.9:5672

  =INFO REPORT==== 21-Jul-2016::17:15:15 ===
  Stopped RabbitMQ application

  =INFO REPORT==== 21-Jul-2016::17:15:16 ===
  Already member of cluster: ['rabbit@overcloud-controller-1',
                              'rabbit@overcloud-controller-0']

After that the node 1 is marked stopped and remains in this state.

Node 0 stops due to the minority detection again:

  =INFO REPORT==== 21-Jul-2016::17:15:18 ===
  node 'rabbit@overcloud-controller-1' down: connection_closed

  =WARNING REPORT==== 21-Jul-2016::17:15:18 ===
  Cluster minority/secondary status detected - awaiting recovery

  =INFO REPORT==== 21-Jul-2016::17:15:18 ===
  Stopping RabbitMQ

  ...

  =INFO REPORT==== 21-Jul-2016::17:15:18 ===
  Stopped RabbitMQ application

So now we're at our end state.  Node 1 failed to cluster twice and is
now stopped.  Meanwhile node 0 is paused because it thinks it's the
minority, waiting for node 1 to come back (but it never does).


After a bit, node 2 comes to life but fails to start since it cannot
be the first node to start (it did not stop last):

  =INFO REPORT==== 21-Jul-2016::17:15:23 ===
  upgrades: Mnesia dir backed up to "/var/lib/rabbitmq/mnesia/rabbit@overcloud-controller-2-upgrade-backup"

  =ERROR REPORT==== 21-Jul-2016::17:15:23 ===
  Cluster upgrade needed but other disc nodes shut down after this one.
  Please first start the last disc node to shut down.

  Note: if several disc nodes were shut down simultaneously they may all
  show this message. In which case, remove the lock file on one of them and
  start that node. The lock file on this node is:

   /var/lib/rabbitmq/mnesia/rabbit@overcloud-controller-2/nodes_running_at_shutdown 

Looks like the resource agent kills it and wipes the mnesia directory and tries again:

  =INFO REPORT==== 21-Jul-2016::17:15:29 ===
  Starting RabbitMQ 3.6.3 on Erlang 18.3.4.1

  ...

  =INFO REPORT==== 21-Jul-2016::17:15:30 ===
  Database directory at /var/lib/rabbitmq/mnesia/rabbit@overcloud-controller-2 is empty. Initialising from scratch...

  ...

  =INFO REPORT==== 21-Jul-2016::17:15:30 ===
  Server startup complete; 0 plugins started.

... and then subsequently kills it.  Why, I'm not sure:

  =INFO REPORT==== 21-Jul-2016::17:15:34 ===
  Stopping RabbitMQ

  ...

  =INFO REPORT==== 21-Jul-2016::17:15:34 ===
  Stopped RabbitMQ application

That brings us to the final state when the upgrade is "complete".

Comment 52 Jiri Stransky 2016-07-22 13:50:26 UTC
What Eck wrote above matches my observation too. We might have 2 different but related issues here.

The issue with *minor updates* was that nodes weren't getting forgotten correctly, and i still believe that one is fixed by the old patch, so we still need that patch. (=> It would be a valid alternative to have 2 different BZs here to verify.)

The 2nd issue which happens only on *major upgrades* is that the 1st reclustering attempt (without cleaning/forgetting) seems to bring down the rabbitmq app on the bootstrap node for some reason, and then when the 2nd clustering attempt is made (the one where nodes are being forgotten and data cleared) there's no running RabbitMQ app to talk to. Interesting is that this seems to happen only during the upgrade. When we tested on OSP 9 only last week, a full cluster stop/start (like the one that happens on major upgrade) went fine.


I wrote down my manual recovery approach into a patch that seems to fix the 2nd issue for me:

https://github.com/jistr/resource-agents/commit/9fc04d5c97f62e2425ca6c7986bf274e060a3f6c


Two things worth noting:

1) I only tested this patch to re-form the cluster after the failure already happened. I'm still about to verify this patch in an actual upgrade run, to see if it is able to *prevent* the failure.

2) I'm not familiar enough with RabbitMQ to fully judge if the approach i took with the patch is somewhat a decent solution or too big of a hack. It's possible that a cleaner solution to our troubles exists. (?)

Comment 53 Jiri Stransky 2016-07-22 15:16:54 UTC
(In reply to Jiri Stransky from comment #52)

> 1) I only tested this patch to re-form the cluster after the failure already
> happened. I'm still about to verify this patch in an actual upgrade run, to
> see if it is able to *prevent* the failure.

Yes the patch prevented the RabbitMQ startup failure during major upgrade for me. In case this isn't an acceptable fix, perhaps we could at least use it as a workaround in the meantime, to allow us to validate later stages of the upgrade with a clean pacemaker cluster.

The patch needs to be applied in addition to the previous patch that fixed minor updates, which is here:

https://github.com/ClusterLabs/resource-agents/commit/279bae7ec9a571a4d52b0d876850e27772eb0933

^ this one is already in resource-agents-3.9.5-76.el7

Comment 55 Marios Andreou 2016-07-22 17:02:08 UTC
Hi, fyi, I just tested jistr latest fixup and it also looks good on my env. I got through the controller upgrade with heat update complete and no services down (esp not the rabbit!). 

For clarity I first installed the latest resource-agents on the overcloud nodes and then applied the fix that jistr points to at comment #53

Comment 56 John Eckersberg 2016-07-22 18:21:58 UTC
The patch looks reasonable to me, especially since we've confirmed that it fixes the problem in testing.  I'm not very confident that it will fix all outstanding clustering issues (I think there's probably some sort of race condition still), but it's an improvement and given the time constraints I think we should go ahead with that patch.

Comment 57 Andrew Beekhof 2016-07-25 04:42:58 UTC
(In reply to John Eckersberg from comment #51)
> OK, I've managed to reproduce this.  The end result after the upgrade:
> 
>   Clone Set: rabbitmq-clone [rabbitmq]
>       Started: [ overcloud-controller-0 ]
>       Stopped: [ overcloud-controller-1 overcloud-controller-2 ]

Do you have sos reports for the controllers?

I'd like to poke around in them to see which cluster actions were involved.
Ie. did pacemaker tell rabbit to stop at '=INFO REPORT==== 21-Jul-2016::17:15:04 ===' or was that the agent.  It looks like no, but I'd like to be 100% sure.

Comment 58 Jiri Stransky 2016-07-25 13:29:51 UTC
Upstream pull request with the workaround-fix-patch discussed above:
https://github.com/ClusterLabs/resource-agents/pull/838


Just thought a bit more about the strange nature of this bug:

OSP 8 stop -> OSP 8 start  --  worked fine
OSP 9 stop -> OSP 9 start  --  worked fine
OSP 8 stop -> OSP 9 start  --  troubles (i mean without the patch)

Perhaps it might be some bug/peculiarity in RabbitMQ itself, happening just during RabbitMQ upgrade. Still it might be ok to work around it in resource-agents unless we find a better solution.

Comment 60 Andrew Beekhof 2016-07-26 03:26:56 UTC
Just confirming that the cluster appears to be sending the right instructions to the resource agent:

# zgrep -e crmd:.*match_graph_event.*rabbit -e crmd:.*Initiating.*rabbit ./sosreport-overcloud-controller-2.localdomain-20160725135019/var/log/cluster/corosync.log-20160722.gz 
Jul 21 17:13:10 [27957] overcloud-controller-2.localdomain       crmd:   notice: te_rsc_command:	Initiating action 104: monitor rabbitmq:0_monitor_0 on overcloud-controller-2 (local)
Jul 21 17:13:10 [27957] overcloud-controller-2.localdomain       crmd:   notice: te_rsc_command:	Initiating action 16: monitor rabbitmq:0_monitor_0 on overcloud-controller-0
Jul 21 17:13:10 [27957] overcloud-controller-2.localdomain       crmd:   notice: te_rsc_command:	Initiating action 60: monitor rabbitmq:0_monitor_0 on overcloud-controller-1
Jul 21 17:13:11 [27957] overcloud-controller-2.localdomain       crmd:     info: match_graph_event:	Action rabbitmq_monitor_0 (16) confirmed on overcloud-controller-0 (rc=7)
Jul 21 17:13:11 [27957] overcloud-controller-2.localdomain       crmd:     info: match_graph_event:	Action rabbitmq_monitor_0 (104) confirmed on overcloud-controller-2 (rc=7)
Jul 21 17:13:12 [27957] overcloud-controller-2.localdomain       crmd:     info: match_graph_event:	Action rabbitmq_monitor_0 (60) confirmed on overcloud-controller-1 (rc=7)
Jul 21 17:14:47 [27957] overcloud-controller-2.localdomain       crmd:   notice: te_rsc_command:	Initiating action 108: start rabbitmq_start_0 on overcloud-controller-0
Jul 21 17:14:56 [27957] overcloud-controller-2.localdomain       crmd:     info: match_graph_event:	Action rabbitmq_start_0 (108) confirmed on overcloud-controller-0 (rc=0)
Jul 21 17:14:56 [27957] overcloud-controller-2.localdomain       crmd:   notice: te_rsc_command:	Initiating action 110: monitor rabbitmq_monitor_10000 on overcloud-controller-0
Jul 21 17:14:56 [27957] overcloud-controller-2.localdomain       crmd:   notice: te_rsc_command:	Initiating action 111: start rabbitmq:1_start_0 on overcloud-controller-1
Jul 21 17:14:58 [27957] overcloud-controller-2.localdomain       crmd:     info: match_graph_event:	Action rabbitmq_monitor_10000 (110) confirmed on overcloud-controller-0 (rc=0)
Jul 21 17:15:17 [27957] overcloud-controller-2.localdomain       crmd:     info: match_graph_event:	Action rabbitmq_start_0 (111) confirmed on overcloud-controller-1 (rc=1)
Jul 21 17:15:17 [27957] overcloud-controller-2.localdomain       crmd:     info: match_graph_event:	Action rabbitmq_start_0 (111) confirmed on overcloud-controller-1 (rc=1)
Jul 21 17:15:17 [27957] overcloud-controller-2.localdomain       crmd:   notice: te_rsc_command:	Initiating action 14: stop rabbitmq_stop_0 on overcloud-controller-1
Jul 21 17:15:20 [27957] overcloud-controller-2.localdomain       crmd:     info: match_graph_event:	Action rabbitmq_stop_0 (14) confirmed on overcloud-controller-1 (rc=0)
Jul 21 17:15:20 [27957] overcloud-controller-2.localdomain       crmd:   notice: te_rsc_command:	Initiating action 111: start rabbitmq:1_start_0 on overcloud-controller-2 (local)
Jul 21 17:15:36 [27957] overcloud-controller-2.localdomain       crmd:     info: match_graph_event:	Action rabbitmq_start_0 (111) confirmed on overcloud-controller-2 (rc=1)
Jul 21 17:15:36 [27957] overcloud-controller-2.localdomain       crmd:     info: match_graph_event:	Action rabbitmq_start_0 (111) confirmed on overcloud-controller-2 (rc=1)
Jul 21 17:15:37 [27957] overcloud-controller-2.localdomain       crmd:   notice: te_rsc_command:	Initiating action 6: stop rabbitmq_stop_0 on overcloud-controller-2 (local)
Jul 21 17:15:39 [27957] overcloud-controller-2.localdomain       crmd:     info: match_graph_event:	Action rabbitmq_stop_0 (6) confirmed on overcloud-controller-2 (rc=0)

Comment 61 Andrew Beekhof 2016-07-26 04:36:45 UTC
Three things...

1. Having understood the patch and it's context, I would suggest that if this is the correct logic, it should happen in `rmq_start` prior to `rmq_forget_cluster_node_remotely` being called.

That the app may be started as a side-effect of `rmq_forget_cluster_node_remotely` being called is non-obvious and could result in calls to `rmq_notify` changing the state of the service which is not looked on favorably.

2. The RMQ_CRM_ATTR_COOKIE logic seems faulty to me.

Presumably, the purpose is to ensure that the last node to stop is the first one to start.  Is this why the attribute is not deleted in `rmq_stop` ?

However I cannot figure out how this is supposed to work, as:

a. we perform an initial monitor of all resources before starting them
b. the value of RMQ_CRM_ATTR_COOKIE is deleted during a monitor if the resource is not running
c. the start operation doesn't seem to have a check that prevents the "wrong" node from starting

Instead, if the cluster starts the "wrong" node first, that node will just wipe the mnesia db and start again.

We should probably have a conversation outside of bugzilla about how this works or should work.

3. This particular case...

> ... and then 3 seconds later the resource agent stops the application
> and tries to re-cluster it but it fails since it's already clustered:

This seems to be the root cause IMHO.
Rabbit (now?) does more than the agent expects.

If we look as `rmq_join_existing`, it calls `rmq_init_and_wait` followed by `$RMQ_CTL stop_app` followed by `$RMQ_CTL join_cluster $node` in a loop.

This is where I would be looking to patch the agent by finding a way to make the stop conditional.

Comment 62 John Eckersberg 2016-07-26 15:21:56 UTC
(In reply to Andrew Beekhof from comment #61)
> Three things...
> 
> 1. Having understood the patch and it's context, I would suggest that if
> this is the correct logic, it should happen in `rmq_start` prior to
> `rmq_forget_cluster_node_remotely` being called.
> 
> That the app may be started as a side-effect of
> `rmq_forget_cluster_node_remotely` being called is non-obvious and could
> result in calls to `rmq_notify` changing the state of the service which is
> not looked on favorably.

Yeah, now that I look more closely at this patch, I see that given two
nodes A and B, when the resource agent runs on node B to join A, it
will cause a side-effect on node A by remotely starting the rabbitmq
app on node A.  This probably is not what we want to happen.  That
violates a clean separation of control between the nodes.

> 2. The RMQ_CRM_ATTR_COOKIE logic seems faulty to me.
> 
> Presumably, the purpose is to ensure that the last node to stop is the first
> one to start.  Is this why the attribute is not deleted in `rmq_stop` ?
> 
> However I cannot figure out how this is supposed to work, as:
> 
> a. we perform an initial monitor of all resources before starting them
> b. the value of RMQ_CRM_ATTR_COOKIE is deleted during a monitor if the
> resource is not running
> c. the start operation doesn't seem to have a check that prevents the
> "wrong" node from starting
> 
> Instead, if the cluster starts the "wrong" node first, that node will just
> wipe the mnesia db and start again.
> 
> We should probably have a conversation outside of bugzilla about how this
> works or should work.

I have no idea what the intention is but I will take your word as to
the actual implemented behavior :)

+1 that this really needs extra discussion.  I think for the immediate
need it's not a huge deal.

> 3. This particular case...
> 
> > ... and then 3 seconds later the resource agent stops the application
> > and tries to re-cluster it but it fails since it's already clustered:
> 
> This seems to be the root cause IMHO.
> Rabbit (now?) does more than the agent expects.
> 
> If we look as `rmq_join_existing`, it calls `rmq_init_and_wait` followed by
> `$RMQ_CTL stop_app` followed by `$RMQ_CTL join_cluster $node` in a loop.
> 
> This is where I would be looking to patch the agent by finding a way to make
> the stop conditional.

Yep, I'm going to try to patch the agent to check the cluster status
before doing the stop.  If it starts and the cluster is still
fine... just leave it alone!

Comment 63 John Eckersberg 2016-07-26 17:52:43 UTC
Proposed patch - https://github.com/ClusterLabs/resource-agents/pull/839

Seems to fix this in my testing.

Comment 64 Andrew Beekhof 2016-07-26 22:24:17 UTC
(In reply to John Eckersberg from comment #62)

> +1 that this really needs extra discussion.  I think for the immediate
> need it's not a huge deal.

Agreed.  Not something we want to address now, but it would be good to do in the next cycle.

> 
> > 3. This particular case...
> > 
> > > ... and then 3 seconds later the resource agent stops the application
> > > and tries to re-cluster it but it fails since it's already clustered:
> > 
> > This seems to be the root cause IMHO.
> > Rabbit (now?) does more than the agent expects.
> > 
> > If we look as `rmq_join_existing`, it calls `rmq_init_and_wait` followed by
> > `$RMQ_CTL stop_app` followed by `$RMQ_CTL join_cluster $node` in a loop.
> > 
> > This is where I would be looking to patch the agent by finding a way to make
> > the stop conditional.
> 
> Yep, I'm going to try to patch the agent to check the cluster status
> before doing the stop.  If it starts and the cluster is still
> fine... just leave it alone!

Perfect

Comment 66 Andrew Beekhof 2016-07-27 04:19:57 UTC
New build:
   https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=11436738

Comment 67 Jiri Stransky 2016-07-27 13:04:54 UTC
I used eck's fix today instead of the previous workaround, had a successful 8->9 major upgrade with it :)

Comment 68 Jiri Stransky 2016-08-02 14:03:24 UTC
*** Bug 1361750 has been marked as a duplicate of this bug. ***

Comment 70 Leonid Natapov 2016-08-24 13:14:15 UTC
checked that patch is inside the resource-agents-3.9.5-80.el7

Comment 74 errata-xmlrpc 2016-11-04 00:03:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-2174.html