Description of problem: ----------------------- After minor update of RHOS-11 to 2018-01-04.2 and OC reboot, rabbitmq fails to start. Cluster name: tripleo_cluster Stack: corosync Current DC: controller-2 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum Last updated: Thu Jan 11 09:34:54 2018 Last change: Thu Jan 11 08:21:04 2018 by root via crm_attribute on controller-1 3 nodes configured 19 resources configured Online: [ controller-0 controller-1 controller-2 ] Full list of resources: Master/Slave Set: galera-master [galera] Masters: [ controller-0 controller-1 controller-2 ] Clone Set: rabbitmq-clone [rabbitmq] Started: [ controller-2 ] Stopped: [ controller-0 controller-1 ] Master/Slave Set: redis-master [redis] Masters: [ controller-1 ] Slaves: [ controller-0 controller-2 ] ip-192.168.24.6 (ocf::heartbeat:IPaddr2): Started controller-1 ip-2620.52.0.13b8.5054.ff.fe3e.1 (ocf::heartbeat:IPaddr2): Started controller-2 ip-fd00.fd00.fd00.2000..18 (ocf::heartbeat:IPaddr2): Started controller-1 ip-fd00.fd00.fd00.2000..14 (ocf::heartbeat:IPaddr2): Started controller-2 ip-fd00.fd00.fd00.3000..16 (ocf::heartbeat:IPaddr2): Started controller-2 ip-fd00.fd00.fd00.4000..13 (ocf::heartbeat:IPaddr2): Started controller-2 Clone Set: haproxy-clone [haproxy] Started: [ controller-0 controller-1 controller-2 ] openstack-cinder-volume (systemd:openstack-cinder-volume): Started controller-1 Failed Actions: * rabbitmq_start_0 on controller-0 'unknown error' (1): call=58, status=complete, exitreason='none', last-rc-change='Thu Jan 11 08:24:11 2018', queued=0ms, exec=11777ms * rabbitmq_start_0 on controller-1 'unknown error' (1): call=58, status=complete, exitreason='none', last-rc-change='Thu Jan 11 08:19:49 2018', queued=0ms, exec=11604ms Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled In /var/log/rabbitmq/startup_log next message is present: --------------------------------------------------------- ERROR: epmd error for host controller-1: address (cannot connect to host/port) From /var/log/messages: ----------------------- Jan 11 08:20:05 controller-1 rabbitmq-cluster(rabbitmq)[12899]: INFO: RabbitMQ server is not running Jan 11 08:20:05 controller-1 lrmd[3150]: notice: rabbitmq_stop_0:12899:stderr [ Error: unable to connect to node 'rabbit@controller-1': nodedown ] Jan 11 08:20:05 controller-1 lrmd[3150]: notice: rabbitmq_stop_0:12899:stderr [ ] Jan 11 08:20:05 controller-1 lrmd[3150]: notice: rabbitmq_stop_0:12899:stderr [ DIAGNOSTICS ] Jan 11 08:20:05 controller-1 lrmd[3150]: notice: rabbitmq_stop_0:12899:stderr [ =========== ] Jan 11 08:20:05 controller-1 lrmd[3150]: notice: rabbitmq_stop_0:12899:stderr [ ] Jan 11 08:20:05 controller-1 lrmd[3150]: notice: rabbitmq_stop_0:12899:stderr [ attempted to contact: ['rabbit@controller-1'] ] Jan 11 08:20:05 controller-1 lrmd[3150]: notice: rabbitmq_stop_0:12899:stderr [ ] Jan 11 08:20:05 controller-1 lrmd[3150]: notice: rabbitmq_stop_0:12899:stderr [ rabbit@controller-1: ] Jan 11 08:20:05 controller-1 lrmd[3150]: notice: rabbitmq_stop_0:12899:stderr [ * unable to connect to epmd (port 4369) on controller-1: address (cannot connect to host/port) ] Jan 11 08:20:05 controller-1 lrmd[3150]: notice: rabbitmq_stop_0:12899:stderr [ ] Jan 11 08:20:05 controller-1 lrmd[3150]: notice: rabbitmq_stop_0:12899:stderr [ ] Jan 11 08:20:05 controller-1 lrmd[3150]: notice: rabbitmq_stop_0:12899:stderr [ current node details: ] Jan 11 08:20:05 controller-1 lrmd[3150]: notice: rabbitmq_stop_0:12899:stderr [ - node name: 'rabbitmq-cli-52@controller-1' ] Jan 11 08:20:05 controller-1 lrmd[3150]: notice: rabbitmq_stop_0:12899:stderr [ - home dir: /var/lib/rabbitmq ] Jan 11 08:20:05 controller-1 lrmd[3150]: notice: rabbitmq_stop_0:12899:stderr [ - cookie hash: 8QhhprZ31nfwL74O/NJpIg== ] Jan 11 08:20:05 controller-1 lrmd[3150]: notice: rabbitmq_stop_0:12899:stderr [ ] Jan 11 08:20:05 controller-1 crmd[3153]: notice: Result of stop operation for rabbitmq on controller-1: 0 (ok) Checking for port 4369 on controller-1: --------------------------------------- ss -anp | grep 4369 tcp LISTEN 0 128 127.0.0.1:4369 *:* users:(("epmd",pid=4780,fd=3)) tcp LISTEN 0 128 fd00:fd00:fd00:2000::10:4369 :::* users:(("epmd",pid=4780,fd=5)) tcp LISTEN 0 128 ::1:4369 :::* users:(("epmd",pid=4780,fd=4)) Checking connection on port 4369 on controller-2: ------------------------------------------------- ss -anp | grep 4369 tcp LISTEN 0 128 127.0.0.1:4369 *:* users:(("epmd",pid=4647,fd=3)) tcp LISTEN 0 128 fd00:fd00:fd00:2000::17:4369 :::* users:(("epmd",pid=4647,fd=5)) tcp LISTEN 0 128 ::1:4369 :::* users:(("epmd",pid=4647,fd=4)) tcp TIME-WAIT 0 0 fd00:fd00:fd00:2000::17:4369 fd00:fd00:fd00:2000::11:57541 tcp TIME-WAIT 0 0 fd00:fd00:fd00:2000::17:4369 fd00:fd00:fd00:2000::11:39988 tcp TIME-WAIT 0 0 fd00:fd00:fd00:2000::17:4369 fd00:fd00:fd00:2000::11:38978 tcp TIME-WAIT 0 0 fd00:fd00:fd00:2000::17:4369 fd00:fd00:fd00:2000::11:42160 tcp TIME-WAIT 0 0 fd00:fd00:fd00:2000::17:4369 fd00:fd00:fd00:2000::11:50466 tcp TIME-WAIT 0 0 fd00:fd00:fd00:2000::17:4369 fd00:fd00:fd00:2000::11:39234 tcp TIME-WAIT 0 0 fd00:fd00:fd00:2000::17:4369 fd00:fd00:fd00:2000::11:37077 tcp TIME-WAIT 0 0 fd00:fd00:fd00:2000::17:4369 fd00:fd00:fd00:2000::11:54520 tcp TIME-WAIT 0 0 fd00:fd00:fd00:2000::17:4369 fd00:fd00:fd00:2000::11:55957 tcp TIME-WAIT 0 0 fd00:fd00:fd00:2000::17:4369 fd00:fd00:fd00:2000::11:47225 Version-Release number of selected component (if applicable): ------------------------------------------------------------- openstack-tripleo-heat-templates-6.2.1-2.el7ost.noarch rabbitmq-server-3.6.5-5.el7ost.noarch puppet-rabbitmq-5.6.0-3.03b8592git.el7ost.noarch Steps to Reproduce: ------------------- 1. Deploy RHOS-11 z2 2. Setup 2018-01-04.2 repos on uc and oc 3. Update uc 4. Update oc 5. Start rebooting oc nodes. For nodes running pacemaker procedure is next: - pcs cluster stop --request-timeout=300 - reboot - pcs cluster start --wait=300 Actual results: --------------- RMQ fails to start Expected results: ----------------- RMQ is started successfully Additional info: ---------------- Virtual setup: 3controllers + 2computes + 3ceph
This is probably the same as bug 1522896, except 1522896 is filed against OSP12.
OK, this looks related to bug 1461190. Erlang can't reach epmd, because epmd is listening on ipv6, but the erlang resolver returns ipv4: [root@controller-1 rabbitmq]# erl -sname foo -proto_dist inet6_tcp Erlang/OTP 18 [erts-7.3.1.3] [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false] Eshell V7.3.1.3 (abort with ^G) (foo@controller-1)1> inet:gethostbyname("controller-1"). {ok,{hostent,"controller-1",[],inet,4, [{192,168,24,6},{192,168,24,13},{172,17,2,11}]}} Seems the reason is because OSP11 is missing https://github.com/voxpupuli/puppet-rabbitmq/pull/552. I've verified that patch is included in OSP12 as per https://bugzilla.redhat.com/show_bug.cgi?id=1484547#c9, so this is *not* the same as bug 1522896.
This also requires the updated erlang from bug 1536064
This has been fixed in rdo stable/ocata for a while now, https://trunk.rdoproject.org/centos7-ocata/current/puppet-rabbitmq-5.6.1-0.20180115161315.5ac45de.el7.centos.noarch.rpm
Pending stable/ocata import - https://trello.com/c/YWYdFmLe/703-osp11-import-rdo-ocata-promotion-2018-03-08
According to our records, this should be resolved by puppet-tripleo-6.5.10-3.el7ost. This build is available now.
Verified , Tested On Minor Update : OSP11 2018-01-04.2 -> OSP11 2018-05-23.1 Package version : [stack@undercloud-0 ~]$ ansible overcloud -b -mshell -a'rpm -qa |grep puppet-tripleo' compute-0 | SUCCESS | rc=0 >> puppet-tripleo-6.5.10-3.el7ost.noarch compute-1 | SUCCESS | rc=0 >> puppet-tripleo-6.5.10-3.el7ost.noarch controller-1 | SUCCESS | rc=0 >> puppet-tripleo-6.5.10-3.el7ost.noarch controller-0 | SUCCESS | rc=0 >> puppet-tripleo-6.5.10-3.el7ost.noarch controller-2 | SUCCESS | rc=0 >> puppet-tripleo-6.5.10-3.el7ost.noarch Reproducing the Reboot process: For nodes running pacemaker ; do pcs cluster stop ;reboot ;pcs cluster start; Check cluster status after disruption: All OK : Cluster name: tripleo_cluster Stack: corosync Current DC: controller-1 (version 1.1.18-11.el7_5.2-2b07d5c5a9) - partition with quorum Last updated: Sun Jun 3 10:01:28 2018 Last change: Sun Jun 3 09:57:53 2018 by hacluster via crmd on controller-2 3 nodes configured 22 resources configured Online: [ controller-0 controller-1 controller-2 ] Full list of resources: Master/Slave Set: galera-master [galera] Masters: [ controller-0 controller-1 controller-2 ] Clone Set: rabbitmq-clone [rabbitmq] Started: [ controller-0 controller-1 controller-2 ] Master/Slave Set: redis-master [redis] Masters: [ controller-0 ] Slaves: [ controller-1 controller-2 ] ip-192.168.24.14 (ocf::heartbeat:IPaddr2): Started controller-0 ip-10.0.0.101 (ocf::heartbeat:IPaddr2): Started controller-1 ip-172.17.1.16 (ocf::heartbeat:IPaddr2): Started controller-0 ip-172.17.1.10 (ocf::heartbeat:IPaddr2): Started controller-0 ip-172.17.3.10 (ocf::heartbeat:IPaddr2): Started controller-1 ip-172.17.4.18 (ocf::heartbeat:IPaddr2): Started controller-1 Clone Set: haproxy-clone [haproxy] Started: [ controller-0 controller-1 controller-2 ] openstack-cinder-volume (systemd:openstack-cinder-volume): Started controller-0 stonith-fence_ipmilan-52540050eb11 (stonith:fence_ipmilan): Started controller-0 stonith-fence_ipmilan-525400cbbc07 (stonith:fence_ipmilan): Started controller-1 stonith-fence_ipmilan-525400f5c568 (stonith:fence_ipmilan): Started controller-1 Daemon Status: corosync: active/enabled pacemaker: active/enabled pcsd: active/enabled Checking empd is listening on ipv4: [stack@undercloud-0 ~]$ ansible controller -b -mshell -a'ss -anlp | grep 4369' [WARNING]: Found both group and host with same name: undercloud controller-2 | SUCCESS | rc=0 >> tcp LISTEN 0 128 172.17.1.17:4369 *:* users:(("epmd",pid=6925,fd=5)) tcp LISTEN 0 128 127.0.0.1:4369 *:* users:(("epmd",pid=6925,fd=3)) tcp LISTEN 0 128 ::1:4369 :::* users:(("epmd",pid=6925,fd=4)) controller-0 | SUCCESS | rc=0 >> nl UNCONN 0 0 16:-4369 0:0 tcp LISTEN 0 128 172.17.1.13:4369 *:* users:(("epmd",pid=4431,fd=5)) tcp LISTEN 0 128 127.0.0.1:4369 *:* users:(("epmd",pid=4431,fd=3)) tcp LISTEN 0 128 ::1:4369 :::* users:(("epmd",pid=4431,fd=4)) controller-1 | SUCCESS | rc=0 >> nl UNCONN 0 0 16:-4369 0:0 tcp LISTEN 0 128 172.17.1.22:4369 *:* users:(("epmd",pid=4426,fd=5)) tcp LISTEN 0 128 127.0.0.1:4369 *:* users:(("epmd",pid=4426,fd=3)) tcp LISTEN 0 128 ::1:4369 :::* users:(("epmd",pid=4426,fd=4)) Checking rabbitmq startup logs: All OK [stack@undercloud-0 ~]$ ansible controller -b -mshell -a'cat /var/log/rabbitmq/startup_log' [WARNING]: Found both group and host with same name: undercloud controller-0 | SUCCESS | rc=0 >> RabbitMQ 3.6.5. Copyright (C) 2007-2016 Pivotal Software, Inc. ## ## Licensed under the MPL. See http://www.rabbitmq.com/ ## ## ########## Logs: /var/log/rabbitmq/rabbit ###### ## /var/log/rabbitmq/rabbit ########## Starting broker... completed with 0 plugins. controller-1 | SUCCESS | rc=0 >> RabbitMQ 3.6.5. Copyright (C) 2007-2016 Pivotal Software, Inc. ## ## Licensed under the MPL. See http://www.rabbitmq.com/ ## ## ########## Logs: /var/log/rabbitmq/rabbit ###### ## /var/log/rabbitmq/rabbit ########## Starting broker... completed with 0 plugins. RabbitMQ 3.6.5. Copyright (C) 2007-2016 Pivotal Software, Inc. ## ## Licensed under the MPL. See http://www.rabbitmq.com/ ## ## ########## Logs: /var/log/rabbitmq/rabbit ###### ## /var/log/rabbitmq/rabbit ########## Starting broker... completed with 0 plugins. controller-2 | SUCCESS | rc=0 >> RabbitMQ 3.6.5. Copyright (C) 2007-2016 Pivotal Software, Inc. ## ## Licensed under the MPL. See http://www.rabbitmq.com/ ## ## ########## Logs: /var/log/rabbitmq/rabbit ###### ## /var/log/rabbitmq/rabbit ########## Starting broker... completed with 0 plugins. RabbitMQ 3.6.5. Copyright (C) 2007-2016 Pivotal Software, Inc. ## ## Licensed under the MPL. See http://www.rabbitmq.com/ ## ## ########## Logs: /var/log/rabbitmq/rabbit ###### ## /var/log/rabbitmq/rabbit ########## Starting broker... completed with 0 plugins.
OSP11 EOL'd with a newer version of this package.