Bug 1533406

Summary: [UPDATES] RMQ fails to start after minor update and reboot
Product: Red Hat OpenStack Reporter: Yurii Prokulevych <yprokule>
Component: puppet-tripleoAssignee: John Eckersberg <jeckersb>
Status: CLOSED CURRENTRELEASE QA Contact: pkomarov
Severity: urgent Docs Contact:
Priority: urgent    
Version: 11.0 (Ocata)CC: apevec, augol, chjones, jeckersb, jjoyce, jschluet, lhh, mbultel, pkomarov, slinaber, srevivo, tvignaud
Target Milestone: zstreamKeywords: TestOnly, Triaged, ZStream
Target Release: 11.0 (Ocata)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: puppet-tripleo-6.5.10-2.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1555317 1557513 (view as bug list) Environment:
Last Closed: 2018-06-08 12:21:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1536064    
Bug Blocks: 1555317, 1557513, 1557519, 1557522, 1647474, 1647587, 1647593, 1654041, 1654042    

Description Yurii Prokulevych 2018-01-11 09:43:23 UTC
Description of problem:
-----------------------
After minor update of RHOS-11 to 2018-01-04.2 and OC reboot, rabbitmq fails to start.

Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-2 (version 1.1.16-12.el7_4.5-94ff4df) - partition with quorum
Last updated: Thu Jan 11 09:34:54 2018
Last change: Thu Jan 11 08:21:04 2018 by root via crm_attribute on controller-1

3 nodes configured
19 resources configured

Online: [ controller-0 controller-1 controller-2 ]

Full list of resources:

 Master/Slave Set: galera-master [galera]
     Masters: [ controller-0 controller-1 controller-2 ]
 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ controller-2 ]
     Stopped: [ controller-0 controller-1 ]
 Master/Slave Set: redis-master [redis]
     Masters: [ controller-1 ]
     Slaves: [ controller-0 controller-2 ]
 ip-192.168.24.6        (ocf::heartbeat:IPaddr2):       Started controller-1
 ip-2620.52.0.13b8.5054.ff.fe3e.1       (ocf::heartbeat:IPaddr2):       Started controller-2
 ip-fd00.fd00.fd00.2000..18     (ocf::heartbeat:IPaddr2):       Started controller-1
 ip-fd00.fd00.fd00.2000..14     (ocf::heartbeat:IPaddr2):       Started controller-2
 ip-fd00.fd00.fd00.3000..16     (ocf::heartbeat:IPaddr2):       Started controller-2
 ip-fd00.fd00.fd00.4000..13     (ocf::heartbeat:IPaddr2):       Started controller-2
 Clone Set: haproxy-clone [haproxy]
     Started: [ controller-0 controller-1 controller-2 ]
 openstack-cinder-volume        (systemd:openstack-cinder-volume):      Started controller-1

Failed Actions:
* rabbitmq_start_0 on controller-0 'unknown error' (1): call=58, status=complete, exitreason='none',
    last-rc-change='Thu Jan 11 08:24:11 2018', queued=0ms, exec=11777ms
* rabbitmq_start_0 on controller-1 'unknown error' (1): call=58, status=complete, exitreason='none',
    last-rc-change='Thu Jan 11 08:19:49 2018', queued=0ms, exec=11604ms


Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled

In /var/log/rabbitmq/startup_log next message is present:
---------------------------------------------------------
ERROR: epmd error for host controller-1: address (cannot connect to host/port)

From /var/log/messages:
-----------------------
Jan 11 08:20:05 controller-1 rabbitmq-cluster(rabbitmq)[12899]: INFO: RabbitMQ server is not running
Jan 11 08:20:05 controller-1 lrmd[3150]:  notice: rabbitmq_stop_0:12899:stderr [ Error: unable to connect to node 'rabbit@controller-1': nodedown ]
Jan 11 08:20:05 controller-1 lrmd[3150]:  notice: rabbitmq_stop_0:12899:stderr [  ]
Jan 11 08:20:05 controller-1 lrmd[3150]:  notice: rabbitmq_stop_0:12899:stderr [ DIAGNOSTICS ]
Jan 11 08:20:05 controller-1 lrmd[3150]:  notice: rabbitmq_stop_0:12899:stderr [ =========== ]
Jan 11 08:20:05 controller-1 lrmd[3150]:  notice: rabbitmq_stop_0:12899:stderr [  ]
Jan 11 08:20:05 controller-1 lrmd[3150]:  notice: rabbitmq_stop_0:12899:stderr [ attempted to contact: ['rabbit@controller-1'] ]
Jan 11 08:20:05 controller-1 lrmd[3150]:  notice: rabbitmq_stop_0:12899:stderr [  ]
Jan 11 08:20:05 controller-1 lrmd[3150]:  notice: rabbitmq_stop_0:12899:stderr [ rabbit@controller-1: ]
Jan 11 08:20:05 controller-1 lrmd[3150]:  notice: rabbitmq_stop_0:12899:stderr [   * unable to connect to epmd (port 4369) on controller-1: address (cannot connect to host/port) ]
Jan 11 08:20:05 controller-1 lrmd[3150]:  notice: rabbitmq_stop_0:12899:stderr [  ]
Jan 11 08:20:05 controller-1 lrmd[3150]:  notice: rabbitmq_stop_0:12899:stderr [  ]
Jan 11 08:20:05 controller-1 lrmd[3150]:  notice: rabbitmq_stop_0:12899:stderr [ current node details: ]
Jan 11 08:20:05 controller-1 lrmd[3150]:  notice: rabbitmq_stop_0:12899:stderr [ - node name: 'rabbitmq-cli-52@controller-1' ]
Jan 11 08:20:05 controller-1 lrmd[3150]:  notice: rabbitmq_stop_0:12899:stderr [ - home dir: /var/lib/rabbitmq ]
Jan 11 08:20:05 controller-1 lrmd[3150]:  notice: rabbitmq_stop_0:12899:stderr [ - cookie hash: 8QhhprZ31nfwL74O/NJpIg== ]
Jan 11 08:20:05 controller-1 lrmd[3150]:  notice: rabbitmq_stop_0:12899:stderr [  ]
Jan 11 08:20:05 controller-1 crmd[3153]:  notice: Result of stop operation for rabbitmq on controller-1: 0 (ok)

Checking for port 4369 on controller-1:
---------------------------------------
ss -anp | grep 4369
tcp    LISTEN     0      128    127.0.0.1:4369                  *:*                   users:(("epmd",pid=4780,fd=3))
tcp    LISTEN     0      128     fd00:fd00:fd00:2000::10:4369                 :::*                   users:(("epmd",pid=4780,fd=5))
tcp    LISTEN     0      128     ::1:4369                 :::*                   users:(("epmd",pid=4780,fd=4))

Checking connection on port 4369 on controller-2:
-------------------------------------------------
ss -anp | grep  4369
tcp    LISTEN     0      128    127.0.0.1:4369                  *:*                   users:(("epmd",pid=4647,fd=3))
tcp    LISTEN     0      128     fd00:fd00:fd00:2000::17:4369                 :::*                   users:(("epmd",pid=4647,fd=5))
tcp    LISTEN     0      128     ::1:4369                 :::*                   users:(("epmd",pid=4647,fd=4))
tcp    TIME-WAIT  0      0       fd00:fd00:fd00:2000::17:4369                fd00:fd00:fd00:2000::11:57541              
tcp    TIME-WAIT  0      0       fd00:fd00:fd00:2000::17:4369                fd00:fd00:fd00:2000::11:39988              
tcp    TIME-WAIT  0      0       fd00:fd00:fd00:2000::17:4369                fd00:fd00:fd00:2000::11:38978              
tcp    TIME-WAIT  0      0       fd00:fd00:fd00:2000::17:4369                fd00:fd00:fd00:2000::11:42160              
tcp    TIME-WAIT  0      0       fd00:fd00:fd00:2000::17:4369                fd00:fd00:fd00:2000::11:50466              
tcp    TIME-WAIT  0      0       fd00:fd00:fd00:2000::17:4369                fd00:fd00:fd00:2000::11:39234              
tcp    TIME-WAIT  0      0       fd00:fd00:fd00:2000::17:4369                fd00:fd00:fd00:2000::11:37077              
tcp    TIME-WAIT  0      0       fd00:fd00:fd00:2000::17:4369                fd00:fd00:fd00:2000::11:54520              
tcp    TIME-WAIT  0      0       fd00:fd00:fd00:2000::17:4369                fd00:fd00:fd00:2000::11:55957              
tcp    TIME-WAIT  0      0       fd00:fd00:fd00:2000::17:4369                fd00:fd00:fd00:2000::11:47225


Version-Release number of selected component (if applicable):
-------------------------------------------------------------
openstack-tripleo-heat-templates-6.2.1-2.el7ost.noarch
rabbitmq-server-3.6.5-5.el7ost.noarch
puppet-rabbitmq-5.6.0-3.03b8592git.el7ost.noarch


Steps to Reproduce:
-------------------
1. Deploy RHOS-11 z2
2. Setup 2018-01-04.2 repos on uc and oc
3. Update uc
4. Update oc
5. Start rebooting oc nodes. For nodes running pacemaker procedure is next:
    - pcs cluster stop --request-timeout=300
    - reboot
    - pcs cluster start --wait=300

Actual results:
---------------
RMQ fails to start

Expected results:
-----------------
RMQ is started successfully


Additional info:
----------------
Virtual setup: 3controllers + 2computes + 3ceph

Comment 2 John Eckersberg 2018-01-11 13:39:41 UTC
This is probably the same as bug 1522896, except 1522896 is filed against OSP12.

Comment 3 John Eckersberg 2018-01-11 16:47:42 UTC
OK, this looks related to bug 1461190.

Erlang can't reach epmd, because epmd is listening on ipv6, but the erlang resolver returns ipv4:

[root@controller-1 rabbitmq]# erl -sname foo -proto_dist inet6_tcp     
Erlang/OTP 18 [erts-7.3.1.3] [source] [64-bit] [smp:4:4] [async-threads:10] [hipe] [kernel-poll:false]                                        

Eshell V7.3.1.3  (abort with ^G)                                       
(foo@controller-1)1> inet:gethostbyname("controller-1").               
{ok,{hostent,"controller-1",[],inet,4,                                 
             [{192,168,24,6},{192,168,24,13},{172,17,2,11}]}}          

Seems the reason is because OSP11 is missing https://github.com/voxpupuli/puppet-rabbitmq/pull/552.  I've verified that patch is included in OSP12 as per https://bugzilla.redhat.com/show_bug.cgi?id=1484547#c9, so this is *not* the same as bug 1522896.

Comment 4 John Eckersberg 2018-01-18 15:09:28 UTC
This also requires the updated erlang from bug 1536064

Comment 5 John Eckersberg 2018-03-16 15:42:58 UTC
This has been fixed in rdo stable/ocata for a while now, https://trunk.rdoproject.org/centos7-ocata/current/puppet-rabbitmq-5.6.1-0.20180115161315.5ac45de.el7.centos.noarch.rpm

Comment 6 John Eckersberg 2018-03-16 15:54:49 UTC
Pending stable/ocata import - https://trello.com/c/YWYdFmLe/703-osp11-import-rdo-ocata-promotion-2018-03-08

Comment 7 Lon Hohberger 2018-05-21 10:36:34 UTC
According to our records, this should be resolved by puppet-tripleo-6.5.10-3.el7ost.  This build is available now.

Comment 8 pkomarov 2018-06-03 10:05:11 UTC
Verified , 

Tested On Minor Update : 
OSP11 2018-01-04.2 -> OSP11 2018-05-23.1

Package version : 
[stack@undercloud-0 ~]$ ansible overcloud -b -mshell -a'rpm -qa |grep puppet-tripleo'

compute-0 | SUCCESS | rc=0 >>
puppet-tripleo-6.5.10-3.el7ost.noarch

compute-1 | SUCCESS | rc=0 >>
puppet-tripleo-6.5.10-3.el7ost.noarch

controller-1 | SUCCESS | rc=0 >>
puppet-tripleo-6.5.10-3.el7ost.noarch

controller-0 | SUCCESS | rc=0 >>
puppet-tripleo-6.5.10-3.el7ost.noarch

controller-2 | SUCCESS | rc=0 >>
puppet-tripleo-6.5.10-3.el7ost.noarch

Reproducing the Reboot process: 
 For nodes running pacemaker ; do pcs cluster stop ;reboot ;pcs cluster start;

Check cluster status after disruption: All OK : 

Cluster name: tripleo_cluster
Stack: corosync
Current DC: controller-1 (version 1.1.18-11.el7_5.2-2b07d5c5a9) - partition with quorum
Last updated: Sun Jun  3 10:01:28 2018
Last change: Sun Jun  3 09:57:53 2018 by hacluster via crmd on controller-2

3 nodes configured
22 resources configured

Online: [ controller-0 controller-1 controller-2 ]

Full list of resources:

 Master/Slave Set: galera-master [galera]
     Masters: [ controller-0 controller-1 controller-2 ]
 Clone Set: rabbitmq-clone [rabbitmq]
     Started: [ controller-0 controller-1 controller-2 ]
 Master/Slave Set: redis-master [redis]
     Masters: [ controller-0 ]
     Slaves: [ controller-1 controller-2 ]
 ip-192.168.24.14	(ocf::heartbeat:IPaddr2):	Started controller-0
 ip-10.0.0.101	(ocf::heartbeat:IPaddr2):	Started controller-1
 ip-172.17.1.16	(ocf::heartbeat:IPaddr2):	Started controller-0
 ip-172.17.1.10	(ocf::heartbeat:IPaddr2):	Started controller-0
 ip-172.17.3.10	(ocf::heartbeat:IPaddr2):	Started controller-1
 ip-172.17.4.18	(ocf::heartbeat:IPaddr2):	Started controller-1
 Clone Set: haproxy-clone [haproxy]
     Started: [ controller-0 controller-1 controller-2 ]
 openstack-cinder-volume	(systemd:openstack-cinder-volume):	Started controller-0
 stonith-fence_ipmilan-52540050eb11	(stonith:fence_ipmilan):	Started controller-0
 stonith-fence_ipmilan-525400cbbc07	(stonith:fence_ipmilan):	Started controller-1
 stonith-fence_ipmilan-525400f5c568	(stonith:fence_ipmilan):	Started controller-1

Daemon Status:
  corosync: active/enabled
  pacemaker: active/enabled
  pcsd: active/enabled


Checking empd is listening on ipv4: 

[stack@undercloud-0 ~]$ ansible controller -b -mshell -a'ss -anlp | grep 4369'
 [WARNING]: Found both group and host with same name: undercloud

controller-2 | SUCCESS | rc=0 >>
tcp    LISTEN     0      128    172.17.1.17:4369                  *:*                   users:(("epmd",pid=6925,fd=5))
tcp    LISTEN     0      128    127.0.0.1:4369                  *:*                   users:(("epmd",pid=6925,fd=3))
tcp    LISTEN     0      128     ::1:4369                 :::*                   users:(("epmd",pid=6925,fd=4))

controller-0 | SUCCESS | rc=0 >>
nl     UNCONN     0      0        16:-4369                 0:0                  
tcp    LISTEN     0      128    172.17.1.13:4369                  *:*                   users:(("epmd",pid=4431,fd=5))
tcp    LISTEN     0      128    127.0.0.1:4369                  *:*                   users:(("epmd",pid=4431,fd=3))
tcp    LISTEN     0      128     ::1:4369                 :::*                   users:(("epmd",pid=4431,fd=4))

controller-1 | SUCCESS | rc=0 >>
nl     UNCONN     0      0        16:-4369                 0:0                  
tcp    LISTEN     0      128    172.17.1.22:4369                  *:*                   users:(("epmd",pid=4426,fd=5))
tcp    LISTEN     0      128    127.0.0.1:4369                  *:*                   users:(("epmd",pid=4426,fd=3))
tcp    LISTEN     0      128     ::1:4369                 :::*                   users:(("epmd",pid=4426,fd=4))


Checking rabbitmq startup logs: All OK 

[stack@undercloud-0 ~]$ ansible controller -b -mshell -a'cat /var/log/rabbitmq/startup_log'
 [WARNING]: Found both group and host with same name: undercloud

controller-0 | SUCCESS | rc=0 >>

              RabbitMQ 3.6.5. Copyright (C) 2007-2016 Pivotal Software, Inc.
  ##  ##      Licensed under the MPL.  See http://www.rabbitmq.com/
  ##  ##
  ##########  Logs: /var/log/rabbitmq/rabbit
  ######  ##        /var/log/rabbitmq/rabbit
  ##########
              Starting broker...
 completed with 0 plugins.

controller-1 | SUCCESS | rc=0 >>

              RabbitMQ 3.6.5. Copyright (C) 2007-2016 Pivotal Software, Inc.
  ##  ##      Licensed under the MPL.  See http://www.rabbitmq.com/
  ##  ##
  ##########  Logs: /var/log/rabbitmq/rabbit
  ######  ##        /var/log/rabbitmq/rabbit
  ##########
              Starting broker...
 completed with 0 plugins.

              RabbitMQ 3.6.5. Copyright (C) 2007-2016 Pivotal Software, Inc.
  ##  ##      Licensed under the MPL.  See http://www.rabbitmq.com/
  ##  ##
  ##########  Logs: /var/log/rabbitmq/rabbit
  ######  ##        /var/log/rabbitmq/rabbit
  ##########
              Starting broker...
 completed with 0 plugins.

controller-2 | SUCCESS | rc=0 >>

              RabbitMQ 3.6.5. Copyright (C) 2007-2016 Pivotal Software, Inc.
  ##  ##      Licensed under the MPL.  See http://www.rabbitmq.com/
  ##  ##
  ##########  Logs: /var/log/rabbitmq/rabbit
  ######  ##        /var/log/rabbitmq/rabbit
  ##########
              Starting broker...
 completed with 0 plugins.

              RabbitMQ 3.6.5. Copyright (C) 2007-2016 Pivotal Software, Inc.
  ##  ##      Licensed under the MPL.  See http://www.rabbitmq.com/
  ##  ##
  ##########  Logs: /var/log/rabbitmq/rabbit
  ######  ##        /var/log/rabbitmq/rabbit
  ##########
              Starting broker...
 completed with 0 plugins.

Comment 9 Chris Jones 2018-06-08 12:21:36 UTC
OSP11 EOL'd with a newer version of this package.