Description of problem: This is a spinoff from https://bugzilla.redhat.com/show_bug.cgi?id=1441635 where we only track the IPv6 VIP misconfiguration. The problem is that the VIPs are on the same network as the "normal" controller interfaces, and it causes problems with routing when a VIP moves. Consider controller ctrl-r00-00. It has IPv6 address fd00:fd00:fd00:2000::20/64 on vlan200, which is the internalapi network. The internalapi VIP has IPv6 address fd00:fd00:fd00:2000::5, also on vlan200, the internalapi network. Now, when the VIP resource is active on ctrl-r00-00, the vlan200 interface has both addresses. There also exists a route to fd00:fd00:fd00:2000::/64 on dev vlan200. What happens when galera starts, and the replication connects to the other cluster members? The configuration item used is: wsrep_cluster_address=gcomm://ctrl-r00-00,ctrl-r01-01,ctrl-r02-02 (this is an attribute on the galera resource) Next, ctrl-r00-00 does a name lookup for ctrl-r01-01, which is: [root@ctrl-r00-00 ~]# getent hosts ctrl-r01-01 fd00:fd00:fd00:2000::21 ctrl-r01-01.redhat.local ctrl-r01-01 So we need to make a connection to fd00:fd00:fd00:2000::21. Remember we have a route for that network on dev vlan200, but we have two valid IPv6 addresses on that network. In short, the kernel chooses the VIP address as the source addr for the connection. And that's a problem. When the VIP moves to a different host, any outbound packets become unroutable. So in this circumstance, ctrl-r01-01 and ctrl-r02-02 will see the galera membership of ctrl-r00-00 as originating from the VIP, not from the address normally associated with ctrl-r00-00. So when the VIP moves, galera logs that a member with the VIP address went away. This is the suspicious behavior we observed previously. The same thing happens with RabbitMQ. Some of the inter-cluster connections are using the VIP as a source address. When the VIP moves, RabbitMQ sees other cluster members disappear, and things generally fall apart from there.
We need fixes for both puppet-tripleo and puppet-pacemaker? Do we need a puppet-pacemaker bugzilla as well as this one for puppet-tripleo?
Usually releng is able to make one errata of multiple packages with one BZ (at least in my experience). In this case we still need to sketch out how to bring this fix out with a) minor updates and b) major upgrades. (Hopefully only a) is needed but we need to first sketch out a full plan). I expect we will need tripleo-heat-template fixes as well for a) and/or b)
Ok so here is a short status update. In order to fix this issue we have two scenarios: - New deployments For new deployments we need three patches (two for puppet-pacemaker and one for puppet-tripleo) A.1) puppet-pacemaker: https://review.openstack.org/460232 - Add support for ipv6_addrlabel with IPaddr2 RA A.2) puppet-pacemaker: https://review.openstack.org/462073 - Fix a typo in ipv6 addrlabel A.3) puppet-tripleo: https://review.openstack.org/460028 - IPv6 VIP addresses need to be /128 So puppet-pacemaker has no stable branches, A.1 and A.2 are the reviews we need. For puppet-tripleo the review for master at A.3 has merged. The backport is here: A.4) puppet-tripleo: https://review.openstack.org/462479 - IPv6 VIP addresses need to be /128 - Existing deployments B.1) tripleo-heat-templates: https://review.openstack.org/#/c/460724/ - Initial VIP ipv6 minor update code B.1 is the master review and has merged. The backport to ocata is here: B.2) tripleo-heat-templates: https://review.openstack.org/462480 - Initial VIP ipv6 minor update code I will link the backports only for tht and puppet-tripleo and the master reviews for puppet-pacemaker. Once the backports are merged, I will go over this with mburns and see if we need to split off a separate bz for puppet-pacemaker or not.
FTR: The pin update for puppet-pacemaker in newton/ocata, which is needed to include the ipv6 fixes in puppet-pacemaker, has happened here: https://review.rdoproject.org/r/#/c/6519/
All four reviews (see comment#4) have no merged (two puppet-pacemaker ones for master branch and two for stable/ocata (tht and puppet-tripleo)). Moving to POST
Bug verification steps taken: Check that we are on rhos-11 GA: [stack@undercloud ~]$ cat core_puddle_version 2017-05-09.2 [stack@undercloud ~]$ grep -v '\#' /etc/yum.repos.d/rhos-release-11.repo|grep -m 1 baseurl baseurl=http://download.lab.bos.redhat.com/rcm-guest/puddles/OpenStack/11.0-RHEL-7/2017-05-09.2/RH7-RHOS-11.0/$basearch/os [stack@undercloud ~]$curl -s http://download-node-02.eng.bos.redhat.com/rcm-guest/puddles/OpenStack/11.0-RHEL-7/|grep GA|cut -f 3,4 -d'=' "[DIR]"> <a href="GA/">GA/</a> 09-May-2017 10:15 - Check that we have, on the overcloud controllers /64 ipv6 vips: [root@controller-0 ~]# pcs status |grep `hostname -s`|grep 2620 ip-2620.52.0.23ae..16 (ocf::heartbeat:IPaddr2): Started controller-0 [root@controller-0 ~]# ip a show vlan189 14: vlan189: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000 link/ether 1e:24:3e:8c:73:86 brd ff:ff:ff:ff:ff:ff inet6 2620:52:0:23ae::16/64 scope global After update from GA -> Z2 , check correct update to z2 : [stack@puma33 ~]$ cat core_puddle_version 2017-08-30.3 [stack@puma33 ~]$ grep -v '\#' /etc/yum.repos.d/rhos-release-11.repo|grep -m 1 baseurl baseurl=http://download.lab.bos.redhat.com/rcm-guest/puddles/OpenStack/11.0-RHEL-7/2017-08-30.3/RH7-RHOS-11.0/$basearch/os [stack@puma33 ~]$ curl -s http://download-node-02.eng.bos.redhat.com/rcm-guest/puddles/OpenStack/11.0-RHEL-7/|grep z2|cut -f 3,4 -d'=' "[DIR]"> <a href="z2/">z2/</a> 30-Aug-2017 23:29 Check that we now have , on the overcloud controllers /128 ipv6 vips: [root@controller-0 ~]# pcs status |grep `hostname -s`|grep 2620 ip-2620.52.0.23b4..15 (ocf::heartbeat:IPaddr2): Started controller-0 ip-2620.52.0.23ae..16 (ocf::heartbeat:IPaddr2): Started controller-0 [root@controller-0 ~]# ip a show vlan189 11: vlan189: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000 link/ether 02:54:64:bd:07:fc brd ff:ff:ff:ff:ff:ff inet6 2620:52:0:23ae::16/128 scope global [root@controller-0 ~]# ip a show vlan195 12: vlan195: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000 link/ether be:38:68:39:09:73 brd ff:ff:ff:ff:ff:ff inet6 2620:52:0:23b4::15/128 scope global
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2721