1445905 – On IPv6 deployments the VIP is configured as /64 and can be used as source address by services on the node

Bug 1445905 - On IPv6 deployments the VIP is configured as /64 and can be used as source address by services on the node

Summary: On IPv6 deployments the VIP is configured as /64 and can be used as source ad...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat OpenStack
Classification:	Red Hat
Component:	puppet-tripleo
Sub Component:
Version:	11.0 (Ocata)
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	z2
Target Release:	11.0 (Ocata)
Assignee:	Michele Baldessari
QA Contact:	pkomarov
Docs Contact:
URL:
Whiteboard:
Depends On:	1445861
Blocks:
TreeView+	depends on / blocked

Reported:	2017-04-26 18:17 UTC by Michele Baldessari
Modified:	2017-09-13 21:43 UTC (History)
CC List:	14 users (show)
Fixed In Version:	puppet-tripleo-6.5.0-7.el7ost openstack-tripleo-heat-templates-6.2.0-2.el7ost
Doc Type:	Known Issue
Doc Text:	In Highly Available IPv6 deployments, virtual IPs used for RabbitMQ may move between controller hosts during an upgrade. A bug in the creation of these IPv6 IPs causes them to be used as source addresses for RabbitMQ's connections. As a result, RabbitMQ will crash and may be unable to automatically recover its cluster. To return to normal operation, restart RabbitMQ on the affected controller hosts, as well as any services which depend on RabbitMQ and do not automatically reconnect.
Clone Of:
Environment:
Last Closed:	2017-09-13 21:43:17 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Launchpad	1686357	None	None	None	2017-04-26 18:19:30 UTC
OpenStack gerrit	460232	None	None	None	2017-05-04 10:33:15 UTC
OpenStack gerrit	462073	None	None	None	2017-05-04 10:33:48 UTC
OpenStack gerrit	462479	None	None	None	2017-05-04 10:34:36 UTC
OpenStack gerrit	462480	None	None	None	2017-05-04 10:35:04 UTC
Red Hat Product Errata	RHBA-2017:2721	normal	SHIPPED_LIVE	Red Hat OpenStack Platform 11.0 director Bug Fix Advisory	2017-09-14 01:39:22 UTC

Description Michele Baldessari 2017-04-26 18:17:44 UTC

Description of problem:
This is a spinoff from https://bugzilla.redhat.com/show_bug.cgi?id=1441635 where we only track the IPv6 VIP misconfiguration.

The problem is that the VIPs are on the same network as the "normal"
controller interfaces, and it causes problems with routing when a VIP
moves.

Consider controller ctrl-r00-00. It has IPv6 address
fd00:fd00:fd00:2000::20/64 on vlan200, which is the internalapi
network.

The internalapi VIP has IPv6 address fd00:fd00:fd00:2000::5, also on
vlan200, the internalapi network.

Now, when the VIP resource is active on ctrl-r00-00, the vlan200
interface has both addresses. There also exists a route to
fd00:fd00:fd00:2000::/64 on dev vlan200.

What happens when galera starts, and the replication connects to the
other cluster members? The configuration item used is:

wsrep_cluster_address=gcomm://ctrl-r00-00,ctrl-r01-01,ctrl-r02-02

(this is an attribute on the galera resource)

Next, ctrl-r00-00 does a name lookup for ctrl-r01-01, which is:

[root@ctrl-r00-00 ~]# getent hosts ctrl-r01-01
fd00:fd00:fd00:2000::21 ctrl-r01-01.redhat.local ctrl-r01-01

So we need to make a connection to fd00:fd00:fd00:2000::21. Remember
we have a route for that network on dev vlan200, but we have two valid
IPv6 addresses on that network.

In short, the kernel chooses the VIP address as the source addr for
the connection. And that's a problem. When the VIP moves to a
different host, any outbound packets become unroutable.

So in this circumstance, ctrl-r01-01 and ctrl-r02-02 will see the
galera membership of ctrl-r00-00 as originating from the VIP, not from
the address normally associated with ctrl-r00-00. So when the VIP
moves, galera logs that a member with the VIP address went away. This
is the suspicious behavior we observed previously.

The same thing happens with RabbitMQ. Some of the inter-cluster
connections are using the VIP as a source address. When the VIP
moves, RabbitMQ sees other cluster members disappear, and things
generally fall apart from there.

Comment 2 Mark McLoughlin 2017-04-27 05:42:57 UTC

We need fixes for both puppet-tripleo and puppet-pacemaker? Do we need a puppet-pacemaker bugzilla as well as this one for puppet-tripleo?

Comment 3 Michele Baldessari 2017-04-27 07:12:27 UTC

Usually releng is able to make one errata of multiple packages with one BZ (at least in my experience). In this case we still need to sketch out how to bring this fix out with a) minor updates and b) major upgrades. (Hopefully only a) is needed but we need to first sketch out a full plan). I expect we will need tripleo-heat-template fixes as well for a) and/or b)

Comment 4 Michele Baldessari 2017-05-04 10:31:22 UTC

Ok so here is a short status update. In order to fix this issue we have two scenarios:
- New deployments
For new deployments we need three patches (two for puppet-pacemaker and one for puppet-tripleo)
A.1) puppet-pacemaker: https://review.openstack.org/460232 - Add support for ipv6_addrlabel with IPaddr2 RA
A.2) puppet-pacemaker: https://review.openstack.org/462073 - Fix a typo in ipv6 addrlabel
A.3) puppet-tripleo: https://review.openstack.org/460028 - IPv6 VIP addresses need to be /128

So puppet-pacemaker has no stable branches, A.1 and A.2 are the reviews we need. For puppet-tripleo the review for master at A.3 has merged. The backport is here:
A.4) puppet-tripleo: https://review.openstack.org/462479 - IPv6 VIP addresses need to be /128


- Existing deployments
B.1) tripleo-heat-templates: https://review.openstack.org/#/c/460724/ - 
Initial VIP ipv6 minor update code

B.1 is the master review and has merged. The backport to ocata is here:
B.2) tripleo-heat-templates: https://review.openstack.org/462480 - 
Initial VIP ipv6 minor update code

I will link the backports only for tht and puppet-tripleo and the master reviews for puppet-pacemaker. Once the backports are merged, I will go over this with mburns and see if we need to split off a separate bz for puppet-pacemaker or not.

Comment 5 Michele Baldessari 2017-05-04 10:36:38 UTC

FTR: The pin update for puppet-pacemaker in newton/ocata, which is needed to include the ipv6 fixes in puppet-pacemaker, has happened here: https://review.rdoproject.org/r/#/c/6519/

Comment 6 Michele Baldessari 2017-05-07 17:59:20 UTC

All four reviews (see comment#4) have no merged (two puppet-pacemaker ones for master branch and two for stable/ocata (tht and puppet-tripleo)). Moving to POST

Comment 13 pkomarov 2017-09-05 10:38:49 UTC

Bug verification steps taken: 

Check that we are on rhos-11 GA:

[stack@undercloud ~]$ cat core_puddle_version
2017-05-09.2
[stack@undercloud ~]$ grep -v '\#' /etc/yum.repos.d/rhos-release-11.repo|grep -m 1 baseurl
baseurl=http://download.lab.bos.redhat.com/rcm-guest/puddles/OpenStack/11.0-RHEL-7/2017-05-09.2/RH7-RHOS-11.0/$basearch/os
[stack@undercloud ~]$curl -s http://download-node-02.eng.bos.redhat.com/rcm-guest/puddles/OpenStack/11.0-RHEL-7/|grep GA|cut -f 3,4 -d'='
"[DIR]"> <a href="GA/">GA/</a>                           09-May-2017 10:15    -


Check that we have, on the overcloud controllers /64 ipv6 vips: 

[root@controller-0 ~]# pcs status |grep `hostname -s`|grep 2620
 ip-2620.52.0.23ae..16    (ocf::heartbeat:IPaddr2):    Started controller-0
[root@controller-0 ~]# ip a show vlan189
14: vlan189: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether 1e:24:3e:8c:73:86 brd ff:ff:ff:ff:ff:ff
    inet6 2620:52:0:23ae::16/64 scope global
 


After update from GA -> Z2 , check correct update to z2 : 

[stack@puma33 ~]$ cat core_puddle_version
2017-08-30.3
[stack@puma33 ~]$ grep -v '\#' /etc/yum.repos.d/rhos-release-11.repo|grep -m 1 baseurl
baseurl=http://download.lab.bos.redhat.com/rcm-guest/puddles/OpenStack/11.0-RHEL-7/2017-08-30.3/RH7-RHOS-11.0/$basearch/os
[stack@puma33 ~]$ curl -s http://download-node-02.eng.bos.redhat.com/rcm-guest/puddles/OpenStack/11.0-RHEL-7/|grep z2|cut -f 3,4 -d'='
"[DIR]"> <a href="z2/">z2/</a>                           30-Aug-2017 23:29    


Check that we now have , on the overcloud controllers /128 ipv6 vips:

[root@controller-0 ~]# pcs status |grep `hostname -s`|grep 2620
 ip-2620.52.0.23b4..15    (ocf::heartbeat:IPaddr2):    Started controller-0
 ip-2620.52.0.23ae..16    (ocf::heartbeat:IPaddr2):    Started controller-0

[root@controller-0 ~]# ip a show vlan189
11: vlan189: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether 02:54:64:bd:07:fc brd ff:ff:ff:ff:ff:ff
    inet6 2620:52:0:23ae::16/128 scope global
 
[root@controller-0 ~]# ip a show vlan195
12: vlan195: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether be:38:68:39:09:73 brd ff:ff:ff:ff:ff:ff
    inet6 2620:52:0:23b4::15/128 scope global

Comment 15 errata-xmlrpc 2017-09-13 21:43:17 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2721

Note You need to log in before you can comment on or make changes to this bug.