Bug 1626357

Summary: Undercloud - changing os-net-config conf kills undercloud_[admin, public]_host IPs
Product: Red Hat OpenStack Reporter: Harald Jensås <hjensas>
Component: openstack-tripleo-heat-templatesAssignee: Harald Jensås <hjensas>
Status: CLOSED ERRATA QA Contact: mlammon
Severity: high Docs Contact:
Priority: high    
Version: 14.0 (Rocky)CC: aschultz, bfournie, bjacot, dbecker, hjensas, itbrown, mburns, morazi, racedoro, sasha, sclewis
Target Milestone: betaKeywords: Reopened, Triaged
Target Release: 14.0 (Rocky)Flags: hjensas: needinfo-
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-9.0.0-0.20180919080943.0rc1.0rc1.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-01-18 13:09:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1625520    

Description Harald Jensås 2018-09-07 07:08:03 UTC
Description of problem:
In the containerized undercloud re-run removes the undercloud_admin_host and undercloud_public_host ip addresses if config for os-net-config is changed.

The br-ctlplane interface is restarted by os-net-config and this removes the undercloud_admin_host and undercloud_public_host ip addresses set up by keepalived. The install/update operation fails later on because services fail to connect to the ip that is no longer there.

Version-Release number of selected component (if applicable):
Upstream current-dev used when I found the bug.

How reproducible:
100%

Steps to Reproduce:
1. Deploy undercloud
2. Change the undercloud_nameservers address in undercloud.conf

sed -i s/undercloud_nameservers = <old-address>/undercloud_nameservers = <new-address>/g /home/stack/undercloud.conf

3. Re-run undercloud install

openstack undercloud install

Additional reproducer:
----------------------

1. Deploy undercloud with routed networks enabled
2. Add more subnets to prepare the undercloud for scale out to additional routed networks leafs
3. Re-run the undercloud installer

  Because additional routes for the ctlplane network traffic is added, this causes os-net-config to re-run as well. And the restart of br-ctlplane kill's the VIP's.


Actual results:
Undercloud update fails.

Expected results:
Undercloud update should succeed.

Additional info:


1. The os-net-config is config.json is updated with the new dnsserver.

Every 5.0s: diff -aur /etc/os-net-config/config.json /tmp/os-net-config.json.orig                                                                                                          Fri Sep  7 08:51:26 2018

--- /etc/os-net-config/config.json      2018-09-07 08:45:39.054174371 +0200
+++ /tmp/os-net-config.json.orig        2018-09-07 08:17:38.597808977 +0200
@@ -1 +1 @@
-{"network_config": [{"addresses": [{"ip_netmask": "172.20.0.200/26"}], "dns_servers": ["192.168.122.1"], "members": [{"mtu": 1500, "name": "eth1", "primary": true, "type": "interface"}], "name": "br-ctlplane",
"ovs_extra": ["br-set-external-id br-ctlplane bridge-id br-ctlplane"], "routes": [], "type": "ovs_bridge", "use_dhcp": false}]}
+{"network_config": [{"addresses": [{"ip_netmask": "172.20.0.200/26"}], "dns_servers": ["172.20.0.254"], "members": [{"mtu": 1500, "name": "eth1", "primary": true, "type": "interface"}], "name": "br-ctlplane", "
ovs_extra": ["br-set-external-id br-ctlplane bridge-id br-ctlplane"], "routes": [], "type": "ovs_bridge", "use_dhcp": false}]}

2. After os-net-config applied config the keepalived VIPs are gone:

Every 2.0s: ip addr show br-ctlplane                                                                                                                                                       Fri Sep  7 08:51:08 2018

47: br-ctlplane: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 52:54:00:7a:f6:c5 brd ff:ff:ff:ff:ff:ff
    inet 172.20.0.200/26 brd 172.20.0.255 scope global br-ctlplane
       valid_lft forever preferred_lft forever
    inet6 fe80::5054:ff:fe7a:f6c5/64 scope link
       valid_lft forever preferred_lft forever

3. The upgrade is stuck on starting the containers:

TASK [Start containers for step 3] **********************************************

4. Log's show that services are failing to connect to the database via the keepalived VIPs:

/var/log/containers/nova/nova-compute.log:2018-09-07 08:52:47.462 6 ERROR oslo_service.periodic_task RemoteError: Remote error: DBConnectionError (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on '172.20.0.201' ([Errno 113] EHOSTUNREACH)") (Background on this error at: http://sqlalche.me/e/e3q8)

Comment 2 Harald Jensås 2018-09-10 17:25:27 UTC
Recent version of keepalived have support for 'dynamic_interfaces', looks like that would solve this problem. We would have to package keepalived 2.0.in RDO? And


 # Allow configuration to include interfaces that don't exist at startup.
 # This allows keepalived to work with interfaces that may be deleted and restored
 # and also allows virtual and static routes and rules on VMAC interfaces.
   dynamic_interfaces

I built keepalived-2.0.6-1.el7.x86_64.rpm using the SRPM[1] from Fedora Rawhide in Centos 7. (With only a small tweak the RPM builds.)

Enabling dynamic_interfaces and using 2.0.6 version of keepalived in the keepalived container fixes this issue.


Suggest we package keepalived 2.0.x and place this in the OSP repositories.



[1] https://sjc.edge.kernel.org/fedora-buffet/fedora/linux/development/rawhide/Everything/source/tree/Packages/k/keepalived-2.0.6-1.fc29.src.rpm

Comment 3 Michele Baldessari 2018-09-17 18:18:22 UTC
*** Bug 1498639 has been marked as a duplicate of this bug. ***

Comment 6 Harald Jensås 2018-09-19 08:10:16 UTC
I proposed the following change: https://review.openstack.org/603587

This implements a similar workaround used in pre-containerized undercloud, ensuring keepalived is restarted when the undercloud installer is run.

This change fixes the problem described in this bug, causing some undercloud config changes to fail. It however does not fix the issue described in https://bugzilla.redhat.com/show_bug.cgi?id=1498639, to fix that we would need a new version of keepalived.

Comment 18 errata-xmlrpc 2019-01-11 11:52:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0045

Comment 19 Itzik Brown 2019-01-15 13:08:36 UTC
Hi,

It happens to me with OSP14 puddle: 2019-01-08.1

Comment 21 Bob Fournier 2019-01-18 13:09:27 UTC
Per Comment 18, this bug should not be reopened, please open a new bug.  In the bug please describe the changes that were made, i.e.
what was the undercloud_nameservers before and after the change.  Please include the link to sosreport in the new bug.