Bug 1626357 - Undercloud - changing os-net-config conf kills undercloud_[admin, public]_host IPs
Summary: Undercloud - changing os-net-config conf kills undercloud_[admin, public]_hos...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: openstack-tripleo-heat-templates
Version: 14.0 (Rocky)
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: beta
: 14.0 (Rocky)
Assignee: Harald Jensås
QA Contact: mlammon
URL:
Whiteboard:
Depends On:
Blocks: 1625520
TreeView+ depends on / blocked
 
Reported: 2018-09-07 07:08 UTC by Harald Jensås
Modified: 2019-01-18 13:09 UTC (History)
11 users (show)

Fixed In Version: openstack-tripleo-heat-templates-9.0.0-0.20180919080943.0rc1.0rc1.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-01-18 13:09:27 UTC
Target Upstream Version:
Embargoed:
hjensas: needinfo-


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Launchpad 1791238 0 None None None 2018-09-07 07:08:02 UTC
OpenStack gerrit 603587 0 None MERGED Undercloud - Restart keepalived on update 2020-07-02 02:05:20 UTC
OpenStack gerrit 605604 0 None MERGED Undercloud - Restart keepalived on update 2020-07-02 02:05:20 UTC
Red Hat Product Errata RHEA-2019:0045 0 None None None 2019-01-11 11:53:12 UTC

Description Harald Jensås 2018-09-07 07:08:03 UTC
Description of problem:
In the containerized undercloud re-run removes the undercloud_admin_host and undercloud_public_host ip addresses if config for os-net-config is changed.

The br-ctlplane interface is restarted by os-net-config and this removes the undercloud_admin_host and undercloud_public_host ip addresses set up by keepalived. The install/update operation fails later on because services fail to connect to the ip that is no longer there.

Version-Release number of selected component (if applicable):
Upstream current-dev used when I found the bug.

How reproducible:
100%

Steps to Reproduce:
1. Deploy undercloud
2. Change the undercloud_nameservers address in undercloud.conf

sed -i s/undercloud_nameservers = <old-address>/undercloud_nameservers = <new-address>/g /home/stack/undercloud.conf

3. Re-run undercloud install

openstack undercloud install

Additional reproducer:
----------------------

1. Deploy undercloud with routed networks enabled
2. Add more subnets to prepare the undercloud for scale out to additional routed networks leafs
3. Re-run the undercloud installer

  Because additional routes for the ctlplane network traffic is added, this causes os-net-config to re-run as well. And the restart of br-ctlplane kill's the VIP's.


Actual results:
Undercloud update fails.

Expected results:
Undercloud update should succeed.

Additional info:


1. The os-net-config is config.json is updated with the new dnsserver.

Every 5.0s: diff -aur /etc/os-net-config/config.json /tmp/os-net-config.json.orig                                                                                                          Fri Sep  7 08:51:26 2018

--- /etc/os-net-config/config.json      2018-09-07 08:45:39.054174371 +0200
+++ /tmp/os-net-config.json.orig        2018-09-07 08:17:38.597808977 +0200
@@ -1 +1 @@
-{"network_config": [{"addresses": [{"ip_netmask": "172.20.0.200/26"}], "dns_servers": ["192.168.122.1"], "members": [{"mtu": 1500, "name": "eth1", "primary": true, "type": "interface"}], "name": "br-ctlplane",
"ovs_extra": ["br-set-external-id br-ctlplane bridge-id br-ctlplane"], "routes": [], "type": "ovs_bridge", "use_dhcp": false}]}
+{"network_config": [{"addresses": [{"ip_netmask": "172.20.0.200/26"}], "dns_servers": ["172.20.0.254"], "members": [{"mtu": 1500, "name": "eth1", "primary": true, "type": "interface"}], "name": "br-ctlplane", "
ovs_extra": ["br-set-external-id br-ctlplane bridge-id br-ctlplane"], "routes": [], "type": "ovs_bridge", "use_dhcp": false}]}

2. After os-net-config applied config the keepalived VIPs are gone:

Every 2.0s: ip addr show br-ctlplane                                                                                                                                                       Fri Sep  7 08:51:08 2018

47: br-ctlplane: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ether 52:54:00:7a:f6:c5 brd ff:ff:ff:ff:ff:ff
    inet 172.20.0.200/26 brd 172.20.0.255 scope global br-ctlplane
       valid_lft forever preferred_lft forever
    inet6 fe80::5054:ff:fe7a:f6c5/64 scope link
       valid_lft forever preferred_lft forever

3. The upgrade is stuck on starting the containers:

TASK [Start containers for step 3] **********************************************

4. Log's show that services are failing to connect to the database via the keepalived VIPs:

/var/log/containers/nova/nova-compute.log:2018-09-07 08:52:47.462 6 ERROR oslo_service.periodic_task RemoteError: Remote error: DBConnectionError (pymysql.err.OperationalError) (2003, "Can't connect to MySQL server on '172.20.0.201' ([Errno 113] EHOSTUNREACH)") (Background on this error at: http://sqlalche.me/e/e3q8)

Comment 2 Harald Jensås 2018-09-10 17:25:27 UTC
Recent version of keepalived have support for 'dynamic_interfaces', looks like that would solve this problem. We would have to package keepalived 2.0.in RDO? And


 # Allow configuration to include interfaces that don't exist at startup.
 # This allows keepalived to work with interfaces that may be deleted and restored
 # and also allows virtual and static routes and rules on VMAC interfaces.
   dynamic_interfaces

I built keepalived-2.0.6-1.el7.x86_64.rpm using the SRPM[1] from Fedora Rawhide in Centos 7. (With only a small tweak the RPM builds.)

Enabling dynamic_interfaces and using 2.0.6 version of keepalived in the keepalived container fixes this issue.


Suggest we package keepalived 2.0.x and place this in the OSP repositories.



[1] https://sjc.edge.kernel.org/fedora-buffet/fedora/linux/development/rawhide/Everything/source/tree/Packages/k/keepalived-2.0.6-1.fc29.src.rpm

Comment 3 Michele Baldessari 2018-09-17 18:18:22 UTC
*** Bug 1498639 has been marked as a duplicate of this bug. ***

Comment 6 Harald Jensås 2018-09-19 08:10:16 UTC
I proposed the following change: https://review.openstack.org/603587

This implements a similar workaround used in pre-containerized undercloud, ensuring keepalived is restarted when the undercloud installer is run.

This change fixes the problem described in this bug, causing some undercloud config changes to fail. It however does not fix the issue described in https://bugzilla.redhat.com/show_bug.cgi?id=1498639, to fix that we would need a new version of keepalived.

Comment 18 errata-xmlrpc 2019-01-11 11:52:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:0045

Comment 19 Itzik Brown 2019-01-15 13:08:36 UTC
Hi,

It happens to me with OSP14 puddle: 2019-01-08.1

Comment 21 Bob Fournier 2019-01-18 13:09:27 UTC
Per Comment 18, this bug should not be reopened, please open a new bug.  In the bug please describe the changes that were made, i.e.
what was the undercloud_nameservers before and after the change.  Please include the link to sosreport in the new bug.


Note You need to log in before you can comment on or make changes to this bug.