Bug 1559151

Summary: OSP11 -> OSP12 upgrade: after rebooting controller nodes post upgrade at boot time interfaces set under ovs bridges have no network connectivity
Product: Red Hat OpenStack Reporter: Marius Cornea <mcornea>
Component: openstack-tripleo-heat-templatesAssignee: Marios Andreou <mandreou>
Status: CLOSED ERRATA QA Contact: Yurii Prokulevych <yprokule>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 12.0 (Pike)CC: bfournie, dbecker, jamsmith, jlibosva, jschluet, mandreou, mburns, mcornea, morazi, rhel-osp-director-maint, tvignaud
Target Milestone: z3Keywords: Regression, Triaged, ZStream
Target Release: 12.0 (Pike)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openstack-tripleo-heat-templates-7.0.9-14.el7ost Doc Type: Bug Fix
Doc Text:
Connectivity problems that occurred after OSP11-to-OSP12 upgrades have been resolved by the removal of an obsolete network configuration file. The file was /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json. Its presence on post-upgrade systems caused connectivity problems after a reboot on any overcloud node. Interfaces set under OVS bridges had no connectivity. For example, controller nodes were unable to rejoin the pacemaker cluster. The upgrade process now removes the file and prevents the connectivity problems.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-08-20 12:59:48 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Marius Cornea 2018-03-21 19:48:57 UTC
Description of problem:

OSP11 -> OSP12 upgrade: after rebooting controller nodes post upgrade at boot time interfaces set under ovs bridges have no network connectivity. This results in the controller node not being able to join the pacemaker cluster. It appears that the ip addresses are set correctly on the interfaces but nevertheless they don't have connectivity to the network that they're attached to:

[root@controller-2 ~]# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 52:54:00:28:6d:7f brd ff:ff:ff:ff:ff:ff
    inet 192.168.24.9/24 brd 192.168.24.255 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::5054:ff:fe28:6d7f/64 scope link 
       valid_lft forever preferred_lft forever
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master ovs-system state UP qlen 1000
    link/ether 52:54:00:be:ee:c6 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::5054:ff:febe:eec6/64 scope link 
       valid_lft forever preferred_lft forever
4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master ovs-system state UP qlen 1000
    link/ether 52:54:00:2e:2f:e3 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::5054:ff:fe2e:2fe3/64 scope link 
       valid_lft forever preferred_lft forever
5: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 5a:4d:b9:46:a3:d7 brd ff:ff:ff:ff:ff:ff
6: br-int: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 56:29:f1:a0:b6:49 brd ff:ff:ff:ff:ff:ff
7: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether 52:54:00:2e:2f:e3 brd ff:ff:ff:ff:ff:ff
    inet 10.0.0.101/24 brd 10.0.0.255 scope global br-ex
       valid_lft forever preferred_lft forever
    inet6 fe80::5054:ff:fe2e:2fe3/64 scope link 
       valid_lft forever preferred_lft forever
8: vxlan_sys_4789: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65470 qdisc noqueue master ovs-system state UNKNOWN qlen 1000
    link/ether a6:e2:bb:8f:3e:d3 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::a4e2:bbff:fe8f:3ed3/64 scope link 
       valid_lft forever preferred_lft forever
9: br-tun: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000
    link/ether 92:23:0b:70:d2:4b brd ff:ff:ff:ff:ff:ff
10: vlan40: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether ca:54:f7:5a:9a:ee brd ff:ff:ff:ff:ff:ff
    inet 172.17.4.15/24 brd 172.17.4.255 scope global vlan40
       valid_lft forever preferred_lft forever
    inet6 fe80::c854:f7ff:fe5a:9aee/64 scope link 
       valid_lft forever preferred_lft forever
11: vlan20: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether 06:d1:9f:a2:26:1d brd ff:ff:ff:ff:ff:ff
    inet 172.17.1.10/24 brd 172.17.1.255 scope global vlan20
       valid_lft forever preferred_lft forever
    inet6 fe80::4d1:9fff:fea2:261d/64 scope link 
       valid_lft forever preferred_lft forever
12: vlan30: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether 76:69:58:29:fc:23 brd ff:ff:ff:ff:ff:ff
    inet 172.17.3.16/24 brd 172.17.3.255 scope global vlan30
       valid_lft forever preferred_lft forever
    inet6 fe80::7469:58ff:fe29:fc23/64 scope link 
       valid_lft forever preferred_lft forever
13: vlan50: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether e6:1a:31:a2:c6:8f brd ff:ff:ff:ff:ff:ff
    inet 172.17.2.14/24 brd 172.17.2.255 scope global vlan50
       valid_lft forever preferred_lft forever
    inet6 fe80::e41a:31ff:fea2:c68f/64 scope link 
       valid_lft forever preferred_lft forever
14: br-isolated: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000
    link/ether 52:54:00:be:ee:c6 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::5054:ff:febe:eec6/64 scope link 
       valid_lft forever preferred_lft forever
15: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN 
    link/ether 02:42:67:61:56:17 brd ff:ff:ff:ff:ff:ff
    inet 172.31.0.1/24 scope global docker0
       valid_lft forever preferred_lft forever
[root@controller-2 ~]# ip r
default via 10.0.0.1 dev br-ex 
10.0.0.0/24 dev br-ex proto kernel scope link src 10.0.0.101 
169.254.169.254 via 192.168.24.1 dev eth0 
172.17.1.0/24 dev vlan20 proto kernel scope link src 172.17.1.10 
172.17.2.0/24 dev vlan50 proto kernel scope link src 172.17.2.14 
172.17.3.0/24 dev vlan30 proto kernel scope link src 172.17.3.16 
172.17.4.0/24 dev vlan40 proto kernel scope link src 172.17.4.15 
172.31.0.0/24 dev docker0 proto kernel scope link src 172.31.0.1 
192.168.24.0/24 dev eth0 proto kernel scope link src 192.168.24.9 
[root@controller-2 ~]# ping 10.0.0.1
PING 10.0.0.1 (10.0.0.1) 56(84) bytes of data.
^C
--- 10.0.0.1 ping statistics ---
3 packets transmitted, 0 received, 100% packet loss, time 2002ms

Attaching sosreports.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Deploys OSP11 with 3 controllers + 1 compute
2. Upgrade to OSP12
3. Reboot one of the controller nodes
4. SSH to the controller node and check pcs status

Actual results:
The node is isolated from the rest of the cluster nodes.

Expected results:
The node has connectivity on the existing interfaces after reboot.

Additional info:

It looks like /etc/os-net-config/config.json is empty:

[root@controller-2 ~]# wc /etc/os-net-config/config.json 
0 0 0 /etc/os-net-config/config.json


In /var/log/openvswitch/ovs-vswitchd.log we can notice:

2018-03-21T18:59:54.389Z|00112|rconn|WARN|br-isolated<->tcp:127.0.0.1:6633: connection failed (Connection refused)
2018-03-21T18:59:56.589Z|00113|bridge|INFO|bridge br-int: deleted interface int-br-ex on port 1
2018-03-21T18:59:56.595Z|00114|bridge|INFO|bridge br-ex: deleted interface phy-br-ex on port 2
2018-03-21T18:59:56.607Z|00115|bridge|INFO|bridge br-int: deleted interface int-br-isolated on port 2
2018-03-21T18:59:56.613Z|00116|bridge|INFO|bridge br-isolated: deleted interface phy-br-isolated on port 6
2018-03-21T19:00:02.392Z|00117|rconn|INFO|br-int<->tcp:127.0.0.1:6633: connected
2018-03-21T19:00:02.393Z|00118|rconn|INFO|br-ex<->tcp:127.0.0.1:6633: connected
2018-03-21T19:00:02.393Z|00119|rconn|INFO|br-tun<->tcp:127.0.0.1:6633: connected
2018-03-21T19:00:02.394Z|00120|rconn|INFO|br-isolated<->tcp:127.0.0.1:6633: connected
2018-03-21T19:00:02.396Z|00121|fail_open|WARN|No longer in fail-open mode
2018-03-21T19:00:02.396Z|00122|fail_open|WARN|No longer in fail-open mode
2018-03-21T19:00:13.156Z|00123|connmgr|INFO|br-int<->tcp:127.0.0.1:6633: 4 flow_mods 10 s ago (4 adds)
(END)

Comment 2 Marios Andreou 2018-03-22 14:05:23 UTC
the only thing that quickly comes to mind, given 'openvswitch' and 'os-net-config' is this https://review.openstack.org/#/c/510577/1/puppet/services/tripleo-packages.yaml which sets -no-activate on the os-net-config run during the upgrade for BZ 1491628. I am not sure it is related yet though, or what happens after reboot with os-net-config

Comment 3 Jakub Libosvar 2018-03-22 14:25:44 UTC
The br-ex bridge is missing the flows again so it doesn't switch packets.

Comment 4 Jakub Libosvar 2018-03-22 14:39:22 UTC
The symptom is the same as at the bug 1473763, the br-ex ifcfg file doesn't contain fix provided by that bug: https://review.openstack.org/#/c/496707/

$ cat etc/sysconfig/network-scripts/ifcfg-br-ex
# This file is autogenerated by os-net-config
DEVICE=br-ex
ONBOOT=yes
HOTPLUG=no
NM_CONTROLLED=no
PEERDNS=no
DEVICETYPE=ovs
TYPE=OVSBridge
BOOTPROTO=static
IPADDR=10.0.0.101
NETMASK=255.255.255.0
OVS_EXTRA="set bridge br-ex other-config:hwaddr=52:54:00:2e:2f:e3 -- set bridge br-ex fail_mode=standalone"

^^ see that " -- del-controller br-ex" is missing in the ifcfg-br-ex file

Comment 5 Jakub Libosvar 2018-03-22 15:45:02 UTC
So the question is whether os-net-config was updated before the upgrade as stated at https://bugzilla.redhat.com/show_bug.cgi?id=1491628#c18

Comment 6 Marius Cornea 2018-03-22 16:14:31 UTC
(In reply to Jakub Libosvar from comment #5)
> So the question is whether os-net-config was updated before the upgrade as
> stated at https://bugzilla.redhat.com/show_bug.cgi?id=1491628#c18

According to the logs it looks like the os-net-config didn't get updated before upgrade as 'test -s /etc/os-net-config/config.json' command failed:

Mar 20 22:14:16 localhost os-collect-config: TASK [Check that os-net-config has configuration] ******************************
Mar 20 22:14:16 localhost os-collect-config: fatal: [localhost]: FAILED! => {"changed": true, "cmd": "test -s /etc/os-net-config/config.json", "delta": "0:00:00.004004", "end": "2018-03-21 02:09:41.930566", "msg": "non-zero return code", 
"rc": 1, "start": "2018-03-21 02:09:41.926562", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
Mar 20 22:14:16 localhost os-collect-config: ...ignoring
Mar 20 22:14:16 localhost os-collect-config: TASK [Upgrade os-net-config] ***************************************************
Mar 20 22:14:16 localhost os-collect-config: skipping: [localhost]
Mar 20 22:14:16 localhost os-collect-config: TASK [take new os-net-config parameters into account now] **********************
Mar 20 22:14:16 localhost os-collect-config: skipping: [localhost]
Mar 20 22:14:16 localhost os-collect-config: TASK [Update all packages] *****************************************************
Mar 20 22:14:16 localhost os-collect-config: changed: [localhost]
Mar 20 22:14:16 localhost os-collect-config: PLAY RECAP *********************************************************************

Comment 7 Marius Cornea 2018-03-22 18:41:35 UTC
After some investigation /etc/os-net-config/config.json is empty as a result of OSP11 deployment(issue described in bug 1514949). 

The upgrade relies on the config.json file to be populated otherwise os-net-config doesn't get updated before the rest of the packages:

https://github.com/openstack/tripleo-heat-templates/blob/stable/pike/puppet/services/tripleo-packages.yaml#L61

Comment 8 Marius Cornea 2018-03-22 18:51:54 UTC
Is there any chance we can implement a workaround automatically in the upgrade procedure to cover this use case and:

1/ populate /etc/os-net-config/config.json before starting upgrade
2/ remove /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json from overcloud nodes per https://bugzilla.redhat.com/show_bug.cgi?id=1514949#c8

This would make the upgrade experience seamless for operators who are in this position.

Comment 9 Mike Burns 2018-03-22 20:03:49 UTC
(In reply to Marius Cornea from comment #8)
> Is there any chance we can implement a workaround automatically in the
> upgrade procedure to cover this use case and:
> 
> 1/ populate /etc/os-net-config/config.json before starting upgrade
> 2/ remove
> /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json from
> overcloud nodes per https://bugzilla.redhat.com/show_bug.cgi?id=1514949#c8
> 
> This would make the upgrade experience seamless for operators who are in
> this position.

Should update process on 11 populate the file?

Comment 10 Bob Fournier 2018-03-22 23:51:49 UTC
This fix - https://review.openstack.org/#/c/555452/ should resolve the issue with the empty config.json when backported to queens.

Comment 13 Marios Andreou 2018-03-23 16:49:34 UTC
(In reply to Marius Cornea from comment #8)
> Is there any chance we can implement a workaround automatically in the
> upgrade procedure to cover this use case and:
> 
> 1/ populate /etc/os-net-config/config.json before starting upgrade
> 2/ remove
> /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json from
> overcloud nodes per https://bugzilla.redhat.com/show_bug.cgi?id=1514949#c8
> 
> This would make the upgrade experience seamless for operators who are in
> this position.

(adding a note here as briefly discussed with morazi/mcornea on irc):

2/ is easy enough and here [0] is a candidate for a 'general' place  we could add a validation to. We could either add something to validate fi that is present (/libexec config.json) or even just delete it unconditionally.

1/ is a bit harder, because the population of the config.json comes from the heat stack. I am still not clear why we didn't have the config.json populated in this particular case. Since this was upgrade from 11..12 .. the pike templates e.g. at [1] are already defining $network_config and passing it into an invocation of os-net-config. SO that should have already happened during the upgrade init heat stack update, or, it wasn't happening then either, because we still had the /libexec around? 

In which case, perhaps we need to add the removal of that element os-net-config from the UpgradeInit, i mean [2][3]. 

[0] https://github.com/openstack/tripleo-heat-templates/blob/0299096401e87a1f1996f981b8387f51640dc22e/puppet/services/tripleo-packages.yaml#L64
[1]  https://github.com/openstack/tripleo-heat-templates/blob/fb82009d62c903f92f0cf287c01b5406c5d7da37/network/config/single-nic-vlans/controller.yaml#L87-L89
[2] https://github.com/openstack/tripleo-heat-templates/blob/fb82009d62c903f92f0cf287c01b5406c5d7da37/environments/major-upgrade-composable-steps.yaml#L7
[3] https://github.com/openstack/tripleo-heat-templates/blob/fb82009d62c903f92f0cf287c01b5406c5d7da37/environments/major-upgrade-composable-steps-docker.yaml#L7

Comment 14 Bob Fournier 2018-03-23 17:33:20 UTC
>1/ is a bit harder, because the population of the config.json comes from the heat >stack. I am still not clear why we didn't have the config.json populated in this >particular case. Since this was upgrade from 11..12 .. the pike templates e.g. at [1] >are already defining $network_config and passing it into an invocation of os-net->config. SO that should have already happened during the upgrade init heat stack >update, or, it wasn't happening then either, because we still had the /libexec around? 

Marios - I'm still unclear on all of the mechanics related to config.json particularly on upgrade but one thing to note is that config.json IS populated properly by os-net-config, but then it gets overwritten when os-apply-config invokes os-net-config using the empty file from /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json.

This is the behavior described here https://bugzilla.redhat.com/show_bug.cgi?id=1514949#c2 and https://bugzilla.redhat.com/show_bug.cgi?id=1514949#c4 and also seen by Dan and Marious in this setup, correct me if I'm wrong.

Comment 15 Bob Fournier 2018-03-23 20:40:56 UTC
Also /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json only exists on versions prior to OSP-12, in OSP-12 it was changed to element_config.json (https://review.openstack.org/#/c/470014/).

Comment 16 Marios Andreou 2018-03-26 13:18:03 UTC
(In reply to Bob Fournier from comment #15)
> Also /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json
> only exists on versions prior to OSP-12, in OSP-12 it was changed to
> element_config.json (https://review.openstack.org/#/c/470014/).

ack thanks for the clarifications Bob, I'll post something momentarily to remove the /usr/libexec-os-apply-config/templates/etc/os-net-config/config.json at the start of the upgrade (as per the discussion here and at https://bugzilla.redhat.com/show_bug.cgi?id=1514949#c5 sounds like that should solve the problem)

Comment 17 Bob Fournier 2018-03-26 14:27:52 UTC
Thanks Marios. After talking this over with Dan Prince, this should also work - https://review.openstack.org/#/c/556539/.  This will remove the files used by os-apply-config (old style nic config file) when run-os-net-config.sh is invoked (new style nic config) to avoid the overwrite problem on upgrade.

Comment 18 Marios Andreou 2018-03-26 14:46:03 UTC
(In reply to Bob Fournier from comment #17)
> Thanks Marios. After talking this over with Dan Prince, this should also
> work - https://review.openstack.org/#/c/556539/.  This will remove the files
> used by os-apply-config (old style nic config file) when
> run-os-net-config.sh is invoked (new style nic config) to avoid the
> overwrite problem on upgrade.

ack thanks, lets add that to trackers. I also just posted https://review.openstack.org/556533 which removes the libexec config with the upgrade init command (we already have the mechanism for it, this just adds the command there, i don't see the harm in doing both. lets update trackers?)

Comment 20 Marios Andreou 2018-04-27 13:29:58 UTC
Information for build openstack-tripleo-heat-templates-7.0.9-14.el7ost https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=679218

Comment 26 Yurii Prokulevych 2018-07-13 14:24:57 UTC
Verified. OC nodes managed to form the cluster after upgrade and reboot.

Comment 29 errata-xmlrpc 2018-08-20 12:59:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2331