Description of problem: OSP11 -> OSP12 upgrade: after rebooting controller nodes post upgrade at boot time interfaces set under ovs bridges have no network connectivity. This results in the controller node not being able to join the pacemaker cluster. It appears that the ip addresses are set correctly on the interfaces but nevertheless they don't have connectivity to the network that they're attached to: [root@controller-2 ~]# ip a 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 inet 127.0.0.1/8 scope host lo valid_lft forever preferred_lft forever inet6 ::1/128 scope host valid_lft forever preferred_lft forever 2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000 link/ether 52:54:00:28:6d:7f brd ff:ff:ff:ff:ff:ff inet 192.168.24.9/24 brd 192.168.24.255 scope global eth0 valid_lft forever preferred_lft forever inet6 fe80::5054:ff:fe28:6d7f/64 scope link valid_lft forever preferred_lft forever 3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master ovs-system state UP qlen 1000 link/ether 52:54:00:be:ee:c6 brd ff:ff:ff:ff:ff:ff inet6 fe80::5054:ff:febe:eec6/64 scope link valid_lft forever preferred_lft forever 4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast master ovs-system state UP qlen 1000 link/ether 52:54:00:2e:2f:e3 brd ff:ff:ff:ff:ff:ff inet6 fe80::5054:ff:fe2e:2fe3/64 scope link valid_lft forever preferred_lft forever 5: ovs-system: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000 link/ether 5a:4d:b9:46:a3:d7 brd ff:ff:ff:ff:ff:ff 6: br-int: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000 link/ether 56:29:f1:a0:b6:49 brd ff:ff:ff:ff:ff:ff 7: br-ex: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000 link/ether 52:54:00:2e:2f:e3 brd ff:ff:ff:ff:ff:ff inet 10.0.0.101/24 brd 10.0.0.255 scope global br-ex valid_lft forever preferred_lft forever inet6 fe80::5054:ff:fe2e:2fe3/64 scope link valid_lft forever preferred_lft forever 8: vxlan_sys_4789: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 65470 qdisc noqueue master ovs-system state UNKNOWN qlen 1000 link/ether a6:e2:bb:8f:3e:d3 brd ff:ff:ff:ff:ff:ff inet6 fe80::a4e2:bbff:fe8f:3ed3/64 scope link valid_lft forever preferred_lft forever 9: br-tun: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN qlen 1000 link/ether 92:23:0b:70:d2:4b brd ff:ff:ff:ff:ff:ff 10: vlan40: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000 link/ether ca:54:f7:5a:9a:ee brd ff:ff:ff:ff:ff:ff inet 172.17.4.15/24 brd 172.17.4.255 scope global vlan40 valid_lft forever preferred_lft forever inet6 fe80::c854:f7ff:fe5a:9aee/64 scope link valid_lft forever preferred_lft forever 11: vlan20: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000 link/ether 06:d1:9f:a2:26:1d brd ff:ff:ff:ff:ff:ff inet 172.17.1.10/24 brd 172.17.1.255 scope global vlan20 valid_lft forever preferred_lft forever inet6 fe80::4d1:9fff:fea2:261d/64 scope link valid_lft forever preferred_lft forever 12: vlan30: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000 link/ether 76:69:58:29:fc:23 brd ff:ff:ff:ff:ff:ff inet 172.17.3.16/24 brd 172.17.3.255 scope global vlan30 valid_lft forever preferred_lft forever inet6 fe80::7469:58ff:fe29:fc23/64 scope link valid_lft forever preferred_lft forever 13: vlan50: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000 link/ether e6:1a:31:a2:c6:8f brd ff:ff:ff:ff:ff:ff inet 172.17.2.14/24 brd 172.17.2.255 scope global vlan50 valid_lft forever preferred_lft forever inet6 fe80::e41a:31ff:fea2:c68f/64 scope link valid_lft forever preferred_lft forever 14: br-isolated: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN qlen 1000 link/ether 52:54:00:be:ee:c6 brd ff:ff:ff:ff:ff:ff inet6 fe80::5054:ff:febe:eec6/64 scope link valid_lft forever preferred_lft forever 15: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN link/ether 02:42:67:61:56:17 brd ff:ff:ff:ff:ff:ff inet 172.31.0.1/24 scope global docker0 valid_lft forever preferred_lft forever [root@controller-2 ~]# ip r default via 10.0.0.1 dev br-ex 10.0.0.0/24 dev br-ex proto kernel scope link src 10.0.0.101 169.254.169.254 via 192.168.24.1 dev eth0 172.17.1.0/24 dev vlan20 proto kernel scope link src 172.17.1.10 172.17.2.0/24 dev vlan50 proto kernel scope link src 172.17.2.14 172.17.3.0/24 dev vlan30 proto kernel scope link src 172.17.3.16 172.17.4.0/24 dev vlan40 proto kernel scope link src 172.17.4.15 172.31.0.0/24 dev docker0 proto kernel scope link src 172.31.0.1 192.168.24.0/24 dev eth0 proto kernel scope link src 192.168.24.9 [root@controller-2 ~]# ping 10.0.0.1 PING 10.0.0.1 (10.0.0.1) 56(84) bytes of data. ^C --- 10.0.0.1 ping statistics --- 3 packets transmitted, 0 received, 100% packet loss, time 2002ms Attaching sosreports. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Deploys OSP11 with 3 controllers + 1 compute 2. Upgrade to OSP12 3. Reboot one of the controller nodes 4. SSH to the controller node and check pcs status Actual results: The node is isolated from the rest of the cluster nodes. Expected results: The node has connectivity on the existing interfaces after reboot. Additional info: It looks like /etc/os-net-config/config.json is empty: [root@controller-2 ~]# wc /etc/os-net-config/config.json 0 0 0 /etc/os-net-config/config.json In /var/log/openvswitch/ovs-vswitchd.log we can notice: 2018-03-21T18:59:54.389Z|00112|rconn|WARN|br-isolated<->tcp:127.0.0.1:6633: connection failed (Connection refused) 2018-03-21T18:59:56.589Z|00113|bridge|INFO|bridge br-int: deleted interface int-br-ex on port 1 2018-03-21T18:59:56.595Z|00114|bridge|INFO|bridge br-ex: deleted interface phy-br-ex on port 2 2018-03-21T18:59:56.607Z|00115|bridge|INFO|bridge br-int: deleted interface int-br-isolated on port 2 2018-03-21T18:59:56.613Z|00116|bridge|INFO|bridge br-isolated: deleted interface phy-br-isolated on port 6 2018-03-21T19:00:02.392Z|00117|rconn|INFO|br-int<->tcp:127.0.0.1:6633: connected 2018-03-21T19:00:02.393Z|00118|rconn|INFO|br-ex<->tcp:127.0.0.1:6633: connected 2018-03-21T19:00:02.393Z|00119|rconn|INFO|br-tun<->tcp:127.0.0.1:6633: connected 2018-03-21T19:00:02.394Z|00120|rconn|INFO|br-isolated<->tcp:127.0.0.1:6633: connected 2018-03-21T19:00:02.396Z|00121|fail_open|WARN|No longer in fail-open mode 2018-03-21T19:00:02.396Z|00122|fail_open|WARN|No longer in fail-open mode 2018-03-21T19:00:13.156Z|00123|connmgr|INFO|br-int<->tcp:127.0.0.1:6633: 4 flow_mods 10 s ago (4 adds) (END)
the only thing that quickly comes to mind, given 'openvswitch' and 'os-net-config' is this https://review.openstack.org/#/c/510577/1/puppet/services/tripleo-packages.yaml which sets -no-activate on the os-net-config run during the upgrade for BZ 1491628. I am not sure it is related yet though, or what happens after reboot with os-net-config
The br-ex bridge is missing the flows again so it doesn't switch packets.
The symptom is the same as at the bug 1473763, the br-ex ifcfg file doesn't contain fix provided by that bug: https://review.openstack.org/#/c/496707/ $ cat etc/sysconfig/network-scripts/ifcfg-br-ex # This file is autogenerated by os-net-config DEVICE=br-ex ONBOOT=yes HOTPLUG=no NM_CONTROLLED=no PEERDNS=no DEVICETYPE=ovs TYPE=OVSBridge BOOTPROTO=static IPADDR=10.0.0.101 NETMASK=255.255.255.0 OVS_EXTRA="set bridge br-ex other-config:hwaddr=52:54:00:2e:2f:e3 -- set bridge br-ex fail_mode=standalone" ^^ see that " -- del-controller br-ex" is missing in the ifcfg-br-ex file
So the question is whether os-net-config was updated before the upgrade as stated at https://bugzilla.redhat.com/show_bug.cgi?id=1491628#c18
(In reply to Jakub Libosvar from comment #5) > So the question is whether os-net-config was updated before the upgrade as > stated at https://bugzilla.redhat.com/show_bug.cgi?id=1491628#c18 According to the logs it looks like the os-net-config didn't get updated before upgrade as 'test -s /etc/os-net-config/config.json' command failed: Mar 20 22:14:16 localhost os-collect-config: TASK [Check that os-net-config has configuration] ****************************** Mar 20 22:14:16 localhost os-collect-config: fatal: [localhost]: FAILED! => {"changed": true, "cmd": "test -s /etc/os-net-config/config.json", "delta": "0:00:00.004004", "end": "2018-03-21 02:09:41.930566", "msg": "non-zero return code", "rc": 1, "start": "2018-03-21 02:09:41.926562", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []} Mar 20 22:14:16 localhost os-collect-config: ...ignoring Mar 20 22:14:16 localhost os-collect-config: TASK [Upgrade os-net-config] *************************************************** Mar 20 22:14:16 localhost os-collect-config: skipping: [localhost] Mar 20 22:14:16 localhost os-collect-config: TASK [take new os-net-config parameters into account now] ********************** Mar 20 22:14:16 localhost os-collect-config: skipping: [localhost] Mar 20 22:14:16 localhost os-collect-config: TASK [Update all packages] ***************************************************** Mar 20 22:14:16 localhost os-collect-config: changed: [localhost] Mar 20 22:14:16 localhost os-collect-config: PLAY RECAP *********************************************************************
After some investigation /etc/os-net-config/config.json is empty as a result of OSP11 deployment(issue described in bug 1514949). The upgrade relies on the config.json file to be populated otherwise os-net-config doesn't get updated before the rest of the packages: https://github.com/openstack/tripleo-heat-templates/blob/stable/pike/puppet/services/tripleo-packages.yaml#L61
Is there any chance we can implement a workaround automatically in the upgrade procedure to cover this use case and: 1/ populate /etc/os-net-config/config.json before starting upgrade 2/ remove /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json from overcloud nodes per https://bugzilla.redhat.com/show_bug.cgi?id=1514949#c8 This would make the upgrade experience seamless for operators who are in this position.
(In reply to Marius Cornea from comment #8) > Is there any chance we can implement a workaround automatically in the > upgrade procedure to cover this use case and: > > 1/ populate /etc/os-net-config/config.json before starting upgrade > 2/ remove > /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json from > overcloud nodes per https://bugzilla.redhat.com/show_bug.cgi?id=1514949#c8 > > This would make the upgrade experience seamless for operators who are in > this position. Should update process on 11 populate the file?
This fix - https://review.openstack.org/#/c/555452/ should resolve the issue with the empty config.json when backported to queens.
(In reply to Marius Cornea from comment #8) > Is there any chance we can implement a workaround automatically in the > upgrade procedure to cover this use case and: > > 1/ populate /etc/os-net-config/config.json before starting upgrade > 2/ remove > /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json from > overcloud nodes per https://bugzilla.redhat.com/show_bug.cgi?id=1514949#c8 > > This would make the upgrade experience seamless for operators who are in > this position. (adding a note here as briefly discussed with morazi/mcornea on irc): 2/ is easy enough and here [0] is a candidate for a 'general' place we could add a validation to. We could either add something to validate fi that is present (/libexec config.json) or even just delete it unconditionally. 1/ is a bit harder, because the population of the config.json comes from the heat stack. I am still not clear why we didn't have the config.json populated in this particular case. Since this was upgrade from 11..12 .. the pike templates e.g. at [1] are already defining $network_config and passing it into an invocation of os-net-config. SO that should have already happened during the upgrade init heat stack update, or, it wasn't happening then either, because we still had the /libexec around? In which case, perhaps we need to add the removal of that element os-net-config from the UpgradeInit, i mean [2][3]. [0] https://github.com/openstack/tripleo-heat-templates/blob/0299096401e87a1f1996f981b8387f51640dc22e/puppet/services/tripleo-packages.yaml#L64 [1] https://github.com/openstack/tripleo-heat-templates/blob/fb82009d62c903f92f0cf287c01b5406c5d7da37/network/config/single-nic-vlans/controller.yaml#L87-L89 [2] https://github.com/openstack/tripleo-heat-templates/blob/fb82009d62c903f92f0cf287c01b5406c5d7da37/environments/major-upgrade-composable-steps.yaml#L7 [3] https://github.com/openstack/tripleo-heat-templates/blob/fb82009d62c903f92f0cf287c01b5406c5d7da37/environments/major-upgrade-composable-steps-docker.yaml#L7
>1/ is a bit harder, because the population of the config.json comes from the heat >stack. I am still not clear why we didn't have the config.json populated in this >particular case. Since this was upgrade from 11..12 .. the pike templates e.g. at [1] >are already defining $network_config and passing it into an invocation of os-net->config. SO that should have already happened during the upgrade init heat stack >update, or, it wasn't happening then either, because we still had the /libexec around? Marios - I'm still unclear on all of the mechanics related to config.json particularly on upgrade but one thing to note is that config.json IS populated properly by os-net-config, but then it gets overwritten when os-apply-config invokes os-net-config using the empty file from /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json. This is the behavior described here https://bugzilla.redhat.com/show_bug.cgi?id=1514949#c2 and https://bugzilla.redhat.com/show_bug.cgi?id=1514949#c4 and also seen by Dan and Marious in this setup, correct me if I'm wrong.
Also /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json only exists on versions prior to OSP-12, in OSP-12 it was changed to element_config.json (https://review.openstack.org/#/c/470014/).
(In reply to Bob Fournier from comment #15) > Also /usr/libexec/os-apply-config/templates/etc/os-net-config/config.json > only exists on versions prior to OSP-12, in OSP-12 it was changed to > element_config.json (https://review.openstack.org/#/c/470014/). ack thanks for the clarifications Bob, I'll post something momentarily to remove the /usr/libexec-os-apply-config/templates/etc/os-net-config/config.json at the start of the upgrade (as per the discussion here and at https://bugzilla.redhat.com/show_bug.cgi?id=1514949#c5 sounds like that should solve the problem)
Thanks Marios. After talking this over with Dan Prince, this should also work - https://review.openstack.org/#/c/556539/. This will remove the files used by os-apply-config (old style nic config file) when run-os-net-config.sh is invoked (new style nic config) to avoid the overwrite problem on upgrade.
(In reply to Bob Fournier from comment #17) > Thanks Marios. After talking this over with Dan Prince, this should also > work - https://review.openstack.org/#/c/556539/. This will remove the files > used by os-apply-config (old style nic config file) when > run-os-net-config.sh is invoked (new style nic config) to avoid the > overwrite problem on upgrade. ack thanks, lets add that to trackers. I also just posted https://review.openstack.org/556533 which removes the libexec config with the upgrade init command (we already have the mechanism for it, this just adds the command there, i don't see the harm in doing both. lets update trackers?)
Information for build openstack-tripleo-heat-templates-7.0.9-14.el7ost https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=679218
Verified. OC nodes managed to form the cluster after upgrade and reboot.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2331