Bug 1971715 - [OCP 4.7] "configure-ovs.sh" leaves static ip in old interface
Summary: [OCP 4.7] "configure-ovs.sh" leaves static ip in old interface
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.9.0
Assignee: Jaime Caamaño Ruiz
QA Contact: Ross Brattain
Padraig O'Grady
URL:
Whiteboard:
: 1967656 1970013 1978481 1992704 1999756 (view as bug list)
Depends On:
Blocks: 1975171 1976110
TreeView+ depends on / blocked
 
Reported: 2021-06-14 16:18 UTC by Mario Abajo
Modified: 2022-10-24 02:39 UTC (History)
23 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Cause: Network configuration specific to ovn-kubernetes might be lost after cluster node reboot. Consequence: After cluster node reboot, on network configurations making use of link aggregation, cluster network connectivity is lost. Fix: Correctly persist ovn-kubernetes specific network configuration across reboots of the cluster nodes. Result: Network configuration specific to ovn-kubernetes persists across reboots and connectivity is not lost..
Clone Of:
: 1975171 (view as bug list)
Environment:
Last Closed: 2021-10-18 17:33:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2626 0 None closed Bug 1971715: configure-ovs: fix nondeterministic master in slave profiles 2021-08-09 13:07:26 UTC
Github openshift machine-config-operator pull 2643 0 None closed Bug 1971715: configure-ovs: fix bond ifcfg backed configuration 2021-08-09 13:07:24 UTC
Github openshift machine-config-operator pull 2706 0 None None None 2021-08-27 12:59:39 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:34:23 UTC

Description Mario Abajo 2021-06-14 16:18:17 UTC
Description of problem:
In a new cluster installation (UPI, masters over vsphere 7.1), after the first reboot, the nodes (in this particular case, the masters) have connection issues because of having duplicate ip configuration. Both, initially setup interface and br-ex have the same ip and address and even duplicate default route.

Version-Release number of selected component (if applicable):
OCP 4.7.13
Vsphere 7.1

How reproducible:
Not tested in lab

Steps to Reproduce:
1. Deploy OCP 4.7 UPI cluster with static ip address and bonding
2. Force a reboot of the nodes, e.g.: apply a new machine-config for setting up chrony
3. Nodes rebooted become notReady and network config is duplicated

Actual results:
System deploys initially but the next reboot will leave nodes in a bad state.
As a workaround, if access to the node via ssh is available, the ipv4 configuration can be manually removed from the NetworkManager interface of the initially configured interfaces, this manages to solve the issue.

Expected results:
IP configuration from the "old" interface should be removed so "br-ex" interface is the only one with the ip configuration

Additional info:
More information in following posts

Comment 4 Jaime Caamaño Ruiz 2021-06-18 09:32:24 UTC
(In reply to Mario Abajo from comment #0)
> Expected results:
> IP configuration from the "old" interface should be removed so "br-ex"
> interface is the only one with the ip configuration

Hello Mario

My guess is that the configuration is preserved to be able to revert to it in different situations, but it is assigned less priority so that the new configuration takes precedence. After configure-ovs does its thing there are two connection profiles for the same bond interface, the previously existing 'pub0' and a new one, 'ovs-if-phys0', that has priority 100 and should be activated instead of 'pub0'.

What I believe is happening is that the auto-connect of the bond slave, ens192, causes an nondeterministic master profile to be activated before the priority of those profiles is ever considered, so in the end 'pub0' connection profile is activated instead of the intended 'ovs-if-phys0'.

I think these commands before reboot for the scenario above would prevent the issue from happening:

nmcli conn modify ens192 connection.autoconnect no
nmcli conn modify ovs-if-phys0 connection.autoconnect-slaves 1
nmcli conn modify pub0 connection.autoconnect-slaves 1

The intention is that the slaves profile activation is deferred to activation of the master profile and that either of the two master profiles is activated by priority.

Let me know if you can try this out.

Comment 5 Jaime Caamaño Ruiz 2021-06-18 13:10:06 UTC
What I described above is not working reliable for me on my own tests and I don't really know why. It may have to do multi-stack master/slave ovs setup and the dependency that generates among the connection profiles that are being activated simultaneously. I will let you try it out anyway if you can.

Given that, I think the best option is to set the connection.master property on bond slave connection profile to the master connection profile uuid instead of the master device name.

uuid=$(nmcli -g connection.uuid conn show ovs-if-phys0)
nmcli conn modify ens192 connection.master $uuid

This way the auto-connect of the bond slave ens192 activates the master ovs-if-phys0 profile instead of being undetermined between ovs-if-phys0 and pub0. It will require a bit more effort to handle the revert in situations that require it.

Comment 25 Tim Rozet 2021-06-23 17:03:40 UTC
*** Bug 1967656 has been marked as a duplicate of this bug. ***

Comment 31 Ross Brattain 2021-06-25 19:57:50 UTC
Ah so for vSphere UPI we are using /etc/sysconfig/network-scripts/ifcfg-bond0 so there is no file in /etc/NetworkManager/systemConnectionsMerged to match the uuid.

The egrep is empty so the basename exits with 123 and we trigger the handle_exit_error

    conn_file="$(egrep -l uuid=$conn_uuid ${NM_CONN_PATH}/* | xargs basename)"

I think all our bonding testing is with /etc/sysconfig/network-scripts/ifcfg-*

All our non-bonding regular vSphere UPI testing also uses ifcfg files.

Comment 38 Ross Brattain 2021-06-30 06:49:47 UTC
https://github.com/openshift/machine-config-operator/pull/2644  looks good for ifcfg case with reboots.

I'll try to test the NetworkManager/system-connections case.

Comment 44 Jaime Caamaño Ruiz 2021-07-05 15:33:39 UTC
*** Bug 1978481 has been marked as a duplicate of this bug. ***

Comment 45 Federico Rossi 2021-07-06 22:20:33 UTC
This bugfix covers the ifcfg case but in case of BZ 1978481 where nmconnection files are generated directly by NMState will it make the bonding (static-ip) works as well?
Looking at the code in the pull request it should, just making sure that's the case, I will be able to retest it as needed when the fix lands on a 4.8.z.

Comment 46 Jaime Caamaño Ruiz 2021-07-07 08:41:58 UTC
It covers the case of network configuration backed by nmconnection files. Just in case, could you test it beforehand? Any early feedback is welcome.

Comment 47 Federico Rossi 2021-07-07 12:59:39 UTC
Yes I can,I am currently using this image https://mirror.openshift.com/pub/openshift-v4/amd64/dependencies/rhcos/pre-release/4.8.0-rc.1/rhcos-4.8.0-rc.1-x86_64-live.x86_64.iso for the install. If you have another one that contains the updated machine config operator let me know and I can test it. Or let me know how I can get the updated MCO with the patched configure-ovs.sh script in the install ISO. thanks

Comment 51 Ross Brattain 2021-07-13 07:00:12 UTC
Tested https://github.com/openshift/machine-config-operator/pull/2644 on 4.8 with kernel boot arg bonding and node reboots, no issues.

ip=172.31.248.134::172.31.248.1:255.255.254.0:compute-0:bond0:none nameserver=10.3.192.12 bond=bond0:ens192,ens224,ens256:mode=active-backup,miimon=100

kernel bond= arg seems to create nmconnection files

bond0.nmconnection
ens192.nmconnection
ens224.nmconnection
ens256.nmconnection

bond= creating nmconnection files was unexpected as it is different from the process described in the https://access.redhat.com/solutions/4762021 solution.


Also tested with ifcfg bonding files and node reboots, no issues.

I'm going to try to test with ifcfg teaming configs as described in BZ 1977426

Comment 52 Ross Brattain 2021-07-14 07:08:01 UTC

Tested with teamd to verify BZ 1977426.  The first couple of reboot seemed okay, but twice with teamd one of the compute nodes was NotReady and didn't respond to external ssh for several minutes.  Trying to reproduce and gather logs.

Comment 53 Ross Brattain 2021-07-15 16:05:25 UTC
Reboots with teamd are working, nodes change from NotReady to Ready after a few minutes.

Verified on 4.8.0-0.ci.test-2021-07-14-151240-ci-ln-40pmlik-latest  with https://github.com/openshift/machine-config-operator/pull/2643

Comment 57 Ross Brattain 2021-08-02 05:13:42 UTC
Fails on RHEL7.9 worker because the cloned connection in /etc/NetworkManager/systemConnectionsMerge doesn't have the '.nmconnection'

# In RHEL7 files in /{etc,run}/NetworkManager/system-connections end without the suffix '.nmconnection', whereas in RHCOS they end with the suffix.


 ipv4.method:                            manual
 + echo 'Static IP addressing detected on default gateway connection: ba5b0e8a-9c72-3fe5-b1bd-cfbeabe5c45c'
 Static IP addressing detected on default gateway connection: ba5b0e8a-9c72-3fe5-b1bd-cfbeabe5c45c
 + egrep -l '--include=*.nmconnection' uuid=ba5b0e8a-9c72-3fe5-b1bd-cfbeabe5c45c /etc/NetworkManager/systemConnectionsMerged/bond0 /etc/NetworkManager/systemConnectionsMerged/br-ex /etc/NetworkManager/systemConnectionsMerged/ens192 /etc/NetworkManager/systemConnectionsMerged/ens192-slave-ovs-clone /etc/NetworkManager/systemConnectionsMerged/ens224 /etc/NetworkManager/systemConnectionsMerged/ens224-slave-ovs-clone /etc/NetworkManager/systemConnectionsMerged/ovs-if-phys0 /etc/NetworkManager/systemConnectionsMerged/ovs-port-br-ex /etc/NetworkManager/systemConnectionsMerged/ovs-port-phys0
 + echo 'WARN: unable to find NM configuration file for conn: ba5b0e8a-9c72-3fe5-b1bd-cfbeabe5c45c. Attempting to clone conn'
 WARN: unable to find NM configuration file for conn: ba5b0e8a-9c72-3fe5-b1bd-cfbeabe5c45c. Attempting to clone conn
 + old_conn_file=/etc/NetworkManager/systemConnectionsMerged/ba5b0e8a-9c72-3fe5-b1bd-cfbeabe5c45c-clone.nmconnection
 + nmcli conn clone ba5b0e8a-9c72-3fe5-b1bd-cfbeabe5c45c ba5b0e8a-9c72-3fe5-b1bd-cfbeabe5c45c-clone
 bond0 (ba5b0e8a-9c72-3fe5-b1bd-cfbeabe5c45c) cloned as ba5b0e8a-9c72-3fe5-b1bd-cfbeabe5c45c-clone (619ad654-5dc9-4ee9-993e-1afd9d3f15f1).
 + cloned=true
 + '[' '!' -f /etc/NetworkManager/systemConnectionsMerged/ba5b0e8a-9c72-3fe5-b1bd-cfbeabe5c45c-clone.nmconnection ']'
 + echo 'ERROR: unable to locate cloned conn file: /etc/NetworkManager/systemConnectionsMerged/ba5b0e8a-9c72-3fe5-b1bd-cfbeabe5c45c-clone.nmconnection'
 ERROR: unable to locate cloned conn file: /etc/NetworkManager/systemConnectionsMerged/ba5b0e8a-9c72-3fe5-b1bd-cfbeabe5c45c-clone.nmconnection
 + exit 1
 + handle_exit_error
 + e=1
 + '[' 1 -eq 0 ']'
 + set +e
 + nmcli c show
 NAME                                        UUID                                  TYPE        DEVICE
 br-ex                                       8f9e3156-d7d2-4bfc-97df-76f05eabecbd  ovs-bridge  br-ex
 ens192-slave-ovs-clone                      fc19aa5d-db67-4e4f-86f0-edba82ca1cc5  ethernet    ens192
 ens224-slave-ovs-clone                      287e1eb8-a0a7-4272-8616-070764df617a  ethernet    ens224
 ovs-if-phys0                                40cb3785-78e7-4527-8923-d9c2362972ea  bond        bond0
 ovs-port-br-ex                              e378e5fe-be29-456e-b02f-069a225a4a9f  ovs-port    br-ex
 ovs-port-phys0                              31e04fbe-e992-437a-8012-01ef7f210808  ovs-port    bond0
 ba5b0e8a-9c72-3fe5-b1bd-cfbeabe5c45c-clone  619ad654-5dc9-4ee9-993e-1afd9d3f15f1  bond        --
 bond0                                       ba5b0e8a-9c72-3fe5-b1bd-cfbeabe5c45c  bond        --
 ens192                                      6190085c-0b4e-3feb-aa91-9101c8234c06  ethernet    --
 ens224                                      825b4e27-863b-36aa-9e91-bfe010649596  ethernet    --
 + nmcli conn up ba5b0e8a-9c72-3fe5-b1bd-cfbeabe5c45c
 Connection successfully activated (master waiting for slaves) (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/10)
 + exit 1

etc/NetworkManager/systemConnectionsMerged/
etc/NetworkManager/systemConnectionsMerged/bond0
etc/NetworkManager/systemConnectionsMerged/ens192
etc/NetworkManager/systemConnectionsMerged/ens224
etc/NetworkManager/systemConnectionsMerged/ba5b0e8a-9c72-3fe5-b1bd-cfbeabe5c45c-clone
etc/NetworkManager/systemConnectionsMerged/ens224-slave-ovs-clone
etc/NetworkManager/systemConnectionsMerged/ens192-slave-ovs-clone
etc/NetworkManager/systemConnectionsMerged/ovs-if-phys0
etc/NetworkManager/systemConnectionsMerged/ovs-port-br-ex
etc/NetworkManager/systemConnectionsMerged/ovs-port-phys0
etc/NetworkManager/systemConnectionsMerged/br-ex

Comment 61 Ross Brattain 2021-08-11 20:36:35 UTC
With 4.9 and openshift/machine-config-operator/pull/2706 

Verified on vSphere:
RHCOS kernel args bond0 with static-ip
RHEL7.9 networkmanager bond0 with static-ip
RHCOS ifcfg teaming with static-ip   (for BZ 1934443)
RHEL7.9 networkmanager with bond0 DHCP
RHCOS baremetal-IPI networkmanager DHCP bond0 (for BZ 1979391)

Comment 62 Jaime Caamaño Ruiz 2021-08-17 10:05:55 UTC
*** Bug 1992704 has been marked as a duplicate of this bug. ***

Comment 65 Jaime Caamaño Ruiz 2021-08-27 10:13:03 UTC
*** Bug 1970013 has been marked as a duplicate of this bug. ***

Comment 66 Ross Brattain 2021-08-30 14:55:51 UTC
Verified on 4.9.0-0.nightly-2021-08-30-070917

Verified on vSphere, RHCOS kernel args bond0 with static-ip

Comment 69 Jaime Caamaño Ruiz 2021-09-21 16:11:53 UTC
*** Bug 1999756 has been marked as a duplicate of this bug. ***

Comment 72 errata-xmlrpc 2021-10-18 17:33:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.