Description of problem: In a new cluster installation (UPI, masters over vsphere 7.1), after the first reboot, the nodes (in this particular case, the masters) have connection issues because of having duplicate ip configuration. Both, initially setup interface and br-ex have the same ip and address and even duplicate default route. Version-Release number of selected component (if applicable): OCP 4.7.13 Vsphere 7.1 How reproducible: Not tested in lab Steps to Reproduce: 1. Deploy OCP 4.7 UPI cluster with static ip address and bonding 2. Force a reboot of the nodes, e.g.: apply a new machine-config for setting up chrony 3. Nodes rebooted become notReady and network config is duplicated Actual results: System deploys initially but the next reboot will leave nodes in a bad state. As a workaround, if access to the node via ssh is available, the ipv4 configuration can be manually removed from the NetworkManager interface of the initially configured interfaces, this manages to solve the issue. Expected results: IP configuration from the "old" interface should be removed so "br-ex" interface is the only one with the ip configuration Additional info: More information in following posts
(In reply to Mario Abajo from comment #0) > Expected results: > IP configuration from the "old" interface should be removed so "br-ex" > interface is the only one with the ip configuration Hello Mario My guess is that the configuration is preserved to be able to revert to it in different situations, but it is assigned less priority so that the new configuration takes precedence. After configure-ovs does its thing there are two connection profiles for the same bond interface, the previously existing 'pub0' and a new one, 'ovs-if-phys0', that has priority 100 and should be activated instead of 'pub0'. What I believe is happening is that the auto-connect of the bond slave, ens192, causes an nondeterministic master profile to be activated before the priority of those profiles is ever considered, so in the end 'pub0' connection profile is activated instead of the intended 'ovs-if-phys0'. I think these commands before reboot for the scenario above would prevent the issue from happening: nmcli conn modify ens192 connection.autoconnect no nmcli conn modify ovs-if-phys0 connection.autoconnect-slaves 1 nmcli conn modify pub0 connection.autoconnect-slaves 1 The intention is that the slaves profile activation is deferred to activation of the master profile and that either of the two master profiles is activated by priority. Let me know if you can try this out.
What I described above is not working reliable for me on my own tests and I don't really know why. It may have to do multi-stack master/slave ovs setup and the dependency that generates among the connection profiles that are being activated simultaneously. I will let you try it out anyway if you can. Given that, I think the best option is to set the connection.master property on bond slave connection profile to the master connection profile uuid instead of the master device name. uuid=$(nmcli -g connection.uuid conn show ovs-if-phys0) nmcli conn modify ens192 connection.master $uuid This way the auto-connect of the bond slave ens192 activates the master ovs-if-phys0 profile instead of being undetermined between ovs-if-phys0 and pub0. It will require a bit more effort to handle the revert in situations that require it.
*** Bug 1967656 has been marked as a duplicate of this bug. ***
Ah so for vSphere UPI we are using /etc/sysconfig/network-scripts/ifcfg-bond0 so there is no file in /etc/NetworkManager/systemConnectionsMerged to match the uuid. The egrep is empty so the basename exits with 123 and we trigger the handle_exit_error conn_file="$(egrep -l uuid=$conn_uuid ${NM_CONN_PATH}/* | xargs basename)" I think all our bonding testing is with /etc/sysconfig/network-scripts/ifcfg-* All our non-bonding regular vSphere UPI testing also uses ifcfg files.
https://github.com/openshift/machine-config-operator/pull/2644 looks good for ifcfg case with reboots. I'll try to test the NetworkManager/system-connections case.
*** Bug 1978481 has been marked as a duplicate of this bug. ***
This bugfix covers the ifcfg case but in case of BZ 1978481 where nmconnection files are generated directly by NMState will it make the bonding (static-ip) works as well? Looking at the code in the pull request it should, just making sure that's the case, I will be able to retest it as needed when the fix lands on a 4.8.z.
It covers the case of network configuration backed by nmconnection files. Just in case, could you test it beforehand? Any early feedback is welcome.
Yes I can,I am currently using this image https://mirror.openshift.com/pub/openshift-v4/amd64/dependencies/rhcos/pre-release/4.8.0-rc.1/rhcos-4.8.0-rc.1-x86_64-live.x86_64.iso for the install. If you have another one that contains the updated machine config operator let me know and I can test it. Or let me know how I can get the updated MCO with the patched configure-ovs.sh script in the install ISO. thanks
Tested https://github.com/openshift/machine-config-operator/pull/2644 on 4.8 with kernel boot arg bonding and node reboots, no issues. ip=172.31.248.134::172.31.248.1:255.255.254.0:compute-0:bond0:none nameserver=10.3.192.12 bond=bond0:ens192,ens224,ens256:mode=active-backup,miimon=100 kernel bond= arg seems to create nmconnection files bond0.nmconnection ens192.nmconnection ens224.nmconnection ens256.nmconnection bond= creating nmconnection files was unexpected as it is different from the process described in the https://access.redhat.com/solutions/4762021 solution. Also tested with ifcfg bonding files and node reboots, no issues. I'm going to try to test with ifcfg teaming configs as described in BZ 1977426
Tested with teamd to verify BZ 1977426. The first couple of reboot seemed okay, but twice with teamd one of the compute nodes was NotReady and didn't respond to external ssh for several minutes. Trying to reproduce and gather logs.
Reboots with teamd are working, nodes change from NotReady to Ready after a few minutes. Verified on 4.8.0-0.ci.test-2021-07-14-151240-ci-ln-40pmlik-latest with https://github.com/openshift/machine-config-operator/pull/2643
Fails on RHEL7.9 worker because the cloned connection in /etc/NetworkManager/systemConnectionsMerge doesn't have the '.nmconnection' # In RHEL7 files in /{etc,run}/NetworkManager/system-connections end without the suffix '.nmconnection', whereas in RHCOS they end with the suffix. ipv4.method: manual + echo 'Static IP addressing detected on default gateway connection: ba5b0e8a-9c72-3fe5-b1bd-cfbeabe5c45c' Static IP addressing detected on default gateway connection: ba5b0e8a-9c72-3fe5-b1bd-cfbeabe5c45c + egrep -l '--include=*.nmconnection' uuid=ba5b0e8a-9c72-3fe5-b1bd-cfbeabe5c45c /etc/NetworkManager/systemConnectionsMerged/bond0 /etc/NetworkManager/systemConnectionsMerged/br-ex /etc/NetworkManager/systemConnectionsMerged/ens192 /etc/NetworkManager/systemConnectionsMerged/ens192-slave-ovs-clone /etc/NetworkManager/systemConnectionsMerged/ens224 /etc/NetworkManager/systemConnectionsMerged/ens224-slave-ovs-clone /etc/NetworkManager/systemConnectionsMerged/ovs-if-phys0 /etc/NetworkManager/systemConnectionsMerged/ovs-port-br-ex /etc/NetworkManager/systemConnectionsMerged/ovs-port-phys0 + echo 'WARN: unable to find NM configuration file for conn: ba5b0e8a-9c72-3fe5-b1bd-cfbeabe5c45c. Attempting to clone conn' WARN: unable to find NM configuration file for conn: ba5b0e8a-9c72-3fe5-b1bd-cfbeabe5c45c. Attempting to clone conn + old_conn_file=/etc/NetworkManager/systemConnectionsMerged/ba5b0e8a-9c72-3fe5-b1bd-cfbeabe5c45c-clone.nmconnection + nmcli conn clone ba5b0e8a-9c72-3fe5-b1bd-cfbeabe5c45c ba5b0e8a-9c72-3fe5-b1bd-cfbeabe5c45c-clone bond0 (ba5b0e8a-9c72-3fe5-b1bd-cfbeabe5c45c) cloned as ba5b0e8a-9c72-3fe5-b1bd-cfbeabe5c45c-clone (619ad654-5dc9-4ee9-993e-1afd9d3f15f1). + cloned=true + '[' '!' -f /etc/NetworkManager/systemConnectionsMerged/ba5b0e8a-9c72-3fe5-b1bd-cfbeabe5c45c-clone.nmconnection ']' + echo 'ERROR: unable to locate cloned conn file: /etc/NetworkManager/systemConnectionsMerged/ba5b0e8a-9c72-3fe5-b1bd-cfbeabe5c45c-clone.nmconnection' ERROR: unable to locate cloned conn file: /etc/NetworkManager/systemConnectionsMerged/ba5b0e8a-9c72-3fe5-b1bd-cfbeabe5c45c-clone.nmconnection + exit 1 + handle_exit_error + e=1 + '[' 1 -eq 0 ']' + set +e + nmcli c show NAME UUID TYPE DEVICE br-ex 8f9e3156-d7d2-4bfc-97df-76f05eabecbd ovs-bridge br-ex ens192-slave-ovs-clone fc19aa5d-db67-4e4f-86f0-edba82ca1cc5 ethernet ens192 ens224-slave-ovs-clone 287e1eb8-a0a7-4272-8616-070764df617a ethernet ens224 ovs-if-phys0 40cb3785-78e7-4527-8923-d9c2362972ea bond bond0 ovs-port-br-ex e378e5fe-be29-456e-b02f-069a225a4a9f ovs-port br-ex ovs-port-phys0 31e04fbe-e992-437a-8012-01ef7f210808 ovs-port bond0 ba5b0e8a-9c72-3fe5-b1bd-cfbeabe5c45c-clone 619ad654-5dc9-4ee9-993e-1afd9d3f15f1 bond -- bond0 ba5b0e8a-9c72-3fe5-b1bd-cfbeabe5c45c bond -- ens192 6190085c-0b4e-3feb-aa91-9101c8234c06 ethernet -- ens224 825b4e27-863b-36aa-9e91-bfe010649596 ethernet -- + nmcli conn up ba5b0e8a-9c72-3fe5-b1bd-cfbeabe5c45c Connection successfully activated (master waiting for slaves) (D-Bus active path: /org/freedesktop/NetworkManager/ActiveConnection/10) + exit 1 etc/NetworkManager/systemConnectionsMerged/ etc/NetworkManager/systemConnectionsMerged/bond0 etc/NetworkManager/systemConnectionsMerged/ens192 etc/NetworkManager/systemConnectionsMerged/ens224 etc/NetworkManager/systemConnectionsMerged/ba5b0e8a-9c72-3fe5-b1bd-cfbeabe5c45c-clone etc/NetworkManager/systemConnectionsMerged/ens224-slave-ovs-clone etc/NetworkManager/systemConnectionsMerged/ens192-slave-ovs-clone etc/NetworkManager/systemConnectionsMerged/ovs-if-phys0 etc/NetworkManager/systemConnectionsMerged/ovs-port-br-ex etc/NetworkManager/systemConnectionsMerged/ovs-port-phys0 etc/NetworkManager/systemConnectionsMerged/br-ex
With 4.9 and openshift/machine-config-operator/pull/2706 Verified on vSphere: RHCOS kernel args bond0 with static-ip RHEL7.9 networkmanager bond0 with static-ip RHCOS ifcfg teaming with static-ip (for BZ 1934443) RHEL7.9 networkmanager with bond0 DHCP RHCOS baremetal-IPI networkmanager DHCP bond0 (for BZ 1979391)
*** Bug 1992704 has been marked as a duplicate of this bug. ***
*** Bug 1970013 has been marked as a duplicate of this bug. ***
Verified on 4.9.0-0.nightly-2021-08-30-070917 Verified on vSphere, RHCOS kernel args bond0 with static-ip
*** Bug 1999756 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:3759