Bug 2089763 - [4.9.z backport] [BM][IPI] Installation with bonds fail - DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress
Summary: [4.9.z backport] [BM][IPI] Installation with bonds fail - DaemonSet "openshif...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.11
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ---
: 4.9.z
Assignee: Jaime Caamaño Ruiz
QA Contact: Ross Brattain
URL:
Whiteboard:
: 2097315 (view as bug list)
Depends On: 2089757
Blocks: 2094765
TreeView+ depends on / blocked
 
Reported: 2022-05-24 11:37 UTC by Jaime Caamaño Ruiz
Modified: 2022-06-23 08:08 UTC (History)
22 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 2089757
: 2094765 (view as bug list)
Environment:
Last Closed: 2022-06-16 17:49:04 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
ovs-configuration logs, success after reboot (24.69 KB, application/gzip)
2022-06-13 01:52 UTC, Ross Brattain
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 3160 0 None Merged [release-4.9] Bug 2089763: configure-ovs: avoid restarting NetworkManager 2022-06-16 15:25:05 UTC
Github openshift machine-config-operator pull 3183 0 None Merged [release-4.9] Bug 2089763: configure-ovs: persist profiles after auto-connect has been set 2022-06-16 15:25:04 UTC
Github openshift machine-config-operator pull 3188 0 None open [release-4.9] Bug 2089763: configure-ovs: clone connection to avoid selinux problems 2022-06-16 17:01:17 UTC

Comment 5 Ross Brattain 2022-06-13 01:51:18 UTC
Tested openshift/machine-config-operator/pull/3183 with   4.9.0-0.ci.test-2022-06-10-143517-ci-ln-c23tgs2-latest and nodes function after reboot.



## /etc/NetworkManager/system-connections/bond0.nmconnection

[connection]
id=bond0
type=bond
interface-name=bond0
autoconnect=true
connection.autoconnect-slaves=1
autoconnect-priority=99

[bond]
mode=802.3ad
miimon=100

[ipv4]
method=auto
dhcp-timeout=2147483647

[ipv6]
method=disabled

## /etc/NetworkManager/system-connections/br-ex.nmconnection

[connection]
id=br-ex
uuid=4f222278-31c0-4b1c-bffe-56f55bd40db0
type=ovs-bridge
autoconnect=false
interface-name=br-ex
permissions=

[ethernet]
mac-address-blacklist=
mtu=1500

[ovs-bridge]

[ipv4]
dns-search=
method=auto

[ipv6]
addr-gen-mode=stable-privacy
dns-search=
method=auto

[proxy]

## /etc/NetworkManager/system-connections/enp5s0.nmconnection

[connection]
id=enp5s0
type=ethernet
interface-name=enp5s0
master=bond0
slave-type=bond
autoconnect=true
autoconnect-priority=99
## /etc/NetworkManager/system-connections/enp5s0-slave-ovs-clone.nmconnection

[connection]
id=enp5s0-slave-ovs-clone
uuid=f3c73ec7-7e9b-4b13-b73e-1e563f4cbbf0
type=ethernet
autoconnect-priority=100
interface-name=enp5s0
master=5422ff16-12ff-4b59-b276-45836cb1956d
permissions=
slave-type=bond
timestamp=1655083495

[ethernet]
mac-address-blacklist=

## /etc/NetworkManager/system-connections/enp6s0.nmconnection

[connection]
id=enp6s0
type=ethernet
interface-name=enp6s0
master=bond0
slave-type=bond
autoconnect=true
autoconnect-priority=99
## /etc/NetworkManager/system-connections/enp6s0-slave-ovs-clone.nmconnection

[connection]
id=enp6s0-slave-ovs-clone
uuid=8dd64e5f-2ffe-4724-9cca-14cf263258c3
type=ethernet
autoconnect-priority=100
interface-name=enp6s0
master=5422ff16-12ff-4b59-b276-45836cb1956d
permissions=
slave-type=bond
timestamp=1655083495

[ethernet]
mac-address-blacklist=





Jun 13 01:25:01 master-0-2 configure-ovs.sh[2914]: + ip route show
Jun 13 01:25:01 master-0-2 configure-ovs.sh[2914]: default via 192.168.123.1 dev br-ex proto dhcp metric 49
Jun 13 01:25:01 master-0-2 configure-ovs.sh[2914]: 172.22.0.0/24 dev enp4s0 proto kernel scope link src 172.22.0.88 metric 100
Jun 13 01:25:01 master-0-2 configure-ovs.sh[2914]: 192.168.123.0/24 dev br-ex proto kernel scope link src 192.168.123.134 metric 49
Jun 13 01:25:01 master-0-2 configure-ovs.sh[2914]: + ip -6 route show

Comment 6 Ross Brattain 2022-06-13 01:52:24 UTC
Created attachment 1889267 [details]
ovs-configuration logs, success after reboot

Comment 11 Jaime Caamaño Ruiz 2022-06-15 14:59:02 UTC
*** Bug 2097315 has been marked as a duplicate of this bug. ***

Comment 13 Lalatendu Mohanty 2022-06-16 16:48:29 UTC
Does this bug anywayimpact upgrades from 4.10 to 4.11?

Comment 15 Lalatendu Mohanty 2022-06-16 17:13:34 UTC
We're asking the following questions to evaluate whether or not this bug warrants changing update recommendations from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the ImpactStatementRequested label has been added to this bug. When responding, please remove ImpactStatementRequested and set the ImpactStatementProposed label. The expectation is that the assignee answers these questions.

Which 4.y.z to 4.y'.z' updates increase vulnerability? Which types of clusters?

    reasoning: This allows us to populate from, to, and matchingRules in conditional update recommendations for "the $SOURCE_RELEASE to $TARGET_RELEASE update is not recommended for clusters like $THIS".
    example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet. Check your vulnerability with oc ... or the following PromQL count (...) > 0.
    example: All customers upgrading from 4.y.z to 4.y+1.z fail. Check your vulnerability with oc adm upgrade to show your current cluster version.

What is the impact? Is it serious enough to warrant removing update recommendations?

    reasoning: This allows us to populate name and message in conditional update recommendations for "...because if you update, $THESE_CONDITIONS may cause $THESE_UNFORTUNATE_SYMPTOMS".
    example: Around 2 minute disruption in edge routing for 10% of clusters. Check with oc ....
    example: Up to 90 seconds of API downtime. Check with curl ....
    example: etcd loses quorum and you have to restore from backup. Check with ssh ....

How involved is remediation?

    reasoning: This allows administrators who are already vulnerable, or who chose to waive conditional-update risks, to recover their cluster. And even moderately serious impacts might be acceptable if they are easy to mitigate.
    example: Issue resolves itself after five minutes.
    example: Admin can run a single: oc ....
    example: Admin must SSH to hosts, restore from backups, or other non standard admin activities.

Is this a regression?

    reasoning: Updating between two vulnerable releases may not increase exposure (unless rebooting during the update increases vulnerability, etc.). We only qualify update recommendations if the update increases exposure.
    example: No, it has always been like this we just never noticed.
    example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1.

Comment 16 Jaime Caamaño Ruiz 2022-06-16 17:49:04 UTC
For a clearer process, I am going to be closing this BZ.

PR 3160 got shipped in 4.9.38. This introduced a 4.9 specific problem with networking being broken after a reboot.
PR 3183 will be shipped in 4.9.39 unless it is tombstoned. This fixed the first problem, but broke static IP configuration.

This finalñ problem is reported in https://bugzilla.redhat.com/show_bug.cgi?id=2095264, where we will continue the work on it. 

I will move the impact statement there as well.

Comment 17 W. Trevor King 2022-06-22 23:19:09 UTC
Update graph-data response is being discussed in bug 2098099, e.g. here [1].  So dropping it from this bug.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=2098099#c6


Note You need to log in before you can comment on or make changes to this bug.