Summary: | RHEL7 worker nodes may go to NotReady,SchedulingDisabled while upgrading from 4.7.17-x86_64 to 4.7.0-0.nightly-2021-06-20-093308 | |||
---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | sunzhaohua <zhsun> | |
Component: | Networking | Assignee: | Tim Rozet <trozet> | |
Networking sub component: | ovn-kubernetes | QA Contact: | Anurag saxena <anusaxen> | |
Status: | CLOSED DUPLICATE | Docs Contact: | ||
Severity: | urgent | |||
Priority: | unspecified | CC: | anbhat, aos-bugs, mmahmoud, trozet, vlaad, wking, zzhao | |
Version: | 4.7 | Keywords: | TestBlocker, Upgrades | |
Target Milestone: | --- | |||
Target Release: | 4.8.0 | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1976232 (view as bug list) | Environment: | ||
Last Closed: | 2021-06-25 14:49:08 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: |
Description
sunzhaohua
2021-06-23 11:03:55 UTC
Hi, Tim, not sure if this is related to mco configure-ovs ? We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the ImpactStatementRequested label has been added to this bug. When responding, please remove ImpactStatementRequested and set the ImpactStatementProposed label. The expectation is that the assignee answers these questions. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? * example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet * example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Is it serious enough to warrant blocking edges? * example: Up to 2 minute disruption in edge routing * example: Up to 90 seconds of API downtime * example: etcd loses quorum and you have to restore from backup How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? * example: Issue resolves itself after five minutes * example: Admin uses oc to fix things * example: Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? * example: No, it has always been like this we just never noticed * example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1 The problem here is that even the first instance of the ovs-configuration script failed because it could not copy the files from the overlay fs upper to lower directory. This is because the files were named "br-ex" without ".nmconnection" file extension shown on newer versions of NetworkManager. NM also decided when it created the connection for ovs-if-phys0 to place it in network-scripts as an ifcfg file rather than placing it as an NM keyfile in the merged directory. This resulted in a subsequent reboot (caused by upgrade or otherwise) to bring the node back up with only a stale ovs-if-phys0 file in network-scripts, and none of the other connections present in the keyfile directory. ovs-config tried to run again and recreated the missing keyfiles, but the file that existed was pointing to an older connection that was never copied. This issue of older RHEL NM using different nm connection file syntax and plugins dir was resolved in 4.8 by: https://github.com/openshift/machine-config-operator/commit/5af273d4c986bc7018882afdfeab6d3479469bb6 *** This bug has been marked as a duplicate of bug 1917282 *** Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? All customers using 4.6 or 4.7 with OVN and using UPI with RHEL 7 nodes. What is the impact? Is it serious enough to warrant blocking edges? ovs-configuration systemd service is likely to fail in this scenario. The result is that the network configuration will be in a working state even though the script failed, but on subsequent reboot networking will not come up correctly. How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? To correct the issue manually on a node: 1. After the node is in the failed state, run "/usr/local/bin/configure-ovs.sh OpenShiftSDN". This will clean off any previously configured OVN related Network Manager keyfiles. 2. Check /etc/sysconfig/network-scripts/ for ifcfg-ovs-if-phys0, if it exists, remove it. 3. Execute "/usr/local/bin/configure-ovs.sh OVNKubernetes" 4. The script will fail to copy files at the end, but it should leave the node in a working networking state with OVN connections active. For example: sh-4.4# nmcli conn show --active NAME UUID TYPE DEVICE ovs-if-br-ex eef7e9bd-0523-4bb8-ab13-1461c9e83b60 ovs-interface br-ex br-ex 323205c3-5758-4071-94f1-3462a3271540 ovs-bridge br-ex ovs-if-phys0 e8a030a2-959f-4e79-aad6-c83ff93b2e64 ethernet enp0s4 ovs-port-br-ex 6ea2117d-58f3-462b-aa58-1d21cde9527f ovs-port br-ex ovs-port-phys0 9f6b0f6b-f528-411e-90d9-105886c233e4 ovs-port enp0s4 5. Now copy all of the NM keyfiles to the persistent underlay filesystem: cp /etc/NetworkManager/system-connections-merged/{br-ex,ovs-if-br-ex,ovs-port-br-ex,ovs-if-phys0,ovs-port-phys0} /etc/NetworkManager/system-connections/ This may result in an error like: cp: cannot stat '/etc/NetworkManager/system-connections-merged/ovs-if-phys0': No such file or directory This is safe to ignore as the ovs-if-phys0 is most likely present under the /etc/sysconfig/network-scripts directory Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? No, it has always been present in 4.6 and 4.7. The fix already exists in 4.8. > Is this a regression...
> No, it has always been present in 4.6 and 4.7...
So not a blocker, because life doesn't get worse for users if they update.
|