Description of problem: On vsphere set up a 4.7.17 cluster with RHEL7.9 & FIPS on & OVN & Etcd Encryption on, then upgrade to 4.7.0-0.nightly-2021-06-20-093308 RHEL7.9 worker node NotReady,SchedulingDisabled Version-Release number of selected component (if applicable): 4.7.17 How reproducible: always Steps to Reproduce: 1. On vsphere set up a 4.7.17 cluster with RHEL7.9 & FIPS on & OVN & Etcd Encryption on 2. upgrade the cluster ./oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-06-20-093308 --force=true --allow-explicit-upgrade=true 3. Actual results: rhel worker node go to NotReady,SchedulingDisabled. Checked the node status, the NotReady node has no ip address Power Status Powered On Guest OS Red Hat Enterprise Linux 7 (64-bit) VMware Tools Running, version:11269 (Guest Managed) Encryption Not encrypted DNS Name (1) zhsun221636-chs52-rhel-0 IP Addresses (1) fe80::bb:7cff:fec2:32ff $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.17 True True 171m Unable to apply 4.7.0-0.nightly-2021-06-20-093308: the cluster operator monitoring has not yet successfully rolled out $ oc get node NAME STATUS ROLES AGE VERSION zhsun221636-chs52-master-0 Ready master 26h v1.20.0+87cc9a4 zhsun221636-chs52-master-1 Ready master 26h v1.20.0+87cc9a4 zhsun221636-chs52-master-2 Ready master 26h v1.20.0+87cc9a4 zhsun221636-chs52-rhel-0 NotReady,SchedulingDisabled worker 25h v1.20.0+87cc9a4 zhsun221636-chs52-rhel-1 Ready worker 25h v1.20.0+87cc9a4 zhsun221636-chs52-worker-j8lhl Ready worker 26h v1.20.0+87cc9a4 zhsun221636-chs52-worker-zssvx Ready worker 26h v1.20.0+2817867 $ oc get co NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.7.0-0.nightly-2021-06-20-093308 True False False 96m baremetal 4.7.0-0.nightly-2021-06-20-093308 True False False 5h7m cloud-credential 4.7.0-0.nightly-2021-06-20-093308 True False False 5h11m cluster-autoscaler 4.7.0-0.nightly-2021-06-20-093308 True False False 5h6m config-operator 4.7.0-0.nightly-2021-06-20-093308 True False False 5h7m console 4.7.0-0.nightly-2021-06-20-093308 True False False 99m csi-snapshot-controller 4.7.0-0.nightly-2021-06-20-093308 True False False 99m dns 4.7.0-0.nightly-2021-06-20-093308 True False False 5h6m etcd 4.7.0-0.nightly-2021-06-20-093308 True False False 5h5m image-registry 4.7.0-0.nightly-2021-06-20-093308 True False False 4h8m ingress 4.7.0-0.nightly-2021-06-20-093308 True False False 4h56m insights 4.7.0-0.nightly-2021-06-20-093308 True False False 5h kube-apiserver 4.7.0-0.nightly-2021-06-20-093308 True False False 5h3m kube-controller-manager 4.7.0-0.nightly-2021-06-20-093308 True False False 5h5m kube-scheduler 4.7.0-0.nightly-2021-06-20-093308 True False False 5h3m kube-storage-version-migrator 4.7.0-0.nightly-2021-06-20-093308 True False False 103m machine-api 4.7.0-0.nightly-2021-06-20-093308 True False False 5h2m machine-approver 4.7.0-0.nightly-2021-06-20-093308 True False False 5h6m machine-config 4.7.17 False True True 125m marketplace 4.7.0-0.nightly-2021-06-20-093308 True False False 99m monitoring 4.7.0-0.nightly-2021-06-20-093308 False True True 106m network 4.7.0-0.nightly-2021-06-20-093308 True True True 5h7m node-tuning 4.7.0-0.nightly-2021-06-20-093308 True False False 149m openshift-apiserver 4.7.0-0.nightly-2021-06-20-093308 True False False 97m openshift-controller-manager 4.7.0-0.nightly-2021-06-20-093308 True False False 5h5m openshift-samples 4.7.0-0.nightly-2021-06-20-093308 True False False 149m operator-lifecycle-manager 4.7.0-0.nightly-2021-06-20-093308 True False False 5h6m operator-lifecycle-manager-catalog 4.7.0-0.nightly-2021-06-20-093308 True False False 5h6m operator-lifecycle-manager-packageserver 4.7.0-0.nightly-2021-06-20-093308 True False False 99m service-ca 4.7.0-0.nightly-2021-06-20-093308 True False False 5h7m storage 4.7.0-0.nightly-2021-06-20-093308 True False False 108m $ oc edit co network status: conditions: - lastTransitionTime: "2021-06-23T09:23:14Z" message: |- DaemonSet "openshift-multus/multus" rollout is not making progress - last change 2021-06-23T09:13:03Z DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2021-06-23T09:13:02Z DaemonSet "openshift-network-diagnostics/network-check-target" rollout is not making progress - last change 2021-06-23T09:13:58Z reason: RolloutHung status: "True" type: Degraded - lastTransitionTime: "2021-06-23T05:38:12Z" status: "False" type: ManagementStateDegraded - lastTransitionTime: "2021-06-23T05:38:12Z" status: "True" type: Upgradeable - lastTransitionTime: "2021-06-23T08:58:29Z" message: |- DaemonSet "openshift-multus/multus" is not available (awaiting 1 nodes) DaemonSet "openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes) DaemonSet "openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes) DaemonSet "openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes) reason: Deploying status: "True" type: Progressing $ oc edit co machine-config status: conditions: - lastTransitionTime: "2021-06-23T08:32:56Z" message: Working towards 4.7.0-0.nightly-2021-06-20-093308 status: "True" type: Progressing - lastTransitionTime: "2021-06-23T09:07:23Z" message: 'Unable to apply 4.7.0-0.nightly-2021-06-20-093308: timed out waiting for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon is not ready. status: (desired: 7, updated: 7, ready: 6, unavailable: 1)' reason: MachineConfigDaemonFailed status: "True" type: Degraded Expected results: Upgrading succeeded without errors Additional info: Must-gather always times out, if necessary, I can setup a cluster for debug
Hi, Tim, not sure if this is related to mco configure-ovs ?
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z. The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way. Sample answers are provided to give more context and the ImpactStatementRequested label has been added to this bug. When responding, please remove ImpactStatementRequested and set the ImpactStatementProposed label. The expectation is that the assignee answers these questions. Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? * example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet * example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time What is the impact? Is it serious enough to warrant blocking edges? * example: Up to 2 minute disruption in edge routing * example: Up to 90 seconds of API downtime * example: etcd loses quorum and you have to restore from backup How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? * example: Issue resolves itself after five minutes * example: Admin uses oc to fix things * example: Admin must SSH to hosts, restore from backups, or other non standard admin activities Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? * example: No, it has always been like this we just never noticed * example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1
The problem here is that even the first instance of the ovs-configuration script failed because it could not copy the files from the overlay fs upper to lower directory. This is because the files were named "br-ex" without ".nmconnection" file extension shown on newer versions of NetworkManager. NM also decided when it created the connection for ovs-if-phys0 to place it in network-scripts as an ifcfg file rather than placing it as an NM keyfile in the merged directory. This resulted in a subsequent reboot (caused by upgrade or otherwise) to bring the node back up with only a stale ovs-if-phys0 file in network-scripts, and none of the other connections present in the keyfile directory. ovs-config tried to run again and recreated the missing keyfiles, but the file that existed was pointing to an older connection that was never copied. This issue of older RHEL NM using different nm connection file syntax and plugins dir was resolved in 4.8 by: https://github.com/openshift/machine-config-operator/commit/5af273d4c986bc7018882afdfeab6d3479469bb6
*** This bug has been marked as a duplicate of bug 1917282 ***
Who is impacted? If we have to block upgrade edges based on this issue, which edges would need blocking? All customers using 4.6 or 4.7 with OVN and using UPI with RHEL 7 nodes. What is the impact? Is it serious enough to warrant blocking edges? ovs-configuration systemd service is likely to fail in this scenario. The result is that the network configuration will be in a working state even though the script failed, but on subsequent reboot networking will not come up correctly. How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)? To correct the issue manually on a node: 1. After the node is in the failed state, run "/usr/local/bin/configure-ovs.sh OpenShiftSDN". This will clean off any previously configured OVN related Network Manager keyfiles. 2. Check /etc/sysconfig/network-scripts/ for ifcfg-ovs-if-phys0, if it exists, remove it. 3. Execute "/usr/local/bin/configure-ovs.sh OVNKubernetes" 4. The script will fail to copy files at the end, but it should leave the node in a working networking state with OVN connections active. For example: sh-4.4# nmcli conn show --active NAME UUID TYPE DEVICE ovs-if-br-ex eef7e9bd-0523-4bb8-ab13-1461c9e83b60 ovs-interface br-ex br-ex 323205c3-5758-4071-94f1-3462a3271540 ovs-bridge br-ex ovs-if-phys0 e8a030a2-959f-4e79-aad6-c83ff93b2e64 ethernet enp0s4 ovs-port-br-ex 6ea2117d-58f3-462b-aa58-1d21cde9527f ovs-port br-ex ovs-port-phys0 9f6b0f6b-f528-411e-90d9-105886c233e4 ovs-port enp0s4 5. Now copy all of the NM keyfiles to the persistent underlay filesystem: cp /etc/NetworkManager/system-connections-merged/{br-ex,ovs-if-br-ex,ovs-port-br-ex,ovs-if-phys0,ovs-port-phys0} /etc/NetworkManager/system-connections/ This may result in an error like: cp: cannot stat '/etc/NetworkManager/system-connections-merged/ovs-if-phys0': No such file or directory This is safe to ignore as the ovs-if-phys0 is most likely present under the /etc/sysconfig/network-scripts directory Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)? No, it has always been present in 4.6 and 4.7. The fix already exists in 4.8.
> Is this a regression... > No, it has always been present in 4.6 and 4.7... So not a blocker, because life doesn't get worse for users if they update.