Bug 1975262 - RHEL7 worker nodes may go to NotReady,SchedulingDisabled while upgrading from 4.7.17-x86_64 to 4.7.0-0.nightly-2021-06-20-093308
Summary: RHEL7 worker nodes may go to NotReady,SchedulingDisabled while upgrading from...
Keywords:
Status: CLOSED DUPLICATE of bug 1917282
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Networking
Version: 4.7
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 4.8.0
Assignee: Tim Rozet
QA Contact: Anurag saxena
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-23 11:03 UTC by sunzhaohua
Modified: 2021-06-25 15:59 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1976232 (view as bug list)
Environment:
Last Closed: 2021-06-25 14:49:08 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description sunzhaohua 2021-06-23 11:03:55 UTC
Description of problem:
On vsphere set up a 4.7.17 cluster with  RHEL7.9 & FIPS on & OVN & Etcd Encryption on, then upgrade to 4.7.0-0.nightly-2021-06-20-093308 RHEL7.9 worker node NotReady,SchedulingDisabled


Version-Release number of selected component (if applicable):
4.7.17

How reproducible:
always

Steps to Reproduce:
1. On vsphere set up a 4.7.17 cluster with RHEL7.9 & FIPS on & OVN & Etcd Encryption on
2. upgrade the cluster
./oc adm upgrade --to-image=registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-06-20-093308 --force=true --allow-explicit-upgrade=true
3.

Actual results:
rhel worker node go to NotReady,SchedulingDisabled. Checked the node status, the NotReady node has no ip address

Power Status 	
Powered On
Guest OS 	Red Hat Enterprise Linux 7 (64-bit)
VMware Tools 	Running, version:11269 (Guest Managed)
Encryption 	Not encrypted
DNS Name (1) 	
zhsun221636-chs52-rhel-0
IP Addresses (1) 	
fe80::bb:7cff:fec2:32ff

$ oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.17    True        True          171m    Unable to apply 4.7.0-0.nightly-2021-06-20-093308: the cluster operator monitoring has not yet successfully rolled out

$ oc get node
NAME                             STATUS                        ROLES    AGE   VERSION
zhsun221636-chs52-master-0       Ready                         master   26h   v1.20.0+87cc9a4
zhsun221636-chs52-master-1       Ready                         master   26h   v1.20.0+87cc9a4
zhsun221636-chs52-master-2       Ready                         master   26h   v1.20.0+87cc9a4
zhsun221636-chs52-rhel-0         NotReady,SchedulingDisabled   worker   25h   v1.20.0+87cc9a4
zhsun221636-chs52-rhel-1         Ready                         worker   25h   v1.20.0+87cc9a4
zhsun221636-chs52-worker-j8lhl   Ready                         worker   26h   v1.20.0+87cc9a4
zhsun221636-chs52-worker-zssvx   Ready                         worker   26h   v1.20.0+2817867

$ oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.7.0-0.nightly-2021-06-20-093308   True        False         False      96m
baremetal                                  4.7.0-0.nightly-2021-06-20-093308   True        False         False      5h7m
cloud-credential                           4.7.0-0.nightly-2021-06-20-093308   True        False         False      5h11m
cluster-autoscaler                         4.7.0-0.nightly-2021-06-20-093308   True        False         False      5h6m
config-operator                            4.7.0-0.nightly-2021-06-20-093308   True        False         False      5h7m
console                                    4.7.0-0.nightly-2021-06-20-093308   True        False         False      99m
csi-snapshot-controller                    4.7.0-0.nightly-2021-06-20-093308   True        False         False      99m
dns                                        4.7.0-0.nightly-2021-06-20-093308   True        False         False      5h6m
etcd                                       4.7.0-0.nightly-2021-06-20-093308   True        False         False      5h5m
image-registry                             4.7.0-0.nightly-2021-06-20-093308   True        False         False      4h8m
ingress                                    4.7.0-0.nightly-2021-06-20-093308   True        False         False      4h56m
insights                                   4.7.0-0.nightly-2021-06-20-093308   True        False         False      5h
kube-apiserver                             4.7.0-0.nightly-2021-06-20-093308   True        False         False      5h3m
kube-controller-manager                    4.7.0-0.nightly-2021-06-20-093308   True        False         False      5h5m
kube-scheduler                             4.7.0-0.nightly-2021-06-20-093308   True        False         False      5h3m
kube-storage-version-migrator              4.7.0-0.nightly-2021-06-20-093308   True        False         False      103m
machine-api                                4.7.0-0.nightly-2021-06-20-093308   True        False         False      5h2m
machine-approver                           4.7.0-0.nightly-2021-06-20-093308   True        False         False      5h6m
machine-config                             4.7.17                              False       True          True       125m
marketplace                                4.7.0-0.nightly-2021-06-20-093308   True        False         False      99m
monitoring                                 4.7.0-0.nightly-2021-06-20-093308   False       True          True       106m
network                                    4.7.0-0.nightly-2021-06-20-093308   True        True          True       5h7m
node-tuning                                4.7.0-0.nightly-2021-06-20-093308   True        False         False      149m
openshift-apiserver                        4.7.0-0.nightly-2021-06-20-093308   True        False         False      97m
openshift-controller-manager               4.7.0-0.nightly-2021-06-20-093308   True        False         False      5h5m
openshift-samples                          4.7.0-0.nightly-2021-06-20-093308   True        False         False      149m
operator-lifecycle-manager                 4.7.0-0.nightly-2021-06-20-093308   True        False         False      5h6m
operator-lifecycle-manager-catalog         4.7.0-0.nightly-2021-06-20-093308   True        False         False      5h6m
operator-lifecycle-manager-packageserver   4.7.0-0.nightly-2021-06-20-093308   True        False         False      99m
service-ca                                 4.7.0-0.nightly-2021-06-20-093308   True        False         False      5h7m
storage                                    4.7.0-0.nightly-2021-06-20-093308   True        False         False      108m

$ oc edit co network
status:
  conditions:
  - lastTransitionTime: "2021-06-23T09:23:14Z"
    message: |-
      DaemonSet "openshift-multus/multus" rollout is not making progress - last change 2021-06-23T09:13:03Z
      DaemonSet "openshift-ovn-kubernetes/ovnkube-node" rollout is not making progress - last change 2021-06-23T09:13:02Z
      DaemonSet "openshift-network-diagnostics/network-check-target" rollout is not making progress - last change 2021-06-23T09:13:58Z
    reason: RolloutHung
    status: "True"
    type: Degraded
  - lastTransitionTime: "2021-06-23T05:38:12Z"
    status: "False"
    type: ManagementStateDegraded
  - lastTransitionTime: "2021-06-23T05:38:12Z"
    status: "True"
    type: Upgradeable
  - lastTransitionTime: "2021-06-23T08:58:29Z"
    message: |-
      DaemonSet "openshift-multus/multus" is not available (awaiting 1 nodes)
      DaemonSet "openshift-multus/network-metrics-daemon" is not available (awaiting 1 nodes)
      DaemonSet "openshift-ovn-kubernetes/ovnkube-node" is not available (awaiting 1 nodes)
      DaemonSet "openshift-network-diagnostics/network-check-target" is not available (awaiting 1 nodes)
    reason: Deploying
    status: "True"
    type: Progressing

$ oc edit co machine-config
status:
  conditions:
  - lastTransitionTime: "2021-06-23T08:32:56Z"
    message: Working towards 4.7.0-0.nightly-2021-06-20-093308
    status: "True"
    type: Progressing
  - lastTransitionTime: "2021-06-23T09:07:23Z"
    message: 'Unable to apply 4.7.0-0.nightly-2021-06-20-093308: timed out waiting
      for the condition during waitForDaemonsetRollout: Daemonset machine-config-daemon
      is not ready. status: (desired: 7, updated: 7, ready: 6, unavailable: 1)'
    reason: MachineConfigDaemonFailed
    status: "True"
    type: Degraded

Expected results:
Upgrading succeeded without errors

Additional info:
Must-gather always times out, if necessary, I can setup a cluster for debug

Comment 3 zhaozhanqi 2021-06-24 02:17:04 UTC
Hi, Tim, not sure if this is related to mco configure-ovs ?

Comment 4 W. Trevor King 2021-06-24 03:14:06 UTC
We're asking the following questions to evaluate whether or not this bug warrants blocking an upgrade edge from either the previous X.Y or X.Y.Z.  The ultimate goal is to avoid delivering an update which introduces new risk or reduces cluster functionality in any way.  Sample answers are provided to give more context and the ImpactStatementRequested label has been added to this bug.  When responding, please remove ImpactStatementRequested and set the ImpactStatementProposed label.  The expectation is that the assignee answers these questions.

Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
* example: Customers upgrading from 4.y.Z to 4.y+1.z running on GCP with thousands of namespaces, approximately 5% of the subscribed fleet
* example: All customers upgrading from 4.y.z to 4.y+1.z fail approximately 10% of the time

What is the impact?  Is it serious enough to warrant blocking edges?
* example: Up to 2 minute disruption in edge routing
* example: Up to 90 seconds of API downtime
* example: etcd loses quorum and you have to restore from backup

How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
* example: Issue resolves itself after five minutes
* example: Admin uses oc to fix things
* example: Admin must SSH to hosts, restore from backups, or other non standard admin activities

Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
* example: No, it has always been like this we just never noticed
* example: Yes, from 4.y.z to 4.y+1.z Or 4.y.z to 4.y.z+1

Comment 12 Tim Rozet 2021-06-25 14:49:08 UTC
The problem here is that even the first instance of the ovs-configuration script failed because it could not copy the files from the overlay fs upper to lower directory. This is because the files were named "br-ex" without ".nmconnection" file extension shown on newer versions of NetworkManager. NM also decided when it created the connection for ovs-if-phys0 to place it in network-scripts as an ifcfg file rather than placing it as an NM keyfile in the merged directory. This resulted in a subsequent reboot (caused by upgrade or otherwise) to bring the node back up with only a stale ovs-if-phys0 file in network-scripts, and none of the other connections present in the keyfile directory. ovs-config tried to run again and recreated the missing keyfiles, but the file that existed was pointing to an older connection that was never copied.

This issue of older RHEL NM using different nm connection file syntax and plugins dir was resolved in 4.8 by:
https://github.com/openshift/machine-config-operator/commit/5af273d4c986bc7018882afdfeab6d3479469bb6

Comment 13 Tim Rozet 2021-06-25 14:59:34 UTC

*** This bug has been marked as a duplicate of bug 1917282 ***

Comment 14 Tim Rozet 2021-06-25 15:52:04 UTC
Who is impacted?  If we have to block upgrade edges based on this issue, which edges would need blocking?
All customers using 4.6 or 4.7 with OVN and using UPI with RHEL 7 nodes.

What is the impact?  Is it serious enough to warrant blocking edges?
ovs-configuration systemd service is likely to fail in this scenario. The result is that the network configuration will be in a working state even though the script failed, but on subsequent reboot networking will not come up correctly.

How involved is remediation (even moderately serious impacts might be acceptable if they are easy to mitigate)?
To correct the issue manually on a node:
1. After the node is in the failed state, run "/usr/local/bin/configure-ovs.sh OpenShiftSDN". This will clean off any previously configured OVN related Network Manager keyfiles.
2. Check /etc/sysconfig/network-scripts/ for ifcfg-ovs-if-phys0, if it exists, remove it.
3. Execute "/usr/local/bin/configure-ovs.sh OVNKubernetes"
4. The script will fail to copy files at the end, but it should leave the node in a working networking state with OVN connections active. For example:
sh-4.4# nmcli conn show --active
NAME            UUID                                  TYPE           DEVICE 
ovs-if-br-ex    eef7e9bd-0523-4bb8-ab13-1461c9e83b60  ovs-interface  br-ex  
br-ex           323205c3-5758-4071-94f1-3462a3271540  ovs-bridge     br-ex  
ovs-if-phys0    e8a030a2-959f-4e79-aad6-c83ff93b2e64  ethernet       enp0s4 
ovs-port-br-ex  6ea2117d-58f3-462b-aa58-1d21cde9527f  ovs-port       br-ex  
ovs-port-phys0  9f6b0f6b-f528-411e-90d9-105886c233e4  ovs-port       enp0s4 

5. Now copy all of the NM keyfiles to the persistent underlay filesystem:
cp /etc/NetworkManager/system-connections-merged/{br-ex,ovs-if-br-ex,ovs-port-br-ex,ovs-if-phys0,ovs-port-phys0} /etc/NetworkManager/system-connections/

This may result in an error like:
cp: cannot stat '/etc/NetworkManager/system-connections-merged/ovs-if-phys0': No such file or directory

This is safe to ignore as the ovs-if-phys0 is most likely present under the /etc/sysconfig/network-scripts directory

Is this a regression (if all previous versions were also vulnerable, updating to the new, vulnerable version does not increase exposure)?
No, it has always been present in 4.6 and 4.7. The fix already exists in 4.8.

Comment 15 W. Trevor King 2021-06-25 15:59:01 UTC
> Is this a regression...
> No, it has always been present in 4.6 and 4.7...

So not a blocker, because life doesn't get worse for users if they update.


Note You need to log in before you can comment on or make changes to this bug.