gathered data url: shorturl.at/anF36 Hello, The cluster started at OCP 4.5.3 with RHEL 7.8 worker nodes. The cluster was upgraded to OCP 4.5.17, then OCP 4.6.17, without upgrading the worker nodes. Then on upgrading to OCP 4.7.0, the upgrade stalled at 29/31 ClusterOperators. This is where we thought that the workers might be the issue. [mchebbi@fedora 02880221]$ omg get co|awk '!/4.7.0.*True.*False.* False/{print}' NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE authentication 4.7.0 False True True 4d console 4.7.0 False True True 2d dns 4.6.17 True False True 4d ingress 4.7.0 False True True 3d kube-storage-version-migrator 4.7.0 False False False 2d machine-config 4.6.17 True False False 3d monitoring 4.7.0 False True True 3m10s network 4.7.0 False True True 3d [mchebbi@fedora 02880221]$ omg get nodes NAME STATUS ROLES AGE VERSION armstrong-master1.armstrong.scale-ocp.tuc.stglabs.ibm.com Ready master 19d v1.19.0+e405995 armstrong-master2.armstrong.scale-ocp.tuc.stglabs.ibm.com Ready master 19d v1.19.0+e405995 armstrong-master3.armstrong.scale-ocp.tuc.stglabs.ibm.com Ready master 19d v1.19.0+e405995 armstrong-compute4.armstrong.scale-ocp.tuc.stglabs.ibm.com Ready worker 17d v1.20.0+ba45583 armstrong-compute9.armstrong.scale-ocp.tuc.stglabs.ibm.com Ready worker 17d v1.20.0+ba45583 armstrong-compute7.armstrong.scale-ocp.tuc.stglabs.ibm.com Ready worker 17d v1.20.0+ba45583 armstrong-compute5.armstrong.scale-ocp.tuc.stglabs.ibm.com Ready worker 17d v1.20.0+ba45583 armstrong-compute8.armstrong.scale-ocp.tuc.stglabs.ibm.com Ready worker 17d v1.20.0+ba45583 armstrong-compute1.armstrong.scale-ocp.tuc.stglabs.ibm.com Ready worker 17d v1.20.0+ba45583 armstrong-compute6.armstrong.scale-ocp.tuc.stglabs.ibm.com Ready worker 17d v1.20.0+ba45583 armstrong-compute2.armstrong.scale-ocp.tuc.stglabs.ibm.com Ready worker 17d v1.20.0+ba45583 [mchebbi@fedora 02880221]$ [mchebbi@fedora 02880221]$ omg get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version True True 1m14s Unable to apply 4.7.0: the cluster operator kube-storage-version-migrator has not yet successfully rolled out [mchebbi@fedora 02880221]$ omg get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE worker rendered-worker-6c9fded112206b678702e275af3d315d False True False 8 0 0 0 19d master rendered-master-8859d06803f06500f6cc324d4a0da142 True False False 3 3 3 0 19d [mchebbi@fedora 02880221]$ ======================================================================================================= kind: ClusterVersion status: availableUpdates: null conditions: - lastTransitionTime: '2020-05-11T22:25:12Z' message: Done applying 4.6.17 status: 'True' type: Available - lastTransitionTime: '2021-03-01T16:10:17Z' message: Cluster operator kube-storage-version-migrator is not available reason: ClusterOperatorNotAvailable status: 'True' type: Failing - lastTransitionTime: '2021-02-23T19:32:51Z' message: 'Unable to apply 4.7.0: the cluster operator kube-storage-version-migrator has not yet successfully rolled out' reason: ClusterOperatorNotAvailable status: 'True' type: Progressing - lastTransitionTime: '2021-02-28T17:45:18Z' status: 'True' type: RetrievedUpdates [ ============================================================================================================================================================================================================== [mchebbi@fedora 02880221]$ omg logs machine-config-controller-78f848949d-j56gg -n openshift-machine-config-operator 2021-02-26T05:47:08.621885708Z E0226 05:47:08.621811 1 render_controller.go:460] Error updating MachineConfigPool worker: Operation cannot be fulfilled on machineconfigpools.machineconfiguration.openshift.io "worker": the object has been modified; please apply your changes to the latest version and try again 2021-02-26T05:47:08.621885708Z I0226 05:47:08.621847 1 render_controller.go:377] Error syncing machineconfigpool worker: Operation cannot be fulfilled on machineconfigpools.machineconfiguration.openshift.io "worker": the object has been modified; please apply your changes to the latest version and try again 2021-02-26T05:48:14.006754288Z I0226 05:48:14.006654 1 node_controller.go:419] Pool worker: node armstrong-compute8.armstrong.scale-ocp.tuc.stglabs.ibm.com: Reporting unready: node armstrong-compute8.armstrong.scale-ocp.tuc.stglabs.ibm.com is reporting NotReady=False 2021-02-26T05:48:14.250008768Z I0226 05:48:14.249945 1 node_controller.go:419] Pool worker: node armstrong-compute8.armstrong.scale-ocp.tuc.stglabs.ibm.com: Reporting ready ============================================================================================================================================================================================================== $ omg logs machine-config-daemon-5678p -c machine-config-daemon -n openshift-machine-config-operator 2021-03-01T09:12:41.157100204-07:00 E0301 16:12:41.157048 5387 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout 2021-03-01T09:13:02.864606563-07:00 I0301 16:13:02.864505 5387 trace.go:205] Trace[1494593888]: "Reflector ListAndWatch" name:github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101 (01-Mar-2021 16:12:32.863) (total time: 30000ms): 2021-03-01T09:13:02.864606563-07:00 Trace[1494593888]: [30.000930122s] [30.000930122s] END 2021-03-01T09:13:02.864606563-07:00 E0301 16:13:02.864549 5387 reflector.go:127] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to watch *v1.MachineConfig: failed to list *v1.MachineConfig: Get "https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout ============================================================================================================================================================================================================== omg logs machine-config-daemon-gr9lp -c machine-config-daemon -n openshift-machine-config-operator 2021-03-01T09:12:56.649503782-07:00 Trace[17984558]: [30.001042548s] [30.001042548s] END 2021-03-01T09:12:56.649503782-07:00 E0301 16:12:56.649465 4835 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout 2021-03-01T09:13:20.774154502-07:00 I0301 16:13:20.774071 4835 trace.go:205] Trace[1944204424]: "Reflector ListAndWatch" name:github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101 (01-Mar-2021 16:12:50.773) (total time: 30000ms): 2021-03-01T09:13:20.774154502-07:00 Trace[1944204424]: [30.000942709s] [30.000942709s] END 2021-03-01T09:13:20.774154502-07:00 E0301 16:13:20.774135 4835 reflector.go:127] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to watch *v1.MachineConfig: failed to list *v1.MachineConfig: Get "https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout ============================================================================================================================================================================================================== omg logs machine-config-daemon-hl7vr -c machine-config-daemon -n openshift-machine-config-operator 2021-03-01T09:12:08.300418327-07:00 E0301 16:12:08.300343 5627 reflector.go:127] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to watch *v1.MachineConfig: failed to list *v1.MachineConfig: Get "https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout 2021-03-01T09:13:12.202071179-07:00 I0301 16:13:12.202004 5627 trace.go:205] Trace[286360012]: "Reflector ListAndWatch" name:k8s.io/client-go/informers/factory.go:134 (01-Mar-2021 16:12:42.200) (total time: 30001ms): 2021-03-01T09:13:12.202071179-07:00 Trace[286360012]: [30.001272833s] [30.001272833s] END 2021-03-01T09:13:12.202071179-07:00 E0301 16:13:12.202050 5627 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout ============================================================================================================================================================================================================== omg logs machine-config-daemon-nxt4k -c machine-config-daemon -n openshift-machine-config-operator 2021-03-01T09:11:40.319647125-07:00 E0301 16:11:40.319581 5154 reflector.go:127] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to watch *v1.MachineConfig: failed to list *v1.MachineConfig: Get "https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout 2021-03-01T09:12:44.233967824-07:00 I0301 16:12:44.233895 5154 trace.go:205] Trace[286360012]: "Reflector ListAndWatch" name:k8s.io/client-go/informers/factory.go:134 (01-Mar-2021 16:12:14.232) (total time: 30001ms): 2021-03-01T09:12:44.233967824-07:00 Trace[286360012]: [30.001194859s] [30.001194859s] END 2021-03-01T09:12:44.233967824-07:00 E0301 16:12:44.233941 5154 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout 2021-03-01T09:13:05.901916947-07:00 I0301 16:13:05.901818 5154 trace.go:205] Trace[1494593888]: "Reflector ListAndWatch" name:github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101 (01-Mar-2021 16:12:35.900) (total time: 30000ms): 2021-03-01T09:13:05.901916947-07:00 Trace[1494593888]: [30.000897633s] [30.000897633s] END 2021-03-01T09:13:05.901916947-07:00 E0301 16:13:05.901862 5154 reflector.go:127] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to watch *v1.MachineConfig: failed to list *v1.MachineConfig: Get "https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout ============================================================================================================================================================================================================== omg logs machine-config-daemon-rqdqp -c machine-config-daemon -n openshift-machine-config-operator 2021-03-01T09:11:37.138756188-07:00 E0301 16:11:37.138682 5394 reflector.go:127] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to watch *v1.MachineConfig: failed to list *v1.MachineConfig: Get "https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout 2021-03-01T09:12:41.012896234-07:00 I0301 16:12:41.012793 5394 trace.go:205] Trace[286360012]: "Reflector ListAndWatch" name:k8s.io/client-go/informers/factory.go:134 (01-Mar-2021 16:12:11.010) (total time: 30001ms): 2021-03-01T09:12:41.012896234-07:00 Trace[286360012]: [30.001861719s] [30.001861719s] END 2021-03-01T09:12:41.012896234-07:00 E0301 16:12:41.012871 5394 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout 2021-03-01T09:13:02.721063984-07:00 I0301 16:13:02.720995 5394 trace.go:205] Trace[1494593888]: "Reflector ListAndWatch" name:github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101 (01-Mar-2021 16:12:32.720) (total time: 30000ms): 2021-03-01T09:13:02.721063984-07:00 Trace[1494593888]: [30.000916976s] [30.000916976s] END 2021-03-01T09:13:02.721063984-07:00 E0301 16:13:02.721039 5394 reflector.go:127] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to watch *v1.MachineConfig: failed to list *v1.MachineConfig: Get "https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout ============================================================================================================================================================================================================== $ omg logs machine-config-daemon-x8jct -c machine-config-daemon -n openshift-machine-config-operator 2021-03-01T09:11:38.902182585-07:00 E0301 16:11:38.902133 5498 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout 2021-03-01T09:12:02.998255159-07:00 I0301 16:12:02.998171 5498 trace.go:205] Trace[1944204424]: "Reflector ListAndWatch" name:github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101 (01-Mar-2021 16:11:32.997) (total time: 30000ms): 2021-03-01T09:12:02.998255159-07:00 Trace[1944204424]: [30.000821765s] [30.000821765s] END 2021-03-01T09:12:02.998255159-07:00 E0301 16:12:02.998217 5498 reflector.go:127] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to watch *v1.MachineConfig: failed to list *v1.MachineConfig: Get "https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout 2021-03-01T09:13:06.896690723-07:00 I0301 16:13:06.896620 5498 trace.go:205] Trace[286360012]: "Reflector ListAndWatch" name:k8s.io/client-go/informers/factory.go:134 (01-Mar-2021 16:12:36.895) (total time: 30001ms): 2021-03-01T09:13:06.896690723-07:00 Trace[286360012]: [30.001096678s] [30.001096678s] END 2021-03-01T09:13:06.896690723-07:00 E0301 16:13:06.896666 5498 reflector.go:127] k8s.io/client-go/informers/factory.go:134: Failed to watch *v1.Node: failed to list *v1.Node: Get "https://172.30.0.1:443/api/v1/nodes?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout 2021-03-01T09:13:28.581359027-07:00 I0301 16:13:28.581300 5498 trace.go:205] Trace[1494593888]: "Reflector ListAndWatch" name:github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101 (01-Mar-2021 16:12:58.579) (total time: 30001ms): 2021-03-01T09:13:28.581359027-07:00 Trace[1494593888]: [30.001700434s] [30.001700434s] END 2021-03-01T09:13:28.581424994-07:00 E0301 16:13:28.581341 5498 reflector.go:127] github.com/openshift/machine-config-operator/pkg/generated/informers/externalversions/factory.go:101: Failed to watch *v1.MachineConfig: failed to list *v1.MachineConfig: Get "https://172.30.0.1:443/apis/machineconfiguration.openshift.io/v1/machineconfigs?limit=500&resourceVersion=0": dial tcp 172.30.0.1:443: i/o timeout ============================================================================================================================================================================================================== [mchebbi@fedora 02880221]$ omg get pods -n openshift-kube-storage-version-migrator NAME READY STATUS RESTARTS AGE migrator-9d6c8f546-qxb2t 0/1 Pending 0 3d [mchebbi@fedora 02880221]$ omg get pods -n openshift-kube-storage-version-migrator NAME READY STATUS RESTARTS AGE migrator-9d6c8f546-qxb2t 0/1 Pending 0 3d [mchebbi@fedora 02880221]$ status: conditions: - lastProbeTime: null lastTransitionTime: '2021-02-26T06:10:11Z' status: 'True' type: Initialized - lastProbeTime: null lastTransitionTime: '2021-02-26T06:10:11Z' message: 'containers with unready status: [migrator]' reason: ContainersNotReady status: 'False' type: Ready [mchebbi@fedora 02880221]$ omg logs kube-storage-version-migrator-operator-84db77494d-ps9kc -n openshift-kube-storage-version-migrator-operator 2021-02-27T05:08:26.570320753Z I0227 05:08:26.570216 1 status_controller.go:172] clusteroperator/kube-storage-version-migrator diff {"status":{"conditions":[{"lastTransitionTime":"2020-06-16T17:16:14Z","message":"All is well","reason":"AsExpected","status":"False","type":"Degraded"},{"lastTransitionTime":"2021-02-27T05:08:26Z","message":"All is well","reason":"AsExpected","status":"False","type":"Progressing"},{"lastTransitionTime":"2021-02-25T19:42:07Z","message":"Available: deployment/migrator.openshift-kube-storage-version-migrator: no replicas are available","reason":"_NoMigratorPod","status":"False","type":"Available"},{"lastTransitionTime":"2020-05-09T22:44:57Z","reason":"NoData","status":"Unknown","type":"Upgradeable"}]}} 2021-02-27T05:08:26.579152789Z I0227 05:08:26.579060 1 event.go:282] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-kube-storage-version-migrator-operator", Name:"kube-storage-version-migrator-operator", UID:"97c0127b-e40f-405e-b140-2a4a2f2f585d", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'OperatorStatusChanged' Status for clusteroperator/kube-storage-version-migrator changed: Degraded message changed from "TargetDegraded: \"kube-storage-version-migrator/namespace.yaml\" (string): etcdserver: leader changed\nTargetDegraded: " to "All is well",Progressing changed from True to False ("All is well") =========================================================== I have asked customer delete the kube-storage-version-migrator-operator pod and its ReplicaSets and this triggers the update to progress but it stuck stuck at the same place again. =============== [root@armstrong-inf ~]# date; oc get clusterversion Wed Mar 3 09:58:59 MST 2021 NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.17 True True 7d21h Working towards 4.7.0: 200 of 668 done (29% complete) [root@armstrong-inf ~]# date; oc get clusterversion Wed Mar 3 10:05:03 MST 2021 NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.17 True True 7d21h Unable to apply 4.7.0: the cluster operator kube-storage-version-migrator has not yet successfully rolled out
Kube-storage-migration-operator is the smallest of your problems. It's on the far end of the root cause chain. dial tcp 172.30.0.1:443: i/o timeout This suggests that networking is broken.
(In reply to Stefan Schimanski from comment #1) > Kube-storage-migration-operator is the smallest of your problems. It's on > the far end of the root cause chain. > > dial tcp 172.30.0.1:443: i/o timeout > > This suggests that networking is broken. Thanks Stefan for your feedback. Could you tell me how to fix the issue. Thanks in advance for your help.
Just adding a bit more info about Moez case. openshift-sdn: ~~~ NAME READY STATUS RESTARTS AGE ovs-455qw 1/1 Running 0 5d ovs-5dntd 0/1 Running 1544 6d ovs-8h26z 0/1 Running 1544 5d ovs-crxx4 0/1 Running 1546 6d ovs-lft66 0/1 Running 1545 6d ovs-lwbl9 0/1 Running 1545 7d ovs-nd5bw 0/1 Running 1544 6d ovs-nwz7k 0/1 Running 1545 6d ovs-s5rcz 1/1 Running 0 5d ovs-trvqd 1/1 Running 0 5d ovs-v2png 0/1 Running 1544 7d sdn-2ddbk 2/2 Running 0 7d sdn-72w2m 1/2 Running 1289 7d sdn-c2d6h 1/2 Running 1289 7d sdn-controller-kw9w2 1/1 Running 0 7d sdn-controller-qj6hd 1/1 Running 0 7d sdn-controller-rzt4f 1/1 Running 0 7d sdn-g9phv 1/2 Running 1288 7d sdn-gsptx 1/2 Running 1289 7d sdn-h2m76 2/2 Running 0 7d sdn-m5pht 1/2 Running 1288 7d sdn-r7bsz 2/2 Running 0 7d sdn-sg5ml 1/2 Running 1288 7d sdn-vbj7l 1/2 Running 1288 7d sdn-wsmvs 1/2 Running 1289 7d ~~~ SDN pods failure message: """ 2021-03-03T10:08:36.955546539-07:00 I0303 17:08:36.955476 14072 healthcheck.go:42] waiting for OVS to start: dial unix /var/run/openvswitch/db.sock: connect: no such file or directory 2021-03-03T10:08:36.955546539-07:00 F0303 17:08:36.955499 14072 cmd.go:111] Failed to start sdn: node SDN setup failed: timed out waiting for the condition """ OVS pods error: """ id: openvswitch: no such user """ This issue seems to be reported in ticket [0], which was linked to a systemd bug ([1]), I have asked the CU to restart one of the failing nodes and see if that solves the issue. [0] https://bugzilla.redhat.com/show_bug.cgi?id=1887040 [1] https://bugzilla.redhat.com/show_bug.cgi?id=1888017
*** This bug has been marked as a duplicate of bug 1907353 ***