Description of problem: One of our upgrade ci build failed, it's upgrade path is original_build=4.2.0-0.nightly-2020-12-21-150827, target_build=4.3.0-0.nightly-2020-12-21-145308,4.4.0-0.nightly-2020-12-21-142921,4.5.0-0.nightly-2020-12-21-141644,4.6.0-0.nightly-2020-12-21-163117,4.7.0-0.nightly-2020-12-21-131655, the cluster is a Disconnected UPI on Baremetal with RHCOS & RHEL7.9 (FIPS off) cluster. This ci build failed when upgradeing to 4.3.0-0.nightly-2020-12-21-145308, but the must gather log is lost for this upgrade. We rebuilt this ci job, the upgrade failed at 4.6.0-0.nightly-2020-12-21-163117, one of rhel worker node status is Ready,SchedulingDisabled. Check the rhel worker node, get: 01-05 20:18:44 Annotations: machineconfiguration.openshift.io/currentConfig: rendered-worker-a3464aade61c26dd5dbc13ea8e918edf 01-05 20:18:44 machineconfiguration.openshift.io/desiredConfig: rendered-worker-e11043037b5133d8413f70c41dc97cec 01-05 20:18:44 machineconfiguration.openshift.io/reason: Failed to find /dev/disk/by-label/root 01-05 20:18:44 machineconfiguration.openshift.io/ssh: accessed 01-05 20:18:44 machineconfiguration.openshift.io/state: Degraded 01-05 20:18:44 volumes.kubernetes.io/controller-managed-attach-detach: true Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
I faced same issue during upgrade from 4.6.9 to 4.6.10 nightly candidate 4.6.0-0.nightly-2021-01-05-203053 Machine Config Pool worker is in degraded state Node ip-10-0-58-74.us-east-2.compute.internal is reporting: "Failed to find /dev/disk/by-label/root" and node is in Scheduling Disabled state. $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.6.0-0.nightly-2021-01-05-203053 True False 137m Error while reconciling 4.6.0-0.nightly-2021-01-05-203053: an unknown error has occurred: MultipleErrors $ oc describe clusterversion Name: version Namespace: Labels: <none> Annotations: <none> API Version: config.openshift.io/v1 Kind: ClusterVersion Metadata: Creation Timestamp: 2021-01-06T03:09:58Z Generation: 2 Managed Fields: API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:spec: .: f:channel: f:clusterID: f:upstream: Manager: cluster-bootstrap Operation: Update Time: 2021-01-06T03:09:58Z API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:spec: f:desiredUpdate: .: f:force: f:image: f:version: Manager: oc Operation: Update Time: 2021-01-06T06:38:35Z API Version: config.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:status: .: f:availableUpdates: f:conditions: f:desired: .: f:image: f:version: f:history: f:observedGeneration: f:versionHash: Manager: cluster-version-operator Operation: Update Time: 2021-01-06T09:51:35Z Resource Version: 223284 Self Link: /apis/config.openshift.io/v1/clusterversions/version UID: 339b9843-799e-4339-b01c-a21846b4ded9 Spec: Channel: stable-4.6 Cluster ID: 298dc0fa-613c-47ce-9a6b-8be820fe6779 Desired Update: Force: true Image: registry.ci.openshift.org/ocp/release:4.6.0-0.nightly-2021-01-05-203053 Version: Upstream: https://api.openshift.com/api/upgrades_info/v1/graph Status: Available Updates: <nil> Conditions: Last Transition Time: 2021-01-06T03:39:44Z Message: Done applying 4.6.0-0.nightly-2021-01-05-203053 Status: True Type: Available Last Transition Time: 2021-01-06T09:51:35Z Message: Multiple errors are preventing progress: * Cluster operator ingress is reporting a failure: Some ingresscontrollers are degraded: ingresscontroller "default" is degraded: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-6668c6f5b9-cw7tw" cannot be scheduled: 0/5 nodes are available: 2 node(s) were unschedulable, 3 node(s) didn't match node selector. Make sure you have sufficient worker nodes.), DeploymentReplicasAllAvailable=False (DeploymentReplicasNotAvailable: 1/2 of replicas are available) * Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating openshift-state-metrics failed: reconciling openshift-state-metrics Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/openshift-state-metrics: got 1 unavailable replicas Reason: MultipleErrors Status: True Type: Failing Last Transition Time: 2021-01-06T07:37:05Z Message: Error while reconciling 4.6.0-0.nightly-2021-01-05-203053: an unknown error has occurred: MultipleErrors Reason: MultipleErrors Status: False Type: Progressing Last Transition Time: 2021-01-06T06:39:05Z Message: Unable to retrieve available updates: currently reconciling cluster version 4.6.0-0.nightly-2021-01-05-203053 not found in the "stable-4.6" channel Reason: VersionNotFound Status: False Type: RetrievedUpdates Desired: Image: registry.ci.openshift.org/ocp/release:4.6.0-0.nightly-2021-01-05-203053 Version: 4.6.0-0.nightly-2021-01-05-203053 History: Completion Time: 2021-01-06T07:37:05Z Image: registry.ci.openshift.org/ocp/release:4.6.0-0.nightly-2021-01-05-203053 Started Time: 2021-01-06T06:38:49Z State: Completed Verified: false Version: 4.6.0-0.nightly-2021-01-05-203053 Completion Time: 2021-01-06T03:39:44Z Image: quay.io/openshift-release-dev/ocp-release@sha256:43d5c84169a4b3ff307c29d7374f6d69a707de15e9fa90ad352b432f77c0cead Started Time: 2021-01-06T03:09:58Z State: Completed Verified: false Version: 4.6.9 Observed Generation: 2 Version Hash: KSVUyyU6E5g= Events: <none> $ oc get nodes -o wide NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME ip-10-0-54-171.us-east-2.compute.internal Ready master 6h42m v1.19.0+9c69bdc 10.0.54.171 <none> Red Hat Enterprise Linux CoreOS 46.82.202101042340-0 (Ootpa) 4.18.0-193.37.1.el8_2.x86_64 cri-o://1.19.1-2.rhaos4.6.git2af9ecf.el8 ip-10-0-56-210.us-east-2.compute.internal Ready master 6h42m v1.19.0+9c69bdc 10.0.56.210 <none> Red Hat Enterprise Linux CoreOS 46.82.202101042340-0 (Ootpa) 4.18.0-193.37.1.el8_2.x86_64 cri-o://1.19.1-2.rhaos4.6.git2af9ecf.el8 ip-10-0-58-74.us-east-2.compute.internal Ready,SchedulingDisabled worker 5h46m v1.19.0+9c69bdc 10.0.58.74 <none> Red Hat Enterprise Linux Server 7.9 (Maipo) 3.10.0-1160.11.1.el7.x86_64 cri-o://1.19.0-118.rhaos4.6.gitf51f94a.el7 ip-10-0-60-221.us-east-2.compute.internal Ready,SchedulingDisabled worker 5h46m v1.19.0+9c69bdc 10.0.60.221 <none> Red Hat Enterprise Linux Server 7.9 (Maipo) 3.10.0-1160.11.1.el7.x86_64 cri-o://1.19.0-118.rhaos4.6.gitf51f94a.el7 ip-10-0-72-181.us-east-2.compute.internal Ready master 6h43m v1.19.0+9c69bdc 10.0.72.181 <none> Red Hat Enterprise Linux CoreOS 46.82.202101042340-0 (Ootpa) 4.18.0-193.37.1.el8_2.x86_64 cri-o://1.19.1-2.rhaos4.6.git2af9ecf.el8 $ oc describe node ip-10-0-58-74.us-east-2.compute.internal Name: ip-10-0-58-74.us-east-2.compute.internal Roles: worker Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/instance-type=m4.xlarge beta.kubernetes.io/os=linux failure-domain.beta.kubernetes.io/region=us-east-2 failure-domain.beta.kubernetes.io/zone=us-east-2a kubernetes.io/arch=amd64 kubernetes.io/hostname=ip-10-0-58-74.us-east-2.compute.internal kubernetes.io/os=linux node-role.kubernetes.io/worker= node.kubernetes.io/instance-type=m4.xlarge node.openshift.io/os_id=rhel topology.ebs.csi.aws.com/zone=us-east-2a topology.hostpath.csi/node=ip-10-0-58-74.us-east-2.compute.internal topology.kubernetes.io/region=us-east-2 topology.kubernetes.io/zone=us-east-2a Annotations: csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-0f8fe5977642b21d8","hostpath.csi.k8s.io":"ip-10-0-58-74.us-east-2.compute.internal"} k8s.ovn.org/l3-gateway-config: {"default":{"mode":"shared","interface-id":"br-ex_ip-10-0-58-74.us-east-2.compute.internal","mac-address":"02:bf:45:e0:d0:28","ip-addresse... k8s.ovn.org/node-chassis-id: eb6efba3-f5c1-444c-ae52-3cdf3591adbd k8s.ovn.org/node-join-subnets: {"default":"100.64.7.0/29"} k8s.ovn.org/node-local-nat-ip: {"default":["169.254.10.73"]} k8s.ovn.org/node-mgmt-port-mac-address: 4e:af:cf:1b:b9:cf k8s.ovn.org/node-primary-ifaddr: {"ipv4":"10.0.58.74/20"} k8s.ovn.org/node-subnets: {"default":"10.131.2.0/23"} machineconfiguration.openshift.io/currentConfig: rendered-worker-d3fda1858c3d1a5b2545c70f14801d7f machineconfiguration.openshift.io/desiredConfig: rendered-worker-98a7d5a6de4b081492fc1ecf58664a0b machineconfiguration.openshift.io/reason: Failed to find /dev/disk/by-label/root machineconfiguration.openshift.io/ssh: accessed machineconfiguration.openshift.io/state: Degraded volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Wed, 06 Jan 2021 09:38:54 +0530 Taints: node.kubernetes.io/unschedulable:NoSchedule Unschedulable: true Lease: HolderIdentity: ip-10-0-58-74.us-east-2.compute.internal AcquireTime: <unset> RenewTime: Wed, 06 Jan 2021 15:25:00 +0530 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- MemoryPressure False Wed, 06 Jan 2021 15:21:09 +0530 Wed, 06 Jan 2021 09:38:54 +0530 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Wed, 06 Jan 2021 15:21:09 +0530 Wed, 06 Jan 2021 09:38:54 +0530 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Wed, 06 Jan 2021 15:21:09 +0530 Wed, 06 Jan 2021 09:38:54 +0530 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Wed, 06 Jan 2021 15:21:09 +0530 Wed, 06 Jan 2021 09:39:54 +0530 KubeletReady kubelet is posting ready status Addresses: InternalIP: 10.0.58.74 Hostname: ip-10-0-58-74.us-east-2.compute.internal InternalDNS: ip-10-0-58-74.us-east-2.compute.internal Capacity: attachable-volumes-aws-ebs: 39 cpu: 4 ephemeral-storage: 31444972Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 16264968Ki pods: 250 Allocatable: attachable-volumes-aws-ebs: 39 cpu: 3500m ephemeral-storage: 27905944324 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 15113992Ki pods: 250 System Info: Machine ID: 0477675021744e3099cacfbb87bf5f86 System UUID: EC233C74-91ED-516D-95DB-C1E02EECF941 Boot ID: 976425f5-4477-496c-b8a9-dc6d6f6b2e3b Kernel Version: 3.10.0-1160.11.1.el7.x86_64 OS Image: Red Hat Enterprise Linux Server 7.9 (Maipo) Operating System: linux Architecture: amd64 Container Runtime Version: cri-o://1.19.0-118.rhaos4.6.gitf51f94a.el7 Kubelet Version: v1.19.0+9c69bdc Kube-Proxy Version: v1.19.0+9c69bdc ProviderID: aws:///us-east-2a/i-0f8fe5977642b21d8 Non-terminated Pods: (13 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE --------- ---- ------------ ---------- --------------- ------------- --- node-upgrade hello-daemonset-ml6cb 0 (0%) 0 (0%) 0 (0%) 0 (0%) 4h43m openshift-cluster-csi-drivers aws-ebs-csi-driver-node-9q2zw 30m (0%) 0 (0%) 150Mi (1%) 0 (0%) 172m openshift-cluster-node-tuning-operator tuned-b4csb 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 172m openshift-dns dns-default-2r6vv 65m (1%) 0 (0%) 110Mi (0%) 512Mi (3%) 157m openshift-image-registry node-ca-h87mb 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 171m openshift-logging fluentd-qqbpz 100m (2%) 0 (0%) 736Mi (4%) 736Mi (4%) 3h39m openshift-machine-config-operator machine-config-daemon-5kjgf 40m (1%) 0 (0%) 100Mi (0%) 0 (0%) 155m openshift-monitoring node-exporter-rd2k8 9m (0%) 0 (0%) 210Mi (1%) 0 (0%) 172m openshift-multus multus-54xpj 10m (0%) 0 (0%) 150Mi (1%) 0 (0%) 161m openshift-multus network-metrics-daemon-n7ts8 20m (0%) 0 (0%) 120Mi (0%) 0 (0%) 166m openshift-ovn-kubernetes ovnkube-node-vfbsk 30m (0%) 0 (0%) 620Mi (4%) 0 (0%) 166m openshift-ovn-kubernetes ovs-node-gqnnb 100m (2%) 0 (0%) 300Mi (2%) 0 (0%) 164m ui-upgrade hello-daemonset-zmhts 0 (0%) 0 (0%) 0 (0%) 0 (0%) 4h38m Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 424m (12%) 0 (0%) memory 2556Mi (17%) 1248Mi (8%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) attachable-volumes-aws-ebs 0 0 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Starting 5h46m kubelet, ip-10-0-58-74.us-east-2.compute.internal Starting kubelet. Normal NodeHasSufficientMemory 5h46m (x2 over 5h46m) kubelet, ip-10-0-58-74.us-east-2.compute.internal Node ip-10-0-58-74.us-east-2.compute.internal status is now: NodeHasSufficientMemory Normal NodeHasNoDiskPressure 5h46m (x2 over 5h46m) kubelet, ip-10-0-58-74.us-east-2.compute.internal Node ip-10-0-58-74.us-east-2.compute.internal status is now: NodeHasNoDiskPressure Normal NodeHasSufficientPID 5h46m (x2 over 5h46m) kubelet, ip-10-0-58-74.us-east-2.compute.internal Node ip-10-0-58-74.us-east-2.compute.internal status is now: NodeHasSufficientPID Normal NodeAllocatableEnforced 5h46m kubelet, ip-10-0-58-74.us-east-2.compute.internal Updated Node Allocatable limit across pods Normal NodeReady 5h45m kubelet, ip-10-0-58-74.us-east-2.compute.internal Node ip-10-0-58-74.us-east-2.compute.internal status is now: NodeReady Normal NodeNotSchedulable 152m kubelet, ip-10-0-58-74.us-east-2.compute.internal Node ip-10-0-58-74.us-east-2.compute.internal status is now: NodeNotSchedulable $ oc describe node ip-10-0-60-221.us-east-2.compute.internal Name: ip-10-0-60-221.us-east-2.compute.internal Roles: worker Labels: beta.kubernetes.io/arch=amd64 beta.kubernetes.io/instance-type=m4.xlarge beta.kubernetes.io/os=linux failure-domain.beta.kubernetes.io/region=us-east-2 failure-domain.beta.kubernetes.io/zone=us-east-2a kubernetes.io/arch=amd64 kubernetes.io/hostname=ip-10-0-60-221.us-east-2.compute.internal kubernetes.io/os=linux node-role.kubernetes.io/worker= node.kubernetes.io/instance-type=m4.xlarge node.openshift.io/os_id=rhel topology.ebs.csi.aws.com/zone=us-east-2a topology.kubernetes.io/region=us-east-2 topology.kubernetes.io/zone=us-east-2a Annotations: csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-0c99db9b70ca3690e"} k8s.ovn.org/l3-gateway-config: {"default":{"mode":"shared","interface-id":"br-ex_ip-10-0-60-221.us-east-2.compute.internal","mac-address":"02:87:6e:17:88:0a","ip-address... k8s.ovn.org/node-chassis-id: 23feec3d-9307-4f5b-af93-7946ad6ea9dc k8s.ovn.org/node-join-subnets: {"default":"100.64.6.0/29"} k8s.ovn.org/node-local-nat-ip: {"default":["169.254.7.183"]} k8s.ovn.org/node-mgmt-port-mac-address: 1a:5b:9f:89:67:21 k8s.ovn.org/node-primary-ifaddr: {"ipv4":"10.0.60.221/20"} k8s.ovn.org/node-subnets: {"default":"10.130.2.0/23"} machineconfiguration.openshift.io/currentConfig: rendered-worker-d3fda1858c3d1a5b2545c70f14801d7f machineconfiguration.openshift.io/desiredConfig: rendered-worker-d3fda1858c3d1a5b2545c70f14801d7f machineconfiguration.openshift.io/ssh: accessed machineconfiguration.openshift.io/state: Done volumes.kubernetes.io/controller-managed-attach-detach: true CreationTimestamp: Wed, 06 Jan 2021 09:38:53 +0530 Taints: node.kubernetes.io/unschedulable:NoSchedule Unschedulable: true Lease: HolderIdentity: ip-10-0-60-221.us-east-2.compute.internal AcquireTime: <unset> RenewTime: Wed, 06 Jan 2021 15:25:09 +0530 Conditions: Type Status LastHeartbeatTime LastTransitionTime Reason Message ---- ------ ----------------- ------------------ ------ ------- MemoryPressure False Wed, 06 Jan 2021 15:25:02 +0530 Wed, 06 Jan 2021 09:38:53 +0530 KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False Wed, 06 Jan 2021 15:25:02 +0530 Wed, 06 Jan 2021 09:38:53 +0530 KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False Wed, 06 Jan 2021 15:25:02 +0530 Wed, 06 Jan 2021 09:38:53 +0530 KubeletHasSufficientPID kubelet has sufficient PID available Ready True Wed, 06 Jan 2021 15:25:02 +0530 Wed, 06 Jan 2021 09:39:53 +0530 KubeletReady kubelet is posting ready status Addresses: InternalIP: 10.0.60.221 Hostname: ip-10-0-60-221.us-east-2.compute.internal InternalDNS: ip-10-0-60-221.us-east-2.compute.internal Capacity: attachable-volumes-aws-ebs: 39 cpu: 4 ephemeral-storage: 31444972Ki hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 16264968Ki pods: 250 Allocatable: attachable-volumes-aws-ebs: 39 cpu: 3500m ephemeral-storage: 27905944324 hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 15113992Ki pods: 250 System Info: Machine ID: 5baffef7ed054ce59608b61344d680d2 System UUID: EC272FAF-0060-757A-EFD9-5C0EE0E83F3A Boot ID: 6d17251e-ed98-477e-b164-2314cb3b7487 Kernel Version: 3.10.0-1160.11.1.el7.x86_64 OS Image: Red Hat Enterprise Linux Server 7.9 (Maipo) Operating System: linux Architecture: amd64 Container Runtime Version: cri-o://1.19.0-118.rhaos4.6.gitf51f94a.el7 Kubelet Version: v1.19.0+9c69bdc Kube-Proxy Version: v1.19.0+9c69bdc ProviderID: aws:///us-east-2a/i-0c99db9b70ca3690e Non-terminated Pods: (14 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE --------- ---- ------------ ---------- --------------- ------------- --- node-upgrade hello-daemonset-cxw82 0 (0%) 0 (0%) 0 (0%) 0 (0%) 4h43m openshift-cluster-csi-drivers aws-ebs-csi-driver-node-879qv 30m (0%) 0 (0%) 150Mi (1%) 0 (0%) 171m openshift-cluster-node-tuning-operator tuned-qpgqd 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 171m openshift-dns dns-default-w5jg9 65m (1%) 0 (0%) 110Mi (0%) 512Mi (3%) 156m openshift-image-registry node-ca-xvglk 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 173m openshift-ingress router-default-6668c6f5b9-ngnkp 100m (2%) 0 (0%) 256Mi (1%) 0 (0%) 173m openshift-logging fluentd-4hqfx 100m (2%) 0 (0%) 736Mi (4%) 736Mi (4%) 3h39m openshift-machine-config-operator machine-config-daemon-h79vt 40m (1%) 0 (0%) 100Mi (0%) 0 (0%) 154m openshift-monitoring node-exporter-zsmmp 9m (0%) 0 (0%) 210Mi (1%) 0 (0%) 172m openshift-multus multus-p9msd 10m (0%) 0 (0%) 150Mi (1%) 0 (0%) 165m openshift-multus network-metrics-daemon-bxm2v 20m (0%) 0 (0%) 120Mi (0%) 0 (0%) 166m openshift-ovn-kubernetes ovnkube-node-2jvgg 30m (0%) 0 (0%) 620Mi (4%) 0 (0%) 165m openshift-ovn-kubernetes ovs-node-2m2fw 100m (2%) 0 (0%) 300Mi (2%) 0 (0%) 166m ui-upgrade hello-daemonset-6dmfl 0 (0%) 0 (0%) 0 (0%) 0 (0%) 4h39m Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 524m (14%) 0 (0%) memory 2812Mi (19%) 1248Mi (8%) ephemeral-storage 0 (0%) 0 (0%) hugepages-1Gi 0 (0%) 0 (0%) hugepages-2Mi 0 (0%) 0 (0%) attachable-volumes-aws-ebs 0 0 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal Starting 5h46m kubelet, ip-10-0-60-221.us-east-2.compute.internal Starting kubelet. Normal NodeHasSufficientMemory 5h46m (x2 over 5h46m) kubelet, ip-10-0-60-221.us-east-2.compute.internal Node ip-10-0-60-221.us-east-2.compute.internal status is now: NodeHasSufficientMemory Normal NodeHasNoDiskPressure 5h46m (x2 over 5h46m) kubelet, ip-10-0-60-221.us-east-2.compute.internal Node ip-10-0-60-221.us-east-2.compute.internal status is now: NodeHasNoDiskPressure Normal NodeHasSufficientPID 5h46m (x2 over 5h46m) kubelet, ip-10-0-60-221.us-east-2.compute.internal Node ip-10-0-60-221.us-east-2.compute.internal status is now: NodeHasSufficientPID Normal NodeAllocatableEnforced 5h46m kubelet, ip-10-0-60-221.us-east-2.compute.internal Updated Node Allocatable limit across pods Normal NodeReady 5h45m kubelet, ip-10-0-60-221.us-east-2.compute.internal Node ip-10-0-60-221.us-east-2.compute.internal status is now: NodeReady Normal NodeNotSchedulable 135m kubelet, ip-10-0-60-221.us-east-2.compute.internal Node ip-10-0-60-221.us-east-2.compute.internal status is now: NodeNotSchedulable $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-7c5ea40d13541de6e0e34d97f04f3c75 True False False 3 3 3 0 6h40m worker rendered-worker-d3fda1858c3d1a5b2545c70f14801d7f False True True 2 0 0 1 6h40m $ oc describe mcp worker Name: worker Namespace: Labels: machineconfiguration.openshift.io/mco-built-in= pools.operator.machineconfiguration.openshift.io/worker= Annotations: <none> API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfigPool Metadata: Creation Timestamp: 2021-01-06T03:15:17Z Generation: 4 Managed Fields: API Version: machineconfiguration.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:metadata: f:labels: .: f:machineconfiguration.openshift.io/mco-built-in: f:pools.operator.machineconfiguration.openshift.io/worker: f:spec: .: f:configuration: f:machineConfigSelector: .: f:matchLabels: .: f:machineconfiguration.openshift.io/role: f:nodeSelector: .: f:matchLabels: .: f:node-role.kubernetes.io/worker: f:paused: Manager: machine-config-operator Operation: Update Time: 2021-01-06T03:15:17Z API Version: machineconfiguration.openshift.io/v1 Fields Type: FieldsV1 fieldsV1: f:spec: f:configuration: f:name: f:source: f:status: .: f:conditions: f:configuration: .: f:name: f:source: f:degradedMachineCount: f:machineCount: f:observedGeneration: f:readyMachineCount: f:unavailableMachineCount: f:updatedMachineCount: Manager: machine-config-controller Operation: Update Time: 2021-01-06T07:39:27Z Resource Version: 172967 Self Link: /apis/machineconfiguration.openshift.io/v1/machineconfigpools/worker UID: 5fec402e-46bb-4c50-aecd-8711b08ca381 Spec: Configuration: Name: rendered-worker-98a7d5a6de4b081492fc1ecf58664a0b Source: API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 00-worker API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 01-worker-container-runtime API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 01-worker-kubelet API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 99-worker-fips API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 99-worker-generated-registries API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 99-worker-ssh Machine Config Selector: Match Labels: machineconfiguration.openshift.io/role: worker Node Selector: Match Labels: node-role.kubernetes.io/worker: Paused: false Status: Conditions: Last Transition Time: 2021-01-06T03:16:46Z Message: Reason: Status: False Type: RenderDegraded Last Transition Time: 2021-01-06T07:22:41Z Message: Reason: Status: False Type: Updated Last Transition Time: 2021-01-06T07:22:41Z Message: All nodes are updating to rendered-worker-98a7d5a6de4b081492fc1ecf58664a0b Reason: Status: True Type: Updating Last Transition Time: 2021-01-06T07:24:50Z Message: Node ip-10-0-58-74.us-east-2.compute.internal is reporting: "Failed to find /dev/disk/by-label/root" Reason: 1 nodes are reporting degraded status on sync Status: True Type: NodeDegraded Last Transition Time: 2021-01-06T07:24:50Z Message: Reason: Status: True Type: Degraded Configuration: Name: rendered-worker-d3fda1858c3d1a5b2545c70f14801d7f Source: API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 00-worker API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 01-worker-container-runtime API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 01-worker-kubelet API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 99-worker-fips API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 99-worker-generated-registries API Version: machineconfiguration.openshift.io/v1 Kind: MachineConfig Name: 99-worker-ssh Degraded Machine Count: 1 Machine Count: 2 Observed Generation: 4 Ready Machine Count: 0 Unavailable Machine Count: 2 Updated Machine Count: 0 Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SetDesiredConfig 6h15m machineconfigcontroller-nodecontroller Targeted node ip-10-0-79-48.us-east-2.compute.internal to config rendered-worker-d3fda1858c3d1a5b2545c70f14801d7f Normal SetDesiredConfig 6h13m machineconfigcontroller-nodecontroller Targeted node ip-10-0-49-14.us-east-2.compute.internal to config rendered-worker-d3fda1858c3d1a5b2545c70f14801d7f Normal SetDesiredConfig 6h7m machineconfigcontroller-nodecontroller Targeted node ip-10-0-61-81.us-east-2.compute.internal to config rendered-worker-d3fda1858c3d1a5b2545c70f14801d7f Normal SetDesiredConfig 153m machineconfigcontroller-nodecontroller Targeted node ip-10-0-58-74.us-east-2.compute.internal to config rendered-worker-98a7d5a6de4b081492fc1ecf58664a0b
It looks like a hardware failure, as some files cannot be fetched too: ``` 2021-01-05T06:50:14.677517929-05:00 I0105 11:50:14.677495 28223 update.go:1220] Removed stale systemd dropin "/etc/systemd/system/ovs-vswitchd.service.d/10-ovs-vswitchd-restart.conf" 2021-01-05T06:50:14.677517929-05:00 I0105 11:50:14.677510 28223 update.go:1282] /etc/systemd/system/multi-user.target.wants/ovs-vswitchd.service was not present. No need to remove 2021-01-05T06:50:14.677570805-05:00 W0105 11:50:14.677537 28223 update.go:1247] unable to delete /etc/systemd/system/ovs-vswitchd.service: remove /etc/systemd/system/ovs-vswitchd.service: no such file or directory 2021-01-05T06:50:14.677570805-05:00 I0105 11:50:14.677547 28223 update.go:1249] Removed stale systemd unit "/etc/systemd/system/ovs-vswitchd.service" 2021-01-05T06:50:14.677617303-05:00 I0105 11:50:14.677604 28223 update.go:1220] Removed stale systemd dropin "/etc/systemd/system/ovsdb-server.service.d/10-ovsdb-restart.conf" 2021-01-05T06:50:14.677637757-05:00 I0105 11:50:14.677624 28223 update.go:1282] /etc/systemd/system/multi-user.target.wants/ovsdb-server.service was not present. No need to remove 2021-01-05T06:50:14.677659570-05:00 W0105 11:50:14.677647 28223 update.go:1247] unable to delete /etc/systemd/system/ovsdb-server.service: remove /etc/systemd/system/ovsdb-server.service: no such file or directory 2021-01-05T06:50:14.677668010-05:00 I0105 11:50:14.677657 28223 update.go:1249] Removed stale systemd unit "/etc/systemd/system/ovsdb-server.service" 2021-01-05T06:50:14.677741683-05:00 I0105 11:50:14.677715 28223 update.go:1220] Removed stale systemd dropin "/etc/systemd/system/zincati.service.d/mco-disabled.conf" 2021-01-05T06:50:14.677750052-05:00 I0105 11:50:14.677742 28223 update.go:1282] /etc/systemd/system/multi-user.target.wants/zincati.service was not present. No need to remove 2021-01-05T06:50:14.677777386-05:00 W0105 11:50:14.677766 28223 update.go:1247] unable to delete /etc/systemd/system/zincati.service: remove /etc/systemd/system/zincati.service: no such file or directory 2021-01-05T06:50:14.677785656-05:00 I0105 11:50:14.677776 28223 update.go:1249] Removed stale systemd unit "/etc/systemd/system/zincati.service" 2021-01-05T06:50:14.677795291-05:00 E0105 11:50:14.677789 28223 writer.go:135] Marking Degraded due to: Failed to find /dev/disk/by-label/root 2021-01-05T06:50:14.681392397-05:00 E0105 11:50:14.680646 28223 token_source.go:152] Unable to rotate token: failed to read token file "/var/run/secrets/kubernetes.io/serviceaccount/token": open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory ``` from quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-a5e086d7c7605d24b00926fd569dbebfe9580e547b2e3f1d48a719bfd19a5049/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-hqx4f/machine-config-daemon/machine-config-daemon/logs in must-gather Reassigning to MCO
From discussion over slack, we think this could be regression from https://github.com/openshift/machine-config-operator/pull/2251. We have seen regression in 4.7 on rhel 7 worker nodes with different error message https://bugzilla.redhat.com/show_bug.cgi?id=1909943
I hit this today trying to upgrade 4.6.6 to a 4.6 scratch build generated from 4.6 branch + https://github.com/openshift/machine-config-operator/pull/2321
Verified on 4.7.0-0.nightly-2021-01-12-150634. Upgraded 4.6.10 to 4.7.0-0.nightly-2021-01-12-150634 with RHEL7 worker Needed to workaround these BZs which also affect RHEL7 compute nodes for the verification. https://bugzilla.redhat.com/show_bug.cgi?id=1913582 Workaround: Edit /etc/os-release and set VERSION_ID="7" https://bugzilla.redhat.com/show_bug.cgi?id=1913536 Workaround: rm /etc/systemd/system/multi-user-target.Wants/openvswitch.service systemctl enable openvswitch.service
Moving back to ON_QA state. Looking at the origin BZ filed in 4.6 the problem is the upgrade succeeds but the MCP is degraded. I need to re-verify that the MCP is not in a degraded state after the upgrade.
A clarifying note: the proposed fix will fix Failed to find /dev/disk/by-label/root on RHEL nodes. This error does NOT block upgrades (workers are not considered in the success/fail criteria of upgrades, only control plane is), so this fix will NOT fix Qin's original issue of a failing upgrade.
Updated verification. There were two scenarios that needed to be tested to verify this fix. The first is an upgrade from 4.6.10 -> 4.7.0-0.nightly-2021-01-13-054018. This is an upgrade from a clean 4.6.10 with no degraded MCP. Verified that the MCP does not go degraded when upgrading to 4.7 and the upgrade is successful. The second test is from 4.6.6 -> 4.6.10 -> 4.6.11 -> 4.7.0-0.nightly-2021-01-13-054018. This is an upgrade through various 4.6.z versions to introduce the degraded MCP. Verified that upgrading to 4.7 is successful and the degraded MCP is fixed with no intervention from the user.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475