Bug 1913154
| Summary: | Upgrading to 4.6.10 nightly failed with RHEL worker nodes: Failed to find /dev/disk/by-label/root | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Qin Ping <piqin> |
| Component: | Machine Config Operator | Assignee: | MCO Team <team-mco> |
| Machine Config Operator sub component: | Machine Config Operator | QA Contact: | Rio Liu <rioliu> |
| Status: | CLOSED ERRATA | Docs Contact: | |
| Severity: | high | ||
| Priority: | high | CC: | aos-bugs, jerzhang, markmc, mifiedle, mkrejci, rioliu, sbhavsar, schoudha, sferguso, sjenning, skumari, smilner, vrutkovs, wking |
| Version: | 4.6 | Keywords: | Upgrades |
| Target Milestone: | --- | ||
| Target Release: | 4.7.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2021-02-24 15:50:15 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 1914892 | ||
|
Description
Qin Ping
2021-01-06 07:02:12 UTC
I faced same issue during upgrade from 4.6.9 to 4.6.10 nightly candidate 4.6.0-0.nightly-2021-01-05-203053
Machine Config Pool worker is in degraded state Node ip-10-0-58-74.us-east-2.compute.internal is reporting: "Failed to find /dev/disk/by-label/root" and node is in Scheduling Disabled state.
$ oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.6.0-0.nightly-2021-01-05-203053 True False 137m Error while reconciling 4.6.0-0.nightly-2021-01-05-203053: an unknown error has occurred: MultipleErrors
$ oc describe clusterversion
Name: version
Namespace:
Labels: <none>
Annotations: <none>
API Version: config.openshift.io/v1
Kind: ClusterVersion
Metadata:
Creation Timestamp: 2021-01-06T03:09:58Z
Generation: 2
Managed Fields:
API Version: config.openshift.io/v1
Fields Type: FieldsV1
fieldsV1:
f:spec:
.:
f:channel:
f:clusterID:
f:upstream:
Manager: cluster-bootstrap
Operation: Update
Time: 2021-01-06T03:09:58Z
API Version: config.openshift.io/v1
Fields Type: FieldsV1
fieldsV1:
f:spec:
f:desiredUpdate:
.:
f:force:
f:image:
f:version:
Manager: oc
Operation: Update
Time: 2021-01-06T06:38:35Z
API Version: config.openshift.io/v1
Fields Type: FieldsV1
fieldsV1:
f:status:
.:
f:availableUpdates:
f:conditions:
f:desired:
.:
f:image:
f:version:
f:history:
f:observedGeneration:
f:versionHash:
Manager: cluster-version-operator
Operation: Update
Time: 2021-01-06T09:51:35Z
Resource Version: 223284
Self Link: /apis/config.openshift.io/v1/clusterversions/version
UID: 339b9843-799e-4339-b01c-a21846b4ded9
Spec:
Channel: stable-4.6
Cluster ID: 298dc0fa-613c-47ce-9a6b-8be820fe6779
Desired Update:
Force: true
Image: registry.ci.openshift.org/ocp/release:4.6.0-0.nightly-2021-01-05-203053
Version:
Upstream: https://api.openshift.com/api/upgrades_info/v1/graph
Status:
Available Updates: <nil>
Conditions:
Last Transition Time: 2021-01-06T03:39:44Z
Message: Done applying 4.6.0-0.nightly-2021-01-05-203053
Status: True
Type: Available
Last Transition Time: 2021-01-06T09:51:35Z
Message: Multiple errors are preventing progress:
* Cluster operator ingress is reporting a failure: Some ingresscontrollers are degraded: ingresscontroller "default" is degraded: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-6668c6f5b9-cw7tw" cannot be scheduled: 0/5 nodes are available: 2 node(s) were unschedulable, 3 node(s) didn't match node selector. Make sure you have sufficient worker nodes.), DeploymentReplicasAllAvailable=False (DeploymentReplicasNotAvailable: 1/2 of replicas are available)
* Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating openshift-state-metrics failed: reconciling openshift-state-metrics Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/openshift-state-metrics: got 1 unavailable replicas
Reason: MultipleErrors
Status: True
Type: Failing
Last Transition Time: 2021-01-06T07:37:05Z
Message: Error while reconciling 4.6.0-0.nightly-2021-01-05-203053: an unknown error has occurred: MultipleErrors
Reason: MultipleErrors
Status: False
Type: Progressing
Last Transition Time: 2021-01-06T06:39:05Z
Message: Unable to retrieve available updates: currently reconciling cluster version 4.6.0-0.nightly-2021-01-05-203053 not found in the "stable-4.6" channel
Reason: VersionNotFound
Status: False
Type: RetrievedUpdates
Desired:
Image: registry.ci.openshift.org/ocp/release:4.6.0-0.nightly-2021-01-05-203053
Version: 4.6.0-0.nightly-2021-01-05-203053
History:
Completion Time: 2021-01-06T07:37:05Z
Image: registry.ci.openshift.org/ocp/release:4.6.0-0.nightly-2021-01-05-203053
Started Time: 2021-01-06T06:38:49Z
State: Completed
Verified: false
Version: 4.6.0-0.nightly-2021-01-05-203053
Completion Time: 2021-01-06T03:39:44Z
Image: quay.io/openshift-release-dev/ocp-release@sha256:43d5c84169a4b3ff307c29d7374f6d69a707de15e9fa90ad352b432f77c0cead
Started Time: 2021-01-06T03:09:58Z
State: Completed
Verified: false
Version: 4.6.9
Observed Generation: 2
Version Hash: KSVUyyU6E5g=
Events: <none>
$ oc get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
ip-10-0-54-171.us-east-2.compute.internal Ready master 6h42m v1.19.0+9c69bdc 10.0.54.171 <none> Red Hat Enterprise Linux CoreOS 46.82.202101042340-0 (Ootpa) 4.18.0-193.37.1.el8_2.x86_64 cri-o://1.19.1-2.rhaos4.6.git2af9ecf.el8
ip-10-0-56-210.us-east-2.compute.internal Ready master 6h42m v1.19.0+9c69bdc 10.0.56.210 <none> Red Hat Enterprise Linux CoreOS 46.82.202101042340-0 (Ootpa) 4.18.0-193.37.1.el8_2.x86_64 cri-o://1.19.1-2.rhaos4.6.git2af9ecf.el8
ip-10-0-58-74.us-east-2.compute.internal Ready,SchedulingDisabled worker 5h46m v1.19.0+9c69bdc 10.0.58.74 <none> Red Hat Enterprise Linux Server 7.9 (Maipo) 3.10.0-1160.11.1.el7.x86_64 cri-o://1.19.0-118.rhaos4.6.gitf51f94a.el7
ip-10-0-60-221.us-east-2.compute.internal Ready,SchedulingDisabled worker 5h46m v1.19.0+9c69bdc 10.0.60.221 <none> Red Hat Enterprise Linux Server 7.9 (Maipo) 3.10.0-1160.11.1.el7.x86_64 cri-o://1.19.0-118.rhaos4.6.gitf51f94a.el7
ip-10-0-72-181.us-east-2.compute.internal Ready master 6h43m v1.19.0+9c69bdc 10.0.72.181 <none> Red Hat Enterprise Linux CoreOS 46.82.202101042340-0 (Ootpa) 4.18.0-193.37.1.el8_2.x86_64 cri-o://1.19.1-2.rhaos4.6.git2af9ecf.el8
$ oc describe node ip-10-0-58-74.us-east-2.compute.internal
Name: ip-10-0-58-74.us-east-2.compute.internal
Roles: worker
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=m4.xlarge
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=us-east-2
failure-domain.beta.kubernetes.io/zone=us-east-2a
kubernetes.io/arch=amd64
kubernetes.io/hostname=ip-10-0-58-74.us-east-2.compute.internal
kubernetes.io/os=linux
node-role.kubernetes.io/worker=
node.kubernetes.io/instance-type=m4.xlarge
node.openshift.io/os_id=rhel
topology.ebs.csi.aws.com/zone=us-east-2a
topology.hostpath.csi/node=ip-10-0-58-74.us-east-2.compute.internal
topology.kubernetes.io/region=us-east-2
topology.kubernetes.io/zone=us-east-2a
Annotations: csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-0f8fe5977642b21d8","hostpath.csi.k8s.io":"ip-10-0-58-74.us-east-2.compute.internal"}
k8s.ovn.org/l3-gateway-config:
{"default":{"mode":"shared","interface-id":"br-ex_ip-10-0-58-74.us-east-2.compute.internal","mac-address":"02:bf:45:e0:d0:28","ip-addresse...
k8s.ovn.org/node-chassis-id: eb6efba3-f5c1-444c-ae52-3cdf3591adbd
k8s.ovn.org/node-join-subnets: {"default":"100.64.7.0/29"}
k8s.ovn.org/node-local-nat-ip: {"default":["169.254.10.73"]}
k8s.ovn.org/node-mgmt-port-mac-address: 4e:af:cf:1b:b9:cf
k8s.ovn.org/node-primary-ifaddr: {"ipv4":"10.0.58.74/20"}
k8s.ovn.org/node-subnets: {"default":"10.131.2.0/23"}
machineconfiguration.openshift.io/currentConfig: rendered-worker-d3fda1858c3d1a5b2545c70f14801d7f
machineconfiguration.openshift.io/desiredConfig: rendered-worker-98a7d5a6de4b081492fc1ecf58664a0b
machineconfiguration.openshift.io/reason: Failed to find /dev/disk/by-label/root
machineconfiguration.openshift.io/ssh: accessed
machineconfiguration.openshift.io/state: Degraded
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Wed, 06 Jan 2021 09:38:54 +0530
Taints: node.kubernetes.io/unschedulable:NoSchedule
Unschedulable: true
Lease:
HolderIdentity: ip-10-0-58-74.us-east-2.compute.internal
AcquireTime: <unset>
RenewTime: Wed, 06 Jan 2021 15:25:00 +0530
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Wed, 06 Jan 2021 15:21:09 +0530 Wed, 06 Jan 2021 09:38:54 +0530 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Wed, 06 Jan 2021 15:21:09 +0530 Wed, 06 Jan 2021 09:38:54 +0530 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Wed, 06 Jan 2021 15:21:09 +0530 Wed, 06 Jan 2021 09:38:54 +0530 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Wed, 06 Jan 2021 15:21:09 +0530 Wed, 06 Jan 2021 09:39:54 +0530 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 10.0.58.74
Hostname: ip-10-0-58-74.us-east-2.compute.internal
InternalDNS: ip-10-0-58-74.us-east-2.compute.internal
Capacity:
attachable-volumes-aws-ebs: 39
cpu: 4
ephemeral-storage: 31444972Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 16264968Ki
pods: 250
Allocatable:
attachable-volumes-aws-ebs: 39
cpu: 3500m
ephemeral-storage: 27905944324
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 15113992Ki
pods: 250
System Info:
Machine ID: 0477675021744e3099cacfbb87bf5f86
System UUID: EC233C74-91ED-516D-95DB-C1E02EECF941
Boot ID: 976425f5-4477-496c-b8a9-dc6d6f6b2e3b
Kernel Version: 3.10.0-1160.11.1.el7.x86_64
OS Image: Red Hat Enterprise Linux Server 7.9 (Maipo)
Operating System: linux
Architecture: amd64
Container Runtime Version: cri-o://1.19.0-118.rhaos4.6.gitf51f94a.el7
Kubelet Version: v1.19.0+9c69bdc
Kube-Proxy Version: v1.19.0+9c69bdc
ProviderID: aws:///us-east-2a/i-0f8fe5977642b21d8
Non-terminated Pods: (13 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
node-upgrade hello-daemonset-ml6cb 0 (0%) 0 (0%) 0 (0%) 0 (0%) 4h43m
openshift-cluster-csi-drivers aws-ebs-csi-driver-node-9q2zw 30m (0%) 0 (0%) 150Mi (1%) 0 (0%) 172m
openshift-cluster-node-tuning-operator tuned-b4csb 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 172m
openshift-dns dns-default-2r6vv 65m (1%) 0 (0%) 110Mi (0%) 512Mi (3%) 157m
openshift-image-registry node-ca-h87mb 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 171m
openshift-logging fluentd-qqbpz 100m (2%) 0 (0%) 736Mi (4%) 736Mi (4%) 3h39m
openshift-machine-config-operator machine-config-daemon-5kjgf 40m (1%) 0 (0%) 100Mi (0%) 0 (0%) 155m
openshift-monitoring node-exporter-rd2k8 9m (0%) 0 (0%) 210Mi (1%) 0 (0%) 172m
openshift-multus multus-54xpj 10m (0%) 0 (0%) 150Mi (1%) 0 (0%) 161m
openshift-multus network-metrics-daemon-n7ts8 20m (0%) 0 (0%) 120Mi (0%) 0 (0%) 166m
openshift-ovn-kubernetes ovnkube-node-vfbsk 30m (0%) 0 (0%) 620Mi (4%) 0 (0%) 166m
openshift-ovn-kubernetes ovs-node-gqnnb 100m (2%) 0 (0%) 300Mi (2%) 0 (0%) 164m
ui-upgrade hello-daemonset-zmhts 0 (0%) 0 (0%) 0 (0%) 0 (0%) 4h38m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 424m (12%) 0 (0%)
memory 2556Mi (17%) 1248Mi (8%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 5h46m kubelet, ip-10-0-58-74.us-east-2.compute.internal Starting kubelet.
Normal NodeHasSufficientMemory 5h46m (x2 over 5h46m) kubelet, ip-10-0-58-74.us-east-2.compute.internal Node ip-10-0-58-74.us-east-2.compute.internal status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 5h46m (x2 over 5h46m) kubelet, ip-10-0-58-74.us-east-2.compute.internal Node ip-10-0-58-74.us-east-2.compute.internal status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 5h46m (x2 over 5h46m) kubelet, ip-10-0-58-74.us-east-2.compute.internal Node ip-10-0-58-74.us-east-2.compute.internal status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 5h46m kubelet, ip-10-0-58-74.us-east-2.compute.internal Updated Node Allocatable limit across pods
Normal NodeReady 5h45m kubelet, ip-10-0-58-74.us-east-2.compute.internal Node ip-10-0-58-74.us-east-2.compute.internal status is now: NodeReady
Normal NodeNotSchedulable 152m kubelet, ip-10-0-58-74.us-east-2.compute.internal Node ip-10-0-58-74.us-east-2.compute.internal status is now: NodeNotSchedulable
$ oc describe node ip-10-0-60-221.us-east-2.compute.internal
Name: ip-10-0-60-221.us-east-2.compute.internal
Roles: worker
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/instance-type=m4.xlarge
beta.kubernetes.io/os=linux
failure-domain.beta.kubernetes.io/region=us-east-2
failure-domain.beta.kubernetes.io/zone=us-east-2a
kubernetes.io/arch=amd64
kubernetes.io/hostname=ip-10-0-60-221.us-east-2.compute.internal
kubernetes.io/os=linux
node-role.kubernetes.io/worker=
node.kubernetes.io/instance-type=m4.xlarge
node.openshift.io/os_id=rhel
topology.ebs.csi.aws.com/zone=us-east-2a
topology.kubernetes.io/region=us-east-2
topology.kubernetes.io/zone=us-east-2a
Annotations: csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-0c99db9b70ca3690e"}
k8s.ovn.org/l3-gateway-config:
{"default":{"mode":"shared","interface-id":"br-ex_ip-10-0-60-221.us-east-2.compute.internal","mac-address":"02:87:6e:17:88:0a","ip-address...
k8s.ovn.org/node-chassis-id: 23feec3d-9307-4f5b-af93-7946ad6ea9dc
k8s.ovn.org/node-join-subnets: {"default":"100.64.6.0/29"}
k8s.ovn.org/node-local-nat-ip: {"default":["169.254.7.183"]}
k8s.ovn.org/node-mgmt-port-mac-address: 1a:5b:9f:89:67:21
k8s.ovn.org/node-primary-ifaddr: {"ipv4":"10.0.60.221/20"}
k8s.ovn.org/node-subnets: {"default":"10.130.2.0/23"}
machineconfiguration.openshift.io/currentConfig: rendered-worker-d3fda1858c3d1a5b2545c70f14801d7f
machineconfiguration.openshift.io/desiredConfig: rendered-worker-d3fda1858c3d1a5b2545c70f14801d7f
machineconfiguration.openshift.io/ssh: accessed
machineconfiguration.openshift.io/state: Done
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Wed, 06 Jan 2021 09:38:53 +0530
Taints: node.kubernetes.io/unschedulable:NoSchedule
Unschedulable: true
Lease:
HolderIdentity: ip-10-0-60-221.us-east-2.compute.internal
AcquireTime: <unset>
RenewTime: Wed, 06 Jan 2021 15:25:09 +0530
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Wed, 06 Jan 2021 15:25:02 +0530 Wed, 06 Jan 2021 09:38:53 +0530 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Wed, 06 Jan 2021 15:25:02 +0530 Wed, 06 Jan 2021 09:38:53 +0530 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Wed, 06 Jan 2021 15:25:02 +0530 Wed, 06 Jan 2021 09:38:53 +0530 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Wed, 06 Jan 2021 15:25:02 +0530 Wed, 06 Jan 2021 09:39:53 +0530 KubeletReady kubelet is posting ready status
Addresses:
InternalIP: 10.0.60.221
Hostname: ip-10-0-60-221.us-east-2.compute.internal
InternalDNS: ip-10-0-60-221.us-east-2.compute.internal
Capacity:
attachable-volumes-aws-ebs: 39
cpu: 4
ephemeral-storage: 31444972Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 16264968Ki
pods: 250
Allocatable:
attachable-volumes-aws-ebs: 39
cpu: 3500m
ephemeral-storage: 27905944324
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 15113992Ki
pods: 250
System Info:
Machine ID: 5baffef7ed054ce59608b61344d680d2
System UUID: EC272FAF-0060-757A-EFD9-5C0EE0E83F3A
Boot ID: 6d17251e-ed98-477e-b164-2314cb3b7487
Kernel Version: 3.10.0-1160.11.1.el7.x86_64
OS Image: Red Hat Enterprise Linux Server 7.9 (Maipo)
Operating System: linux
Architecture: amd64
Container Runtime Version: cri-o://1.19.0-118.rhaos4.6.gitf51f94a.el7
Kubelet Version: v1.19.0+9c69bdc
Kube-Proxy Version: v1.19.0+9c69bdc
ProviderID: aws:///us-east-2a/i-0c99db9b70ca3690e
Non-terminated Pods: (14 in total)
Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE
--------- ---- ------------ ---------- --------------- ------------- ---
node-upgrade hello-daemonset-cxw82 0 (0%) 0 (0%) 0 (0%) 0 (0%) 4h43m
openshift-cluster-csi-drivers aws-ebs-csi-driver-node-879qv 30m (0%) 0 (0%) 150Mi (1%) 0 (0%) 171m
openshift-cluster-node-tuning-operator tuned-qpgqd 10m (0%) 0 (0%) 50Mi (0%) 0 (0%) 171m
openshift-dns dns-default-w5jg9 65m (1%) 0 (0%) 110Mi (0%) 512Mi (3%) 156m
openshift-image-registry node-ca-xvglk 10m (0%) 0 (0%) 10Mi (0%) 0 (0%) 173m
openshift-ingress router-default-6668c6f5b9-ngnkp 100m (2%) 0 (0%) 256Mi (1%) 0 (0%) 173m
openshift-logging fluentd-4hqfx 100m (2%) 0 (0%) 736Mi (4%) 736Mi (4%) 3h39m
openshift-machine-config-operator machine-config-daemon-h79vt 40m (1%) 0 (0%) 100Mi (0%) 0 (0%) 154m
openshift-monitoring node-exporter-zsmmp 9m (0%) 0 (0%) 210Mi (1%) 0 (0%) 172m
openshift-multus multus-p9msd 10m (0%) 0 (0%) 150Mi (1%) 0 (0%) 165m
openshift-multus network-metrics-daemon-bxm2v 20m (0%) 0 (0%) 120Mi (0%) 0 (0%) 166m
openshift-ovn-kubernetes ovnkube-node-2jvgg 30m (0%) 0 (0%) 620Mi (4%) 0 (0%) 165m
openshift-ovn-kubernetes ovs-node-2m2fw 100m (2%) 0 (0%) 300Mi (2%) 0 (0%) 166m
ui-upgrade hello-daemonset-6dmfl 0 (0%) 0 (0%) 0 (0%) 0 (0%) 4h39m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 524m (14%) 0 (0%)
memory 2812Mi (19%) 1248Mi (8%)
ephemeral-storage 0 (0%) 0 (0%)
hugepages-1Gi 0 (0%) 0 (0%)
hugepages-2Mi 0 (0%) 0 (0%)
attachable-volumes-aws-ebs 0 0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Starting 5h46m kubelet, ip-10-0-60-221.us-east-2.compute.internal Starting kubelet.
Normal NodeHasSufficientMemory 5h46m (x2 over 5h46m) kubelet, ip-10-0-60-221.us-east-2.compute.internal Node ip-10-0-60-221.us-east-2.compute.internal status is now: NodeHasSufficientMemory
Normal NodeHasNoDiskPressure 5h46m (x2 over 5h46m) kubelet, ip-10-0-60-221.us-east-2.compute.internal Node ip-10-0-60-221.us-east-2.compute.internal status is now: NodeHasNoDiskPressure
Normal NodeHasSufficientPID 5h46m (x2 over 5h46m) kubelet, ip-10-0-60-221.us-east-2.compute.internal Node ip-10-0-60-221.us-east-2.compute.internal status is now: NodeHasSufficientPID
Normal NodeAllocatableEnforced 5h46m kubelet, ip-10-0-60-221.us-east-2.compute.internal Updated Node Allocatable limit across pods
Normal NodeReady 5h45m kubelet, ip-10-0-60-221.us-east-2.compute.internal Node ip-10-0-60-221.us-east-2.compute.internal status is now: NodeReady
Normal NodeNotSchedulable 135m kubelet, ip-10-0-60-221.us-east-2.compute.internal Node ip-10-0-60-221.us-east-2.compute.internal status is now: NodeNotSchedulable
$ oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-7c5ea40d13541de6e0e34d97f04f3c75 True False False 3 3 3 0 6h40m
worker rendered-worker-d3fda1858c3d1a5b2545c70f14801d7f False True True 2 0 0 1 6h40m
$ oc describe mcp worker
Name: worker
Namespace:
Labels: machineconfiguration.openshift.io/mco-built-in=
pools.operator.machineconfiguration.openshift.io/worker=
Annotations: <none>
API Version: machineconfiguration.openshift.io/v1
Kind: MachineConfigPool
Metadata:
Creation Timestamp: 2021-01-06T03:15:17Z
Generation: 4
Managed Fields:
API Version: machineconfiguration.openshift.io/v1
Fields Type: FieldsV1
fieldsV1:
f:metadata:
f:labels:
.:
f:machineconfiguration.openshift.io/mco-built-in:
f:pools.operator.machineconfiguration.openshift.io/worker:
f:spec:
.:
f:configuration:
f:machineConfigSelector:
.:
f:matchLabels:
.:
f:machineconfiguration.openshift.io/role:
f:nodeSelector:
.:
f:matchLabels:
.:
f:node-role.kubernetes.io/worker:
f:paused:
Manager: machine-config-operator
Operation: Update
Time: 2021-01-06T03:15:17Z
API Version: machineconfiguration.openshift.io/v1
Fields Type: FieldsV1
fieldsV1:
f:spec:
f:configuration:
f:name:
f:source:
f:status:
.:
f:conditions:
f:configuration:
.:
f:name:
f:source:
f:degradedMachineCount:
f:machineCount:
f:observedGeneration:
f:readyMachineCount:
f:unavailableMachineCount:
f:updatedMachineCount:
Manager: machine-config-controller
Operation: Update
Time: 2021-01-06T07:39:27Z
Resource Version: 172967
Self Link: /apis/machineconfiguration.openshift.io/v1/machineconfigpools/worker
UID: 5fec402e-46bb-4c50-aecd-8711b08ca381
Spec:
Configuration:
Name: rendered-worker-98a7d5a6de4b081492fc1ecf58664a0b
Source:
API Version: machineconfiguration.openshift.io/v1
Kind: MachineConfig
Name: 00-worker
API Version: machineconfiguration.openshift.io/v1
Kind: MachineConfig
Name: 01-worker-container-runtime
API Version: machineconfiguration.openshift.io/v1
Kind: MachineConfig
Name: 01-worker-kubelet
API Version: machineconfiguration.openshift.io/v1
Kind: MachineConfig
Name: 99-worker-fips
API Version: machineconfiguration.openshift.io/v1
Kind: MachineConfig
Name: 99-worker-generated-registries
API Version: machineconfiguration.openshift.io/v1
Kind: MachineConfig
Name: 99-worker-ssh
Machine Config Selector:
Match Labels:
machineconfiguration.openshift.io/role: worker
Node Selector:
Match Labels:
node-role.kubernetes.io/worker:
Paused: false
Status:
Conditions:
Last Transition Time: 2021-01-06T03:16:46Z
Message:
Reason:
Status: False
Type: RenderDegraded
Last Transition Time: 2021-01-06T07:22:41Z
Message:
Reason:
Status: False
Type: Updated
Last Transition Time: 2021-01-06T07:22:41Z
Message: All nodes are updating to rendered-worker-98a7d5a6de4b081492fc1ecf58664a0b
Reason:
Status: True
Type: Updating
Last Transition Time: 2021-01-06T07:24:50Z
Message: Node ip-10-0-58-74.us-east-2.compute.internal is reporting: "Failed to find /dev/disk/by-label/root"
Reason: 1 nodes are reporting degraded status on sync
Status: True
Type: NodeDegraded
Last Transition Time: 2021-01-06T07:24:50Z
Message:
Reason:
Status: True
Type: Degraded
Configuration:
Name: rendered-worker-d3fda1858c3d1a5b2545c70f14801d7f
Source:
API Version: machineconfiguration.openshift.io/v1
Kind: MachineConfig
Name: 00-worker
API Version: machineconfiguration.openshift.io/v1
Kind: MachineConfig
Name: 01-worker-container-runtime
API Version: machineconfiguration.openshift.io/v1
Kind: MachineConfig
Name: 01-worker-kubelet
API Version: machineconfiguration.openshift.io/v1
Kind: MachineConfig
Name: 99-worker-fips
API Version: machineconfiguration.openshift.io/v1
Kind: MachineConfig
Name: 99-worker-generated-registries
API Version: machineconfiguration.openshift.io/v1
Kind: MachineConfig
Name: 99-worker-ssh
Degraded Machine Count: 1
Machine Count: 2
Observed Generation: 4
Ready Machine Count: 0
Unavailable Machine Count: 2
Updated Machine Count: 0
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal SetDesiredConfig 6h15m machineconfigcontroller-nodecontroller Targeted node ip-10-0-79-48.us-east-2.compute.internal to config rendered-worker-d3fda1858c3d1a5b2545c70f14801d7f
Normal SetDesiredConfig 6h13m machineconfigcontroller-nodecontroller Targeted node ip-10-0-49-14.us-east-2.compute.internal to config rendered-worker-d3fda1858c3d1a5b2545c70f14801d7f
Normal SetDesiredConfig 6h7m machineconfigcontroller-nodecontroller Targeted node ip-10-0-61-81.us-east-2.compute.internal to config rendered-worker-d3fda1858c3d1a5b2545c70f14801d7f
Normal SetDesiredConfig 153m machineconfigcontroller-nodecontroller Targeted node ip-10-0-58-74.us-east-2.compute.internal to config rendered-worker-98a7d5a6de4b081492fc1ecf58664a0b
It looks like a hardware failure, as some files cannot be fetched too: ``` 2021-01-05T06:50:14.677517929-05:00 I0105 11:50:14.677495 28223 update.go:1220] Removed stale systemd dropin "/etc/systemd/system/ovs-vswitchd.service.d/10-ovs-vswitchd-restart.conf" 2021-01-05T06:50:14.677517929-05:00 I0105 11:50:14.677510 28223 update.go:1282] /etc/systemd/system/multi-user.target.wants/ovs-vswitchd.service was not present. No need to remove 2021-01-05T06:50:14.677570805-05:00 W0105 11:50:14.677537 28223 update.go:1247] unable to delete /etc/systemd/system/ovs-vswitchd.service: remove /etc/systemd/system/ovs-vswitchd.service: no such file or directory 2021-01-05T06:50:14.677570805-05:00 I0105 11:50:14.677547 28223 update.go:1249] Removed stale systemd unit "/etc/systemd/system/ovs-vswitchd.service" 2021-01-05T06:50:14.677617303-05:00 I0105 11:50:14.677604 28223 update.go:1220] Removed stale systemd dropin "/etc/systemd/system/ovsdb-server.service.d/10-ovsdb-restart.conf" 2021-01-05T06:50:14.677637757-05:00 I0105 11:50:14.677624 28223 update.go:1282] /etc/systemd/system/multi-user.target.wants/ovsdb-server.service was not present. No need to remove 2021-01-05T06:50:14.677659570-05:00 W0105 11:50:14.677647 28223 update.go:1247] unable to delete /etc/systemd/system/ovsdb-server.service: remove /etc/systemd/system/ovsdb-server.service: no such file or directory 2021-01-05T06:50:14.677668010-05:00 I0105 11:50:14.677657 28223 update.go:1249] Removed stale systemd unit "/etc/systemd/system/ovsdb-server.service" 2021-01-05T06:50:14.677741683-05:00 I0105 11:50:14.677715 28223 update.go:1220] Removed stale systemd dropin "/etc/systemd/system/zincati.service.d/mco-disabled.conf" 2021-01-05T06:50:14.677750052-05:00 I0105 11:50:14.677742 28223 update.go:1282] /etc/systemd/system/multi-user.target.wants/zincati.service was not present. No need to remove 2021-01-05T06:50:14.677777386-05:00 W0105 11:50:14.677766 28223 update.go:1247] unable to delete /etc/systemd/system/zincati.service: remove /etc/systemd/system/zincati.service: no such file or directory 2021-01-05T06:50:14.677785656-05:00 I0105 11:50:14.677776 28223 update.go:1249] Removed stale systemd unit "/etc/systemd/system/zincati.service" 2021-01-05T06:50:14.677795291-05:00 E0105 11:50:14.677789 28223 writer.go:135] Marking Degraded due to: Failed to find /dev/disk/by-label/root 2021-01-05T06:50:14.681392397-05:00 E0105 11:50:14.680646 28223 token_source.go:152] Unable to rotate token: failed to read token file "/var/run/secrets/kubernetes.io/serviceaccount/token": open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory ``` from quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-a5e086d7c7605d24b00926fd569dbebfe9580e547b2e3f1d48a719bfd19a5049/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-hqx4f/machine-config-daemon/machine-config-daemon/logs in must-gather Reassigning to MCO From discussion over slack, we think this could be regression from https://github.com/openshift/machine-config-operator/pull/2251. We have seen regression in 4.7 on rhel 7 worker nodes with different error message https://bugzilla.redhat.com/show_bug.cgi?id=1909943 I hit this today trying to upgrade 4.6.6 to a 4.6 scratch build generated from 4.6 branch + https://github.com/openshift/machine-config-operator/pull/2321 Verified on 4.7.0-0.nightly-2021-01-12-150634. Upgraded 4.6.10 to 4.7.0-0.nightly-2021-01-12-150634 with RHEL7 worker Needed to workaround these BZs which also affect RHEL7 compute nodes for the verification. https://bugzilla.redhat.com/show_bug.cgi?id=1913582 Workaround: Edit /etc/os-release and set VERSION_ID="7" https://bugzilla.redhat.com/show_bug.cgi?id=1913536 Workaround: rm /etc/systemd/system/multi-user-target.Wants/openvswitch.service systemctl enable openvswitch.service Moving back to ON_QA state. Looking at the origin BZ filed in 4.6 the problem is the upgrade succeeds but the MCP is degraded. I need to re-verify that the MCP is not in a degraded state after the upgrade. A clarifying note: the proposed fix will fix Failed to find /dev/disk/by-label/root on RHEL nodes. This error does NOT block upgrades (workers are not considered in the success/fail criteria of upgrades, only control plane is), so this fix will NOT fix Qin's original issue of a failing upgrade. Updated verification. There were two scenarios that needed to be tested to verify this fix. The first is an upgrade from 4.6.10 -> 4.7.0-0.nightly-2021-01-13-054018. This is an upgrade from a clean 4.6.10 with no degraded MCP. Verified that the MCP does not go degraded when upgrading to 4.7 and the upgrade is successful. The second test is from 4.6.6 -> 4.6.10 -> 4.6.11 -> 4.7.0-0.nightly-2021-01-13-054018. This is an upgrade through various 4.6.z versions to introduce the degraded MCP. Verified that upgrading to 4.7 is successful and the degraded MCP is fixed with no intervention from the user. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:5633 Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475 |