Bug 1913154 - Upgrading to 4.6.10 nightly failed with RHEL worker nodes: Failed to find /dev/disk/by-label/root
Summary: Upgrading to 4.6.10 nightly failed with RHEL worker nodes: Failed to find /de...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.7.0
Assignee: MCO Team
QA Contact: Rio Liu
URL:
Whiteboard:
Depends On:
Blocks: 1914892
TreeView+ depends on / blocked
 
Reported: 2021-01-06 07:02 UTC by Qin Ping
Modified: 2021-11-03 06:24 UTC (History)
14 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-02-24 15:50:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2325 0 None closed Bug 1913154: update.go: only set BFQ scheduler for masters 2021-02-17 14:27:26 UTC
Red Hat Product Errata RHSA-2020:5633 0 None None None 2021-02-24 15:50:37 UTC

Description Qin Ping 2021-01-06 07:02:12 UTC
Description of problem:
One of our upgrade ci build failed, it's upgrade path is original_build=4.2.0-0.nightly-2020-12-21-150827, target_build=4.3.0-0.nightly-2020-12-21-145308,4.4.0-0.nightly-2020-12-21-142921,4.5.0-0.nightly-2020-12-21-141644,4.6.0-0.nightly-2020-12-21-163117,4.7.0-0.nightly-2020-12-21-131655, the cluster is a Disconnected UPI on Baremetal with RHCOS & RHEL7.9 (FIPS off) cluster. This ci build failed when upgradeing to 4.3.0-0.nightly-2020-12-21-145308, but the must gather log is lost for this upgrade.

We rebuilt this ci job, the upgrade failed at 4.6.0-0.nightly-2020-12-21-163117, one of rhel worker node status is Ready,SchedulingDisabled.
Check the rhel worker node, get:
01-05 20:18:44  Annotations:        machineconfiguration.openshift.io/currentConfig: rendered-worker-a3464aade61c26dd5dbc13ea8e918edf
01-05 20:18:44                      machineconfiguration.openshift.io/desiredConfig: rendered-worker-e11043037b5133d8413f70c41dc97cec
01-05 20:18:44                      machineconfiguration.openshift.io/reason: Failed to find /dev/disk/by-label/root
01-05 20:18:44                      machineconfiguration.openshift.io/ssh: accessed
01-05 20:18:44                      machineconfiguration.openshift.io/state: Degraded
01-05 20:18:44                      volumes.kubernetes.io/controller-managed-attach-detach: true

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Sunil Choudhary 2021-01-06 10:02:21 UTC
I faced same issue during upgrade from 4.6.9 to 4.6.10 nightly candidate 4.6.0-0.nightly-2021-01-05-203053

Machine Config Pool worker is in degraded state Node ip-10-0-58-74.us-east-2.compute.internal is reporting: "Failed to find /dev/disk/by-label/root" and node is in Scheduling Disabled state.


$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2021-01-05-203053   True        False         137m    Error while reconciling 4.6.0-0.nightly-2021-01-05-203053: an unknown error has occurred: MultipleErrors

$ oc describe clusterversion
Name:         version
Namespace:    
Labels:       <none>
Annotations:  <none>
API Version:  config.openshift.io/v1
Kind:         ClusterVersion
Metadata:
  Creation Timestamp:  2021-01-06T03:09:58Z
  Generation:          2
  Managed Fields:
    API Version:  config.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        .:
        f:channel:
        f:clusterID:
        f:upstream:
    Manager:      cluster-bootstrap
    Operation:    Update
    Time:         2021-01-06T03:09:58Z
    API Version:  config.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        f:desiredUpdate:
          .:
          f:force:
          f:image:
          f:version:
    Manager:      oc
    Operation:    Update
    Time:         2021-01-06T06:38:35Z
    API Version:  config.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:availableUpdates:
        f:conditions:
        f:desired:
          .:
          f:image:
          f:version:
        f:history:
        f:observedGeneration:
        f:versionHash:
    Manager:         cluster-version-operator
    Operation:       Update
    Time:            2021-01-06T09:51:35Z
  Resource Version:  223284
  Self Link:         /apis/config.openshift.io/v1/clusterversions/version
  UID:               339b9843-799e-4339-b01c-a21846b4ded9
Spec:
  Channel:     stable-4.6
  Cluster ID:  298dc0fa-613c-47ce-9a6b-8be820fe6779
  Desired Update:
    Force:    true
    Image:    registry.ci.openshift.org/ocp/release:4.6.0-0.nightly-2021-01-05-203053
    Version:  
  Upstream:   https://api.openshift.com/api/upgrades_info/v1/graph
Status:
  Available Updates:  <nil>
  Conditions:
    Last Transition Time:  2021-01-06T03:39:44Z
    Message:               Done applying 4.6.0-0.nightly-2021-01-05-203053
    Status:                True
    Type:                  Available
    Last Transition Time:  2021-01-06T09:51:35Z
    Message:               Multiple errors are preventing progress:
* Cluster operator ingress is reporting a failure: Some ingresscontrollers are degraded: ingresscontroller "default" is degraded: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-6668c6f5b9-cw7tw" cannot be scheduled: 0/5 nodes are available: 2 node(s) were unschedulable, 3 node(s) didn't match node selector. Make sure you have sufficient worker nodes.), DeploymentReplicasAllAvailable=False (DeploymentReplicasNotAvailable: 1/2 of replicas are available)
* Cluster operator monitoring is reporting a failure: Failed to rollout the stack. Error: running task Updating openshift-state-metrics failed: reconciling openshift-state-metrics Deployment failed: updating Deployment object failed: waiting for DeploymentRollout of openshift-monitoring/openshift-state-metrics: got 1 unavailable replicas
    Reason:                MultipleErrors
    Status:                True
    Type:                  Failing
    Last Transition Time:  2021-01-06T07:37:05Z
    Message:               Error while reconciling 4.6.0-0.nightly-2021-01-05-203053: an unknown error has occurred: MultipleErrors
    Reason:                MultipleErrors
    Status:                False
    Type:                  Progressing
    Last Transition Time:  2021-01-06T06:39:05Z
    Message:               Unable to retrieve available updates: currently reconciling cluster version 4.6.0-0.nightly-2021-01-05-203053 not found in the "stable-4.6" channel
    Reason:                VersionNotFound
    Status:                False
    Type:                  RetrievedUpdates
  Desired:
    Image:    registry.ci.openshift.org/ocp/release:4.6.0-0.nightly-2021-01-05-203053
    Version:  4.6.0-0.nightly-2021-01-05-203053
  History:
    Completion Time:    2021-01-06T07:37:05Z
    Image:              registry.ci.openshift.org/ocp/release:4.6.0-0.nightly-2021-01-05-203053
    Started Time:       2021-01-06T06:38:49Z
    State:              Completed
    Verified:           false
    Version:            4.6.0-0.nightly-2021-01-05-203053
    Completion Time:    2021-01-06T03:39:44Z
    Image:              quay.io/openshift-release-dev/ocp-release@sha256:43d5c84169a4b3ff307c29d7374f6d69a707de15e9fa90ad352b432f77c0cead
    Started Time:       2021-01-06T03:09:58Z
    State:              Completed
    Verified:           false
    Version:            4.6.9
  Observed Generation:  2
  Version Hash:         KSVUyyU6E5g=
Events:                 <none>


$ oc get nodes -o wide
NAME                                        STATUS                     ROLES    AGE     VERSION           INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                                                       KERNEL-VERSION                 CONTAINER-RUNTIME
ip-10-0-54-171.us-east-2.compute.internal   Ready                      master   6h42m   v1.19.0+9c69bdc   10.0.54.171   <none>        Red Hat Enterprise Linux CoreOS 46.82.202101042340-0 (Ootpa)   4.18.0-193.37.1.el8_2.x86_64   cri-o://1.19.1-2.rhaos4.6.git2af9ecf.el8
ip-10-0-56-210.us-east-2.compute.internal   Ready                      master   6h42m   v1.19.0+9c69bdc   10.0.56.210   <none>        Red Hat Enterprise Linux CoreOS 46.82.202101042340-0 (Ootpa)   4.18.0-193.37.1.el8_2.x86_64   cri-o://1.19.1-2.rhaos4.6.git2af9ecf.el8
ip-10-0-58-74.us-east-2.compute.internal    Ready,SchedulingDisabled   worker   5h46m   v1.19.0+9c69bdc   10.0.58.74    <none>        Red Hat Enterprise Linux Server 7.9 (Maipo)                    3.10.0-1160.11.1.el7.x86_64    cri-o://1.19.0-118.rhaos4.6.gitf51f94a.el7
ip-10-0-60-221.us-east-2.compute.internal   Ready,SchedulingDisabled   worker   5h46m   v1.19.0+9c69bdc   10.0.60.221   <none>        Red Hat Enterprise Linux Server 7.9 (Maipo)                    3.10.0-1160.11.1.el7.x86_64    cri-o://1.19.0-118.rhaos4.6.gitf51f94a.el7
ip-10-0-72-181.us-east-2.compute.internal   Ready                      master   6h43m   v1.19.0+9c69bdc   10.0.72.181   <none>        Red Hat Enterprise Linux CoreOS 46.82.202101042340-0 (Ootpa)   4.18.0-193.37.1.el8_2.x86_64   cri-o://1.19.1-2.rhaos4.6.git2af9ecf.el8


$ oc describe node ip-10-0-58-74.us-east-2.compute.internal
Name:               ip-10-0-58-74.us-east-2.compute.internal
Roles:              worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=m4.xlarge
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-east-2
                    failure-domain.beta.kubernetes.io/zone=us-east-2a
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-10-0-58-74.us-east-2.compute.internal
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/worker=
                    node.kubernetes.io/instance-type=m4.xlarge
                    node.openshift.io/os_id=rhel
                    topology.ebs.csi.aws.com/zone=us-east-2a
                    topology.hostpath.csi/node=ip-10-0-58-74.us-east-2.compute.internal
                    topology.kubernetes.io/region=us-east-2
                    topology.kubernetes.io/zone=us-east-2a
Annotations:        csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-0f8fe5977642b21d8","hostpath.csi.k8s.io":"ip-10-0-58-74.us-east-2.compute.internal"}
                    k8s.ovn.org/l3-gateway-config:
                      {"default":{"mode":"shared","interface-id":"br-ex_ip-10-0-58-74.us-east-2.compute.internal","mac-address":"02:bf:45:e0:d0:28","ip-addresse...
                    k8s.ovn.org/node-chassis-id: eb6efba3-f5c1-444c-ae52-3cdf3591adbd
                    k8s.ovn.org/node-join-subnets: {"default":"100.64.7.0/29"}
                    k8s.ovn.org/node-local-nat-ip: {"default":["169.254.10.73"]}
                    k8s.ovn.org/node-mgmt-port-mac-address: 4e:af:cf:1b:b9:cf
                    k8s.ovn.org/node-primary-ifaddr: {"ipv4":"10.0.58.74/20"}
                    k8s.ovn.org/node-subnets: {"default":"10.131.2.0/23"}
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-d3fda1858c3d1a5b2545c70f14801d7f
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-98a7d5a6de4b081492fc1ecf58664a0b
                    machineconfiguration.openshift.io/reason: Failed to find /dev/disk/by-label/root
                    machineconfiguration.openshift.io/ssh: accessed
                    machineconfiguration.openshift.io/state: Degraded
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Wed, 06 Jan 2021 09:38:54 +0530
Taints:             node.kubernetes.io/unschedulable:NoSchedule
Unschedulable:      true
Lease:
  HolderIdentity:  ip-10-0-58-74.us-east-2.compute.internal
  AcquireTime:     <unset>
  RenewTime:       Wed, 06 Jan 2021 15:25:00 +0530
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Wed, 06 Jan 2021 15:21:09 +0530   Wed, 06 Jan 2021 09:38:54 +0530   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Wed, 06 Jan 2021 15:21:09 +0530   Wed, 06 Jan 2021 09:38:54 +0530   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Wed, 06 Jan 2021 15:21:09 +0530   Wed, 06 Jan 2021 09:38:54 +0530   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Wed, 06 Jan 2021 15:21:09 +0530   Wed, 06 Jan 2021 09:39:54 +0530   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:   10.0.58.74
  Hostname:     ip-10-0-58-74.us-east-2.compute.internal
  InternalDNS:  ip-10-0-58-74.us-east-2.compute.internal
Capacity:
  attachable-volumes-aws-ebs:  39
  cpu:                         4
  ephemeral-storage:           31444972Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      16264968Ki
  pods:                        250
Allocatable:
  attachable-volumes-aws-ebs:  39
  cpu:                         3500m
  ephemeral-storage:           27905944324
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      15113992Ki
  pods:                        250
System Info:
  Machine ID:                             0477675021744e3099cacfbb87bf5f86
  System UUID:                            EC233C74-91ED-516D-95DB-C1E02EECF941
  Boot ID:                                976425f5-4477-496c-b8a9-dc6d6f6b2e3b
  Kernel Version:                         3.10.0-1160.11.1.el7.x86_64
  OS Image:                               Red Hat Enterprise Linux Server 7.9 (Maipo)
  Operating System:                       linux
  Architecture:                           amd64
  Container Runtime Version:              cri-o://1.19.0-118.rhaos4.6.gitf51f94a.el7
  Kubelet Version:                        v1.19.0+9c69bdc
  Kube-Proxy Version:                     v1.19.0+9c69bdc
ProviderID:                               aws:///us-east-2a/i-0f8fe5977642b21d8
Non-terminated Pods:                      (13 in total)
  Namespace                               Name                             CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                               ----                             ------------  ----------  ---------------  -------------  ---
  node-upgrade                            hello-daemonset-ml6cb            0 (0%)        0 (0%)      0 (0%)           0 (0%)         4h43m
  openshift-cluster-csi-drivers           aws-ebs-csi-driver-node-9q2zw    30m (0%)      0 (0%)      150Mi (1%)       0 (0%)         172m
  openshift-cluster-node-tuning-operator  tuned-b4csb                      10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         172m
  openshift-dns                           dns-default-2r6vv                65m (1%)      0 (0%)      110Mi (0%)       512Mi (3%)     157m
  openshift-image-registry                node-ca-h87mb                    10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         171m
  openshift-logging                       fluentd-qqbpz                    100m (2%)     0 (0%)      736Mi (4%)       736Mi (4%)     3h39m
  openshift-machine-config-operator       machine-config-daemon-5kjgf      40m (1%)      0 (0%)      100Mi (0%)       0 (0%)         155m
  openshift-monitoring                    node-exporter-rd2k8              9m (0%)       0 (0%)      210Mi (1%)       0 (0%)         172m
  openshift-multus                        multus-54xpj                     10m (0%)      0 (0%)      150Mi (1%)       0 (0%)         161m
  openshift-multus                        network-metrics-daemon-n7ts8     20m (0%)      0 (0%)      120Mi (0%)       0 (0%)         166m
  openshift-ovn-kubernetes                ovnkube-node-vfbsk               30m (0%)      0 (0%)      620Mi (4%)       0 (0%)         166m
  openshift-ovn-kubernetes                ovs-node-gqnnb                   100m (2%)     0 (0%)      300Mi (2%)       0 (0%)         164m
  ui-upgrade                              hello-daemonset-zmhts            0 (0%)        0 (0%)      0 (0%)           0 (0%)         4h38m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests      Limits
  --------                    --------      ------
  cpu                         424m (12%)    0 (0%)
  memory                      2556Mi (17%)  1248Mi (8%)
  ephemeral-storage           0 (0%)        0 (0%)
  hugepages-1Gi               0 (0%)        0 (0%)
  hugepages-2Mi               0 (0%)        0 (0%)
  attachable-volumes-aws-ebs  0             0
Events:
  Type    Reason                   Age                    From                                               Message
  ----    ------                   ----                   ----                                               -------
  Normal  Starting                 5h46m                  kubelet, ip-10-0-58-74.us-east-2.compute.internal  Starting kubelet.
  Normal  NodeHasSufficientMemory  5h46m (x2 over 5h46m)  kubelet, ip-10-0-58-74.us-east-2.compute.internal  Node ip-10-0-58-74.us-east-2.compute.internal status is now: NodeHasSufficientMemory
  Normal  NodeHasNoDiskPressure    5h46m (x2 over 5h46m)  kubelet, ip-10-0-58-74.us-east-2.compute.internal  Node ip-10-0-58-74.us-east-2.compute.internal status is now: NodeHasNoDiskPressure
  Normal  NodeHasSufficientPID     5h46m (x2 over 5h46m)  kubelet, ip-10-0-58-74.us-east-2.compute.internal  Node ip-10-0-58-74.us-east-2.compute.internal status is now: NodeHasSufficientPID
  Normal  NodeAllocatableEnforced  5h46m                  kubelet, ip-10-0-58-74.us-east-2.compute.internal  Updated Node Allocatable limit across pods
  Normal  NodeReady                5h45m                  kubelet, ip-10-0-58-74.us-east-2.compute.internal  Node ip-10-0-58-74.us-east-2.compute.internal status is now: NodeReady
  Normal  NodeNotSchedulable       152m                   kubelet, ip-10-0-58-74.us-east-2.compute.internal  Node ip-10-0-58-74.us-east-2.compute.internal status is now: NodeNotSchedulable


$ oc describe node ip-10-0-60-221.us-east-2.compute.internal
Name:               ip-10-0-60-221.us-east-2.compute.internal
Roles:              worker
Labels:             beta.kubernetes.io/arch=amd64
                    beta.kubernetes.io/instance-type=m4.xlarge
                    beta.kubernetes.io/os=linux
                    failure-domain.beta.kubernetes.io/region=us-east-2
                    failure-domain.beta.kubernetes.io/zone=us-east-2a
                    kubernetes.io/arch=amd64
                    kubernetes.io/hostname=ip-10-0-60-221.us-east-2.compute.internal
                    kubernetes.io/os=linux
                    node-role.kubernetes.io/worker=
                    node.kubernetes.io/instance-type=m4.xlarge
                    node.openshift.io/os_id=rhel
                    topology.ebs.csi.aws.com/zone=us-east-2a
                    topology.kubernetes.io/region=us-east-2
                    topology.kubernetes.io/zone=us-east-2a
Annotations:        csi.volume.kubernetes.io/nodeid: {"ebs.csi.aws.com":"i-0c99db9b70ca3690e"}
                    k8s.ovn.org/l3-gateway-config:
                      {"default":{"mode":"shared","interface-id":"br-ex_ip-10-0-60-221.us-east-2.compute.internal","mac-address":"02:87:6e:17:88:0a","ip-address...
                    k8s.ovn.org/node-chassis-id: 23feec3d-9307-4f5b-af93-7946ad6ea9dc
                    k8s.ovn.org/node-join-subnets: {"default":"100.64.6.0/29"}
                    k8s.ovn.org/node-local-nat-ip: {"default":["169.254.7.183"]}
                    k8s.ovn.org/node-mgmt-port-mac-address: 1a:5b:9f:89:67:21
                    k8s.ovn.org/node-primary-ifaddr: {"ipv4":"10.0.60.221/20"}
                    k8s.ovn.org/node-subnets: {"default":"10.130.2.0/23"}
                    machineconfiguration.openshift.io/currentConfig: rendered-worker-d3fda1858c3d1a5b2545c70f14801d7f
                    machineconfiguration.openshift.io/desiredConfig: rendered-worker-d3fda1858c3d1a5b2545c70f14801d7f
                    machineconfiguration.openshift.io/ssh: accessed
                    machineconfiguration.openshift.io/state: Done
                    volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp:  Wed, 06 Jan 2021 09:38:53 +0530
Taints:             node.kubernetes.io/unschedulable:NoSchedule
Unschedulable:      true
Lease:
  HolderIdentity:  ip-10-0-60-221.us-east-2.compute.internal
  AcquireTime:     <unset>
  RenewTime:       Wed, 06 Jan 2021 15:25:09 +0530
Conditions:
  Type             Status  LastHeartbeatTime                 LastTransitionTime                Reason                       Message
  ----             ------  -----------------                 ------------------                ------                       -------
  MemoryPressure   False   Wed, 06 Jan 2021 15:25:02 +0530   Wed, 06 Jan 2021 09:38:53 +0530   KubeletHasSufficientMemory   kubelet has sufficient memory available
  DiskPressure     False   Wed, 06 Jan 2021 15:25:02 +0530   Wed, 06 Jan 2021 09:38:53 +0530   KubeletHasNoDiskPressure     kubelet has no disk pressure
  PIDPressure      False   Wed, 06 Jan 2021 15:25:02 +0530   Wed, 06 Jan 2021 09:38:53 +0530   KubeletHasSufficientPID      kubelet has sufficient PID available
  Ready            True    Wed, 06 Jan 2021 15:25:02 +0530   Wed, 06 Jan 2021 09:39:53 +0530   KubeletReady                 kubelet is posting ready status
Addresses:
  InternalIP:   10.0.60.221
  Hostname:     ip-10-0-60-221.us-east-2.compute.internal
  InternalDNS:  ip-10-0-60-221.us-east-2.compute.internal
Capacity:
  attachable-volumes-aws-ebs:  39
  cpu:                         4
  ephemeral-storage:           31444972Ki
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      16264968Ki
  pods:                        250
Allocatable:
  attachable-volumes-aws-ebs:  39
  cpu:                         3500m
  ephemeral-storage:           27905944324
  hugepages-1Gi:               0
  hugepages-2Mi:               0
  memory:                      15113992Ki
  pods:                        250
System Info:
  Machine ID:                             5baffef7ed054ce59608b61344d680d2
  System UUID:                            EC272FAF-0060-757A-EFD9-5C0EE0E83F3A
  Boot ID:                                6d17251e-ed98-477e-b164-2314cb3b7487
  Kernel Version:                         3.10.0-1160.11.1.el7.x86_64
  OS Image:                               Red Hat Enterprise Linux Server 7.9 (Maipo)
  Operating System:                       linux
  Architecture:                           amd64
  Container Runtime Version:              cri-o://1.19.0-118.rhaos4.6.gitf51f94a.el7
  Kubelet Version:                        v1.19.0+9c69bdc
  Kube-Proxy Version:                     v1.19.0+9c69bdc
ProviderID:                               aws:///us-east-2a/i-0c99db9b70ca3690e
Non-terminated Pods:                      (14 in total)
  Namespace                               Name                               CPU Requests  CPU Limits  Memory Requests  Memory Limits  AGE
  ---------                               ----                               ------------  ----------  ---------------  -------------  ---
  node-upgrade                            hello-daemonset-cxw82              0 (0%)        0 (0%)      0 (0%)           0 (0%)         4h43m
  openshift-cluster-csi-drivers           aws-ebs-csi-driver-node-879qv      30m (0%)      0 (0%)      150Mi (1%)       0 (0%)         171m
  openshift-cluster-node-tuning-operator  tuned-qpgqd                        10m (0%)      0 (0%)      50Mi (0%)        0 (0%)         171m
  openshift-dns                           dns-default-w5jg9                  65m (1%)      0 (0%)      110Mi (0%)       512Mi (3%)     156m
  openshift-image-registry                node-ca-xvglk                      10m (0%)      0 (0%)      10Mi (0%)        0 (0%)         173m
  openshift-ingress                       router-default-6668c6f5b9-ngnkp    100m (2%)     0 (0%)      256Mi (1%)       0 (0%)         173m
  openshift-logging                       fluentd-4hqfx                      100m (2%)     0 (0%)      736Mi (4%)       736Mi (4%)     3h39m
  openshift-machine-config-operator       machine-config-daemon-h79vt        40m (1%)      0 (0%)      100Mi (0%)       0 (0%)         154m
  openshift-monitoring                    node-exporter-zsmmp                9m (0%)       0 (0%)      210Mi (1%)       0 (0%)         172m
  openshift-multus                        multus-p9msd                       10m (0%)      0 (0%)      150Mi (1%)       0 (0%)         165m
  openshift-multus                        network-metrics-daemon-bxm2v       20m (0%)      0 (0%)      120Mi (0%)       0 (0%)         166m
  openshift-ovn-kubernetes                ovnkube-node-2jvgg                 30m (0%)      0 (0%)      620Mi (4%)       0 (0%)         165m
  openshift-ovn-kubernetes                ovs-node-2m2fw                     100m (2%)     0 (0%)      300Mi (2%)       0 (0%)         166m
  ui-upgrade                              hello-daemonset-6dmfl              0 (0%)        0 (0%)      0 (0%)           0 (0%)         4h39m
Allocated resources:
  (Total limits may be over 100 percent, i.e., overcommitted.)
  Resource                    Requests      Limits
  --------                    --------      ------
  cpu                         524m (14%)    0 (0%)
  memory                      2812Mi (19%)  1248Mi (8%)
  ephemeral-storage           0 (0%)        0 (0%)
  hugepages-1Gi               0 (0%)        0 (0%)
  hugepages-2Mi               0 (0%)        0 (0%)
  attachable-volumes-aws-ebs  0             0
Events:
  Type    Reason                   Age                    From                                                Message
  ----    ------                   ----                   ----                                                -------
  Normal  Starting                 5h46m                  kubelet, ip-10-0-60-221.us-east-2.compute.internal  Starting kubelet.
  Normal  NodeHasSufficientMemory  5h46m (x2 over 5h46m)  kubelet, ip-10-0-60-221.us-east-2.compute.internal  Node ip-10-0-60-221.us-east-2.compute.internal status is now: NodeHasSufficientMemory
  Normal  NodeHasNoDiskPressure    5h46m (x2 over 5h46m)  kubelet, ip-10-0-60-221.us-east-2.compute.internal  Node ip-10-0-60-221.us-east-2.compute.internal status is now: NodeHasNoDiskPressure
  Normal  NodeHasSufficientPID     5h46m (x2 over 5h46m)  kubelet, ip-10-0-60-221.us-east-2.compute.internal  Node ip-10-0-60-221.us-east-2.compute.internal status is now: NodeHasSufficientPID
  Normal  NodeAllocatableEnforced  5h46m                  kubelet, ip-10-0-60-221.us-east-2.compute.internal  Updated Node Allocatable limit across pods
  Normal  NodeReady                5h45m                  kubelet, ip-10-0-60-221.us-east-2.compute.internal  Node ip-10-0-60-221.us-east-2.compute.internal status is now: NodeReady
  Normal  NodeNotSchedulable       135m                   kubelet, ip-10-0-60-221.us-east-2.compute.internal  Node ip-10-0-60-221.us-east-2.compute.internal status is now: NodeNotSchedulable



$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master   rendered-master-7c5ea40d13541de6e0e34d97f04f3c75   True      False      False      3              3                   3                     0                      6h40m
worker   rendered-worker-d3fda1858c3d1a5b2545c70f14801d7f   False     True       True       2              0                   0                     1                      6h40m


$ oc describe mcp worker
Name:         worker
Namespace:    
Labels:       machineconfiguration.openshift.io/mco-built-in=
              pools.operator.machineconfiguration.openshift.io/worker=
Annotations:  <none>
API Version:  machineconfiguration.openshift.io/v1
Kind:         MachineConfigPool
Metadata:
  Creation Timestamp:  2021-01-06T03:15:17Z
  Generation:          4
  Managed Fields:
    API Version:  machineconfiguration.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          .:
          f:machineconfiguration.openshift.io/mco-built-in:
          f:pools.operator.machineconfiguration.openshift.io/worker:
      f:spec:
        .:
        f:configuration:
        f:machineConfigSelector:
          .:
          f:matchLabels:
            .:
            f:machineconfiguration.openshift.io/role:
        f:nodeSelector:
          .:
          f:matchLabels:
            .:
            f:node-role.kubernetes.io/worker:
        f:paused:
    Manager:      machine-config-operator
    Operation:    Update
    Time:         2021-01-06T03:15:17Z
    API Version:  machineconfiguration.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:spec:
        f:configuration:
          f:name:
          f:source:
      f:status:
        .:
        f:conditions:
        f:configuration:
          .:
          f:name:
          f:source:
        f:degradedMachineCount:
        f:machineCount:
        f:observedGeneration:
        f:readyMachineCount:
        f:unavailableMachineCount:
        f:updatedMachineCount:
    Manager:         machine-config-controller
    Operation:       Update
    Time:            2021-01-06T07:39:27Z
  Resource Version:  172967
  Self Link:         /apis/machineconfiguration.openshift.io/v1/machineconfigpools/worker
  UID:               5fec402e-46bb-4c50-aecd-8711b08ca381
Spec:
  Configuration:
    Name:  rendered-worker-98a7d5a6de4b081492fc1ecf58664a0b
    Source:
      API Version:  machineconfiguration.openshift.io/v1
      Kind:         MachineConfig
      Name:         00-worker
      API Version:  machineconfiguration.openshift.io/v1
      Kind:         MachineConfig
      Name:         01-worker-container-runtime
      API Version:  machineconfiguration.openshift.io/v1
      Kind:         MachineConfig
      Name:         01-worker-kubelet
      API Version:  machineconfiguration.openshift.io/v1
      Kind:         MachineConfig
      Name:         99-worker-fips
      API Version:  machineconfiguration.openshift.io/v1
      Kind:         MachineConfig
      Name:         99-worker-generated-registries
      API Version:  machineconfiguration.openshift.io/v1
      Kind:         MachineConfig
      Name:         99-worker-ssh
  Machine Config Selector:
    Match Labels:
      machineconfiguration.openshift.io/role:  worker
  Node Selector:
    Match Labels:
      node-role.kubernetes.io/worker:  
  Paused:                              false
Status:
  Conditions:
    Last Transition Time:  2021-01-06T03:16:46Z
    Message:               
    Reason:                
    Status:                False
    Type:                  RenderDegraded
    Last Transition Time:  2021-01-06T07:22:41Z
    Message:               
    Reason:                
    Status:                False
    Type:                  Updated
    Last Transition Time:  2021-01-06T07:22:41Z
    Message:               All nodes are updating to rendered-worker-98a7d5a6de4b081492fc1ecf58664a0b
    Reason:                
    Status:                True
    Type:                  Updating
    Last Transition Time:  2021-01-06T07:24:50Z
    Message:               Node ip-10-0-58-74.us-east-2.compute.internal is reporting: "Failed to find /dev/disk/by-label/root"
    Reason:                1 nodes are reporting degraded status on sync
    Status:                True
    Type:                  NodeDegraded
    Last Transition Time:  2021-01-06T07:24:50Z
    Message:               
    Reason:                
    Status:                True
    Type:                  Degraded
  Configuration:
    Name:  rendered-worker-d3fda1858c3d1a5b2545c70f14801d7f
    Source:
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   00-worker
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   01-worker-container-runtime
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   01-worker-kubelet
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   99-worker-fips
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   99-worker-generated-registries
      API Version:            machineconfiguration.openshift.io/v1
      Kind:                   MachineConfig
      Name:                   99-worker-ssh
  Degraded Machine Count:     1
  Machine Count:              2
  Observed Generation:        4
  Ready Machine Count:        0
  Unavailable Machine Count:  2
  Updated Machine Count:      0
Events:
  Type    Reason            Age    From                                    Message
  ----    ------            ----   ----                                    -------
  Normal  SetDesiredConfig  6h15m  machineconfigcontroller-nodecontroller  Targeted node ip-10-0-79-48.us-east-2.compute.internal to config rendered-worker-d3fda1858c3d1a5b2545c70f14801d7f
  Normal  SetDesiredConfig  6h13m  machineconfigcontroller-nodecontroller  Targeted node ip-10-0-49-14.us-east-2.compute.internal to config rendered-worker-d3fda1858c3d1a5b2545c70f14801d7f
  Normal  SetDesiredConfig  6h7m   machineconfigcontroller-nodecontroller  Targeted node ip-10-0-61-81.us-east-2.compute.internal to config rendered-worker-d3fda1858c3d1a5b2545c70f14801d7f
  Normal  SetDesiredConfig  153m   machineconfigcontroller-nodecontroller  Targeted node ip-10-0-58-74.us-east-2.compute.internal to config rendered-worker-98a7d5a6de4b081492fc1ecf58664a0b

Comment 3 Vadim Rutkovsky 2021-01-06 11:59:02 UTC
It looks like a hardware failure, as some files cannot be fetched too:
```
2021-01-05T06:50:14.677517929-05:00 I0105 11:50:14.677495   28223 update.go:1220] Removed stale systemd dropin "/etc/systemd/system/ovs-vswitchd.service.d/10-ovs-vswitchd-restart.conf"
2021-01-05T06:50:14.677517929-05:00 I0105 11:50:14.677510   28223 update.go:1282] /etc/systemd/system/multi-user.target.wants/ovs-vswitchd.service was not present. No need to remove
2021-01-05T06:50:14.677570805-05:00 W0105 11:50:14.677537   28223 update.go:1247] unable to delete /etc/systemd/system/ovs-vswitchd.service: remove /etc/systemd/system/ovs-vswitchd.service: no such file or directory
2021-01-05T06:50:14.677570805-05:00 I0105 11:50:14.677547   28223 update.go:1249] Removed stale systemd unit "/etc/systemd/system/ovs-vswitchd.service"
2021-01-05T06:50:14.677617303-05:00 I0105 11:50:14.677604   28223 update.go:1220] Removed stale systemd dropin "/etc/systemd/system/ovsdb-server.service.d/10-ovsdb-restart.conf"
2021-01-05T06:50:14.677637757-05:00 I0105 11:50:14.677624   28223 update.go:1282] /etc/systemd/system/multi-user.target.wants/ovsdb-server.service was not present. No need to remove
2021-01-05T06:50:14.677659570-05:00 W0105 11:50:14.677647   28223 update.go:1247] unable to delete /etc/systemd/system/ovsdb-server.service: remove /etc/systemd/system/ovsdb-server.service: no such file or directory
2021-01-05T06:50:14.677668010-05:00 I0105 11:50:14.677657   28223 update.go:1249] Removed stale systemd unit "/etc/systemd/system/ovsdb-server.service"
2021-01-05T06:50:14.677741683-05:00 I0105 11:50:14.677715   28223 update.go:1220] Removed stale systemd dropin "/etc/systemd/system/zincati.service.d/mco-disabled.conf"
2021-01-05T06:50:14.677750052-05:00 I0105 11:50:14.677742   28223 update.go:1282] /etc/systemd/system/multi-user.target.wants/zincati.service was not present. No need to remove
2021-01-05T06:50:14.677777386-05:00 W0105 11:50:14.677766   28223 update.go:1247] unable to delete /etc/systemd/system/zincati.service: remove /etc/systemd/system/zincati.service: no such file or directory
2021-01-05T06:50:14.677785656-05:00 I0105 11:50:14.677776   28223 update.go:1249] Removed stale systemd unit "/etc/systemd/system/zincati.service"
2021-01-05T06:50:14.677795291-05:00 E0105 11:50:14.677789   28223 writer.go:135] Marking Degraded due to: Failed to find /dev/disk/by-label/root
2021-01-05T06:50:14.681392397-05:00 E0105 11:50:14.680646   28223 token_source.go:152] Unable to rotate token: failed to read token file "/var/run/secrets/kubernetes.io/serviceaccount/token": open /var/run/secrets/kubernetes.io/serviceaccount/token: no such file or directory
```
from quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-a5e086d7c7605d24b00926fd569dbebfe9580e547b2e3f1d48a719bfd19a5049/namespaces/openshift-machine-config-operator/pods/machine-config-daemon-hqx4f/machine-config-daemon/machine-config-daemon/logs in must-gather

Reassigning to MCO

Comment 7 Sinny Kumari 2021-01-06 14:19:00 UTC
From discussion over slack, we think this could be regression from https://github.com/openshift/machine-config-operator/pull/2251. 
We have seen regression in 4.7 on rhel 7 worker nodes with different error message https://bugzilla.redhat.com/show_bug.cgi?id=1909943

Comment 9 Mike Fiedler 2021-01-07 19:29:39 UTC
I hit this today trying to upgrade 4.6.6 to a 4.6 scratch build generated from 4.6 branch + https://github.com/openshift/machine-config-operator/pull/2321

Comment 11 Michael Nguyen 2021-01-13 00:28:30 UTC
Verified on 4.7.0-0.nightly-2021-01-12-150634.  Upgraded 4.6.10 to 4.7.0-0.nightly-2021-01-12-150634 with RHEL7 worker

Needed to workaround these BZs which also affect RHEL7 compute nodes for the verification.

https://bugzilla.redhat.com/show_bug.cgi?id=1913582
Workaround: Edit /etc/os-release and set VERSION_ID="7"

https://bugzilla.redhat.com/show_bug.cgi?id=1913536 
Workaround: rm /etc/systemd/system/multi-user-target.Wants/openvswitch.service
            systemctl enable openvswitch.service

Comment 12 Michael Nguyen 2021-01-14 23:28:26 UTC
Moving back to ON_QA state.  Looking at the origin BZ filed in 4.6 the problem is the upgrade succeeds but the MCP is degraded.  I need to re-verify that the MCP is not in a degraded state after the upgrade.

Comment 13 Yu Qi Zhang 2021-01-15 00:06:44 UTC
A clarifying note: the proposed fix will fix Failed to find /dev/disk/by-label/root on RHEL nodes.

This error does NOT block upgrades (workers are not considered in the success/fail criteria of upgrades, only control plane is), so this fix will NOT fix Qin's original issue of a failing upgrade.

Comment 14 Michael Nguyen 2021-01-15 20:46:14 UTC
Updated verification.  There were two scenarios that needed to be tested to verify this fix.

The first is an upgrade from 4.6.10 -> 4.7.0-0.nightly-2021-01-13-054018.  This is an upgrade from a clean 4.6.10 with no degraded MCP.  Verified that the MCP does not go degraded when upgrading to 4.7 and the upgrade is successful.

The second test is from 4.6.6 -> 4.6.10 -> 4.6.11 -> 4.7.0-0.nightly-2021-01-13-054018.  This is an upgrade through various 4.6.z versions to introduce the degraded MCP.  Verified that upgrading to 4.7 is successful and the degraded MCP is fixed with no intervention from the user.

Comment 29 errata-xmlrpc 2021-02-24 15:50:15 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.7.0 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:5633

Comment 30 W. Trevor King 2021-04-05 17:47:26 UTC
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475


Note You need to log in before you can comment on or make changes to this bug.