Bug 1997478

Summary: Storage operator is not available after reboot cluster instances
Product: OpenShift Container Platform Reporter: Yunfei Jiang <yunjiang>
Component: NodeAssignee: Elana Hashman <ehashman>
Node sub component: Kubelet QA Contact: Yunfei Jiang <yunjiang>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: urgent CC: aos-bugs, bgilbert, ccoleman, dhellmann, ehashman, fbertina, jima, miabbott, nstielau, pehunt, prkumar, rphillips, smilner, tsze, wking, xtian, yliu1
Version: 4.9Flags: ehashman: needinfo-
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Refactor of kubelet lifecycle caused an incorrect "Running" status to be synced out that previously never left the kubelet. Consequence: Some terminated pods had their status reset to Running. Fix: Performed a merge of all status sources within kubelet to ensure terminated pods maintain terminal status. Result: Terminated pods remain terminal. Final pod status synced to API server reflects their terminal status.
Story Points: ---
Clone Of:
: 2011050 2011960 (view as bug list) Environment:
Last Closed: 2022-03-12 04:37:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 2011050, 2011960    
Attachments:
Description Flags
kubelet_log none

Description Yunfei Jiang 2021-08-25 10:41:12 UTC
Restart all instances of cluster, some pods are in error status and storage operator is not available:


> oc get pod -A | grep -iv -E "Running|Completed"
NAMESPACE                                          NAME                                                                 READY   STATUS                       RESTARTS       AGE
openshift-cluster-csi-drivers                      aws-ebs-csi-driver-controller-67d7785f4-qstw7                        0/11    CreateContainerConfigError   11             94m
openshift-cluster-csi-drivers                      aws-ebs-csi-driver-controller-67d7785f4-xpwsb                        0/11    CreateContainerConfigError   15 (90m ago)   94m
openshift-dns-operator                             dns-operator-7669b65fb7-kg2qk                                        0/2     Error                        2              96m
openshift-machine-config-operator                  machine-config-operator-68949bfd47-vxqcn                             0/1     Error                        1              96m
openshift-machine-config-operator                  machine-config-server-77wv6                                          0/1     CreateContainerConfigError   1              93m

all two aws-ebs-csi-driver-controller pods are CreateContainerConfigError, which cause storage operator is not available, events from one pods(aws-ebs-csi-driver-controller-67d7785f4-xpwsb):

Events:
  Warning  Preempting   91m                  kubelet            Preempted in order to admit critical pod
  Normal   Pulled       91m                  kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:b850e05e7b99e5f45c8ae2b321e42826d9ea9b92865500ea49b36bb1ddf3c2a0" already present on machine
  Warning  Failed       91m                  kubelet            Error: cannot find volume "aws-credentials" to mount into container "csi-driver"
  Normal   Pulled       91m                  kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:79ed2aa8d4c6bb63c813bead31c2b7da0f082506315e9b93f6a3bfbc1c44d940" already present on machine
  Warning  Failed       91m                  kubelet            Error: cannot find volume "metrics-serving-cert" to mount into container "driver-kube-rbac-proxy"
  Normal   Pulled       91m                  kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:9143971d7c9550881adbf9604752a43b95ca3d38f091546c0ea342633cc03e7c" already present on machine
  Warning  Failed       91m                  kubelet            Error: cannot find volume "socket-dir" to mount into container "csi-provisioner"
  Normal   Pulled       91m                  kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:79ed2aa8d4c6bb63c813bead31c2b7da0f082506315e9b93f6a3bfbc1c44d940" already present on machine
  Warning  Failed       91m                  kubelet            Error: cannot find volume "metrics-serving-cert" to mount into container "provisioner-kube-rbac-proxy"
  Normal   Pulled       91m                  kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:430df9b440e4f9a18aaeab1f520d6dbbb4788d1d0bb72b97c7b1ee379a88eb86" already present on machine
  Warning  Failed       91m                  kubelet            Error: cannot find volume "socket-dir" to mount into container "csi-attacher"
  Normal   Pulled       91m                  kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:bf8f0e9aecca2f1a313a4c7f4332e0a10cb41f4ac094d633dc4ec6f5d49d3282" already present on machine
  Warning  Failed       91m                  kubelet            Error: cannot find volume "metrics-serving-cert" to mount into container "attacher-kube-rbac-proxy"
  Normal   Pulled       91m                  kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:8c329d1fd0553f6b5a8650b1e477ad3cbfa3236a1b6e5aa74b6583ab82922955" already present on machine
  Warning  Failed       91m                  kubelet            Error: cannot find volume "socket-dir" to mount into container "csi-resizer"
  Normal   Pulled       91m                  kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:79ed2aa8d4c6bb63c813bead31c2b7da0f082506315e9b93f6a3bfbc1c44d940" already present on machine
  Warning  Failed       91m                  kubelet            Error: cannot find volume "metrics-serving-cert" to mount into container "resizer-kube-rbac-proxy"
  Normal   Pulled       91m                  kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:6d3fd86f1a1cf4ec3328b0cbf4a90b799ce8e43c8c4d54f0023619bfbdbd4f46" already present on machine
  Warning  Failed       91m                  kubelet            Error: cannot find volume "kube-api-access-xdwpq" to mount into container "csi-snapshotter"
  Normal   Pulled       91m                  kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:79ed2aa8d4c6bb63c813bead31c2b7da0f082506315e9b93f6a3bfbc1c44d940" already present on machine
  Normal   Pulled       91m                  kubelet            Container image "quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:79ed2aa8d4c6bb63c813bead31c2b7da0f082506315e9b93f6a3bfbc1c44d940" already present on machine
  Warning  Failed       91m (x2 over 91m)    kubelet            (combined from similar events): Error: cannot find volume "kube-api-access-xdwpq" to mount into container "csi-liveness-probe"

> oc get co
NAME                                       VERSION                             AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
authentication                             4.9.0-0.nightly-2021-08-25-010624   True        False         False      52m
baremetal                                  4.9.0-0.nightly-2021-08-25-010624   True        False         False      69m
cloud-controller-manager                   4.9.0-0.nightly-2021-08-25-010624   True        False         False      72m
cloud-credential                           4.9.0-0.nightly-2021-08-25-010624   True        False         False      72m
cluster-autoscaler                         4.9.0-0.nightly-2021-08-25-010624   True        False         False      69m
config-operator                            4.9.0-0.nightly-2021-08-25-010624   True        False         False      71m
console                                    4.9.0-0.nightly-2021-08-25-010624   True        False         False      14m
csi-snapshot-controller                    4.9.0-0.nightly-2021-08-25-010624   True        False         False      70m
dns                                        4.9.0-0.nightly-2021-08-25-010624   True        False         False      69m
etcd                                       4.9.0-0.nightly-2021-08-25-010624   True        False         False      69m
image-registry                             4.9.0-0.nightly-2021-08-25-010624   True        False         False      65m
ingress                                    4.9.0-0.nightly-2021-08-25-010624   True        False         False      14m
insights                                   4.9.0-0.nightly-2021-08-25-010624   True        False         False      64m
kube-apiserver                             4.9.0-0.nightly-2021-08-25-010624   True        False         False      67m
kube-controller-manager                    4.9.0-0.nightly-2021-08-25-010624   True        False         False      68m
kube-scheduler                             4.9.0-0.nightly-2021-08-25-010624   True        False         False      69m
kube-storage-version-migrator              4.9.0-0.nightly-2021-08-25-010624   True        False         False      70m
machine-api                                4.9.0-0.nightly-2021-08-25-010624   True        False         False      66m
machine-approver                           4.9.0-0.nightly-2021-08-25-010624   True        False         False      70m
machine-config                             4.9.0-0.nightly-2021-08-25-010624   True        False         False      68m
marketplace                                4.9.0-0.nightly-2021-08-25-010624   True        False         False      69m
monitoring                                 4.9.0-0.nightly-2021-08-25-010624   True        False         False      63m
network                                    4.9.0-0.nightly-2021-08-25-010624   True        False         False      71m
node-tuning                                4.9.0-0.nightly-2021-08-25-010624   True        False         False      70m
openshift-apiserver                        4.9.0-0.nightly-2021-08-25-010624   True        False         False      66m
openshift-controller-manager               4.9.0-0.nightly-2021-08-25-010624   True        False         False      66m
openshift-samples                          4.9.0-0.nightly-2021-08-25-010624   True        False         False      66m
operator-lifecycle-manager                 4.9.0-0.nightly-2021-08-25-010624   True        False         False      70m
operator-lifecycle-manager-catalog         4.9.0-0.nightly-2021-08-25-010624   True        False         False      70m
operator-lifecycle-manager-packageserver   4.9.0-0.nightly-2021-08-25-010624   True        False         False      66m
service-ca                                 4.9.0-0.nightly-2021-08-25-010624   True        False         False      70m
storage                                    4.9.0-0.nightly-2021-08-25-010624   False       True          False      12m     AWSEBSCSIDriverOperatorCRAvailable: AWSEBSDriverControllerServiceControllerAvailable: Waiting for Deployment

./oc describe storage
Name:         cluster
Namespace:
Labels:       <none>
Annotations:  include.release.openshift.io/ibm-cloud-managed: true
              include.release.openshift.io/self-managed-high-availability: true
              include.release.openshift.io/single-node-developer: true
              release.openshift.io/create-only: true
API Version:  operator.openshift.io/v1
Kind:         Storage
Metadata:
  Creation Timestamp:  2021-08-25T07:40:37Z
  Generation:          1
  Managed Fields:
    API Version:  operator.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .:
          f:include.release.openshift.io/ibm-cloud-managed:
          f:include.release.openshift.io/self-managed-high-availability:
          f:include.release.openshift.io/single-node-developer:
          f:release.openshift.io/create-only:
        f:ownerReferences:
          .:
          k:{"uid":"cb70338d-f4dd-421e-8ad2-d924fb0796f4"}:
      f:spec:
        .:
        f:logLevel:
        f:managementState:
        f:operatorLogLevel:
    Manager:      cluster-version-operator
    Operation:    Update
    Time:         2021-08-25T07:40:37Z
    API Version:  operator.openshift.io/v1
    Fields Type:  FieldsV1
    fieldsV1:
      f:status:
        .:
        f:conditions:
        f:generations:
        f:readyReplicas:
    Manager:      cluster-storage-operator
    Operation:    Update
    Subresource:  status
    Time:         2021-08-25T07:42:56Z
  Owner References:
    API Version:     config.openshift.io/v1
    Kind:            ClusterVersion
    Name:            version
    UID:             cb70338d-f4dd-421e-8ad2-d924fb0796f4
  Resource Version:  41493
  UID:               21a2f7b8-96d3-4d8b-a0f3-9510e679e3f0
Spec:
  Log Level:           Normal
  Management State:    Managed
  Operator Log Level:  Normal
Status:
  Conditions:
    Last Transition Time:  2021-08-25T07:42:51Z
    Status:                False
    Type:                  ManagementStateDegraded
    Last Transition Time:  2021-08-25T07:42:51Z
    Status:                False
    Type:                  ConfigObservationDegraded
    Last Transition Time:  2021-08-25T07:42:51Z
    Status:                True
    Type:                  DefaultStorageClassControllerAvailable
    Last Transition Time:  2021-08-25T07:42:51Z
    Status:                False
    Type:                  DefaultStorageClassControllerProgressing
    Last Transition Time:  2021-08-25T07:42:51Z
    Reason:                AsExpected
    Status:                False
    Type:                  DefaultStorageClassControllerDegraded
    Last Transition Time:  2021-08-25T07:42:51Z
    Status:                True
    Type:                  SnapshotCRDControllerUpgradeable
    Last Transition Time:  2021-08-25T07:42:52Z
    Reason:                AsExpected
    Status:                False
    Type:                  VSphereProblemDetectorStarterDegraded
    Last Transition Time:  2021-08-25T07:42:52Z
    Reason:                AsExpected
    Status:                False
    Type:                  SnapshotCRDControllerDegraded
    Last Transition Time:  2021-08-25T07:42:55Z
    Reason:                AsExpected
    Status:                False
    Type:                  CSIDriverStarterDegraded
    Last Transition Time:  2021-08-25T08:40:55Z
    Message:               AWSEBSDriverControllerServiceControllerAvailable: Waiting for Deployment
    Reason:                AWSEBSDriverControllerServiceController_Deploying
    Status:                False
    Type:                  AWSEBSCSIDriverOperatorCRAvailable
    Last Transition Time:  2021-08-25T08:40:55Z
    Message:               AWSEBSDriverControllerServiceControllerProgressing: Waiting for Deployment to deploy pods
    Reason:                AWSEBSDriverControllerServiceController_Deploying
    Status:                True
    Type:                  AWSEBSCSIDriverOperatorCRProgressing
    Last Transition Time:  2021-08-25T07:42:56Z
    Message:               All is well
    Reason:                AsExpected
    Status:                False
    Type:                  AWSEBSCSIDriverOperatorCRDegraded
    Last Transition Time:  2021-08-25T07:43:02Z
    Status:                False
    Type:                  AWSEBSProgressing
    Last Transition Time:  2021-08-25T07:42:56Z
    Reason:                AsExpected
    Status:                False
    Type:                  AWSEBSCSIDriverOperatorDegraded
    Last Transition Time:  2021-08-25T07:42:56Z
    Reason:                AsExpected
    Status:                False
    Type:                  AWSEBSCSIDriverOperatorDeploymentDegraded
    Last Transition Time:  2021-08-25T07:42:58Z
    Reason:                AsExpected
    Status:                False
    Type:                  AWSEBSCSIDriverOperatorStaticControllerDegraded
  Generations:
    Group:            apps
    Hash:
    Last Generation:  1
    Name:             aws-ebs-csi-driver-operator
    Namespace:        openshift-cluster-csi-drivers
    Resource:         deployments
    Group:            operator.openshift.io
    Hash:
    Last Generation:  1
    Name:             ebs.csi.aws.com
    Namespace:
    Resource:         clustercsidrivers
  Ready Replicas:     0
Events:               <none>




How reproducible:
3/3

Version:
4.9.0-0.nightly-2021-08-25-010624

Platform:
AWS

How to reproduce it (as minimally and precisely as possible)?
1. install a cluster on AWS
2. reboot all instances

Comment 3 Wei Duan 2021-08-25 12:03:57 UTC
Created attachment 1817432 [details]
kubelet_log

Comment 4 Fabio Bertinatto 2021-08-25 13:11:55 UTC
This looks similar to bug 1996184 that I'm investigating.

Comment 5 Fabio Bertinatto 2021-08-25 14:45:34 UTC
*** Bug 1996184 has been marked as a duplicate of this bug. ***

Comment 6 Fabio Bertinatto 2021-08-25 14:54:47 UTC
The pod won't start because some volumes can't be mounted into the containers. 

Taking the "metrics-volume-cert" volume as an example: this volume is backed by the "aws-ebs-csi-driver-controller-metrics-serving-cert" secret and from the must-gather output we can see that it exists.

However, the kubelet hasn't been able to fetch this secret from the API server:

Aug 25 07:43:15 ip-10-0-61-148 hyperkube[1368]: E0825 07:43:15.536690    1368 secret.go:195] Couldn't get secret openshift-cluster-csi-drivers/aws-ebs-csi-driver-controller-metrics-serving-cert: secret "aws-ebs-csi-driver-controller-metrics-serving-cert" not found
Aug 25 07:43:15 ip-10-0-61-148 hyperkube[1368]: E0825 07:43:15.536754    1368 nestedpendingoperations.go:301] Operation for "{volumeName:kubernetes.io/secret/6a24646f-5aa1-496d-9f68-59c8ec17f864-metrics-serving-cert podName:6a24646f-5aa1-496d-9f68-59c8ec17f864 nodeName:}" failed. No retries permitted until 2021-08-25 07:43:23.536737498 +0000 UTC m=+101.211117505 (durationBeforeRetry 8s). Error: MountVolume.SetUp failed for volume "metrics-serving-cert" (UniqueName: "kubernetes.io/secret/6a24646f-5aa1-496d-9f68-59c8ec17f864-metrics-serving-cert") pod "aws-ebs-csi-driver-controller-67d7785f4-qstw7" (UID: "6a24646f-5aa1-496d-9f68-59c8ec17f864") : secret "aws-ebs-csi-driver-controller-metrics-serving-cert" not found

Assigning to Node team to help triage this.

Comment 10 Yunfei Jiang 2021-09-28 03:03:44 UTC
Elana, any progress on this issue? thanks

Comment 25 Elana Hashman 2021-09-29 22:10:31 UTC
I believe I have a minimal repro on a nightly AWS cluster:

ehashman@red-dot:~$ cat tmp/manifest.yml 

kind: Pod
apiVersion: v1
metadata:
  name: mypod
spec:
  terminationGracePeriodSeconds: 10
  containers:
  - name: busybox
    image: k8s.gcr.io/e2e-test-images/nginx:1.14-2
    command: ["sh", "-c", "fallocate -l 110G file; sleep 100000"]

ehashman@red-dot:~$ oc apply -f tmp/manifest.yml 
pod/mypod created

ehashman@red-dot:~$ oc get pod
NAME     READY   STATUS              RESTARTS   AGE
mypod    0/1     ContainerCreating   0          3s

ehashman@red-dot:~$ oc get pod
NAME     READY   STATUS    RESTARTS   AGE
mypod    1/1     Running   0          7s

ehashman@red-dot:~$ oc get pod
NAME     READY   STATUS    RESTARTS   AGE
mypod    1/1     Evicted   0          12s

ehashman@red-dot:~$ oc get pod
NAME     READY   STATUS   RESTARTS   AGE
mypod    0/1     Error    0          23s


Kubelet log from node for pod, after waiting for node to become responsive (note podname in my testrun was "mypod2"):

Sep 29 21:44:59 ip-10-0-255-253 hyperkube[1388]: I0929 21:44:59.062780    1388 eviction_manager.go:350] "Eviction manager: must evict pod(s) to reclaim" resourceName="ephemeral-storage"
Sep 29 21:44:59 ip-10-0-255-253 hyperkube[1388]: I0929 21:44:59.063965    1388 eviction_manager.go:368] "Eviction manager: pods ranked for eviction" pods=[openshift-network-diagnostics/network-check-target-sxv6g openshift-console/downloads-86dbddddf-2wwrv openshift-monitoring/prometheus-k8s-0 openshift-monitoring/grafana-54d77c7589-s8lkc openshift-monitoring/telemeter-client-799587c847-gp2p8 openshift-monitoring/alertmanager-main-0 openshift-monitoring/thanos-querier-5ccd9c4865-ztgv8 openshift-ingress/router-default-7658c77c46-mg2jx openshift-monitoring/node-exporter-m89qf openshift-image-registry/node-ca-zdp5x openshift-monitoring/prometheus-adapter-bd96f9697-msc6w openshift-ingress-canary/ingress-canary-p5qlq openshift-sdn/sdn-9jcbh openshift-cluster-node-tuning-operator/tuned-hlhz9 openshift-multus/multus-additional-cni-plugins-ndfs6 openshift-multus/network-metrics-daemon-c2qqk openshift-cluster-csi-drivers/aws-ebs-csi-driver-node-8g2f5 openshift-dns/dns-default-rzl29 openshift-machine-config-operator/machine-config-daemon-d4msj openshift-dns/node-resolver-tjg5h openshift-multus/multus-sphhl default/mypod2]
Sep 29 21:44:59 ip-10-0-255-253 hyperkube[1388]: I0929 21:44:59.108229    1388 kubelet.go:2092] "SyncLoop DELETE" source="api" pods=[openshift-network-diagnostics/network-check-target-sxv6g]
Sep 29 21:44:59 ip-10-0-255-253 hyperkube[1388]: I0929 21:44:59.111131    1388 kubelet.go:2086] "SyncLoop REMOVE" source="api" pods=[openshift-network-diagnostics/network-check-target-sxv6g]
Sep 29 21:44:59 ip-10-0-255-253 hyperkube[1388]: I0929 21:44:59.172057    1388 kubelet.go:2076] "SyncLoop ADD" source="api" pods=[openshift-network-diagnostics/network-check-target-xqqn8]
Sep 29 21:44:59 ip-10-0-255-253 hyperkube[1388]: I0929 21:44:59.172547    1388 eviction_manager.go:168] "Failed to admit pod to node" pod="openshift-network-diagnostics/network-check-target-xqqn8" nodeCondition=[DiskPressure]
Sep 29 21:44:59 ip-10-0-255-253 hyperkube[1388]: I0929 21:44:59.480474    1388 eviction_manager.go:576] "Eviction manager: pod is evicted successfully" pod="openshift-network-diagnostics/network-check-target-sxv6g"
Sep 29 21:44:59 ip-10-0-255-253 hyperkube[1388]: I0929 21:44:59.480519    1388 eviction_manager.go:199] "Eviction manager: pods evicted, waiting for pod to be cleaned up" pods=[openshift-network-diagnostics/network-check-target-sxv6g]
Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: I0929 21:45:00.481736    1388 eviction_manager.go:411] "Eviction manager: pods successfully cleaned up" pods=[openshift-network-diagnostics/nework-check-target-sxv6g]
Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: I0929 21:45:00.491421    1388 eviction_manager.go:339] "Eviction manager: attempting to reclaim" resourceName="ephemeral-storage"
Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: I0929 21:45:00.491471    1388 container_gc.go:85] "Attempting to delete unused containers"
Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:00.504009    1388 kubelet_pods.go:1484] "Pod attempted illegal phase transition" pod="openshift-network-diagnostics/network-check-arget-xqqn8" originalStatusPhase=Failed apiStatusPhase=Pending apiStatus="&PodStatus{Phase:Pending,Conditions:[]PodCondition{},Message:,Reason:,HostIP:,PodIP:,StartTime:<nil>,ContainerStatuses:[]ContainerStatus{ContainerStatus{Name:network-check-target-container,State:ContainerState{Waiting:&ContainerStateWaiting{Reason:ContainerCreating,Message:,},Running:nil,Terminated:nil,},LastTerminationState:ContainerState{Waiting:nil,Running:nil,Terminated:nil,},Ready:false,RestartCount:0,Image:quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:da24fe6e3f9bf5bf398ea11d384c9ef4d5e964e827408f5e7eedffa3e67e2b26,ImageID:,ContainerID:,Started:nil,},},QOSClass:Burstable,InitContainerStatuses:[]ContainerStatus{},NominatedNodeName:,PodIPs:[]PodIP{},EphemeralContainerStatuses:[]ContainerStatus{},}"
Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: I0929 21:45:00.611130    1388 eviction_manager.go:350] "Eviction manager: must evict pod(s) to reclaim" resourceName="ephemeral-storage"
Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: I0929 21:45:00.611933    1388 eviction_manager.go:368] "Eviction manager: pods ranked for eviction" pods=[openshift-console/downloads-86dbddddf-2wwrv openshift-monitoring/prometheus-k8s-0 openshift-monitoring/grafana-54d77c7589-s8lkc openshift-monitoring/alertmanager-main-0 openshift-monitoring/thanos-querier-5ccd9c4865-ztgv8 openshift-monitoring/telemeter-client-799587c847-gp2p8 openshift-ingress/router-default-7658c77c46-mg2jx openshift-monitoring/node-exporter-m89qf openshift-image-registry/node-ca-zdp5x openshift-monitoring/prometheus-adapter-bd96f9697-msc6w openshift-ingress-canary/ingress-canary-p5qlq openshift-sdn/sdn-9jcbh openshift-cluster-node-tuning-operator/tuned-hlhz9 openshift-multus/multus-additional-cni-plugins-ndfs6 openshift-machine-config-operator/machine-config-daemon-d4msj openshift-cluster-csi-drivers/aws-ebs-csi-driver-node-8g2f5 openshift-dns/dns-default-rzl29 openshift-multus/network-metrics-daemon-c2qqk openshift-multus/multus-sphhl openshift-dns/node-resolver-tjg5h default/mypod2]
Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:00.611989    1388 eviction_manager.go:561] "Eviction manager: cannot evict a critical pod" pod="openshift-console/downloads-86dbddddf-2wwrv"
Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:00.612041    1388 eviction_manager.go:561] "Eviction manager: cannot evict a critical pod" pod="openshift-monitoring/prometheus-k8s-0"
Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:00.612068    1388 eviction_manager.go:561] "Eviction manager: cannot evict a critical pod" pod="openshift-monitoring/grafana-54d77c7589-s8lkc"
Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:00.612105    1388 eviction_manager.go:561] "Eviction manager: cannot evict a critical pod" pod="openshift-monitoring/alertmanager-main-0"
Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:00.612143    1388 eviction_manager.go:561] "Eviction manager: cannot evict a critical pod" pod="openshift-monitoring/thanos-querier-5ccd9c4865-ztgv8"
Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:00.612174    1388 eviction_manager.go:561] "Eviction manager: cannot evict a critical pod" pod="openshift-monitoring/telemeter-client-799587c847-gp2p8"
Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:00.612197    1388 eviction_manager.go:561] "Eviction manager: cannot evict a critical pod" pod="openshift-ingress/router-default-7658c77c46-mg2jx"
Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:00.612226    1388 eviction_manager.go:561] "Eviction manager: cannot evict a critical pod" pod="openshift-monitoring/node-exporter-m89qf"
Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:00.612251    1388 eviction_manager.go:561] "Eviction manager: cannot evict a critical pod" pod="openshift-image-registry/node-ca-zdp5x"
Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:00.612274    1388 eviction_manager.go:561] "Eviction manager: cannot evict a critical pod" pod="openshift-monitoring/prometheus-adapter-bd96f9697-msc6w"
Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:00.612298    1388 eviction_manager.go:561] "Eviction manager: cannot evict a critical pod" pod="openshift-ingress-canary/ingress-canary-p5qlq"
Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:00.612322    1388 eviction_manager.go:561] "Eviction manager: cannot evict a critical pod" pod="openshift-sdn/sdn-9jcbh"
Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:00.612345    1388 eviction_manager.go:561] "Eviction manager: cannot evict a critical pod" pod="openshift-cluster-node-tuning-operator/tuned-hlhz9"
Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:00.612365    1388 eviction_manager.go:561] "Eviction manager: cannot evict a critical pod" pod="openshift-multus/multus-additional-cni-plugins-ndfs6"
Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:00.612391    1388 eviction_manager.go:561] "Eviction manager: cannot evict a critical pod" pod="openshift-machine-config-operator/machine-config-daemon-d4msj"
Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:00.612421    1388 eviction_manager.go:561] "Eviction manager: cannot evict a critical pod" pod="openshift-cluster-csi-drivers/aws-ebs-csi-driver-node-8g2f5"
Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:00.612447    1388 eviction_manager.go:561] "Eviction manager: cannot evict a critical pod" pod="openshift-dns/dns-default-rzl29"
Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:00.612473    1388 eviction_manager.go:561] "Eviction manager: cannot evict a critical pod" pod="openshift-multus/network-metrics-daemon-c2qqk"
Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:00.612494    1388 eviction_manager.go:561] "Eviction manager: cannot evict a critical pod" pod="openshift-multus/multus-sphhl"
Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:00.612517    1388 eviction_manager.go:561] "Eviction manager: cannot evict a critical pod" pod="openshift-dns/node-resolver-tjg5h"
Sep 29 21:45:00 ip-10-0-255-253 hyperkube[1388]: I0929 21:45:00.612864    1388 kuberuntime_container.go:720] "Killing container with a grace period" pod="default/mypod2" podUID=a66ad821-0748-6c7-b111-591e1cd86071 containerName="busybox" containerID="cri-o://15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c" gracePeriod=10
Sep 29 21:45:10 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:10.612977    1388 eviction_manager.go:574] "Eviction manager: pod failed to evict" err="timeout waiting to kill pod" pod="default/mypod2"
Sep 29 21:45:10 ip-10-0-255-253 hyperkube[1388]: I0929 21:45:10.613019    1388 eviction_manager.go:199] "Eviction manager: pods evicted, waiting for pod to be cleaned up" pods=[default/mypod2]
Sep 29 21:45:11 ip-10-0-255-253 hyperkube[1388]: I0929 21:45:11.539069    1388 generic.go:296] "Generic (PLEG): container finished" podID=a66ad821-0748-46c7-b111-591e1cd86071 containerID="15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c" exitCode=137
Sep 29 21:45:11 ip-10-0-255-253 hyperkube[1388]: I0929 21:45:11.539137    1388 kubelet.go:2114] "SyncLoop (PLEG): event for pod" pod="default/mypod2" event=&{ID:a66ad821-0748-46c7-b111-591e1cd86071 Type:ContainerDied Data:15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c}
Sep 29 21:45:11 ip-10-0-255-253 hyperkube[1388]: I0929 21:45:11.539187    1388 kubelet.go:2114] "SyncLoop (PLEG): event for pod" pod="default/mypod2" event=&{ID:a66ad821-0748-46c7-b111-591e1cd86071 Type:ContainerDied Data:fd70cc494c8ac718f881389f1c0e76741640f2641a18cbd25c2ba87d59b28bdb}
Sep 29 21:45:11 ip-10-0-255-253 hyperkube[1388]: I0929 21:45:11.539231    1388 scope.go:110] "RemoveContainer" containerID="15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c"
Sep 29 21:45:11 ip-10-0-255-253 hyperkube[1388]: I0929 21:45:11.567278    1388 scope.go:110] "RemoveContainer" containerID="15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c"
Sep 29 21:45:11 ip-10-0-255-253 hyperkube[1388]: E0929 21:45:11.567810    1388 remote_runtime.go:334] "ContainerStatus from runtime service failed" err="rpc error: code = NotFound desc = could not find container \"15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c\": container with ID starting with 15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c not found: ID does not exist" containerID="15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c"
Sep 29 21:45:11 ip-10-0-255-253 hyperkube[1388]: I0929 21:45:11.567858    1388 pod_container_deletor.go:52] "DeleteContainer returned error" containerID={Type:cri-o ID:15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c} err="failed to get container status \"15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c\": rpc error: code = NotFound desc = could not find container \"15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c\": container with ID starting with 15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c not found: ID does not exist"
Sep 29 21:45:11 ip-10-0-255-253 hyperkube[1388]: I0929 21:45:11.613937    1388 eviction_manager.go:411] "Eviction manager: pods successfully cleaned up" pods=[default/mypod2]
Sep 29 21:45:11 ip-10-0-255-253 hyperkube[1388]: I0929 21:45:11.539137    1388 kubelet.go:2114] "SyncLoop (PLEG): event for pod" pod="default/mypod2" event=&{ID:a66ad821-0748-46c7-b111-591e1cd6071 Type:ContainerDied Data:15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c}
Sep 29 21:45:11 ip-10-0-255-253 hyperkube[1388]: I0929 21:45:11.539187    1388 kubelet.go:2114] "SyncLoop (PLEG): event for pod" pod="default/mypod2" event=&{ID:a66ad821-0748-46c7-b111-591e1cd6071 Type:ContainerDied Data:fd70cc494c8ac718f881389f1c0e76741640f2641a18cbd25c2ba87d59b28bdb}
Sep 29 21:45:11 ip-10-0-255-253 hyperkube[1388]: I0929 21:45:11.613937    1388 eviction_manager.go:411] "Eviction manager: pods successfully cleaned up" pods=[default/mypod2]


Pod still shows up in API as in error state.

Events:
  Type     Reason               Age   From               Message
  ----     ------               ----  ----               -------
  Normal   Scheduled            23m   default-scheduler  Successfully assigned default/mypod2 to ip-10-0-255-253.us-east-2.compute.internal
  Normal   AddedInterface       23m   multus             Add eth0 [10.128.2.18/23] from openshift-sdn
  Normal   Pulling              23m   kubelet            Pulling image "k8s.gcr.io/e2e-test-images/nginx:1.14-2"
  Normal   Pulled               23m   kubelet            Successfully pulled image "k8s.gcr.io/e2e-test-images/nginx:1.14-2" in 2.639942044s
  Normal   Created              23m   kubelet            Created container busybox
  Normal   Started              23m   kubelet            Started container busybox
  Warning  Evicted              23m   kubelet            The node was low on resource: ephemeral-storage.
  Normal   Killing              23m   kubelet            Stopping container busybox
  Warning  ExceededGracePeriod  23m   kubelet            Container runtime did not kill the pod within specified grace period.

status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2021-09-29T21:44:49Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2021-09-29T21:45:11Z"
    message: 'containers with unready status: [busybox]'
    reason: ContainersNotReady
    status: "False"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2021-09-29T21:45:11Z"
    message: 'containers with unready status: [busybox]'
    reason: ContainersNotReady
    status: "False"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2021-09-29T21:44:49Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: cri-o://15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c
    image: k8s.gcr.io/e2e-test-images/nginx:1.14-2
    imageID: k8s.gcr.io/e2e-test-images/nginx@sha256:13616070e3f29de4417eee434a8ef472221c9e51b3d037b5a6b46cef08eb7443
    lastState: {}
    name: busybox
    ready: false
    restartCount: 0
    started: false
    state:
      terminated:
        containerID: cri-o://15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c
        exitCode: 137
        finishedAt: "2021-09-29T21:45:10Z"
        reason: Error
        startedAt: "2021-09-29T21:44:55Z"
  hostIP: 10.0.255.253
  phase: Running
  podIP: 10.128.2.18
  podIPs:
  - ip: 10.128.2.18
  qosClass: BestEffort
  startTime: "2021-09-29T21:44:49Z"

Comment 26 Elana Hashman 2021-09-29 22:21:46 UTC
crio seems to think it removed the container ok:

sh-4.4# journalctl -u crio | grep 15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c
Sep 29 21:44:55 ip-10-0-255-253 crio[1356]: time="2021-09-29 21:44:55.223939634Z" level=info msg="Created container 15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c: default/mypod2/busybox" id=7281076f-39d3-4427-bf76-5d26d4536403 name=/runtime.v1alpha2.RuntimeService/CreateContainer
Sep 29 21:44:55 ip-10-0-255-253 crio[1356]: time="2021-09-29 21:44:55.224653116Z" level=info msg="Starting container: 15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c" id=478d040e-7c90-4746-9968-a6ceed5ce962 name=/runtime.v1alpha2.RuntimeService/StartContainer
Sep 29 21:44:55 ip-10-0-255-253 crio[1356]: time="2021-09-29 21:44:55.243331022Z" level=info msg="Started container" PID=53822 containerID=15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c description=default/mypod2/busybox id=478d040e-7c90-4746-9968-a6ceed5ce962 name=/runtime.v1alpha2.RuntimeService/StartContainer sandboxID=fd70cc494c8ac718f881389f1c0e76741640f2641a18cbd25c2ba87d59b28bdb
Sep 29 21:45:00 ip-10-0-255-253 crio[1356]: time="2021-09-29 21:45:00.613225315Z" level=info msg="Stopping container: 15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c (timeout: 10s)" id=f99ed228-260d-4914-96d5-e0911fb8646f name=/runtime.v1alpha2.RuntimeService/StopContainer
Sep 29 21:45:10 ip-10-0-255-253 crio[1356]: time="2021-09-29 21:45:10.638622250Z" level=warning msg="Stopping container 15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c with stop signal timed out: timeout reached after 10 seconds waiting for container process to exit" id=f99ed228-260d-4914-96d5-e0911fb8646f name=/runtime.v1alpha2.RuntimeService/StopContainer
Sep 29 21:45:10 ip-10-0-255-253 crio[1356]: time="2021-09-29 21:45:10.850271824Z" level=info msg="Stopped container 15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c: default/mypod2/busybox" id=f99ed228-260d-4914-96d5-e0911fb8646f name=/runtime.v1alpha2.RuntimeService/StopContainer
Sep 29 21:45:11 ip-10-0-255-253 crio[1356]: time="2021-09-29 21:45:11.540558265Z" level=info msg="Removing container: 15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c" id=6920d056-d456-4df2-8a46-1b845f75ce76 name=/runtime.v1alpha2.RuntimeService/RemoveContainer
Sep 29 21:45:11 ip-10-0-255-253 crio[1356]: time="2021-09-29 21:45:11.566991841Z" level=info msg="Removed container 15df7dfd7f777a6f4886821b1c40084a28ead229ce501d708c0b00351aa5f62c: default/mypod2/busybox" id=6920d056-d456-4df2-8a46-1b845f75ce76 name=/runtime.v1alpha2.RuntimeService/RemoveContainer

So I think we can conclude we've narrowed this down to the kubelet.

Comment 28 Elana Hashman 2021-10-01 22:12:04 UTC
I believe I have a working fix in the linked WIP GitHub PR https://github.com/kubernetes/kubernetes/pull/105411 although it only addresses the kubelet side. kubectl will still display status of "ContainerStatusUnknown" once the kubelet gets a chance to sync the final containerStatuses; we may want to also post a PR to fix this bug in case it causes confusion.

Comment 29 yliu1 2021-10-05 13:38:18 UTC
*** Bug 1998193 has been marked as a duplicate of this bug. ***

Comment 34 Micah Abbott 2021-10-05 19:03:31 UTC
Noting publicly that the Node team wants this fix for the bootstrap use case as well, which means we won't be able to mark this fully delivered/fixed until it lands in an RHCOS boot image bump to `openshift/installer`

I've added this BZ as a dependency to our pending 4.9 boot image bump BZ https://bugzilla.redhat.com/show_bug.cgi?id=2009653

Comment 36 Benjamin Gilbert 2021-10-06 01:46:13 UTC
Since this BZ has been retargeted at 4.10, I'm switching its bootimage bump dependency to the 4.10 tracker.

I'm also moving it back to POST, since the CoreOS team process requires bootimage dependencies to be held in POST until the bootimage bump lands.  If that clashes with your process too much, let us know and we can clone this to a tracker bug.

Comment 37 Benjamin Gilbert 2021-10-07 19:49:22 UTC
After some additional discussion on the CoreOS team, I've created a separate bug 2011960 to track this fix landing in an RHCOS bootimage, and I'm reverting the changes to this BZ that were prompted by trying to fit it into the bootimage bump process.  Thanks for your patience, all.

Comment 38 Yunfei Jiang 2021-10-08 10:22:59 UTC
Wait for new version of rhcos (bug 2011960) get ready, and then verify with the latest rhcos build.

Comment 39 Angus Thomas 2021-10-08 13:55:13 UTC
*** Bug 2008745 has been marked as a duplicate of this bug. ***

Comment 40 Praveen Kumar 2021-10-12 13:43:23 UTC
Tested rc.7 bits, not facing any issue like https://bugzilla.redhat.com/show_bug.cgi?id=2008745 one anymore.

Comment 41 Yunfei Jiang 2021-10-13 08:21:23 UTC
verified. PASS.

OCP version: 4.10.0-0.nightly-2021-10-12-203611

Comment 42 Yunfei Jiang 2021-10-13 08:30:30 UTC
rhcos-49.84.202107010027-0 is being used in 4.10, waiting for new version of rhcos (bug 2011960) and do verification again.

Comment 43 Yunfei Jiang 2021-10-26 01:37:39 UTC
verified. PASS.
OCP version: 4.10.0-0.nightly-2021-10-23-225921
rhcos: rhcos-410.84.202110140201-0-x86_64

Comment 46 errata-xmlrpc 2022-03-12 04:37:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056