Bug 1814291

Summary:	Pods stuck in terminating after e2e run
Product:	OpenShift Container Platform	Reporter:	Ben Parees <bparees>
Component:	Node	Assignee:	Ryan Phillips <rphillips>
Status:	CLOSED ERRATA	QA Contact:	MinLi <minmli>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	4.4	CC:	aos-bugs, ccoleman, hekumar, jokerman, rphillips, wking
Target Milestone:	---
Target Release:	4.5.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1814393 (view as bug list)		Environment:
Last Closed:	2020-08-04 18:05:50 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1814393

Description Ben Parees 2020-03-17 15:00:43 UTC

After an e2e run, there are some pods stuck in terminating, not clear why they are stuck:

$ oc get pods -o yaml
apiVersion: v1
items:
- apiVersion: v1
  kind: Pod
  metadata:
    annotations:
      k8s.v1.cni.cncf.io/networks-status: ""
      openshift.io/scc: anyuid
    creationTimestamp: "2020-03-11T06:30:12Z"
    deletionGracePeriodSeconds: 30
    deletionTimestamp: "2020-03-14T04:05:45Z"
    generateName: pvc-volume-tester-
    name: pvc-volume-tester-dkp9s
    namespace: e2e-csi-mock-volumes-7010
    resourceVersion: "2073172"
    selfLink: /api/v1/namespaces/e2e-csi-mock-volumes-7010/pods/pvc-volume-tester-dkp9s
    uid: 6aefd56f-c478-4295-b8c8-779113edc0d2
  spec:
    containers:
    - image: k8s.gcr.io/pause:3.1
      imagePullPolicy: IfNotPresent
      name: volume-tester
      resources: {}
      securityContext:
        capabilities:
          drop:
          - MKNOD
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
      - mountPath: /mnt/test
        name: my-volume
      - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        name: default-token-gd6q6
        readOnly: true
    dnsPolicy: ClusterFirst
    enableServiceLinks: true
    imagePullSecrets:
    - name: default-dockercfg-fzcxt
    nodeName: ip-10-0-132-57.us-east-2.compute.internal
    priority: 0
    restartPolicy: Never
    schedulerName: default-scheduler
    securityContext:
      seLinuxOptions:
        level: s0:c122,c49
    serviceAccount: default
    serviceAccountName: default
    terminationGracePeriodSeconds: 30
    tolerations:
    - effect: NoExecute
      key: node.kubernetes.io/not-ready
      operator: Exists
      tolerationSeconds: 300
    - effect: NoExecute
      key: node.kubernetes.io/unreachable
      operator: Exists
      tolerationSeconds: 300
    volumes:
    - name: my-volume
      persistentVolumeClaim:
        claimName: pvc-zmb4r
    - name: default-token-gd6q6
      secret:
        defaultMode: 420
        secretName: default-token-gd6q6
  status:
    conditions:
    - lastProbeTime: null
      lastTransitionTime: "2020-03-11T06:30:12Z"
      reason: PodCompleted
      status: "True"
      type: Initialized
    - lastProbeTime: null
      lastTransitionTime: "2020-03-14T04:05:16Z"
      reason: PodCompleted
      status: "False"
      type: Ready
    - lastProbeTime: null
      lastTransitionTime: "2020-03-14T04:05:16Z"
      reason: PodCompleted
      status: "False"
      type: ContainersReady
    - lastProbeTime: null
      lastTransitionTime: "2020-03-11T06:30:12Z"
      status: "True"
      type: PodScheduled
    containerStatuses:
    - containerID: cri-o://47d13e4a73ef34dbaeaa1dc15f06651e2cae3d767988e0e1f86ccc01b754eee8
      image: k8s.gcr.io/pause:3.1
      imageID: k8s.gcr.io/pause@sha256:59eec8837a4d942cc19a52b8c09ea75121acc38114a2c68b98983ce9356b8610
      lastState: {}
      name: volume-tester
      ready: false
      restartCount: 0
      started: false
      state:
        terminated:
          containerID: cri-o://47d13e4a73ef34dbaeaa1dc15f06651e2cae3d767988e0e1f86ccc01b754eee8
          exitCode: 0
          finishedAt: "2020-03-14T04:05:15Z"
          reason: Completed
          startedAt: "2020-03-11T06:30:23Z"
    hostIP: 10.0.132.57
    phase: Succeeded
    podIP: 10.129.3.125
    podIPs:
    - ip: 10.129.3.125
    qosClass: BestEffort
    startTime: "2020-03-11T06:30:12Z"
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""



$ oc describe pod pvc-volume-tester-dkp9s
Name:                      pvc-volume-tester-dkp9s
Namespace:                 e2e-csi-mock-volumes-7010
Priority:                  0
Node:                      ip-10-0-132-57.us-east-2.compute.internal/10.0.132.57
Start Time:                Wed, 11 Mar 2020 02:30:12 -0400
Labels:                    <none>
Annotations:               k8s.v1.cni.cncf.io/networks-status: 
                           openshift.io/scc: anyuid
Status:                    Terminating (lasts 3d10h)
Termination Grace Period:  30s
IP:                        10.129.3.125
IPs:
  IP:  10.129.3.125
Containers:
  volume-tester:
    Container ID:   cri-o://47d13e4a73ef34dbaeaa1dc15f06651e2cae3d767988e0e1f86ccc01b754eee8
    Image:          k8s.gcr.io/pause:3.1
    Image ID:       k8s.gcr.io/pause@sha256:59eec8837a4d942cc19a52b8c09ea75121acc38114a2c68b98983ce9356b8610
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 11 Mar 2020 02:30:23 -0400
      Finished:     Sat, 14 Mar 2020 00:05:15 -0400
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /mnt/test from my-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-gd6q6 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  my-volume:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  pvc-zmb4r
    ReadOnly:   false
  default-token-gd6q6:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-gd6q6
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:          <none>


$ oc version
Client Version: v4.2.0-alpha.0-249-gc276ecb
Server Version: 4.4.0-0.nightly-2020-03-15-215151
Kubernetes Version: v1.17.1


Putting this on node team for now since i'd expect the node to clean up this pod, but if it's an issue w/ the storage test itself, feel free to reassign.

As long as it is not a test-config/behavior specific bug, this is an urgent bug for 4.4 as we must ensure pods actually terminate.

Comment 1 Ryan Phillips 2020-03-17 18:09:53 UTC

Known upstream issue https://github.com/kubernetes/kubernetes/issues/51835

Comment 2 Clayton Coleman 2020-03-17 18:24:54 UTC

This is an urgent issue, moving back to 4.4.0.  This may not be deferred without architect sign off.

We should look at what we can do to improve debugging.  We can make e2e runs fail if there are still terminating pods.

Comment 5 MinLi 2020-04-08 09:24:25 UTC

@ Ben Parees ,  May I verify this bz using the steps in https://bugzilla.redhat.com/show_bug.cgi?id=1808123#c11 ? 
or how should I verify this bz?

Comment 6 Ben Parees 2020-04-08 13:19:55 UTC

Yes i think those steps are ok, but might be best to confirm with Hemant as he was fixing up the e2e test.

Comment 7 MinLi 2020-04-10 11:40:13 UTC

verified on version 4.5.0-0.nightly-2020-04-09-231931

Run below commands for several times:
openshift-tests run openshift/conformance/parallel --dry-run | grep Feature:VolumeSnapshotDataSource > tests
openshift-tests run openshift/conformance/parallel -f tests

During the test, we can see
# oc get volumesnapshots --all-namespaces
NAMESPACE              NAME             READYTOUSE   SOURCEPVC   SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS                                                SNAPSHOTCONTENT                                    CREATIONTIME   AGE
e2e-provisioning-277   snapshot-h6rff   false        pvc-wjqxr                                         e2e-provisioning-277-csi-hostpath-e2e-provisioning-277-vsc   snapcontent-fc906e79-ffb8-4d8b-8fff-b9dfd6d350b1                  14s
e2e-provisioning-718   snapshot-g5xjd   true         pvc-lr9bl                           1Mi           e2e-provisioning-718-csi-hostpath-e2e-provisioning-718-vsc   snapcontent-770b62b5-c31e-462d-a021-b62ae3359015   20s            20s


But when the test is finished, 
# oc get volumesnapshots --all-namespaces
No resources found

Comment 8 W. Trevor King 2020-04-30 01:56:25 UTC

VERIFIED, no longer needs info.

Comment 16 errata-xmlrpc 2020-08-04 18:05:50 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.5 image release advisory), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409