Bug 1814393 - [4.4] Pods stuck in terminating after e2e run
Summary: [4.4] Pods stuck in terminating after e2e run
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.4
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ---
: 4.4.z
Assignee: Ryan Phillips
QA Contact: MinLi
URL:
Whiteboard:
Depends On: 1814291
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-17 18:43 UTC by Ryan Phillips
Modified: 2020-06-17 22:26 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1814291
Environment:
Last Closed: 2020-06-17 22:26:03 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift origin pull 24710 0 None closed [release-4.4] Bug 1814393: UPSTREAM: 88141: Don't try to create VolumeSpec immediately after underlying PVC is being del... 2021-01-19 09:41:43 UTC
Red Hat Product Errata RHBA-2020:2445 0 None None None 2020-06-17 22:26:28 UTC

Description Ryan Phillips 2020-03-17 18:43:10 UTC
+++ This bug was initially created as a clone of Bug #1814291 +++

After an e2e run, there are some pods stuck in terminating, not clear why they are stuck:

$ oc get pods -o yaml
apiVersion: v1
items:
- apiVersion: v1
  kind: Pod
  metadata:
    annotations:
      k8s.v1.cni.cncf.io/networks-status: ""
      openshift.io/scc: anyuid
    creationTimestamp: "2020-03-11T06:30:12Z"
    deletionGracePeriodSeconds: 30
    deletionTimestamp: "2020-03-14T04:05:45Z"
    generateName: pvc-volume-tester-
    name: pvc-volume-tester-dkp9s
    namespace: e2e-csi-mock-volumes-7010
    resourceVersion: "2073172"
    selfLink: /api/v1/namespaces/e2e-csi-mock-volumes-7010/pods/pvc-volume-tester-dkp9s
    uid: 6aefd56f-c478-4295-b8c8-779113edc0d2
  spec:
    containers:
    - image: k8s.gcr.io/pause:3.1
      imagePullPolicy: IfNotPresent
      name: volume-tester
      resources: {}
      securityContext:
        capabilities:
          drop:
          - MKNOD
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      volumeMounts:
      - mountPath: /mnt/test
        name: my-volume
      - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
        name: default-token-gd6q6
        readOnly: true
    dnsPolicy: ClusterFirst
    enableServiceLinks: true
    imagePullSecrets:
    - name: default-dockercfg-fzcxt
    nodeName: ip-10-0-132-57.us-east-2.compute.internal
    priority: 0
    restartPolicy: Never
    schedulerName: default-scheduler
    securityContext:
      seLinuxOptions:
        level: s0:c122,c49
    serviceAccount: default
    serviceAccountName: default
    terminationGracePeriodSeconds: 30
    tolerations:
    - effect: NoExecute
      key: node.kubernetes.io/not-ready
      operator: Exists
      tolerationSeconds: 300
    - effect: NoExecute
      key: node.kubernetes.io/unreachable
      operator: Exists
      tolerationSeconds: 300
    volumes:
    - name: my-volume
      persistentVolumeClaim:
        claimName: pvc-zmb4r
    - name: default-token-gd6q6
      secret:
        defaultMode: 420
        secretName: default-token-gd6q6
  status:
    conditions:
    - lastProbeTime: null
      lastTransitionTime: "2020-03-11T06:30:12Z"
      reason: PodCompleted
      status: "True"
      type: Initialized
    - lastProbeTime: null
      lastTransitionTime: "2020-03-14T04:05:16Z"
      reason: PodCompleted
      status: "False"
      type: Ready
    - lastProbeTime: null
      lastTransitionTime: "2020-03-14T04:05:16Z"
      reason: PodCompleted
      status: "False"
      type: ContainersReady
    - lastProbeTime: null
      lastTransitionTime: "2020-03-11T06:30:12Z"
      status: "True"
      type: PodScheduled
    containerStatuses:
    - containerID: cri-o://47d13e4a73ef34dbaeaa1dc15f06651e2cae3d767988e0e1f86ccc01b754eee8
      image: k8s.gcr.io/pause:3.1
      imageID: k8s.gcr.io/pause@sha256:59eec8837a4d942cc19a52b8c09ea75121acc38114a2c68b98983ce9356b8610
      lastState: {}
      name: volume-tester
      ready: false
      restartCount: 0
      started: false
      state:
        terminated:
          containerID: cri-o://47d13e4a73ef34dbaeaa1dc15f06651e2cae3d767988e0e1f86ccc01b754eee8
          exitCode: 0
          finishedAt: "2020-03-14T04:05:15Z"
          reason: Completed
          startedAt: "2020-03-11T06:30:23Z"
    hostIP: 10.0.132.57
    phase: Succeeded
    podIP: 10.129.3.125
    podIPs:
    - ip: 10.129.3.125
    qosClass: BestEffort
    startTime: "2020-03-11T06:30:12Z"
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""



$ oc describe pod pvc-volume-tester-dkp9s
Name:                      pvc-volume-tester-dkp9s
Namespace:                 e2e-csi-mock-volumes-7010
Priority:                  0
Node:                      ip-10-0-132-57.us-east-2.compute.internal/10.0.132.57
Start Time:                Wed, 11 Mar 2020 02:30:12 -0400
Labels:                    <none>
Annotations:               k8s.v1.cni.cncf.io/networks-status: 
                           openshift.io/scc: anyuid
Status:                    Terminating (lasts 3d10h)
Termination Grace Period:  30s
IP:                        10.129.3.125
IPs:
  IP:  10.129.3.125
Containers:
  volume-tester:
    Container ID:   cri-o://47d13e4a73ef34dbaeaa1dc15f06651e2cae3d767988e0e1f86ccc01b754eee8
    Image:          k8s.gcr.io/pause:3.1
    Image ID:       k8s.gcr.io/pause@sha256:59eec8837a4d942cc19a52b8c09ea75121acc38114a2c68b98983ce9356b8610
    Port:           <none>
    Host Port:      <none>
    State:          Terminated
      Reason:       Completed
      Exit Code:    0
      Started:      Wed, 11 Mar 2020 02:30:23 -0400
      Finished:     Sat, 14 Mar 2020 00:05:15 -0400
    Ready:          False
    Restart Count:  0
    Environment:    <none>
    Mounts:
      /mnt/test from my-volume (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-gd6q6 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  my-volume:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  pvc-zmb4r
    ReadOnly:   false
  default-token-gd6q6:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-gd6q6
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:          <none>


$ oc version
Client Version: v4.2.0-alpha.0-249-gc276ecb
Server Version: 4.4.0-0.nightly-2020-03-15-215151
Kubernetes Version: v1.17.1


Putting this on node team for now since i'd expect the node to clean up this pod, but if it's an issue w/ the storage test itself, feel free to reassign.

As long as it is not a test-config/behavior specific bug, this is an urgent bug for 4.4 as we must ensure pods actually terminate.

--- Additional comment from Ryan Phillips on 2020-03-17 18:09:53 UTC ---

Known upstream issue https://github.com/kubernetes/kubernetes/issues/51835

--- Additional comment from Clayton Coleman on 2020-03-17 18:24:54 UTC ---

This is an urgent issue, moving back to 4.4.0.  This may not be deferred without architect sign off.

We should look at what we can do to improve debugging.  We can make e2e runs fail if there are still terminating pods.

Comment 1 Ryan Phillips 2020-03-19 15:28:25 UTC
Root cause has been found here: https://bugzilla.redhat.com/show_bug.cgi?id=1814282#c9

Going to duplicate this BZ to 1814282.

*** This bug has been marked as a duplicate of bug 1814282 ***

Comment 5 MinLi 2020-06-10 08:43:39 UTC
verified on version : 4.4.0-0.nightly-2020-06-08-083627

Run below commands several times:
openshift-tests run openshift/conformance/parallel --dry-run | grep Feature:VolumeSnapshotDataSource > tests
openshift-tests run openshift/conformance/parallel -f tests

During the test, we can see
# oc get volumesnapshots --all-namespaces
NAMESPACE               NAME             READYTOUSE   SOURCEPVC   SOURCESNAPSHOTCONTENT   RESTORESIZE   SNAPSHOTCLASS                                                  SNAPSHOTCONTENT                                    CREATIONTIME   AGE
e2e-provisioning-8741   snapshot-2qp26   true         pvc-vgnlq                           1Mi           e2e-provisioning-8741-csi-hostpath-e2e-provisioning-8741-vsc   snapcontent-39212bc3-7ba7-4fe0-bfc1-6c2a4176e12d   9s             10s


But when the test is finished, 
# oc get volumesnapshots --all-namespaces
No resources found

Comment 7 errata-xmlrpc 2020-06-17 22:26:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2445


Note You need to log in before you can comment on or make changes to this bug.