Bug 1450461
Summary: | Unable to delete pods stuck in terminating state | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Justin Pierce <jupierce> |
Component: | Node | Assignee: | Derek Carr <decarr> |
Status: | CLOSED ERRATA | QA Contact: | DeShuai Ma <dma> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 3.6.0 | CC: | anli, aos-bugs, decarr, dma, eparis, hgomes, jhonce, jokerman, jupierce, mkargaki, mmccomas, mwoodson, pweil, qcai, rpuccini, sjenning, smunilla |
Target Milestone: | --- | Keywords: | Reopened |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | 1449277 | Environment: | |
Last Closed: | 2018-04-09 21:13:32 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Justin Pierce
2017-05-12 15:54:49 UTC
Ignore the clone of/depends on 1449277 -- these issues are apparently unrelated. Example undeletable pod: [root@dev-preview-stg-master-defb2 ~]# oc describe pod pull-05121420z-dw-1-8pjp0 Name: pull-05121420z-dw-1-8pjp0 Namespace: ops-health-monitoring Security Policy: restricted Node: ip-172-31-9-166.ec2.internal/ Labels: app=pull-05121420z-dw deployment=pull-05121420z-dw-1 deploymentconfig=pull-05121420z-dw Annotations: kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicationController","namespace":"ops-health-monitoring","name":"pull-05121420z-dw-1","uid":"1a5fecfb-37... openshift.io/deployment-config.latest-version=1 openshift.io/deployment-config.name=pull-05121420z-dw openshift.io/deployment.name=pull-05121420z-dw-1 openshift.io/generated-by=OpenShiftNewApp openshift.io/scc=restricted Status: Terminating (expires Fri, 12 May 2017 14:25:45 +0000) Termination Grace Period: 30s IP: Controllers: ReplicationController/pull-05121420z-dw-1 Containers: pull-05121420z-dw: Image: openshift/hello-openshift@sha256:7ce9d7b0c83a3abef41e0db590c5aa39fb05793315c60fd907f2c609997caf11 Ports: 8080/TCP, 8888/TCP Environment: <none> Mounts: /var/run/secrets/kubernetes.io/serviceaccount from default-token-ds8ay (ro) Conditions: Type Status PodScheduled True Volumes: default-token-ds8ay: Type: Secret (a volume populated by a Secret) SecretName: default-token-ds8ay Optional: false QoS Class: BestEffort Node-Selectors: type=compute Tolerations: <none> Events: FirstSeen LastSeen Count From SubObjectPath Type Reason Message --------- -------- ----- ---- ------------- -------- ------ ------- 1h 1h 1 default-scheduler Normal Scheduled Successfully assigned pull-05121420z-dw-1-8pjp0 to ip-172-31-9-166.ec2.internal 1h 1h 1 kubelet, ip-172-31-9-166.ec2.internal spec.containers{pull-05121420z-dw} Normal Pulling pulling image "openshift/hello-openshift@sha256:7ce9d7b0c83a3abef41e0db590c5aa39fb05793315c60fd907f2c609997caf11" 1h 1h 1 kubelet, ip-172-31-9-166.ec2.internal spec.containers{pull-05121420z-dw} Normal Pulled Successfully pulled image "openshift/hello-openshift@sha256:7ce9d7b0c83a3abef41e0db590c5aa39fb05793315c60fd907f2c609997caf11" 1h 1h 1 kubelet, ip-172-31-9-166.ec2.internal spec.containers{pull-05121420z-dw} Normal Created Created container with docker id 4f6ed2bdc6f2; Security:[seccomp=unconfined] 1h 1h 1 kubelet, ip-172-31-9-166.ec2.internal spec.containers{pull-05121420z-dw} Normal Started Started container with docker id 4f6ed2bdc6f2 1h 1h 1 kubelet, ip-172-31-9-166.ec2.internal spec.containers{pull-05121420z-dw} Normal Killing Killing container with docker id 4f6ed2bdc6f2: Need to kill pod. [root@dev-preview-stg-master-defb2 ~]# oc get pod pull-05121420z-dw-1-8pjp0 -o=yaml apiVersion: v1 kind: Pod metadata: annotations: kubernetes.io/created-by: | {"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicationController","namespace":"ops-health-monitoring","name":"pull-05121420z-dw-1","uid":"1a5fecfb-371e-11e7-8d75-0eaa067b1713","apiVersion":"v1","resourceVersion":"113712485"}} openshift.io/deployment-config.latest-version: "1" openshift.io/deployment-config.name: pull-05121420z-dw openshift.io/deployment.name: pull-05121420z-dw-1 openshift.io/generated-by: OpenShiftNewApp openshift.io/scc: restricted creationTimestamp: 2017-05-12T14:20:15Z deletionGracePeriodSeconds: 30 deletionTimestamp: 2017-05-12T14:25:45Z generateName: pull-05121420z-dw-1- labels: app: pull-05121420z-dw deployment: pull-05121420z-dw-1 deploymentconfig: pull-05121420z-dw name: pull-05121420z-dw-1-8pjp0 namespace: ops-health-monitoring ownerReferences: - apiVersion: v1 blockOwnerDeletion: true controller: true kind: ReplicationController name: pull-05121420z-dw-1 uid: 1a5fecfb-371e-11e7-8d75-0eaa067b1713 resourceVersion: "113713858" selfLink: /api/v1/namespaces/ops-health-monitoring/pods/pull-05121420z-dw-1-8pjp0 uid: 1e6994b4-371e-11e7-8d75-0eaa067b1713 spec: containers: - image: openshift/hello-openshift@sha256:7ce9d7b0c83a3abef41e0db590c5aa39fb05793315c60fd907f2c609997caf11 imagePullPolicy: Always name: pull-05121420z-dw ports: - containerPort: 8080 protocol: TCP - containerPort: 8888 protocol: TCP resources: {} securityContext: capabilities: drop: - KILL - MKNOD - SETGID - SETUID - SYS_CHROOT privileged: false runAsUser: 1056510000 seLinuxOptions: level: s0:c238,c52 terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: default-token-ds8ay readOnly: true dnsPolicy: ClusterFirst imagePullSecrets: - name: default-dockercfg-ji2kp nodeName: ip-172-31-9-166.ec2.internal nodeSelector: type: compute restartPolicy: Always schedulerName: default-scheduler securityContext: fsGroup: 1056510000 seLinuxOptions: level: s0:c238,c52 serviceAccount: default serviceAccountName: default terminationGracePeriodSeconds: 30 volumes: - name: default-token-ds8ay secret: defaultMode: 420 secretName: default-token-ds8ay status: conditions: - lastProbeTime: null lastTransitionTime: 2017-05-12T14:20:15Z status: "True" type: PodScheduled phase: Pending qosClass: BestEffort Quirky info that others likely understand. Setting blockOwnerDeletion false like so: ownerReferences: - apiVersion: v1 blockOwnerDeletion: false controller: true kind: ReplicationController name: pull-05121520z-0k-1 uid: 7c068bb1-3726-11e7-8d75-0eaa067b1713 Will not cause the pod to be cleaned up. Removing it entirely like so: ownerReferences: - apiVersion: v1 controller: true kind: ReplicationController name: pull-05121520z-0k-1 uid: 7c068bb1-3726-11e7-8d75-0eaa067b1713 Will cause the pod to be cleaned up. This upstream PR I *think* should stop the problem from impacting our kubelet. See: https://github.com/kubernetes/kubernetes/pull/45747 Just for reference I ran the following script on dev-preview-stg #!/usr/bin/env python import json import subprocess oc = subprocess.Popen(['oc', 'get', 'pod', '--all-namespaces', '-o', 'json'], stdout=subprocess.PIPE) stdout = oc.communicate()[0] oc.wait() def deleteOwnerReferences(pod, ns): print("Deleting reference for: %s %s" % (pod, ns)) patch = subprocess.Popen(['oc', 'patch', '-n', ns, 'pod', pod, '-p', '{"metadata":{"ownerReferences":null}}']) patch.wait() def clearPods(pods): for pod,ns in pods: deleteOwnerReferences(pod, ns) terminating = [] monitoring_terminating = [] pods = json.loads(stdout) for pod in pods["items"]: if "deletionTimestamp" in pod["metadata"]: name = pod["metadata"]["name"] ns = pod["metadata"]["namespace"] if ns == "ops-health-monitoring": monitoring_terminating.append((name,ns)) else: terminating.append((name,ns)) # Save 10 pods for later evaluation monitoring_terminating.sort(key=lambda tup: tup[0]) monitoring_terminating = monitoring_terminating[10:] clearPods(monitoring_terminating) #clearPods(terminating) print(len(monitoring_terminating)) #print(len(terminating)) Which cleans up most of the terminating pods (I intentionally left 10 pods in the ops-health-monitoring so I had things to debug, it seems to be creating more at a rapid rate). We still had about 20 other pods stuck terminating. Those are for a different reason. See: https://bugzilla.redhat.com/show_bug.cgi?id=1450554 for a BZ about at least some of those of stuck terminating pods. This specific bug ONLY affects 3.5 nodes and 3.6 masters. Reproduce step: 1. install ocp-3.5 with dedicated nodes. 2. create applications. oc new-app cakephp-mysql-example 3. Enable OCP repos include openshift-3.6.74 4. upgrade upgrade_control_plane.yml. 5. upgrade nodes. The upgrade_nodes.yml playbook hang and there are terminal pods When upgrade to openshift-3.6.101. the upgrade succeed without this issue. so move bug to verified. # oc version oc v3.6.101 kubernetes v1.6.1+5115d708d7 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://openshift-225.lab.eng.nay.redhat.com:8443 openshift v3.6.101 kubernetes v1.6.1+5115d708d7 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1716 |