Bug 1450461
| Summary: | Unable to delete pods stuck in terminating state | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Justin Pierce <jupierce> |
| Component: | Node | Assignee: | Derek Carr <decarr> |
| Status: | CLOSED ERRATA | QA Contact: | DeShuai Ma <dma> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 3.6.0 | CC: | anli, aos-bugs, decarr, dma, eparis, hgomes, jhonce, jokerman, jupierce, mkargaki, mmccomas, mwoodson, pweil, qcai, rpuccini, sjenning, smunilla |
| Target Milestone: | --- | Keywords: | Reopened |
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | 1449277 | Environment: | |
| Last Closed: | 2018-04-09 21:13:32 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Justin Pierce
2017-05-12 15:54:49 UTC
Ignore the clone of/depends on 1449277 -- these issues are apparently unrelated. Example undeletable pod:
[root@dev-preview-stg-master-defb2 ~]# oc describe pod pull-05121420z-dw-1-8pjp0
Name: pull-05121420z-dw-1-8pjp0
Namespace: ops-health-monitoring
Security Policy: restricted
Node: ip-172-31-9-166.ec2.internal/
Labels: app=pull-05121420z-dw
deployment=pull-05121420z-dw-1
deploymentconfig=pull-05121420z-dw
Annotations: kubernetes.io/created-by={"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicationController","namespace":"ops-health-monitoring","name":"pull-05121420z-dw-1","uid":"1a5fecfb-37...
openshift.io/deployment-config.latest-version=1
openshift.io/deployment-config.name=pull-05121420z-dw
openshift.io/deployment.name=pull-05121420z-dw-1
openshift.io/generated-by=OpenShiftNewApp
openshift.io/scc=restricted
Status: Terminating (expires Fri, 12 May 2017 14:25:45 +0000)
Termination Grace Period: 30s
IP:
Controllers: ReplicationController/pull-05121420z-dw-1
Containers:
pull-05121420z-dw:
Image: openshift/hello-openshift@sha256:7ce9d7b0c83a3abef41e0db590c5aa39fb05793315c60fd907f2c609997caf11
Ports: 8080/TCP, 8888/TCP
Environment: <none>
Mounts:
/var/run/secrets/kubernetes.io/serviceaccount from default-token-ds8ay (ro)
Conditions:
Type Status
PodScheduled True
Volumes:
default-token-ds8ay:
Type: Secret (a volume populated by a Secret)
SecretName: default-token-ds8ay
Optional: false
QoS Class: BestEffort
Node-Selectors: type=compute
Tolerations: <none>
Events:
FirstSeen LastSeen Count From SubObjectPath Type Reason Message
--------- -------- ----- ---- ------------- -------- ------ -------
1h 1h 1 default-scheduler Normal Scheduled Successfully assigned pull-05121420z-dw-1-8pjp0 to ip-172-31-9-166.ec2.internal
1h 1h 1 kubelet, ip-172-31-9-166.ec2.internal spec.containers{pull-05121420z-dw} Normal Pulling pulling image "openshift/hello-openshift@sha256:7ce9d7b0c83a3abef41e0db590c5aa39fb05793315c60fd907f2c609997caf11"
1h 1h 1 kubelet, ip-172-31-9-166.ec2.internal spec.containers{pull-05121420z-dw} Normal Pulled Successfully pulled image "openshift/hello-openshift@sha256:7ce9d7b0c83a3abef41e0db590c5aa39fb05793315c60fd907f2c609997caf11"
1h 1h 1 kubelet, ip-172-31-9-166.ec2.internal spec.containers{pull-05121420z-dw} Normal Created Created container with docker id 4f6ed2bdc6f2; Security:[seccomp=unconfined]
1h 1h 1 kubelet, ip-172-31-9-166.ec2.internal spec.containers{pull-05121420z-dw} Normal Started Started container with docker id 4f6ed2bdc6f2
1h 1h 1 kubelet, ip-172-31-9-166.ec2.internal spec.containers{pull-05121420z-dw} Normal Killing Killing container with docker id 4f6ed2bdc6f2: Need to kill pod.
[root@dev-preview-stg-master-defb2 ~]# oc get pod pull-05121420z-dw-1-8pjp0 -o=yaml
apiVersion: v1
kind: Pod
metadata:
annotations:
kubernetes.io/created-by: |
{"kind":"SerializedReference","apiVersion":"v1","reference":{"kind":"ReplicationController","namespace":"ops-health-monitoring","name":"pull-05121420z-dw-1","uid":"1a5fecfb-371e-11e7-8d75-0eaa067b1713","apiVersion":"v1","resourceVersion":"113712485"}}
openshift.io/deployment-config.latest-version: "1"
openshift.io/deployment-config.name: pull-05121420z-dw
openshift.io/deployment.name: pull-05121420z-dw-1
openshift.io/generated-by: OpenShiftNewApp
openshift.io/scc: restricted
creationTimestamp: 2017-05-12T14:20:15Z
deletionGracePeriodSeconds: 30
deletionTimestamp: 2017-05-12T14:25:45Z
generateName: pull-05121420z-dw-1-
labels:
app: pull-05121420z-dw
deployment: pull-05121420z-dw-1
deploymentconfig: pull-05121420z-dw
name: pull-05121420z-dw-1-8pjp0
namespace: ops-health-monitoring
ownerReferences:
- apiVersion: v1
blockOwnerDeletion: true
controller: true
kind: ReplicationController
name: pull-05121420z-dw-1
uid: 1a5fecfb-371e-11e7-8d75-0eaa067b1713
resourceVersion: "113713858"
selfLink: /api/v1/namespaces/ops-health-monitoring/pods/pull-05121420z-dw-1-8pjp0
uid: 1e6994b4-371e-11e7-8d75-0eaa067b1713
spec:
containers:
- image: openshift/hello-openshift@sha256:7ce9d7b0c83a3abef41e0db590c5aa39fb05793315c60fd907f2c609997caf11
imagePullPolicy: Always
name: pull-05121420z-dw
ports:
- containerPort: 8080
protocol: TCP
- containerPort: 8888
protocol: TCP
resources: {}
securityContext:
capabilities:
drop:
- KILL
- MKNOD
- SETGID
- SETUID
- SYS_CHROOT
privileged: false
runAsUser: 1056510000
seLinuxOptions:
level: s0:c238,c52
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: default-token-ds8ay
readOnly: true
dnsPolicy: ClusterFirst
imagePullSecrets:
- name: default-dockercfg-ji2kp
nodeName: ip-172-31-9-166.ec2.internal
nodeSelector:
type: compute
restartPolicy: Always
schedulerName: default-scheduler
securityContext:
fsGroup: 1056510000
seLinuxOptions:
level: s0:c238,c52
serviceAccount: default
serviceAccountName: default
terminationGracePeriodSeconds: 30
volumes:
- name: default-token-ds8ay
secret:
defaultMode: 420
secretName: default-token-ds8ay
status:
conditions:
- lastProbeTime: null
lastTransitionTime: 2017-05-12T14:20:15Z
status: "True"
type: PodScheduled
phase: Pending
qosClass: BestEffort
Quirky info that others likely understand. Setting blockOwnerDeletion false like so:
ownerReferences:
- apiVersion: v1
blockOwnerDeletion: false
controller: true
kind: ReplicationController
name: pull-05121520z-0k-1
uid: 7c068bb1-3726-11e7-8d75-0eaa067b1713
Will not cause the pod to be cleaned up. Removing it entirely like so:
ownerReferences:
- apiVersion: v1
controller: true
kind: ReplicationController
name: pull-05121520z-0k-1
uid: 7c068bb1-3726-11e7-8d75-0eaa067b1713
Will cause the pod to be cleaned up.
This upstream PR I *think* should stop the problem from impacting our kubelet. See: https://github.com/kubernetes/kubernetes/pull/45747 Just for reference I ran the following script on dev-preview-stg
#!/usr/bin/env python
import json
import subprocess
oc = subprocess.Popen(['oc', 'get', 'pod', '--all-namespaces', '-o', 'json'], stdout=subprocess.PIPE)
stdout = oc.communicate()[0]
oc.wait()
def deleteOwnerReferences(pod, ns):
print("Deleting reference for: %s %s" % (pod, ns))
patch = subprocess.Popen(['oc', 'patch', '-n', ns, 'pod', pod, '-p', '{"metadata":{"ownerReferences":null}}'])
patch.wait()
def clearPods(pods):
for pod,ns in pods:
deleteOwnerReferences(pod, ns)
terminating = []
monitoring_terminating = []
pods = json.loads(stdout)
for pod in pods["items"]:
if "deletionTimestamp" in pod["metadata"]:
name = pod["metadata"]["name"]
ns = pod["metadata"]["namespace"]
if ns == "ops-health-monitoring":
monitoring_terminating.append((name,ns))
else:
terminating.append((name,ns))
# Save 10 pods for later evaluation
monitoring_terminating.sort(key=lambda tup: tup[0])
monitoring_terminating = monitoring_terminating[10:]
clearPods(monitoring_terminating)
#clearPods(terminating)
print(len(monitoring_terminating))
#print(len(terminating))
Which cleans up most of the terminating pods (I intentionally left 10 pods in the ops-health-monitoring so I had things to debug, it seems to be creating more at a rapid rate).
We still had about 20 other pods stuck terminating. Those are for a different reason. See: https://bugzilla.redhat.com/show_bug.cgi?id=1450554 for a BZ about at least some of those of stuck terminating pods.
This specific bug ONLY affects 3.5 nodes and 3.6 masters. Reproduce step: 1. install ocp-3.5 with dedicated nodes. 2. create applications. oc new-app cakephp-mysql-example 3. Enable OCP repos include openshift-3.6.74 4. upgrade upgrade_control_plane.yml. 5. upgrade nodes. The upgrade_nodes.yml playbook hang and there are terminal pods When upgrade to openshift-3.6.101. the upgrade succeed without this issue. so move bug to verified. # oc version oc v3.6.101 kubernetes v1.6.1+5115d708d7 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://openshift-225.lab.eng.nay.redhat.com:8443 openshift v3.6.101 kubernetes v1.6.1+5115d708d7 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2017:1716 |