Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1566150

Summary:	Fluentd pod won't terminate, even after pid of container was killed
Product:	OpenShift Container Platform	Reporter:	Peter Portante <pportant>
Component:	Containers	Assignee:	Jhon Honce <jhonce>
Status:	CLOSED DUPLICATE	QA Contact:	DeShuai Ma <dma>
Severity:	high	Docs Contact:
Priority:	high
Version:	3.9.0	CC:	aos-bugs, aos-storage-staff, jokerman, jsafrane, kp, mmccomas, sjenning, wmeng
Target Milestone:	---	Keywords:	OpsBlocker
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-04-19 19:16:19 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Peter Portante 2018-04-11 15:53:09 UTC

We have a fluentd point that won't terminate.  We issued an "ocp delete pod logging-fluentd-dn9xt", and then just see the pod stay in the "Terminating" state.

This is on OCP 3.9 and one of the starter clusters.

Comment 1 Peter Portante 2018-04-11 15:54:54 UTC

We also issued a "kill -KILL <pid>" on the compute node where the fluentd pod was running, and while that worked to kill the process, it did not allow the pod to finish its termination sequence.

Comment 3 Jan Safranek 2018-04-12 13:44:47 UTC

Peter, are there any indications that this bug is caused by storage?

Next time, it would be useful if you could check node logs and try to find anything related to the pod in question and post it here - it will help to route the problem to correct person.


The reason why the pod can't be killed:
Apr 12 03:30:51 ip-172-31-30-246 atomic-openshift-node: E0412 03:30:51.109764   26646 pod_workers.go:186] Error syncing pod db2852c8-3c68-11e8-a010-02d8407159d1 ("logging-fluentd-dn9xt_logging(db2852c8-3c68-11e8-a010-02d8407159d1)"), skipping: error killing pod: [failed to "KillContainer" for "fluentd-elasticsearch" with KillContainerError: "rpc error: code = Unknown desc = Error response from daemon: Cannot stop container 437efd7ceb2ea951a55f9bfc59525b214b3bfcc4e9113552e97cdae734a41030: Cannot kill container 437efd7ceb2ea951a55f9bfc59525b214b3bfcc4e9113552e97cdae734a41030: rpc error: code = 14 desc = grpc: the connection is unavailable"

Repeats every ~30 seconds. There are *many* messages like this in the log, I counted that 48 pods are affected:

egrep -o 'failed to "KillContainer" for "[^"]*"' /var/log/messages | sort | uniq

I was not able to find the place in logs where it started:
- Apr 10 02:44:45 - node successfully mounted volumes for pod logging-fluentd-dn9xt
- Apr 12 03:30:51 - the first 'failed to "KillPodSandbox" for' logging-fluentd-dn9xt

There is no log with the pod in between, which is odd. Namely messages-20180411.gz does not contain any log about the pod!

Anyway, I don't think that storage is responsible for this bug, there seems to be something wrong between kubelet and docker.

Comment 4 Karthik Prabhakar 2018-04-15 00:13:48 UTC

I'm seeing this error too with Openshift 3.9 (on RHEL 7.5), but with application pods instead of fluentd, and there are no storage volumes mounted on the application, so this would point to a kubelet to docker issue.


Apr 14 23:23:52 nd1.cnx.tigera.io atomic-openshift-node[88869]: I0414 23:23:52.128444   88869 kuberuntime_container.go:581] Killing container "docker://416313dedcc5cc2c714d261dedcfcb150c308574d53d272808661e464bc7c106" with 30 second grace period
Apr 14 23:23:52 nd1.cnx.tigera.io atomic-openshift-node[88869]: E0414 23:23:52.398297   88869 docker_sandbox.go:233] Failed to stop sandbox "6f1bfea85afebf2780095b01fea73b65983fc89324bbd51f8032e62becedba21": Error response from daemon: Cannot stop container 6f1bfea85afebf2780095b01fea73b65983fc89324bbd51f8032e62becedba21: Cannot kill container 6f1bfea85afebf2780095b01fea73b65983fc89324bbd51f8032e62becedba21: rpc error: code = 14 desc = grpc: the connection is unavailable
Apr 14 23:23:52 nd1.cnx.tigera.io atomic-openshift-node[88869]: E0414 23:23:52.572890   88869 remote_runtime.go:115] StopPodSandbox "6f1bfea85afebf2780095b01fea73b65983fc89324bbd51f8032e62becedba21" from runtime service failed: rpc error: code = Unknown desc = Error response from daemon: Cannot stop container 6f1bfea85afebf2780095b01fea73b65983fc89324bbd51f8032e62becedba21: Cannot kill container 6f1bfea85afebf2780095b01fea73b65983fc89324bbd51f8032e62becedba21: rpc error: code = 14 desc = grpc: the connection is unavailable
Apr 14 23:23:52 nd1.cnx.tigera.io atomic-openshift-node[88869]: E0414 23:23:52.572936   88869 kuberuntime_manager.go:800] Failed to stop sandbox {"docker" "6f1bfea85afebf2780095b01fea73b65983fc89324bbd51f8032e62becedba21"}
Apr 14 23:23:52 nd1.cnx.tigera.io atomic-openshift-node[88869]: E0414 23:23:52.573005   88869 kubelet.go:1522] error killing pod: [failed to "KillContainer" for "backend" with KillContainerError: "rpc error: code = Unknown desc = Error response from daemon: Cannot stop container f37f487d8593810fce154b46582d1c06da420bac360eb62e1b25a61954a7df28: Cannot kill container f37f487d8593810fce154b46582d1c06da420bac360eb62e1b25a61954a7df28: rpc error: code = 14 desc = grpc: the connection is unavailable"
Apr 14 23:23:52 nd1.cnx.tigera.io atomic-openshift-node[88869]: , failed to "KillPodSandbox" for "0675b2e5-4011-11e8-9200-06e00806d9c2" with KillPodSandboxError: "rpc error: code = Unknown desc = Error response from daemon: Cannot stop container 6f1bfea85afebf2780095b01fea73b65983fc89324bbd51f8032e62becedba21: Cannot kill container 6f1bfea85afebf2780095b01fea73b65983fc89324bbd51f8032e62becedba21: rpc error: code = 14 desc = grpc: the connection is unavailable"
Apr 14 23:23:52 nd1.cnx.tigera.io atomic-openshift-node[88869]: ]
Apr 14 23:23:52 nd1.cnx.tigera.io atomic-openshift-node[88869]: E0414 23:23:52.573023   88869 pod_workers.go:186] Error syncing pod 0675b2e5-4011-11e8-9200-06e00806d9c2 ("backend-qgz9c_stars(0675b2e5-4011-11e8-9200-06e00806d9c2)"), skipping: error killing pod: [failed to "KillContainer" for "backend" with KillContainerError: "rpc error: code = Unknown desc = Error response from daemon: Cannot stop container f37f487d8593810fce154b46582d1c06da420bac360eb62e1b25a61954a7df28: Cannot kill container f37f487d8593810fce154b46582d1c06da420bac360eb62e1b25a61954a7df28: rpc error: code = 14 desc = grpc: the connection is unavailable"
Apr 14 23:23:52 nd1.cnx.tigera.io atomic-openshift-node[88869]: , failed to "KillPodSandbox" for "0675b2e5-4011-11e8-9200-06e00806d9c2" with KillPodSandboxError: "rpc error: code = Unknown desc = Error response from daemon: Cannot stop container 6f1bfea85afebf2780095b01fea73b65983fc89324bbd51f8032e62becedba21: Cannot kill container 6f1bfea85afebf2780095b01fea73b65983fc89324bbd51f8032e62becedba21: rpc error: code = 14 desc = grpc: the connection is unavailable"


Also in the logs, I see this, not sure if it's related:

Apr 14 23:24:00 nd1.cnx.tigera.io atomic-openshift-node[88869]: E0414 23:24:00.126665   88869 remote_runtime.go:229] StopContainer "af17a95e87e8f1929726a7b7068b2a0299f2f610579e2beffc143e368e3e2b59" from runtime service failed: rpc error: code = Unknown desc = Error response from daemon: Cannot stop container af17a95e87e8f1929726a7b7068b2a0299f2f610579e2beffc143e368e3e2b59: Cannot kill container af17a95e87e8f1929726a7b7068b2a0299f2f610579e2beffc143e368e3e2b59: rpc error: code = 14 desc = grpc: the connection is unavailable
Apr 14 23:24:00 nd1.cnx.tigera.io atomic-openshift-node[88869]: E0414 23:24:00.126704   88869 kuberuntime_container.go:603] Container "docker://af17a95e87e8f1929726a7b7068b2a0299f2f610579e2beffc143e368e3e2b59" termination failed with gracePeriod 30: rpc error: code = Unknown desc = Error response from daemon: Cannot stop container af17a95e87e8f1929726a7b7068b2a0299f2f610579e2beffc143e368e3e2b59: Cannot kill container af17a95e87e8f1929726a7b7068b2a0299f2f610579e2beffc143e368e3e2b59: rpc error: code = 14 desc = grpc: the connection is unavailable
Apr 14 23:24:00 nd1.cnx.tigera.io atomic-openshift-node[88869]: E0414 23:24:00.150983   88869 event.go:200] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"backend2-x86rz.15256f68e43c21ad", GenerateName:"", Namespace:"stars", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"stars", Name:"backend2-x86rz", UID:"acf9db13-4011-11e8-9200-06e00806d9c2", APIVersion:"v1", ResourceVersion:"1849671", FieldPath:"spec.containers{backend2}"}, Reason:"Killing", Message:"Killing container with id docker://backend2:Need to kill Pod", Source:v1.EventSource{Component:"kubelet", Host:"nd1.cnx.tigera.io"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63659343345, loc:(*time.Location)(0xf584ba0)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbeacc144078ddf96, ext:89034287503739, loc:(*time.Location)(0xf584ba0)}}, Count:92, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'events "backend2-x86rz.15256f68e43c21ad" is forbidden: unable to create new content in namespace stars because it is being terminated.' (will not retry!)

Comment 5 Seth Jennings 2018-04-16 14:25:57 UTC

Seems like the kubelet is doing the right thing trying to kill the container.

Sending to containers for investigation into the rather opaque "Cannot stop container" error we are getting from docker.

Comment 6 Seth Jennings 2018-04-19 19:16:19 UTC


*** This bug has been marked as a duplicate of bug 1560428 ***