Bug 1566150
| Summary: | Fluentd pod won't terminate, even after pid of container was killed | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Peter Portante <pportant> |
| Component: | Containers | Assignee: | Jhon Honce <jhonce> |
| Status: | CLOSED DUPLICATE | QA Contact: | DeShuai Ma <dma> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 3.9.0 | CC: | aos-bugs, aos-storage-staff, jokerman, jsafrane, kp, mmccomas, sjenning, wmeng |
| Target Milestone: | --- | Keywords: | OpsBlocker |
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-04-19 19:16:19 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Peter Portante
2018-04-11 15:53:09 UTC
We also issued a "kill -KILL <pid>" on the compute node where the fluentd pod was running, and while that worked to kill the process, it did not allow the pod to finish its termination sequence. Peter, are there any indications that this bug is caused by storage?
Next time, it would be useful if you could check node logs and try to find anything related to the pod in question and post it here - it will help to route the problem to correct person.
The reason why the pod can't be killed:
Apr 12 03:30:51 ip-172-31-30-246 atomic-openshift-node: E0412 03:30:51.109764 26646 pod_workers.go:186] Error syncing pod db2852c8-3c68-11e8-a010-02d8407159d1 ("logging-fluentd-dn9xt_logging(db2852c8-3c68-11e8-a010-02d8407159d1)"), skipping: error killing pod: [failed to "KillContainer" for "fluentd-elasticsearch" with KillContainerError: "rpc error: code = Unknown desc = Error response from daemon: Cannot stop container 437efd7ceb2ea951a55f9bfc59525b214b3bfcc4e9113552e97cdae734a41030: Cannot kill container 437efd7ceb2ea951a55f9bfc59525b214b3bfcc4e9113552e97cdae734a41030: rpc error: code = 14 desc = grpc: the connection is unavailable"
Repeats every ~30 seconds. There are *many* messages like this in the log, I counted that 48 pods are affected:
egrep -o 'failed to "KillContainer" for "[^"]*"' /var/log/messages | sort | uniq
I was not able to find the place in logs where it started:
- Apr 10 02:44:45 - node successfully mounted volumes for pod logging-fluentd-dn9xt
- Apr 12 03:30:51 - the first 'failed to "KillPodSandbox" for' logging-fluentd-dn9xt
There is no log with the pod in between, which is odd. Namely messages-20180411.gz does not contain any log about the pod!
Anyway, I don't think that storage is responsible for this bug, there seems to be something wrong between kubelet and docker.
I'm seeing this error too with Openshift 3.9 (on RHEL 7.5), but with application pods instead of fluentd, and there are no storage volumes mounted on the application, so this would point to a kubelet to docker issue.
Apr 14 23:23:52 nd1.cnx.tigera.io atomic-openshift-node[88869]: I0414 23:23:52.128444 88869 kuberuntime_container.go:581] Killing container "docker://416313dedcc5cc2c714d261dedcfcb150c308574d53d272808661e464bc7c106" with 30 second grace period
Apr 14 23:23:52 nd1.cnx.tigera.io atomic-openshift-node[88869]: E0414 23:23:52.398297 88869 docker_sandbox.go:233] Failed to stop sandbox "6f1bfea85afebf2780095b01fea73b65983fc89324bbd51f8032e62becedba21": Error response from daemon: Cannot stop container 6f1bfea85afebf2780095b01fea73b65983fc89324bbd51f8032e62becedba21: Cannot kill container 6f1bfea85afebf2780095b01fea73b65983fc89324bbd51f8032e62becedba21: rpc error: code = 14 desc = grpc: the connection is unavailable
Apr 14 23:23:52 nd1.cnx.tigera.io atomic-openshift-node[88869]: E0414 23:23:52.572890 88869 remote_runtime.go:115] StopPodSandbox "6f1bfea85afebf2780095b01fea73b65983fc89324bbd51f8032e62becedba21" from runtime service failed: rpc error: code = Unknown desc = Error response from daemon: Cannot stop container 6f1bfea85afebf2780095b01fea73b65983fc89324bbd51f8032e62becedba21: Cannot kill container 6f1bfea85afebf2780095b01fea73b65983fc89324bbd51f8032e62becedba21: rpc error: code = 14 desc = grpc: the connection is unavailable
Apr 14 23:23:52 nd1.cnx.tigera.io atomic-openshift-node[88869]: E0414 23:23:52.572936 88869 kuberuntime_manager.go:800] Failed to stop sandbox {"docker" "6f1bfea85afebf2780095b01fea73b65983fc89324bbd51f8032e62becedba21"}
Apr 14 23:23:52 nd1.cnx.tigera.io atomic-openshift-node[88869]: E0414 23:23:52.573005 88869 kubelet.go:1522] error killing pod: [failed to "KillContainer" for "backend" with KillContainerError: "rpc error: code = Unknown desc = Error response from daemon: Cannot stop container f37f487d8593810fce154b46582d1c06da420bac360eb62e1b25a61954a7df28: Cannot kill container f37f487d8593810fce154b46582d1c06da420bac360eb62e1b25a61954a7df28: rpc error: code = 14 desc = grpc: the connection is unavailable"
Apr 14 23:23:52 nd1.cnx.tigera.io atomic-openshift-node[88869]: , failed to "KillPodSandbox" for "0675b2e5-4011-11e8-9200-06e00806d9c2" with KillPodSandboxError: "rpc error: code = Unknown desc = Error response from daemon: Cannot stop container 6f1bfea85afebf2780095b01fea73b65983fc89324bbd51f8032e62becedba21: Cannot kill container 6f1bfea85afebf2780095b01fea73b65983fc89324bbd51f8032e62becedba21: rpc error: code = 14 desc = grpc: the connection is unavailable"
Apr 14 23:23:52 nd1.cnx.tigera.io atomic-openshift-node[88869]: ]
Apr 14 23:23:52 nd1.cnx.tigera.io atomic-openshift-node[88869]: E0414 23:23:52.573023 88869 pod_workers.go:186] Error syncing pod 0675b2e5-4011-11e8-9200-06e00806d9c2 ("backend-qgz9c_stars(0675b2e5-4011-11e8-9200-06e00806d9c2)"), skipping: error killing pod: [failed to "KillContainer" for "backend" with KillContainerError: "rpc error: code = Unknown desc = Error response from daemon: Cannot stop container f37f487d8593810fce154b46582d1c06da420bac360eb62e1b25a61954a7df28: Cannot kill container f37f487d8593810fce154b46582d1c06da420bac360eb62e1b25a61954a7df28: rpc error: code = 14 desc = grpc: the connection is unavailable"
Apr 14 23:23:52 nd1.cnx.tigera.io atomic-openshift-node[88869]: , failed to "KillPodSandbox" for "0675b2e5-4011-11e8-9200-06e00806d9c2" with KillPodSandboxError: "rpc error: code = Unknown desc = Error response from daemon: Cannot stop container 6f1bfea85afebf2780095b01fea73b65983fc89324bbd51f8032e62becedba21: Cannot kill container 6f1bfea85afebf2780095b01fea73b65983fc89324bbd51f8032e62becedba21: rpc error: code = 14 desc = grpc: the connection is unavailable"
Also in the logs, I see this, not sure if it's related:
Apr 14 23:24:00 nd1.cnx.tigera.io atomic-openshift-node[88869]: E0414 23:24:00.126665 88869 remote_runtime.go:229] StopContainer "af17a95e87e8f1929726a7b7068b2a0299f2f610579e2beffc143e368e3e2b59" from runtime service failed: rpc error: code = Unknown desc = Error response from daemon: Cannot stop container af17a95e87e8f1929726a7b7068b2a0299f2f610579e2beffc143e368e3e2b59: Cannot kill container af17a95e87e8f1929726a7b7068b2a0299f2f610579e2beffc143e368e3e2b59: rpc error: code = 14 desc = grpc: the connection is unavailable
Apr 14 23:24:00 nd1.cnx.tigera.io atomic-openshift-node[88869]: E0414 23:24:00.126704 88869 kuberuntime_container.go:603] Container "docker://af17a95e87e8f1929726a7b7068b2a0299f2f610579e2beffc143e368e3e2b59" termination failed with gracePeriod 30: rpc error: code = Unknown desc = Error response from daemon: Cannot stop container af17a95e87e8f1929726a7b7068b2a0299f2f610579e2beffc143e368e3e2b59: Cannot kill container af17a95e87e8f1929726a7b7068b2a0299f2f610579e2beffc143e368e3e2b59: rpc error: code = 14 desc = grpc: the connection is unavailable
Apr 14 23:24:00 nd1.cnx.tigera.io atomic-openshift-node[88869]: E0414 23:24:00.150983 88869 event.go:200] Server rejected event '&v1.Event{TypeMeta:v1.TypeMeta{Kind:"", APIVersion:""}, ObjectMeta:v1.ObjectMeta{Name:"backend2-x86rz.15256f68e43c21ad", GenerateName:"", Namespace:"stars", SelfLink:"", UID:"", ResourceVersion:"", Generation:0, CreationTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, DeletionTimestamp:(*v1.Time)(nil), DeletionGracePeriodSeconds:(*int64)(nil), Labels:map[string]string(nil), Annotations:map[string]string(nil), OwnerReferences:[]v1.OwnerReference(nil), Initializers:(*v1.Initializers)(nil), Finalizers:[]string(nil), ClusterName:""}, InvolvedObject:v1.ObjectReference{Kind:"Pod", Namespace:"stars", Name:"backend2-x86rz", UID:"acf9db13-4011-11e8-9200-06e00806d9c2", APIVersion:"v1", ResourceVersion:"1849671", FieldPath:"spec.containers{backend2}"}, Reason:"Killing", Message:"Killing container with id docker://backend2:Need to kill Pod", Source:v1.EventSource{Component:"kubelet", Host:"nd1.cnx.tigera.io"}, FirstTimestamp:v1.Time{Time:time.Time{wall:0x0, ext:63659343345, loc:(*time.Location)(0xf584ba0)}}, LastTimestamp:v1.Time{Time:time.Time{wall:0xbeacc144078ddf96, ext:89034287503739, loc:(*time.Location)(0xf584ba0)}}, Count:92, Type:"Normal", EventTime:v1.MicroTime{Time:time.Time{wall:0x0, ext:0, loc:(*time.Location)(nil)}}, Series:(*v1.EventSeries)(nil), Action:"", Related:(*v1.ObjectReference)(nil), ReportingController:"", ReportingInstance:""}': 'events "backend2-x86rz.15256f68e43c21ad" is forbidden: unable to create new content in namespace stars because it is being terminated.' (will not retry!)
Seems like the kubelet is doing the right thing trying to kill the container. Sending to containers for investigation into the rather opaque "Cannot stop container" error we are getting from docker. *** This bug has been marked as a duplicate of bug 1560428 *** |