1560428 – pods are stuck in containercreating with error grpc: the connection is unavailable

Bug 1560428 - pods are stuck in containercreating with error grpc: the connection is unavailable

Summary: pods are stuck in containercreating with error grpc: the connection is unavai...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Containers
Sub Component:
Version:	3.9.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	3.10.z
Assignee:	Urvashi Mohnani
QA Contact:	Anping Li
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1566150 (view as bug list)
Depends On:
Blocks:	1186913 1629733 1643806
TreeView+	depends on / blocked

Reported:	2018-03-26 07:13 UTC by Anping Li
Modified:	2022-03-13 14:48 UTC (History)
CC List:	35 users (show)
Fixed In Version:	docker-1.13.1-74.git6e3bb8e.el7
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1629733 (view as bug list)
Environment:
Last Closed:	2018-07-30 19:11:31 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1601874	unspecified	CLOSED	While creating, deleting pvs in loop and running gluster volume heal in gluster pods 2 of the 4 gluster pods are in 0/1 ...	2021-02-22 00:41:40 UTC
Red Hat Bugzilla	1605158	unspecified	CLOSED	[Tracker] While creating, deleting pvs in loop and running gluster volume heal in gluster pods 2 of the 4 gluster pods a...	2021-02-22 00:41:40 UTC
Red Hat Knowledge Base (Solution)	3601001	None	None	None	2018-09-06 08:19:33 UTC
Red Hat Product Errata	RHBA-2018:1816	None	None	None	2018-07-30 19:12:11 UTC

Internal Links: 1601874 1605158 1656963

Description Anping Li 2018-03-26 07:13:07 UTC

Description of problem:
Logging-es cluster status is Red on Free stg Env, the records couldn't be sent to ES

Version-Release number of selected component (if applicable):
Logging-3.9.14

How reproducible:
always

Steps to Reproduce:
1. Check the ES cluster-status
2.
3.

Actual results:
oc exec -c elasticsearch $cur_pod_name -- curl -s -XGET --cacert /etc/elasticsearch/secret/admin-ca --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://localhost:9200/${cur_action} 
{
  "cluster_name" : "logging-es",
  "status" : "red",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 75,
  "active_shards" : 75,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 79,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 48.701298701298704
}

Actual Result:
The ES cluster status is red

Expected Result:
The ES cluster status is Green

Comment 1 Jeff Cantrill 2018-04-19 12:39:58 UTC

Pods have been in 'terminating' state for a day or longer.  Describing on of the pods has:

 Warning  FailedKillPod  27m (x2358 over 22h)  kubelet, ip-172-31-75-47.us-east-2.compute.internal  error killing pod: [failed to "KillContainer" for "elasticsearch" with KillContainerError: "rpc error: code = Unknown desc = Error response from daemon: Cannot stop container 1d167db53ddfec21f7a037f1b8c7825d48f136e69fff9060e75428a24dca6270: Cannot kill container 1d167db53ddfec21f7a037f1b8c7825d48f136e69fff9060e75428a24dca6270: rpc error: code = 14 desc = grpc: the connection is unavailable"
, failed to "KillContainer" for "proxy" with KillContainerError: "rpc error: code = Unknown desc = Error response from daemon: Cannot stop container c7119f78ea9b9110061d42f3b8af741c5485675741ded70d7172419ddea5b76b: Cannot kill container c7119f78ea9b9110061d42f3b8af741c5485675741ded70d7172419ddea5b76b: rpc error: code = 14 desc = grpc: the connection is unavailable"
, failed to "KillPodSandbox" for "fec87b9a-2e43-11e8-8282-02306c0cdc4b" with KillPodSandboxError: "rpc error: code = Unknown desc = Error response from daemon: Cannot stop container a207cbe5ec369a35ac8106e91197ef795cd45f52b502c407a7c679bcda5db53d: Cannot kill container a207cbe5ec369a35ac8106e91197ef795cd45f52b502c407a7c679bcda5db53d: rpc error: code = 14 desc = grpc: the connection is unavailable"


# oc version
oc v3.9.14
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://172.31.78.254:443
openshift v3.9.14
kubernetes v1.9.1+a0ce1bc657


What can I do from here?

Comment 2 Jeff Cantrill 2018-04-19 12:39:58 UTC

Pods have been in 'terminating' state for a day or longer.  Describing on of the pods has:

 Warning  FailedKillPod  27m (x2358 over 22h)  kubelet, ip-172-31-75-47.us-east-2.compute.internal  error killing pod: [failed to "KillContainer" for "elasticsearch" with KillContainerError: "rpc error: code = Unknown desc = Error response from daemon: Cannot stop container 1d167db53ddfec21f7a037f1b8c7825d48f136e69fff9060e75428a24dca6270: Cannot kill container 1d167db53ddfec21f7a037f1b8c7825d48f136e69fff9060e75428a24dca6270: rpc error: code = 14 desc = grpc: the connection is unavailable"
, failed to "KillContainer" for "proxy" with KillContainerError: "rpc error: code = Unknown desc = Error response from daemon: Cannot stop container c7119f78ea9b9110061d42f3b8af741c5485675741ded70d7172419ddea5b76b: Cannot kill container c7119f78ea9b9110061d42f3b8af741c5485675741ded70d7172419ddea5b76b: rpc error: code = 14 desc = grpc: the connection is unavailable"
, failed to "KillPodSandbox" for "fec87b9a-2e43-11e8-8282-02306c0cdc4b" with KillPodSandboxError: "rpc error: code = Unknown desc = Error response from daemon: Cannot stop container a207cbe5ec369a35ac8106e91197ef795cd45f52b502c407a7c679bcda5db53d: Cannot kill container a207cbe5ec369a35ac8106e91197ef795cd45f52b502c407a7c679bcda5db53d: rpc error: code = 14 desc = grpc: the connection is unavailable"


# oc version
oc v3.9.14
kubernetes v1.9.1+a0ce1bc657
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://172.31.78.254:443
openshift v3.9.14
kubernetes v1.9.1+a0ce1bc657


What can I do from here?

Comment 3 Seth Jennings 2018-04-19 17:43:37 UTC

$ oc get dc
NAME                              REVISION   DESIRED   CURRENT   TRIGGERED BY
logging-curator                   16         1         1         config
logging-es-data-master-dm4qr1as   15         0         0         
logging-es-data-master-qhgbgiw5   13         0         0         
logging-es-data-master-szybvfio   15         0         0         
logging-es-kopf                   0          1         0         config,image(logging-es-kopf:latest)
logging-kibana                    19         1         1         config

There are 3 logging-es-data-master DCs present

$ oc get pod -o wide
NAME                                       READY     STATUS        RESTARTS   AGE       IP            NODE
logging-es-data-master-dm4qr1as-13-7xhkt   1/2       Terminating   0          27d       10.131.2.6    ip-172-31-75-47.us-east-2.compute.internal
logging-es-data-master-qhgbgiw5-12-ds5l4   1/2       Terminating   0          27d       10.131.2.3    ip-172-31-75-47.us-east-2.compute.internal

Seems that two of the DCs each scheduled a pod to the same infra node

On that node:

# docker ps | grep logging-elasticsearch
1d167db53ddf        registry.reg-aws.openshift.com:443/openshift3/logging-elasticsearch@sha256:d182af4499b5f0ff13707f3b677681d2484f15300164d95b91e6aab04bb57dd6   "sh /opt/app-root/..."   3 weeks ago         Up 3 weeks                              k8s_elasticsearch_logging-es-data-master-dm4qr1as-13-7xhkt_logging_fec87b9a-2e43-11e8-8282-02306c0cdc4b_0
9c055465f682        registry.reg-aws.openshift.com:443/openshift3/logging-elasticsearch@sha256:d182af4499b5f0ff13707f3b677681d2484f15300164d95b91e6aab04bb57dd6   "sh /opt/app-root/..."   3 weeks ago         Up 3 weeks                              k8s_elasticsearch_logging-es-data-master-qhgbgiw5-12-ds5l4_logging_6e158150-2e43-11e8-8282-02306c0cdc4b_0

# docker ps | grep oauth-proxy
c7119f78ea9b        registry.reg-aws.openshift.com:443/openshift3/oauth-proxy@sha256:514540e0d34c8c120545e19c3122da5106e45df609987b8b9092a1cabadd1b21             "/bin/oauth-proxy ..."   3 weeks ago         Up 3 weeks                              k8s_proxy_logging-es-data-master-dm4qr1as-13-7xhkt_logging_fec87b9a-2e43-11e8-8282-02306c0cdc4b_0
ed77c1546bf9        registry.reg-aws.openshift.com:443/openshift3/oauth-proxy@sha256:514540e0d34c8c120545e19c3122da5106e45df609987b8b9092a1cabadd1b21             "/bin/oauth-proxy ..."   3 weeks ago         Up 3 weeks                              k8s_proxy_logging-es-data-master-qhgbgiw5-12-ds5l4_logging_6e158150-2e43-11e8-8282-02306c0cdc4b_0

Both containers for the two pods are still running.

The "the connection is unavailable" seems to be an error from the docker daemon unable to connect to containerd.

From this moby issue https://github.com/moby/moby/issues/31849#issuecomment-320236354 :

"Unfortunately, the docker: Error response from daemon: grpc: the connection is unavailable. error is quite generic. Basically it means: the docker daemon (dockerd) was unable to make a gRPC connection with the containerd daemon (docker-containerd)"

If I do a "ps -fe | grep containerd" I see lots of docker-containerd-shim-current processes but not containerd itself.  It seems to have terminated somehow.

I think Conatiners team should figure out how containerd would have died and why it doesn't come back automatically.

Comment 4 Jeff Cantrill 2018-04-19 17:47:30 UTC

Same infra nodes should be resolved by https://bugzilla.redhat.com/show_bug.cgi?id=1564944

Comment 5 Seth Jennings 2018-04-19 19:16:19 UTC

*** Bug 1566150 has been marked as a duplicate of this bug. ***

Comment 7 Peter Portante 2018-04-21 00:38:50 UTC

This is another reason fluentd pods should not be run as daemonset containers, but as system containers or RHEL systemd services, see https://bugzilla.redhat.com/show_bug.cgi?id=1561510.

Comment 8 Urvashi Mohnani 2018-04-22 17:46:16 UTC

Can you please attach the containerd logs.

Comment 11 Urvashi Mohnani 2018-04-30 17:50:24 UTC

The setup doesn't have a separate unit/service for containerd. Looks like the docker-containerd process is started as a child process by docker, so there are no separate logs for containerd. Looking through the docker logs, I found the point where containerd terminated and tried to start up again, but failed to do so:

Mar 28 02:50:07 ip-172-31-75-47.us-east-2.compute.internal dockerd-current[23018]: time="2018-03-28T02:50:07.183654004Z" level=error msg="libcontainerd: failed to receive event from containerd: rpc error: code = 13 desc = transport is closing"
Mar 28 02:50:07 ip-172-31-75-47.us-east-2.compute.internal dockerd-current[23018]: time="2018-03-28T02:50:07.187331604Z" level=error msg="Error running exec in container: rpc error: code = 13 desc = transport is closing"
Mar 28 02:50:07 ip-172-31-75-47.us-east-2.compute.internal dockerd-current[23018]: time="2018-03-28T02:50:07.192837541Z" level=info msg="libcontainerd: new containerd process, pid: 26160"
Mar 28 02:50:07 ip-172-31-75-47.us-east-2.compute.internal dockerd-current[23018]: time="2018-03-28T02:50:07.420057417Z" level=error msg="Error running exec in container: rpc error: code = 14 desc = grpc: the connection is unavailable"
Mar 28 02:50:07 ip-172-31-75-47.us-east-2.compute.internal dockerd-current[23018]: time="2018-03-28T02:50:07.985250984Z" level=info msg="libcontainerd: new containerd process, pid: 26172"
Mar 28 02:50:08 ip-172-31-75-47.us-east-2.compute.internal dockerd-current[23018]: time="2018-03-28T02:50:08.186364873Z" level=error msg="libcontainerd: failed to receive event from containerd: rpc error: code = 12 desc = unknown service types.API"

The "unknown service types.API" error went on forever till a Go panic occurred. After this one attempt to restart containerd, there was no other attempt to start containerd.

Due to the lack of containerd logs, there is no telling what caused containerd to terminate. We are currently working on changing the set up so that containerd is a separate service with higher log-level for more information.

For now, I restarted the docker service on the node where this error occurred, and containerd started up successfully. The pods that were stuck in Terminating state successfully terminated.

Keeping this bug open till we modify the set up, and in case this problem occurs again in the near future.

Comment 12 Karthik Prabhakar 2018-05-06 17:08:03 UTC

As mentioned in another bugzilla comment (https://bugzilla.redhat.com/show_bug.cgi?id=1566150#c4) this issue is not limited to fluentd, but occurs on various other application pods.

Restarting docker fixes the problem (temporarily, until it occurs again), but does not address the root cause.

Comment 15 giriraj rajawat 2018-06-01 06:44:32 UTC

Does it going to be back ported for 3.9.

Comment 16 Antonio Murdaca 2018-06-01 14:37:52 UTC

what's the docker version here? please provide it

Comment 17 Karthik Prabhakar 2018-06-01 17:51:52 UTC

I'm running docker-1.13.1

Comment 18 Karthik Prabhakar 2018-06-01 17:53:58 UTC

And is there a fix, perhaps there's a private comment that's not visible in public.

Have torn down that cluster, but did see the problem recur on occasion (sometimes after a few weeks).

Comment 19 Antonio Murdaca 2018-06-04 07:48:06 UTC

(In reply to Karthik Prabhakar from comment #17)
> I'm running docker-1.13.1

please provide the output of "rpm -q docker"

I've pushed a potential fix in the containerd project which should fix this.

Comment 22 rehan 2018-06-12 10:32:51 UTC

(In reply to Antonio Murdaca from comment #19)
> (In reply to Karthik Prabhakar from comment #17)
> > I'm running docker-1.13.1
> 
> please provide the output of "rpm -q docker"
> 
> I've pushed a potential fix in the containerd project which should fix this.


#  rpm -q docker
docker-1.13.1-63.git94f4240.el7.x86_64

Comment 23 Anping Li 2018-06-13 08:12:33 UTC

@rehat, What is the root causes?  Could you point me the key point to reproduce this issue?

Comment 24 Anping Li 2018-06-25 02:57:17 UTC

I couldn't reproduce it. According to our testing in free-int/stg/production, all works well, so move bug to verified.

Comment 25 rehan 2018-06-26 10:30:08 UTC

> @rehat, What is the root causes?  Could you point me the key point to
> reproduce this issue?


Once this happens, the kubelet on the node is no longer able to manage pods. A node restart (eventually) fixes things, but this is obviously not a good workaround, since the problem repeats after a little while again. 


The cluster is obviously unusable (at least the affected nodes), which makes this quite annoying.(In reply to Anping Li from comment #23)

Comment 29 giriraj rajawat 2018-07-05 12:43:07 UTC

Priority raised to urgent

Below are some use cases wherein this error occurs:



$ docker run containers.xxx.com/mukeshk2/cae-test:nginx-8
/usr/bin/docker-current: Error response from daemon: grpc: the connection is unavailable.


$ docker exec -it 851fd402154a sh
rpc error: code = 14 desc = grpc: the connection is unavailable


Do let us know what more information is required.

Comment 30 Urvashi Mohnani 2018-07-06 14:08:06 UTC

The logs don't give much helpful information, so we have added https://github.com/projectatomic/docker/commit/b702ba9aff6526466684373dbd5e334cd1dd9592, which should print out more information in the logs if/when containerd goes down and fails to come back, so still trying to figure out the root cause.
In the meantime the temp workaround is restarting the docker service whenever there is an error response from the daemon.

Comment 31 giriraj rajawat 2018-07-10 05:21:50 UTC

Urvashi Mohnani , Do let us know if you need more information from client side on this.

Comment 34 Urvashi Mohnani 2018-07-17 20:33:05 UTC

Daein, did the oom-killer kill containerd as well? Could you paste the docker journal logs as well please.

Comment 38 errata-xmlrpc 2018-07-30 19:11:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1816

Comment 46 Urvashi Mohnani 2018-08-16 19:57:36 UTC

To backport this to 3.9, just use docker version docker-1.13.1-74.git6e3bb8e.el7, https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=734404, on 3.9 to pick up the fix for this issue.

Comment 48 Urvashi Mohnani 2018-08-20 19:13:55 UTC

@Krishnanram docker-1.13.1-74.git6e3bb8e.el7 shipped 4 days ago, so just doing a yum update docker should get the fix in. Should be available to use for both OCP 3.10 and 3.9.

Comment 51 hgomes 2018-09-17 15:22:07 UTC

Cloned this BZ, since appears that this was not properly fixed at docker-1.13.1-74.git6e3bb8e.el7

https://bugzilla.redhat.com/show_bug.cgi?id=1629733

Comment 54 Amana 2018-09-21 14:51:37 UTC

Urvashi,

BTW a cloned BZ[1] was created with further details. I asked cu for docker logs.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1629733

Comment 55 Giuliano Quites 2018-10-02 18:59:06 UTC

Hi, 

This bugzilla is correct for version 3.9.41?

This error have only for new pods in hosts.

Oct 02 13:43:02 hostname.br. dockerd-current[2644]: time="2018-10-02T13:43:02.911030794-03:00" level=error msg="libcontainerd: failed to receive event from containerd: rpc error: code = 12 desc = unknown service types.API"

Openshift version: 3.9.41.

Comment 56 seferovic 2018-10-19 13:09:52 UTC

Hi, we are running v3.9.40 in production and are encountering this behavior...

Oct 19 15:07:46 p03 dockerd-current[62681]: time="2018-10-19T15:07:46.550361913+02:00" level=error msg="Error running exec in container: rpc error: code = 14 desc = grpc: the connection is unavailable"
Oct 19 15:07:46 p03 dockerd-current[62681]: time="2018-10-19T15:07:46.812958866+02:00" level=error msg="Error running exec in container: rpc error: code = 14 desc = grpc: the connection is unavailable"
Oct 19 15:07:47 p03 dockerd-current[62681]: time="2018-10-19T15:07:47.376688094+02:00" level=error msg="Error running exec in container: rpc error: code = 14 desc = grpc: the connection is unavailable"
Oct 19 15:07:47 p03 dockerd-current[62681]: time="2018-10-19T15:07:47.558795460+02:00" level=error msg="Error running exec in container: rpc error: code = 14 desc = grpc: the connection is unavailable"
Oct 19 15:07:47 v dockerd-current[62681]: time="2018-10-19T15:07:47.872017100+02:00" level=error msg="Error running exec in container: rpc error: code = 14 desc = grpc: the connection is unavailable"

rpm -qa docker
docker-1.13.1-68.gitdded712.el7.x86_64

Note You need to log in before you can comment on or make changes to this bug.

achvatal
ajuricic
amurdaca
aos-bugs
byount
dapark
dornelas
dpalmer
dyocum
fshaikh
gquites
grajawat
hgomes
jcantril
jdanek
jhonce
jkaur
jlieskov
jokerman
jrosenta
kp
kramdoss
ktadimar
maupadhy
mmccomas
pportant
rcyriac
rekhan
seferovic
sgaikwad
takirby
tcarlin
umohnani
vjaypurk
wmeng