Description of problem: Logging-es cluster status is Red on Free stg Env, the records couldn't be sent to ES Version-Release number of selected component (if applicable): Logging-3.9.14 How reproducible: always Steps to Reproduce: 1. Check the ES cluster-status 2. 3. Actual results: oc exec -c elasticsearch $cur_pod_name -- curl -s -XGET --cacert /etc/elasticsearch/secret/admin-ca --cert /etc/elasticsearch/secret/admin-cert --key /etc/elasticsearch/secret/admin-key https://localhost:9200/${cur_action} { "cluster_name" : "logging-es", "status" : "red", "timed_out" : false, "number_of_nodes" : 3, "number_of_data_nodes" : 3, "active_primary_shards" : 75, "active_shards" : 75, "relocating_shards" : 0, "initializing_shards" : 0, "unassigned_shards" : 79, "delayed_unassigned_shards" : 0, "number_of_pending_tasks" : 0, "number_of_in_flight_fetch" : 0, "task_max_waiting_in_queue_millis" : 0, "active_shards_percent_as_number" : 48.701298701298704 } Actual Result: The ES cluster status is red Expected Result: The ES cluster status is Green
Pods have been in 'terminating' state for a day or longer. Describing on of the pods has: Warning FailedKillPod 27m (x2358 over 22h) kubelet, ip-172-31-75-47.us-east-2.compute.internal error killing pod: [failed to "KillContainer" for "elasticsearch" with KillContainerError: "rpc error: code = Unknown desc = Error response from daemon: Cannot stop container 1d167db53ddfec21f7a037f1b8c7825d48f136e69fff9060e75428a24dca6270: Cannot kill container 1d167db53ddfec21f7a037f1b8c7825d48f136e69fff9060e75428a24dca6270: rpc error: code = 14 desc = grpc: the connection is unavailable" , failed to "KillContainer" for "proxy" with KillContainerError: "rpc error: code = Unknown desc = Error response from daemon: Cannot stop container c7119f78ea9b9110061d42f3b8af741c5485675741ded70d7172419ddea5b76b: Cannot kill container c7119f78ea9b9110061d42f3b8af741c5485675741ded70d7172419ddea5b76b: rpc error: code = 14 desc = grpc: the connection is unavailable" , failed to "KillPodSandbox" for "fec87b9a-2e43-11e8-8282-02306c0cdc4b" with KillPodSandboxError: "rpc error: code = Unknown desc = Error response from daemon: Cannot stop container a207cbe5ec369a35ac8106e91197ef795cd45f52b502c407a7c679bcda5db53d: Cannot kill container a207cbe5ec369a35ac8106e91197ef795cd45f52b502c407a7c679bcda5db53d: rpc error: code = 14 desc = grpc: the connection is unavailable" # oc version oc v3.9.14 kubernetes v1.9.1+a0ce1bc657 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://172.31.78.254:443 openshift v3.9.14 kubernetes v1.9.1+a0ce1bc657 What can I do from here?
$ oc get dc NAME REVISION DESIRED CURRENT TRIGGERED BY logging-curator 16 1 1 config logging-es-data-master-dm4qr1as 15 0 0 logging-es-data-master-qhgbgiw5 13 0 0 logging-es-data-master-szybvfio 15 0 0 logging-es-kopf 0 1 0 config,image(logging-es-kopf:latest) logging-kibana 19 1 1 config There are 3 logging-es-data-master DCs present $ oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE logging-es-data-master-dm4qr1as-13-7xhkt 1/2 Terminating 0 27d 10.131.2.6 ip-172-31-75-47.us-east-2.compute.internal logging-es-data-master-qhgbgiw5-12-ds5l4 1/2 Terminating 0 27d 10.131.2.3 ip-172-31-75-47.us-east-2.compute.internal Seems that two of the DCs each scheduled a pod to the same infra node On that node: # docker ps | grep logging-elasticsearch 1d167db53ddf registry.reg-aws.openshift.com:443/openshift3/logging-elasticsearch@sha256:d182af4499b5f0ff13707f3b677681d2484f15300164d95b91e6aab04bb57dd6 "sh /opt/app-root/..." 3 weeks ago Up 3 weeks k8s_elasticsearch_logging-es-data-master-dm4qr1as-13-7xhkt_logging_fec87b9a-2e43-11e8-8282-02306c0cdc4b_0 9c055465f682 registry.reg-aws.openshift.com:443/openshift3/logging-elasticsearch@sha256:d182af4499b5f0ff13707f3b677681d2484f15300164d95b91e6aab04bb57dd6 "sh /opt/app-root/..." 3 weeks ago Up 3 weeks k8s_elasticsearch_logging-es-data-master-qhgbgiw5-12-ds5l4_logging_6e158150-2e43-11e8-8282-02306c0cdc4b_0 # docker ps | grep oauth-proxy c7119f78ea9b registry.reg-aws.openshift.com:443/openshift3/oauth-proxy@sha256:514540e0d34c8c120545e19c3122da5106e45df609987b8b9092a1cabadd1b21 "/bin/oauth-proxy ..." 3 weeks ago Up 3 weeks k8s_proxy_logging-es-data-master-dm4qr1as-13-7xhkt_logging_fec87b9a-2e43-11e8-8282-02306c0cdc4b_0 ed77c1546bf9 registry.reg-aws.openshift.com:443/openshift3/oauth-proxy@sha256:514540e0d34c8c120545e19c3122da5106e45df609987b8b9092a1cabadd1b21 "/bin/oauth-proxy ..." 3 weeks ago Up 3 weeks k8s_proxy_logging-es-data-master-qhgbgiw5-12-ds5l4_logging_6e158150-2e43-11e8-8282-02306c0cdc4b_0 Both containers for the two pods are still running. The "the connection is unavailable" seems to be an error from the docker daemon unable to connect to containerd. From this moby issue https://github.com/moby/moby/issues/31849#issuecomment-320236354 : "Unfortunately, the docker: Error response from daemon: grpc: the connection is unavailable. error is quite generic. Basically it means: the docker daemon (dockerd) was unable to make a gRPC connection with the containerd daemon (docker-containerd)" If I do a "ps -fe | grep containerd" I see lots of docker-containerd-shim-current processes but not containerd itself. It seems to have terminated somehow. I think Conatiners team should figure out how containerd would have died and why it doesn't come back automatically.
Same infra nodes should be resolved by https://bugzilla.redhat.com/show_bug.cgi?id=1564944
*** Bug 1566150 has been marked as a duplicate of this bug. ***
This is another reason fluentd pods should not be run as daemonset containers, but as system containers or RHEL systemd services, see https://bugzilla.redhat.com/show_bug.cgi?id=1561510.
Can you please attach the containerd logs.
The setup doesn't have a separate unit/service for containerd. Looks like the docker-containerd process is started as a child process by docker, so there are no separate logs for containerd. Looking through the docker logs, I found the point where containerd terminated and tried to start up again, but failed to do so: Mar 28 02:50:07 ip-172-31-75-47.us-east-2.compute.internal dockerd-current[23018]: time="2018-03-28T02:50:07.183654004Z" level=error msg="libcontainerd: failed to receive event from containerd: rpc error: code = 13 desc = transport is closing" Mar 28 02:50:07 ip-172-31-75-47.us-east-2.compute.internal dockerd-current[23018]: time="2018-03-28T02:50:07.187331604Z" level=error msg="Error running exec in container: rpc error: code = 13 desc = transport is closing" Mar 28 02:50:07 ip-172-31-75-47.us-east-2.compute.internal dockerd-current[23018]: time="2018-03-28T02:50:07.192837541Z" level=info msg="libcontainerd: new containerd process, pid: 26160" Mar 28 02:50:07 ip-172-31-75-47.us-east-2.compute.internal dockerd-current[23018]: time="2018-03-28T02:50:07.420057417Z" level=error msg="Error running exec in container: rpc error: code = 14 desc = grpc: the connection is unavailable" Mar 28 02:50:07 ip-172-31-75-47.us-east-2.compute.internal dockerd-current[23018]: time="2018-03-28T02:50:07.985250984Z" level=info msg="libcontainerd: new containerd process, pid: 26172" Mar 28 02:50:08 ip-172-31-75-47.us-east-2.compute.internal dockerd-current[23018]: time="2018-03-28T02:50:08.186364873Z" level=error msg="libcontainerd: failed to receive event from containerd: rpc error: code = 12 desc = unknown service types.API" The "unknown service types.API" error went on forever till a Go panic occurred. After this one attempt to restart containerd, there was no other attempt to start containerd. Due to the lack of containerd logs, there is no telling what caused containerd to terminate. We are currently working on changing the set up so that containerd is a separate service with higher log-level for more information. For now, I restarted the docker service on the node where this error occurred, and containerd started up successfully. The pods that were stuck in Terminating state successfully terminated. Keeping this bug open till we modify the set up, and in case this problem occurs again in the near future.
As mentioned in another bugzilla comment (https://bugzilla.redhat.com/show_bug.cgi?id=1566150#c4) this issue is not limited to fluentd, but occurs on various other application pods. Restarting docker fixes the problem (temporarily, until it occurs again), but does not address the root cause.
Does it going to be back ported for 3.9.
what's the docker version here? please provide it
I'm running docker-1.13.1
And is there a fix, perhaps there's a private comment that's not visible in public. Have torn down that cluster, but did see the problem recur on occasion (sometimes after a few weeks).
(In reply to Karthik Prabhakar from comment #17) > I'm running docker-1.13.1 please provide the output of "rpm -q docker" I've pushed a potential fix in the containerd project which should fix this.
(In reply to Antonio Murdaca from comment #19) > (In reply to Karthik Prabhakar from comment #17) > > I'm running docker-1.13.1 > > please provide the output of "rpm -q docker" > > I've pushed a potential fix in the containerd project which should fix this. # rpm -q docker docker-1.13.1-63.git94f4240.el7.x86_64
@rehat, What is the root causes? Could you point me the key point to reproduce this issue?
I couldn't reproduce it. According to our testing in free-int/stg/production, all works well, so move bug to verified.
> @rehat, What is the root causes? Could you point me the key point to > reproduce this issue? Once this happens, the kubelet on the node is no longer able to manage pods. A node restart (eventually) fixes things, but this is obviously not a good workaround, since the problem repeats after a little while again. The cluster is obviously unusable (at least the affected nodes), which makes this quite annoying.(In reply to Anping Li from comment #23)
Priority raised to urgent Below are some use cases wherein this error occurs: $ docker run containers.xxx.com/mukeshk2/cae-test:nginx-8 /usr/bin/docker-current: Error response from daemon: grpc: the connection is unavailable. $ docker exec -it 851fd402154a sh rpc error: code = 14 desc = grpc: the connection is unavailable Do let us know what more information is required.
The logs don't give much helpful information, so we have added https://github.com/projectatomic/docker/commit/b702ba9aff6526466684373dbd5e334cd1dd9592, which should print out more information in the logs if/when containerd goes down and fails to come back, so still trying to figure out the root cause. In the meantime the temp workaround is restarting the docker service whenever there is an error response from the daemon.
Urvashi Mohnani , Do let us know if you need more information from client side on this.
Daein, did the oom-killer kill containerd as well? Could you paste the docker journal logs as well please.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1816
To backport this to 3.9, just use docker version docker-1.13.1-74.git6e3bb8e.el7, https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=734404, on 3.9 to pick up the fix for this issue.
@Krishnanram docker-1.13.1-74.git6e3bb8e.el7 shipped 4 days ago, so just doing a yum update docker should get the fix in. Should be available to use for both OCP 3.10 and 3.9.
Cloned this BZ, since appears that this was not properly fixed at docker-1.13.1-74.git6e3bb8e.el7 https://bugzilla.redhat.com/show_bug.cgi?id=1629733
Urvashi, BTW a cloned BZ[1] was created with further details. I asked cu for docker logs. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1629733
Hi, This bugzilla is correct for version 3.9.41? This error have only for new pods in hosts. Oct 02 13:43:02 hostname.br. dockerd-current[2644]: time="2018-10-02T13:43:02.911030794-03:00" level=error msg="libcontainerd: failed to receive event from containerd: rpc error: code = 12 desc = unknown service types.API" Openshift version: 3.9.41.
Hi, we are running v3.9.40 in production and are encountering this behavior... Oct 19 15:07:46 p03 dockerd-current[62681]: time="2018-10-19T15:07:46.550361913+02:00" level=error msg="Error running exec in container: rpc error: code = 14 desc = grpc: the connection is unavailable" Oct 19 15:07:46 p03 dockerd-current[62681]: time="2018-10-19T15:07:46.812958866+02:00" level=error msg="Error running exec in container: rpc error: code = 14 desc = grpc: the connection is unavailable" Oct 19 15:07:47 p03 dockerd-current[62681]: time="2018-10-19T15:07:47.376688094+02:00" level=error msg="Error running exec in container: rpc error: code = 14 desc = grpc: the connection is unavailable" Oct 19 15:07:47 p03 dockerd-current[62681]: time="2018-10-19T15:07:47.558795460+02:00" level=error msg="Error running exec in container: rpc error: code = 14 desc = grpc: the connection is unavailable" Oct 19 15:07:47 v dockerd-current[62681]: time="2018-10-19T15:07:47.872017100+02:00" level=error msg="Error running exec in container: rpc error: code = 14 desc = grpc: the connection is unavailable" rpm -qa docker docker-1.13.1-68.gitdded712.el7.x86_64