A fluentd pod on starter-us-west-1 was stuck for almost two days without being able to send any logs to Elasticsearch. The pod was marked as healthy, in a running state, but the fluentd on disk queues were filled and a day old (see below). The pod had the following error found with oc describe: Warning NetworkNotReady 58m (x3 over 58m) kubelet, ip-172-31-20-183.us-west-1.compute.internal network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized] The full output of oc describe is below. Looks like the networking problem should have killed the pod so that it could be recreated. Once I killed the pod and restarted it, fluentd is working fine on that node sending logs. ---- [root@starter-us-west-1-master-2514b ~]# oc rsh pods/logging-fluentd-smg4n sh-4.2# ls -ltrh /var/lib/fluentd/ total 264M -rw-r--r--. 1 root root 8.0M Mar 22 12:38 buffer-output-es-config.output_tag.q567ff8e24c2da87b.log -rw-r--r--. 1 root root 8.0M Mar 22 12:38 buffer-output-es-config.output_tag.q567ff94dd7c1e7ed.log -rw-r--r--. 1 root root 8.0M Mar 22 12:38 buffer-output-es-config.output_tag.q567ff954c1f2346a.log -rw-r--r--. 1 root root 8.0M Mar 22 12:48 buffer-output-es-config.output_tag.q567ff95bb7dbce07.log -rw-r--r--. 1 root root 8.0M Mar 22 13:13 buffer-output-es-config.output_tag.q567ffb9af8126979.log -rw-r--r--. 1 root root 8.0M Mar 22 13:37 buffer-output-es-config.output_tag.q568001373d8eb187.log -rw-r--r--. 1 root root 8.0M Mar 22 13:58 buffer-output-es-config.output_tag.q5680068c30a1df20.log -rw-r--r--. 1 root root 8.0M Mar 22 14:03 buffer-output-es-config.output_tag.q56800b268629ce19.log -rw-r--r--. 1 root root 8.0M Mar 22 14:05 buffer-output-es-config.output_tag.q56800c64c4e3dc53.log -rw-r--r--. 1 root root 8.0M Mar 22 14:05 buffer-output-es-config.output_tag.q56800cc8dc88864e.log -rw-r--r--. 1 root root 8.0M Mar 22 14:05 buffer-output-es-config.output_tag.q56800ccff2d3f1fd.log -rw-r--r--. 1 root root 8.0M Mar 22 14:08 buffer-output-es-config.output_tag.q56800cd798c5c88a.log -rw-r--r--. 1 root root 8.0M Mar 22 14:33 buffer-output-es-config.output_tag.q56800d91bbd9978b.log -rw-r--r--. 1 root root 8.0M Mar 22 14:57 buffer-output-es-config.output_tag.q5680130d5f5f184e.log -rw-r--r--. 1 root root 8.0M Mar 22 15:19 buffer-output-es-config.output_tag.q56801875c2302b74.log -rw-r--r--. 1 root root 8.0M Mar 22 15:30 buffer-output-es-config.output_tag.q56801d63148c21d4.log -rw-r--r--. 1 root root 8.0M Mar 22 15:32 buffer-output-es-config.output_tag.q56801fd70e4df957.log -rw-r--r--. 1 root root 8.0M Mar 22 15:32 buffer-output-es-config.output_tag.q5680204d73a7619f.log -rw-r--r--. 1 root root 8.0M Mar 22 15:32 buffer-output-es-config.output_tag.q568020557c199319.log -rw-r--r--. 1 root root 8.0M Mar 22 15:33 buffer-output-es-config.output_tag.q5680205c03e5bff9.log -rw-r--r--. 1 root root 8.0M Mar 22 15:55 buffer-output-es-config.output_tag.q56802062b6be19bd.log -rw-r--r--. 1 root root 8.0M Mar 22 16:19 buffer-output-es-config.output_tag.q568025479eaad120.log -rw-r--r--. 1 root root 8.0M Mar 22 16:42 buffer-output-es-config.output_tag.q56802ab8c7ac5b5f.log -rw-r--r--. 1 root root 8.0M Mar 22 16:57 buffer-output-es-config.output_tag.q56802fd041109de2.log -rw-r--r--. 1 root root 8.0M Mar 22 16:59 buffer-output-es-config.output_tag.q5680334e4c6b910e.log -rw-r--r--. 1 root root 8.0M Mar 22 16:59 buffer-output-es-config.output_tag.q568033b958f892ec.log -rw-r--r--. 1 root root 8.0M Mar 22 16:59 buffer-output-es-config.output_tag.q568033c613d699db.log -rw-r--r--. 1 root root 8.0M Mar 22 17:00 buffer-output-es-config.output_tag.q568033cccd186bf3.log -rw-r--r--. 1 root root 8.0M Mar 22 17:16 buffer-output-es-config.output_tag.q568033d4983ff734.log -rw-r--r--. 1 root root 8.0M Mar 22 17:41 buffer-output-es-config.output_tag.q56803764dbcf542f.log -rw-r--r--. 1 root root 8.0M Mar 22 18:05 buffer-output-es-config.output_tag.q56803cf983f21f4f.log -rw-r--r--. 1 root root 8.0M Mar 22 18:26 buffer-output-es-config.output_tag.q5680427561824f30.log -rw-r--r--. 1 root root 8.0M Mar 22 18:27 buffer-output-es-config.output_tag.b5680471e87e39b9a.log sh-4.2# ps auxww USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.8 0.2 7413820 178516 ? Ssl Mar22 25:00 /usr/bin/ruby /usr/bin/fluentd --no-supervisor root 13486 0.0 0.0 11772 1696 ? Ss 00:43 0:00 /bin/sh root 13493 0.0 0.0 47448 1664 ? R+ 00:43 0:00 ps auxww sh-4.2# kill -TERM 1 sh-4.2# ps auxww USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.8 0.2 7424068 178516 ? Ssl Mar22 25:00 /usr/bin/ruby /usr/bin/fluentd --no-supervisor root 13486 0.0 0.0 11772 1760 ? Ss 00:43 0:00 /bin/sh root 13502 0.0 0.0 47448 1664 ? R+ 00:43 0:00 ps auxww sh-4.2# exit exit command terminated with exit code 127 [root@starter-us-west-1-master-2514b ~]# oc describe pods/logging-fluentd-smg4n Name: logging-fluentd-smg4n Namespace: logging Node: ip-172-31-20-183.us-west-1.compute.internal/172.31.20.183 Start Time: Thu, 22 Mar 2018 00:29:07 +0000 Labels: component=fluentd controller-revision-hash=248860309 logging-infra=fluentd pod-template-generation=5 provider=openshift Annotations: openshift.io/scc=privileged Status: Running IP: 10.128.4.15 Controlled By: DaemonSet/logging-fluentd Containers: fluentd-elasticsearch: Container ID: docker://f6330f4793b25e08e6ef597dd3b82f1bd521432d4735aa5af2b558b488d8842d Image: registry.reg-aws.openshift.com:443/openshift3/logging-fluentd:v3.9.7 Image ID: docker-pullable://registry.reg-aws.openshift.com:443/openshift3/logging-fluentd@sha256:2387ab82fb5f670de4062c372ff857fff127829c3dbebafe360f950981b268be Port: <none> State: Running Started: Thu, 22 Mar 2018 00:29:14 +0000 Ready: True Restart Count: 0 Limits: memory: 512Mi Requests: cpu: 100m memory: 512Mi Environment: K8S_HOST_URL: https://kubernetes.default.svc.cluster.local ES_HOST: logging-es ES_PORT: 9200 ES_CLIENT_CERT: /etc/fluent/keys/cert ES_CLIENT_KEY: /etc/fluent/keys/key ES_CA: /etc/fluent/keys/ca OPS_HOST: logging-es OPS_PORT: 9200 OPS_CLIENT_CERT: /etc/fluent/keys/cert OPS_CLIENT_KEY: /etc/fluent/keys/key OPS_CA: /etc/fluent/keys/ca JOURNAL_SOURCE: JOURNAL_READ_FROM_HEAD: BUFFER_QUEUE_LIMIT: 32 BUFFER_SIZE_LIMIT: 8m FLUENTD_CPU_LIMIT: node allocatable (limits.cpu) FLUENTD_MEMORY_LIMIT: 536870912 (limits.memory) FILE_BUFFER_LIMIT: 256Mi Mounts: /etc/docker from dockerdaemoncfg (ro) /etc/docker-hostname from dockerhostname (ro) /etc/fluent/configs.d/user from config (ro) /etc/fluent/keys from certs (ro) /etc/localtime from localtime (ro) /etc/origin/node from originnodecfg (ro) /etc/sysconfig/docker from dockercfg (ro) /run/log/journal from runlogjournal (rw) /var/lib/docker/containers from varlibdockercontainers (ro) /var/lib/fluentd from filebufferstorage (rw) /var/log from varlog (rw) /var/run/secrets/kubernetes.io/serviceaccount from aggregated-logging-fluentd-token-2lmll (ro) Conditions: Type Status Initialized True Ready True PodScheduled True Volumes: runlogjournal: Type: HostPath (bare host directory volume) Path: /run/log/journal HostPathType: varlog: Type: HostPath (bare host directory volume) Path: /var/log HostPathType: varlibdockercontainers: Type: HostPath (bare host directory volume) Path: /var/lib/docker/containers HostPathType: config: Type: ConfigMap (a volume populated by a ConfigMap) Name: logging-fluentd Optional: false certs: Type: Secret (a volume populated by a Secret) SecretName: logging-fluentd Optional: false dockerhostname: Type: HostPath (bare host directory volume) Path: /etc/hostname HostPathType: localtime: Type: HostPath (bare host directory volume) Path: /etc/localtime HostPathType: dockercfg: Type: HostPath (bare host directory volume) Path: /etc/sysconfig/docker HostPathType: originnodecfg: Type: HostPath (bare host directory volume) Path: /etc/origin/node HostPathType: dockerdaemoncfg: Type: HostPath (bare host directory volume) Path: /etc/docker HostPathType: filebufferstorage: Type: HostPath (bare host directory volume) Path: /var/lib/fluentd HostPathType: aggregated-logging-fluentd-token-2lmll: Type: Secret (a volume populated by a Secret) SecretName: aggregated-logging-fluentd-token-2lmll Optional: false QoS Class: Burstable Node-Selectors: logging-infra-fluentd=true Tolerations: node.kubernetes.io/disk-pressure:NoSchedule node.kubernetes.io/memory-pressure:NoSchedule node.kubernetes.io/not-ready:NoExecute node.kubernetes.io/unreachable:NoExecute Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SuccessfulMountVolume 58m kubelet, ip-172-31-20-183.us-west-1.compute.internal MountVolume.SetUp succeeded for volume "originnodecfg" Normal SuccessfulMountVolume 58m kubelet, ip-172-31-20-183.us-west-1.compute.internal MountVolume.SetUp succeeded for volume "filebufferstorage" Normal SuccessfulMountVolume 58m kubelet, ip-172-31-20-183.us-west-1.compute.internal MountVolume.SetUp succeeded for volume "varlibdockercontainers" Normal SuccessfulMountVolume 58m kubelet, ip-172-31-20-183.us-west-1.compute.internal MountVolume.SetUp succeeded for volume "dockercfg" Normal SuccessfulMountVolume 58m kubelet, ip-172-31-20-183.us-west-1.compute.internal MountVolume.SetUp succeeded for volume "varlog" Normal SuccessfulMountVolume 58m kubelet, ip-172-31-20-183.us-west-1.compute.internal MountVolume.SetUp succeeded for volume "runlogjournal" Normal SuccessfulMountVolume 58m kubelet, ip-172-31-20-183.us-west-1.compute.internal MountVolume.SetUp succeeded for volume "dockerdaemoncfg" Normal SuccessfulMountVolume 58m kubelet, ip-172-31-20-183.us-west-1.compute.internal MountVolume.SetUp succeeded for volume "localtime" Normal SuccessfulMountVolume 58m kubelet, ip-172-31-20-183.us-west-1.compute.internal MountVolume.SetUp succeeded for volume "dockerhostname" Warning NetworkNotReady 58m (x3 over 58m) kubelet, ip-172-31-20-183.us-west-1.compute.internal network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized] Normal SuccessfulMountVolume 58m (x3 over 58m) kubelet, ip-172-31-20-183.us-west-1.compute.internal (combined from similar events): MountVolume.SetUp succeeded for volume "certs"
@ben it would be good to know what else should have been pulled from this node at the time to help debug this, since I'm guessing this issue is no longer happening
When was that oc describe executed? Was the pod only up for 58m? Those look like startup messages. The "network is not ready" message is expected while a node is starting... but they have tolerations, and I wonder if the pod is started before the networking is ready, and if that happens it will never get networking since the CNI hooks run when the pod is getting set up. (Investigating that)
Can I get the full configuration for the fluentd top-level object please? Either the deployment config of the daemonset for it.
Another case where the pod has lost its network but on a different starter cluster: [root@starter-ca-central-1-master-692e9 ~]# oc logs logging-fluentd-dn9xt . . . 2018-04-11 11:51:43 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-04-11 13:27:43 +0000 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"logging-es\", :port=>9200, :scheme=>\"https\"})!" plugin_id="object:3fd0bd039058" 2018-04-11 11:51:43 +0000 [warn]: suppressed same stacktrace 2018-04-11 11:51:57 +0000 [warn]: emit transaction failed: error_class=SocketError error="getaddrinfo: Name or service not known" location="/usr/share/ruby/net/http.rb:878:in `initialize'" tag="kubernetes.var.log.containers.logging-fluentd-dn9xt_logging_fluentd-elasticsearch-437efd7ceb2ea951a55f9bfc59525b214b3bfcc4e9113552e97cdae734a41030.log" 2018-04-11 11:51:57 +0000 [warn]: suppressed same stacktrace 2018-04-11 11:52:17 +0000 [warn]: emit transaction failed: error_class=SocketError error="getaddrinfo: Name or service not known" location="/usr/share/ruby/net/http.rb:878:in `initialize'" tag="kubernetes.var.log.containers.logging-fluentd-dn9xt_logging_fluentd-elasticsearch-437efd7ceb2ea951a55f9bfc59525b214b3bfcc4e9113552e97cdae734a41030.log" 2018-04-11 11:52:17 +0000 [warn]: suppressed same stacktrace 2018-04-11 11:52:37 +0000 [warn]: emit transaction failed: error_class=SocketError error="getaddrinfo: Name or service not known" location="/usr/share/ruby/net/http.rb:878:in `initialize'" tag="kubernetes.var.log.containers.jenkins-5-hbnm5_k8s-jenkins_jenkins-064ad99fdc66162528c4cb750bfafb3cd6a6d1d067ce686e4cc24b51dcfb3245.log" 2018-04-11 11:52:37 +0000 [warn]: suppressed same stacktrace [root@starter-ca-central-1-master-692e9 ~]# oc describe pod logging-fluentd-dn9xt Name: logging-fluentd-dn9xt Namespace: logging Node: ip-172-31-30-246.ca-central-1.compute.internal/172.31.30.246 Start Time: Tue, 10 Apr 2018 02:44:05 +0000 Labels: component=fluentd controller-revision-hash=248860309 logging-infra=fluentd pod-template-generation=19 provider=openshift Annotations: openshift.io/scc=privileged Status: Terminating (expires Wed, 11 Apr 2018 04:23:24 +0000) Termination Grace Period: 30s IP: 10.130.39.84 Controlled By: DaemonSet/logging-fluentd Containers: fluentd-elasticsearch: Container ID: docker://437efd7ceb2ea951a55f9bfc59525b214b3bfcc4e9113552e97cdae734a41030 Image: registry.reg-aws.openshift.com:443/openshift3/logging-fluentd:v3.9.7 Image ID: docker-pullable://registry.reg-aws.openshift.com:443/openshift3/logging-fluentd@sha256:2387ab82fb5f670de4062c372ff857fff127829c3dbebafe360f950981b268be Port: <none> State: Running Started: Tue, 10 Apr 2018 02:44:58 +0000 Ready: True Restart Count: 0 Limits: memory: 512Mi Requests: cpu: 100m memory: 512Mi Environment: K8S_HOST_URL: https://kubernetes.default.svc.cluster.local ES_HOST: logging-es ES_PORT: 9200 ES_CLIENT_CERT: /etc/fluent/keys/cert ES_CLIENT_KEY: /etc/fluent/keys/key ES_CA: /etc/fluent/keys/ca OPS_HOST: logging-es OPS_PORT: 9200 OPS_CLIENT_CERT: /etc/fluent/keys/cert OPS_CLIENT_KEY: /etc/fluent/keys/key OPS_CA: /etc/fluent/keys/ca JOURNAL_SOURCE: JOURNAL_READ_FROM_HEAD: BUFFER_QUEUE_LIMIT: 32 BUFFER_SIZE_LIMIT: 8m FLUENTD_CPU_LIMIT: node allocatable (limits.cpu) FLUENTD_MEMORY_LIMIT: 536870912 (limits.memory) FILE_BUFFER_LIMIT: 256Mi Mounts: /etc/docker from dockerdaemoncfg (ro) /etc/docker-hostname from dockerhostname (ro) /etc/fluent/configs.d/user from config (ro) /etc/fluent/keys from certs (ro) /etc/localtime from localtime (ro) /etc/origin/node from originnodecfg (ro) /etc/sysconfig/docker from dockercfg (ro) /run/log/journal from runlogjournal (rw) /var/lib/docker/containers from varlibdockercontainers (ro) /var/lib/fluentd from filebufferstorage (rw) /var/log from varlog (rw) /var/run/secrets/kubernetes.io/serviceaccount from aggregated-logging-fluentd-token-642wp (ro) Conditions: Type Status Initialized True Ready True PodScheduled True Volumes: runlogjournal: Type: HostPath (bare host directory volume) Path: /run/log/journal HostPathType: varlog: Type: HostPath (bare host directory volume) Path: /var/log HostPathType: varlibdockercontainers: Type: HostPath (bare host directory volume) Path: /var/lib/docker/containers HostPathType: config: Type: ConfigMap (a volume populated by a ConfigMap) Name: logging-fluentd Optional: false certs: Type: Secret (a volume populated by a Secret) SecretName: logging-fluentd Optional: false dockerhostname: Type: HostPath (bare host directory volume) Path: /etc/hostname HostPathType: localtime: Type: HostPath (bare host directory volume) Path: /etc/localtime HostPathType: dockercfg: Type: HostPath (bare host directory volume) Path: /etc/sysconfig/docker HostPathType: originnodecfg: Type: HostPath (bare host directory volume) Path: /etc/origin/node HostPathType: dockerdaemoncfg: Type: HostPath (bare host directory volume) Path: /etc/docker HostPathType: filebufferstorage: Type: HostPath (bare host directory volume) Path: /var/lib/fluentd HostPathType: aggregated-logging-fluentd-token-642wp: Type: Secret (a volume populated by a Secret) SecretName: aggregated-logging-fluentd-token-642wp Optional: false QoS Class: Burstable Node-Selectors: logging-infra-fluentd=true Tolerations: node.kubernetes.io/disk-pressure:NoSchedule node.kubernetes.io/memory-pressure:NoSchedule node.kubernetes.io/not-ready:NoExecute node.kubernetes.io/unreachable:NoExecute Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedKillPod 23m (x15157 over 3d) kubelet, ip-172-31-30-246.ca-central-1.compute.internal error killing pod: [failed to "KillContainer" for "fluentd-elasticsearch" with KillContainerError: "rpc error: code = Unknown desc = Error response from daemon: Cannot stop container 437efd7ceb2ea951a55f9bfc59525b214b3bfcc4e9113552e97cdae734a41030: Cannot kill container 437efd7ceb2ea951a55f9bfc59525b214b3bfcc4e9113552e97cdae734a41030: rpc error: code = 14 desc = grpc: the connection is unavailable" , failed to "KillPodSandbox" for "db2852c8-3c68-11e8-a010-02d8407159d1" with KillPodSandboxError: "rpc error: code = Unknown desc = Error response from daemon: Cannot stop container 1c2db3f7ef8c9d480703b5e49ce1fd1f7a80abe659c91b64a9bca40d94466c5a: Cannot kill container 1c2db3f7ef8c9d480703b5e49ce1fd1f7a80abe659c91b64a9bca40d94466c5a: rpc error: code = 14 desc = grpc: the connection is unavailable" ] Normal Killing 3m (x15224 over 3d) kubelet, ip-172-31-30-246.ca-central-1.compute.internal Killing container with id docker://fluentd-elasticsearch:Need to kill Pod
Can't reproduce on a test setup. Is this problem persisting?
(In reply to Ivan Chavero from comment #6) > Can't reproduce on a test setup. > Is this problem persisting? @ivan Not sure - but the next time this happens, what information should we gather?
Hello Rich, This info should be useful oc logs from the pod check if docker-containerd-current is running check if there's enough space on the filesystem create and destroy a container manually
Also docker logs from journalctl
I'm closing this bug, if the problem persists, feel free to reopen it.