Bug 1560134
| Summary: | Fluentd pod started but unable to send logs to Elasticsearch | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Peter Portante <pportant> |
| Component: | Networking | Assignee: | Casey Callendrello <cdc> |
| Networking sub component: | openshift-sdn | QA Contact: | zhaozhanqi <zzhao> |
| Status: | CLOSED INSUFFICIENT_DATA | Docs Contact: | |
| Severity: | low | ||
| Priority: | low | CC: | aos-bugs, bbennett, jforrest, pportant, rmeggins |
| Version: | 3.9.0 | ||
| Target Milestone: | --- | ||
| Target Release: | 3.10.0 | ||
| Hardware: | All | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-05-31 18:06:03 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
@ben it would be good to know what else should have been pulled from this node at the time to help debug this, since I'm guessing this issue is no longer happening When was that oc describe executed? Was the pod only up for 58m? Those look like startup messages. The "network is not ready" message is expected while a node is starting... but they have tolerations, and I wonder if the pod is started before the networking is ready, and if that happens it will never get networking since the CNI hooks run when the pod is getting set up. (Investigating that) Can I get the full configuration for the fluentd top-level object please? Either the deployment config of the daemonset for it. Another case where the pod has lost its network but on a different starter cluster:
[root@starter-ca-central-1-master-692e9 ~]# oc logs logging-fluentd-dn9xt
.
.
.
2018-04-11 11:51:43 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2018-04-11 13:27:43 +0000 error_class="Fluent::ElasticsearchOutput::ConnectionFailure" error="Can not reach Elasticsearch cluster ({:host=>\"logging-es\", :port=>9200, :scheme=>\"https\"})!" plugin_id="object:3fd0bd039058"
2018-04-11 11:51:43 +0000 [warn]: suppressed same stacktrace
2018-04-11 11:51:57 +0000 [warn]: emit transaction failed: error_class=SocketError error="getaddrinfo: Name or service not known" location="/usr/share/ruby/net/http.rb:878:in `initialize'" tag="kubernetes.var.log.containers.logging-fluentd-dn9xt_logging_fluentd-elasticsearch-437efd7ceb2ea951a55f9bfc59525b214b3bfcc4e9113552e97cdae734a41030.log"
2018-04-11 11:51:57 +0000 [warn]: suppressed same stacktrace
2018-04-11 11:52:17 +0000 [warn]: emit transaction failed: error_class=SocketError error="getaddrinfo: Name or service not known" location="/usr/share/ruby/net/http.rb:878:in `initialize'" tag="kubernetes.var.log.containers.logging-fluentd-dn9xt_logging_fluentd-elasticsearch-437efd7ceb2ea951a55f9bfc59525b214b3bfcc4e9113552e97cdae734a41030.log"
2018-04-11 11:52:17 +0000 [warn]: suppressed same stacktrace
2018-04-11 11:52:37 +0000 [warn]: emit transaction failed: error_class=SocketError error="getaddrinfo: Name or service not known" location="/usr/share/ruby/net/http.rb:878:in `initialize'" tag="kubernetes.var.log.containers.jenkins-5-hbnm5_k8s-jenkins_jenkins-064ad99fdc66162528c4cb750bfafb3cd6a6d1d067ce686e4cc24b51dcfb3245.log"
2018-04-11 11:52:37 +0000 [warn]: suppressed same stacktrace
[root@starter-ca-central-1-master-692e9 ~]# oc describe pod logging-fluentd-dn9xt
Name: logging-fluentd-dn9xt
Namespace: logging
Node: ip-172-31-30-246.ca-central-1.compute.internal/172.31.30.246
Start Time: Tue, 10 Apr 2018 02:44:05 +0000
Labels: component=fluentd
controller-revision-hash=248860309
logging-infra=fluentd
pod-template-generation=19
provider=openshift
Annotations: openshift.io/scc=privileged
Status: Terminating (expires Wed, 11 Apr 2018 04:23:24 +0000)
Termination Grace Period: 30s
IP: 10.130.39.84
Controlled By: DaemonSet/logging-fluentd
Containers:
fluentd-elasticsearch:
Container ID: docker://437efd7ceb2ea951a55f9bfc59525b214b3bfcc4e9113552e97cdae734a41030
Image: registry.reg-aws.openshift.com:443/openshift3/logging-fluentd:v3.9.7
Image ID: docker-pullable://registry.reg-aws.openshift.com:443/openshift3/logging-fluentd@sha256:2387ab82fb5f670de4062c372ff857fff127829c3dbebafe360f950981b268be
Port: <none>
State: Running
Started: Tue, 10 Apr 2018 02:44:58 +0000
Ready: True
Restart Count: 0
Limits:
memory: 512Mi
Requests:
cpu: 100m
memory: 512Mi
Environment:
K8S_HOST_URL: https://kubernetes.default.svc.cluster.local
ES_HOST: logging-es
ES_PORT: 9200
ES_CLIENT_CERT: /etc/fluent/keys/cert
ES_CLIENT_KEY: /etc/fluent/keys/key
ES_CA: /etc/fluent/keys/ca
OPS_HOST: logging-es
OPS_PORT: 9200
OPS_CLIENT_CERT: /etc/fluent/keys/cert
OPS_CLIENT_KEY: /etc/fluent/keys/key
OPS_CA: /etc/fluent/keys/ca
JOURNAL_SOURCE:
JOURNAL_READ_FROM_HEAD:
BUFFER_QUEUE_LIMIT: 32
BUFFER_SIZE_LIMIT: 8m
FLUENTD_CPU_LIMIT: node allocatable (limits.cpu)
FLUENTD_MEMORY_LIMIT: 536870912 (limits.memory)
FILE_BUFFER_LIMIT: 256Mi
Mounts:
/etc/docker from dockerdaemoncfg (ro)
/etc/docker-hostname from dockerhostname (ro)
/etc/fluent/configs.d/user from config (ro)
/etc/fluent/keys from certs (ro)
/etc/localtime from localtime (ro)
/etc/origin/node from originnodecfg (ro)
/etc/sysconfig/docker from dockercfg (ro)
/run/log/journal from runlogjournal (rw)
/var/lib/docker/containers from varlibdockercontainers (ro)
/var/lib/fluentd from filebufferstorage (rw)
/var/log from varlog (rw)
/var/run/secrets/kubernetes.io/serviceaccount from aggregated-logging-fluentd-token-642wp (ro)
Conditions:
Type Status
Initialized True
Ready True
PodScheduled True
Volumes:
runlogjournal:
Type: HostPath (bare host directory volume)
Path: /run/log/journal
HostPathType:
varlog:
Type: HostPath (bare host directory volume)
Path: /var/log
HostPathType:
varlibdockercontainers:
Type: HostPath (bare host directory volume)
Path: /var/lib/docker/containers
HostPathType:
config:
Type: ConfigMap (a volume populated by a ConfigMap)
Name: logging-fluentd
Optional: false
certs:
Type: Secret (a volume populated by a Secret)
SecretName: logging-fluentd
Optional: false
dockerhostname:
Type: HostPath (bare host directory volume)
Path: /etc/hostname
HostPathType:
localtime:
Type: HostPath (bare host directory volume)
Path: /etc/localtime
HostPathType:
dockercfg:
Type: HostPath (bare host directory volume)
Path: /etc/sysconfig/docker
HostPathType:
originnodecfg:
Type: HostPath (bare host directory volume)
Path: /etc/origin/node
HostPathType:
dockerdaemoncfg:
Type: HostPath (bare host directory volume)
Path: /etc/docker
HostPathType:
filebufferstorage:
Type: HostPath (bare host directory volume)
Path: /var/lib/fluentd
HostPathType:
aggregated-logging-fluentd-token-642wp:
Type: Secret (a volume populated by a Secret)
SecretName: aggregated-logging-fluentd-token-642wp
Optional: false
QoS Class: Burstable
Node-Selectors: logging-infra-fluentd=true
Tolerations: node.kubernetes.io/disk-pressure:NoSchedule
node.kubernetes.io/memory-pressure:NoSchedule
node.kubernetes.io/not-ready:NoExecute
node.kubernetes.io/unreachable:NoExecute
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedKillPod 23m (x15157 over 3d) kubelet, ip-172-31-30-246.ca-central-1.compute.internal error killing pod: [failed to "KillContainer" for "fluentd-elasticsearch" with KillContainerError: "rpc error: code = Unknown desc = Error response from daemon: Cannot stop container 437efd7ceb2ea951a55f9bfc59525b214b3bfcc4e9113552e97cdae734a41030: Cannot kill container 437efd7ceb2ea951a55f9bfc59525b214b3bfcc4e9113552e97cdae734a41030: rpc error: code = 14 desc = grpc: the connection is unavailable"
, failed to "KillPodSandbox" for "db2852c8-3c68-11e8-a010-02d8407159d1" with KillPodSandboxError: "rpc error: code = Unknown desc = Error response from daemon: Cannot stop container 1c2db3f7ef8c9d480703b5e49ce1fd1f7a80abe659c91b64a9bca40d94466c5a: Cannot kill container 1c2db3f7ef8c9d480703b5e49ce1fd1f7a80abe659c91b64a9bca40d94466c5a: rpc error: code = 14 desc = grpc: the connection is unavailable"
]
Normal Killing 3m (x15224 over 3d) kubelet, ip-172-31-30-246.ca-central-1.compute.internal Killing container with id docker://fluentd-elasticsearch:Need to kill Pod
Can't reproduce on a test setup. Is this problem persisting? (In reply to Ivan Chavero from comment #6) > Can't reproduce on a test setup. > Is this problem persisting? @ivan Not sure - but the next time this happens, what information should we gather? Hello Rich, This info should be useful oc logs from the pod check if docker-containerd-current is running check if there's enough space on the filesystem create and destroy a container manually Also docker logs from journalctl I'm closing this bug, if the problem persists, feel free to reopen it. |
A fluentd pod on starter-us-west-1 was stuck for almost two days without being able to send any logs to Elasticsearch. The pod was marked as healthy, in a running state, but the fluentd on disk queues were filled and a day old (see below). The pod had the following error found with oc describe: Warning NetworkNotReady 58m (x3 over 58m) kubelet, ip-172-31-20-183.us-west-1.compute.internal network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized] The full output of oc describe is below. Looks like the networking problem should have killed the pod so that it could be recreated. Once I killed the pod and restarted it, fluentd is working fine on that node sending logs. ---- [root@starter-us-west-1-master-2514b ~]# oc rsh pods/logging-fluentd-smg4n sh-4.2# ls -ltrh /var/lib/fluentd/ total 264M -rw-r--r--. 1 root root 8.0M Mar 22 12:38 buffer-output-es-config.output_tag.q567ff8e24c2da87b.log -rw-r--r--. 1 root root 8.0M Mar 22 12:38 buffer-output-es-config.output_tag.q567ff94dd7c1e7ed.log -rw-r--r--. 1 root root 8.0M Mar 22 12:38 buffer-output-es-config.output_tag.q567ff954c1f2346a.log -rw-r--r--. 1 root root 8.0M Mar 22 12:48 buffer-output-es-config.output_tag.q567ff95bb7dbce07.log -rw-r--r--. 1 root root 8.0M Mar 22 13:13 buffer-output-es-config.output_tag.q567ffb9af8126979.log -rw-r--r--. 1 root root 8.0M Mar 22 13:37 buffer-output-es-config.output_tag.q568001373d8eb187.log -rw-r--r--. 1 root root 8.0M Mar 22 13:58 buffer-output-es-config.output_tag.q5680068c30a1df20.log -rw-r--r--. 1 root root 8.0M Mar 22 14:03 buffer-output-es-config.output_tag.q56800b268629ce19.log -rw-r--r--. 1 root root 8.0M Mar 22 14:05 buffer-output-es-config.output_tag.q56800c64c4e3dc53.log -rw-r--r--. 1 root root 8.0M Mar 22 14:05 buffer-output-es-config.output_tag.q56800cc8dc88864e.log -rw-r--r--. 1 root root 8.0M Mar 22 14:05 buffer-output-es-config.output_tag.q56800ccff2d3f1fd.log -rw-r--r--. 1 root root 8.0M Mar 22 14:08 buffer-output-es-config.output_tag.q56800cd798c5c88a.log -rw-r--r--. 1 root root 8.0M Mar 22 14:33 buffer-output-es-config.output_tag.q56800d91bbd9978b.log -rw-r--r--. 1 root root 8.0M Mar 22 14:57 buffer-output-es-config.output_tag.q5680130d5f5f184e.log -rw-r--r--. 1 root root 8.0M Mar 22 15:19 buffer-output-es-config.output_tag.q56801875c2302b74.log -rw-r--r--. 1 root root 8.0M Mar 22 15:30 buffer-output-es-config.output_tag.q56801d63148c21d4.log -rw-r--r--. 1 root root 8.0M Mar 22 15:32 buffer-output-es-config.output_tag.q56801fd70e4df957.log -rw-r--r--. 1 root root 8.0M Mar 22 15:32 buffer-output-es-config.output_tag.q5680204d73a7619f.log -rw-r--r--. 1 root root 8.0M Mar 22 15:32 buffer-output-es-config.output_tag.q568020557c199319.log -rw-r--r--. 1 root root 8.0M Mar 22 15:33 buffer-output-es-config.output_tag.q5680205c03e5bff9.log -rw-r--r--. 1 root root 8.0M Mar 22 15:55 buffer-output-es-config.output_tag.q56802062b6be19bd.log -rw-r--r--. 1 root root 8.0M Mar 22 16:19 buffer-output-es-config.output_tag.q568025479eaad120.log -rw-r--r--. 1 root root 8.0M Mar 22 16:42 buffer-output-es-config.output_tag.q56802ab8c7ac5b5f.log -rw-r--r--. 1 root root 8.0M Mar 22 16:57 buffer-output-es-config.output_tag.q56802fd041109de2.log -rw-r--r--. 1 root root 8.0M Mar 22 16:59 buffer-output-es-config.output_tag.q5680334e4c6b910e.log -rw-r--r--. 1 root root 8.0M Mar 22 16:59 buffer-output-es-config.output_tag.q568033b958f892ec.log -rw-r--r--. 1 root root 8.0M Mar 22 16:59 buffer-output-es-config.output_tag.q568033c613d699db.log -rw-r--r--. 1 root root 8.0M Mar 22 17:00 buffer-output-es-config.output_tag.q568033cccd186bf3.log -rw-r--r--. 1 root root 8.0M Mar 22 17:16 buffer-output-es-config.output_tag.q568033d4983ff734.log -rw-r--r--. 1 root root 8.0M Mar 22 17:41 buffer-output-es-config.output_tag.q56803764dbcf542f.log -rw-r--r--. 1 root root 8.0M Mar 22 18:05 buffer-output-es-config.output_tag.q56803cf983f21f4f.log -rw-r--r--. 1 root root 8.0M Mar 22 18:26 buffer-output-es-config.output_tag.q5680427561824f30.log -rw-r--r--. 1 root root 8.0M Mar 22 18:27 buffer-output-es-config.output_tag.b5680471e87e39b9a.log sh-4.2# ps auxww USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.8 0.2 7413820 178516 ? Ssl Mar22 25:00 /usr/bin/ruby /usr/bin/fluentd --no-supervisor root 13486 0.0 0.0 11772 1696 ? Ss 00:43 0:00 /bin/sh root 13493 0.0 0.0 47448 1664 ? R+ 00:43 0:00 ps auxww sh-4.2# kill -TERM 1 sh-4.2# ps auxww USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 0.8 0.2 7424068 178516 ? Ssl Mar22 25:00 /usr/bin/ruby /usr/bin/fluentd --no-supervisor root 13486 0.0 0.0 11772 1760 ? Ss 00:43 0:00 /bin/sh root 13502 0.0 0.0 47448 1664 ? R+ 00:43 0:00 ps auxww sh-4.2# exit exit command terminated with exit code 127 [root@starter-us-west-1-master-2514b ~]# oc describe pods/logging-fluentd-smg4n Name: logging-fluentd-smg4n Namespace: logging Node: ip-172-31-20-183.us-west-1.compute.internal/172.31.20.183 Start Time: Thu, 22 Mar 2018 00:29:07 +0000 Labels: component=fluentd controller-revision-hash=248860309 logging-infra=fluentd pod-template-generation=5 provider=openshift Annotations: openshift.io/scc=privileged Status: Running IP: 10.128.4.15 Controlled By: DaemonSet/logging-fluentd Containers: fluentd-elasticsearch: Container ID: docker://f6330f4793b25e08e6ef597dd3b82f1bd521432d4735aa5af2b558b488d8842d Image: registry.reg-aws.openshift.com:443/openshift3/logging-fluentd:v3.9.7 Image ID: docker-pullable://registry.reg-aws.openshift.com:443/openshift3/logging-fluentd@sha256:2387ab82fb5f670de4062c372ff857fff127829c3dbebafe360f950981b268be Port: <none> State: Running Started: Thu, 22 Mar 2018 00:29:14 +0000 Ready: True Restart Count: 0 Limits: memory: 512Mi Requests: cpu: 100m memory: 512Mi Environment: K8S_HOST_URL: https://kubernetes.default.svc.cluster.local ES_HOST: logging-es ES_PORT: 9200 ES_CLIENT_CERT: /etc/fluent/keys/cert ES_CLIENT_KEY: /etc/fluent/keys/key ES_CA: /etc/fluent/keys/ca OPS_HOST: logging-es OPS_PORT: 9200 OPS_CLIENT_CERT: /etc/fluent/keys/cert OPS_CLIENT_KEY: /etc/fluent/keys/key OPS_CA: /etc/fluent/keys/ca JOURNAL_SOURCE: JOURNAL_READ_FROM_HEAD: BUFFER_QUEUE_LIMIT: 32 BUFFER_SIZE_LIMIT: 8m FLUENTD_CPU_LIMIT: node allocatable (limits.cpu) FLUENTD_MEMORY_LIMIT: 536870912 (limits.memory) FILE_BUFFER_LIMIT: 256Mi Mounts: /etc/docker from dockerdaemoncfg (ro) /etc/docker-hostname from dockerhostname (ro) /etc/fluent/configs.d/user from config (ro) /etc/fluent/keys from certs (ro) /etc/localtime from localtime (ro) /etc/origin/node from originnodecfg (ro) /etc/sysconfig/docker from dockercfg (ro) /run/log/journal from runlogjournal (rw) /var/lib/docker/containers from varlibdockercontainers (ro) /var/lib/fluentd from filebufferstorage (rw) /var/log from varlog (rw) /var/run/secrets/kubernetes.io/serviceaccount from aggregated-logging-fluentd-token-2lmll (ro) Conditions: Type Status Initialized True Ready True PodScheduled True Volumes: runlogjournal: Type: HostPath (bare host directory volume) Path: /run/log/journal HostPathType: varlog: Type: HostPath (bare host directory volume) Path: /var/log HostPathType: varlibdockercontainers: Type: HostPath (bare host directory volume) Path: /var/lib/docker/containers HostPathType: config: Type: ConfigMap (a volume populated by a ConfigMap) Name: logging-fluentd Optional: false certs: Type: Secret (a volume populated by a Secret) SecretName: logging-fluentd Optional: false dockerhostname: Type: HostPath (bare host directory volume) Path: /etc/hostname HostPathType: localtime: Type: HostPath (bare host directory volume) Path: /etc/localtime HostPathType: dockercfg: Type: HostPath (bare host directory volume) Path: /etc/sysconfig/docker HostPathType: originnodecfg: Type: HostPath (bare host directory volume) Path: /etc/origin/node HostPathType: dockerdaemoncfg: Type: HostPath (bare host directory volume) Path: /etc/docker HostPathType: filebufferstorage: Type: HostPath (bare host directory volume) Path: /var/lib/fluentd HostPathType: aggregated-logging-fluentd-token-2lmll: Type: Secret (a volume populated by a Secret) SecretName: aggregated-logging-fluentd-token-2lmll Optional: false QoS Class: Burstable Node-Selectors: logging-infra-fluentd=true Tolerations: node.kubernetes.io/disk-pressure:NoSchedule node.kubernetes.io/memory-pressure:NoSchedule node.kubernetes.io/not-ready:NoExecute node.kubernetes.io/unreachable:NoExecute Events: Type Reason Age From Message ---- ------ ---- ---- ------- Normal SuccessfulMountVolume 58m kubelet, ip-172-31-20-183.us-west-1.compute.internal MountVolume.SetUp succeeded for volume "originnodecfg" Normal SuccessfulMountVolume 58m kubelet, ip-172-31-20-183.us-west-1.compute.internal MountVolume.SetUp succeeded for volume "filebufferstorage" Normal SuccessfulMountVolume 58m kubelet, ip-172-31-20-183.us-west-1.compute.internal MountVolume.SetUp succeeded for volume "varlibdockercontainers" Normal SuccessfulMountVolume 58m kubelet, ip-172-31-20-183.us-west-1.compute.internal MountVolume.SetUp succeeded for volume "dockercfg" Normal SuccessfulMountVolume 58m kubelet, ip-172-31-20-183.us-west-1.compute.internal MountVolume.SetUp succeeded for volume "varlog" Normal SuccessfulMountVolume 58m kubelet, ip-172-31-20-183.us-west-1.compute.internal MountVolume.SetUp succeeded for volume "runlogjournal" Normal SuccessfulMountVolume 58m kubelet, ip-172-31-20-183.us-west-1.compute.internal MountVolume.SetUp succeeded for volume "dockerdaemoncfg" Normal SuccessfulMountVolume 58m kubelet, ip-172-31-20-183.us-west-1.compute.internal MountVolume.SetUp succeeded for volume "localtime" Normal SuccessfulMountVolume 58m kubelet, ip-172-31-20-183.us-west-1.compute.internal MountVolume.SetUp succeeded for volume "dockerhostname" Warning NetworkNotReady 58m (x3 over 58m) kubelet, ip-172-31-20-183.us-west-1.compute.internal network is not ready: [runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:docker: network plugin is not ready: cni config uninitialized] Normal SuccessfulMountVolume 58m (x3 over 58m) kubelet, ip-172-31-20-183.us-west-1.compute.internal (combined from similar events): MountVolume.SetUp succeeded for volume "certs"