Bug 1889114
Summary: | Fluentd shows error - error_class=Fluent::Plugin::ElasticsearchOutput::RetryStreamEmitFailure error="buffer is full." | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Vladislav Walek <vwalek> |
Component: | Logging | Assignee: | Jeff Cantrill <jcantril> |
Status: | CLOSED NOTABUG | QA Contact: | Anping Li <anli> |
Severity: | high | Docs Contact: | |
Priority: | low | ||
Version: | 4.5 | CC: | aos-bugs, cruhm, hgomes, jburrell, jcantril, ocasalsa, periklis, rsandu, stwalter |
Target Milestone: | --- | Flags: | vwalek:
needinfo-
|
Target Release: | 4.7.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | logging-core | ||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-01-20 21:45:58 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Vladislav Walek
2020-10-18 00:11:14 UTC
The configuration is set to: BUFFER_QUEUE_LIMIT: 64 BUFFER_SIZE_LIMIT: 32m FILE_BUFFER_LIMIT: 2048Mi I am unable to open the link from #c2 but I see the following in the logs in the customer case: [2020-10-17T17:17:11,835][WARN ][r.suppressed ] [elasticsearch-cdm-r3u7lpil-2] path: /_bulk, params: {} com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 start byte 0x92 at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@545c4788; line: 1, column: 73] and Populate config from /opt/app-root/src/sgconfig/ Will update 'security/config' with /opt/app-root/src/sgconfig/config.yml FAIL: Configuration for 'config' failed because of ClusterBlockException[blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];]ml The error message in #c0 is an indicator that fluent is unable to push any additional logs and this could be because Elasticsearch disk is at it's thresholds where it will no longer accept data. The attachment does not provide any information as to the state of the log store but my suspicion is the cronjobs which cleanup the disk are failing and thus are not removing indicies. I encourage you to provide the output of [1] which would include disk info. There are also several 4.5 fixes for the delete cronjobs that are either being released for 4.5 or have just been released. To resolve the later one post [2] suggests the following to allow you to clean up the disk (note I have not verified): * oc exec -c elasticsearch $POD -- es_util --query=_cluster/settings -d '{ "transient": { "cluster.routing.allocation.disk.threshold_enabled": false } }' -XPUT * oc exec -c elasticsearch $POD -- es_util --query=__all/_settings -d '{"index.blocks.read_only_allow_delete": null}' -XPUT You may need to manually cleanup old indices with something like if the cronjobs do not fire or are unable to complete: * oc exec -c elasticsearch $POD -- es_util --query=$INDEX -XDELETE and once logs are being injested again I would suggest re-enabling the thresholds * oc exec -c elasticsearch $POD -- es_util --query=_cluster/settings -d '{ "transient": { "cluster.routing.allocation.disk.threshold_enabled": true } }' -XPUT [1] https://github.com/openshift/cluster-logging-operator/tree/master/must-gather#cluster-logging-must-gather [2] https://selleo.com/til/posts/esrgfyxjee-how-to-fix-elasticsearch-forbidden12index-read-only Hi Jeff, Appreciate your reply. I'll share this with customer. As per your request, the must gather for logging component has been made available to SupportShell, for case '02777138', file: 0020-noprod-logging-logs.tar.gz Hi, At this moment in a couple of logs of fluentd we found this events constantly. A fresh logging dump was provided > logging-20201020_201938.tar.gz (105.715 KB) ~~~ 2020-10-20 16:27:20 +0000 [warn]: [retry_clo_default_output_es] failed to flush the buffer. retry_time=13 next_retry_seconds=2020-10-20 16:32:10 +0000 chunk="5b176d95b836b4199b55ac7f38c39796" error_class=Fluent::Pl ugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch.openshift-logging.svc.cluster.local\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"}): [500] {\"error\":{\"root_cause\":[{\"type\":\"json_parse_exception\",\"reason\":\"Invalid UTF-8 start byte 0x92\\n at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@6d9021a8; line: 1, column: 73]\"}],\"type\":\"json_parse_exception\",\"reason\":\"Invalid UTF-8 start byte 0x92\\n at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@6d9021a8; line: 1, column: 73]\"},\"status\":500}" ~~~ The allocation is the follow, $ oc exec -c elasticsearch elasticsearch-cdm-r3u7lpil-1-54b5d455b8-qgww2 -- es_util --query=_cat/allocation?v shards disk.indices disk.used disk.avail disk.total disk.percent host ip node 62 168.5gb 170.4gb 123.7gb 294.2gb 57 10.194.12.30 10.194.12.30 elasticsearch-cdm-r3u7lpil-3 61 171.6gb 173.5gb 120.7gb 294.2gb 58 10.194.10.243 10.194.10.243 elasticsearch-cdm-r3u7lpil-1 61 169.2gb 171.2gb 123gb 294.2gb 58 10.194.14.18 10.194.14.18 elasticsearch-cdm-r3u7lpil-2 [esanabria@sbst2002lx 20102020-1]$ And all the indices was un blocked * oc exec -c elasticsearch $POD -- es_util --query=__all/_settings -d '{"index.blocks.read_only_allow_delete": null}' -XPUT All the fluentd filesystems are normal $ for i in $(oc get pods -l component=fluentd | awk '/fluentd/ { print $1 }') ; do oc exec $i -- du -sh /var/lib/fluentd ; done 104K /var/lib/fluentd 32K /var/lib/fluentd 32M /var/lib/fluentd 24K /var/lib/fluentd 36K /var/lib/fluentd 132K /var/lib/fluentd 16K /var/lib/fluentd 32K /var/lib/fluentd 8.0K /var/lib/fluentd 3.1M /var/lib/fluentd 30M /var/lib/fluentd 38M /var/lib/fluentd 8.0M /var/lib/fluentd 452K /var/lib/fluentd 24K /var/lib/fluentd 128K /var/lib/fluentd But the events the events are present in several of pods of fluentd. Do you know why the fluentd doesn't connect with elasticsearch? Setting UpcomingSprint as unable to resolve before EOD (In reply to hgomes from comment #5) > Hi, > > At this moment in a couple of logs of fluentd we found this events > constantly. A fresh logging dump was provided > > logging-20201020_201938.tar.gz (105.715 KB) > > ~~~ > 2020-10-20 16:27:20 +0000 [warn]: [retry_clo_default_output_es] failed to > flush the buffer. retry_time=13 next_retry_seconds=2020-10-20 16:32:10 +0000 > chunk="5b176d95b836b4199b55ac7f38c39796" error_class=Fluent::Pl > ugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push > logs to Elasticsearch cluster > ({:host=>\"elasticsearch.openshift-logging.svc.cluster.local\", :port=>9200, > :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"}): [500] > {\"error\":{\"root_cause\":[{\"type\":\"json_parse_exception\",\"reason\": > \"Invalid UTF-8 start byte 0x92\\n at [Source: > org.elasticsearch.transport.netty4.ByteBufStreamInput@6d9021a8; line: 1, > column: 73]\"}],\"type\":\"json_parse_exception\",\"reason\":\"Invalid UTF-8 > start byte 0x92\\n at [Source: > org.elasticsearch.transport.netty4.ByteBufStreamInput@6d9021a8; line: 1, > column: 73]\"},\"status\":500}" > ~~~ This is an error we have seen before which we believed to be related to full disk on the hosts. I don't believe we can fully attribute it to full disk. It may be a difference in encoding between what log is created, what fluent expects, and what ES requires. Log entries like this get dumped back into a retry queue and it may be we don't properly evaluate if one can over submit them to ES. > > But the events the events are present in several of pods of fluentd. Do you > know why the fluentd doesn't connect with elasticsearch? It's not fluent can not connect but that it is unable to push logs because of the utf8 error. I'll have to investigate what is happening. Is there anyway to get the following information from the customer: * What is the encoding of the container on the node where you see this failure * What is the encoding of the node where you see this failure. * A few sample log messages of one of the containers that produces this message? Hello, I also have a customer reaching a similar error. Looking at the fluentd logs and elasticsearch logs we see the following Fluentd is not able to send logs to the ES: ~~~ $ tail -5 fluentd-6gqjj-fluentd.log 2020-11-30 09:47:03 +0000 [warn]: suppressed same stacktrace 2020-11-30 09:47:49 +0000 [warn]: [clo_default_output_es] failed to flush the buffer. retry_time=2280 next_retry_seconds=2020-11-30 09:53:02 +0000 chunk="5b4bb0d8a6b8922f83e385cf89497dd4" error_class=Fluent::Plugin::ElasticsearchOutput::RetryStreamEmitFailure error="buffer is full." 2020-11-30 09:47:49 +0000 [warn]: suppressed same stacktrace 2020-11-30 09:47:49 +0000 [warn]: [clo_default_output_es] failed to flush the buffer. retry_time=2281 next_retry_seconds=2020-11-30 09:52:47 +0000 chunk="5b4bb0d9a343d016d7c4e256584e76d9" error_class=Fluent::Plugin::ElasticsearchOutput::RetryStreamEmitFailure error="buffer is full." 2020-11-30 09:47:49 +0000 [warn]: suppressed same stacktrace [supportshell.prod.useraccess-us-west-2.redhat.com] [13:37:12+0000] ~~~ - Checking the ES pods, they have some failures starting like this: ~~~ $ cat elasticsearch-cdm-hjoanotz-1-74f9bc7565-8mxgf-elasticsearch.log ... Will update 'security/config' with /opt/app-root/src/sgconfig/config.yml FAIL: Configuration for 'config' failed because of ClusterBlockException[blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];] Will update 'security/roles' with /opt/app-root/src/sgconfig/roles.yml FAIL: Configuration for 'roles' failed because of ClusterBlockException[blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];] Will update 'security/rolesmapping' with /opt/app-root/src/sgconfig/roles_mapping.yml FAIL: Configuration for 'rolesmapping' failed because of ClusterBlockException[blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];] Will update 'security/internalusers' with /opt/app-root/src/sgconfig/internal_users.yml FAIL: Configuration for 'internalusers' failed because of ClusterBlockException[blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];] Will update 'security/actiongroups' with /opt/app-root/src/sgconfig/action_groups.yml ... ./init.sh: line 35: show_failures: command not found I attached a link to a full must-gather the customer uploaded in the above comment. (In reply to Courtney Ruhm from comment #9) > Hello, > > I also have a customer reaching a similar error. Looking at the fluentd logs > and elasticsearch logs we see the following > > Fluentd is not able to send logs to the ES: > > ~~~ > $ tail -5 fluentd-6gqjj-fluentd.log > 2020-11-30 09:47:03 +0000 [warn]: suppressed same stacktrace > 2020-11-30 09:47:49 +0000 [warn]: [clo_default_output_es] failed to flush > the buffer. retry_time=2280 next_retry_seconds=2020-11-30 09:53:02 +0000 > chunk="5b4bb0d8a6b8922f83e385cf89497dd4" > error_class=Fluent::Plugin::ElasticsearchOutput::RetryStreamEmitFailure > error="buffer is full." This matches the reported issue but is a side-effect of not being able to push logs. I would also note this is different from the utf8 issue > - Checking the ES pods, they have some failures starting like this: > ~~~ > $ cat elasticsearch-cdm-hjoanotz-1-74f9bc7565-8mxgf-elasticsearch.log > ... > Will update 'security/config' with /opt/app-root/src/sgconfig/config.yml > FAIL: Configuration for 'config' failed because of > ClusterBlockException[blocked by: [FORBIDDEN/12/index read-only / allow > delete (api)];] > Will update 'security/roles' with /opt/app-root/src/sgconfig/roles.yml > FAIL: Configuration for 'roles' failed because of > ClusterBlockException[blocked by: [FORBIDDEN/12/index read-only / allow > delete (api)];] > Will update 'security/rolesmapping' with > /opt/app-root/src/sgconfig/roles_mapping.yml > FAIL: Configuration for 'rolesmapping' failed because of > ClusterBlockException[blocked by: [FORBIDDEN/12/index read-only / allow > delete (api)];] > Will update 'security/internalusers' with > /opt/app-root/src/sgconfig/internal_users.yml > FAIL: Configuration for 'internalusers' failed because of > ClusterBlockException[blocked by: [FORBIDDEN/12/index read-only / allow > delete (api)];] > Will update 'security/actiongroups' with > /opt/app-root/src/sgconfig/action_groups.yml > ... These issues are related to again the disk being full and being locked out from deleting. We have another bz here Possible solution to the utf-8 issue https://github.com/uken/fluent-plugin-elasticsearch/issues/77#issuecomment-63049158 Hi, in Courtney's case we have found the disk is not full: # oc get pods -l component=elasticsearch NAME READY STATUS RESTARTS AGE elasticsearch-cdm-hjoanotz-1-74f9bc7565-8mxgf 2/2 Running 0 5d18h elasticsearch-cdm-hjoanotz-2-7577c969cb-mg8td 2/2 Running 0 5d18h elasticsearch-cdm-hjoanotz-3-86d99c96d7-b97cz 2/2 Running 0 5d18h # oc rsh -c elasticsearch elasticsearch-cdm-hjoanotz-1-74f9bc7565-8mxgf sh-4.2$ curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_all/_settings -d '{"index.blocks.read_only_allow_delete": null}' curl: (52) Empty reply from server sh-4.2$ curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_cluster/settings -d '{ "transient": { "cluster.routing.allocation.disk.threshold_enabled": false } }' curl: (52) Empty reply from server sh-4.2$ hostname elasticsearch-cdm-hjoanotz-1-74f9bc7565-8mxgf sh-4.2$ df -hP Filesystem Size Used Avail Use% Mounted on overlay 149G 68G 82G 46% / tmpfs 64M 0 64M 0% /dev tmpfs 24G 0 24G 0% /sys/fs/cgroup shm 64M 0 64M 0% /dev/shm tmpfs 24G 43M 24G 1% /etc/passwd /dev/sdb 100G 41G 60G 41% /elasticsearch/persistent /dev/sda3 149G 68G 82G 46% /etc/hosts tmpfs 24G 28K 24G 1% /etc/openshift/elasticsearch/secret tmpfs 24G 28K 24G 1% /run/secrets/kubernetes.io/serviceaccount tmpfs 24G 0 24G 0% /proc/acpi tmpfs 24G 0 24G 0% /proc/scsi tmpfs 24G 0 24G 0% /sys/firmware sh-4.2$ Or should we be checking the node disk space? > We have another bz here Should we open a new bz for this other case? Hello, Any update for our customer on this bug. Thank you (In reply to Steven Walter from comment #12) > Hi, in Courtney's case we have found the disk is not full: I will correct my previous statement based on some newer findings related to the rollover and delete cronjobs. The rollover process is not transactional but is a two-step process behind the scenes. We believe there is an issue related to both steps not succeeding which results in the situation described in #c9. > > > We have another bz here > Should we open a new bz for this other case? We are working the fix against issue https://bugzilla.redhat.com/show_bug.cgi?id=1893992 which is one of several issues related to the index management jobs Hello, So is there any known current workaround for this issue or are we going to just need to wait for the fix for this other bug? Thank you (In reply to Steven Walter from comment #12) > # oc rsh -c elasticsearch elasticsearch-cdm-hjoanotz-1-74f9bc7565-8mxgf > sh-4.2$ curl -XPUT -H "Content-Type: application/json" > http://localhost:9200/_all/_settings -d > '{"index.blocks.read_only_allow_delete": null}' > curl: (52) Empty reply from server > sh-4.2$ curl -XPUT -H "Content-Type: application/json" > http://localhost:9200/_cluster/settings -d '{ "transient": { > "cluster.routing.allocation.disk.threshold_enabled": false } }' > curl: (52) Empty reply from server > sh-4.2$ hostname > elasticsearch-cdm-hjoanotz-1-74f9bc7565-8mxgf These are not the same commands as recommended in #c3 as communication to the server is over https and not plan http. I suggest using the commands as listed and/or hopping on the pod as done here and using "es_util". It is not clear to me in what state is the ES cluster given #c9 makes me think it is not properly seeded for the ACLs. When we resolve the storage issue, it will resolve the issue at the collector. Might I recommend attaching a 'must-gather' to this issues so we can get a holistic picture of the environment: https://github.com/openshift/cluster-logging-operator/tree/master/must-gather Hello Jeff, Here's the logging must-gather you requested. Can't attach it directly to the case as it's to large. Let me know if this link works or not https://attachments.access.redhat.com/hydra/rest/cases/02804669/attachments/f4fd0d76-8e31-43bc-8195-4edecfb193b2?usePresignedUrl=true Thank you (In reply to Courtney Ruhm from comment #17) > Hello Jeff, > > Here's the logging must-gather you requested. Can't attach it directly to > the case as it's to large. Let me know if this link works or not > > https://attachments.access.redhat.com/hydra/rest/cases/02804669/attachments/ > f4fd0d76-8e31-43bc-8195-4edecfb193b2?usePresignedUrl=true > I am unable to access this info I believe this issue is related to https://bugzilla.redhat.com/show_bug.cgi?id=1890838 Closing as customer cases are closed. This is a warning message that likely is only resolvable by correcting the overall health of the logging stack The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |