Bug 1889114

Summary: Fluentd shows error - error_class=Fluent::Plugin::ElasticsearchOutput::RetryStreamEmitFailure error="buffer is full."
Product: OpenShift Container Platform Reporter: Vladislav Walek <vwalek>
Component: LoggingAssignee: Jeff Cantrill <jcantril>
Status: CLOSED NOTABUG QA Contact: Anping Li <anli>
Severity: high Docs Contact:
Priority: low    
Version: 4.5CC: aos-bugs, cruhm, hgomes, jburrell, jcantril, ocasalsa, periklis, rsandu, stwalter
Target Milestone: ---Flags: vwalek: needinfo-
Target Release: 4.7.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: logging-core
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-01-20 21:45:58 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vladislav Walek 2020-10-18 00:11:14 UTC
Description of problem:

The fluentd is reporting issue:

2020-10-17 17:14:28 +0000 [warn]: [retry_clo_default_output_es] failed to flush the buffer. retry_time=10 next_retry_seconds=2020-10-17 17:19:14 +0000 chunk="5b1808bf54dbe2f8656d00dc0ad491fb" error_class=Fluent::Plugin::ElasticsearchOutput::RetryStreamEmitFailure error="buffer is full."


Version-Release number of selected component (if applicable):
OpenShift Container Platform 4.5

How reproducible:
n/a

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Vladislav Walek 2020-10-18 00:11:32 UTC
The configuration is set to:

      BUFFER_QUEUE_LIMIT:              64
      BUFFER_SIZE_LIMIT:               32m
      FILE_BUFFER_LIMIT:               2048Mi

Comment 3 Jeff Cantrill 2020-10-19 14:05:04 UTC
I am unable to open the link from #c2 but I see the following in the logs in the customer case:

[2020-10-17T17:17:11,835][WARN ][r.suppressed             ] [elasticsearch-cdm-r3u7lpil-2] path: /_bulk, params: {}
com.fasterxml.jackson.core.JsonParseException: Invalid UTF-8 start byte 0x92
 at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@545c4788; line: 1, column: 73]

and 

Populate config from /opt/app-root/src/sgconfig/
Will update 'security/config' with /opt/app-root/src/sgconfig/config.yml 
   FAIL: Configuration for 'config' failed because of ClusterBlockException[blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];]ml 

The error message in #c0 is an indicator that fluent is unable to push any additional logs and this could be because Elasticsearch disk is at it's thresholds where it will no longer accept data.  The attachment does not provide any information as to the state of the log store but my suspicion is the cronjobs which cleanup the disk are failing and thus are not removing indicies.  I encourage you to provide the output of [1] which would include disk info. There are also several 4.5 fixes for the delete cronjobs that are either being released for 4.5 or have just been released.

To resolve the later one post [2] suggests the following to allow you to clean up the disk (note I have not verified):

* oc exec -c elasticsearch $POD -- es_util --query=_cluster/settings -d '{ "transient": { "cluster.routing.allocation.disk.threshold_enabled": false } }' -XPUT
* oc exec -c elasticsearch $POD -- es_util --query=__all/_settings -d '{"index.blocks.read_only_allow_delete": null}' -XPUT

You may need to manually cleanup old indices with something like if the cronjobs do not fire or are unable to complete:

* oc exec -c elasticsearch $POD -- es_util --query=$INDEX -XDELETE

and once logs are being injested again I would suggest re-enabling the thresholds 

* oc exec -c elasticsearch $POD -- es_util --query=_cluster/settings -d '{ "transient": { "cluster.routing.allocation.disk.threshold_enabled": true } }' -XPUT

[1] https://github.com/openshift/cluster-logging-operator/tree/master/must-gather#cluster-logging-must-gather
[2] https://selleo.com/til/posts/esrgfyxjee-how-to-fix-elasticsearch-forbidden12index-read-only

Comment 4 hgomes 2020-10-19 15:25:11 UTC
Hi Jeff,

Appreciate your reply. I'll share this with customer.
As per your request, the must gather for logging component has been made available to SupportShell, for case '02777138', file: 0020-noprod-logging-logs.tar.gz

Comment 5 hgomes 2020-10-21 12:10:46 UTC
Hi,

At this moment in a couple of logs of fluentd we found this events constantly. A fresh logging dump was provided > logging-20201020_201938.tar.gz (105.715 KB)

~~~
2020-10-20 16:27:20 +0000 [warn]: [retry_clo_default_output_es] failed to flush the buffer. retry_time=13 next_retry_seconds=2020-10-20 16:32:10 +0000 chunk="5b176d95b836b4199b55ac7f38c39796" error_class=Fluent::Pl                                                                               ugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push logs to Elasticsearch cluster ({:host=>\"elasticsearch.openshift-logging.svc.cluster.local\", :port=>9200, :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"}): [500] {\"error\":{\"root_cause\":[{\"type\":\"json_parse_exception\",\"reason\":\"Invalid UTF-8 start byte 0x92\\n at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@6d9021a8; line: 1, column: 73]\"}],\"type\":\"json_parse_exception\",\"reason\":\"Invalid UTF-8 start byte 0x92\\n at [Source: org.elasticsearch.transport.netty4.ByteBufStreamInput@6d9021a8; line: 1, column: 73]\"},\"status\":500}"
~~~

The allocation is the follow,

$ oc exec -c elasticsearch elasticsearch-cdm-r3u7lpil-1-54b5d455b8-qgww2 -- es_util --query=_cat/allocation?v
shards disk.indices disk.used disk.avail disk.total disk.percent host          ip            node
    62      168.5gb   170.4gb    123.7gb    294.2gb           57 10.194.12.30  10.194.12.30  elasticsearch-cdm-r3u7lpil-3
    61      171.6gb   173.5gb    120.7gb    294.2gb           58 10.194.10.243 10.194.10.243 elasticsearch-cdm-r3u7lpil-1
    61      169.2gb   171.2gb      123gb    294.2gb           58 10.194.14.18  10.194.14.18  elasticsearch-cdm-r3u7lpil-2
[esanabria@sbst2002lx 20102020-1]$

And all the indices was un blocked

* oc exec -c elasticsearch $POD -- es_util --query=__all/_settings -d '{"index.blocks.read_only_allow_delete": null}' -XPUT

All the fluentd filesystems are normal

$ for i in $(oc get pods -l component=fluentd | awk '/fluentd/ { print $1 }') ; do oc exec $i -- du -sh /var/lib/fluentd  ; done
104K    /var/lib/fluentd
32K     /var/lib/fluentd
32M     /var/lib/fluentd
24K     /var/lib/fluentd
36K     /var/lib/fluentd
132K    /var/lib/fluentd
16K     /var/lib/fluentd
32K     /var/lib/fluentd
8.0K    /var/lib/fluentd
3.1M    /var/lib/fluentd
30M     /var/lib/fluentd
38M     /var/lib/fluentd
8.0M    /var/lib/fluentd
452K    /var/lib/fluentd
24K     /var/lib/fluentd
128K    /var/lib/fluentd


But the events the events are present in several of pods of fluentd. Do you know why the fluentd doesn't connect with elasticsearch?

Comment 6 Jeff Cantrill 2020-10-23 15:19:57 UTC
Setting UpcomingSprint as unable to resolve before EOD

Comment 7 Jeff Cantrill 2020-10-26 18:14:37 UTC
(In reply to hgomes from comment #5)
> Hi,
> 
> At this moment in a couple of logs of fluentd we found this events
> constantly. A fresh logging dump was provided >
> logging-20201020_201938.tar.gz (105.715 KB)
> 
> ~~~
> 2020-10-20 16:27:20 +0000 [warn]: [retry_clo_default_output_es] failed to
> flush the buffer. retry_time=13 next_retry_seconds=2020-10-20 16:32:10 +0000
> chunk="5b176d95b836b4199b55ac7f38c39796" error_class=Fluent::Pl             
> ugin::ElasticsearchOutput::RecoverableRequestFailure error="could not push
> logs to Elasticsearch cluster
> ({:host=>\"elasticsearch.openshift-logging.svc.cluster.local\", :port=>9200,
> :scheme=>\"https\", :user=>\"fluentd\", :password=>\"obfuscated\"}): [500]
> {\"error\":{\"root_cause\":[{\"type\":\"json_parse_exception\",\"reason\":
> \"Invalid UTF-8 start byte 0x92\\n at [Source:
> org.elasticsearch.transport.netty4.ByteBufStreamInput@6d9021a8; line: 1,
> column: 73]\"}],\"type\":\"json_parse_exception\",\"reason\":\"Invalid UTF-8
> start byte 0x92\\n at [Source:
> org.elasticsearch.transport.netty4.ByteBufStreamInput@6d9021a8; line: 1,
> column: 73]\"},\"status\":500}"
> ~~~

This is an error we have seen before which we believed to be related to full disk on the hosts.  I don't believe we can fully attribute it to full disk.  It may be a difference in encoding between what log is created, what fluent expects, and what ES requires. Log entries like this get dumped back into a retry queue and it may be we don't properly evaluate if one can over submit them to ES.

> 
> But the events the events are present in several of pods of fluentd. Do you
> know why the fluentd doesn't connect with elasticsearch?

It's not fluent can not connect but that it is unable to push logs because of the utf8 error.  I'll have to investigate what is happening.  Is there anyway to get the following information from the customer:
* What is the encoding of the container on the node where you see this failure
* What is the encoding of the node where you see this failure.
* A few sample log messages of one of the containers that produces this message?

Comment 9 Courtney Ruhm 2020-11-30 17:45:48 UTC
Hello,

I also have a customer reaching a similar error. Looking at the fluentd logs and elasticsearch logs we see the following

 Fluentd is not able to send logs to the ES:

~~~
$ tail -5 fluentd-6gqjj-fluentd.log
  2020-11-30 09:47:03 +0000 [warn]: suppressed same stacktrace
2020-11-30 09:47:49 +0000 [warn]: [clo_default_output_es] failed to flush the buffer. retry_time=2280 next_retry_seconds=2020-11-30 09:53:02 +0000 chunk="5b4bb0d8a6b8922f83e385cf89497dd4" error_class=Fluent::Plugin::ElasticsearchOutput::RetryStreamEmitFailure error="buffer is full."
  2020-11-30 09:47:49 +0000 [warn]: suppressed same stacktrace
2020-11-30 09:47:49 +0000 [warn]: [clo_default_output_es] failed to flush the buffer. retry_time=2281 next_retry_seconds=2020-11-30 09:52:47 +0000 chunk="5b4bb0d9a343d016d7c4e256584e76d9" error_class=Fluent::Plugin::ElasticsearchOutput::RetryStreamEmitFailure error="buffer is full."
  2020-11-30 09:47:49 +0000 [warn]: suppressed same stacktrace
[supportshell.prod.useraccess-us-west-2.redhat.com] [13:37:12+0000]
~~~

- Checking the ES pods, they have some failures starting like this:
~~~
$ cat elasticsearch-cdm-hjoanotz-1-74f9bc7565-8mxgf-elasticsearch.log
...
Will update 'security/config' with /opt/app-root/src/sgconfig/config.yml
   FAIL: Configuration for 'config' failed because of ClusterBlockException[blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];]
Will update 'security/roles' with /opt/app-root/src/sgconfig/roles.yml
   FAIL: Configuration for 'roles' failed because of ClusterBlockException[blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];]
Will update 'security/rolesmapping' with /opt/app-root/src/sgconfig/roles_mapping.yml
   FAIL: Configuration for 'rolesmapping' failed because of ClusterBlockException[blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];]
Will update 'security/internalusers' with /opt/app-root/src/sgconfig/internal_users.yml
   FAIL: Configuration for 'internalusers' failed because of ClusterBlockException[blocked by: [FORBIDDEN/12/index read-only / allow delete (api)];]
Will update 'security/actiongroups' with /opt/app-root/src/sgconfig/action_groups.yml
...
./init.sh: line 35: show_failures: command not found

I attached a link to a full must-gather the customer uploaded in the above comment.

Comment 10 Jeff Cantrill 2020-11-30 20:18:19 UTC
(In reply to Courtney Ruhm from comment #9)
> Hello,
> 
> I also have a customer reaching a similar error. Looking at the fluentd logs
> and elasticsearch logs we see the following
> 
>  Fluentd is not able to send logs to the ES:
> 
> ~~~
> $ tail -5 fluentd-6gqjj-fluentd.log
>   2020-11-30 09:47:03 +0000 [warn]: suppressed same stacktrace
> 2020-11-30 09:47:49 +0000 [warn]: [clo_default_output_es] failed to flush
> the buffer. retry_time=2280 next_retry_seconds=2020-11-30 09:53:02 +0000
> chunk="5b4bb0d8a6b8922f83e385cf89497dd4"
> error_class=Fluent::Plugin::ElasticsearchOutput::RetryStreamEmitFailure
> error="buffer is full."

This matches the reported issue but is a side-effect of not being able to push logs.  I would also note this is different from the utf8 issue

> - Checking the ES pods, they have some failures starting like this:
> ~~~
> $ cat elasticsearch-cdm-hjoanotz-1-74f9bc7565-8mxgf-elasticsearch.log
> ...
> Will update 'security/config' with /opt/app-root/src/sgconfig/config.yml
>    FAIL: Configuration for 'config' failed because of
> ClusterBlockException[blocked by: [FORBIDDEN/12/index read-only / allow
> delete (api)];]
> Will update 'security/roles' with /opt/app-root/src/sgconfig/roles.yml
>    FAIL: Configuration for 'roles' failed because of
> ClusterBlockException[blocked by: [FORBIDDEN/12/index read-only / allow
> delete (api)];]
> Will update 'security/rolesmapping' with
> /opt/app-root/src/sgconfig/roles_mapping.yml
>    FAIL: Configuration for 'rolesmapping' failed because of
> ClusterBlockException[blocked by: [FORBIDDEN/12/index read-only / allow
> delete (api)];]
> Will update 'security/internalusers' with
> /opt/app-root/src/sgconfig/internal_users.yml
>    FAIL: Configuration for 'internalusers' failed because of
> ClusterBlockException[blocked by: [FORBIDDEN/12/index read-only / allow
> delete (api)];]
> Will update 'security/actiongroups' with
> /opt/app-root/src/sgconfig/action_groups.yml
> ...


These issues are related to again the disk being full and being locked out from deleting.  We have another bz here

Comment 11 Jeff Cantrill 2020-11-30 20:23:21 UTC
Possible solution to the utf-8 issue https://github.com/uken/fluent-plugin-elasticsearch/issues/77#issuecomment-63049158

Comment 12 Steven Walter 2020-12-01 16:19:23 UTC
Hi, in Courtney's case we have found the disk is not full:

# oc get pods -l component=elasticsearch
NAME                                            READY   STATUS    RESTARTS   AGE
elasticsearch-cdm-hjoanotz-1-74f9bc7565-8mxgf   2/2     Running   0          5d18h
elasticsearch-cdm-hjoanotz-2-7577c969cb-mg8td   2/2     Running   0          5d18h
elasticsearch-cdm-hjoanotz-3-86d99c96d7-b97cz   2/2     Running   0          5d18h
# oc rsh -c elasticsearch elasticsearch-cdm-hjoanotz-1-74f9bc7565-8mxgf
sh-4.2$ curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_all/_settings -d '{"index.blocks.read_only_allow_delete": null}'
curl: (52) Empty reply from server
sh-4.2$ curl -XPUT -H "Content-Type: application/json" http://localhost:9200/_cluster/settings -d '{ "transient": { "cluster.routing.allocation.disk.threshold_enabled": false } }'
curl: (52) Empty reply from server
sh-4.2$ hostname
elasticsearch-cdm-hjoanotz-1-74f9bc7565-8mxgf
sh-4.2$ df -hP
Filesystem      Size  Used Avail Use% Mounted on
overlay         149G   68G   82G  46% /
tmpfs            64M     0   64M   0% /dev
tmpfs            24G     0   24G   0% /sys/fs/cgroup
shm              64M     0   64M   0% /dev/shm
tmpfs            24G   43M   24G   1% /etc/passwd
/dev/sdb        100G   41G   60G  41% /elasticsearch/persistent
/dev/sda3       149G   68G   82G  46% /etc/hosts
tmpfs            24G   28K   24G   1% /etc/openshift/elasticsearch/secret
tmpfs            24G   28K   24G   1% /run/secrets/kubernetes.io/serviceaccount
tmpfs            24G     0   24G   0% /proc/acpi
tmpfs            24G     0   24G   0% /proc/scsi
tmpfs            24G     0   24G   0% /sys/firmware
sh-4.2$

Or should we be checking the node disk space?

> We have another bz here
Should we open a new bz for this other case?

Comment 13 Courtney Ruhm 2020-12-01 21:54:45 UTC
Hello,

Any update for our customer on this bug.

Thank you

Comment 14 Jeff Cantrill 2020-12-02 16:59:42 UTC
(In reply to Steven Walter from comment #12)
> Hi, in Courtney's case we have found the disk is not full:

I will correct my previous statement based on some newer findings related to the rollover and delete cronjobs.  The rollover process is not transactional but is a two-step process behind the scenes.  We believe there is an issue related to both steps not succeeding which results in the situation described in #c9.


> 
> > We have another bz here
> Should we open a new bz for this other case?

We are working the fix against issue https://bugzilla.redhat.com/show_bug.cgi?id=1893992 which is one of several issues related to the index management jobs

Comment 15 Courtney Ruhm 2020-12-02 18:49:25 UTC
Hello,

So is there any known current workaround for this issue or are we going to just need to wait for the fix for this other bug?

Thank you

Comment 16 Jeff Cantrill 2020-12-02 19:53:07 UTC
(In reply to Steven Walter from comment #12)
> # oc rsh -c elasticsearch elasticsearch-cdm-hjoanotz-1-74f9bc7565-8mxgf
> sh-4.2$ curl -XPUT -H "Content-Type: application/json"
> http://localhost:9200/_all/_settings -d
> '{"index.blocks.read_only_allow_delete": null}'
> curl: (52) Empty reply from server
> sh-4.2$ curl -XPUT -H "Content-Type: application/json"
> http://localhost:9200/_cluster/settings -d '{ "transient": {
> "cluster.routing.allocation.disk.threshold_enabled": false } }'
> curl: (52) Empty reply from server
> sh-4.2$ hostname
> elasticsearch-cdm-hjoanotz-1-74f9bc7565-8mxgf

These are not the same commands as recommended in #c3 as communication to the server is over https and not plan http.  I suggest using the commands as listed and/or hopping on the pod as done here and using "es_util".  

It is not clear to me in what state is the ES cluster given #c9 makes me think it is not properly seeded for the ACLs.  When we resolve the storage issue, it will resolve the issue at the collector.  Might I recommend attaching a 'must-gather' to this issues so we can get a holistic picture of the environment:

https://github.com/openshift/cluster-logging-operator/tree/master/must-gather

Comment 17 Courtney Ruhm 2020-12-03 16:18:25 UTC
Hello Jeff,

Here's the logging must-gather you requested. Can't attach it directly to the case as it's to large. Let me know if this link works or not

https://attachments.access.redhat.com/hydra/rest/cases/02804669/attachments/f4fd0d76-8e31-43bc-8195-4edecfb193b2?usePresignedUrl=true

Thank you

Comment 19 Jeff Cantrill 2020-12-15 21:12:53 UTC
(In reply to Courtney Ruhm from comment #17)
> Hello Jeff,
> 
> Here's the logging must-gather you requested. Can't attach it directly to
> the case as it's to large. Let me know if this link works or not
> 
> https://attachments.access.redhat.com/hydra/rest/cases/02804669/attachments/
> f4fd0d76-8e31-43bc-8195-4edecfb193b2?usePresignedUrl=true
> 

I am unable to access this info

Comment 20 Jeff Cantrill 2020-12-16 22:10:29 UTC
I believe this issue is related to https://bugzilla.redhat.com/show_bug.cgi?id=1890838

Comment 21 Jeff Cantrill 2021-01-20 21:45:58 UTC
Closing as customer cases are closed.  This is a warning message that likely is only resolvable by correcting the overall health of the logging stack

Comment 22 Red Hat Bugzilla 2023-09-15 00:49:50 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days