Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1669223

Summary: How to get information on root cause of "400 - Rejected by Elasticsearch" errors
Product: OpenShift Container Platform Reporter: Pedro Amoedo <pamoedom>
Component: LoggingAssignee: Jeff Cantrill <jcantril>
Status: CLOSED WONTFIX QA Contact: Anping Li <anli>
Severity: medium Docs Contact:
Priority: high    
Version: 3.7.1CC: aos-bugs, pamoedom, rmeggins
Target Milestone: ---   
Target Release: 3.7.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-01-25 16:24:41 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
elasticsearch and fluentd configmaps none

Description Pedro Amoedo 2019-01-24 16:13:15 UTC
Created attachment 1523210 [details]
elasticsearch and fluentd configmaps

Description of problem:

I've a customer whose users complains about some error logs not being transferred to ES and the relevant info provided by elasticsearch logs (fluentd already in debug mode) are like the following example:

2019-01-23 14:23:56 +0100 [warn]: dump an error event: error_class=Fluent::ElasticsearchErrorHandler::ElasticsearchError error="400 - Rejected by Elasticsearch" location=nil tag="output_tag" time=1548249831 record={"caller"=>"transactionsBusiness.go:269", "component"=>"<obfuscated>", "correlationId"=>"d080dfb3-1e83-4dbf-8afa-2682855dad50", "dateTime"=>"2019-01-23T13:23:51.630668Z", "error"=>{"errorCategory"=>"validation", "errorCode"=>"monthlyValueLimitExceeded", "errorDateTime"=>"2019-01-23T13:23:51.623Z", "errorDescription"=>"validation_monthlyValueLimitExceeded", "errorParameters"=>[{"key"=>"onUs", "value"=>"true"}, {"key"=>"onUsRoundAmount", "value"=>"0"}, {"key"=>"onUsWholesaleRate", "value"=>"1"}, {"key"=>"onUsHedgingRate", "value"=>"0.00"}, {"key"=>"onUsTransactionReferenceReceiver", "value"=>"1c609b2a-7a57-41ef-9b1a-b101f2749bb8"}, {"key"=>"onUsAddOnCodeValueReceiver", "value"=>"TRANSFER_AMOUNT"}, {"key"=>"onUsAddOnCodeDescriptionReceiver", "value"=>"RECEIVER_AMOUNT_MONTHLY"}, {"key"=>"onUsReceiverStatus", "value"=>"rejected"}, {"key"=>"onUsReceiverSubStatus", "value"=>"Non refundable"}, {"key"=>"sessionID", "value"=>"1c609b2a-7a57-41ef-9b1a-b101f2749bb8"}]}, "level"=>"err", "msg"=>"error has occurred with receiver partner", "requestId"=>"0137d1e9-200d-4890-9c93-c6f9ae7ac3d9", "sessionId"=>"ba56685f-880d-4ca6-91f4-93c71da919c5", "useCase"=>"<obfuscated>", "version"=>"1.5.0", "docker"=>{"container_id"=>"10b41b8f0154ae2a7429fd93d030a328babab5716f568a0260cc183f8ea73405"}, "kubernetes"=>{"container_name"=>"<obfuscated>", "namespace_name"=>"<obfuscated>", "pod_name"=>"<obfuscated>", "pod_id"=>"b24d1a9f-15a3-11e9-8598-fa163ea64709", "labels"=>{"app"=>"<obfuscated>", "component"=>"<obfuscated>", "country"=>"<obfuscated>", "deployment"=>"<obfuscated>", "deploymentConfig"=>"<obfuscated>", "deploymentconfig"=>"i<obfuscated>", "version"=>"1-5-0-2"}, "host"=>"<obfuscated>", "master_url"=>"https://kubernetes.default.svc.cluster.local", "namespace_id"=>"5c6a879d-f892-11e8-8d58-fa163ec49403", "namespace_labels"=>{"billing_account"=>"<obfuscated>", "project_type"=>"free"}}, "message"=>"{\"caller\":\"transactionsBusiness.go:269\",\"component\":\"<obfuscated>\",\"correlationId\":\"d080dfb3-1e83-4dbf-8afa-2682855dad50\",\"dateTime\":\"2019-01-23T13:23:51.630668Z\",\"error\":{\"errorCategory\":\"validation\",\"errorCode\":\"monthlyValueLimitExceeded\",\"errorDateTime\":\"2019-01-23T13:23:51.623Z\",\"errorDescription\":\"validation_monthlyValueLimitExceeded\",\"errorParameters\":[{\"key\":\"onUs\",\"value\":\"true\"},{\"key\":\"onUsRoundAmount\",\"value\":\"0\"},{\"key\":\"onUsWholesaleRate\",\"value\":\"1\"},{\"key\":\"onUsHedgingRate\",\"value\":\"0.00\"},{\"key\":\"onUsTransactionReferenceReceiver\",\"value\":\"1c609b2a-7a57-41ef-9b1a-b101f2749bb8\"},{\"key\":\"onUsAddOnCodeValueReceiver\",\"value\":\"TRANSFER_AMOUNT\"},{\"key\":\"onUsAddOnCodeDescriptionReceiver\",\"value\":\"RECEIVER_AMOUNT_MONTHLY\"},{\"key\":\"onUsReceiverStatus\",\"value\":\"rejected\"},{\"key\":\"onUsReceiverSubStatus\",\"value\":\"Non refundable\"},{\"key\":\"sessionID\",\"value\":\"1c609b2a-7a57-41ef-9b1a-b101f2749bb8\"}]},\"level\":\"error\",\"msg\":\"error has occurred with receiver partner\",\"requestId\":\"0137d1e9-200d-4890-9c93-c6f9ae7ac3d9\",\"sessionId\":\"ba56685f-880d-4ca6-91f4-93c71da919c5\",\"useCase\":\"<obfuscated>\",\"version\":\"1.5.0\"}\n", "hostname"=>"<obfuscated>", "pipeline_metadata"=>{"collector"=>{"ipaddr4"=>"<obfuscated>", "ipaddr6"=>"<obfuscated>", "inputname"=>"fluent-plugin-systemd", "name"=>"fluentd", "received_at"=>"2019-01-23T13:23:52.119442+00:00", "version"=>"0.12.42 1.6.0"}}, "@timestamp"=>"2019-01-23T13:23:51.631841+00:00", "viaq_index_name"=>"project.<obfuscated>.5c6a879d-f892-11e8-8d58-fa163ec49403.2019.01.23", "viaq_msg_id"=>"Y2MzYTNmYWMtMjc4My00ZTYxLTk3YmQtOTJjZDljMDVkOGI3"}
2019-01-23 14:23:56 +0100 [debug]: Indexed (op = create), 97 successes, 1 bad_argument

Version-Release number of selected component (if applicable):

OCP 3.7.72

GEMS:
[info]: gem 'fluent-mixin-config-placeholders' version '0.4.0'
[info]: gem 'fluent-plugin-docker_metadata_filter' version '0.1.1'
[info]: gem 'fluent-plugin-elasticsearch' version '1.17.0'
[info]: gem 'fluent-plugin-flatten-hash' version '0.4.0'
[info]: gem 'fluent-plugin-kubernetes_metadata_filter' version '1.0.1'
[info]: gem 'fluent-plugin-record-modifier' version '0.6.2'
[info]: gem 'fluent-plugin-remote-syslog' version '1.1'
[info]: gem 'fluent-plugin-rewrite-tag-filter' version '1.5.6'
[info]: gem 'fluent-plugin-secure-forward' version '0.4.5'
[info]: gem 'fluent-plugin-systemd' version '0.0.9'
[info]: gem 'fluent-plugin-viaq_data_model' version '0.0.14'
[info]: gem 'fluentd' version '0.12.42'

Logging configmaps: PFA

NOTE1: The customer also had logging-elasticsearch-ops running in the past but it's disabled right now.

NOTE2: As part of our tests, "FILE_BUFFER_LIMIT" variable was also modified by the customer to 512Mi having currently a real "buffer_queue_limit" of 64

How reproducible:
n/a

Steps to Reproduce:
1.
2.
3.

Actual results:

Undetailed elasticsearch error in debug mode

Expected results:

Maybe something like this?

[warn]: #0 dump an error event: error_class=Fluent::Plugin::ElasticsearchErrorHandler::ElasticsearchError error="400 - Rejected by Elasticsearch [error type]: illegal_argument_exception [reason]: 'Limit of total fields [1000] in index [xxxxxx] has been exceeded'"

Additional info:

A thirdparty external patch that I found that modifies the fluentd-plugin-elasticsearch to provide more error detail, not sure if applicable here but just in case it can serve as some reference:
https://github.com/uken/fluent-plugin-elasticsearch/pull/469

Comment 3 Jeff Cantrill 2019-01-25 16:24:41 UTC
This is most likely caused because merging of json logs is enabled by default [1].  In our pending 4.0 release, we disabled this feature by default because of various issues of which this falls into.  The problem is your applications are likely logging a JSON message payload that is being added to the payload fluentd submits to Elasticsearch.  The addition of your application fields to fluent's payload exceeds the maximum allowing fields and ES rejects the messages.  The only resolution is to disable this feature so that when new indices are created and logs are added to these indices that you will not exceed this threshold.

We expose disabling via ENV var in later releases but for 3.7:

1. Create a configmap consisting of these files (named fluentd-overrides ?) [2]
2. Edit [3] to include 'merge_json_log false'
3. Edit the DaemonSet to mount the configmap

add section to 'volumes':

      - name: config-overrides
        configMap:
          name: fluentd-overrides 

add section to 'volumeMounts':
        - name: config-overrides
          mountPath: /etc/fluent/configs.d/openshift
          readOnly: true

Note: The changes to the DaemonSet will need to be reapplied after any upgrades

[1] https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/blob/v1.0.1/lib/fluent/plugin/filter_kubernetes_metadata.rb#L51
[2] https://github.com/openshift/origin-aggregated-logging/tree/release-3.7/fluentd/configs.d/openshift
[3] https://github.com/openshift/origin-aggregated-logging/blob/release-3.7/fluentd/configs.d/openshift/filter-k8s-meta.conf

We recommend upgrading to 3.11 which is the long term support release.  Closing this issue as WONTFIX

Comment 5 Pedro Amoedo 2019-01-25 16:34:34 UTC
(In reply to Jeff Cantrill from comment #3)
> This is most likely caused because merging of json logs is enabled by
> default [1].  In our pending 4.0 release, we disabled this feature by
> default because of various issues of which this falls into.  The problem is
> your applications are likely logging a JSON message payload that is being
> added to the payload fluentd submits to Elasticsearch.  The addition of your
> application fields to fluent's payload exceeds the maximum allowing fields
> and ES rejects the messages.  The only resolution is to disable this feature
> so that when new indices are created and logs are added to these indices
> that you will not exceed this threshold.
> 
> We expose disabling via ENV var in later releases but for 3.7:
> 
> 1. Create a configmap consisting of these files (named fluentd-overrides ?)
> [2]
> 2. Edit [3] to include 'merge_json_log false'
> 3. Edit the DaemonSet to mount the configmap
> 
> add section to 'volumes':
> 
>       - name: config-overrides
>         configMap:
>           name: fluentd-overrides 
> 
> add section to 'volumeMounts':
>         - name: config-overrides
>           mountPath: /etc/fluent/configs.d/openshift
>           readOnly: true
> 
> Note: The changes to the DaemonSet will need to be reapplied after any
> upgrades
> 
> [1]
> https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/blob/
> v1.0.1/lib/fluent/plugin/filter_kubernetes_metadata.rb#L51
> [2]
> https://github.com/openshift/origin-aggregated-logging/tree/release-3.7/
> fluentd/configs.d/openshift
> [3]
> https://github.com/openshift/origin-aggregated-logging/blob/release-3.7/
> fluentd/configs.d/openshift/filter-k8s-meta.conf
> 
> We recommend upgrading to 3.11 which is the long term support release. 
> Closing this issue as WONTFIX

Thanks Jeff, I will pass those instructions to the customer to verify the solution.

Comment 8 Pedro Amoedo 2019-01-28 12:45:32 UTC
(In reply to Pedro Amoedo from comment #5)
> (In reply to Jeff Cantrill from comment #3)
> > This is most likely caused because merging of json logs is enabled by
> > default [1].  In our pending 4.0 release, we disabled this feature by
> > default because of various issues of which this falls into.  The problem is
> > your applications are likely logging a JSON message payload that is being
> > added to the payload fluentd submits to Elasticsearch.  The addition of your
> > application fields to fluent's payload exceeds the maximum allowing fields
> > and ES rejects the messages.  The only resolution is to disable this feature
> > so that when new indices are created and logs are added to these indices
> > that you will not exceed this threshold.
> > 
> > We expose disabling via ENV var in later releases but for 3.7:
> > 
> > 1. Create a configmap consisting of these files (named fluentd-overrides ?)
> > [2]
> > 2. Edit [3] to include 'merge_json_log false'
> > 3. Edit the DaemonSet to mount the configmap
> > 
> > add section to 'volumes':
> > 
> >       - name: config-overrides
> >         configMap:
> >           name: fluentd-overrides 
> > 
> > add section to 'volumeMounts':
> >         - name: config-overrides
> >           mountPath: /etc/fluent/configs.d/openshift
> >           readOnly: true
> > 
> > Note: The changes to the DaemonSet will need to be reapplied after any
> > upgrades
> > 
> > [1]
> > https://github.com/fabric8io/fluent-plugin-kubernetes_metadata_filter/blob/
> > v1.0.1/lib/fluent/plugin/filter_kubernetes_metadata.rb#L51
> > [2]
> > https://github.com/openshift/origin-aggregated-logging/tree/release-3.7/
> > fluentd/configs.d/openshift
> > [3]
> > https://github.com/openshift/origin-aggregated-logging/blob/release-3.7/
> > fluentd/configs.d/openshift/filter-k8s-meta.conf
> > 
> > We recommend upgrading to 3.11 which is the long term support release. 
> > Closing this issue as WONTFIX
> 
> Thanks Jeff, I will pass those instructions to the customer to verify the
> solution.

Hi Jeff, the customer has confirmed the workaround, but they are now suffering another problem with indexing fields when deactivating the json merging, here you have the exact words from them:

---------------
Thanks for this information and feedback.
Change has been applied on the different clusters.
Everything seems to be ok since 3 hours and we don't see the "400 - Rejected by Elasticsearch" error anymore.
The root cause seems to be found.

But, there is a drawback of the "merge json logs" desactivation.
From what we have seen, there are no more indexes and fields in Kibana based on application data.
Let me explain by an example. 
The applications are filling the fielf message: with this data (it's an example to illustrate):
{"caller":"casaPartnerDAO.go:167","component":"CASA_SENDER","country":"CI","dateTime":"2019-01-28T09:17:04.219272Z","level":"info","msg":"receive error","requestId":"0aae2aee-7e90-489d-9e66-b1d8ad67e824","sessionId":"3d92fcd6-9be5-4605-9289-7ed051b20aa9","subject":"transfers.3d92fcd6-9be5-4605-9289-7ed051b20aa9.error","useCase":"ASyncIRTConfirmRequest","version":"1.5.1-RC10"}.

When "merge json log" is active:
Application can filter and perform search on Kibana  based on sessionId or any field described inside message.

When "merge json log" is not active:
we can only see kubernetes fields on Kibana to look for an information.

So, sorry, for insisting, but this is quit difficult now for users to perform queries based only on Kubernetes field.
Is it possible to extend maximum allowing fields in Elastic Search rather than desactivating this feature?
---------------

Any suggestion/workaround about this?

Thanks in advance.

Comment 9 Jeff Cantrill 2019-01-28 17:03:42 UTC
This [1] is the reason we recommend disabling.  Your are correct in it removes the possibility to simplify queries for user applications but there are the trade-offs described.  There is no alternative at the moment but we plan to renable in a future release once we can address the issues identified.

[1] https://github.com/openshift/origin-aggregated-logging/issues/1492

Comment 10 Rich Megginson 2019-01-28 17:09:24 UTC
> So, sorry, for insisting, but this is quit difficult now for users to perform queries based only on Kubernetes field.
Is it possible to extend maximum allowing fields in Elastic Search rather than desactivating this feature?

I don't think it is just that there is a maximum number of fields, it is that we have no way to control the format of the fields coming out of the application.  That is, elasticsearch does not like it if you write a field named "mydata" like this:

{..., "mydata":"some string value", ....}
then later write "mydata" with a different value like
{..., "mydata":42.5, ....}
then later
{..., "mydata":{"anotherfield":"some string value"}, ....}

Elasticsearch treats this as a "schema" violation, and returns error 400.

I think this is the source of the error 400 reported by fluentd but we won't know for sure until we get the corroborating fluentd and elasticsearch logs.

Comment 12 Pedro Amoedo 2019-01-28 17:46:43 UTC
(In reply to Rich Megginson from comment #10)
> > So, sorry, for insisting, but this is quit difficult now for users to perform queries based only on Kubernetes field.
> Is it possible to extend maximum allowing fields in Elastic Search rather
> than desactivating this feature?
> 
> I don't think it is just that there is a maximum number of fields, it is
> that we have no way to control the format of the fields coming out of the
> application.  That is, elasticsearch does not like it if you write a field
> named "mydata" like this:
> 
> {..., "mydata":"some string value", ....}
> then later write "mydata" with a different value like
> {..., "mydata":42.5, ....}
> then later
> {..., "mydata":{"anotherfield":"some string value"}, ....}
> 
> Elasticsearch treats this as a "schema" violation, and returns error 400.
> 
> I think this is the source of the error 400 reported by fluentd but we won't
> know for sure until we get the corroborating fluentd and elasticsearch logs.

Thanks for the graphical explanation Rich, PFA the requested logs.