1866019 – elasticsearch-rollover and elasticsearch-delete pods in Error states

Bug 1866019 - elasticsearch-rollover and elasticsearch-delete pods in Error states

Summary: elasticsearch-rollover and elasticsearch-delete pods in Error states

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Jeff Cantrill
QA Contact:	Qiaoling Tang
Docs Contact:
URL:
Whiteboard:	logging-exploration osd-45-logging
Duplicates (1):	1866963 (view as bug list)
Depends On:
Blocks:	1868675 1896578
TreeView+	depends on / blocked

Reported:	2020-08-04 16:41 UTC by Steven Walter
Modified:	2024-06-13 22:51 UTC (History)
CC List:	34 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-03-08 13:55:35 UTC
Target Upstream Version:
Embargoed:
Flags:	tcort: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift elasticsearch-operator pull 454	None	closed	Bug 1866019: Delete indices in batches	2021-02-17 13:08:15 UTC
Github	openshift elasticsearch-operator pull 470	None	closed	Bug 1866019: Remove space from joining deleting indices call	2021-02-17 13:08:16 UTC
Github	openshift elasticsearch-operator pull 477	None	closed	Bug 1866019: Add more error checking to delete script	2021-02-17 13:08:16 UTC
Red Hat Knowledge Base (Solution)	5410091	None	None	None	2020-09-16 14:12:45 UTC
Red Hat Product Errata	RHBA-2020:4198	None	None	None	2020-10-27 15:10:14 UTC

Description Steven Walter 2020-08-04 16:41:40 UTC

Description of problem:
Newly installed cluster-logging instance on 4.5 utilizing the log retention/rollover setup, the -rollover- and -delete- pods enter Error state.

Version-Release number of selected component (if applicable):

RHOCP 4.5.2
EFK   4.5.0-202007012112.p0


How reproducible:
Unconfirmed


Actual results:

+ the log of "elasticsearch-delete-audit-1234567-xxxxx"

Traceback (most recent call last):
  File "<string>", line 4, in <module>
TypeError: 'int' object has no attribute '__getitem__'


+ the log of "elasticsearch-delete-infra-1234567-xxxxx"

deleting indices: infra-xxxxxx
{
  "error" : {
    "root_cause" : [
      {
        "type" : "security_exception",
        "reason" : "Unexpected exception indices:admin/delete"
      }
    ],
    "type" : "security_exception",
    "reason" : "Unexpected exception indices:admin/delete"
  },
  "status" : 500
}

+ the log of "elasticsearch-rollover-infra-1234567-xxxxx"
{
  "error" : {
    "root_cause" : [
      {
        "type" : "security_exception",
        "reason" : "Unexpected exception indices:admin/rollover"
      }
    ],
    "type" : "security_exception",
    "reason" : "Unexpected exception indices:admin/rollover"
  },
  "status" : 500
}

Comment 6 Gurenko Alex 2020-08-05 11:36:14 UTC

I have same problem on my test environment. Occasionally delete pod fails with same issue. I can provide access if that would help.

Comment 7 Jeff Cantrill 2020-08-05 13:25:14 UTC

Do the pods continually fail or is it only while Elasticsearch is starting, which would be expected?  We leave at least one failed pod for review so there may always be an error pod in the list

Comment 9 Jeff Cantrill 2020-08-07 19:09:07 UTC

*** Bug 1866963 has been marked as a duplicate of this bug. ***

Comment 10 tcort 2020-08-07 19:17:10 UTC

I also have this problem.  I see pods fail intermittently through out the day.  Currently I have two delete pod failures, about 5 hours apart, and a Rollover pod failure 15 minutes after the 2nd delete pod failure.  

The cronjobs for these pods has them running at what looks to be 15 minute intervals (*/15 * * * *), which means the failure rate isn't particularly high but any of these failures cause concerns with the customer.

Comment 11 Jeff Cantrill 2020-08-07 20:23:09 UTC

(In reply to tcort from comment #10)
> I also have this problem.  I see pods fail intermittently through out the
> day.  Currently I have two delete pod failures, about 5 hours apart, and a
> Rollover pod failure 15 minutes after the 2nd delete pod failure.  
> 
> The cronjobs for these pods has them running at what looks to be 15 minute
> intervals (*/15 * * * *), which means the failure rate isn't particularly
> high but any of these failures cause concerns with the customer.

I typically have only seen this issue when first deploying a logging cluster where the cronjobs are created before the Elasticsearch pods are accepting traffic.  Please attach any logs that may be useful to help resolve the issue. It's worth trying to resolve why these failures exist, but If you have more successful jobs then failed and the data is being curated (e.g. index rollover and deletion), the priority of this issue is low.

Comment 12 Jeff Cantrill 2020-08-10 20:32:57 UTC

Please attach the artifacts from the [1] along with the logs for the failed cronjobs

[1] https://github.com/openshift/origin-aggregated-logging/blob/release-4.5/hack/logging-dump.sh

Comment 16 Jeff Cantrill 2020-08-12 16:05:07 UTC

This is likely tangential, but looking at the artifacts, it looks like you could benefit from the latest images [1] which were released 10-Aug.  I am unable to reproduce locally but I don't have the same volume of logs.  I'll investigate further to see I can alter the script to provide more info


[1] https://bugzilla.redhat.com/show_bug.cgi?id=1860135

Comment 17 Jeff Cantrill 2020-08-12 18:20:46 UTC

One way to get the issue of __getitem__: 
{"error":{"root_cause":[{"type":"index_not_found_exception","reason":"no such index","index_uuid":"_na_","resource.type":"index_or_alias","resource.id":"test","index":"test"}

Comment 18 Jeff Cantrill 2020-08-12 18:35:11 UTC

Added 566 indices and came across:

{"error":{"root_cause":[{"type":"too_long_frame_exception","reason":"An HTTP line is larger than 4096 bytes."}],"type":"too_long_frame_exception","reason":"An HTTP line is larger than 4096 bytes."},"status":400

Comment 19 Jeff Cantrill 2020-08-12 19:10:50 UTC

(In reply to Hideshi Fukumoto from comment #15)
> Hi,
> Did you get any news/information from the attached logging dump ?
> My customer strongly need the fix or workaround to mitigate this issue,
> because
> he plan to start the cluster service on September 1st.
> So, he have to complete his test until August 20th at least.

I would disagree this is a high issue but there is something here to fix.  We need to change how we execute the delete of indices even though this looks to maybe related to an expired token.  Working around the delete can be accomplished by running the following for older indices:

oc exec -c elasticsearch $pod -- es_util --query=<index_name> -XDELETE

You may need to ref [1] to determine which indices but this would give you and idea of the ones that would be eligible to be deleted.

Also, evaluating the response code more robustly may allow us to provide better log information

[1] https://github.com/openshift/elasticsearch-operator/blob/release-4.5/pkg/indexmanagement/scripts.go#L40-L57

Comment 24 Qiaoling Tang 2020-08-18 02:14:33 UTC

Testing is blocked by https://bugzilla.redhat.com/show_bug.cgi?id=1869117.

Comment 26 Qiaoling Tang 2020-08-24 08:49:01 UTC

Verified with elasticsearch-operator.4.6.0-202008202200.p0

Comment 29 Qiaoling Tang 2020-08-31 08:42:05 UTC

Hit the issue again when testing with elasticsearch-operator.4.6.0-202008291540.p0:

$ oc get pod
NAME                                            READY   STATUS      RESTARTS   AGE
cluster-logging-operator-6f94cf9f9-tgm9f        1/1     Running     0          7h52m
elasticsearch-cdm-b0g2npxl-1-78b8769956-lzf9x   2/2     Running     0          66m
elasticsearch-cdm-b0g2npxl-2-d9fcbd799-l5jj6    2/2     Running     0          66m
elasticsearch-cdm-b0g2npxl-3-8578bd869d-lhwrc   2/2     Running     0          66m
elasticsearch-delete-app-1598861700-mx89d       0/1     Completed   0          18m
elasticsearch-delete-app-1598862600-phr9k       0/1     Error       0          3m19s
elasticsearch-delete-audit-1598862600-9plcp     0/1     Completed   0          3m19s
elasticsearch-delete-infra-1598862600-ptd5v     0/1     Completed   0          3m19s
elasticsearch-rollover-app-1598862600-xctn2     0/1     Completed   0          3m19s
elasticsearch-rollover-audit-1598862600-rczfr   0/1     Completed   0          3m18s
elasticsearch-rollover-infra-1598862600-55fnn   0/1     Completed   0          3m18s
fluentd-76c7q                                   1/1     Running     0          66m
fluentd-7k7cj                                   1/1     Running     0          66m
fluentd-brj6w                                   1/1     Running     0          66m
fluentd-m9vcb                                   1/1     Running     0          66m
fluentd-vj72r                                   1/1     Running     0          66m
fluentd-xtzxt                                   1/1     Running     0          66m
kibana-d85d4bb76-hf5hq                          2/2     Running     0          66m

$ oc logs elasticsearch-delete-app-1598862600-phr9k 
Traceback (most recent call last):
  File "<string>", line 4, in <module>
TypeError: 'int' object has no attribute '__getitem__'

no logs found in elasticsearch-rollover-app-1598862600-xctn2.

The ES cluster is Green, and has enough disk space.

There are lots of logs in the EO:
{"level":"info","ts":1598862589.0276558,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:80","msg":"reconciling index management","mapping":"app"}
{"level":"info","ts":1598862589.0835047,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:80","msg":"reconciling index management","mapping":"infra"}
{"level":"info","ts":1598862589.13504,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:80","msg":"reconciling index management","mapping":"audit"}
{"level":"info","ts":1598862619.0736885,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:80","msg":"reconciling index management","mapping":"app"}
{"level":"info","ts":1598862619.122707,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:80","msg":"reconciling index management","mapping":"infra"}
{"level":"info","ts":1598862619.16467,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:80","msg":"reconciling index management","mapping":"audit"}
{"level":"info","ts":1598862620.095908,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:80","msg":"reconciling index management","mapping":"app"}
{"level":"info","ts":1598862620.1660204,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:80","msg":"reconciling index management","mapping":"infra"}
{"level":"info","ts":1598862620.226318,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:80","msg":"reconciling index management","mapping":"audit"}

Comment 31 Jeff Cantrill 2020-09-12 01:53:01 UTC

Moving to UpcomingSprint awaiting for PRs to merge, etc.

Comment 35 Qiaoling Tang 2020-09-21 03:08:19 UTC

Hit the issue again:

$ oc get pod
NAME                                            READY   STATUS      RESTARTS   AGE
cluster-logging-operator-d487668d9-9x9nj        1/1     Running     0          74m
elasticsearch-cdm-r8jw2yat-1-76f5c5b5b8-lfcr6   2/2     Running     0          73m
elasticsearch-cdm-r8jw2yat-2-867ff46bd7-q6r2r   2/2     Running     0          73m
elasticsearch-cdm-r8jw2yat-3-78bb547c58-2l975   2/2     Running     0          73m
elasticsearch-delete-app-1600655400-mcnlr       0/1     Completed   0          23m
elasticsearch-delete-app-1600656300-2tjdh       0/1     Error       0          8m22s
elasticsearch-delete-audit-1600656300-7f595     0/1     Completed   0          8m22s
elasticsearch-delete-infra-1600656300-mst6c     0/1     Completed   0          8m22s
elasticsearch-rollover-app-1600656300-smzzj     0/1     Completed   0          8m22s
elasticsearch-rollover-audit-1600656300-79bxd   0/1     Completed   0          8m22s
elasticsearch-rollover-infra-1600656300-h478s   0/1     Completed   0          8m22s
fluentd-48g94                                   1/1     Running     0          73m
fluentd-6bgzc                                   1/1     Running     0          73m
fluentd-h528z                                   1/1     Running     0          73m
fluentd-llc65                                   1/1     Running     0          73m
fluentd-mwng7                                   1/1     Running     0          73m
fluentd-wcfhn                                   1/1     Running     0          73m
kibana-66688f69db-wk47w                         2/2     Running     0          72m

$ oc logs elasticsearch-delete-app-1600656300-2tjdh 
{"error":{"root_cause":[{"type":"security_exception","reason":"Unexpected exception indices:admin/aliases/get"}],"type":"security_exception","reason":"Unexpected exception indices:admin/aliases/get"},"status":500}
Error while attemping to determine the active write alias: {"error":{"root_cause":[{"type":"security_exception","reason":"Unexpected exception indices:admin/aliases/get"}],"type":"security_exception","reason":"Unexpected exception indices:admin/aliases/get"},"status":500}

$ oc get csv
NAME                                           DISPLAY                  VERSION                 REPLACES   PHASE
clusterlogging.4.6.0-202009192030.p0           Cluster Logging          4.6.0-202009192030.p0              Succeeded
elasticsearch-operator.4.6.0-202009192030.p0   Elasticsearch Operator   4.6.0-202009192030.p0              Succeeded


Found some errors in the ES nodes:

elasticsearch-cdm-r8jw2yat-2-867ff46bd7-q6r2r(the master node):
[2020-09-21T02:45:04,100][INFO ][o.e.c.m.MetaDataCreateIndexService] [elasticsearch-cdm-r8jw2yat-2] [infra-000005] creating index, cause [rollover_index], templates [com.redhat.viaq-openshift-operations.template.json, common.settings.operations.template.json, ocp-gen-infra], shards [3]/[1], mappings [_doc]
[2020-09-21T02:45:04,342][INFO ][o.e.c.m.MetaDataMappingService] [elasticsearch-cdm-r8jw2yat-2] [infra-000005/0uS9S_4kSxuGBcrUrmJktQ] update_mapping [_doc]
[2020-09-21T02:45:04,506][INFO ][o.e.c.m.MetaDataMappingService] [elasticsearch-cdm-r8jw2yat-2] [infra-000005/0uS9S_4kSxuGBcrUrmJktQ] update_mapping [_doc]
[2020-09-21T02:45:04,509][INFO ][o.e.c.m.MetaDataMappingService] [elasticsearch-cdm-r8jw2yat-2] [infra-000005/0uS9S_4kSxuGBcrUrmJktQ] update_mapping [_doc]
[2020-09-21T02:45:04,512][INFO ][o.e.c.m.MetaDataMappingService] [elasticsearch-cdm-r8jw2yat-2] [infra-000005/0uS9S_4kSxuGBcrUrmJktQ] update_mapping [_doc]
[2020-09-21T02:45:04,580][INFO ][o.e.c.m.MetaDataMappingService] [elasticsearch-cdm-r8jw2yat-2] [infra-000005/0uS9S_4kSxuGBcrUrmJktQ] update_mapping [_doc]
[2020-09-21T02:45:04,618][INFO ][o.e.c.m.MetaDataMappingService] [elasticsearch-cdm-r8jw2yat-2] [infra-000005/0uS9S_4kSxuGBcrUrmJktQ] update_mapping [_doc]
[2020-09-21T02:45:04,651][INFO ][o.e.c.m.MetaDataMappingService] [elasticsearch-cdm-r8jw2yat-2] [infra-000005/0uS9S_4kSxuGBcrUrmJktQ] update_mapping [_doc]
[2020-09-21T02:45:04,698][INFO ][o.e.c.m.MetaDataMappingService] [elasticsearch-cdm-r8jw2yat-2] [infra-000005/0uS9S_4kSxuGBcrUrmJktQ] update_mapping [_doc]
[2020-09-21T02:45:04,784][INFO ][o.e.c.m.MetaDataCreateIndexService] [elasticsearch-cdm-r8jw2yat-2] [app-000002] creating index, cause [rollover_index], templates [com.redhat.viaq-openshift-project.template.json, common.settings.project.template.json, ocp-gen-app], shards [3]/[1], mappings [_doc]
[2020-09-21T02:45:04,843][ERROR][c.a.o.s.f.OpenDistroSecurityFilter] [elasticsearch-cdm-r8jw2yat-2] Unexpected exception ElasticsearchSecurityException[_opendistro_security_dls_query does not match (SG 900D)]
org.elasticsearch.ElasticsearchSecurityException: _opendistro_security_dls_query does not match (SG 900D)
	at com.amazon.opendistroforelasticsearch.security.privileges.DlsFlsEvaluator.evaluate(DlsFlsEvaluator.java:133) ~[opendistro_security-0.10.1.0-redhat-1.jar:0.10.1.0-redhat-1]
	at com.amazon.opendistroforelasticsearch.security.privileges.PrivilegesEvaluator.evaluate(PrivilegesEvaluator.java:441) ~[opendistro_security-0.10.1.0-redhat-1.jar:0.10.1.0-redhat-1]
	at com.amazon.opendistroforelasticsearch.security.filter.OpenDistroSecurityFilter.apply0(OpenDistroSecurityFilter.java:253) [opendistro_security-0.10.1.0-redhat-1.jar:0.10.1.0-redhat-1]
	at com.amazon.opendistroforelasticsearch.security.filter.OpenDistroSecurityFilter.apply(OpenDistroSecurityFilter.java:119) [opendistro_security-0.10.1.0-redhat-1.jar:0.10.1.0-redhat-1]
	at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:165) [elasticsearch-6.8.1.redhat-6.jar:6.8.1.redhat-6]
	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:139) [elasticsearch-6.8.1.redhat-6.jar:6.8.1.redhat-6]
	at org.elasticsearch.action.support.HandledTransportAction$TransportHandler.messageReceived(HandledTransportAction.java:89) [elasticsearch-6.8.1.redhat-6.jar:6.8.1.redhat-6]
	at org.elasticsearch.action.support.HandledTransportAction$TransportHandler.messageReceived(HandledTransportAction.java:80) [elasticsearch-6.8.1.redhat-6.jar:6.8.1.redhat-6]
	at com.amazon.opendistroforelasticsearch.security.ssl.transport.OpenDistroSecuritySSLRequestHandler.messageReceivedDecorate(OpenDistroSecuritySSLRequestHandler.java:194) [opendistro_security_ssl-0.10.0.3-redhat-1.jar:0.10.0.3-redhat-1]
	at com.amazon.opendistroforelasticsearch.security.transport.OpenDistroSecurityRequestHandler.messageReceivedDecorate(OpenDistroSecurityRequestHandler.java:284) [opendistro_security-0.10.1.0-redhat-1.jar:0.10.1.0-redhat-1]
	at com.amazon.opendistroforelasticsearch.security.ssl.transport.OpenDistroSecuritySSLRequestHandler.messageReceived(OpenDistroSecuritySSLRequestHandler.java:166) [opendistro_security_ssl-0.10.0.3-redhat-1.jar:0.10.0.3-redhat-1]
	at com.amazon.opendistroforelasticsearch.security.OpenDistroSecurityPlugin$7$1.messageReceived(OpenDistroSecurityPlugin.java:646) [opendistro_security-0.10.1.0-redhat-1.jar:0.10.1.0-redhat-1]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) [elasticsearch-6.8.1.redhat-6.jar:6.8.1.redhat-6]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1087) [elasticsearch-6.8.1.redhat-6.jar:6.8.1.redhat-6]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.8.1.redhat-6.jar:6.8.1.redhat-6]
	at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:193) [elasticsearch-6.8.1.redhat-6.jar:6.8.1.redhat-6]
	at org.elasticsearch.transport.TcpTransport.handleRequest(TcpTransport.java:1046) [elasticsearch-6.8.1.redhat-6.jar:6.8.1.redhat-6]
	at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:932) [elasticsearch-6.8.1.redhat-6.jar:6.8.1.redhat-6]
	at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:763) [elasticsearch-6.8.1.redhat-6.jar:6.8.1.redhat-6]
	at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:53) [transport-netty4-client-6.8.1.redhat-5.jar:6.8.1.redhat-5]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323) [netty-codec-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297) [netty-codec-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) [netty-handler-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1436) [netty-handler-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1203) [netty-handler-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1247) [netty-handler-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:502) [netty-codec-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:441) [netty-codec-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:278) [netty-codec-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:656) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:556) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:510) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909) [netty-common-4.1.32.Final.jar:4.1.32.Final]
	at java.lang.Thread.run(Thread.java:834) [?:?]
[2020-09-21T02:45:04,971][INFO ][o.e.c.m.MetaDataIndexTemplateService] [elasticsearch-cdm-r8jw2yat-2] adding template [ocp-gen-app] for index patterns [app*]
[2020-09-21T02:45:05,044][INFO ][o.e.c.m.MetaDataIndexTemplateService] [elasticsearch-cdm-r8jw2yat-2] adding template [ocp-gen-infra] for index patterns [infra*]
[2020-09-21T02:45:05,102][INFO ][o.e.c.m.MetaDataIndexTemplateService] [elasticsearch-cdm-r8jw2yat-2] adding template [ocp-gen-audit] for index patterns [audit*]
[2020-09-21T02:45:05,177][INFO ][o.e.c.m.MetaDataMappingService] [elasticsearch-cdm-r8jw2yat-2] [infra-000005/0uS9S_4kSxuGBcrUrmJktQ] update_mapping [_doc]
[2020-09-21T02:45:05,480][INFO ][o.e.c.m.MetaDataMappingService] [elasticsearch-cdm-r8jw2yat-2] [infra-000005/0uS9S_4kSxuGBcrUrmJktQ] update_mapping [_doc]
[2020-09-21T02:45:05,504][INFO ][o.e.c.r.a.AllocationService] [elasticsearch-cdm-r8jw2yat-2] Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[app-000002][0]] ...]).
[2020-09-21T02:45:06,457][INFO ][o.e.c.m.MetaDataMappingService] [elasticsearch-cdm-r8jw2yat-2] [app-000002/CTpGx2IWRC60uMz8_o0DRA] update_mapping [_doc]


elasticsearch-cdm-r8jw2yat-3-78bb547c58-2l975:
[2020-09-21T01:41:41,060][INFO ][c.a.o.s.c.IndexBaseConfigurationRepository] [elasticsearch-cdm-r8jw2yat-3] Node 'elasticsearch-cdm-r8jw2yat-3' initialized
[2020-09-21T02:45:04,859][WARN ][r.suppressed             ] [elasticsearch-cdm-r8jw2yat-3] path: /app-*/_alias/app-write, params: {name=app-write, index=app-*}
org.elasticsearch.transport.RemoteTransportException: [elasticsearch-cdm-r8jw2yat-2][10.128.2.16:9300][indices:admin/aliases/get]
Caused by: org.elasticsearch.ElasticsearchSecurityException: Unexpected exception indices:admin/aliases/get
	at com.amazon.opendistroforelasticsearch.security.filter.OpenDistroSecurityFilter.apply0(OpenDistroSecurityFilter.java:274) ~[?:?]
	at com.amazon.opendistroforelasticsearch.security.filter.OpenDistroSecurityFilter.apply(OpenDistroSecurityFilter.java:119) ~[?:?]
	at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:165) ~[elasticsearch-6.8.1.redhat-6.jar:6.8.1.redhat-6]
	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:139) ~[elasticsearch-6.8.1.redhat-6.jar:6.8.1.redhat-6]
	at org.elasticsearch.action.support.HandledTransportAction$TransportHandler.messageReceived(HandledTransportAction.java:89) ~[elasticsearch-6.8.1.redhat-6.jar:6.8.1.redhat-6]
	at org.elasticsearch.action.support.HandledTransportAction$TransportHandler.messageReceived(HandledTransportAction.java:80) ~[elasticsearch-6.8.1.redhat-6.jar:6.8.1.redhat-6]
	at com.amazon.opendistroforelasticsearch.security.ssl.transport.OpenDistroSecuritySSLRequestHandler.messageReceivedDecorate(OpenDistroSecuritySSLRequestHandler.java:194) ~[?:?]
	at com.amazon.opendistroforelasticsearch.security.transport.OpenDistroSecurityRequestHandler.messageReceivedDecorate(OpenDistroSecurityRequestHandler.java:284) ~[?:?]
	at com.amazon.opendistroforelasticsearch.security.ssl.transport.OpenDistroSecuritySSLRequestHandler.messageReceived(OpenDistroSecuritySSLRequestHandler.java:166) ~[?:?]
	at com.amazon.opendistroforelasticsearch.security.OpenDistroSecurityPlugin$7$1.messageReceived(OpenDistroSecurityPlugin.java:646) ~[?:?]
	at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.8.1.redhat-6.jar:6.8.1.redhat-6]
	at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1087) ~[elasticsearch-6.8.1.redhat-6.jar:6.8.1.redhat-6]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.8.1.redhat-6.jar:6.8.1.redhat-6]
	at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:193) ~[elasticsearch-6.8.1.redhat-6.jar:6.8.1.redhat-6]
	at org.elasticsearch.transport.TcpTransport.handleRequest(TcpTransport.java:1046) [elasticsearch-6.8.1.redhat-6.jar:6.8.1.redhat-6]
	at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:932) [elasticsearch-6.8.1.redhat-6.jar:6.8.1.redhat-6]
	at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:763) [elasticsearch-6.8.1.redhat-6.jar:6.8.1.redhat-6]
	at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:53) [transport-netty4-client-6.8.1.redhat-5.jar:6.8.1.redhat-5]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323) [netty-codec-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297) [netty-codec-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) [netty-handler-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1436) [netty-handler-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1203) [netty-handler-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1247) [netty-handler-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:502) [netty-codec-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:441) [netty-codec-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:278) [netty-codec-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:656) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:556) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:510) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909) [netty-common-4.1.32.Final.jar:4.1.32.Final]
	at java.lang.Thread.run(Thread.java:834) [?:?]

Comment 36 Jeff Cantrill 2020-09-21 17:38:33 UTC

Moving this back to ON_QA as I am unable to verify #c35 which looks to be a completely unrelated issue.  I did the following:

* Deployed latest 4.6 CLO
* Deployed latest 4.6 EO
* Updated CL instance to be unmanaged
* Edit the infra policy to poll every minute, rollover every 2m, delete every 4 min.

Consistently see both the rollover and delete crons complete successfully.  Inspecting the indices I see new ones being created and old removed.

Comment 37 Qiaoling Tang 2020-09-25 09:18:21 UTC

The clusterlogging is running for 24 hours, and all the jobs completed successfully, move this bz to VERIFIED.

I'll try to reproduce the issue in comment 35.

Comment 41 errata-xmlrpc 2020-10-27 15:09:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.1 extras update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4198

Comment 47 Jeff Cantrill 2020-11-16 17:19:55 UTC

(In reply to Andre Costa from comment #46)
> Hello,
> 
> Can we re-open this ticket or have some information on the fix? 
> 
> We still have customers saying that on 4.5.z this is still happening.

The fix made the scripts more robust and added more log statements.  It may be a transient error. First, we keep 1 copy of failed pods.  These pods are on a 15m cycle.  If the ES cluster is not functional in that time then you will have failures which are transient.

Please create a new ticket with a must-gather and all the details you can acquire so we can further investigate

Comment 48 hgomes 2020-11-17 18:57:58 UTC

BZ opened to backport 4.5
https://bugzilla.redhat.com/show_bug.cgi?id=1896578

Comment 55 Red Hat Bugzilla 2023-09-18 00:21:54 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.

agurenko
andcosta
anli
aos-bugs
cruhm
cshereme
dgautam
dkulkarn
dseals
gcollege
hfukumot
hgomes
jcantril
jeder
jritter
karim.blidi
ksathe
mallmen
mchebbi
miroslav.popov
mnunes
mrobson
naoto30
ngirard
nnosenzo
ocasalsa
prdeshpa
puraut
rnoma
sople
sparpate
ssadhale
tcort
yhe