Hide Forgot
Description of problem: In the node where we have the elsticsearch POD running the PID uses 190% of CPU (2cores) constantly even after the POD restart and node reboot Where are you experiencing the behavior? What environment? here there is some line of the log of the POD [2016-10-12 12:09:06,770][ERROR][io.fabric8.elasticsearch.plugin.acl.DynamicACLFilter] [Glenn Talbot] Exception encountered when seeding initial ACL org.elasticsearch.action.NoShardAvailableActionException: [.searchguard.logging-es-bxgpgroo-4-50rl3][4] null at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction$AsyncSingleAction.perform(TransportShardSingleOperationAction.java:175) at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction$AsyncSingleAction.start(TransportShardSingleOperationAction.java:155) at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction.doExecute(TransportShardSingleOperationAction.java:89) at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction.doExecute(TransportShardSingleOperationAction.java:55) at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:167) at io.fabric8.elasticsearch.plugin.KibanaUserReindexAction.apply(KibanaUserReindexAction.java:78) at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:165) at com.floragunn.searchguard.filter.SearchGuardActionFilter.apply0(SearchGuardActionFilter.java:145) at com.floragunn.searchguard.filter.SearchGuardActionFilter.apply(SearchGuardActionFilter.java:90) at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:165) at com.floragunn.searchguard.filter.AbstractActionFilter.apply(AbstractActionFilter.java:105) at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:165) at com.floragunn.searchguard.filter.AbstractActionFilter.apply(AbstractActionFilter.java:105) at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:165) at com.floragunn.searchguard.filter.AbstractActionFilter.apply(AbstractActionFilter.java:105) at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:165) at org.elasticsearch.action.support.ActionFilter$Simple.apply(ActionFilter.java:64) at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:165) at io.fabric8.elasticsearch.plugin.ActionForbiddenActionFilter.apply(ActionForbiddenActionFilter.java:48) at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:165) at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:82) at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:55) at org.elasticsearch.client.node.NodeClient.execute(NodeClient.java:90) at org.elasticsearch.client.support.AbstractClient.get(AbstractClient.java:188) at io.fabric8.elasticsearch.plugin.acl.DynamicACLFilter.loadAcl(DynamicACLFilter.java:257) at io.fabric8.elasticsearch.plugin.acl.DynamicACLFilter.seedInitialACL(DynamicACLFilter.java:309) at io.fabric8.elasticsearch.plugin.acl.DynamicACLFilter.onSearchGuardACLActionRequest(DynamicACLFilter.java:122) at io.fabric8.elasticsearch.plugin.acl.DefaultACLNotifierService$1.run(DefaultACLNotifierService.java:64) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) [2016-10-12 12:09:07,501][WARN ][index.engine ] [Glenn Talbot] [.operations.2016.10.10][4] failed to sync translog [2016-10-12 12:09:07,517][WARN ][indices.cluster ] [Glenn Talbot] [[.operations.2016.10.10][4]] marking and sending shard failed due to [failed recovery] org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [.operations.2016.10.10][4] failed to recover shard at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:290) at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:112) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.elasticsearch.index.translog.TranslogCorruptedException: translog corruption while reading from stream at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:72) at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:260) Version-Release number of selected component (if applicable): How reproducible: On customer side Steps to Reproduce: 1.Mentioned in the description 2. 3. Actual results: Expected results: Additional info:
Can you attach the entire log for the ES pod, from the beginning? My half-baked guess is that ES is choking on something in its storage (persistent volume) and failing to even get to the point where the entry for the initial ACL can be created.
Hello Luke, please find logs attached.
The problem appears to be that recovery never succeeded because of a corrupted transaction log file. [2016-10-25 12:22:25,554][WARN ][index.engine ] [Holly] [.operations.2016.10.10][0] failed to sync translog [2016-10-25 12:22:25,670][WARN ][indices.cluster ] [Holly] [[.operations.2016.10.10][0]] marking and sending shard failed due to [failed recovery] org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [.operations.2016.10.10][0] failed to recover shard at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:290) at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:112) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: org.elasticsearch.index.translog.TranslogCorruptedException: translog corruption while reading from stream at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:72) at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:260) ... 4 more Caused by: org.elasticsearch.ElasticsearchIllegalArgumentException: No version type match [99] at org.elasticsearch.index.VersionType.fromValue(VersionType.java:307) at org.elasticsearch.index.translog.Translog$Create.readFrom(Translog.java:376) at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:68) ... 5 more [2016-10-25 12:22:25,675][WARN ][cluster.action.shard ] [Holly] [.operations.2016.10.10][0] received shard failed for [.operations.2016.10.10][0], node[oCTdf5nATDCqLLb5y6d7Wg], [P], s[INITIALIZING], indexUUID [ya7_aagrRLCEpAgBwub1VQ], reason [shard failure [failed recovery][IndexShardGatewayRecoveryException[[.operations.2016.10.10][0] failed to recover shard]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: ElasticsearchIllegalArgumentException[No version type match [99]]; ]] This can happen for a variety of reasons, including running out of disk and unclean shutdowns. From https://github.com/elastic/elasticsearch/issues/12055 I gather that the solution is to delete .recovery files from the persistent volume: $ find /elasticsearch/persistent -name '*.recovery' -delete Then remove the pod so another will start with cleaner data.
Some other questions, how many instances of ES are in this cluster? What is the output from doing a cluster health check: $ oc exec <your_es_pod> -- curl \ --key /etc/elasticsearch/keys/admin-key \ --cert /etc/elasticsearch/keys/admin-cert \ --cacert /etc/elasticsearch/keys/admin-ca \ -XGET 'http://localhost:9200/_cluster/health?pretty=true' I did see a decent amount of GC messages during your recovery, which could contribute to the high CPU usage.
My mistake, I had said 'http' in the comment 5, it should be 'https': $ oc exec <your_es_pod> -- curl \ --key /etc/elasticsearch/keys/admin-key \ --cert /etc/elasticsearch/keys/admin-cert \ --cacert /etc/elasticsearch/keys/admin-ca \ -XGET 'https://localhost:9200/_cluster/health?pretty=true'
It looks like recovery has completed as much as it can given the number of pending tasks is 0 and the cpu usage has dropped back down (per comment #6). The cluster is still in a 'red' state per how Elasticsearch determines it's cluster state. Red indicates that there are missing primary shards for at least one index. The number of unassigned shards also supports this. There are two ways that the cluster can get to a non-red state. Yellow means that all indices have their primary shards allocated. Green means that all replica shards are also allocated, however ES will not co-locate the replica of a particular primary shard. Given that there is only one ES node in the cluster (number_of_nodes), the best this cluster will be able to get to is 'yellow' simply due to the fact that it will not be able to schedule its replica shards. To change this, either a second ES node can be added to the cluster (described in "changing the scale of elasticsearch" [1]) or the number of replicas can be decreased. Please note that following the steps to increase the cluster size will cause the cluster to recover again when starting back up due to needing to stop the current cluster members. Other than adding a cluster node, he number of replicas for the indices can be changed so that there are 0 replicas, meaning there will only be primary shards allocated. Given that ES would not allocate the replicas shards to begin with given the cluster structure, there would be no loss of data but rather gained heap space: $ oc exec <your_es_pod> -- curl \ --key /etc/elasticsearch/keys/admin-key \ --cert /etc/elasticsearch/keys/admin-cert \ --cacert /etc/elasticsearch/keys/admin-ca \ -XPOST 'https://localhost:9200/*/_settings' -d ' { "index" : { "number_of_replicas" : 0 } }' This should decrease the number of unassigned shards, as the number of replicas are decreased. The remaining unassigned shards will come from recovery errors that were seen before in the initial description. You can either wait for them to be deleted by Curator, or they can be deleted/closed before that. To figure out which indices those are we can run the following: $ oc exec <your_es_pod> -- curl \ --key /etc/elasticsearch/keys/admin-key \ --cert /etc/elasticsearch/keys/admin-cert \ --cacert /etc/elasticsearch/keys/admin-ca \ -XGET 'https://localhost:9200/_cluster/health?pretty=true&level=indices' | grep -a2 'red' Once you have a list of indices that are in the red state, you can close/delete them. When an index is closed, its data remains on disk however it is not available to query on but given that the index did not recover correctly either would lead to a similar state. The only reason you would close is if you wanted to somehow attempt to recover the index again at a later point. A closed index can later be opened [2], a deleted index is gone. Delete: $ oc exec <your_es_pod> -- curl \ --key /etc/elasticsearch/keys/admin-key \ --cert /etc/elasticsearch/keys/admin-cert \ --cacert /etc/elasticsearch/keys/admin-ca \ -XDELETE 'https://localhost:9200/<index_to_delete>' Close: $ oc exec <your_es_pod> -- curl \ --key /etc/elasticsearch/keys/admin-key \ --cert /etc/elasticsearch/keys/admin-cert \ --cacert /etc/elasticsearch/keys/admin-ca \ -XPOST 'https://localhost:9200/<index_to_close>/_close' [1] https://access.redhat.com/documentation/en/openshift-enterprise/version-3.2/installation-and-configuration/#aggregated-elasticsearch [2] https://www.elastic.co/guide/en/elasticsearch/reference/1.5/indices-open-close.html
There is a mistake with the _settings curl. I misread the ES documentation, it should be XPUT, not XPOST. $ oc exec <your_es_pod> -- curl \ --key /etc/elasticsearch/keys/admin-key \ --cert /etc/elasticsearch/keys/admin-cert \ --cacert /etc/elasticsearch/keys/admin-ca \ -XPUT 'https://localhost:9200/*/_settings' -d ' { "index" : { "number_of_replicas" : 0 } }' -- Please note that the mount point for the keys is different because this was tested with a different Aggregated Logging installation than reported (3.3.1 instead of 3.2.0). However, the version of Elasticsearch is the same. $ oc exec logging-es-ig5cinh5-1-6xe6a -- curl --key /etc/elasticsearch/secret/admin-key --cert /etc/elasticsearch/secret/admin-cert --cacert /etc/elasticsearch/secret/admin-ca -XPUT 'https://localhost:9200/*/_settings' -d ' { "index" : { "number_of_replicas" : 0 } }' % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 81 100 21 100 60 137 393 --:--:-- --:--:-- --:--:-- 394 {"acknowledged":true}