Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1388535

Summary:	ElasticSearch service uses a lot of cpu resources
Product:	OpenShift Container Platform	Reporter:	Miheer Salunke <misalunk>
Component:	Logging	Assignee:	ewolinet
Status:	CLOSED NOTABUG	QA Contact:	Xia Zhao <xiazhao>
Severity:	low	Docs Contact:
Priority:	low
Version:	3.2.0	CC:	akokshar, aos-bugs, ewolinet, lmeyer, misalunk
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:	undefined	Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-12-16 17:56:30 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Miheer Salunke 2016-10-25 15:05:54 UTC

Description of problem:

In the node where we have the elsticsearch POD running the PID uses 190% of CPU (2cores) constantly even after the POD restart and node reboot

Where are you experiencing the behavior?  What environment?

here there is some line of the log of the POD
[2016-10-12 12:09:06,770][ERROR][io.fabric8.elasticsearch.plugin.acl.DynamicACLFilter] [Glenn Talbot] Exception encountered when seeding initial ACL
org.elasticsearch.action.NoShardAvailableActionException: [.searchguard.logging-es-bxgpgroo-4-50rl3][4] null
	at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction$AsyncSingleAction.perform(TransportShardSingleOperationAction.java:175)
	at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction$AsyncSingleAction.start(TransportShardSingleOperationAction.java:155)
	at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction.doExecute(TransportShardSingleOperationAction.java:89)
	at org.elasticsearch.action.support.single.shard.TransportShardSingleOperationAction.doExecute(TransportShardSingleOperationAction.java:55)
	at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:167)
	at io.fabric8.elasticsearch.plugin.KibanaUserReindexAction.apply(KibanaUserReindexAction.java:78)
	at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:165)
	at com.floragunn.searchguard.filter.SearchGuardActionFilter.apply0(SearchGuardActionFilter.java:145)
	at com.floragunn.searchguard.filter.SearchGuardActionFilter.apply(SearchGuardActionFilter.java:90)
	at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:165)
	at com.floragunn.searchguard.filter.AbstractActionFilter.apply(AbstractActionFilter.java:105)
	at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:165)
	at com.floragunn.searchguard.filter.AbstractActionFilter.apply(AbstractActionFilter.java:105)
	at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:165)
	at com.floragunn.searchguard.filter.AbstractActionFilter.apply(AbstractActionFilter.java:105)
	at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:165)
	at org.elasticsearch.action.support.ActionFilter$Simple.apply(ActionFilter.java:64)
	at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:165)
	at io.fabric8.elasticsearch.plugin.ActionForbiddenActionFilter.apply(ActionForbiddenActionFilter.java:48)
	at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:165)
	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:82)
	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:55)
	at org.elasticsearch.client.node.NodeClient.execute(NodeClient.java:90)
	at org.elasticsearch.client.support.AbstractClient.get(AbstractClient.java:188)
	at io.fabric8.elasticsearch.plugin.acl.DynamicACLFilter.loadAcl(DynamicACLFilter.java:257)
	at io.fabric8.elasticsearch.plugin.acl.DynamicACLFilter.seedInitialACL(DynamicACLFilter.java:309)
	at io.fabric8.elasticsearch.plugin.acl.DynamicACLFilter.onSearchGuardACLActionRequest(DynamicACLFilter.java:122)
	at io.fabric8.elasticsearch.plugin.acl.DefaultACLNotifierService$1.run(DefaultACLNotifierService.java:64)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
[2016-10-12 12:09:07,501][WARN ][index.engine             ] [Glenn Talbot] [.operations.2016.10.10][4] failed to sync translog
[2016-10-12 12:09:07,517][WARN ][indices.cluster          ] [Glenn Talbot] [[.operations.2016.10.10][4]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [.operations.2016.10.10][4] failed to recover shard
	at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:290)
	at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:112)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.index.translog.TranslogCorruptedException: translog corruption while reading from stream
	at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:72)
	at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:260)
Version-Release number of selected component (if applicable):


How reproducible:
On customer side

Steps to Reproduce:
1.Mentioned in the description
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Luke Meyer 2016-10-25 20:39:30 UTC

Can you attach the entire log for the ES pod, from the beginning?

My half-baked guess is that ES is choking on something in its storage (persistent volume) and failing to even get to the point where the entry for the initial ACL can be created.

Comment 3 Alexander Koksharov 2016-10-26 08:23:44 UTC

Hello Luke, please find logs attached.

Comment 4 Luke Meyer 2016-10-27 15:00:37 UTC

The problem appears to be that recovery never succeeded because of a corrupted transaction log file. 

[2016-10-25 12:22:25,554][WARN ][index.engine             ] [Holly] [.operations.2016.10.10][0] failed to sync translog
[2016-10-25 12:22:25,670][WARN ][indices.cluster          ] [Holly] [[.operations.2016.10.10][0]] marking and sending shard failed due to [failed recovery]
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [.operations.2016.10.10][0] failed to recover shard
        at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:290)
        at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:112)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.index.translog.TranslogCorruptedException: translog corruption while reading from stream
        at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:72)
        at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:260)
        ... 4 more
Caused by: org.elasticsearch.ElasticsearchIllegalArgumentException: No version type match [99]
        at org.elasticsearch.index.VersionType.fromValue(VersionType.java:307)
        at org.elasticsearch.index.translog.Translog$Create.readFrom(Translog.java:376)
        at org.elasticsearch.index.translog.ChecksummedTranslogStream.read(ChecksummedTranslogStream.java:68)
        ... 5 more
[2016-10-25 12:22:25,675][WARN ][cluster.action.shard     ] [Holly] [.operations.2016.10.10][0] received shard failed for [.operations.2016.10.10][0], node[oCTdf5nATDCqLLb5y6d7Wg], [P], s[INITIALIZING], indexUUID [ya7_aagrRLCEpAgBwub1VQ], reason [shard failure [failed recovery][IndexShardGatewayRecoveryException[[.operations.2016.10.10][0] failed to recover shard]; nested: TranslogCorruptedException[translog corruption while reading from stream]; nested: ElasticsearchIllegalArgumentException[No version type match [99]]; ]]

This can happen for a variety of reasons, including running out of disk and unclean shutdowns. From https://github.com/elastic/elasticsearch/issues/12055 I gather that the solution is to delete .recovery files from the persistent volume:

$ find /elasticsearch/persistent -name '*.recovery' -delete

Then remove the pod so another will start with cleaner data.

Comment 5 ewolinet 2016-10-28 15:41:07 UTC

Some other questions, how many instances of ES are in this cluster?

What is the output from doing a cluster health check:

  $ oc exec <your_es_pod> -- curl \ 
    --key /etc/elasticsearch/keys/admin-key \
    --cert /etc/elasticsearch/keys/admin-cert \
    --cacert /etc/elasticsearch/keys/admin-ca \
    -XGET 'http://localhost:9200/_cluster/health?pretty=true'

I did see a decent amount of GC messages during your recovery, which could contribute to the high CPU usage.

Comment 9 ewolinet 2016-11-04 18:10:30 UTC

My mistake, I had said 'http' in the comment 5, it should be 'https':

$ oc exec <your_es_pod> -- curl \ 
    --key /etc/elasticsearch/keys/admin-key \
    --cert /etc/elasticsearch/keys/admin-cert \
    --cacert /etc/elasticsearch/keys/admin-ca \
    -XGET 'https://localhost:9200/_cluster/health?pretty=true'

Comment 11 ewolinet 2016-11-08 15:31:16 UTC

It looks like recovery has completed as much as it can given the number of pending tasks is 0 and the cpu usage has dropped back down (per comment #6).

The cluster is still in a 'red' state per how Elasticsearch determines it's cluster state. Red indicates that there are missing primary shards for at least one index. The number of unassigned shards also supports this.

There are two ways that the cluster can get to a non-red state. Yellow means that all indices have their primary shards allocated. Green means that all replica shards are also allocated, however ES will not co-locate the replica of a particular primary shard.

Given that there is only one ES node in the cluster (number_of_nodes), the best this cluster will be able to get to is 'yellow' simply due to the fact that it will not be able to schedule its replica shards. To change this, either a second ES node can be added to the cluster (described in "changing the scale of elasticsearch" [1]) or the number of replicas can be decreased. Please note that following the steps to increase the cluster size will cause the cluster to recover again when starting back up due to needing to stop the current cluster members.


Other than adding a cluster node, he number of replicas for the indices can be changed so that there are 0 replicas, meaning there will only be primary shards allocated. Given that ES would not allocate the replicas shards to begin with given the cluster structure, there would be no loss of data but rather gained heap space:

  $ oc exec <your_es_pod> -- curl \ 
    --key /etc/elasticsearch/keys/admin-key \
    --cert /etc/elasticsearch/keys/admin-cert \
    --cacert /etc/elasticsearch/keys/admin-ca \
    -XPOST 'https://localhost:9200/*/_settings' -d '
{
    "index" : {
        "number_of_replicas" : 0
    }
}'


This should decrease the number of unassigned shards, as the number of replicas are decreased. 

The remaining unassigned shards will come from recovery errors that were seen before in the initial description. You can either wait for them to be deleted by Curator, or they can be deleted/closed before that. To figure out which indices those are we can run the following:

  $ oc exec <your_es_pod> -- curl \ 
    --key /etc/elasticsearch/keys/admin-key \
    --cert /etc/elasticsearch/keys/admin-cert \
    --cacert /etc/elasticsearch/keys/admin-ca \
    -XGET 'https://localhost:9200/_cluster/health?pretty=true&level=indices' | grep -a2 'red'

Once you have a list of indices that are in the red state, you can close/delete them. When an index is closed, its data remains on disk however it is not available to query on but given that the index did not recover correctly either would lead to a similar state. The only reason you would close is if you wanted to somehow attempt to recover the index again at a later point. A closed index can later be opened [2], a deleted index is gone.

Delete:

  $ oc exec <your_es_pod> -- curl \ 
    --key /etc/elasticsearch/keys/admin-key \
    --cert /etc/elasticsearch/keys/admin-cert \
    --cacert /etc/elasticsearch/keys/admin-ca \
    -XDELETE 'https://localhost:9200/<index_to_delete>'

Close:

  $ oc exec <your_es_pod> -- curl \ 
    --key /etc/elasticsearch/keys/admin-key \
    --cert /etc/elasticsearch/keys/admin-cert \
    --cacert /etc/elasticsearch/keys/admin-ca \
    -XPOST 'https://localhost:9200/<index_to_close>/_close'


[1] https://access.redhat.com/documentation/en/openshift-enterprise/version-3.2/installation-and-configuration/#aggregated-elasticsearch

[2] https://www.elastic.co/guide/en/elasticsearch/reference/1.5/indices-open-close.html

Comment 12 ewolinet 2016-11-11 16:19:44 UTC

There is a mistake with the _settings curl. I misread the ES documentation, it should be XPUT, not XPOST.


$ oc exec <your_es_pod> -- curl \ 
    --key /etc/elasticsearch/keys/admin-key \
    --cert /etc/elasticsearch/keys/admin-cert \
    --cacert /etc/elasticsearch/keys/admin-ca \
    -XPUT 'https://localhost:9200/*/_settings' -d '
{
    "index" : {
        "number_of_replicas" : 0
    }
}'

-- Please note that the mount point for the keys is different because this was tested with a different Aggregated Logging installation than reported (3.3.1 instead of 3.2.0). However, the version of Elasticsearch is the same.

$ oc exec logging-es-ig5cinh5-1-6xe6a -- curl --key /etc/elasticsearch/secret/admin-key  --cert /etc/elasticsearch/secret/admin-cert  --cacert /etc/elasticsearch/secret/admin-ca  -XPUT 'https://localhost:9200/*/_settings' -d '
 {
    "index" : {
       "number_of_replicas" : 0
    }
 }'
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    81  100    21  100    60    137    393 --:--:-- --:--:-- --:--:--   394
{"acknowledged":true}