Bug 1613131

Summary: Provide a way to rebuild .searchguard index except for restart pod
Product: OpenShift Container Platform Reporter: Anping Li <anli>
Component: LoggingAssignee: Jeff Cantrill <jcantril>
Status: CLOSED ERRATA QA Contact: Anping Li <anli>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 3.11.0CC: anli, aos-bugs, rmeggins
Target Milestone: ---   
Target Release: 3.11.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-10-11 07:24:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Anping Li 2018-08-07 05:16:35 UTC
Description of problem:
The Elasticsearch run out of spaces, and the .searchguard status is UNASSIGNED.  After I deleted the indices, there are plenty of space. But the .searchguard are still in UNASSIGNED status.  Restart the Elasticsearch pod can make it started.  


Version-Release number of selected component (if applicable):
ose-logging-elasticsearch5/images/v3.11.0-0.11.0.0

How reproducible:
Always

Steps to Reproduce:
1. The .searchguard in UNASSIGNED
2. Wait for a while.

Actual results:
The .searchguard kept in UNASSIGNED until I restarted the Elasticsearch pod

Expected Results:
The .searchguard go to started status automatically.


Additional info:

Comment 1 Anton Sherkhonov 2018-08-07 12:34:57 UTC
How exactly did you make .searchguard UNASSIGNED? (I'm trying to understand whether there is an easy way to reproduce the issue)
Is it a single-node cluster?
What is the content of the following after step 2?
_cluster/settings
_nodes/stats
_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED
_cluster/allocation/explain?pretty

Comment 2 Anping Li 2018-08-08 00:01:15 UTC
@Anton, I just found the .searchguard UNASSINGED.   I am not confident to reproduce it.
I haven't record the cluster/node status. When I observed the UNASSINGED .searchguard.  There are other indices(.operations)in UNASSIGNED too. After I cleared the spaces by delete some large index. The other index recovered expect for .searchguard. 
The .operations and project* indices can be rebuild automatically when new document come from fluentd.  Who is the document producer of .searchguard?

Comment 3 Jeff Cantrill 2018-08-08 20:16:17 UTC
The '.searchguard' index is created when ES starts up by the search-guard plugin.  This is the index where the ACL documents are stored.  I believe we modified the defaults to have 1 shard and no replicas.

If your cluster has no space and nodes go up and down because of something like a new deployment, its possible this index could become unassigned.  We need to better understand under what conditions it came into this state.  We see unassigned indices frequently during roleout of new versions in the online clusters and it generally requires us to manually allocate or assign shards.  We have an allocate and move scripts [1] to work around some of these issues but I'd be curious to understand if they are usable if this particular index  is unavailable.

[1] https://github.com/openshift/origin-aggregated-logging/tree/master/elasticsearch/utils

Comment 4 Anping Li 2018-08-09 06:14:20 UTC
@jeff,  Thanks for the explanation.  it is better to provide a way to rebuild the .searchguard index instead of restarting the pod.   Both manually or automatically are acceptable.

Comment 5 Jeff Cantrill 2018-08-17 19:14:46 UTC
The root cause of this issue is the lack of disk space.  Presumably alerts will notify when we are running into issues related to that condition.  Additionally, I have seen unassigned shards which I believe are related to an ES node not gracefully terminating.  This sometimes occurs during upgrades and requires manual intervention.  The result of this investigation led me to find some of our util scripts needed to change for 5.x.  

One could correct this situation by first trying to run [1] and then possibly [2] which can lead to data loss.  

If all else fails, we could delete the index:

oc exec -c elasticsearch $pod -- es_util --query=.searchguard -XDELETE

and then reseed:

oc exec -c elasticsearch $pod -- es_seed_acl

[1] https://github.com/openshift/origin-aggregated-logging/pull/1301/files#diff-832537f1c03423c10b34fb8158462c68R29
[2] https://github.com/openshift/origin-aggregated-logging/pull/1301/files#diff-ff1e3417d3a52fe1c1e0e03f70f05bc6R1

Comment 6 Anping Li 2018-08-20 06:09:23 UTC
es_seed_acl can rebuild the searchguard index. So move to verified.

Comment 8 errata-xmlrpc 2018-10-11 07:24:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2652