1613131 – Provide a way to rebuild .searchguard index except for restart pod

Bug 1613131 - Provide a way to rebuild .searchguard index except for restart pod

Summary: Provide a way to rebuild .searchguard index except for restart pod

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	3.11.0
Assignee:	Jeff Cantrill
QA Contact:	Anping Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-08-07 05:16 UTC by Anping Li
Modified:	2018-10-11 07:24 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-10-11 07:24:00 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:2652	0	None	None	None	2018-10-11 07:24:21 UTC

Description Anping Li 2018-08-07 05:16:35 UTC

Description of problem:
The Elasticsearch run out of spaces, and the .searchguard status is UNASSIGNED.  After I deleted the indices, there are plenty of space. But the .searchguard are still in UNASSIGNED status.  Restart the Elasticsearch pod can make it started.  


Version-Release number of selected component (if applicable):
ose-logging-elasticsearch5/images/v3.11.0-0.11.0.0

How reproducible:
Always

Steps to Reproduce:
1. The .searchguard in UNASSIGNED
2. Wait for a while.

Actual results:
The .searchguard kept in UNASSIGNED until I restarted the Elasticsearch pod

Expected Results:
The .searchguard go to started status automatically.


Additional info:

Comment 1 Anton Sherkhonov 2018-08-07 12:34:57 UTC

How exactly did you make .searchguard UNASSIGNED? (I'm trying to understand whether there is an easy way to reproduce the issue)
Is it a single-node cluster?
What is the content of the following after step 2?
_cluster/settings
_nodes/stats
_cat/shards?h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED
_cluster/allocation/explain?pretty

Comment 2 Anping Li 2018-08-08 00:01:15 UTC

@Anton, I just found the .searchguard UNASSINGED.   I am not confident to reproduce it.
I haven't record the cluster/node status. When I observed the UNASSINGED .searchguard.  There are other indices(.operations)in UNASSIGNED too. After I cleared the spaces by delete some large index. The other index recovered expect for .searchguard. 
The .operations and project* indices can be rebuild automatically when new document come from fluentd.  Who is the document producer of .searchguard?

Comment 3 Jeff Cantrill 2018-08-08 20:16:17 UTC

The '.searchguard' index is created when ES starts up by the search-guard plugin.  This is the index where the ACL documents are stored.  I believe we modified the defaults to have 1 shard and no replicas.

If your cluster has no space and nodes go up and down because of something like a new deployment, its possible this index could become unassigned.  We need to better understand under what conditions it came into this state.  We see unassigned indices frequently during roleout of new versions in the online clusters and it generally requires us to manually allocate or assign shards.  We have an allocate and move scripts [1] to work around some of these issues but I'd be curious to understand if they are usable if this particular index  is unavailable.

[1] https://github.com/openshift/origin-aggregated-logging/tree/master/elasticsearch/utils

Comment 4 Anping Li 2018-08-09 06:14:20 UTC

@jeff,  Thanks for the explanation.  it is better to provide a way to rebuild the .searchguard index instead of restarting the pod.   Both manually or automatically are acceptable.

Comment 5 Jeff Cantrill 2018-08-17 19:14:46 UTC

The root cause of this issue is the lack of disk space.  Presumably alerts will notify when we are running into issues related to that condition.  Additionally, I have seen unassigned shards which I believe are related to an ES node not gracefully terminating.  This sometimes occurs during upgrades and requires manual intervention.  The result of this investigation led me to find some of our util scripts needed to change for 5.x.  

One could correct this situation by first trying to run [1] and then possibly [2] which can lead to data loss.  

If all else fails, we could delete the index:

oc exec -c elasticsearch $pod -- es_util --query=.searchguard -XDELETE

and then reseed:

oc exec -c elasticsearch $pod -- es_seed_acl

[1] https://github.com/openshift/origin-aggregated-logging/pull/1301/files#diff-832537f1c03423c10b34fb8158462c68R29
[2] https://github.com/openshift/origin-aggregated-logging/pull/1301/files#diff-ff1e3417d3a52fe1c1e0e03f70f05bc6R1

Comment 6 Anping Li 2018-08-20 06:09:23 UTC

es_seed_acl can rebuild the searchguard index. So move to verified.

Comment 8 errata-xmlrpc 2018-10-11 07:24:00 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:2652

Note You need to log in before you can comment on or make changes to this bug.