Description of problem: The Elasticsearch run out of spaces, and the .searchguard status is UNASSIGNED. After I deleted the indices, there are plenty of space. But the .searchguard are still in UNASSIGNED status. Restart the Elasticsearch pod can make it started. Version-Release number of selected component (if applicable): ose-logging-elasticsearch5/images/v3.11.0-0.11.0.0 How reproducible: Always Steps to Reproduce: 1. The .searchguard in UNASSIGNED 2. Wait for a while. Actual results: The .searchguard kept in UNASSIGNED until I restarted the Elasticsearch pod Expected Results: The .searchguard go to started status automatically. Additional info:
How exactly did you make .searchguard UNASSIGNED? (I'm trying to understand whether there is an easy way to reproduce the issue) Is it a single-node cluster? What is the content of the following after step 2? _cluster/settings _nodes/stats _cat/shards?h=index,shard,prirep,state,unassigned.reason| grep UNASSIGNED _cluster/allocation/explain?pretty
@Anton, I just found the .searchguard UNASSINGED. I am not confident to reproduce it. I haven't record the cluster/node status. When I observed the UNASSINGED .searchguard. There are other indices(.operations)in UNASSIGNED too. After I cleared the spaces by delete some large index. The other index recovered expect for .searchguard. The .operations and project* indices can be rebuild automatically when new document come from fluentd. Who is the document producer of .searchguard?
The '.searchguard' index is created when ES starts up by the search-guard plugin. This is the index where the ACL documents are stored. I believe we modified the defaults to have 1 shard and no replicas. If your cluster has no space and nodes go up and down because of something like a new deployment, its possible this index could become unassigned. We need to better understand under what conditions it came into this state. We see unassigned indices frequently during roleout of new versions in the online clusters and it generally requires us to manually allocate or assign shards. We have an allocate and move scripts [1] to work around some of these issues but I'd be curious to understand if they are usable if this particular index is unavailable. [1] https://github.com/openshift/origin-aggregated-logging/tree/master/elasticsearch/utils
@jeff, Thanks for the explanation. it is better to provide a way to rebuild the .searchguard index instead of restarting the pod. Both manually or automatically are acceptable.
The root cause of this issue is the lack of disk space. Presumably alerts will notify when we are running into issues related to that condition. Additionally, I have seen unassigned shards which I believe are related to an ES node not gracefully terminating. This sometimes occurs during upgrades and requires manual intervention. The result of this investigation led me to find some of our util scripts needed to change for 5.x. One could correct this situation by first trying to run [1] and then possibly [2] which can lead to data loss. If all else fails, we could delete the index: oc exec -c elasticsearch $pod -- es_util --query=.searchguard -XDELETE and then reseed: oc exec -c elasticsearch $pod -- es_seed_acl [1] https://github.com/openshift/origin-aggregated-logging/pull/1301/files#diff-832537f1c03423c10b34fb8158462c68R29 [2] https://github.com/openshift/origin-aggregated-logging/pull/1301/files#diff-ff1e3417d3a52fe1c1e0e03f70f05bc6R1
es_seed_acl can rebuild the searchguard index. So move to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:2652