Bug 1339888
| Summary: | Recover EFK after CorruptIndexException of ElasticSearch | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Jaspreet Kaur <jkaur> |
| Component: | Logging | Assignee: | Jeff Cantrill <jcantril> |
| Status: | CLOSED ERRATA | QA Contact: | Xia Zhao <xiazhao> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 3.2.0 | CC: | akokshar, aos-bugs, erich, erjones, ewolinet, jcantril, jkaur, lmeyer, pdwyer, rmeggins |
| Target Milestone: | --- | Keywords: | Performance |
| Target Release: | 3.2.1 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: |
undefined
|
Story Points: | --- |
| Clone Of: | Environment: | ||
| Last Closed: | 2017-05-18 09:26:44 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Jaspreet Kaur
2016-05-26 05:31:32 UTC
There is a solution that involves sshing in and nuking the affected index on the filesystem. I just don't know the specifics at this time. Eric haven't you described this before somewhere? You can also just delete the storage entirely and start over, but that's probably not a very palatable solution. I think this condition is caused by an unclean shutdown of ES while it is mid-write, which leaves the index on the disk in a bad state. ES is apparently less than perfect at cleaning these up and sometimes manual intervention is necessary. Possibly we can provide a tool to make this easy or even automatic. That was a slightly different stack trace, but that was covered in this support case: https://access.redhat.com/support/cases/#/case/01569315 There wasn't a clear reason as to why this occurs, even from Elastic. So it may be due to things from the disk running out of storage, the node receiving a power outage during a commit or another disaster scenario. There was a process in the case where recovery failed, the recommended solution was to delete '.recovery' files: If you `oc rsh` into the ES pod they should be able delete the files (based on the index in the exception) Example: /elasticsearch/persistent/logging-es/data/logging-es/nodes/0/indices/.operations.2015.12.17/3/translog/*.recovery Alternatively, if there is a PV you should also be able to navigate via the share: e.g. /my_pvc_host_mount/logging-es/data/logging-es/nodes/0/indices/.operations.2015.12.17/3/translog/*.recovery There may be other .recovery files, this should resolve to the path of the one that we see in the stack trace. If you cannot find any at that path, I would recommend doing a find on `*.recovery` files to find and delete the files. Jaspreet, did comment 5 above help you? If that worked for you, we should probably ensure there's a kbase article about it. Customer was not able to find recovery files hence we were not able to conclude the resolution. Expanding further given we came across this again with another customer and was able to work through it. We've seen this happen in cases where ES hits an Out of Memory Exception. When we get to this point during recovery, Elasticsearch should still be up and running despite seeing this stack trace, it is just letting you know that it was unable to recover that index/shard and the data may not be available to query on. What is happening is that a shard for the specific index is unable to recover due to the prior bad state. One thing that can be done is to issue a delete for that specific index [1], this will remove the metadata and the files from disk for that index. Another thing that can be done is to close the index [2], this will make it so Elasticsearch will not try to read the data from disk and will not load information for it into memory but it will still be around on disk. In this case the data will not be able to be queried on either. To be able to issue either a delete or a close, you would follow the guide here [3] to issue administrative commands to Elasticsearch under chapter 22.8 Performing Elasticsearch Maintenance Operations. [1] https://www.elastic.co/guide/en/elasticsearch/reference/1.5/indices-delete-index.html [2] https://www.elastic.co/guide/en/elasticsearch/reference/1.5/indices-open-close.html [3] https://access.redhat.com/documentation/en/openshift-enterprise/3.2/paged/installation-and-configuration/chapter-22-aggregating-container-logs ping - any update on this? Can we close this? There has been no action on this bug in months. @Eric, The trello card can't be opened: https://trello.com/c/bN2Ipf69/9-elasticsearch-recovery, could you please paste the link of this Red Hat Knowledge Base (Solution) 2961981 for QE to review? Thanks! It's also linked above in the external trackers - https://access.redhat.com/site/solutions/2961981 On page https://access.redhat.com/solutions/2961981, part "Resolution", the hyper link under "guide here" is unavailable: To be able to issue either a delete or a close, you would follow the guide here Get this error when click into hyperlink "guide here": Access Denied You do not have permission to access the page you requested. @Eric is something you can help out with? (In reply to Jeff Cantrill from comment #20) > @Eric is something you can help out with? https://access.redhat.com/documentation/en/openshift-enterprise/3.2/paged/installation-and-configuration/chapter-22-aggregating-container-logs sees to be a broken link. (In reply to Eric Rich from comment #21) > (In reply to Jeff Cantrill from comment #20) > > @Eric is something you can help out with? > > https://access.redhat.com/documentation/en/openshift-enterprise/3.2/paged/ > installation-and-configuration/chapter-22-aggregating-container-logs sees to > be a broken link. I updated the article to point at https://access.redhat.com/documentation/en-us/openshift_enterprise/3.2/html-single/installation_and_configuration/#install-config-aggregate-logging The document review was passed, KCS contains all info in comment #9 @jkaur Any hints on how to reproduce the original bug to evaluate the solution documented in KCS? I want to give it a try, thanks. Set to verified based on the document review result in comment #23 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1235 The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days |