Description of problem: Elastic search logs is giving below errors : 016-05-13 06:56:40,280][WARN ][indices.cluster ] [Grey Gargoyle] [[logging.6f517df3-1130-11e6-b18c-0050568f7a94.2016.05.12][1]] marking and sending shard failed due to [failed recovery] 23365 org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [logging.6f517df3-1130-11e6-b18c-0050568f7a94.2016.05.12][1] failed to fetch index version after copying it over 23366 at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:157) 23367 at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:112) 23368 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 23369 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 23370 at java.lang.Thread.run(Thread.java:745) 23371 Caused by: org.apache.lucene.index.CorruptIndexException: [logging.6f517df3-1130-11e6-b18c-0050568f7a94.2016.05.12][1] Preexisting corrupted index [corrupted_Lbq2oiP6StWh1dKpbLvNcg] caused by: CorruptIndexException[Invalid fieldsStream maxPointer (file truncated?): maxPointer=1205, length=0] 23372 org.apache.lucene.index.CorruptIndexException: Invalid fieldsStream maxPointer (file truncated?): maxPointer=1205, length=0 23373 at org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.<init>(CompressingStoredFieldsReader.java:135) 23374 at org.apache.lucene.codecs.compressing.CompressingStoredFieldsFormat.fieldsReader(CompressingStoredFieldsFormat.java:113) 23375 at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:133) 23376 at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:108) 23377 at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:145) 23378 at org.apache.lucene.index.ReadersAndUpdates.getReadOnlyClone(ReadersAndUpdates.java:239) 23379 at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:109) 23380 at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:421) 23381 at org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:297) 23382 at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:272) 23383 at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:262) 23384 at org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:171) 23385 at org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:118) 23386 at org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:58) 23387 at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:176) 23388 at org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:253) 23389 at org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:568) 23390 at org.elasticsearch.index.shard.IndexShard.refresh(IndexShard.java:565) 23391 at org.elasticsearch.index.shard.IndexShard$EngineRefresher$1.run(IndexShard.java:1095) 23392 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 23393 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 23394 at java.lang.Thread.run(Thread.java:745) 23395 23396 at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:547) 23397 at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:528) 23398 at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:120) 23399 ... 4 more 23400 [2016-05-13 06:56:40,297][WARN ][indices.cluster ] [Grey Gargoyle] [[swagger-search.8350d0f8-fbf6-11e5-8b2d-0050568f56a5.2016.05.12][3]] marking and sending shard failed due to [failed recovery] 23401 org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [swagger-search.8350d0f8-fbf6-11e5-8b2d-0050568f56a5.2016.05.12][3] failed to fetch index version after copying it over 23402 at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:157) 23403 at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:112) 23404 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 23405 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 23406 at java.lang.Thread.run(Thread.java:745) 23407 Caused by: org.apache.lucene.index.CorruptIndexException: [swagger-search.8350d0f8-fbf6-11e5-8b2d-0050568f56a5.2016.05.12][3] Preexisting corrupted index [corrupted_ZpMUZgkvShq8jXiHH15Jhw] caused by: CorruptIndexException[Invalid fieldsStream maxPointer (file truncated?): maxPointer=11850, length=8166] 23408 org.apache.lucene.index.CorruptIndexException: Invalid fieldsStream maxPointer (file truncated?): maxPointer=11850, length=8166 23409 at org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.<init>(CompressingStoredFieldsReader.java:135) 23410 at org.apache.lucene.codecs.compressing.CompressingStoredFieldsFormat.fieldsReader(CompressingStoredFieldsFormat.java:113) 23411 at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:133) 23412 at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:108) 23413 at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:145) 23414 at org.apache.lucene.index.ReadersAndUpdates.getReadOnlyClone(ReadersAndUpdates.java:239) 23415 at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:109) 23416 at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:421) 23417 at org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:297) 23418 at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:272) 23419 at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:262) 23420 at org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:171) 23421 at org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:118) 23422 at org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:58) 23423 at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:176) 23424 at org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:253) 23425 at org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:568) 23426 at org.elasticsearch.index.shard.IndexShard.refresh(IndexShard.java:565) 23427 at org.elasticsearch.index.shard.IndexShard$EngineRefresher$1.run(IndexShard.java:1095) 23428 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 23429 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) 23430 at java.lang.Thread.run(Thread.java:745) 23431 23432 at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:547) 23433 at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:528) 23434 at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:120) 23435 ... 4 more There is no good way for how to recover it and what caused the issue ? Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: There is no way to recover efk after CorruptIndexException or failed with unknown reason Expected results: There should be an easy way to recover efk from different situations or errors. Additional info:
There is a solution that involves sshing in and nuking the affected index on the filesystem. I just don't know the specifics at this time. Eric haven't you described this before somewhere? You can also just delete the storage entirely and start over, but that's probably not a very palatable solution. I think this condition is caused by an unclean shutdown of ES while it is mid-write, which leaves the index on the disk in a bad state. ES is apparently less than perfect at cleaning these up and sometimes manual intervention is necessary. Possibly we can provide a tool to make this easy or even automatic.
That was a slightly different stack trace, but that was covered in this support case: https://access.redhat.com/support/cases/#/case/01569315 There wasn't a clear reason as to why this occurs, even from Elastic. So it may be due to things from the disk running out of storage, the node receiving a power outage during a commit or another disaster scenario. There was a process in the case where recovery failed, the recommended solution was to delete '.recovery' files: If you `oc rsh` into the ES pod they should be able delete the files (based on the index in the exception) Example: /elasticsearch/persistent/logging-es/data/logging-es/nodes/0/indices/.operations.2015.12.17/3/translog/*.recovery Alternatively, if there is a PV you should also be able to navigate via the share: e.g. /my_pvc_host_mount/logging-es/data/logging-es/nodes/0/indices/.operations.2015.12.17/3/translog/*.recovery There may be other .recovery files, this should resolve to the path of the one that we see in the stack trace. If you cannot find any at that path, I would recommend doing a find on `*.recovery` files to find and delete the files.
Jaspreet, did comment 5 above help you? If that worked for you, we should probably ensure there's a kbase article about it.
Customer was not able to find recovery files hence we were not able to conclude the resolution.
Expanding further given we came across this again with another customer and was able to work through it. We've seen this happen in cases where ES hits an Out of Memory Exception. When we get to this point during recovery, Elasticsearch should still be up and running despite seeing this stack trace, it is just letting you know that it was unable to recover that index/shard and the data may not be available to query on. What is happening is that a shard for the specific index is unable to recover due to the prior bad state. One thing that can be done is to issue a delete for that specific index [1], this will remove the metadata and the files from disk for that index. Another thing that can be done is to close the index [2], this will make it so Elasticsearch will not try to read the data from disk and will not load information for it into memory but it will still be around on disk. In this case the data will not be able to be queried on either. To be able to issue either a delete or a close, you would follow the guide here [3] to issue administrative commands to Elasticsearch under chapter 22.8 Performing Elasticsearch Maintenance Operations. [1] https://www.elastic.co/guide/en/elasticsearch/reference/1.5/indices-delete-index.html [2] https://www.elastic.co/guide/en/elasticsearch/reference/1.5/indices-open-close.html [3] https://access.redhat.com/documentation/en/openshift-enterprise/3.2/paged/installation-and-configuration/chapter-22-aggregating-container-logs
ping - any update on this?
Can we close this? There has been no action on this bug in months.
@Eric, The trello card can't be opened: https://trello.com/c/bN2Ipf69/9-elasticsearch-recovery, could you please paste the link of this Red Hat Knowledge Base (Solution) 2961981 for QE to review? Thanks!
It's also linked above in the external trackers - https://access.redhat.com/site/solutions/2961981
On page https://access.redhat.com/solutions/2961981, part "Resolution", the hyper link under "guide here" is unavailable: To be able to issue either a delete or a close, you would follow the guide here Get this error when click into hyperlink "guide here": Access Denied You do not have permission to access the page you requested.
@Eric is something you can help out with?
(In reply to Jeff Cantrill from comment #20) > @Eric is something you can help out with? https://access.redhat.com/documentation/en/openshift-enterprise/3.2/paged/installation-and-configuration/chapter-22-aggregating-container-logs sees to be a broken link.
(In reply to Eric Rich from comment #21) > (In reply to Jeff Cantrill from comment #20) > > @Eric is something you can help out with? > > https://access.redhat.com/documentation/en/openshift-enterprise/3.2/paged/ > installation-and-configuration/chapter-22-aggregating-container-logs sees to > be a broken link. I updated the article to point at https://access.redhat.com/documentation/en-us/openshift_enterprise/3.2/html-single/installation_and_configuration/#install-config-aggregate-logging
The document review was passed, KCS contains all info in comment #9 @jkaur Any hints on how to reproduce the original bug to evaluate the solution documented in KCS? I want to give it a try, thanks.
Set to verified based on the document review result in comment #23
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1235
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 500 days