Bug 1339888 - Recover EFK after CorruptIndexException of ElasticSearch [NEEDINFO]
Summary: Recover EFK after CorruptIndexException of ElasticSearch
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 3.2.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.2.1
Assignee: Jeff Cantrill
QA Contact: Xia Zhao
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-05-26 05:31 UTC by Jaspreet Kaur
Modified: 2019-11-14 08:10 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
undefined
Clone Of:
Environment:
Last Closed: 2017-05-18 09:26:44 UTC
Target Upstream Version:
xiazhao: needinfo? (jkaur)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 2961981 0 None None None 2017-03-09 16:25:02 UTC
Red Hat Product Errata RHBA-2017:1235 0 normal SHIPPED_LIVE OpenShift Container Platform 3.5, 3.4, 3.3, and 3.1 bug fix update 2017-05-18 13:15:52 UTC

Description Jaspreet Kaur 2016-05-26 05:31:32 UTC
Description of problem: Elastic search logs is giving below errors :

016-05-13 06:56:40,280][WARN ][indices.cluster          ] [Grey Gargoyle] [[logging.6f517df3-1130-11e6-b18c-0050568f7a94.2016.05.12][1]] marking and sending shard failed due to [failed recovery]
23365	org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [logging.6f517df3-1130-11e6-b18c-0050568f7a94.2016.05.12][1] failed to fetch index version after copying it over
23366		at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:157)
23367		at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:112)
23368		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
23369		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
23370		at java.lang.Thread.run(Thread.java:745)
23371	Caused by: org.apache.lucene.index.CorruptIndexException: [logging.6f517df3-1130-11e6-b18c-0050568f7a94.2016.05.12][1] Preexisting corrupted index [corrupted_Lbq2oiP6StWh1dKpbLvNcg] caused by: CorruptIndexException[Invalid fieldsStream maxPointer (file truncated?): maxPointer=1205, length=0]
23372	org.apache.lucene.index.CorruptIndexException: Invalid fieldsStream maxPointer (file truncated?): maxPointer=1205, length=0
23373		at org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.<init>(CompressingStoredFieldsReader.java:135)
23374		at org.apache.lucene.codecs.compressing.CompressingStoredFieldsFormat.fieldsReader(CompressingStoredFieldsFormat.java:113)
23375		at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:133)
23376		at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:108)
23377		at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:145)
23378		at org.apache.lucene.index.ReadersAndUpdates.getReadOnlyClone(ReadersAndUpdates.java:239)
23379		at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:109)
23380		at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:421)
23381		at org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:297)
23382		at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:272)
23383		at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:262)
23384		at org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:171)
23385		at org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:118)
23386		at org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:58)
23387		at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:176)
23388		at org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:253)
23389		at org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:568)
23390		at org.elasticsearch.index.shard.IndexShard.refresh(IndexShard.java:565)
23391		at org.elasticsearch.index.shard.IndexShard$EngineRefresher$1.run(IndexShard.java:1095)
23392		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
23393		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
23394		at java.lang.Thread.run(Thread.java:745)
23395	
23396		at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:547)
23397		at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:528)
23398		at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:120)
23399		... 4 more
23400	[2016-05-13 06:56:40,297][WARN ][indices.cluster          ] [Grey Gargoyle] [[swagger-search.8350d0f8-fbf6-11e5-8b2d-0050568f56a5.2016.05.12][3]] marking and sending shard failed due to [failed recovery]
23401	org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [swagger-search.8350d0f8-fbf6-11e5-8b2d-0050568f56a5.2016.05.12][3] failed to fetch index version after copying it over
23402		at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:157)
23403		at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:112)
23404		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
23405		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
23406		at java.lang.Thread.run(Thread.java:745)
23407	Caused by: org.apache.lucene.index.CorruptIndexException: [swagger-search.8350d0f8-fbf6-11e5-8b2d-0050568f56a5.2016.05.12][3] Preexisting corrupted index [corrupted_ZpMUZgkvShq8jXiHH15Jhw] caused by: CorruptIndexException[Invalid fieldsStream maxPointer (file truncated?): maxPointer=11850, length=8166]
23408	org.apache.lucene.index.CorruptIndexException: Invalid fieldsStream maxPointer (file truncated?): maxPointer=11850, length=8166
23409		at org.apache.lucene.codecs.compressing.CompressingStoredFieldsReader.<init>(CompressingStoredFieldsReader.java:135)
23410		at org.apache.lucene.codecs.compressing.CompressingStoredFieldsFormat.fieldsReader(CompressingStoredFieldsFormat.java:113)
23411		at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:133)
23412		at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:108)
23413		at org.apache.lucene.index.ReadersAndUpdates.getReader(ReadersAndUpdates.java:145)
23414		at org.apache.lucene.index.ReadersAndUpdates.getReadOnlyClone(ReadersAndUpdates.java:239)
23415		at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:109)
23416		at org.apache.lucene.index.IndexWriter.getReader(IndexWriter.java:421)
23417		at org.apache.lucene.index.StandardDirectoryReader.doOpenFromWriter(StandardDirectoryReader.java:297)
23418		at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:272)
23419		at org.apache.lucene.index.StandardDirectoryReader.doOpenIfChanged(StandardDirectoryReader.java:262)
23420		at org.apache.lucene.index.DirectoryReader.openIfChanged(DirectoryReader.java:171)
23421		at org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:118)
23422		at org.apache.lucene.search.SearcherManager.refreshIfNeeded(SearcherManager.java:58)
23423		at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:176)
23424		at org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:253)
23425		at org.elasticsearch.index.engine.InternalEngine.refresh(InternalEngine.java:568)
23426		at org.elasticsearch.index.shard.IndexShard.refresh(IndexShard.java:565)
23427		at org.elasticsearch.index.shard.IndexShard$EngineRefresher$1.run(IndexShard.java:1095)
23428		at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
23429		at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
23430		at java.lang.Thread.run(Thread.java:745)
23431	
23432		at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:547)
23433		at org.elasticsearch.index.store.Store.failIfCorrupted(Store.java:528)
23434		at org.elasticsearch.index.gateway.local.LocalIndexShardGateway.recover(LocalIndexShardGateway.java:120)
23435		... 4 more

There is no good way for how to recover it and what caused the issue ?


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results: There is no way to recover efk after CorruptIndexException or failed with unknown reason


Expected results: There should be an easy way to recover efk from different situations or errors.


Additional info:

Comment 4 Luke Meyer 2016-06-02 02:07:59 UTC
There is a solution that involves sshing in and nuking the affected index on the filesystem. I just don't know the specifics at this time. Eric haven't you described this before somewhere?

You can also just delete the storage entirely and start over, but that's probably not a very palatable solution.

I think this condition is caused by an unclean shutdown of ES while it is mid-write, which leaves the index on the disk in a bad state. ES is apparently less than perfect at cleaning these up and sometimes manual intervention is necessary. Possibly we can provide a tool to make this easy or even automatic.

Comment 5 ewolinet 2016-06-02 02:36:30 UTC
That was a slightly different stack trace, but that was covered in this support case: https://access.redhat.com/support/cases/#/case/01569315

There wasn't a clear reason as to why this occurs, even from Elastic.  So it may be due to things from the disk running out of storage, the node receiving a power outage during a commit or another disaster scenario.

There was a process in the case where recovery failed, the recommended solution was to delete '.recovery' files:

If you `oc rsh` into the ES pod they should be able delete the files (based on the index in the exception) Example: /elasticsearch/persistent/logging-es/data/logging-es/nodes/0/indices/.operations.2015.12.17/3/translog/*.recovery 

Alternatively, if there is a PV you should also be able to navigate via the share: e.g. /my_pvc_host_mount/logging-es/data/logging-es/nodes/0/indices/.operations.2015.12.17/3/translog/*.recovery 

There may be other .recovery files, this should resolve to the path of the one that we see in the stack trace. If you cannot find any at that path, I would recommend doing a find on `*.recovery` files to find and delete the files.

Comment 7 Luke Meyer 2016-10-12 15:32:13 UTC
Jaspreet, did comment 5 above help you?

If that worked for you, we should probably ensure there's a kbase article about it.

Comment 8 Jaspreet Kaur 2016-10-13 05:05:39 UTC
Customer was not able to find recovery files hence we were not able to conclude the resolution.

Comment 9 ewolinet 2016-10-13 15:20:08 UTC
Expanding further given we came across this again with another customer and was able to work through it.

We've seen this happen in cases where ES hits an Out of Memory Exception. 

When we get to this point during recovery, Elasticsearch should still be up and running despite seeing this stack trace, it is just letting you know that it was unable to recover that index/shard and the data may not be available to query on.

What is happening is that a shard for the specific index is unable to recover due to the prior bad state. One thing that can be done is to issue a delete for that specific index [1], this will remove the metadata and the files from disk for that index. Another thing that can be done is to close the index [2], this will make it so Elasticsearch will not try to read the data from disk and will not load information for it into memory but it will still be around on disk. In this case the data will not be able to be queried on either.

To be able to issue either a delete or a close, you would follow the guide here [3] to issue administrative commands to Elasticsearch under chapter 22.8 Performing Elasticsearch Maintenance Operations.


[1] https://www.elastic.co/guide/en/elasticsearch/reference/1.5/indices-delete-index.html
[2] https://www.elastic.co/guide/en/elasticsearch/reference/1.5/indices-open-close.html
[3] https://access.redhat.com/documentation/en/openshift-enterprise/3.2/paged/installation-and-configuration/chapter-22-aggregating-container-logs

Comment 12 Rich Megginson 2016-11-08 21:46:17 UTC
ping - any update on this?

Comment 13 Rich Megginson 2016-12-06 14:41:44 UTC
Can we close this?  There has been no action on this bug in months.

Comment 17 Xia Zhao 2017-03-10 07:45:00 UTC
@Eric,

The trello card can't be opened: https://trello.com/c/bN2Ipf69/9-elasticsearch-recovery, could you please paste the link of this Red Hat Knowledge Base (Solution) 2961981 for QE to review? Thanks!

Comment 18 ewolinet 2017-03-10 15:41:30 UTC
It's also linked above in the external trackers - 
https://access.redhat.com/site/solutions/2961981

Comment 19 Xia Zhao 2017-03-13 06:48:20 UTC
On page https://access.redhat.com/solutions/2961981, part "Resolution", the hyper link under "guide here" is unavailable:

     To be able to issue either a delete or a close, you would follow the guide here 

Get this error when click into hyperlink "guide here":

     Access Denied

     You do not have permission to access the page you requested.

Comment 20 Jeff Cantrill 2017-03-13 13:41:10 UTC
@Eric is something you can help out with?

Comment 21 Eric Rich 2017-03-13 13:45:22 UTC
(In reply to Jeff Cantrill from comment #20)
> @Eric is something you can help out with?

https://access.redhat.com/documentation/en/openshift-enterprise/3.2/paged/installation-and-configuration/chapter-22-aggregating-container-logs sees to be a broken link.

Comment 22 Eric Rich 2017-03-13 13:47:36 UTC
(In reply to Eric Rich from comment #21)
> (In reply to Jeff Cantrill from comment #20)
> > @Eric is something you can help out with?
> 
> https://access.redhat.com/documentation/en/openshift-enterprise/3.2/paged/
> installation-and-configuration/chapter-22-aggregating-container-logs sees to
> be a broken link.

I updated the article to point at https://access.redhat.com/documentation/en-us/openshift_enterprise/3.2/html-single/installation_and_configuration/#install-config-aggregate-logging

Comment 23 Xia Zhao 2017-03-14 02:48:37 UTC
The document review was passed, KCS contains all info in comment #9

@jkaur Any hints on how to reproduce the original bug to evaluate the solution documented in KCS? I want to give it a try, thanks.

Comment 24 Xia Zhao 2017-03-20 09:29:05 UTC
Set to verified based on the document review result in comment #23

Comment 26 errata-xmlrpc 2017-05-18 09:26:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1235


Note You need to log in before you can comment on or make changes to this bug.