Bug 1670587

Summary: ES pod deployment timeout can corrupt logging indices
Product: OpenShift Container Platform Reporter: Matthew Barnes <mbarnes>
Component: LoggingAssignee: Michael Burke <mburke>
Status: CLOSED CURRENTRELEASE QA Contact: Anping Li <anli>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 3.11.0CC: aos-bugs, cvogel, lvlcek, mburke, rmeggins
Target Milestone: ---Keywords: OpsBlocker
Target Release: 3.11.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: groom
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-02-06 21:56:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Matthew Barnes 2019-01-29 21:08:36 UTC
Description of problem:

ElasticSearch v5 stores indices on persistent volumes differently than earlier versions (using a hash value instead of the name of the index, I believe).

When ElasticSearch is upgraded to v5, the new pods are not considered ready until the Searchguard index becomes green.  Especially on large clusters this can take a VERY long time to complete, but the rollout strategy has a 30-minute default timeout before terminating the pods and rolling back to the previous version.

This can leave ElasticSearch indices partially converted to v5 format, which earlier ElasticSearch versions can't deal with.

Further attempts to upgrade ElasticSearch to v5 can further corrupt the partially-upgraded indices, making them extremely difficult to recover.


Version-Release number of selected component (if applicable):

Upgrade from v3.9.41 -> v3.10.45 -> v3.11.43

Comment 1 Jeff Cantrill 2019-01-31 15:14:42 UTC
(In reply to Matthew Barnes from comment #0)
> Description of problem:
> 
> ElasticSearch v5 stores indices on persistent volumes differently than
> earlier versions (using a hash value instead of the name of the index, I
> believe).
> 
> When ElasticSearch is upgraded to v5, the new pods are not considered ready
> until the Searchguard index becomes green.  Especially on large clusters
> this can take a VERY long time to complete, but the rollout strategy has a
> 30-minute default timeout before terminating the pods and rolling back to
> the previous version.

This is not accurate.  The state of the Searchguard index is not involved in determination of readiness.  My assumption is the pod(s) are rolled back because the storage from the previous deployment is not released by AWS and attached to the new deployment before the rollback time is exceeded.  This was fixed by [1].

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1655675

Comment 3 Jeff Cantrill 2019-10-28 18:48:44 UTC
It's been a long time with this issue but talking to PM we may resolve via documentation. Need to verify following [1] returns no results and that customer's must validate prior to upgrading their ES clusters otherwise their data may not be recoverable.

[1] https://github.com/jcantrill/cluster-logging-tools/blob/release-3.x/scripts/dots-in-field-names

Comment 4 Jeff Cantrill 2019-10-29 18:42:07 UTC
@lukas,

I'm looking to turn this into a doc issue.  Expecting to reference the content of the script in #c3 and want to reference ES changes.  I found the mapping explosion [1] but I dont see ref to dots in fields. Do you have a link?

[1] https://www.elastic.co/guide/en/elasticsearch/reference/5.6/breaking_50_mapping_changes.html#breaking_50_mapping_changes

Comment 5 Michael Burke 2020-01-14 17:34:28 UTC
Documentation PR:
https://github.com/openshift/openshift-docs/pull/17931

Comment 6 Jeff Cantrill 2020-01-14 17:38:30 UTC
Moving to ON_QA for validation which may have already occurred given QE has been involved in reviewing the docs

Comment 7 Anping Li 2020-01-15 02:23:14 UTC
LGTM

Comment 9 Red Hat Bugzilla 2023-09-14 04:45:53 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days