Bug 1670587 - ES pod deployment timeout can corrupt logging indices [NEEDINFO]
Summary: ES pod deployment timeout can corrupt logging indices
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Logging
Version: 3.11.0
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
: 3.11.z
Assignee: Michael Burke
QA Contact: Anping Li
URL:
Whiteboard: groom
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2019-01-29 21:08 UTC by Matthew Barnes
Modified: 2020-02-06 21:56 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-02-06 21:56:44 UTC
Target Upstream Version:
jcantril: needinfo? (lvlcek)


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift openshift-docs pull 17931 0 None closed Add 3.11 migration documentation 2020-02-26 01:40:48 UTC

Description Matthew Barnes 2019-01-29 21:08:36 UTC
Description of problem:

ElasticSearch v5 stores indices on persistent volumes differently than earlier versions (using a hash value instead of the name of the index, I believe).

When ElasticSearch is upgraded to v5, the new pods are not considered ready until the Searchguard index becomes green.  Especially on large clusters this can take a VERY long time to complete, but the rollout strategy has a 30-minute default timeout before terminating the pods and rolling back to the previous version.

This can leave ElasticSearch indices partially converted to v5 format, which earlier ElasticSearch versions can't deal with.

Further attempts to upgrade ElasticSearch to v5 can further corrupt the partially-upgraded indices, making them extremely difficult to recover.


Version-Release number of selected component (if applicable):

Upgrade from v3.9.41 -> v3.10.45 -> v3.11.43

Comment 1 Jeff Cantrill 2019-01-31 15:14:42 UTC
(In reply to Matthew Barnes from comment #0)
> Description of problem:
> 
> ElasticSearch v5 stores indices on persistent volumes differently than
> earlier versions (using a hash value instead of the name of the index, I
> believe).
> 
> When ElasticSearch is upgraded to v5, the new pods are not considered ready
> until the Searchguard index becomes green.  Especially on large clusters
> this can take a VERY long time to complete, but the rollout strategy has a
> 30-minute default timeout before terminating the pods and rolling back to
> the previous version.

This is not accurate.  The state of the Searchguard index is not involved in determination of readiness.  My assumption is the pod(s) are rolled back because the storage from the previous deployment is not released by AWS and attached to the new deployment before the rollback time is exceeded.  This was fixed by [1].

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1655675

Comment 3 Jeff Cantrill 2019-10-28 18:48:44 UTC
It's been a long time with this issue but talking to PM we may resolve via documentation. Need to verify following [1] returns no results and that customer's must validate prior to upgrading their ES clusters otherwise their data may not be recoverable.

[1] https://github.com/jcantrill/cluster-logging-tools/blob/release-3.x/scripts/dots-in-field-names

Comment 4 Jeff Cantrill 2019-10-29 18:42:07 UTC
@lukas,

I'm looking to turn this into a doc issue.  Expecting to reference the content of the script in #c3 and want to reference ES changes.  I found the mapping explosion [1] but I dont see ref to dots in fields. Do you have a link?

[1] https://www.elastic.co/guide/en/elasticsearch/reference/5.6/breaking_50_mapping_changes.html#breaking_50_mapping_changes

Comment 5 Michael Burke 2020-01-14 17:34:28 UTC
Documentation PR:
https://github.com/openshift/openshift-docs/pull/17931

Comment 6 Jeff Cantrill 2020-01-14 17:38:30 UTC
Moving to ON_QA for validation which may have already occurred given QE has been involved in reviewing the docs

Comment 7 Anping Li 2020-01-15 02:23:14 UTC
LGTM


Note You need to log in before you can comment on or make changes to this bug.