1670587 – ES pod deployment timeout can corrupt logging indices

Bug 1670587 - ES pod deployment timeout can corrupt logging indices

Summary: ES pod deployment timeout can corrupt logging indices

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	3.11.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	3.11.z
Assignee:	Michael Burke
QA Contact:	Anping Li
Docs Contact:
URL:
Whiteboard:	groom
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-01-29 21:08 UTC by Matthew Barnes
Modified:	2023-09-14 04:45 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-02-06 21:56:44 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift openshift-docs pull 17931	0	None	closed	Add 3.11 migration documentation	2020-02-26 01:40:48 UTC

Description Matthew Barnes 2019-01-29 21:08:36 UTC

Description of problem:

ElasticSearch v5 stores indices on persistent volumes differently than earlier versions (using a hash value instead of the name of the index, I believe).

When ElasticSearch is upgraded to v5, the new pods are not considered ready until the Searchguard index becomes green.  Especially on large clusters this can take a VERY long time to complete, but the rollout strategy has a 30-minute default timeout before terminating the pods and rolling back to the previous version.

This can leave ElasticSearch indices partially converted to v5 format, which earlier ElasticSearch versions can't deal with.

Further attempts to upgrade ElasticSearch to v5 can further corrupt the partially-upgraded indices, making them extremely difficult to recover.


Version-Release number of selected component (if applicable):

Upgrade from v3.9.41 -> v3.10.45 -> v3.11.43

Comment 1 Jeff Cantrill 2019-01-31 15:14:42 UTC

(In reply to Matthew Barnes from comment #0)
> Description of problem:
> 
> ElasticSearch v5 stores indices on persistent volumes differently than
> earlier versions (using a hash value instead of the name of the index, I
> believe).
> 
> When ElasticSearch is upgraded to v5, the new pods are not considered ready
> until the Searchguard index becomes green.  Especially on large clusters
> this can take a VERY long time to complete, but the rollout strategy has a
> 30-minute default timeout before terminating the pods and rolling back to
> the previous version.

This is not accurate.  The state of the Searchguard index is not involved in determination of readiness.  My assumption is the pod(s) are rolled back because the storage from the previous deployment is not released by AWS and attached to the new deployment before the rollback time is exceeded.  This was fixed by [1].

[1] https://bugzilla.redhat.com/show_bug.cgi?id=1655675

Comment 3 Jeff Cantrill 2019-10-28 18:48:44 UTC

It's been a long time with this issue but talking to PM we may resolve via documentation. Need to verify following [1] returns no results and that customer's must validate prior to upgrading their ES clusters otherwise their data may not be recoverable.

[1] https://github.com/jcantrill/cluster-logging-tools/blob/release-3.x/scripts/dots-in-field-names

Comment 4 Jeff Cantrill 2019-10-29 18:42:07 UTC

@lukas,

I'm looking to turn this into a doc issue.  Expecting to reference the content of the script in #c3 and want to reference ES changes.  I found the mapping explosion [1] but I dont see ref to dots in fields. Do you have a link?

[1] https://www.elastic.co/guide/en/elasticsearch/reference/5.6/breaking_50_mapping_changes.html#breaking_50_mapping_changes

Comment 5 Michael Burke 2020-01-14 17:34:28 UTC

Documentation PR:
https://github.com/openshift/openshift-docs/pull/17931

Comment 6 Jeff Cantrill 2020-01-14 17:38:30 UTC

Moving to ON_QA for validation which may have already occurred given QE has been involved in reviewing the docs

Comment 7 Anping Li 2020-01-15 02:23:14 UTC

LGTM

Comment 8 Michael Burke 2020-02-06 21:56:44 UTC

Changes are live:
https://docs.openshift.com/container-platform/3.11/upgrading/automated_upgrades.html#upgrading-efk-logging-stack
https://access.redhat.com/documentation/en-us/openshift_container_platform/3.11/html/upgrading_clusters/install-config-upgrading-automated-upgrades#upgrading-efk-logging-stack

Comment 9 Red Hat Bugzilla 2023-09-14 04:45:53 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.