1868300 – ElasticSearch operator upgrade created new PVCs

Bug 1868300 - ElasticSearch operator upgrade created new PVCs

Summary: ElasticSearch operator upgrade created new PVCs

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Logging
Sub Component:
Version:	4.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.6.0
Assignee:	ewolinet
QA Contact:	Qiaoling Tang
Docs Contact:
URL:
Whiteboard:	logging-exploration osd-45-logging
Depends On:
Blocks:	1881957
TreeView+	depends on / blocked

Reported:	2020-08-12 09:11 UTC by Jonas Nordell
Modified:	2024-06-13 22:56 UTC (History)
CC List:	34 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: Due to a corrupted etcd gc (from setting the owner ref of cluster level objects to be namespaced objects) the elasticsearch CR gets removed by garbage collection and then recreated (by CLO). Consequence: EO treated the recreated CR as a new cluster which meant that new UUIDs were issued and new PVCs were created. Fix: EO now will look for PVCs with a particular label (which it creates its PVCs with) and will reuse those UUIDs if they exist and they match the correct roles and node count. If they do not exist then EO will treat this as a new cluster. The label on the PVCs will be "logging-cluster: <the-name-of-the-cluster>". Result: EO will reuse previously used PVCs in the case that a CR is deleted and then recreated.
Clone Of:
Clones:	1881957 (view as bug list)
Environment:
Last Closed:	2020-10-27 15:09:55 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift elasticsearch-operator pull 471	None	closed	Bug 1868300: Updating EO to adopt existing uuids from pvcs or deployment objects if nil in CR	2021-02-16 02:52:27 UTC
Red Hat Knowledge Base (Solution)	5323141	None	None	None	2020-08-26 19:17:48 UTC
Red Hat Product Errata	RHBA-2020:4198	None	None	None	2020-10-27 15:10:11 UTC

Description Jonas Nordell 2020-08-12 09:11:10 UTC

Description of problem:

When the ElasticSearch operator was upgraded it proceeded to create new PVCs and tries to use the new PVCs instead of the old ones.

This seems to be an regression of https://bugzilla.redhat.com/show_bug.cgi?id=1756794. Might close this as a duplicate?

Version-Release number of selected component (if applicable):

oc version
Client Version: 4.5.0-202007240519.p0-b66f2d3
Server Version: 4.5.4
Kubernetes Version: v1.18.3+012b3ec

Elastic operator: 4.5.0-202007240519.p0

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:

ElasticSearch is trying to use new PVCs instead of old

Expected results:

ElasticSearch should always use existing PVCs if there are any


Additional info:

Comment 2 ewolinet 2020-08-12 22:11:17 UTC

The only time it would create different PVCs is if the UUID that is specified for the node in the elasticsearch CR changed.

The name of a pvc is defined as:

"<cluster name>-<node name>"
cluster name is the same as your elasticsearch CR object (if it was created by CLO this is "elasticsearch")

node name is
(redundantly) "<cluster name>-<node roles>-<node UUID>-<node replica number>"

node roles consists of up to three letters to denote what role the node has defined for it:
c - client
d - data
m - master

node UUID is generated in the case it is not provided in the CR and after that EO will check against your CR's status to ensure the UUID isn't changed (you should see a status condition denoting this too)

can you verify that the UUID wasn't changed or that the elasticsearch CR wasn't recreated?

Comment 5 ewolinet 2020-08-13 15:58:25 UTC

to clarify, the elasticsearch CR was recreated during the upgrade automatically or the customer recreated their CR as part of this?
Are any of the old elasticsearch deployments still around? if not it sounds like the old CR was deleted (if so, was this done automatically?)

what version did the customer upgrade from to get to 4.5.0?

Comment 7 Nicolas Nosenzo 2020-08-14 10:08:47 UTC

I'm currently seeing the same issue in a cluster upgrade (4.4.11 to 4.4.14 and to 4.5.4). The events:

~~~
2020-08-05T06:42:52Z   2020-08-05T06:42:52Z   1        elasticsearch-operator.4.4.0-202007240028.p0.16284c2107a3fde8             ClusterServiceVersion   <none>                     Normal    ComponentUnhealthy operator-lifecycle-manager                                                                
                     installing: deployment changed old hash=68dcb67f7c, new hash=6687c7ccb8
~~~


Then, some time after:
~~~
2020-08-05T07:20:09Z   2020-08-05T07:20:09Z   1        elasticsearch-elasticsearch-cdm-7bg4p53i-2.16284e29ddddcc46                        PersistentVolumeClaim   <none>                                                         Normal    ProvisioningSucceeded                        persistentvolume-controller                                                                                     Successfully provisioned volume pvc-02a6387c-71eb-4727-8af3-e55efc33b6b2 using kubernetes.io/vsphere-volume
2020-08-05T07:20:09Z   2020-08-05T07:20:09Z   1        elasticsearch-elasticsearch-cdm-7bg4p53i-3.16284e29deb5a07a                        PersistentVolumeClaim   <none>                                                         Normal    ProvisioningSucceeded                        persistentvolume-controller                                                                                     Successfully provisioned volume pvc-c824f73b-1f52-4222-ae41-5a9aaa417aad using kubernetes.io/vsphere-volume
2020-08-05T07:20:09Z   2020-08-05T07:20:09Z   1        elasticsearch-elasticsearch-cdm-7bg4p53i-1.16284e29e460ed2b                        PersistentVolumeClaim   <none>                                                         Normal    ProvisioningSucceeded                        persistentvolume-controller                                                                                     Successfully provisioned volume pvc-726fa5f9-cc24-4493-b0e3-8eeb5441030c using kubernetes.io/vsphere-volume
~~~

During one of the upgrade attempts, the deployment objects were recreated, so were the PVCs (many times):

PODs:
elasticsearch-cdm-7bg4p53i-1-6bdbfd8684-kjhw2   2/2     Running   0          5h52m
elasticsearch-cdm-7bg4p53i-2-7fdb9b5668-k6z79   2/2     Running   0          5h50m
elasticsearch-cdm-7bg4p53i-3-67c698955c-crxl7   2/2     Running   0          5h49m

Deployments:
NAME                           READY   UP-TO-DATE   AVAILABLE   AGE
cluster-logging-operator       1/1     1            1           22d
elasticsearch-cdm-7bg4p53i-1   1/1     1            1           5h52m
elasticsearch-cdm-7bg4p53i-2   1/1     1            1           5h51m
elasticsearch-cdm-7bg4p53i-3   1/1     1            1           5h50m

PVCs:
NAME                                         STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
elasticsearch-elasticsearch-cdm-7bg4p53i-1   Bound    pvc-726fa5f9-cc24-4493-b0e3-8eeb5441030c   300Gi      RWO            thin           5h52m
elasticsearch-elasticsearch-cdm-7bg4p53i-2   Bound    pvc-02a6387c-71eb-4727-8af3-e55efc33b6b2   300Gi      RWO            thin           5h52m
elasticsearch-elasticsearch-cdm-7bg4p53i-3   Bound    pvc-c824f73b-1f52-4222-ae41-5a9aaa417aad   300Gi      RWO            thin           5h52m
elasticsearch-elasticsearch-cdm-fydsfsvg-1   Bound    pvc-1ae15210-e99c-4f47-901d-5333e3f17be4   300Gi      RWO            thin           6h25m
elasticsearch-elasticsearch-cdm-fydsfsvg-2   Bound    pvc-6fee0e0d-e6ad-4e11-b76c-6dfeaebbc1ce   300Gi      RWO            thin           6h25m
elasticsearch-elasticsearch-cdm-fydsfsvg-3   Bound    pvc-7a33b181-b195-4635-8693-bfe205018555   300Gi      RWO            thin           6h25m
elasticsearch-elasticsearch-cdm-tiyzfpzh-1   Bound    pvc-244bfab4-2ff2-4d1b-94d7-f6bb520382f0   300Gi      RWO            thin           22d
elasticsearch-elasticsearch-cdm-tiyzfpzh-2   Bound    pvc-e5e204f5-e88a-40ff-b68e-2c61d31e52b0   300Gi      RWO            thin           22d
elasticsearch-elasticsearch-cdm-tiyzfpzh-3   Bound    pvc-3c1c37f0-100b-47da-b93e-481a41ca98e6   300Gi      RWO            thin           22d


Note: I tried to reproduce the same but could not. Anyhow, seems to be 100% reproducible in the CU cluster.

Comment 8 ewolinet 2020-08-17 16:38:36 UTC

do you notice if the clusterlogging CR is being deleted and recreated? it is the owner of the elasticsearch CR (if you are using CLO) and that could be the root cause of the elasticsearch CR being deleted and recreated.

> When you say what version did the customer upgrade from, do you mean OCP or logging?

what logging version did they upgrade from?

Comment 10 ewolinet 2020-08-19 15:30:12 UTC

@Jonas,

Can you attach the startingCSV and currentCSV ?

I need to confirm what is being configured to be the owner of the ClusterLogging CR, my understanding is it shouldn't be tied to the operator deployment.. but if it is that may be what is causing what we're seeing here.

Comment 12 ewolinet 2020-08-20 15:20:35 UTC

Hi Jonas,

This is the subscription for EO, can you provide the one for CLO?
Can you also provide the output from "oc get clusterlogging instance -n openshift-logging -o yaml" from the customer please?

Comment 14 Jeff Cantrill 2020-08-21 14:11:02 UTC

Moving to UpcomingSprint for future evaluation

Comment 21 ewolinet 2020-08-26 21:31:07 UTC

For the clusters that you said upgrade from 4.4.11 to 4.4.14 and to 4.5.4, what mechanism did you use to upgrade from 4.4.14 -> 4.5.4? 

Trying to recreate it I was unable to by following the steps:

1. Using the operator hub to install a 4.4 cluster (install EO and then install CLO)
2  Verify a single PVC
3. Upgrade EO to 4.5 by changing the subscription to 4.5 (for namespace openshift-operators-redhat)
4. Verify the operator rolled out
5. Upgrade CLO to 4.5 by changing the subscription to 4.5 (for namespace openshift-logging)
6. Verified the operator rolled out

Throughout all of those changes nothing caused additional PVC to be created. The only way I could cause that to happen was by manually deleting my elasticsearch CR and having CLO recreate it (which is not required for upgrades nor recommended to do):
oc delete elasticsearch elasticsearch -n openshift-logging


Also, I'm unsure of where the logs in https://bugzilla.redhat.com/show_bug.cgi?id=1868300#c19 were sourced from. The comment says "operator logs" and https://bugzilla.redhat.com/show_bug.cgi?id=1868300#c20 says its ES pod logs (it does appear to be this though based on the message contents)

Comment 22 ewolinet 2020-08-27 18:51:24 UTC

I was able to recreate this a different way, if I manually edit the elasticsearch CR to drop the UUID and clear out the status field, the operator will regenerate a UUID and therefore create a new PVC.

Based on this case, I will try to add some hardening to our operators.

Comment 23 Jonas Nordell 2020-08-31 07:15:24 UTC

@ewolinet 

As I mentioned on Slack, operator logs are extracted from ElasticSearch and not directly from the pods. Sorry about the confusion.

Is there a workaround were it would be possible to go back using the original PVs without creating problems in the future ?

Comment 26 Armin Kunaschik 2020-08-31 15:12:09 UTC

Please provide a workaround! Or a fast fix in the very next release.
We upgraded 2 times. 4.4.16->4.5.6 and 4.5.6->4.5.7. Every time we lost our previous logs.

Comment 29 ewolinet 2020-09-01 14:32:45 UTC

Thank you for the logs, I see the following occur twice in the EO logs which indicates to me something is deleting and recreating the elasticsearch CR which would be the root cause of this happening (and it means EO is working as expected).

time="2020-08-27T10:44:03Z" level=info msg="Flushing nodes for openshift-logging/elasticsearch"

I will look through the logging dump further to see if I can figure out what is deleting and recreating the CR.

Comment 31 ewolinet 2020-09-01 14:43:54 UTC

Armin,

Please see https://access.redhat.com/solutions/5323141 regarding the kbase article for a workaround.

Comment 44 ewolinet 2020-09-09 22:03:51 UTC

Per the PR summary to address this:



As part of the recovery/adoption process, it will be required that the PVCs to be picked back up have the label "logging-cluster: <name-of-the-cr>". It will also be validated against the name of the cluster that the PVC name is based on.

Recovery/adoption will be triggered upon the processing of a CR that is missing UUIDs. It will only try to recover UUIDs for nodes that do not already have UUIDs defined.

Further documentation will need to be developed and publish as part of how to recover data that from another PVC. This PR does not seek to resolve that but rather address cases where an elasticsearch CR may have been removed on accident and then recreated (without UUIDs).

Comment 46 Jeff Cantrill 2020-09-12 01:52:51 UTC

Moving to UpcomingSprint awaiting for PRs to merge, etc.

Comment 50 Qiaoling Tang 2020-09-22 06:32:53 UTC

I tried 5 times, and I was not able to reproduce this issue, so move this BZ to VERIFIED.

Steps:
1. deploy logging 4.5 on a 4.5 cluster
2. upgrade logging to 4.6
3. upgrade the cluster to 4.6

CSV version:
elasticsearch-operator.4.6.0-202009211504.p0/elasticsearch-operator.4.6.0-202009192030.p0

Comment 63 Anping Li 2020-10-19 08:22:08 UTC

I can't reproduced this issue. But found a similar issue that some resources are deleted in OCP upgrade if the clusterlogging is in Unmanaged status.  https://bugzilla.redhat.com/show_bug.cgi?id=1888622.

Comment 65 errata-xmlrpc 2020-10-27 15:09:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6.1 extras update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4198

Note You need to log in before you can comment on or make changes to this bug.

adeshpan
aghadge
aivaraslaimikis
akaris
alchan
andbartl
andcosta
anli
aos-bugs
armin.kunaschik
cpassare
emahoney
ewolinet
hchatter
hgomes
igarciam
jcantril
jeder
jmalde
kelly.brown1
mburke
michele.sandro.emma
mrobson
nchoudhu
ngirard
nnosenzo
openshift-bugs-escalate
palshure
qitang
rheinzma
rvokal
scuppett
steven.barre
tjungbau