Description of problem: When the ElasticSearch operator was upgraded it proceeded to create new PVCs and tries to use the new PVCs instead of the old ones. This seems to be an regression of https://bugzilla.redhat.com/show_bug.cgi?id=1756794. Might close this as a duplicate? Version-Release number of selected component (if applicable): oc version Client Version: 4.5.0-202007240519.p0-b66f2d3 Server Version: 4.5.4 Kubernetes Version: v1.18.3+012b3ec Elastic operator: 4.5.0-202007240519.p0 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: ElasticSearch is trying to use new PVCs instead of old Expected results: ElasticSearch should always use existing PVCs if there are any Additional info:
The only time it would create different PVCs is if the UUID that is specified for the node in the elasticsearch CR changed. The name of a pvc is defined as: "<cluster name>-<node name>" cluster name is the same as your elasticsearch CR object (if it was created by CLO this is "elasticsearch") node name is (redundantly) "<cluster name>-<node roles>-<node UUID>-<node replica number>" node roles consists of up to three letters to denote what role the node has defined for it: c - client d - data m - master node UUID is generated in the case it is not provided in the CR and after that EO will check against your CR's status to ensure the UUID isn't changed (you should see a status condition denoting this too) can you verify that the UUID wasn't changed or that the elasticsearch CR wasn't recreated?
to clarify, the elasticsearch CR was recreated during the upgrade automatically or the customer recreated their CR as part of this? Are any of the old elasticsearch deployments still around? if not it sounds like the old CR was deleted (if so, was this done automatically?) what version did the customer upgrade from to get to 4.5.0?
I'm currently seeing the same issue in a cluster upgrade (4.4.11 to 4.4.14 and to 4.5.4). The events: ~~~ 2020-08-05T06:42:52Z 2020-08-05T06:42:52Z 1 elasticsearch-operator.4.4.0-202007240028.p0.16284c2107a3fde8 ClusterServiceVersion <none> Normal ComponentUnhealthy operator-lifecycle-manager installing: deployment changed old hash=68dcb67f7c, new hash=6687c7ccb8 ~~~ Then, some time after: ~~~ 2020-08-05T07:20:09Z 2020-08-05T07:20:09Z 1 elasticsearch-elasticsearch-cdm-7bg4p53i-2.16284e29ddddcc46 PersistentVolumeClaim <none> Normal ProvisioningSucceeded persistentvolume-controller Successfully provisioned volume pvc-02a6387c-71eb-4727-8af3-e55efc33b6b2 using kubernetes.io/vsphere-volume 2020-08-05T07:20:09Z 2020-08-05T07:20:09Z 1 elasticsearch-elasticsearch-cdm-7bg4p53i-3.16284e29deb5a07a PersistentVolumeClaim <none> Normal ProvisioningSucceeded persistentvolume-controller Successfully provisioned volume pvc-c824f73b-1f52-4222-ae41-5a9aaa417aad using kubernetes.io/vsphere-volume 2020-08-05T07:20:09Z 2020-08-05T07:20:09Z 1 elasticsearch-elasticsearch-cdm-7bg4p53i-1.16284e29e460ed2b PersistentVolumeClaim <none> Normal ProvisioningSucceeded persistentvolume-controller Successfully provisioned volume pvc-726fa5f9-cc24-4493-b0e3-8eeb5441030c using kubernetes.io/vsphere-volume ~~~ During one of the upgrade attempts, the deployment objects were recreated, so were the PVCs (many times): PODs: elasticsearch-cdm-7bg4p53i-1-6bdbfd8684-kjhw2 2/2 Running 0 5h52m elasticsearch-cdm-7bg4p53i-2-7fdb9b5668-k6z79 2/2 Running 0 5h50m elasticsearch-cdm-7bg4p53i-3-67c698955c-crxl7 2/2 Running 0 5h49m Deployments: NAME READY UP-TO-DATE AVAILABLE AGE cluster-logging-operator 1/1 1 1 22d elasticsearch-cdm-7bg4p53i-1 1/1 1 1 5h52m elasticsearch-cdm-7bg4p53i-2 1/1 1 1 5h51m elasticsearch-cdm-7bg4p53i-3 1/1 1 1 5h50m PVCs: NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE elasticsearch-elasticsearch-cdm-7bg4p53i-1 Bound pvc-726fa5f9-cc24-4493-b0e3-8eeb5441030c 300Gi RWO thin 5h52m elasticsearch-elasticsearch-cdm-7bg4p53i-2 Bound pvc-02a6387c-71eb-4727-8af3-e55efc33b6b2 300Gi RWO thin 5h52m elasticsearch-elasticsearch-cdm-7bg4p53i-3 Bound pvc-c824f73b-1f52-4222-ae41-5a9aaa417aad 300Gi RWO thin 5h52m elasticsearch-elasticsearch-cdm-fydsfsvg-1 Bound pvc-1ae15210-e99c-4f47-901d-5333e3f17be4 300Gi RWO thin 6h25m elasticsearch-elasticsearch-cdm-fydsfsvg-2 Bound pvc-6fee0e0d-e6ad-4e11-b76c-6dfeaebbc1ce 300Gi RWO thin 6h25m elasticsearch-elasticsearch-cdm-fydsfsvg-3 Bound pvc-7a33b181-b195-4635-8693-bfe205018555 300Gi RWO thin 6h25m elasticsearch-elasticsearch-cdm-tiyzfpzh-1 Bound pvc-244bfab4-2ff2-4d1b-94d7-f6bb520382f0 300Gi RWO thin 22d elasticsearch-elasticsearch-cdm-tiyzfpzh-2 Bound pvc-e5e204f5-e88a-40ff-b68e-2c61d31e52b0 300Gi RWO thin 22d elasticsearch-elasticsearch-cdm-tiyzfpzh-3 Bound pvc-3c1c37f0-100b-47da-b93e-481a41ca98e6 300Gi RWO thin 22d Note: I tried to reproduce the same but could not. Anyhow, seems to be 100% reproducible in the CU cluster.
do you notice if the clusterlogging CR is being deleted and recreated? it is the owner of the elasticsearch CR (if you are using CLO) and that could be the root cause of the elasticsearch CR being deleted and recreated. > When you say what version did the customer upgrade from, do you mean OCP or logging? what logging version did they upgrade from?
@Jonas, Can you attach the startingCSV and currentCSV ? I need to confirm what is being configured to be the owner of the ClusterLogging CR, my understanding is it shouldn't be tied to the operator deployment.. but if it is that may be what is causing what we're seeing here.
Hi Jonas, This is the subscription for EO, can you provide the one for CLO? Can you also provide the output from "oc get clusterlogging instance -n openshift-logging -o yaml" from the customer please?
Moving to UpcomingSprint for future evaluation
For the clusters that you said upgrade from 4.4.11 to 4.4.14 and to 4.5.4, what mechanism did you use to upgrade from 4.4.14 -> 4.5.4? Trying to recreate it I was unable to by following the steps: 1. Using the operator hub to install a 4.4 cluster (install EO and then install CLO) 2 Verify a single PVC 3. Upgrade EO to 4.5 by changing the subscription to 4.5 (for namespace openshift-operators-redhat) 4. Verify the operator rolled out 5. Upgrade CLO to 4.5 by changing the subscription to 4.5 (for namespace openshift-logging) 6. Verified the operator rolled out Throughout all of those changes nothing caused additional PVC to be created. The only way I could cause that to happen was by manually deleting my elasticsearch CR and having CLO recreate it (which is not required for upgrades nor recommended to do): oc delete elasticsearch elasticsearch -n openshift-logging Also, I'm unsure of where the logs in https://bugzilla.redhat.com/show_bug.cgi?id=1868300#c19 were sourced from. The comment says "operator logs" and https://bugzilla.redhat.com/show_bug.cgi?id=1868300#c20 says its ES pod logs (it does appear to be this though based on the message contents)
I was able to recreate this a different way, if I manually edit the elasticsearch CR to drop the UUID and clear out the status field, the operator will regenerate a UUID and therefore create a new PVC. Based on this case, I will try to add some hardening to our operators.
@ewolinet As I mentioned on Slack, operator logs are extracted from ElasticSearch and not directly from the pods. Sorry about the confusion. Is there a workaround were it would be possible to go back using the original PVs without creating problems in the future ?
Please provide a workaround! Or a fast fix in the very next release. We upgraded 2 times. 4.4.16->4.5.6 and 4.5.6->4.5.7. Every time we lost our previous logs.
Thank you for the logs, I see the following occur twice in the EO logs which indicates to me something is deleting and recreating the elasticsearch CR which would be the root cause of this happening (and it means EO is working as expected). time="2020-08-27T10:44:03Z" level=info msg="Flushing nodes for openshift-logging/elasticsearch" I will look through the logging dump further to see if I can figure out what is deleting and recreating the CR.
Armin, Please see https://access.redhat.com/solutions/5323141 regarding the kbase article for a workaround.
Per the PR summary to address this: As part of the recovery/adoption process, it will be required that the PVCs to be picked back up have the label "logging-cluster: <name-of-the-cr>". It will also be validated against the name of the cluster that the PVC name is based on. Recovery/adoption will be triggered upon the processing of a CR that is missing UUIDs. It will only try to recover UUIDs for nodes that do not already have UUIDs defined. Further documentation will need to be developed and publish as part of how to recover data that from another PVC. This PR does not seek to resolve that but rather address cases where an elasticsearch CR may have been removed on accident and then recreated (without UUIDs).
Moving to UpcomingSprint awaiting for PRs to merge, etc.
I tried 5 times, and I was not able to reproduce this issue, so move this BZ to VERIFIED. Steps: 1. deploy logging 4.5 on a 4.5 cluster 2. upgrade logging to 4.6 3. upgrade the cluster to 4.6 CSV version: elasticsearch-operator.4.6.0-202009211504.p0/elasticsearch-operator.4.6.0-202009192030.p0
I can't reproduced this issue. But found a similar issue that some resources are deleted in OCP upgrade if the clusterlogging is in Unmanaged status. https://bugzilla.redhat.com/show_bug.cgi?id=1888622.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6.1 extras update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4198