Description of problem: Auto upgrade of service messh fails each time and we had to uninstall servicemesh, kiali, jaegger, then reinstall them all. # oc get csv NAME DISPLAY VERSION REPLACES PHASE elasticsearch-operator.4.2.1-201910221723 Elasticsearch Operator 4.2.1-201910221723 Succeeded elasticsearch-operator.4.2.4-201911050122 Elasticsearch Operator 4.2.4-201911050122 Failed ... # oc describe csv elasticsearch-operator.4.2.4-201911050122 ... Status: Certs Last Updated: <nil> Certs Rotate At: <nil> Conditions: Last Transition Time: 2019-11-21T10:03:43Z Last Update Time: 2019-11-21T10:03:43Z Message: installing: ComponentMissing: missing deployment with name=elasticsearch-operator Phase: Pending Reason: NeedsReinstall Last Transition Time: 2019-11-21T10:03:43Z Last Update Time: 2019-11-21T10:03:43Z Message: conflicting CRD owner in namespace Phase: Failed Reason: OwnerConflict Last Transition Time: 2019-11-21T10:03:45Z Last Update Time: 2019-11-21T10:03:45Z Message: installing: ComponentMissing: missing deployment with name=elasticsearch-operator Phase: Pending Reason: NeedsReinstall Last Transition Time: 2019-11-21T10:03:46Z Last Update Time: 2019-11-21T10:03:46Z Message: conflicting CRD owner in namespace Phase: Failed Reason: OwnerConflict ... Where are you experiencing the behaviour? What environment? AWS, lab When does the behavior occur? Frequently? Repeatedly? At certain times? Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. Install service mesh 2. At time of autoupgrade it fails and it can be recovered only by reinstalling 3. Actual results: Fails during auto upgrade Expected results: Should succeed without external intervention. Additional info:
Hi Jaspreet, I have a few questions: 1) What version of openshift was your cluster? 2) Was there any other CSVs that went in the failed state during auto upgrade? Looks like servicemesh depends on a few operator, and if any of those operators fail to install for some reason, servicemesh will fail to install. But if that's the case, I don't think it'll be fair to just say **ServiceMesh** failed to install during auto upgrade in the bug report. 3) How reproducible was this? Could you provide more detailed steps on how to reproduce this? If this was a one off thing, for example elasticsearch operator, which servicemesh operator depends on, had a one off glitch for some reason in your cluster, we may not be able to classify this as a bug. However, if with the steps you provide, elasticsearch operator fails to install more than once, and only during upgrade, then we could investigate this further as a potential bug.
This may have the same root cause as https://bugzilla.redhat.com/show_bug.cgi?id=1789920, which prevents garbage collection of copied CSVs, but we don't have quite enough information to confirm or disconfirm. If your cluster has reproduced it, the "conflicting CRD owner in namespace" failures would be expected, since the copied CSV asserts itself the owner of the same CRD that the new CSV wants to own. During a normal upgrade, the new version would specify that it replaces the previous version, so the existing CRD ownership is not considered a conflict. However, if the zombie CSV were two or more versions earlier than the newest CSV, it would result in a CRD ownership conflict. You can query your cluster for CSVs that are in this state: $ oc get -A -o json csv | jq '.items[] | select((.status.reason == "Copied" and .metadata.annotations["olm.operatorNamespace"] == .metadata.namespace))' Any such CSVs can be safely deleted. The 4.4.0 release will contain changes that prevent CSVs from entering this state and clean up and existing CSVs that are already in this state. The fixes will also be backported to 4.3.z (https://bugzilla.redhat.com/show_bug.cgi?id=1797019) and 4.2.z (https://bugzilla.redhat.com/show_bug.cgi?id=1797021). If you can reproduce your original issue, but there are no CSVs matching the above query, please respond and we can consider more avenues of investigation.
*** This bug has been marked as a duplicate of bug 1789920 ***
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days