1775518 – Service mesh auto upgrade fails each time

Bug 1775518 - Service mesh auto upgrade fails each time

Summary: Service mesh auto upgrade fails each time

Keywords:
Status:	CLOSED DUPLICATE of bug 1789920
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	OLM
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Ben Luddy
QA Contact:	Jian Zhang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-11-22 07:20 UTC by Jaspreet Kaur
Modified:	2024-01-06 04:27 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-02-06 17:21:31 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Jaspreet Kaur 2019-11-22 07:20:06 UTC

Description of problem:

Auto upgrade of service messh fails each time and we had to uninstall servicemesh, kiali, jaegger, then reinstall them all.

# oc get csv
NAME DISPLAY VERSION REPLACES PHASE
elasticsearch-operator.4.2.1-201910221723 Elasticsearch Operator 4.2.1-201910221723 Succeeded
elasticsearch-operator.4.2.4-201911050122 Elasticsearch Operator 4.2.4-201911050122 Failed
...
# oc describe csv elasticsearch-operator.4.2.4-201911050122
...
Status:
Certs Last Updated: <nil>
Certs Rotate At: <nil>
Conditions:
Last Transition Time: 2019-11-21T10:03:43Z
Last Update Time: 2019-11-21T10:03:43Z
Message: installing: ComponentMissing: missing deployment with name=elasticsearch-operator
Phase: Pending
Reason: NeedsReinstall
Last Transition Time: 2019-11-21T10:03:43Z
Last Update Time: 2019-11-21T10:03:43Z
Message: conflicting CRD owner in namespace
Phase: Failed
Reason: OwnerConflict
Last Transition Time: 2019-11-21T10:03:45Z
Last Update Time: 2019-11-21T10:03:45Z
Message: installing: ComponentMissing: missing deployment with name=elasticsearch-operator
Phase: Pending
Reason: NeedsReinstall
Last Transition Time: 2019-11-21T10:03:46Z
Last Update Time: 2019-11-21T10:03:46Z
Message: conflicting CRD owner in namespace
Phase: Failed
Reason: OwnerConflict
...

Where are you experiencing the behaviour? What environment?

AWS, lab

When does the behavior occur? Frequently? Repeatedly? At certain times?

Version-Release number of selected component (if applicable):

How reproducible:

Steps to Reproduce:
1. Install service mesh
2. At time of autoupgrade it fails and it can be recovered only by reinstalling
3.

Actual results: Fails during auto upgrade

Expected results: Should succeed without external intervention.

Additional info:

Comment 2 Anik 2019-11-27 15:28:02 UTC

Hi Jaspreet, 

I have a few questions:

1) What version of openshift was your cluster? 

2) Was there any other CSVs that went in the failed state during auto upgrade? Looks like servicemesh depends on a few operator, and if any of those operators fail to install for some reason, servicemesh will fail to install. But if that's the case, I don't think it'll be fair to just say **ServiceMesh** failed to install during auto upgrade in the bug report.

3) How reproducible was this? Could you provide more detailed steps on how to reproduce this? If this was a one off thing, for example elasticsearch operator, which servicemesh operator depends on, had a one off glitch for some reason in your cluster, we may not be able to classify this as a bug. However, if with the steps you provide, elasticsearch operator fails to install more than once, and only during upgrade, then we could investigate this further as a potential bug.

Comment 6 Ben Luddy 2020-02-03 14:53:13 UTC

This may have the same root cause as
https://bugzilla.redhat.com/show_bug.cgi?id=1789920, which prevents
garbage collection of copied CSVs, but we don't have quite enough
information to confirm or disconfirm. If your cluster has reproduced
it, the "conflicting CRD owner in namespace" failures would be
expected, since the copied CSV asserts itself the owner of the same
CRD that the new CSV wants to own.

During a normal upgrade, the new version would specify that it
replaces the previous version, so the existing CRD ownership is not
considered a conflict. However, if the zombie CSV were two or more
versions earlier than the newest CSV, it would result in a CRD
ownership conflict.

You can query your cluster for CSVs that are in this state:

$ oc get -A -o json csv | jq '.items[] | select((.status.reason == "Copied" and .metadata.annotations["olm.operatorNamespace"] == .metadata.namespace))'

Any such CSVs can be safely deleted.

The 4.4.0 release will contain changes that prevent CSVs from entering
this state and clean up and existing CSVs that are already in this
state. The fixes will also be backported to 4.3.z
(https://bugzilla.redhat.com/show_bug.cgi?id=1797019) and 4.2.z
(https://bugzilla.redhat.com/show_bug.cgi?id=1797021).

If you can reproduce your original issue, but there are no CSVs
matching the above query, please respond and we can consider more
avenues of investigation.

Comment 7 Ben Luddy 2020-02-06 17:21:31 UTC


*** This bug has been marked as a duplicate of bug 1789920 ***

Comment 8 Red Hat Bugzilla 2024-01-06 04:27:12 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days

Note You need to log in before you can comment on or make changes to this bug.