Bug 1775518

Summary: Service mesh auto upgrade fails each time
Product: OpenShift Container Platform Reporter: Jaspreet Kaur <jkaur>
Component: OLMAssignee: Ben Luddy <bluddy>
OLM sub component: OLM QA Contact: Jian Zhang <jiazha>
Status: CLOSED DUPLICATE Docs Contact:
Severity: high    
Priority: high CC: aos-bugs, bluddy, dsover, ecordell, eparis, jokerman, vdinh
Version: 4.2.0   
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-02-06 17:21:31 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Jaspreet Kaur 2019-11-22 07:20:06 UTC
Description of problem: 

Auto upgrade of service messh fails each time and we had to uninstall servicemesh, kiali, jaegger, then reinstall them all. 

# oc get csv
NAME                                        DISPLAY                        VERSION              REPLACES                              PHASE
elasticsearch-operator.4.2.1-201910221723   Elasticsearch Operator         4.2.1-201910221723                                         Succeeded
elasticsearch-operator.4.2.4-201911050122   Elasticsearch Operator         4.2.4-201911050122                                         Failed
...
# oc describe csv elasticsearch-operator.4.2.4-201911050122
...
Status:
  Certs Last Updated:  <nil>
  Certs Rotate At:     <nil>
  Conditions:
    Last Transition Time:  2019-11-21T10:03:43Z
    Last Update Time:      2019-11-21T10:03:43Z
    Message:               installing: ComponentMissing: missing deployment with name=elasticsearch-operator
    Phase:                 Pending
    Reason:                NeedsReinstall
    Last Transition Time:  2019-11-21T10:03:43Z
    Last Update Time:      2019-11-21T10:03:43Z
    Message:               conflicting CRD owner in namespace
    Phase:                 Failed
    Reason:                OwnerConflict
    Last Transition Time:  2019-11-21T10:03:45Z
    Last Update Time:      2019-11-21T10:03:45Z
    Message:               installing: ComponentMissing: missing deployment with name=elasticsearch-operator
    Phase:                 Pending
    Reason:                NeedsReinstall
    Last Transition Time:  2019-11-21T10:03:46Z
    Last Update Time:      2019-11-21T10:03:46Z
    Message:               conflicting CRD owner in namespace
    Phase:                 Failed
    Reason:                OwnerConflict
...

Where are you experiencing the behaviour?  What environment?

AWS, lab

When does the behavior occur? Frequently?  Repeatedly?   At certain times?





Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1. Install service mesh
2. At time of autoupgrade it fails and it can be recovered only by reinstalling
3.

Actual results: Fails during auto upgrade


Expected results: Should succeed without external intervention.


Additional info:

Comment 2 Anik 2019-11-27 15:28:02 UTC
Hi Jaspreet, 

I have a few questions:

1) What version of openshift was your cluster? 

2) Was there any other CSVs that went in the failed state during auto upgrade? Looks like servicemesh depends on a few operator, and if any of those operators fail to install for some reason, servicemesh will fail to install. But if that's the case, I don't think it'll be fair to just say **ServiceMesh** failed to install during auto upgrade in the bug report.

3) How reproducible was this? Could you provide more detailed steps on how to reproduce this? If this was a one off thing, for example elasticsearch operator, which servicemesh operator depends on, had a one off glitch for some reason in your cluster, we may not be able to classify this as a bug. However, if with the steps you provide, elasticsearch operator fails to install more than once, and only during upgrade, then we could investigate this further as a potential bug.

Comment 6 Ben Luddy 2020-02-03 14:53:13 UTC
This may have the same root cause as
https://bugzilla.redhat.com/show_bug.cgi?id=1789920, which prevents
garbage collection of copied CSVs, but we don't have quite enough
information to confirm or disconfirm. If your cluster has reproduced
it, the "conflicting CRD owner in namespace" failures would be
expected, since the copied CSV asserts itself the owner of the same
CRD that the new CSV wants to own.

During a normal upgrade, the new version would specify that it
replaces the previous version, so the existing CRD ownership is not
considered a conflict. However, if the zombie CSV were two or more
versions earlier than the newest CSV, it would result in a CRD
ownership conflict.

You can query your cluster for CSVs that are in this state:

$ oc get -A -o json csv | jq '.items[] | select((.status.reason == "Copied" and .metadata.annotations["olm.operatorNamespace"] == .metadata.namespace))'

Any such CSVs can be safely deleted.

The 4.4.0 release will contain changes that prevent CSVs from entering
this state and clean up and existing CSVs that are already in this
state. The fixes will also be backported to 4.3.z
(https://bugzilla.redhat.com/show_bug.cgi?id=1797019) and 4.2.z
(https://bugzilla.redhat.com/show_bug.cgi?id=1797021).

If you can reproduce your original issue, but there are no CSVs
matching the above query, please respond and we can consider more
avenues of investigation.

Comment 7 Ben Luddy 2020-02-06 17:21:31 UTC

*** This bug has been marked as a duplicate of bug 1789920 ***

Comment 8 Red Hat Bugzilla 2024-01-06 04:27:12 UTC
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days