Bug 1953715

Summary: CVO and OLM conflicts on managing packageserver CSV
Product: OpenShift Container Platform Reporter: Vu Dinh <vdinh>
Component: OLMAssignee: Kevin Rizza <krizza>
OLM sub component: OLM QA Contact: Jian Zhang <jiazha>
Status: CLOSED WONTFIX Docs Contact:
Severity: medium    
Priority: medium CC: anbhatta, bluddy, nelluri, tflannag, wking
Version: 4.8   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-12-01 21:55:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Vu Dinh 2021-04-26 17:29:30 UTC
Description of problem:
OLM packageserver CSV is created and managed by CVO as a part of cluster bootstrap. OLM (olm-catalog) itself will add labels/annotations (adoption, operatorgroup labels and etc) as a part of its control process. As a result, OLM and CVO get into a conflict controller loop of OLM adding labels and CVO restoring the CSV to original state (without labels). This contention leads to errors on olm-operator that are present on the log.

Version-Release number of selected component (if applicable):
4.8

How reproducible:
100%

Steps to Reproduce:
1. Launch a cluster and install no new operator
2. Look at olm-catalog log and there are error messages:
`
time="2021-04-23T19:57:31Z" level=info msg="checking packageserver"
time="2021-04-23T19:57:31Z" level=warning msg="error adding operatorgroup annotations" csv=packageserver error="Operation cannot be fulfilled on clusterserviceversions.operators.coreos.com \"packageserver\": the object has been modified; please apply your changes to the latest version and try again" namespace=openshift-operator-lifecycle-manager operatorGroup=olm-operators
`

olm-catalog controller continues to check packageserver over and over again:
`
time="2021-04-23T19:58:24Z" level=info msg="checking packageserver"
time="2021-04-23T19:58:24Z" level=info msg="checking packageserver"
time="2021-04-23T19:58:24Z" level=info msg="checking packageserver"
time="2021-04-23T19:58:24Z" level=info msg="checking packageserver"
`
3. If you mark the packageserver CSV to be unmanaged by CVO, this issue goes away.

Actual results:
A lot of logs about updating packageserver CSV that failed and continue to check for packageserver CSV.

Expected results:
No excessive error on packageserver CSV checking. Some initial error is expected at the start but it should normalize and not persistent.


Additional info:

Comment 2 Ben Luddy 2021-05-07 13:58:57 UTC
Raising priority due to likely impact on clusteroperator Availability condition.

Comment 3 Ben Luddy 2021-05-07 14:25:08 UTC
Looking at https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26016/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1390634961685450752:

May 07 13:03:53.558 E clusteroperator/operator-lifecycle-manager-packageserver condition/Available status/False reason/ClusterServiceVersionNotSucceeded changed: ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver is in phase InstallReady with reason: ComponentUnhealthy, message: installing: deployment changed old hash=797b6f8d96, new hash=55bf776f77
May 07 13:03:53.558 - 19s   E clusteroperator/operator-lifecycle-manager-packageserver condition/Available status/False reason/ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver is in phase InstallReady with reason: ComponentUnhealthy, message: installing: deployment changed old hash=797b6f8d96, new hash=55bf776f77

From the audit logs:

{"timestamp":"2021-05-07T13:03:53.337237Z","verb":"update","username":"system:serviceaccount:openshift-cluster-version:default","name":"packageserver","namespace":"openshift-operator-lifecycle-manager"}
{"timestamp":"2021-05-07T13:03:53.362000Z","verb":"update","username":"system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount","name":"packageserver","namespace":"openshift-operator-lifecycle-manager"}

Comment 6 Ben Luddy 2021-05-10 19:41:21 UTC
The ClusterOperator impact appears to be due to legitimate CSV updates and not due to this contention. The Available computation itself needs to be smarter to avoid flapping in this case, opened https://bugzilla.redhat.com/show_bug.cgi?id=1959158.

Comment 7 Haseeb Tariq 2021-05-14 23:10:34 UTC
This will likely be addressed in 4.9 as there is some ongoing work for making packageserver work in a single node configuration which would require OLM to create the packageserver CSV instead of the CVO doing it.
See: https://issues.redhat.com/browse/OLM-2078

Additionally the CVO should have a feature that lets us remove existing manifests via an annotation.
https://github.com/openshift/cluster-version-operator/pull/438

This can be used to remove the old CVO managed packageserver CSV.

Removing myself as the assignee as I am no longer actively working on this at the moment.

Comment 10 tflannag 2021-12-01 21:55:39 UTC
I'm inclined to close this BZ out as CLOSED - WONTFIX as this contention between the CVO and OLM is alleviated in 4.9+ OCP minor versions where the PackageServer CSV is now managed by the OLM stack, and there's no clear backportable way to alleviate this contention in versions less than 4.9. See the comment from Haseeb (https://bugzilla.redhat.com/show_bug.cgi?id=1953715#c7) further up in this comment chain for a better explanation.