Bug 1953715 - CVO and OLM conflicts on managing packageserver CSV
Summary: CVO and OLM conflicts on managing packageserver CSV
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: OLM
Version: 4.8
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: ---
Assignee: Kevin Rizza
QA Contact: Jian Zhang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-04-26 17:29 UTC by Vu Dinh
Modified: 2021-12-01 21:55 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-12-01 21:55:39 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Vu Dinh 2021-04-26 17:29:30 UTC
Description of problem:
OLM packageserver CSV is created and managed by CVO as a part of cluster bootstrap. OLM (olm-catalog) itself will add labels/annotations (adoption, operatorgroup labels and etc) as a part of its control process. As a result, OLM and CVO get into a conflict controller loop of OLM adding labels and CVO restoring the CSV to original state (without labels). This contention leads to errors on olm-operator that are present on the log.

Version-Release number of selected component (if applicable):
4.8

How reproducible:
100%

Steps to Reproduce:
1. Launch a cluster and install no new operator
2. Look at olm-catalog log and there are error messages:
`
time="2021-04-23T19:57:31Z" level=info msg="checking packageserver"
time="2021-04-23T19:57:31Z" level=warning msg="error adding operatorgroup annotations" csv=packageserver error="Operation cannot be fulfilled on clusterserviceversions.operators.coreos.com \"packageserver\": the object has been modified; please apply your changes to the latest version and try again" namespace=openshift-operator-lifecycle-manager operatorGroup=olm-operators
`

olm-catalog controller continues to check packageserver over and over again:
`
time="2021-04-23T19:58:24Z" level=info msg="checking packageserver"
time="2021-04-23T19:58:24Z" level=info msg="checking packageserver"
time="2021-04-23T19:58:24Z" level=info msg="checking packageserver"
time="2021-04-23T19:58:24Z" level=info msg="checking packageserver"
`
3. If you mark the packageserver CSV to be unmanaged by CVO, this issue goes away.

Actual results:
A lot of logs about updating packageserver CSV that failed and continue to check for packageserver CSV.

Expected results:
No excessive error on packageserver CSV checking. Some initial error is expected at the start but it should normalize and not persistent.


Additional info:

Comment 2 Ben Luddy 2021-05-07 13:58:57 UTC
Raising priority due to likely impact on clusteroperator Availability condition.

Comment 3 Ben Luddy 2021-05-07 14:25:08 UTC
Looking at https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26016/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1390634961685450752:

May 07 13:03:53.558 E clusteroperator/operator-lifecycle-manager-packageserver condition/Available status/False reason/ClusterServiceVersionNotSucceeded changed: ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver is in phase InstallReady with reason: ComponentUnhealthy, message: installing: deployment changed old hash=797b6f8d96, new hash=55bf776f77
May 07 13:03:53.558 - 19s   E clusteroperator/operator-lifecycle-manager-packageserver condition/Available status/False reason/ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver is in phase InstallReady with reason: ComponentUnhealthy, message: installing: deployment changed old hash=797b6f8d96, new hash=55bf776f77

From the audit logs:

{"timestamp":"2021-05-07T13:03:53.337237Z","verb":"update","username":"system:serviceaccount:openshift-cluster-version:default","name":"packageserver","namespace":"openshift-operator-lifecycle-manager"}
{"timestamp":"2021-05-07T13:03:53.362000Z","verb":"update","username":"system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount","name":"packageserver","namespace":"openshift-operator-lifecycle-manager"}

Comment 6 Ben Luddy 2021-05-10 19:41:21 UTC
The ClusterOperator impact appears to be due to legitimate CSV updates and not due to this contention. The Available computation itself needs to be smarter to avoid flapping in this case, opened https://bugzilla.redhat.com/show_bug.cgi?id=1959158.

Comment 7 Haseeb Tariq 2021-05-14 23:10:34 UTC
This will likely be addressed in 4.9 as there is some ongoing work for making packageserver work in a single node configuration which would require OLM to create the packageserver CSV instead of the CVO doing it.
See: https://issues.redhat.com/browse/OLM-2078

Additionally the CVO should have a feature that lets us remove existing manifests via an annotation.
https://github.com/openshift/cluster-version-operator/pull/438

This can be used to remove the old CVO managed packageserver CSV.

Removing myself as the assignee as I am no longer actively working on this at the moment.

Comment 10 tflannag 2021-12-01 21:55:39 UTC
I'm inclined to close this BZ out as CLOSED - WONTFIX as this contention between the CVO and OLM is alleviated in 4.9+ OCP minor versions where the PackageServer CSV is now managed by the OLM stack, and there's no clear backportable way to alleviate this contention in versions less than 4.9. See the comment from Haseeb (https://bugzilla.redhat.com/show_bug.cgi?id=1953715#c7) further up in this comment chain for a better explanation.


Note You need to log in before you can comment on or make changes to this bug.