Bug 1969320 - CVO hotloops on clusterserviceversions packageserver
Summary: CVO hotloops on clusterserviceversions packageserver
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 4.7
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.7.z
Assignee: Vadim Rutkovsky
QA Contact: Yang Yang
URL:
Whiteboard:
Depends On: 1881522
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-08 08:05 UTC by Vadim Rutkovsky
Modified: 2021-09-21 22:33 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of: 1881522
Environment:
Last Closed: 2021-09-21 22:32:59 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github operator-framework operator-lifecycle-manager pull 2329 0 None None None 2021-08-17 17:08:05 UTC
Red Hat Product Errata RHBA-2021:3510 0 None None None 2021-09-21 22:33:17 UTC

Comment 2 Yang Yang 2021-06-29 11:59:46 UTC
Verifying with pre-merge build failed.

Build a payload with the linked PR:

 $ build operator-framework/operator-lifecycle-manager#2192
 job https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1409806438900961280 succeeded

Install a cluster the payload

 $ oc get clusterversion
NAME      VERSION                                                  AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.ci.test-2021-06-29-094313-ci-ln-7kgwcsb-latest   True        False         39m     Cluster version is 4.7.0-0.ci.test-2021-06-29-094313-ci-ln-7kgwcsb-latest

Check the packageserver in audit log

 $ masters=$(oc get no -l node-role.kubernetes.io/master | sed '1d' | awk '{print $1}')
 $ oc adm node-logs $masters --path=kube-apiserver/audit.log --raw | zgrep -h '"verb":"update".*"resource":".*"packageserver"' 2>/dev/null | jq -r '.user.username + " " + (.objectRef | .resource + " " + .namespace + " " + .name + " " + .apiGroup) + " " + .stageTimestamp + " " + (.responseStatus | tostring)' | grep "cluster-version" | sort
system:serviceaccount:openshift-cluster-version:default clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-06-29T11:28:10.934577Z {"metadata":{},"code":200}
system:serviceaccount:openshift-cluster-version:default clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-06-29T11:31:28.956997Z {"metadata":{},"code":200}

There are 2 records indicating that CVO is updating the packageserver

After a few minutes, check it again
# oc adm node-logs $masters --path=kube-apiserver/audit.log --raw | zgrep -h '"verb":"update".*"resource":".*"packageserver"' 2>/dev/null | jq -r '.user.username + " " + (.objectRef | .resource + " " + .namespace + " " + .name + " " + .apiGroup) + " " + .stageTimestamp + " " + (.responseStatus | tostring)' | grep "cluster-version" | sort

The records disappear.

After a few minutes, check it once more.

# oc adm node-logs $masters --path=kube-apiserver/audit.log --raw | zgrep -h '"verb":"update".*"resource":".*"packageserver"' 2>/dev/null | jq -r '.user.username + " " + (.objectRef | .resource + " " + .namespace + " " + .name + " " + .apiGroup) + " " + .stageTimestamp + " " + (.responseStatus | tostring)' | grep "cluster-version" | sort
system:serviceaccount:openshift-cluster-version:default clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-06-29T11:47:59.213667Z {"metadata":{},"code":200}
system:serviceaccount:openshift-cluster-version:default clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-06-29T11:51:17.181699Z {"metadata":{},"code":200}
system:serviceaccount:openshift-cluster-version:default clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-06-29T11:54:35.202873Z {"metadata":{},"code":200}


CVO re-creates the packageserver constantly once more.

Comment 3 Vadim Rutkovsky 2021-06-29 18:50:43 UTC
Right, we're waiting for https://github.com/operator-framework/operator-lifecycle-manager/pull/2192 to be approved for cherry-picking (same fields missing in CVO logs of latest e2e-serial run in 4.7)

Comment 7 Yang Yang 2021-08-17 07:28:49 UTC
Verifying with 4.7.0-0.nightly-2021-08-16-204701 failed.

$ oc adm release info --commits registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-08-16-204701 | grep operator-lifecycle-manager
  operator-lifecycle-manager                     https://github.com/operator-framework/operator-lifecycle-manager            4bfcded86491aa59e42d5b4288a79fcc673afc91

operator-lifecycle-manager/ (master) $ git --no-pager log --first-parent --oneline -3 origin/release-4.7            [13:47:58]
4bfcded86 (origin/release-4.7) packageserver CSV: add missing properties (#2192)
b0453a0cb Fix for resolution error on channels with deprecated inner entries. (#2313)
dcbab9aff Merge pull request #2287 from dinhxuanvu/test-4.7

So the payload includes the olm fix.

# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-08-16-204701   True        False         6m38s   Cluster version is 4.7.0-0.nightly-2021-08-16-204701

# masters=$(oc get no -l node-role.kubernetes.io/master | sed '1d' | awk '{print $1}')

# oc adm node-logs $masters --path=kube-apiserver/audit.log --raw | zgrep -h '"verb":"update".*"resource":".*"packageserver"' 2>/dev/null | jq -r '.user.username + " " + (.objectRef | .resource + " " + .namespace + " " + .name + " " + .apiGroup) + " " + .stageTimestamp + " " + (.responseStatus | tostring)' | sort
system:serviceaccount:kube-system:deployment-controller deployments openshift-operator-lifecycle-manager packageserver apps 2021-08-17T06:11:14.015350Z {"metadata":{},"code":200}
system:serviceaccount:kube-system:deployment-controller deployments openshift-operator-lifecycle-manager packageserver apps 2021-08-17T06:11:14.028274Z {"metadata":{},"status":"Failure","reason":"Conflict","code":409}
system:serviceaccount:kube-system:deployment-controller deployments openshift-operator-lifecycle-manager packageserver apps 2021-08-17T06:11:14.048081Z {"metadata":{},"code":200}
system:serviceaccount:kube-system:deployment-controller deployments openshift-operator-lifecycle-manager packageserver apps 2021-08-17T06:11:14.374924Z {"metadata":{},"code":200}
system:serviceaccount:kube-system:deployment-controller deployments openshift-operator-lifecycle-manager packageserver apps 2021-08-17T06:11:14.389804Z {"metadata":{},"status":"Failure","reason":"Conflict","code":409}
system:serviceaccount:kube-system:deployment-controller deployments openshift-operator-lifecycle-manager packageserver apps 2021-08-17T06:11:14.410616Z {"metadata":{},"code":200}
system:serviceaccount:kube-system:deployment-controller deployments openshift-operator-lifecycle-manager packageserver apps 2021-08-17T06:11:14.420149Z {"metadata":{},"status":"Failure","reason":"Conflict","code":409}
system:serviceaccount:openshift-cluster-version:default clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T06:19:38.715374Z {"metadata":{},"code":200}
system:serviceaccount:openshift-cluster-version:default clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T06:43:56.864047Z {"metadata":{},"code":200}
system:serviceaccount:openshift-cluster-version:default clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T06:47:14.952895Z {"metadata":{},"code":200}
system:serviceaccount:openshift-cluster-version:default clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T06:50:32.963959Z {"metadata":{},"code":200}
system:serviceaccount:openshift-cluster-version:default clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T06:53:51.035193Z {"metadata":{},"code":200}
system:serviceaccount:openshift-cluster-version:default clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T06:57:09.090293Z {"metadata":{},"code":200}
system:serviceaccount:openshift-cluster-version:default clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T07:00:27.203482Z {"metadata":{},"code":200}
system:serviceaccount:openshift-cluster-version:default clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T07:03:45.506066Z {"metadata":{},"code":200}
system:serviceaccount:openshift-cluster-version:default clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T07:07:03.421687Z {"metadata":{},"code":200}
system:serviceaccount:openshift-cluster-version:default clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T07:10:21.476466Z {"metadata":{},"code":200}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T06:13:02.936977Z {"metadata":{},"status":"Failure","reason":"Conflict","code":409}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T06:13:03.053232Z {"metadata":{},"code":200}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T06:13:03.216338Z {"metadata":{},"code":200}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T06:13:04.001196Z {"metadata":{},"code":200}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T06:13:04.640870Z {"metadata":{},"status":"Failure","reason":"Conflict","code":409}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T06:16:00.965559Z {"metadata":{},"code":200}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T06:16:00.996226Z {"metadata":{},"code":200}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T06:16:01.014947Z {"metadata":{},"status":"Failure","reason":"Conflict","code":409}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T06:19:38.736421Z {"metadata":{},"code":200}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T06:19:38.755656Z {"metadata":{},"code":200}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T06:19:38.767784Z {"metadata":{},"status":"Failure","reason":"Conflict","code":409}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T06:19:38.767877Z {"metadata":{},"status":"Failure","reason":"Conflict","code":409}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T07:10:21.495879Z {"metadata":{},"code":200}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T07:10:21.512740Z {"metadata":{},"code":200}


CVO still loops on the clusterserviceversions packageserver.

Checking CVO log:

# grep "Updating ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver" /tmp/cvo.log 
I0817 06:12:46.514363       1 generic.go:81] Updating ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver due to diff:   &unstructured.Unstructured{
I0817 06:16:00.823235       1 generic.go:81] Updating ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver due to diff:   &unstructured.Unstructured{
I0817 06:19:38.585097       1 generic.go:81] Updating ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver due to diff:   &unstructured.Unstructured{
I0817 06:23:54.677414       1 generic.go:81] Updating ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver due to diff:   &unstructured.Unstructured{
I0817 06:27:26.615236       1 generic.go:81] Updating ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver due to diff:   &unstructured.Unstructured{
I0817 06:30:44.640852       1 generic.go:81] Updating ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver due to diff:   &unstructured.Unstructured{
I0817 06:34:02.749232       1 generic.go:81] Updating ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver due to diff:   &unstructured.Unstructured{
I0817 06:37:20.793513       1 generic.go:81] Updating ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver due to diff:   &unstructured.Unstructured{
I0817 06:40:38.797883       1 generic.go:81] Updating ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver due to diff:   &unstructured.Unstructured{
I0817 06:43:56.846988       1 generic.go:81] Updating ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver due to diff:   &unstructured.Unstructured{
I0817 06:47:14.936912       1 generic.go:81] Updating ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver due to diff:   &unstructured.Unstructured{
I0817 06:50:32.935138       1 generic.go:81] Updating ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver due to diff:   &unstructured.Unstructured{
I0817 06:53:51.018298       1 generic.go:81] Updating ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver due to diff:   &unstructured.Unstructured{
I0817 06:57:09.073345       1 generic.go:81] Updating ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver due to diff:   &unstructured.Unstructured{
I0817 07:00:27.186033       1 generic.go:81] Updating ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver due to diff:   &unstructured.Unstructured{
I0817 07:03:45.491120       1 generic.go:81] Updating ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver due to diff:   &unstructured.Unstructured{
I0817 07:07:03.401386       1 generic.go:81] Updating ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver due to diff:   &unstructured.Unstructured{

Re-opening it.

Comment 9 Vadim Rutkovsky 2021-08-17 17:09:41 UTC
```
I0817 06:12:46.514363       1 generic.go:81] Updating ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver due to diff:   &unstructured.Unstructured{
  	Object: map[string]interface{}{
  		"apiVersion": string("operators.coreos.com/v1alpha1"),
  		"kind":       string("ClusterServiceVersion"),
  		"metadata":   map[string]interface{}{"annotations": map[string]interface{}{"include.release.openshift.io/self-managed-high-availability": string("true"), "olm.operatorGroup": string("olm-operators"), "olm.operatorNamespace": string("openshift-operator-lifecycle-manager"), "olm.targetNamespaces": string("openshift-operator-lifecycle-manager")}, "creationTimestamp": string("2021-08-17T05:58:20Z"), "generation": int64(1), "labels": map[string]interface{}{"olm.api.4bca9f23e412d79d": string("provided"), "olm.clusteroperator.name": string("operator-lifecycle-manager-packageserver"), "olm.version": string("0.17.0")}, ...},
  		"spec": map[string]interface{}{
  			"apiservicedefinitions":     map[string]interface{}{"owned": []interface{}{map[string]interface{}{"containerPort": int64(5443), "deploymentName": string("packageserver"), "description": string("A PackageManifest is a resource generated from existing CatalogS"...), "displayName": string("PackageManifest"), ...}}},
- 			"cleanup":                   map[string]interface{}{"enabled": bool(false)},
```


Created https://github.com/operator-framework/operator-lifecycle-manager/pull/2329 to remove cleanup, which is not available in 4.7

Comment 10 Yang Yang 2021-08-18 07:39:04 UTC
Pre-merge verification gets passed.

1. Build a payload with operator-framework/operator-lifecycle-manager#2329 using cluster-bot
   
   build operator-framework/operator-lifecycle-manager#2329

2. Install a cluster with the payload built in step 1.

3. Check the audit log

# oc get clusterversion
NAME      VERSION                                                  AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.ci.test-2021-08-18-061332-ci-ln-6jdrj8k-latest   True        False         18m     Cluster version is 4.7.0-0.ci.test-2021-08-18-061332-ci-ln-6jdrj8k-latest

# masters=$(oc get no -l node-role.kubernetes.io/master | sed '1d' | awk '{print $1}')

# oc adm node-logs $masters --path=kube-apiserver/audit.log --raw | zgrep -h '"verb":"update".*"resource":".*"packageserver"' 2>/dev/null | jq -r '.user.username + " " + (.objectRef | .resource + " " + .namespace + " " + .name + " " + .apiGroup) + " " + .stageTimestamp + " " + (.responseStatus | tostring)' | sort
null

4. Check CVO log

# grep "Updating ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver" /tmp/cvo1.log 
null

CVO no longer hotloops on packageserver.

Comment 11 Yang Yang 2021-08-18 07:49:54 UTC
It looks like PR[1] doesn't allow me to do the PR pre-merge verification. I'll test it again when it goes to ON_QA.

[1] https://github.com/operator-framework/operator-lifecycle-manager/pull/2329

Comment 12 ximhan 2021-08-20 08:00:10 UTC
OpenShift engineering has decided to NOT ship 4.8.6 on 8/23 due to the following issue.
https://bugzilla.redhat.com/show_bug.cgi?id=1995785
All the fixes part will be now included in 4.8.7 on 8/30.

Comment 14 ximhan 2021-08-21 06:45:09 UTC
OpenShift engineering has decided to NOT ship 4.7.25 on 8/24 due to the following issue.
https://bugzilla.redhat.com/show_bug.cgi?id=1995785
All the fixes part will be now included in 4.7.26 on 8/31.

Comment 17 Yang Yang 2021-09-10 03:12:48 UTC
Verifying with 4.7.0-0.nightly-2021-09-09-200524 passed.

# oc adm release info --commits registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-09-09-200524 | grep operator-lifecycle-manager
  operator-lifecycle-manager                     https://github.com/operator-framework/operator-lifecycle-manager            cfea59e1905c10386581a991b989c55d0d8ff393


$ git --no-pager log --first-parent --oneline -3 origin/release-4.7            [10:18:33]
ebe84506e (origin/release-4.7) installplans: retry crd updates on conflicts (#2293) (#2314)
cfea59e19 manifests/0000_50_olm_15-packageserver.clusterserviceversion.yaml: remove spec.cleanup (#2329)
cf5ef8521 Merge pull request #2351 from timflannagan/fix-build-root-dockerfile

# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-09-09-200524   True        False         5m29s   Cluster version is 4.7.0-0.nightly-2021-09-09-200524

# oc adm node-logs $masters --path=kube-apiserver/audit.log --raw | zgrep -h '"verb":"update".*"resource":".*"packageserver"' 2>/dev/null | jq -r '.user.username + " " + (.objectRef | .resource + " " + .namespace + " " + .name + " " + .apiGroup) + " " + .stageTimestamp + " " + (.responseStatus | tostring)' | grep "cluster-version" | sort

null

# oc logs cluster-version-operator-7d4bdf9578-7ngp5 -n openshift-cluster-version | grep "Updating ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver"

null


CVO no longer hotloops on packageserver. Moving it to verified state.

Comment 20 errata-xmlrpc 2021-09-21 22:32:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.31 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3510


Note You need to log in before you can comment on or make changes to this bug.