Bug 1969320

Summary: CVO hotloops on clusterserviceversions packageserver
Product: OpenShift Container Platform Reporter: Vadim Rutkovsky <vrutkovs>
Component: Cluster Version OperatorAssignee: Vadim Rutkovsky <vrutkovs>
Status: CLOSED ERRATA QA Contact: Yang Yang <yanyang>
Severity: low Docs Contact:
Priority: low    
Version: 4.7CC: aos-bugs, bleanhar, jack.ottofaro, jokerman, lmohanty, sttts, vrutkovs, wking, yanyang
Target Milestone: ---   
Target Release: 4.7.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: 1881522 Environment:
Last Closed: 2021-09-21 22:32:59 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1881522    
Bug Blocks:    

Comment 2 Yang Yang 2021-06-29 11:59:46 UTC
Verifying with pre-merge build failed.

Build a payload with the linked PR:

 $ build operator-framework/operator-lifecycle-manager#2192
 job https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-gcp/1409806438900961280 succeeded

Install a cluster the payload

 $ oc get clusterversion
NAME      VERSION                                                  AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.ci.test-2021-06-29-094313-ci-ln-7kgwcsb-latest   True        False         39m     Cluster version is 4.7.0-0.ci.test-2021-06-29-094313-ci-ln-7kgwcsb-latest

Check the packageserver in audit log

 $ masters=$(oc get no -l node-role.kubernetes.io/master | sed '1d' | awk '{print $1}')
 $ oc adm node-logs $masters --path=kube-apiserver/audit.log --raw | zgrep -h '"verb":"update".*"resource":".*"packageserver"' 2>/dev/null | jq -r '.user.username + " " + (.objectRef | .resource + " " + .namespace + " " + .name + " " + .apiGroup) + " " + .stageTimestamp + " " + (.responseStatus | tostring)' | grep "cluster-version" | sort
system:serviceaccount:openshift-cluster-version:default clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-06-29T11:28:10.934577Z {"metadata":{},"code":200}
system:serviceaccount:openshift-cluster-version:default clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-06-29T11:31:28.956997Z {"metadata":{},"code":200}

There are 2 records indicating that CVO is updating the packageserver

After a few minutes, check it again
# oc adm node-logs $masters --path=kube-apiserver/audit.log --raw | zgrep -h '"verb":"update".*"resource":".*"packageserver"' 2>/dev/null | jq -r '.user.username + " " + (.objectRef | .resource + " " + .namespace + " " + .name + " " + .apiGroup) + " " + .stageTimestamp + " " + (.responseStatus | tostring)' | grep "cluster-version" | sort

The records disappear.

After a few minutes, check it once more.

# oc adm node-logs $masters --path=kube-apiserver/audit.log --raw | zgrep -h '"verb":"update".*"resource":".*"packageserver"' 2>/dev/null | jq -r '.user.username + " " + (.objectRef | .resource + " " + .namespace + " " + .name + " " + .apiGroup) + " " + .stageTimestamp + " " + (.responseStatus | tostring)' | grep "cluster-version" | sort
system:serviceaccount:openshift-cluster-version:default clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-06-29T11:47:59.213667Z {"metadata":{},"code":200}
system:serviceaccount:openshift-cluster-version:default clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-06-29T11:51:17.181699Z {"metadata":{},"code":200}
system:serviceaccount:openshift-cluster-version:default clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-06-29T11:54:35.202873Z {"metadata":{},"code":200}


CVO re-creates the packageserver constantly once more.

Comment 3 Vadim Rutkovsky 2021-06-29 18:50:43 UTC
Right, we're waiting for https://github.com/operator-framework/operator-lifecycle-manager/pull/2192 to be approved for cherry-picking (same fields missing in CVO logs of latest e2e-serial run in 4.7)

Comment 7 Yang Yang 2021-08-17 07:28:49 UTC
Verifying with 4.7.0-0.nightly-2021-08-16-204701 failed.

$ oc adm release info --commits registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-08-16-204701 | grep operator-lifecycle-manager
  operator-lifecycle-manager                     https://github.com/operator-framework/operator-lifecycle-manager            4bfcded86491aa59e42d5b4288a79fcc673afc91

operator-lifecycle-manager/ (master) $ git --no-pager log --first-parent --oneline -3 origin/release-4.7            [13:47:58]
4bfcded86 (origin/release-4.7) packageserver CSV: add missing properties (#2192)
b0453a0cb Fix for resolution error on channels with deprecated inner entries. (#2313)
dcbab9aff Merge pull request #2287 from dinhxuanvu/test-4.7

So the payload includes the olm fix.

# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-08-16-204701   True        False         6m38s   Cluster version is 4.7.0-0.nightly-2021-08-16-204701

# masters=$(oc get no -l node-role.kubernetes.io/master | sed '1d' | awk '{print $1}')

# oc adm node-logs $masters --path=kube-apiserver/audit.log --raw | zgrep -h '"verb":"update".*"resource":".*"packageserver"' 2>/dev/null | jq -r '.user.username + " " + (.objectRef | .resource + " " + .namespace + " " + .name + " " + .apiGroup) + " " + .stageTimestamp + " " + (.responseStatus | tostring)' | sort
system:serviceaccount:kube-system:deployment-controller deployments openshift-operator-lifecycle-manager packageserver apps 2021-08-17T06:11:14.015350Z {"metadata":{},"code":200}
system:serviceaccount:kube-system:deployment-controller deployments openshift-operator-lifecycle-manager packageserver apps 2021-08-17T06:11:14.028274Z {"metadata":{},"status":"Failure","reason":"Conflict","code":409}
system:serviceaccount:kube-system:deployment-controller deployments openshift-operator-lifecycle-manager packageserver apps 2021-08-17T06:11:14.048081Z {"metadata":{},"code":200}
system:serviceaccount:kube-system:deployment-controller deployments openshift-operator-lifecycle-manager packageserver apps 2021-08-17T06:11:14.374924Z {"metadata":{},"code":200}
system:serviceaccount:kube-system:deployment-controller deployments openshift-operator-lifecycle-manager packageserver apps 2021-08-17T06:11:14.389804Z {"metadata":{},"status":"Failure","reason":"Conflict","code":409}
system:serviceaccount:kube-system:deployment-controller deployments openshift-operator-lifecycle-manager packageserver apps 2021-08-17T06:11:14.410616Z {"metadata":{},"code":200}
system:serviceaccount:kube-system:deployment-controller deployments openshift-operator-lifecycle-manager packageserver apps 2021-08-17T06:11:14.420149Z {"metadata":{},"status":"Failure","reason":"Conflict","code":409}
system:serviceaccount:openshift-cluster-version:default clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T06:19:38.715374Z {"metadata":{},"code":200}
system:serviceaccount:openshift-cluster-version:default clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T06:43:56.864047Z {"metadata":{},"code":200}
system:serviceaccount:openshift-cluster-version:default clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T06:47:14.952895Z {"metadata":{},"code":200}
system:serviceaccount:openshift-cluster-version:default clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T06:50:32.963959Z {"metadata":{},"code":200}
system:serviceaccount:openshift-cluster-version:default clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T06:53:51.035193Z {"metadata":{},"code":200}
system:serviceaccount:openshift-cluster-version:default clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T06:57:09.090293Z {"metadata":{},"code":200}
system:serviceaccount:openshift-cluster-version:default clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T07:00:27.203482Z {"metadata":{},"code":200}
system:serviceaccount:openshift-cluster-version:default clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T07:03:45.506066Z {"metadata":{},"code":200}
system:serviceaccount:openshift-cluster-version:default clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T07:07:03.421687Z {"metadata":{},"code":200}
system:serviceaccount:openshift-cluster-version:default clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T07:10:21.476466Z {"metadata":{},"code":200}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T06:13:02.936977Z {"metadata":{},"status":"Failure","reason":"Conflict","code":409}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T06:13:03.053232Z {"metadata":{},"code":200}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T06:13:03.216338Z {"metadata":{},"code":200}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T06:13:04.001196Z {"metadata":{},"code":200}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T06:13:04.640870Z {"metadata":{},"status":"Failure","reason":"Conflict","code":409}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T06:16:00.965559Z {"metadata":{},"code":200}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T06:16:00.996226Z {"metadata":{},"code":200}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T06:16:01.014947Z {"metadata":{},"status":"Failure","reason":"Conflict","code":409}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T06:19:38.736421Z {"metadata":{},"code":200}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T06:19:38.755656Z {"metadata":{},"code":200}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T06:19:38.767784Z {"metadata":{},"status":"Failure","reason":"Conflict","code":409}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T06:19:38.767877Z {"metadata":{},"status":"Failure","reason":"Conflict","code":409}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T07:10:21.495879Z {"metadata":{},"code":200}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-08-17T07:10:21.512740Z {"metadata":{},"code":200}


CVO still loops on the clusterserviceversions packageserver.

Checking CVO log:

# grep "Updating ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver" /tmp/cvo.log 
I0817 06:12:46.514363       1 generic.go:81] Updating ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver due to diff:   &unstructured.Unstructured{
I0817 06:16:00.823235       1 generic.go:81] Updating ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver due to diff:   &unstructured.Unstructured{
I0817 06:19:38.585097       1 generic.go:81] Updating ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver due to diff:   &unstructured.Unstructured{
I0817 06:23:54.677414       1 generic.go:81] Updating ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver due to diff:   &unstructured.Unstructured{
I0817 06:27:26.615236       1 generic.go:81] Updating ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver due to diff:   &unstructured.Unstructured{
I0817 06:30:44.640852       1 generic.go:81] Updating ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver due to diff:   &unstructured.Unstructured{
I0817 06:34:02.749232       1 generic.go:81] Updating ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver due to diff:   &unstructured.Unstructured{
I0817 06:37:20.793513       1 generic.go:81] Updating ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver due to diff:   &unstructured.Unstructured{
I0817 06:40:38.797883       1 generic.go:81] Updating ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver due to diff:   &unstructured.Unstructured{
I0817 06:43:56.846988       1 generic.go:81] Updating ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver due to diff:   &unstructured.Unstructured{
I0817 06:47:14.936912       1 generic.go:81] Updating ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver due to diff:   &unstructured.Unstructured{
I0817 06:50:32.935138       1 generic.go:81] Updating ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver due to diff:   &unstructured.Unstructured{
I0817 06:53:51.018298       1 generic.go:81] Updating ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver due to diff:   &unstructured.Unstructured{
I0817 06:57:09.073345       1 generic.go:81] Updating ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver due to diff:   &unstructured.Unstructured{
I0817 07:00:27.186033       1 generic.go:81] Updating ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver due to diff:   &unstructured.Unstructured{
I0817 07:03:45.491120       1 generic.go:81] Updating ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver due to diff:   &unstructured.Unstructured{
I0817 07:07:03.401386       1 generic.go:81] Updating ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver due to diff:   &unstructured.Unstructured{

Re-opening it.

Comment 9 Vadim Rutkovsky 2021-08-17 17:09:41 UTC
```
I0817 06:12:46.514363       1 generic.go:81] Updating ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver due to diff:   &unstructured.Unstructured{
  	Object: map[string]interface{}{
  		"apiVersion": string("operators.coreos.com/v1alpha1"),
  		"kind":       string("ClusterServiceVersion"),
  		"metadata":   map[string]interface{}{"annotations": map[string]interface{}{"include.release.openshift.io/self-managed-high-availability": string("true"), "olm.operatorGroup": string("olm-operators"), "olm.operatorNamespace": string("openshift-operator-lifecycle-manager"), "olm.targetNamespaces": string("openshift-operator-lifecycle-manager")}, "creationTimestamp": string("2021-08-17T05:58:20Z"), "generation": int64(1), "labels": map[string]interface{}{"olm.api.4bca9f23e412d79d": string("provided"), "olm.clusteroperator.name": string("operator-lifecycle-manager-packageserver"), "olm.version": string("0.17.0")}, ...},
  		"spec": map[string]interface{}{
  			"apiservicedefinitions":     map[string]interface{}{"owned": []interface{}{map[string]interface{}{"containerPort": int64(5443), "deploymentName": string("packageserver"), "description": string("A PackageManifest is a resource generated from existing CatalogS"...), "displayName": string("PackageManifest"), ...}}},
- 			"cleanup":                   map[string]interface{}{"enabled": bool(false)},
```


Created https://github.com/operator-framework/operator-lifecycle-manager/pull/2329 to remove cleanup, which is not available in 4.7

Comment 10 Yang Yang 2021-08-18 07:39:04 UTC
Pre-merge verification gets passed.

1. Build a payload with operator-framework/operator-lifecycle-manager#2329 using cluster-bot
   
   build operator-framework/operator-lifecycle-manager#2329

2. Install a cluster with the payload built in step 1.

3. Check the audit log

# oc get clusterversion
NAME      VERSION                                                  AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.ci.test-2021-08-18-061332-ci-ln-6jdrj8k-latest   True        False         18m     Cluster version is 4.7.0-0.ci.test-2021-08-18-061332-ci-ln-6jdrj8k-latest

# masters=$(oc get no -l node-role.kubernetes.io/master | sed '1d' | awk '{print $1}')

# oc adm node-logs $masters --path=kube-apiserver/audit.log --raw | zgrep -h '"verb":"update".*"resource":".*"packageserver"' 2>/dev/null | jq -r '.user.username + " " + (.objectRef | .resource + " " + .namespace + " " + .name + " " + .apiGroup) + " " + .stageTimestamp + " " + (.responseStatus | tostring)' | sort
null

4. Check CVO log

# grep "Updating ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver" /tmp/cvo1.log 
null

CVO no longer hotloops on packageserver.

Comment 11 Yang Yang 2021-08-18 07:49:54 UTC
It looks like PR[1] doesn't allow me to do the PR pre-merge verification. I'll test it again when it goes to ON_QA.

[1] https://github.com/operator-framework/operator-lifecycle-manager/pull/2329

Comment 12 ximhan 2021-08-20 08:00:10 UTC
OpenShift engineering has decided to NOT ship 4.8.6 on 8/23 due to the following issue.
https://bugzilla.redhat.com/show_bug.cgi?id=1995785
All the fixes part will be now included in 4.8.7 on 8/30.

Comment 14 ximhan 2021-08-21 06:45:09 UTC
OpenShift engineering has decided to NOT ship 4.7.25 on 8/24 due to the following issue.
https://bugzilla.redhat.com/show_bug.cgi?id=1995785
All the fixes part will be now included in 4.7.26 on 8/31.

Comment 17 Yang Yang 2021-09-10 03:12:48 UTC
Verifying with 4.7.0-0.nightly-2021-09-09-200524 passed.

# oc adm release info --commits registry.ci.openshift.org/ocp/release:4.7.0-0.nightly-2021-09-09-200524 | grep operator-lifecycle-manager
  operator-lifecycle-manager                     https://github.com/operator-framework/operator-lifecycle-manager            cfea59e1905c10386581a991b989c55d0d8ff393


$ git --no-pager log --first-parent --oneline -3 origin/release-4.7            [10:18:33]
ebe84506e (origin/release-4.7) installplans: retry crd updates on conflicts (#2293) (#2314)
cfea59e19 manifests/0000_50_olm_15-packageserver.clusterserviceversion.yaml: remove spec.cleanup (#2329)
cf5ef8521 Merge pull request #2351 from timflannagan/fix-build-root-dockerfile

# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2021-09-09-200524   True        False         5m29s   Cluster version is 4.7.0-0.nightly-2021-09-09-200524

# oc adm node-logs $masters --path=kube-apiserver/audit.log --raw | zgrep -h '"verb":"update".*"resource":".*"packageserver"' 2>/dev/null | jq -r '.user.username + " " + (.objectRef | .resource + " " + .namespace + " " + .name + " " + .apiGroup) + " " + .stageTimestamp + " " + (.responseStatus | tostring)' | grep "cluster-version" | sort

null

# oc logs cluster-version-operator-7d4bdf9578-7ngp5 -n openshift-cluster-version | grep "Updating ClusterServiceVersion openshift-operator-lifecycle-manager/packageserver"

null


CVO no longer hotloops on packageserver. Moving it to verified state.

Comment 20 errata-xmlrpc 2021-09-21 22:32:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.31 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3510