Bug 1881522 - CVO hotloops on clusterserviceversions packageserver
Summary: CVO hotloops on clusterserviceversions packageserver
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 4.6
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.8.0
Assignee: Vadim Rutkovsky
QA Contact: Yang Yang
URL:
Whiteboard:
Depends On:
Blocks: 1969320
TreeView+ depends on / blocked
 
Reported: 2020-09-22 15:21 UTC by Stefan Schimanski
Modified: 2021-07-27 22:34 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1969320 (view as bug list)
Environment:
Last Closed: 2021-07-27 22:33:30 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift operator-framework-olm pull 84 0 None open Bug 1881522: packageserver CSV: add missing properties 2021-06-02 15:05:23 UTC
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:34:13 UTC

Description Stefan Schimanski 2020-09-22 15:21:15 UTC
metadata.managedFields[0].time shows that CVO is updating this continously:

{"count":91,"path":"/apis/operators.coreos.com/v1alpha1/namespaces/openshift-operator-lifecycle-manager/clusterserviceversions/packageserver"}

Looks like it races with OLM:

  managedFields:
  - apiVersion: operators.coreos.com/v1alpha1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:labels:
          .: {}
          f:olm.clusteroperator.name: {}
          f:olm.version: {}
      f:spec:
        .: {}
        f:apiservicedefinitions:
          .: {}
          f:owned: {}
        f:description: {}
        f:displayName: {}
        f:install:
          .: {}
          f:spec:
            .: {}
            f:clusterPermissions: {}
          f:strategy: {}
        f:installModes: {}
        f:keywords: {}
        f:links: {}
        f:maintainers: {}
        f:maturity: {}
        f:minKubeVersion: {}
        f:provider:
          .: {}
          f:name: {}
        f:version: {}
    manager: cluster-version-operator
    operation: Update
    time: "2020-09-22T15:18:01Z"
  - apiVersion: operators.coreos.com/v1alpha1
    fieldsType: FieldsV1
    fieldsV1:
      f:metadata:
        f:annotations:
          .: {}
          f:olm.operatorGroup: {}
          f:olm.operatorNamespace: {}
          f:olm.targetNamespaces: {}
        f:labels:
          f:olm.api.4bca9f23e412d79d: {}
      f:spec:
        f:customresourcedefinitions: {}
        f:install:
          f:spec:
            f:deployments: {}
      f:status:
        .: {}
        f:certsLastUpdated: {}
        f:certsRotateAt: {}
        f:conditions: {}
        f:lastTransitionTime: {}
        f:lastUpdateTime: {}
        f:message: {}
        f:phase: {}
        f:reason: {}
        f:requirementStatus: {}
    manager: olm
    operation: Update
    time: "2020-09-22T15:18:16Z"

Comment 1 W. Trevor King 2020-09-22 21:35:53 UTC
Stefan suggests possibly waiting until API-server support for server-side apply [1] goes GA and rerolling the CVO's apply logic to use that instead of client-side merging, which might help here.  And bug 1879184 might end up with a [Late] CI guard based on the audit logs.  But whatever is going on here is unlikely to be new in 4.6, so punting to 4.7.

[1]: https://kubernetes.io/blog/2020/04/01/kubernetes-1.18-feature-server-side-apply-beta-2/

Comment 3 Stefan Schimanski 2020-09-23 13:11:51 UTC
I don't think https://bugzilla.redhat.com/show_bug.cgi?id=1881522#c1 reflect what I meant. I meant that it is an infinite game with lots of chance for mistakes and failure in writing the perfect client-side merging funcs for all types. Instead the right solution is to triage these bugs, fix the manifests for now and add an e2e test that uncovers the issues before new manifests merge.

Comment 5 W. Trevor King 2020-09-23 16:58:05 UTC
> ... fix the manifests for now...

If that's what this bug is about, it should be assigned to the samples team, right?

> ... and add an e2e test that uncovers the issues before new manifests merge.

This is bug 1879184, right?

Punting back to 4.7, because I don't see any new-in-4.6 regressions here, and it's really late in the 4.6 cycle to make new 4.6 blockers unless we have a solid story around why this is a critical issue.

Comment 6 W. Trevor King 2020-10-02 23:11:22 UTC
It's end of sprint, and this is not going to get fixed in the next few hours.  Hopefully we will at least get the Late audit guard from bug 1879184 in next sprint, and then we'll see which team should fix this issue.

Comment 7 Jack Ottofaro 2020-10-23 19:03:28 UTC
Adding UpcomingSprint as we have reached the end of the current sprint and pushing this bug to the next sprint.

Comment 9 Yang Yang 2021-06-08 01:04:25 UTC
Reproducing with 4.8.0-fc.3

# masters=$(oc get no -l node-role.kubernetes.io/master | sed '1d' | awk '{print $1}')

# oc adm node-logs $masters --path=kube-apiserver/audit.log --raw | zgrep -h '"verb":"update".*"resource":".*"packageserver"' 2>/dev/null | jq -r '.user.username + " " + (.objectRef | .resource + " " + .namespace + " " + .name + " " + .apiGroup) + " " + .stageTimestamp + " " + (.responseStatus | tostring)' | sort
system:serviceaccount:openshift-cluster-version:default clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-06-08T00:37:50.673758Z {"metadata":{},"code":200}
system:serviceaccount:openshift-cluster-version:default clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-06-08T00:46:41.857536Z {"metadata":{},"code":200}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-06-08T00:37:50.691220Z {"metadata":{},"code":200}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-06-08T00:37:50.706568Z {"metadata":{},"code":200}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-06-08T00:37:50.719282Z {"metadata":{},"status":"Failure","reason":"Conflict","code":409}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-06-08T00:37:50.719576Z {"metadata":{},"status":"Failure","reason":"Conflict","code":409}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-06-08T00:46:41.876512Z {"metadata":{},"code":200}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-06-08T00:46:41.894444Z {"metadata":{},"code":200}

CVO updates the packageserver and races with OLM.

Comment 10 Yang Yang 2021-06-08 03:32:43 UTC
Attempting to verify it with 4.8.0-0.nightly-2021-06-07-180258

# oc adm release info --commits registry.ci.openshift.org/ocp/release:4.8.0-0.nightly-2021-06-07-180258 | grep -i olm
  operator-lifecycle-manager                     https://github.com/openshift/operator-framework-olm                         1adb4495ae3cec2189e74bd354af348ed5ec7b9b

$ git --no-pager log --first-parent --oneline -3 origin/release-4.8
1adb4495a (HEAD -> master, origin/release-4.9, origin/release-4.8, origin/master, origin/HEAD) Merge pull request #84 from vrutkovs/cvo-hotlooping
ca1f0b69c Merge pull request #83 from hasbro17/fix-ssa-error
0e9f3bffa Merge pull request #82 from joelanford/bz-1961472

The nightly build includes the fix.

# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-06-07-180258   True        False         2m      Cluster version is 4.8.0-0.nightly-2021-06-07-180258

# masters=$(oc get no -l node-role.kubernetes.io/master | sed '1d' | awk '{print $1}')

# oc adm node-logs $masters --path=kube-apiserver/audit.log --raw | zgrep -h '"verb":"update".*"resource":".*"packageserver"' 2>/dev/null | jq -r '.user.username + " " + (.objectRef | .resource + " " + .namespace + " " + .name + " " + .apiGroup) + " " + .stageTimestamp + " " + (.responseStatus | tostring)' | sort
system:serviceaccount:openshift-cluster-version:default clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-06-08T02:39:54.025056Z {"metadata":{},"code":200}
system:serviceaccount:openshift-cluster-version:default clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-06-08T02:43:07.894875Z {"metadata":{},"code":200}
system:serviceaccount:openshift-cluster-version:default clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-06-08T02:46:50.647544Z {"metadata":{},"code":200}
system:serviceaccount:openshift-cluster-version:default clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-06-08T02:51:12.037391Z {"metadata":{},"code":200}
system:serviceaccount:openshift-cluster-version:default clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-06-08T02:54:24.973357Z {"metadata":{},"code":200}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-06-08T02:40:25.235770Z {"metadata":{},"status":"Failure","reason":"Conflict","code":409}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-06-08T02:40:25.577762Z {"metadata":{},"status":"Failure","reason":"Conflict","code":409}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-06-08T02:40:39.919881Z {"metadata":{},"code":200}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-06-08T02:40:40.303295Z {"metadata":{},"status":"Failure","reason":"Conflict","code":409}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-06-08T02:40:40.320423Z {"metadata":{},"code":200}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-06-08T02:40:40.803135Z {"metadata":{},"code":200}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-06-08T02:43:07.921885Z {"metadata":{},"code":200}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-06-08T02:43:07.944801Z {"metadata":{},"code":200}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-06-08T02:43:07.961416Z {"metadata":{},"status":"Failure","reason":"Conflict","code":409}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-06-08T02:46:50.679361Z {"metadata":{},"code":200}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-06-08T02:46:50.712323Z {"metadata":{},"code":200}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-06-08T02:46:50.726868Z {"metadata":{},"status":"Failure","reason":"Conflict","code":409}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-06-08T02:46:50.732949Z {"metadata":{},"status":"Failure","reason":"Conflict","code":409}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-06-08T02:51:12.058207Z {"metadata":{},"code":200}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-06-08T02:51:12.075421Z {"metadata":{},"code":200}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-06-08T02:51:12.086471Z {"metadata":{},"status":"Failure","reason":"Conflict","code":409}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-06-08T02:51:12.086550Z {"metadata":{},"status":"Failure","reason":"Conflict","code":409}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-06-08T02:54:24.994938Z {"metadata":{},"code":200}
system:serviceaccount:openshift-operator-lifecycle-manager:olm-operator-serviceaccount clusterserviceversions openshift-operator-lifecycle-manager packageserver operators.coreos.com 2021-06-08T02:54:25.017649Z {"metadata":{},"code":200}

The CVO still updates the packageserver constantly and races with olm. Re-opening it.
It's fixed in olm component only but it targets to CVO component. Is the bz component selected correctly?

Comment 11 Vadim Rutkovsky 2021-06-08 07:45:58 UTC
Right, this also needs https://github.com/openshift/cluster-version-operator/pull/561 to do proper comparison on unspecified resources.

I'll move the bug to ON_QA again once it merges and a new nightly is available

Comment 12 Lalatendu Mohanty 2021-06-09 20:34:44 UTC
https://github.com/openshift/cluster-version-operator/pull/561 has merged. So moving this to ON_QA

Comment 13 Yang Yang 2021-06-10 03:03:14 UTC
Verified with 4.8.0-0.nightly-2021-06-09-214128

# masters=$(oc get no -l node-role.kubernetes.io/master | sed '1d' | awk '{print $1}')

# oc adm node-logs $masters --path=kube-apiserver/audit.log --raw | zgrep -h '"verb":"update".*"resource":".*"packageserver"' 2>/dev/null | jq -r '.user.username + " " + (.objectRef | .resource + " " + .namespace + " " + .name + " " + .apiGroup) + " " + .stageTimestamp + " " + (.responseStatus | tostring)' | grep "cluster-version" | sort

null

CVO does not update packageserver constantly any more. Moving it to verified state.

Comment 16 errata-xmlrpc 2021-07-27 22:33:30 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.