Bug 2039228
Summary: | CVO hotloops on CronJob openshift-operator-lifecycle-manager/collect-profiles | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Yang Yang <yanyang> | ||||
Component: | Cluster Version Operator | Assignee: | David Hurta <dhurta> | ||||
Status: | CLOSED DEFERRED | QA Contact: | Yang Yang <yanyang> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 4.9 | CC: | aos-bugs, dhurta, juzhao, lmohanty, sbiragda, wking | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2023-03-09 01:10:52 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Yang Yang
2022-01-11 10:12:20 UTC
Looking at 4.10.0-fc.1 CI [1], the CVO is fighting with the cronjob-controller: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws-modern/1481748933687382016/artifacts/launch/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2 $ zgrep -h '"resource":"cronjobs"' */*.log.gz | jq -r 'select((.verb | (. != "list" and . != "watch" and . != "get")) and .objectRef.name == "collect-profiles") | .stageTimestamp + " " + .verb + " " + .user.username' ... 2022-01-13T23:55:54.001749Z update system:serviceaccount:kube-system:cronjob-controller 2022-01-13T23:56:51.102056Z update system:serviceaccount:openshift-cluster-version:default 2022-01-14T00:00:00.246540Z update system:serviceaccount:kube-system:cronjob-controller 2022-01-14T00:00:00.276516Z update system:serviceaccount:kube-system:cronjob-controller 2022-01-14T00:00:00.281524Z update system:serviceaccount:kube-system:cronjob-controller 2022-01-14T00:00:00.286255Z update system:serviceaccount:kube-system:cronjob-controller 2022-01-14T00:00:00.291128Z update system:serviceaccount:kube-system:cronjob-controller 2022-01-14T00:00:00.297065Z update system:serviceaccount:kube-system:cronjob-controller 2022-01-14T00:00:00.304868Z update system:serviceaccount:kube-system:cronjob-controller 2022-01-14T00:00:00.333474Z update system:serviceaccount:kube-system:cronjob-controller 2022-01-14T00:00:00.341517Z update system:serviceaccount:kube-system:cronjob-controller 2022-01-14T00:00:09.056386Z update system:serviceaccount:kube-system:cronjob-controller 2022-01-14T00:00:09.059691Z update system:serviceaccount:kube-system:cronjob-controller 2022-01-14T00:00:09.074194Z update system:serviceaccount:kube-system:cronjob-controller 2022-01-14T00:00:09.084415Z update system:serviceaccount:kube-system:cronjob-controller 2022-01-14T00:00:09.089586Z update system:serviceaccount:kube-system:cronjob-controller 2022-01-14T00:00:09.094423Z update system:serviceaccount:kube-system:cronjob-controller 2022-01-14T00:00:09.097614Z update system:serviceaccount:kube-system:cronjob-controller 2022-01-14T00:00:09.100821Z update system:serviceaccount:kube-system:cronjob-controller 2022-01-14T00:02:07.062753Z update system:serviceaccount:openshift-cluster-version:default 2022-01-14T00:05:07.153452Z update system:serviceaccount:openshift-cluster-version:default 2022-01-14T00:05:53.875533Z update system:serviceaccount:kube-system:cronjob-controller 2022-01-14T00:05:53.885704Z update system:serviceaccount:kube-system:cronjob-controller I suspect this will be the lack of CVO-side defaulting, because we have no specific CronJob handling today. Manifest is from [2], where it dates back to 4.9 [3]. Back to the CI run [1]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws-modern/1481748933687382016/artifacts/launch/gather-must-gather/artifacts/must-gather.tar | tar tz | grep namespaces/openshift-operator-lifecycle-manager/batch/cronjob quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-8a7b47d7347dc9b92454c75485b31bd17d8d20a25021c88cbd1b9985056af355/namespaces/openshift-operator-lifecycle-manager/batch/cronjobs.yaml $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws-modern/1481748933687382016/artifacts/launch/gather-must-gather/artifacts/must-gather.tar | tar xOz quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-8a7b47d7347dc9b92454c75485b31bd17d8d20a25021c88cbd1b9985056af355/namespaces/openshift-operator-lifecycle-manager/batch/cronjobs.yaml --- apiVersion: batch/v1 items: - apiVersion: batch/v1 kind: CronJob metadata: annotations: include.release.openshift.io/ibm-cloud-managed: "true" include.release.openshift.io/self-managed-high-availability: "true" creationTimestamp: "2022-01-13T22:20:28Z" generation: 2 name: collect-profiles namespace: openshift-operator-lifecycle-manager ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: 43db5612-bebc-4627-b015-e817bac9f065 resourceVersion: "74890" uid: 62b3c639-536c-401f-9548-06c1731d4ef5 spec: ... Not clear to me why managedFields shows up in the CVO log, but not the resource collected in the must-gather or the one extracted straight from the cluster in comment 0. If we had that, it might confirm the "we're missing CVO-side defaulting" hypothesis. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws-modern/1481748933687382016 [2]: https://github.com/openshift/operator-framework-olm/blob/78d2b0af32651aa4aed442858d4584204808d31e/manifests/0000_50_olm_07-collect-profiles.cronjob.yaml [3]: https://github.com/openshift/operator-framework-olm/pull/112 issue starts from 4.9 # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.12 True False 137m Cluster version is 4.9.12 # oc -n openshift-cluster-version get po NAME READY STATUS RESTARTS AGE cluster-version-operator-5b8c47f655-rlscv 1/1 Running 0 3h16m # oc -n openshift-cluster-version logs cluster-version-operator-5b8c47f655-rlscv | grep "Updating .*due to diff" I0119 01:36:56.822552 1 generic.go:109] Updating PrometheusRule openshift-machine-api/machine-api-operator-prometheus-rules due to diff: &unstructured.Unstructured{ I0119 01:37:08.795231 1 generic.go:109] Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff: &unstructured.Unstructured{ I0119 01:40:15.737678 1 generic.go:109] Updating PrometheusRule openshift-machine-api/machine-api-operator-prometheus-rules due to diff: &unstructured.Unstructured{ I0119 01:40:27.617154 1 generic.go:109] Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff: &unstructured.Unstructured{ I0119 01:43:34.559656 1 generic.go:109] Updating PrometheusRule openshift-machine-api/machine-api-operator-prometheus-rules due to diff: &unstructured.Unstructured{ .. # oc -n openshift-cluster-version logs cluster-version-operator-5b8c47f655-rlscv | grep "Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff" | wc -l 41 Issue is seen on ocp 4.10 rc-1 as well. [root@rdr-sh-410-rc1-mon01-bastion-0 ~]# oc version Client Version: 4.10.0-rc.1 Server Version: 4.10.0-rc.1 Kubernetes Version: v1.23.3+b63be7f [root@rdr-sh-410-rc1-mon01-bastion-0 ~]# oc get pod -n openshift-cluster-version NAME READY STATUS RESTARTS AGE cluster-version-operator-6567cdf878-hm68d 1/1 Running 0 8d [root@rdr-sh-410-rc1-mon01-bastion-0 ~]# oc logs cluster-version-operator-6567cdf878-hm68d -n openshift-cluster-version > cvo_info.log [root@rdr-sh-410-rc1-mon01-bastion-0 ~]# grep -o 'Updating .*due to diff' cvo_info.log | sort | uniq -c 6 Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff ... [root@rdr-sh-410-rc1-mon01-bastion-0 ~]# oc -n openshift-cluster-version logs cluster-version-operator-6567cdf878-hm68d | grep "Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff" | wc -l 7 (In reply to shweta from comment #3) > Issue is seen on ocp 4.10 rc-1 ( IBM Power platform )as well. > > [root@rdr-sh-410-rc1-mon01-bastion-0 ~]# oc version > Client Version: 4.10.0-rc.1 > Server Version: 4.10.0-rc.1 > Kubernetes Version: v1.23.3+b63be7f > > [root@rdr-sh-410-rc1-mon01-bastion-0 ~]# oc get pod -n > openshift-cluster-version > NAME READY STATUS RESTARTS AGE > cluster-version-operator-6567cdf878-hm68d 1/1 Running 0 8d > > [root@rdr-sh-410-rc1-mon01-bastion-0 ~]# oc logs > cluster-version-operator-6567cdf878-hm68d -n openshift-cluster-version > > cvo_info.log > > [root@rdr-sh-410-rc1-mon01-bastion-0 ~]# grep -o 'Updating .*due to diff' > cvo_info.log | sort | uniq -c > 6 Updating CronJob > openshift-operator-lifecycle-manager/collect-profiles due to diff > > ... > > [root@rdr-sh-410-rc1-mon01-bastion-0 ~]# oc -n openshift-cluster-version > logs cluster-version-operator-6567cdf878-hm68d | grep "Updating CronJob > openshift-operator-lifecycle-manager/collect-profiles due to diff" | wc -l > 7 Seems like it doesn't occur in the 4.12. # oc version Client Version: 4.12.0-0.nightly-2022-09-28-204419 Kustomize Version: v4.5.4 Server Version: 4.12.0-0.nightly-2022-10-05-053337 Kubernetes Version: v1.25.0+3ef6ef3 # grep -o 'Updating .*due to diff' cvo.log| sort | uniq -c | grep CronJob 1 Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff (In reply to Yang Yang from comment #5) > Seems like it doesn't occur in the 4.12. > # oc version > Client Version: 4.12.0-0.nightly-2022-09-28-204419 > Kustomize Version: v4.5.4 > Server Version: 4.12.0-0.nightly-2022-10-05-053337 > Kubernetes Version: v1.25.0+3ef6ef3 > > # grep -o 'Updating .*due to diff' cvo.log| sort | uniq -c | grep CronJob > 1 Updating CronJob > openshift-operator-lifecycle-manager/collect-profiles due to diff Updating and correcting ^ after a double-check in a few minutes # oc logs cluster-version-operator-7865bcc54c-d7gnx -n openshift-cluster-version | grep -o 'Updating .*due to diff' cvo.log| sort | uniq -c | grep CronJob 4 Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff It's still reproducible in 4.12. OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira. https://issues.redhat.com/browse/OCPBUGS-9070 |