Created attachment 1850058 [details] CVO log file Description of problem: In a fresh installed cluster, we can see hot-loopings on Service openshift-monitoring/cluster-monitoring-operator. # grep -o 'Updating .*due to diff' cvo2.log | sort | uniq -c 18 Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff 12 Updating Service openshift-monitoring/cluster-monitoring-operator due to diff Looking at the CronJob hot-looping # grep -A60 'Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff' cvo2.log | tail -n61 I0110 06:32:44.489277 1 generic.go:109] Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff: &unstructured.Unstructured{ Object: map[string]interface{}{ "apiVersion": string("batch/v1"), "kind": string("CronJob"), "metadata": map[string]interface{}{"annotations": map[string]interface{}{"include.release.openshift.io/ibm-cloud-managed": string("true"), "include.release.openshift.io/self-managed-high-availability": string("true")}, "creationTimestamp": string("2022-01-10T04:35:19Z"), "generation": int64(1), "managedFields": []interface{}{map[string]interface{}{"apiVersion": string("batch/v1"), "fieldsType": string("FieldsV1"), "fieldsV1": map[string]interface{}{"f:metadata": map[string]interface{}{"f:annotations": map[string]interface{}{".": map[string]interface{}{}, "f:include.release.openshift.io/ibm-cloud-managed": map[string]interface{}{}, "f:include.release.openshift.io/self-managed-high-availability": map[string]interface{}{}}, "f:ownerReferences": map[string]interface{}{".": map[string]interface{}{}, `k:{"uid":"334d6c04-126d-4271-96ec-d303e93b7d1c"}`: map[string]interface{}{}}}, "f:spec": map[string]interface{}{"f:concurrencyPolicy": map[string]interface{}{}, "f:failedJobsHistoryLimit": map[string]interface{}{}, "f:jobTemplate": map[string]interface{}{"f:spec": map[string]interface{}{"f:template": map[string]interface{}{"f:spec": map[string]interface{}{"f:containers": map[string]interface{}{`k:{"name":"collect-profiles"}`: map[string]interface{}{".": map[string]interface{}{}, "f:args": map[string]interface{}{}, "f:command": map[string]interface{}{}, "f:image": map[string]interface{}{}, ...}}, "f:dnsPolicy": map[string]interface{}{}, "f:priorityClassName": map[string]interface{}{}, "f:restartPolicy": map[string]interface{}{}, ...}}}}, "f:schedule": map[string]interface{}{}, ...}}, "manager": string("cluster-version-operator"), ...}, map[string]interface{}{"apiVersion": string("batch/v1"), "fieldsType": string("FieldsV1"), "fieldsV1": map[string]interface{}{"f:status": map[string]interface{}{"f:lastScheduleTime": map[string]interface{}{}, "f:lastSuccessfulTime": map[string]interface{}{}}}, "manager": string("kube-controller-manager"), ...}}, ...}, "spec": map[string]interface{}{ + "concurrencyPolicy": string("Allow"), + "failedJobsHistoryLimit": int64(1), "jobTemplate": map[string]interface{}{ + "metadata": map[string]interface{}{"creationTimestamp": nil}, "spec": map[string]interface{}{ "template": map[string]interface{}{ + "metadata": map[string]interface{}{"creationTimestamp": nil}, "spec": map[string]interface{}{ "containers": []interface{}{ map[string]interface{}{ ... // 4 identical entries "name": string("collect-profiles"), "resources": map[string]interface{}{"requests": map[string]interface{}{"cpu": string("10m"), "memory": string("80Mi")}}, + "terminationMessagePath": string("/dev/termination-log"), + "terminationMessagePolicy": string("File"), "volumeMounts": []interface{}{map[string]interface{}{"mountPath": string("/etc/config"), "name": string("config-volume")}, map[string]interface{}{"mountPath": string("/var/run/secrets/serving-cert"), "name": string("secret-volume")}}, }, }, + "dnsPolicy": string("ClusterFirst"), "priorityClassName": string("openshift-user-critical"), "restartPolicy": string("Never"), + "schedulerName": string("default-scheduler"), + "securityContext": map[string]interface{}{}, + "serviceAccount": string("collect-profiles"), "serviceAccountName": string("collect-profiles"), + "terminationGracePeriodSeconds": int64(30), "volumes": []interface{}{ map[string]interface{}{ "configMap": map[string]interface{}{ + "defaultMode": int64(420), "name": string("collect-profiles-config"), }, "name": string("config-volume"), }, map[string]interface{}{ "name": string("secret-volume"), "secret": map[string]interface{}{ + "defaultMode": int64(420), "secretName": string("pprof-cert"), }, }, }, }, }, }, }, "schedule": string("*/15 * * * *"), + "successfulJobsHistoryLimit": int64(3), + "suspend": bool(false), }, "status": map[string]interface{}{"lastScheduleTime": string("2022-01-10T06:30:00Z"), "lastSuccessfulTime": string("2022-01-10T06:30:11Z")}, }, } I0110 06:32:44.499764 1 sync_worker.go:771] Done syncing for cronjob "openshift-operator-lifecycle-manager/collect-profiles" (574 of 765) I0110 06:32:44.499814 1 sync_worker.go:759] Running sync for deployment "openshift-operator-lifecycle-manager/olm-operator" (575 of 765) Extract the manifest: # cat 0000_50_olm_07-collect-profiles.cronjob.yaml apiVersion: batch/v1 kind: CronJob metadata: annotations: include.release.openshift.io/ibm-cloud-managed: "true" include.release.openshift.io/self-managed-high-availability: "true" name: collect-profiles namespace: openshift-operator-lifecycle-manager spec: schedule: "*/15 * * * *" jobTemplate: spec: template: spec: serviceAccountName: collect-profiles priorityClassName: openshift-user-critical containers: - name: collect-profiles image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2a8d116943a7c1eb32cd161a0de5cb173713724ff419a03abe0382a2d5d9c9a7 imagePullPolicy: IfNotPresent command: - bin/collect-profiles args: - -n - openshift-operator-lifecycle-manager - --config-mount-path - /etc/config - --cert-mount-path - /var/run/secrets/serving-cert - olm-operator-heap-:https://olm-operator-metrics:8443/debug/pprof/heap - catalog-operator-heap-:https://catalog-operator-metrics:8443/debug/pprof/heap volumeMounts: - mountPath: /etc/config name: config-volume - mountPath: /var/run/secrets/serving-cert name: secret-volume resources: requests: cpu: 10m memory: 80Mi volumes: - name: config-volume configMap: name: collect-profiles-config - name: secret-volume secret: secretName: pprof-cert restartPolicy: Never Looking at the in-cluster object: # oc get cronjob.batch/collect-profiles -oyaml -n openshift-operator-lifecycle-manager apiVersion: batch/v1 kind: CronJob metadata: annotations: include.release.openshift.io/ibm-cloud-managed: "true" include.release.openshift.io/self-managed-high-availability: "true" creationTimestamp: "2022-01-10T04:35:19Z" generation: 1 name: collect-profiles namespace: openshift-operator-lifecycle-manager ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: 334d6c04-126d-4271-96ec-d303e93b7d1c resourceVersion: "450801" uid: d0b92cd3-3213-466c-921c-d4c4c77f7a6b spec: concurrencyPolicy: Allow failedJobsHistoryLimit: 1 jobTemplate: metadata: creationTimestamp: null spec: template: metadata: creationTimestamp: null spec: containers: - args: - -n - openshift-operator-lifecycle-manager - --config-mount-path - /etc/config - --cert-mount-path - /var/run/secrets/serving-cert - olm-operator-heap-:https://olm-operator-metrics:8443/debug/pprof/heap - catalog-operator-heap-:https://catalog-operator-metrics:8443/debug/pprof/heap command: - bin/collect-profiles image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2a8d116943a7c1eb32cd161a0de5cb173713724ff419a03abe0382a2d5d9c9a7 imagePullPolicy: IfNotPresent name: collect-profiles resources: requests: cpu: 10m memory: 80Mi terminationMessagePath: /dev/termination-log terminationMessagePolicy: File volumeMounts: - mountPath: /etc/config name: config-volume - mountPath: /var/run/secrets/serving-cert name: secret-volume dnsPolicy: ClusterFirst priorityClassName: openshift-user-critical restartPolicy: Never schedulerName: default-scheduler securityContext: {} serviceAccount: collect-profiles serviceAccountName: collect-profiles terminationGracePeriodSeconds: 30 volumes: - configMap: defaultMode: 420 name: collect-profiles-config name: config-volume - name: secret-volume secret: defaultMode: 420 secretName: pprof-cert schedule: '*/15 * * * *' successfulJobsHistoryLimit: 3 suspend: false status: lastScheduleTime: "2022-01-11T03:00:00Z" lastSuccessfulTime: "2022-01-11T03:00:07Z" Version-Release number of the following components: 4.10.0-0.nightly-2022-01-09-195852 How reproducible: 1/1 Steps to Reproduce: 1.Install a 4.10 cluster 2. Grep 'Updating .*due to diff' in the cvo log to check hot-loopings 3. Actual results: CVO hotloops on CronJob openshift-operator-lifecycle-manager/collect-profiles Expected results: CVO should not hotloop on it in a fresh installed cluster Additional info: Please attach logs from ansible-playbook with the -vvv flag
Looking at 4.10.0-fc.1 CI [1], the CVO is fighting with the cronjob-controller: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws-modern/1481748933687382016/artifacts/launch/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2 $ zgrep -h '"resource":"cronjobs"' */*.log.gz | jq -r 'select((.verb | (. != "list" and . != "watch" and . != "get")) and .objectRef.name == "collect-profiles") | .stageTimestamp + " " + .verb + " " + .user.username' ... 2022-01-13T23:55:54.001749Z update system:serviceaccount:kube-system:cronjob-controller 2022-01-13T23:56:51.102056Z update system:serviceaccount:openshift-cluster-version:default 2022-01-14T00:00:00.246540Z update system:serviceaccount:kube-system:cronjob-controller 2022-01-14T00:00:00.276516Z update system:serviceaccount:kube-system:cronjob-controller 2022-01-14T00:00:00.281524Z update system:serviceaccount:kube-system:cronjob-controller 2022-01-14T00:00:00.286255Z update system:serviceaccount:kube-system:cronjob-controller 2022-01-14T00:00:00.291128Z update system:serviceaccount:kube-system:cronjob-controller 2022-01-14T00:00:00.297065Z update system:serviceaccount:kube-system:cronjob-controller 2022-01-14T00:00:00.304868Z update system:serviceaccount:kube-system:cronjob-controller 2022-01-14T00:00:00.333474Z update system:serviceaccount:kube-system:cronjob-controller 2022-01-14T00:00:00.341517Z update system:serviceaccount:kube-system:cronjob-controller 2022-01-14T00:00:09.056386Z update system:serviceaccount:kube-system:cronjob-controller 2022-01-14T00:00:09.059691Z update system:serviceaccount:kube-system:cronjob-controller 2022-01-14T00:00:09.074194Z update system:serviceaccount:kube-system:cronjob-controller 2022-01-14T00:00:09.084415Z update system:serviceaccount:kube-system:cronjob-controller 2022-01-14T00:00:09.089586Z update system:serviceaccount:kube-system:cronjob-controller 2022-01-14T00:00:09.094423Z update system:serviceaccount:kube-system:cronjob-controller 2022-01-14T00:00:09.097614Z update system:serviceaccount:kube-system:cronjob-controller 2022-01-14T00:00:09.100821Z update system:serviceaccount:kube-system:cronjob-controller 2022-01-14T00:02:07.062753Z update system:serviceaccount:openshift-cluster-version:default 2022-01-14T00:05:07.153452Z update system:serviceaccount:openshift-cluster-version:default 2022-01-14T00:05:53.875533Z update system:serviceaccount:kube-system:cronjob-controller 2022-01-14T00:05:53.885704Z update system:serviceaccount:kube-system:cronjob-controller I suspect this will be the lack of CVO-side defaulting, because we have no specific CronJob handling today. Manifest is from [2], where it dates back to 4.9 [3]. Back to the CI run [1]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws-modern/1481748933687382016/artifacts/launch/gather-must-gather/artifacts/must-gather.tar | tar tz | grep namespaces/openshift-operator-lifecycle-manager/batch/cronjob quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-8a7b47d7347dc9b92454c75485b31bd17d8d20a25021c88cbd1b9985056af355/namespaces/openshift-operator-lifecycle-manager/batch/cronjobs.yaml $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws-modern/1481748933687382016/artifacts/launch/gather-must-gather/artifacts/must-gather.tar | tar xOz quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-8a7b47d7347dc9b92454c75485b31bd17d8d20a25021c88cbd1b9985056af355/namespaces/openshift-operator-lifecycle-manager/batch/cronjobs.yaml --- apiVersion: batch/v1 items: - apiVersion: batch/v1 kind: CronJob metadata: annotations: include.release.openshift.io/ibm-cloud-managed: "true" include.release.openshift.io/self-managed-high-availability: "true" creationTimestamp: "2022-01-13T22:20:28Z" generation: 2 name: collect-profiles namespace: openshift-operator-lifecycle-manager ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: 43db5612-bebc-4627-b015-e817bac9f065 resourceVersion: "74890" uid: 62b3c639-536c-401f-9548-06c1731d4ef5 spec: ... Not clear to me why managedFields shows up in the CVO log, but not the resource collected in the must-gather or the one extracted straight from the cluster in comment 0. If we had that, it might confirm the "we're missing CVO-side defaulting" hypothesis. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws-modern/1481748933687382016 [2]: https://github.com/openshift/operator-framework-olm/blob/78d2b0af32651aa4aed442858d4584204808d31e/manifests/0000_50_olm_07-collect-profiles.cronjob.yaml [3]: https://github.com/openshift/operator-framework-olm/pull/112
issue starts from 4.9 # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.12 True False 137m Cluster version is 4.9.12 # oc -n openshift-cluster-version get po NAME READY STATUS RESTARTS AGE cluster-version-operator-5b8c47f655-rlscv 1/1 Running 0 3h16m # oc -n openshift-cluster-version logs cluster-version-operator-5b8c47f655-rlscv | grep "Updating .*due to diff" I0119 01:36:56.822552 1 generic.go:109] Updating PrometheusRule openshift-machine-api/machine-api-operator-prometheus-rules due to diff: &unstructured.Unstructured{ I0119 01:37:08.795231 1 generic.go:109] Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff: &unstructured.Unstructured{ I0119 01:40:15.737678 1 generic.go:109] Updating PrometheusRule openshift-machine-api/machine-api-operator-prometheus-rules due to diff: &unstructured.Unstructured{ I0119 01:40:27.617154 1 generic.go:109] Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff: &unstructured.Unstructured{ I0119 01:43:34.559656 1 generic.go:109] Updating PrometheusRule openshift-machine-api/machine-api-operator-prometheus-rules due to diff: &unstructured.Unstructured{ .. # oc -n openshift-cluster-version logs cluster-version-operator-5b8c47f655-rlscv | grep "Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff" | wc -l 41
Issue is seen on ocp 4.10 rc-1 as well. [root@rdr-sh-410-rc1-mon01-bastion-0 ~]# oc version Client Version: 4.10.0-rc.1 Server Version: 4.10.0-rc.1 Kubernetes Version: v1.23.3+b63be7f [root@rdr-sh-410-rc1-mon01-bastion-0 ~]# oc get pod -n openshift-cluster-version NAME READY STATUS RESTARTS AGE cluster-version-operator-6567cdf878-hm68d 1/1 Running 0 8d [root@rdr-sh-410-rc1-mon01-bastion-0 ~]# oc logs cluster-version-operator-6567cdf878-hm68d -n openshift-cluster-version > cvo_info.log [root@rdr-sh-410-rc1-mon01-bastion-0 ~]# grep -o 'Updating .*due to diff' cvo_info.log | sort | uniq -c 6 Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff ... [root@rdr-sh-410-rc1-mon01-bastion-0 ~]# oc -n openshift-cluster-version logs cluster-version-operator-6567cdf878-hm68d | grep "Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff" | wc -l 7
(In reply to shweta from comment #3) > Issue is seen on ocp 4.10 rc-1 ( IBM Power platform )as well. > > [root@rdr-sh-410-rc1-mon01-bastion-0 ~]# oc version > Client Version: 4.10.0-rc.1 > Server Version: 4.10.0-rc.1 > Kubernetes Version: v1.23.3+b63be7f > > [root@rdr-sh-410-rc1-mon01-bastion-0 ~]# oc get pod -n > openshift-cluster-version > NAME READY STATUS RESTARTS AGE > cluster-version-operator-6567cdf878-hm68d 1/1 Running 0 8d > > [root@rdr-sh-410-rc1-mon01-bastion-0 ~]# oc logs > cluster-version-operator-6567cdf878-hm68d -n openshift-cluster-version > > cvo_info.log > > [root@rdr-sh-410-rc1-mon01-bastion-0 ~]# grep -o 'Updating .*due to diff' > cvo_info.log | sort | uniq -c > 6 Updating CronJob > openshift-operator-lifecycle-manager/collect-profiles due to diff > > ... > > [root@rdr-sh-410-rc1-mon01-bastion-0 ~]# oc -n openshift-cluster-version > logs cluster-version-operator-6567cdf878-hm68d | grep "Updating CronJob > openshift-operator-lifecycle-manager/collect-profiles due to diff" | wc -l > 7
Seems like it doesn't occur in the 4.12. # oc version Client Version: 4.12.0-0.nightly-2022-09-28-204419 Kustomize Version: v4.5.4 Server Version: 4.12.0-0.nightly-2022-10-05-053337 Kubernetes Version: v1.25.0+3ef6ef3 # grep -o 'Updating .*due to diff' cvo.log| sort | uniq -c | grep CronJob 1 Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff
(In reply to Yang Yang from comment #5) > Seems like it doesn't occur in the 4.12. > # oc version > Client Version: 4.12.0-0.nightly-2022-09-28-204419 > Kustomize Version: v4.5.4 > Server Version: 4.12.0-0.nightly-2022-10-05-053337 > Kubernetes Version: v1.25.0+3ef6ef3 > > # grep -o 'Updating .*due to diff' cvo.log| sort | uniq -c | grep CronJob > 1 Updating CronJob > openshift-operator-lifecycle-manager/collect-profiles due to diff Updating and correcting ^ after a double-check in a few minutes # oc logs cluster-version-operator-7865bcc54c-d7gnx -n openshift-cluster-version | grep -o 'Updating .*due to diff' cvo.log| sort | uniq -c | grep CronJob 4 Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff It's still reproducible in 4.12.
OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira. https://issues.redhat.com/browse/OCPBUGS-9070