2039228 – CVO hotloops on CronJob openshift-operator-lifecycle-manager/collect-profiles

Bug 2039228 - CVO hotloops on CronJob openshift-operator-lifecycle-manager/collect-profiles

Summary: CVO hotloops on CronJob openshift-operator-lifecycle-manager/collect-profiles

Keywords:
Status:	CLOSED DEFERRED
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	David Hurta
QA Contact:	Yang Yang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-01-11 10:12 UTC by Yang Yang
Modified:	2023-03-09 01:10 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2023-03-09 01:10:52 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
CVO log file (9.71 MB, text/plain) 2022-01-11 10:12 UTC, Yang Yang	no flags	Details
View All

Description Yang Yang 2022-01-11 10:12:20 UTC

Created attachment 1850058 [details]
CVO log file

Description of problem:

In a fresh installed cluster, we can see hot-loopings on Service openshift-monitoring/cluster-monitoring-operator.

# grep -o 'Updating .*due to diff' cvo2.log | sort | uniq -c 
     18 Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff
     12 Updating Service openshift-monitoring/cluster-monitoring-operator due to diff

Looking at the CronJob hot-looping

# grep -A60 'Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff' cvo2.log | tail -n61
I0110 06:32:44.489277       1 generic.go:109] Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff:   &unstructured.Unstructured{
  	Object: map[string]interface{}{
  		"apiVersion": string("batch/v1"),
  		"kind":       string("CronJob"),
  		"metadata":   map[string]interface{}{"annotations": map[string]interface{}{"include.release.openshift.io/ibm-cloud-managed": string("true"), "include.release.openshift.io/self-managed-high-availability": string("true")}, "creationTimestamp": string("2022-01-10T04:35:19Z"), "generation": int64(1), "managedFields": []interface{}{map[string]interface{}{"apiVersion": string("batch/v1"), "fieldsType": string("FieldsV1"), "fieldsV1": map[string]interface{}{"f:metadata": map[string]interface{}{"f:annotations": map[string]interface{}{".": map[string]interface{}{}, "f:include.release.openshift.io/ibm-cloud-managed": map[string]interface{}{}, "f:include.release.openshift.io/self-managed-high-availability": map[string]interface{}{}}, "f:ownerReferences": map[string]interface{}{".": map[string]interface{}{}, `k:{"uid":"334d6c04-126d-4271-96ec-d303e93b7d1c"}`: map[string]interface{}{}}}, "f:spec": map[string]interface{}{"f:concurrencyPolicy": map[string]interface{}{}, "f:failedJobsHistoryLimit": map[string]interface{}{}, "f:jobTemplate": map[string]interface{}{"f:spec": map[string]interface{}{"f:template": map[string]interface{}{"f:spec": map[string]interface{}{"f:containers": map[string]interface{}{`k:{"name":"collect-profiles"}`: map[string]interface{}{".": map[string]interface{}{}, "f:args": map[string]interface{}{}, "f:command": map[string]interface{}{}, "f:image": map[string]interface{}{}, ...}}, "f:dnsPolicy": map[string]interface{}{}, "f:priorityClassName": map[string]interface{}{}, "f:restartPolicy": map[string]interface{}{}, ...}}}}, "f:schedule": map[string]interface{}{}, ...}}, "manager": string("cluster-version-operator"), ...}, map[string]interface{}{"apiVersion": string("batch/v1"), "fieldsType": string("FieldsV1"), "fieldsV1": map[string]interface{}{"f:status": map[string]interface{}{"f:lastScheduleTime": map[string]interface{}{}, "f:lastSuccessfulTime": map[string]interface{}{}}}, "manager": string("kube-controller-manager"), ...}}, ...},
  		"spec": map[string]interface{}{
+ 			"concurrencyPolicy":      string("Allow"),
+ 			"failedJobsHistoryLimit": int64(1),
  			"jobTemplate": map[string]interface{}{
+ 				"metadata": map[string]interface{}{"creationTimestamp": nil},
  				"spec": map[string]interface{}{
  					"template": map[string]interface{}{
+ 						"metadata": map[string]interface{}{"creationTimestamp": nil},
  						"spec": map[string]interface{}{
  							"containers": []interface{}{
  								map[string]interface{}{
  									... // 4 identical entries
  									"name":                     string("collect-profiles"),
  									"resources":                map[string]interface{}{"requests": map[string]interface{}{"cpu": string("10m"), "memory": string("80Mi")}},
+ 									"terminationMessagePath":   string("/dev/termination-log"),
+ 									"terminationMessagePolicy": string("File"),
  									"volumeMounts":             []interface{}{map[string]interface{}{"mountPath": string("/etc/config"), "name": string("config-volume")}, map[string]interface{}{"mountPath": string("/var/run/secrets/serving-cert"), "name": string("secret-volume")}},
  								},
  							},
+ 							"dnsPolicy":                     string("ClusterFirst"),
  							"priorityClassName":             string("openshift-user-critical"),
  							"restartPolicy":                 string("Never"),
+ 							"schedulerName":                 string("default-scheduler"),
+ 							"securityContext":               map[string]interface{}{},
+ 							"serviceAccount":                string("collect-profiles"),
  							"serviceAccountName":            string("collect-profiles"),
+ 							"terminationGracePeriodSeconds": int64(30),
  							"volumes": []interface{}{
  								map[string]interface{}{
  									"configMap": map[string]interface{}{
+ 										"defaultMode": int64(420),
  										"name":        string("collect-profiles-config"),
  									},
  									"name": string("config-volume"),
  								},
  								map[string]interface{}{
  									"name": string("secret-volume"),
  									"secret": map[string]interface{}{
+ 										"defaultMode": int64(420),
  										"secretName":  string("pprof-cert"),
  									},
  								},
  							},
  						},
  					},
  				},
  			},
  			"schedule":                   string("*/15 * * * *"),
+ 			"successfulJobsHistoryLimit": int64(3),
+ 			"suspend":                    bool(false),
  		},
  		"status": map[string]interface{}{"lastScheduleTime": string("2022-01-10T06:30:00Z"), "lastSuccessfulTime": string("2022-01-10T06:30:11Z")},
  	},
  }
I0110 06:32:44.499764       1 sync_worker.go:771] Done syncing for cronjob "openshift-operator-lifecycle-manager/collect-profiles" (574 of 765)
I0110 06:32:44.499814       1 sync_worker.go:759] Running sync for deployment "openshift-operator-lifecycle-manager/olm-operator" (575 of 765)

Extract the manifest:

# cat 0000_50_olm_07-collect-profiles.cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  annotations:
    include.release.openshift.io/ibm-cloud-managed: "true"
    include.release.openshift.io/self-managed-high-availability: "true"
  name: collect-profiles
  namespace: openshift-operator-lifecycle-manager
spec:
  schedule: "*/15 * * * *"
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: collect-profiles
          priorityClassName: openshift-user-critical
          containers:
            - name: collect-profiles
              image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2a8d116943a7c1eb32cd161a0de5cb173713724ff419a03abe0382a2d5d9c9a7
              imagePullPolicy: IfNotPresent
              command:
                - bin/collect-profiles
              args:
                - -n
                - openshift-operator-lifecycle-manager
                - --config-mount-path
                - /etc/config
                - --cert-mount-path
                - /var/run/secrets/serving-cert
                - olm-operator-heap-:https://olm-operator-metrics:8443/debug/pprof/heap
                - catalog-operator-heap-:https://catalog-operator-metrics:8443/debug/pprof/heap
              volumeMounts:
                - mountPath: /etc/config
                  name: config-volume
                - mountPath: /var/run/secrets/serving-cert
                  name: secret-volume
              resources:
                requests:
                  cpu: 10m
                  memory: 80Mi
          volumes:
            - name: config-volume
              configMap:
                name: collect-profiles-config
            - name: secret-volume
              secret:
                secretName: pprof-cert
          restartPolicy: Never

Looking at the in-cluster object:

# oc get cronjob.batch/collect-profiles -oyaml -n openshift-operator-lifecycle-manager
apiVersion: batch/v1
kind: CronJob
metadata:
  annotations:
    include.release.openshift.io/ibm-cloud-managed: "true"
    include.release.openshift.io/self-managed-high-availability: "true"
  creationTimestamp: "2022-01-10T04:35:19Z"
  generation: 1
  name: collect-profiles
  namespace: openshift-operator-lifecycle-manager
  ownerReferences:
  - apiVersion: config.openshift.io/v1
    kind: ClusterVersion
    name: version
    uid: 334d6c04-126d-4271-96ec-d303e93b7d1c
  resourceVersion: "450801"
  uid: d0b92cd3-3213-466c-921c-d4c4c77f7a6b
spec:
  concurrencyPolicy: Allow
  failedJobsHistoryLimit: 1
  jobTemplate:
    metadata:
      creationTimestamp: null
    spec:
      template:
        metadata:
          creationTimestamp: null
        spec:
          containers:
          - args:
            - -n
            - openshift-operator-lifecycle-manager
            - --config-mount-path
            - /etc/config
            - --cert-mount-path
            - /var/run/secrets/serving-cert
            - olm-operator-heap-:https://olm-operator-metrics:8443/debug/pprof/heap
            - catalog-operator-heap-:https://catalog-operator-metrics:8443/debug/pprof/heap
            command:
            - bin/collect-profiles
            image: quay.io/openshift-release-dev/ocp-v4.0-art-dev@sha256:2a8d116943a7c1eb32cd161a0de5cb173713724ff419a03abe0382a2d5d9c9a7
            imagePullPolicy: IfNotPresent
            name: collect-profiles
            resources:
              requests:
                cpu: 10m
                memory: 80Mi
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
            volumeMounts:
            - mountPath: /etc/config
              name: config-volume
            - mountPath: /var/run/secrets/serving-cert
              name: secret-volume
          dnsPolicy: ClusterFirst
          priorityClassName: openshift-user-critical
          restartPolicy: Never
          schedulerName: default-scheduler
          securityContext: {}
          serviceAccount: collect-profiles
          serviceAccountName: collect-profiles
          terminationGracePeriodSeconds: 30
          volumes:
          - configMap:
              defaultMode: 420
              name: collect-profiles-config
            name: config-volume
          - name: secret-volume
            secret:
              defaultMode: 420
              secretName: pprof-cert
  schedule: '*/15 * * * *'
  successfulJobsHistoryLimit: 3
  suspend: false
status:
  lastScheduleTime: "2022-01-11T03:00:00Z"
  lastSuccessfulTime: "2022-01-11T03:00:07Z"


Version-Release number of the following components:
4.10.0-0.nightly-2022-01-09-195852

How reproducible:
1/1

Steps to Reproduce:
1.Install a 4.10 cluster
2. Grep 'Updating .*due to diff' in the cvo log to check hot-loopings
3.

Actual results:
CVO hotloops on CronJob openshift-operator-lifecycle-manager/collect-profiles

Expected results:
CVO should not hotloop on it in a fresh installed cluster


Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 W. Trevor King 2022-01-15 07:27:40 UTC

Looking at 4.10.0-fc.1 CI [1], the CVO is fighting with the cronjob-controller:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws-modern/1481748933687382016/artifacts/launch/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
$ zgrep -h '"resource":"cronjobs"' */*.log.gz | jq -r 'select((.verb | (. != "list" and . != "watch" and . != "get")) and .objectRef.name == "collect-profiles") | .stageTimestamp + " " + .verb + " " + .user.username'
...
2022-01-13T23:55:54.001749Z update system:serviceaccount:kube-system:cronjob-controller
2022-01-13T23:56:51.102056Z update system:serviceaccount:openshift-cluster-version:default
2022-01-14T00:00:00.246540Z update system:serviceaccount:kube-system:cronjob-controller
2022-01-14T00:00:00.276516Z update system:serviceaccount:kube-system:cronjob-controller
2022-01-14T00:00:00.281524Z update system:serviceaccount:kube-system:cronjob-controller
2022-01-14T00:00:00.286255Z update system:serviceaccount:kube-system:cronjob-controller
2022-01-14T00:00:00.291128Z update system:serviceaccount:kube-system:cronjob-controller
2022-01-14T00:00:00.297065Z update system:serviceaccount:kube-system:cronjob-controller
2022-01-14T00:00:00.304868Z update system:serviceaccount:kube-system:cronjob-controller
2022-01-14T00:00:00.333474Z update system:serviceaccount:kube-system:cronjob-controller
2022-01-14T00:00:00.341517Z update system:serviceaccount:kube-system:cronjob-controller
2022-01-14T00:00:09.056386Z update system:serviceaccount:kube-system:cronjob-controller
2022-01-14T00:00:09.059691Z update system:serviceaccount:kube-system:cronjob-controller
2022-01-14T00:00:09.074194Z update system:serviceaccount:kube-system:cronjob-controller
2022-01-14T00:00:09.084415Z update system:serviceaccount:kube-system:cronjob-controller
2022-01-14T00:00:09.089586Z update system:serviceaccount:kube-system:cronjob-controller
2022-01-14T00:00:09.094423Z update system:serviceaccount:kube-system:cronjob-controller
2022-01-14T00:00:09.097614Z update system:serviceaccount:kube-system:cronjob-controller
2022-01-14T00:00:09.100821Z update system:serviceaccount:kube-system:cronjob-controller
2022-01-14T00:02:07.062753Z update system:serviceaccount:openshift-cluster-version:default
2022-01-14T00:05:07.153452Z update system:serviceaccount:openshift-cluster-version:default
2022-01-14T00:05:53.875533Z update system:serviceaccount:kube-system:cronjob-controller
2022-01-14T00:05:53.885704Z update system:serviceaccount:kube-system:cronjob-controller

I suspect this will be the lack of CVO-side defaulting, because we have no specific CronJob handling today.  Manifest is from [2], where it dates back to 4.9 [3].

Back to the CI run [1]:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws-modern/1481748933687382016/artifacts/launch/gather-must-gather/artifacts/must-gather.tar | tar tz | grep namespaces/openshift-operator-lifecycle-manager/batch/cronjob
quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-8a7b47d7347dc9b92454c75485b31bd17d8d20a25021c88cbd1b9985056af355/namespaces/openshift-operator-lifecycle-manager/batch/cronjobs.yaml
$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws-modern/1481748933687382016/artifacts/launch/gather-must-gather/artifacts/must-gather.tar | tar xOz quay-io-openshift-release-dev-ocp-v4-0-art-dev-sha256-8a7b47d7347dc9b92454c75485b31bd17d8d20a25021c88cbd1b9985056af355/namespaces/openshift-operator-lifecycle-manager/batch/cronjobs.yaml
---
apiVersion: batch/v1
items:
- apiVersion: batch/v1
  kind: CronJob
  metadata:
    annotations:
      include.release.openshift.io/ibm-cloud-managed: "true"
      include.release.openshift.io/self-managed-high-availability: "true"
    creationTimestamp: "2022-01-13T22:20:28Z"
    generation: 2
    name: collect-profiles
    namespace: openshift-operator-lifecycle-manager
    ownerReferences:
    - apiVersion: config.openshift.io/v1
      kind: ClusterVersion
      name: version
      uid: 43db5612-bebc-4627-b015-e817bac9f065
    resourceVersion: "74890"
    uid: 62b3c639-536c-401f-9548-06c1731d4ef5
  spec:
...

Not clear to me why managedFields shows up in the CVO log, but not the resource collected in the must-gather or the one extracted straight from the cluster in comment 0.  If we had that, it might confirm the "we're missing CVO-side defaulting" hypothesis.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws-modern/1481748933687382016
[2]: https://github.com/openshift/operator-framework-olm/blob/78d2b0af32651aa4aed442858d4584204808d31e/manifests/0000_50_olm_07-collect-profiles.cronjob.yaml
[3]: https://github.com/openshift/operator-framework-olm/pull/112

Comment 2 Junqi Zhao 2022-01-19 03:54:25 UTC

issue starts from 4.9
# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.12    True        False         137m    Cluster version is 4.9.12

# oc -n openshift-cluster-version get po
NAME                                        READY   STATUS    RESTARTS   AGE
cluster-version-operator-5b8c47f655-rlscv   1/1     Running   0          3h16m

# oc -n openshift-cluster-version logs  cluster-version-operator-5b8c47f655-rlscv | grep "Updating .*due to diff"
I0119 01:36:56.822552       1 generic.go:109] Updating PrometheusRule openshift-machine-api/machine-api-operator-prometheus-rules due to diff:   &unstructured.Unstructured{
I0119 01:37:08.795231       1 generic.go:109] Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff:   &unstructured.Unstructured{
I0119 01:40:15.737678       1 generic.go:109] Updating PrometheusRule openshift-machine-api/machine-api-operator-prometheus-rules due to diff:   &unstructured.Unstructured{
I0119 01:40:27.617154       1 generic.go:109] Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff:   &unstructured.Unstructured{
I0119 01:43:34.559656       1 generic.go:109] Updating PrometheusRule openshift-machine-api/machine-api-operator-prometheus-rules due to diff:   &unstructured.Unstructured{
..
# oc -n openshift-cluster-version logs  cluster-version-operator-5b8c47f655-rlscv | grep "Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff" | wc -l
41

Comment 3 shweta 2022-02-16 08:04:48 UTC

Issue is seen on ocp 4.10 rc-1 as well.

[root@rdr-sh-410-rc1-mon01-bastion-0 ~]# oc version
Client Version: 4.10.0-rc.1
Server Version: 4.10.0-rc.1
Kubernetes Version: v1.23.3+b63be7f

[root@rdr-sh-410-rc1-mon01-bastion-0 ~]# oc get pod -n openshift-cluster-version
NAME                                        READY   STATUS    RESTARTS   AGE
cluster-version-operator-6567cdf878-hm68d   1/1     Running   0          8d

[root@rdr-sh-410-rc1-mon01-bastion-0 ~]# oc logs cluster-version-operator-6567cdf878-hm68d -n openshift-cluster-version > cvo_info.log

[root@rdr-sh-410-rc1-mon01-bastion-0 ~]# grep -o 'Updating .*due to diff' cvo_info.log | sort | uniq -c
      6 Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff

...

[root@rdr-sh-410-rc1-mon01-bastion-0 ~]# oc -n openshift-cluster-version logs cluster-version-operator-6567cdf878-hm68d | grep "Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff" | wc -l
7

Comment 4 shweta 2022-02-16 08:10:42 UTC

(In reply to shweta from comment #3)
> Issue is seen on ocp 4.10 rc-1 ( IBM Power platform )as well.
> 
> [root@rdr-sh-410-rc1-mon01-bastion-0 ~]# oc version
> Client Version: 4.10.0-rc.1
> Server Version: 4.10.0-rc.1
> Kubernetes Version: v1.23.3+b63be7f
> 
> [root@rdr-sh-410-rc1-mon01-bastion-0 ~]# oc get pod -n
> openshift-cluster-version
> NAME                                        READY   STATUS    RESTARTS   AGE
> cluster-version-operator-6567cdf878-hm68d   1/1     Running   0          8d
> 
> [root@rdr-sh-410-rc1-mon01-bastion-0 ~]# oc logs
> cluster-version-operator-6567cdf878-hm68d -n openshift-cluster-version >
> cvo_info.log
> 
> [root@rdr-sh-410-rc1-mon01-bastion-0 ~]# grep -o 'Updating .*due to diff'
> cvo_info.log | sort | uniq -c
>       6 Updating CronJob
> openshift-operator-lifecycle-manager/collect-profiles due to diff
> 
> ...
> 
> [root@rdr-sh-410-rc1-mon01-bastion-0 ~]# oc -n openshift-cluster-version
> logs cluster-version-operator-6567cdf878-hm68d | grep "Updating CronJob
> openshift-operator-lifecycle-manager/collect-profiles due to diff" | wc -l
> 7

Comment 5 Yang Yang 2022-10-08 02:38:42 UTC

Seems like it doesn't occur in the 4.12.
# oc version
Client Version: 4.12.0-0.nightly-2022-09-28-204419
Kustomize Version: v4.5.4
Server Version: 4.12.0-0.nightly-2022-10-05-053337
Kubernetes Version: v1.25.0+3ef6ef3

# grep -o 'Updating .*due to diff' cvo.log| sort | uniq -c | grep CronJob
      1 Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff

Comment 6 Yang Yang 2022-10-08 02:50:09 UTC

(In reply to Yang Yang from comment #5)
> Seems like it doesn't occur in the 4.12.
> # oc version
> Client Version: 4.12.0-0.nightly-2022-09-28-204419
> Kustomize Version: v4.5.4
> Server Version: 4.12.0-0.nightly-2022-10-05-053337
> Kubernetes Version: v1.25.0+3ef6ef3
> 
> # grep -o 'Updating .*due to diff' cvo.log| sort | uniq -c | grep CronJob
>       1 Updating CronJob
> openshift-operator-lifecycle-manager/collect-profiles due to diff

Updating and correcting ^ after a double-check in a few minutes

# oc logs cluster-version-operator-7865bcc54c-d7gnx -n openshift-cluster-version | grep -o 'Updating .*due to diff' cvo.log| sort | uniq -c | grep CronJob
      4 Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff

It's still reproducible in 4.12.

Comment 7 Shiftzilla 2023-03-09 01:10:52 UTC

OpenShift has moved to Jira for its defect tracking! This bug can now be found in the OCPBUGS project in Jira.

https://issues.redhat.com/browse/OCPBUGS-9070

Note You need to log in before you can comment on or make changes to this bug.