Created attachment 1851782 [details] CVO logs Description of problem: # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.9.15 True False 79m Cluster version is 4.9.15 # oc -n openshift-cluster-version get po NAME READY STATUS RESTARTS AGE cluster-version-operator-556f59b64-m8ctt 1/1 Running 0 108m checked the PrometheusRule hotloop, found info "Updating PrometheusRule openshift-machine-api/machine-api-operator-prometheus-rules due to diff" # oc -n openshift-cluster-version logs cluster-version-operator-556f59b64-m8ctt ... I0119 02:43:03.043729 1 generic.go:109] Updating PrometheusRule openshift-machine-api/machine-api-operator-prometheus-rules due to diff: &unstructured.Unstructured{ Object: map[string]interface{}{ "apiVersion": string("monitoring.coreos.com/v1"), "kind": string("PrometheusRule"), "metadata": map[string]interface{}{"annotations": map[string]interface{}{"exclude.release.openshift.io/internal-openshift-hosted": string("true"), "include.release.openshift.io/self-managed-high-availability": string("true"), "include.release.openshift.io/single-node-developer": string("true")}, "creationTimestamp": string("2022-01-19T02:12:03Z"), "generation": int64(1), "labels": map[string]interface{}{"prometheus": string("k8s"), "role": string("alert-rules")}, ...}, "spec": map[string]interface{}{ "groups": []interface{}{ ... // 2 identical elements map[string]interface{}{"name": string("machine-not-yet-deleted"), "rules": []interface{}{map[string]interface{}{"alert": string("MachineNotYetDeleted"), "annotations": map[string]interface{}{"message": string("machine {{ $labels.name }} has been in Deleting phase for more t"...)}, "expr": string("(mapi_machine_created_timestamp_seconds{phase=\"Deleting\"}) > 0\n"), "for": string("360m"), ...}}}, map[string]interface{}{"name": string("machine-api-operator-metrics-collector-up"), "rules": []interface{}{map[string]interface{}{"alert": string("MachineAPIOperatorMetricsCollectionFailing"), "annotations": map[string]interface{}{"message": string("machine api operator metrics collection is failing. For more det"...)}, "expr": string("mapi_mao_collector_up == 0\n"), "for": string("5m"), ...}}}, map[string]interface{}{ "name": string("machine-health-check-unterminated-short-circuit"), "rules": []interface{}{ map[string]interface{}{ "alert": string("MachineHealthCheckUnterminatedShortCircuit"), - "annotation": map[string]interface{}{ - "message": string("machine health check {{ $labels.name }} has been disabled by short circuit for more than 30 minutes"), - }, "expr": string("mapi_machinehealthcheck_short_circuit == 1\n"), "for": string("30m"), "labels": map[string]interface{}{"severity": string("warning")}, }, }, }, }, }, }, } ... # oc -n openshift-cluster-version logs cluster-version-operator-556f59b64-m8ctt | grep "Updating .*due to diff" I0119 02:43:03.043729 1 generic.go:109] Updating PrometheusRule openshift-machine-api/machine-api-operator-prometheus-rules due to diff: &unstructured.Unstructured{ I0119 02:43:14.970795 1 generic.go:109] Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff: &unstructured.Unstructured{ I0119 02:46:21.924131 1 generic.go:109] Updating PrometheusRule openshift-machine-api/machine-api-operator-prometheus-rules due to diff: &unstructured.Unstructured{ I0119 02:46:33.794857 1 generic.go:109] Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff: &unstructured.Unstructured{ I0119 02:49:40.744937 1 generic.go:109] Updating PrometheusRule openshift-machine-api/machine-api-operator-prometheus-rules due to diff: &unstructured.Unstructured{ I0119 02:49:52.622493 1 generic.go:109] Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff: &unstructured.Unstructured{ I0119 02:52:59.595694 1 generic.go:109] Updating PrometheusRule openshift-machine-api/machine-api-operator-prometheus-rules due to diff: &unstructured.Unstructured{ I0119 02:53:11.443600 1 generic.go:109] Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff: &unstructured.Unstructured{ I0119 02:56:18.390534 1 generic.go:109] Updating PrometheusRule openshift-machine-api/machine-api-operator-prometheus-rules due to diff: &unstructured.Unstructured{ I0119 02:56:30.265420 1 generic.go:109] Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff: &unstructured.Unstructured{ I0119 02:59:37.212084 1 generic.go:109] Updating PrometheusRule openshift-machine-api/machine-api-operator-prometheus-rules due to diff: &unstructured.Unstructured{ ... # oc -n openshift-cluster-version logs cluster-version-operator-556f59b64-m8ctt | grep "Updating .*due to diff" | wc -l 48 "Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff" is tracked in bug 2039228 # oc -n openshift-cluster-version logs cluster-version-operator-556f59b64-m8ctt | grep "Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff" | wc -l 24 # oc -n openshift-cluster-version logs cluster-version-operator-556f59b64-m8ctt | grep "Updating PrometheusRule openshift-machine-api/machine-api-operator-prometheus-rules due to diff" | wc -l 24 # oc -n openshift-machine-api get PrometheusRule machine-api-operator-prometheus-rules -oyaml apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: annotations: exclude.release.openshift.io/internal-openshift-hosted: "true" include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" creationTimestamp: "2022-01-19T02:12:03Z" generation: 1 labels: prometheus: k8s role: alert-rules name: machine-api-operator-prometheus-rules namespace: openshift-machine-api ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: d1a19063-6076-44a1-98cf-3c2ed403e91e resourceVersion: "2013" uid: 90cbd695-b002-4152-a9ee-69b862c11f4c spec: groups: - name: machine-without-valid-node-ref rules: - alert: MachineWithoutValidNode annotations: message: machine {{ $labels.name }} does not have valid node reference expr: | (mapi_machine_created_timestamp_seconds unless on(node) kube_node_info) > 0 for: 60m labels: severity: warning - name: machine-with-no-running-phase rules: - alert: MachineWithNoRunningPhase annotations: message: 'machine {{ $labels.name }} is in phase: {{ $labels.phase }}' expr: | (mapi_machine_created_timestamp_seconds{phase!~"Running|Deleting"}) > 0 for: 60m labels: severity: warning - name: machine-not-yet-deleted rules: - alert: MachineNotYetDeleted annotations: message: machine {{ $labels.name }} has been in Deleting phase for more than 6 hours expr: | (mapi_machine_created_timestamp_seconds{phase="Deleting"}) > 0 for: 360m labels: severity: warning - name: machine-api-operator-metrics-collector-up rules: - alert: MachineAPIOperatorMetricsCollectionFailing annotations: message: 'machine api operator metrics collection is failing. For more details: oc logs <machine-api-operator-pod-name> -n openshift-machine-api' expr: | mapi_mao_collector_up == 0 for: 5m labels: severity: critical - name: machine-health-check-unterminated-short-circuit rules: - alert: MachineHealthCheckUnterminatedShortCircuit expr: | mapi_machinehealthcheck_short_circuit == 1 for: 30m labels: severity: warning Version-Release number of the following components: 4.9.15 How reproducible: happen with 4.9, no such issue for 4.10 Steps to Reproduce: 1. Install a 4.10 cluster 2. Grep 'Updating .*due to diff' in the cvo log to check hot-loopings 3. Actual results: found info "Updating PrometheusRule openshift-machine-api/machine-api-operator-prometheus-rules due to diff" Expected results: CVO should not hotloop on it in a fresh installed cluster Additional info:
This bug seems to be already fixed. The CVO was trying to reconcile an incorrectly written manifest. We can see in the diff that the CVO tries to add a missing field "annotation" to the manifest (for the whole log line see the previous comment [1]): - "annotation": map[string]interface{}{ - "message": string("machine health check {{ $labels.name }} has been disabled by short circuit for more than 30 minutes"), - }, However, the written manifest that the CVO is trying to reconcile has a typo. "annotation" should be "annotations" (for the custom resource definition, see [2]). The CVO was trying to apply a change to a resource that would never be accepted by the API server resulting in a hot-looping on the CVO side. This issue was already fixed by the pull request [3] for the bug [4] in the original repository [5] from where the manifest comes from. A CVO log file from a newer version (in my case 4.12.0-0.ci-2022-08-29-170215) doesn't contain this hot-looping. [1] https://bugzilla.redhat.com/show_bug.cgi?id=2042231#c0 [2] https://github.com/openshift/api/blob/b21e86c742e740c2e2c55288a0e6b68cf3afee4d/monitoring/v1alpha1/0000_50_monitoring_01_alertingrules.crd.yaml [3] https://github.com/openshift/machine-api-operator/pull/942 [4] https://bugzilla.redhat.com/show_bug.cgi?id=2010368 [5] https://github.com/openshift/machine-api-operator
Per https://bugzilla.redhat.com/show_bug.cgi?id=2042231#c2 closing as a dup of https://bugzilla.redhat.com/show_bug.cgi?id=2010368. *** This bug has been marked as a duplicate of bug 2010368 ***