2042231 – CVO hotloops on Updating PrometheusRule openshift-machine-api/machine-api-operator-prometheus-rules

Bug 2042231 - CVO hotloops on Updating PrometheusRule openshift-machine-api/machine-api-operator-prometheus-rules

Summary: CVO hotloops on Updating PrometheusRule openshift-machine-api/machine-api-ope...

Keywords:
Status:	CLOSED DUPLICATE of bug 2010368
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Jack Ottofaro
QA Contact:	Yang Yang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-01-19 04:21 UTC by Junqi Zhao
Modified:	2022-08-31 12:54 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-08-31 12:54:32 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
CVO logs (7.96 MB, text/plain) 2022-01-19 04:21 UTC, Junqi Zhao	no flags	Details
View All

Description Junqi Zhao 2022-01-19 04:21:10 UTC

Created attachment 1851782 [details]
CVO logs

Description of problem:
# oc get clusterversion
NAME      VERSION   AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.9.15    True        False         79m     Cluster version is 4.9.15

# oc -n openshift-cluster-version get po
NAME                                       READY   STATUS    RESTARTS   AGE
cluster-version-operator-556f59b64-m8ctt   1/1     Running   0          108m

checked the PrometheusRule hotloop, found info "Updating PrometheusRule openshift-machine-api/machine-api-operator-prometheus-rules due to diff"
# oc -n openshift-cluster-version logs cluster-version-operator-556f59b64-m8ctt
...
I0119 02:43:03.043729       1 generic.go:109] Updating PrometheusRule openshift-machine-api/machine-api-operator-prometheus-rules due to diff:   &unstructured.Unstructured{
  	Object: map[string]interface{}{
  		"apiVersion": string("monitoring.coreos.com/v1"),
  		"kind":       string("PrometheusRule"),
  		"metadata":   map[string]interface{}{"annotations": map[string]interface{}{"exclude.release.openshift.io/internal-openshift-hosted": string("true"), "include.release.openshift.io/self-managed-high-availability": string("true"), "include.release.openshift.io/single-node-developer": string("true")}, "creationTimestamp": string("2022-01-19T02:12:03Z"), "generation": int64(1), "labels": map[string]interface{}{"prometheus": string("k8s"), "role": string("alert-rules")}, ...},
  		"spec": map[string]interface{}{
  			"groups": []interface{}{
  				... // 2 identical elements
  				map[string]interface{}{"name": string("machine-not-yet-deleted"), "rules": []interface{}{map[string]interface{}{"alert": string("MachineNotYetDeleted"), "annotations": map[string]interface{}{"message": string("machine {{ $labels.name }} has been in Deleting phase for more t"...)}, "expr": string("(mapi_machine_created_timestamp_seconds{phase=\"Deleting\"}) > 0\n"), "for": string("360m"), ...}}},
  				map[string]interface{}{"name": string("machine-api-operator-metrics-collector-up"), "rules": []interface{}{map[string]interface{}{"alert": string("MachineAPIOperatorMetricsCollectionFailing"), "annotations": map[string]interface{}{"message": string("machine api operator metrics collection is failing. For more det"...)}, "expr": string("mapi_mao_collector_up == 0\n"), "for": string("5m"), ...}}},
  				map[string]interface{}{
  					"name": string("machine-health-check-unterminated-short-circuit"),
  					"rules": []interface{}{
  						map[string]interface{}{
  							"alert": string("MachineHealthCheckUnterminatedShortCircuit"),
- 							"annotation": map[string]interface{}{
- 								"message": string("machine health check {{ $labels.name }} has been disabled by short circuit for more than 30 minutes"),
- 							},
  							"expr":   string("mapi_machinehealthcheck_short_circuit == 1\n"),
  							"for":    string("30m"),
  							"labels": map[string]interface{}{"severity": string("warning")},
  						},
  					},
  				},
  			},
  		},
  	},
  }
...

# oc -n openshift-cluster-version logs cluster-version-operator-556f59b64-m8ctt | grep "Updating .*due to diff"
I0119 02:43:03.043729       1 generic.go:109] Updating PrometheusRule openshift-machine-api/machine-api-operator-prometheus-rules due to diff:   &unstructured.Unstructured{
I0119 02:43:14.970795       1 generic.go:109] Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff:   &unstructured.Unstructured{
I0119 02:46:21.924131       1 generic.go:109] Updating PrometheusRule openshift-machine-api/machine-api-operator-prometheus-rules due to diff:   &unstructured.Unstructured{
I0119 02:46:33.794857       1 generic.go:109] Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff:   &unstructured.Unstructured{
I0119 02:49:40.744937       1 generic.go:109] Updating PrometheusRule openshift-machine-api/machine-api-operator-prometheus-rules due to diff:   &unstructured.Unstructured{
I0119 02:49:52.622493       1 generic.go:109] Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff:   &unstructured.Unstructured{
I0119 02:52:59.595694       1 generic.go:109] Updating PrometheusRule openshift-machine-api/machine-api-operator-prometheus-rules due to diff:   &unstructured.Unstructured{
I0119 02:53:11.443600       1 generic.go:109] Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff:   &unstructured.Unstructured{
I0119 02:56:18.390534       1 generic.go:109] Updating PrometheusRule openshift-machine-api/machine-api-operator-prometheus-rules due to diff:   &unstructured.Unstructured{
I0119 02:56:30.265420       1 generic.go:109] Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff:   &unstructured.Unstructured{
I0119 02:59:37.212084       1 generic.go:109] Updating PrometheusRule openshift-machine-api/machine-api-operator-prometheus-rules due to diff:   &unstructured.Unstructured{
...
#  oc -n openshift-cluster-version logs cluster-version-operator-556f59b64-m8ctt | grep "Updating .*due to diff" | wc -l
48

"Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff" is tracked in bug 2039228
# oc -n openshift-cluster-version logs cluster-version-operator-556f59b64-m8ctt | grep "Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff" | wc -l
24
# oc -n openshift-cluster-version logs cluster-version-operator-556f59b64-m8ctt | grep "Updating PrometheusRule openshift-machine-api/machine-api-operator-prometheus-rules due to diff" | wc -l
24

# oc -n openshift-machine-api get PrometheusRule machine-api-operator-prometheus-rules -oyaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  annotations:
    exclude.release.openshift.io/internal-openshift-hosted: "true"
    include.release.openshift.io/self-managed-high-availability: "true"
    include.release.openshift.io/single-node-developer: "true"
  creationTimestamp: "2022-01-19T02:12:03Z"
  generation: 1
  labels:
    prometheus: k8s
    role: alert-rules
  name: machine-api-operator-prometheus-rules
  namespace: openshift-machine-api
  ownerReferences:
  - apiVersion: config.openshift.io/v1
    kind: ClusterVersion
    name: version
    uid: d1a19063-6076-44a1-98cf-3c2ed403e91e
  resourceVersion: "2013"
  uid: 90cbd695-b002-4152-a9ee-69b862c11f4c
spec:
  groups:
  - name: machine-without-valid-node-ref
    rules:
    - alert: MachineWithoutValidNode
      annotations:
        message: machine {{ $labels.name }} does not have valid node reference
      expr: |
        (mapi_machine_created_timestamp_seconds unless on(node) kube_node_info) > 0
      for: 60m
      labels:
        severity: warning
  - name: machine-with-no-running-phase
    rules:
    - alert: MachineWithNoRunningPhase
      annotations:
        message: 'machine {{ $labels.name }} is in phase: {{ $labels.phase }}'
      expr: |
        (mapi_machine_created_timestamp_seconds{phase!~"Running|Deleting"}) > 0
      for: 60m
      labels:
        severity: warning
  - name: machine-not-yet-deleted
    rules:
    - alert: MachineNotYetDeleted
      annotations:
        message: machine {{ $labels.name }} has been in Deleting phase for more than
          6 hours
      expr: |
        (mapi_machine_created_timestamp_seconds{phase="Deleting"}) > 0
      for: 360m
      labels:
        severity: warning
  - name: machine-api-operator-metrics-collector-up
    rules:
    - alert: MachineAPIOperatorMetricsCollectionFailing
      annotations:
        message: 'machine api operator metrics collection is failing. For more details:  oc
          logs <machine-api-operator-pod-name> -n openshift-machine-api'
      expr: |
        mapi_mao_collector_up == 0
      for: 5m
      labels:
        severity: critical
  - name: machine-health-check-unterminated-short-circuit
    rules:
    - alert: MachineHealthCheckUnterminatedShortCircuit
      expr: |
        mapi_machinehealthcheck_short_circuit == 1
      for: 30m
      labels:
        severity: warning

Version-Release number of the following components:
4.9.15

How reproducible:
happen with 4.9, no such issue for 4.10

Steps to Reproduce:
1. Install a 4.10 cluster
2. Grep 'Updating .*due to diff' in the cvo log to check hot-loopings
3.

Actual results:
found info "Updating PrometheusRule openshift-machine-api/machine-api-operator-prometheus-rules due to diff"

Expected results:
CVO should not hotloop on it in a fresh installed cluster

Additional info:

Comment 2 David Hurta 2022-08-30 13:07:18 UTC

This bug seems to be already fixed.

The CVO was trying to reconcile an incorrectly written manifest.

We can see in the diff that the CVO tries to add a missing field "annotation" to the manifest (for the whole log line see the previous comment [1]):

- 							"annotation": map[string]interface{}{
- 								"message": string("machine health check {{ $labels.name }} has been disabled by short circuit for more than 30 minutes"),
- 							},

However, the written manifest that the CVO is trying to reconcile has a typo. "annotation" should be "annotations" (for the custom resource definition, see [2]). The CVO was trying to apply a change to a resource that would never be accepted by the API server resulting in a hot-looping on the CVO side.

This issue was already fixed by the pull request [3] for the bug [4] in the original repository [5] from where the manifest comes from. A CVO log file from a newer version (in my case 4.12.0-0.ci-2022-08-29-170215) doesn't contain this hot-looping.

[1] https://bugzilla.redhat.com/show_bug.cgi?id=2042231#c0

[2] https://github.com/openshift/api/blob/b21e86c742e740c2e2c55288a0e6b68cf3afee4d/monitoring/v1alpha1/0000_50_monitoring_01_alertingrules.crd.yaml

[3] https://github.com/openshift/machine-api-operator/pull/942

[4] https://bugzilla.redhat.com/show_bug.cgi?id=2010368

[5] https://github.com/openshift/machine-api-operator

Comment 3 Jack Ottofaro 2022-08-31 12:54:32 UTC

Per https://bugzilla.redhat.com/show_bug.cgi?id=2042231#c2 closing as a dup of https://bugzilla.redhat.com/show_bug.cgi?id=2010368.

*** This bug has been marked as a duplicate of bug 2010368 ***

Note You need to log in before you can comment on or make changes to this bug.