2039119 – CVO hotloops on Service openshift-monitoring/cluster-monitoring-operator

Bug 2039119 - CVO hotloops on Service openshift-monitoring/cluster-monitoring-operator

Summary: CVO hotloops on Service openshift-monitoring/cluster-monitoring-operator

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.10
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.10.0
Assignee:	Jan Fajerski
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-01-11 02:33 UTC by Yang Yang
Modified:	2022-03-10 16:38 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-03-10 16:38:34 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
CVO log file (9.71 MB, text/plain) 2022-01-11 02:33 UTC, Yang Yang	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 1537	0	None	Merged	Bug 2039119: assets: let CVO manage the CMO Service resource	2022-01-18 20:21:26 UTC
Red Hat Product Errata	RHSA-2022:0056	0	None	None	None	2022-03-10 16:38:52 UTC

Description Yang Yang 2022-01-11 02:33:34 UTC

Created attachment 1849999 [details]
CVO log file

Description of problem:
In a fresh installed cluster, we can see hot-loopings on Service openshift-monitoring/cluster-monitoring-operator.

# grep -o 'Updating .*due to diff' cvo2.log | sort | uniq -c 
     18 Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff
     12 Updating Service openshift-monitoring/cluster-monitoring-operator due to diff

Looking at the Service hotloop.

# grep -A139 'Updating Service openshift-monitoring/cluster-monitoring-operator due to diff' cvo2.log | tail -n140

I0110 05:42:20.035301       1 core.go:78] Updating Service openshift-monitoring/cluster-monitoring-operator due to diff:   &v1.Service{
  	TypeMeta: v1.TypeMeta{
- 		Kind:       "",
+ 		Kind:       "Service",
- 		APIVersion: "",
+ 		APIVersion: "v1",
  	},
  	ObjectMeta: v1.ObjectMeta{
  		... // 2 identical fields
  		Namespace:                  "openshift-monitoring",
  		SelfLink:                   "",
- 		UID:                        "bf847434-1d7b-403e-8799-3301762a9e4b",
+ 		UID:                        "",
- 		ResourceVersion:            "43235",
+ 		ResourceVersion:            "",
  		Generation:                 0,
- 		CreationTimestamp:          v1.Time{Time: s"2022-01-10 04:35:09 +0000 UTC"},
+ 		CreationTimestamp:          v1.Time{},
  		DeletionTimestamp:          nil,
  		DeletionGracePeriodSeconds: nil,
  		Labels: map[string]string{
  			"app":                    "cluster-monitoring-operator",
- 			"app.kubernetes.io/name": "cluster-monitoring-operator",
  		},
  		Annotations: map[string]string{
  			"include.release.openshift.io/ibm-cloud-managed":              "true",
  			"include.release.openshift.io/self-managed-high-availability": "true",
  			"include.release.openshift.io/single-node-developer":          "true",
- 			"service.alpha.openshift.io/serving-cert-signed-by":           "openshift-service-serving-signer@1641789443",
  			"service.beta.openshift.io/serving-cert-secret-name":          "cluster-monitoring-operator-tls",
- 			"service.beta.openshift.io/serving-cert-signed-by":            "openshift-service-serving-signer@1641789443",
  		},
  		OwnerReferences: {{APIVersion: "config.openshift.io/v1", Kind: "ClusterVersion", Name: "version", UID: "334d6c04-126d-4271-96ec-d303e93b7d1c", ...}},
  		Finalizers:      nil,
  		ClusterName:     "",
- 		ManagedFields: []v1.ManagedFieldsEntry{
- 			{
- 				Manager:    "cluster-version-operator",
- 				Operation:  "Update",
- 				APIVersion: "v1",
- 				Time:       s"2022-01-10 05:39:32 +0000 UTC",
- 				FieldsType: "FieldsV1",
- 				FieldsV1:   s`{"f:metadata":{"f:annotations":{".":{},"f:include.release.opensh`...,
- 			},
- 			{
- 				Manager:    "Go-http-client",
- 				Operation:  "Update",
- 				APIVersion: "v1",
- 				Time:       s"2022-01-10 05:39:35 +0000 UTC",
- 				FieldsType: "FieldsV1",
- 				FieldsV1:   s`{"f:metadata":{"f:annotations":{"f:service.alpha.openshift.io/se`...,
- 			},
- 		},
+ 		ManagedFields: nil,
  	},
  	Spec: v1.ServiceSpec{
  		Ports: []v1.ServicePort{
  			{
  				Name:        "https",
- 				Protocol:    "TCP",
+ 				Protocol:    "",
  				AppProtocol: nil,
  				Port:        8443,
  				... // 2 identical fields
  			},
  		},
  		Selector:                 {"app": "cluster-monitoring-operator"},
  		ClusterIP:                "None",
- 		ClusterIPs:               []string{"None"},
+ 		ClusterIPs:               nil,
- 		Type:                     "ClusterIP",
+ 		Type:                     "",
  		ExternalIPs:              nil,
- 		SessionAffinity:          "None",
+ 		SessionAffinity:          "",
  		LoadBalancerIP:           "",
  		LoadBalancerSourceRanges: nil,
  		... // 3 identical fields
  		PublishNotReadyAddresses:      false,
  		SessionAffinityConfig:         nil,
- 		IPFamilies:                    []v1.IPFamily{"IPv4"},
+ 		IPFamilies:                    nil,
- 		IPFamilyPolicy:                &"SingleStack",
+ 		IPFamilyPolicy:                nil,
  		AllocateLoadBalancerNodePorts: nil,
  		LoadBalancerClass:             nil,
- 		InternalTrafficPolicy:         &"Cluster",
+ 		InternalTrafficPolicy:         nil,
  	},
  	Status: {},
  }

Extract the service manifest

# cat 0000_50_cluster-monitoring-operator_03-service.yaml 
 1 apiVersion: v1
  2 kind: Service
  3 metadata:
  4   annotations:
  5     service.beta.openshift.io/serving-cert-secret-name: cluster-monitoring-operator-tls
  6     include.release.openshift.io/ibm-cloud-managed: "true"
  7     include.release.openshift.io/self-managed-high-availability: "true"
  8     include.release.openshift.io/single-node-developer: "true"
  9   labels:
 10     app: cluster-monitoring-operator
 11   name: cluster-monitoring-operator
 12   namespace: openshift-monitoring
 13 spec:
 14   clusterIP: None
 15   ports:
 16   - name: https
 17     port: 8443
 18     targetPort: https
 19   selector:
 20     app: cluster-monitoring-operator

Looking at the in-cluster object

# oc get service/cluster-monitoring-operator -oyaml -n openshift-monitoring
apiVersion: v1
kind: Service
metadata:
  annotations:
    include.release.openshift.io/ibm-cloud-managed: "true"
    include.release.openshift.io/self-managed-high-availability: "true"
    include.release.openshift.io/single-node-developer: "true"
    service.alpha.openshift.io/serving-cert-signed-by: openshift-service-serving-signer@1641789443
    service.beta.openshift.io/serving-cert-secret-name: cluster-monitoring-operator-tls
    service.beta.openshift.io/serving-cert-signed-by: openshift-service-serving-signer@1641789443
  creationTimestamp: "2022-01-10T04:35:09Z"
  labels:
    app: cluster-monitoring-operator
    app.kubernetes.io/name: cluster-monitoring-operator
  name: cluster-monitoring-operator
  namespace: openshift-monitoring
  ownerReferences:
  - apiVersion: config.openshift.io/v1
    kind: ClusterVersion
    name: version
    uid: 334d6c04-126d-4271-96ec-d303e93b7d1c
  resourceVersion: "66862"
  uid: bf847434-1d7b-403e-8799-3301762a9e4b
spec:
  clusterIP: None
  clusterIPs:
  - None
  internalTrafficPolicy: Cluster
  ipFamilies:
  - IPv4
  ipFamilyPolicy: SingleStack
  ports:
  - name: https
    port: 8443
    protocol: TCP
    targetPort: https
  selector:
    app: cluster-monitoring-operator
  sessionAffinity: None
  type: ClusterIP
status:
  loadBalancer: {}

Version-Release number of the following components:
4.10.0-0.nightly-2022-01-09-195852

How reproducible:
1/1

Steps to Reproduce:
1. Install a 4.10 cluster
2. Grep 'Updating .*due to diff' in the cvo log to check hot-loopings
3.

Actual results:
CVO hotloops on Service openshift-monitoring/cluster-monitoring-operator

Expected results:
CVO should not hotloop on it in a fresh installed cluster

Additional info:
Please attach logs from ansible-playbook with the -vvv flag

Comment 1 W. Trevor King 2022-01-15 07:00:35 UTC

Poking about a bit in my PR, the conflict seems to be coming from ownerReferences.  Which surprising, since we aren't hotlooping on other resources where we maintain ownerReferences.  Looking for contention in 4.10.0-fc.1 CI [1]:

$ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws-modern/1481748933687382016/artifacts/launch/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2
$ zgrep -h '"resource":"services"' */*.log.gz | jq -r 'select((.verb | (. != "list" and . != "watch" and . != "get")) and .objectRef.name == "cluster-monitoring-operator") | .stageTimestamp + " " + .verb + " " +
 .user.username'
2022-01-13T22:58:58.692572Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
2022-01-13T22:59:15.365385Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
2022-01-13T23:02:12.384046Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
2022-01-13T23:02:29.044822Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
2022-01-13T23:14:45.575789Z update system:serviceaccount:openshift-cluster-version:default
2022-01-13T23:22:45.201847Z update system:serviceaccount:openshift-cluster-version:default
2022-01-13T23:32:24.490817Z update system:serviceaccount:openshift-cluster-version:default
2022-01-13T22:41:21.771855Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
2022-01-13T22:41:38.521442Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
2022-01-13T22:55:24.153928Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
2022-01-13T22:55:41.039888Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
2022-01-13T23:44:27.711076Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
2022-01-13T23:44:44.480831Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
2022-01-13T23:45:01.054004Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
2022-01-13T23:51:43.891035Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
2022-01-13T22:29:43.801000Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
2022-01-13T22:34:03.912235Z update system:serviceaccount:openshift-cluster-version:default
2022-01-13T22:34:45.018509Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
2022-01-13T22:35:24.874931Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
2022-01-13T22:38:16.163942Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
2022-01-13T22:38:33.049484Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
2022-01-13T22:39:11.721002Z update system:serviceaccount:openshift-cluster-version:default
2022-01-13T22:44:37.741661Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
2022-01-13T22:44:54.453722Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
2022-01-13T22:45:26.056044Z update system:serviceaccount:openshift-cluster-version:default
2022-01-13T22:47:12.378804Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
2022-01-13T22:47:29.356884Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
2022-01-13T22:49:46.496567Z update system:serviceaccount:openshift-cluster-version:default
2022-01-13T22:53:16.162430Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
2022-01-13T22:53:33.037243Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
2022-01-13T23:02:45.718556Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
2022-01-13T23:08:16.139168Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
2022-01-13T23:08:32.841782Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
2022-01-13T23:15:08.813635Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
2022-01-13T23:29:58.298690Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
2022-01-13T23:30:15.103338Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
2022-01-13T23:50:27.421965Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
2022-01-13T23:50:44.319910Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator
2022-01-13T23:51:53.871955Z update system:serviceaccount:openshift-cluster-version:default

So the CVO seems to be fighting with the monitoring operator.  Why does the monitoring operator care about this service?  Seems like they have both the manifest asking the CVO to manage it [2], and also an asset version they manage directly [3].  The internal asset seems old, but the manifest is relatively recent [4].  I'm going to punt this over to monitoring; perhaps Jan intended to drop the internal asset when he added the manifest?

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws-modern/1481748933687382016
[2]: https://github.com/openshift/cluster-monitoring-operator/blob/7a908ea2b0947dbc3a9bdd8e3db9d8422d2ce67b/manifests/0000_50_cluster-monitoring-operator_03-service.yaml
[3]: https://github.com/openshift/cluster-monitoring-operator/blob/7a908ea2b0947dbc3a9bdd8e3db9d8422d2ce67b/assets/cluster-monitoring-operator/service.yaml
[4]: https://github.com/openshift/cluster-monitoring-operator/pull/1451

Comment 5 Junqi Zhao 2022-01-19 03:39:58 UTC

checked with 4.10.0-0.nightly-2022-01-18-044014, found the issue, will test with the build include the fix
# oc -n openshift-cluster-version logs cluster-version-operator-68db9d654-bbcgn | grep "Updating Service openshift-monitoring/cluster-monitoring-operator due to diff" | wc -l
45

Comment 6 Junqi Zhao 2022-01-19 03:49:43 UTC

# oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2022-01-18-204237   True        False         161m    Cluster version is 4.10.0-0.nightly-2022-01-18-204237
# oc -n openshift-cluster-version get pod
NAME                                       READY   STATUS    RESTARTS   AGE
cluster-version-operator-9f9b99f94-78w74   1/1     Running   0          3h5m

only see the "Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff", tracked in bug 2039228, no such info for CMO
# oc -n openshift-cluster-version logs cluster-version-operator-9f9b99f94-78w74 | grep "Updating .*due to diff"
I0119 01:08:11.628617       1 generic.go:109] Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff:   &unstructured.Unstructured{
I0119 01:13:28.657914       1 generic.go:109] Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff:   &unstructured.Unstructured{
I0119 01:16:53.425872       1 generic.go:109] Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff:   &unstructured.Unstructured{
...

# oc -n openshift-cluster-version logs cluster-version-operator-9f9b99f94-78w74 | grep "Updating Service openshift-monitoring/cluster-monitoring-operator due to diff" | wc -l
0

Comment 10 errata-xmlrpc 2022-03-10 16:38:34 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056

Note You need to log in before you can comment on or make changes to this bug.