Bug 2039119
Summary: | CVO hotloops on Service openshift-monitoring/cluster-monitoring-operator | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Yang Yang <yanyang> | ||||
Component: | Monitoring | Assignee: | Jan Fajerski <jfajersk> | ||||
Status: | CLOSED ERRATA | QA Contact: | Junqi Zhao <juzhao> | ||||
Severity: | medium | Docs Contact: | |||||
Priority: | medium | ||||||
Version: | 4.10 | CC: | amuller, anpicker, aos-bugs, arajkuma, erooth, wking | ||||
Target Milestone: | --- | ||||||
Target Release: | 4.10.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2022-03-10 16:38:34 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Poking about a bit in my PR, the conflict seems to be coming from ownerReferences. Which surprising, since we aren't hotlooping on other resources where we maintain ownerReferences. Looking for contention in 4.10.0-fc.1 CI [1]: $ curl -s https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws-modern/1481748933687382016/artifacts/launch/gather-audit-logs/artifacts/audit-logs.tar | tar -xz --strip-components=2 $ zgrep -h '"resource":"services"' */*.log.gz | jq -r 'select((.verb | (. != "list" and . != "watch" and . != "get")) and .objectRef.name == "cluster-monitoring-operator") | .stageTimestamp + " " + .verb + " " + .user.username' 2022-01-13T22:58:58.692572Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2022-01-13T22:59:15.365385Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2022-01-13T23:02:12.384046Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2022-01-13T23:02:29.044822Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2022-01-13T23:14:45.575789Z update system:serviceaccount:openshift-cluster-version:default 2022-01-13T23:22:45.201847Z update system:serviceaccount:openshift-cluster-version:default 2022-01-13T23:32:24.490817Z update system:serviceaccount:openshift-cluster-version:default 2022-01-13T22:41:21.771855Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2022-01-13T22:41:38.521442Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2022-01-13T22:55:24.153928Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2022-01-13T22:55:41.039888Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2022-01-13T23:44:27.711076Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2022-01-13T23:44:44.480831Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2022-01-13T23:45:01.054004Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2022-01-13T23:51:43.891035Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2022-01-13T22:29:43.801000Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2022-01-13T22:34:03.912235Z update system:serviceaccount:openshift-cluster-version:default 2022-01-13T22:34:45.018509Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2022-01-13T22:35:24.874931Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2022-01-13T22:38:16.163942Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2022-01-13T22:38:33.049484Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2022-01-13T22:39:11.721002Z update system:serviceaccount:openshift-cluster-version:default 2022-01-13T22:44:37.741661Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2022-01-13T22:44:54.453722Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2022-01-13T22:45:26.056044Z update system:serviceaccount:openshift-cluster-version:default 2022-01-13T22:47:12.378804Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2022-01-13T22:47:29.356884Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2022-01-13T22:49:46.496567Z update system:serviceaccount:openshift-cluster-version:default 2022-01-13T22:53:16.162430Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2022-01-13T22:53:33.037243Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2022-01-13T23:02:45.718556Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2022-01-13T23:08:16.139168Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2022-01-13T23:08:32.841782Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2022-01-13T23:15:08.813635Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2022-01-13T23:29:58.298690Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2022-01-13T23:30:15.103338Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2022-01-13T23:50:27.421965Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2022-01-13T23:50:44.319910Z update system:serviceaccount:openshift-monitoring:cluster-monitoring-operator 2022-01-13T23:51:53.871955Z update system:serviceaccount:openshift-cluster-version:default So the CVO seems to be fighting with the monitoring operator. Why does the monitoring operator care about this service? Seems like they have both the manifest asking the CVO to manage it [2], and also an asset version they manage directly [3]. The internal asset seems old, but the manifest is relatively recent [4]. I'm going to punt this over to monitoring; perhaps Jan intended to drop the internal asset when he added the manifest? [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-aws-modern/1481748933687382016 [2]: https://github.com/openshift/cluster-monitoring-operator/blob/7a908ea2b0947dbc3a9bdd8e3db9d8422d2ce67b/manifests/0000_50_cluster-monitoring-operator_03-service.yaml [3]: https://github.com/openshift/cluster-monitoring-operator/blob/7a908ea2b0947dbc3a9bdd8e3db9d8422d2ce67b/assets/cluster-monitoring-operator/service.yaml [4]: https://github.com/openshift/cluster-monitoring-operator/pull/1451 checked with 4.10.0-0.nightly-2022-01-18-044014, found the issue, will test with the build include the fix # oc -n openshift-cluster-version logs cluster-version-operator-68db9d654-bbcgn | grep "Updating Service openshift-monitoring/cluster-monitoring-operator due to diff" | wc -l 45 # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.10.0-0.nightly-2022-01-18-204237 True False 161m Cluster version is 4.10.0-0.nightly-2022-01-18-204237 # oc -n openshift-cluster-version get pod NAME READY STATUS RESTARTS AGE cluster-version-operator-9f9b99f94-78w74 1/1 Running 0 3h5m only see the "Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff", tracked in bug 2039228, no such info for CMO # oc -n openshift-cluster-version logs cluster-version-operator-9f9b99f94-78w74 | grep "Updating .*due to diff" I0119 01:08:11.628617 1 generic.go:109] Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff: &unstructured.Unstructured{ I0119 01:13:28.657914 1 generic.go:109] Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff: &unstructured.Unstructured{ I0119 01:16:53.425872 1 generic.go:109] Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff: &unstructured.Unstructured{ ... # oc -n openshift-cluster-version logs cluster-version-operator-9f9b99f94-78w74 | grep "Updating Service openshift-monitoring/cluster-monitoring-operator due to diff" | wc -l 0 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |
Created attachment 1849999 [details] CVO log file Description of problem: In a fresh installed cluster, we can see hot-loopings on Service openshift-monitoring/cluster-monitoring-operator. # grep -o 'Updating .*due to diff' cvo2.log | sort | uniq -c 18 Updating CronJob openshift-operator-lifecycle-manager/collect-profiles due to diff 12 Updating Service openshift-monitoring/cluster-monitoring-operator due to diff Looking at the Service hotloop. # grep -A139 'Updating Service openshift-monitoring/cluster-monitoring-operator due to diff' cvo2.log | tail -n140 I0110 05:42:20.035301 1 core.go:78] Updating Service openshift-monitoring/cluster-monitoring-operator due to diff: &v1.Service{ TypeMeta: v1.TypeMeta{ - Kind: "", + Kind: "Service", - APIVersion: "", + APIVersion: "v1", }, ObjectMeta: v1.ObjectMeta{ ... // 2 identical fields Namespace: "openshift-monitoring", SelfLink: "", - UID: "bf847434-1d7b-403e-8799-3301762a9e4b", + UID: "", - ResourceVersion: "43235", + ResourceVersion: "", Generation: 0, - CreationTimestamp: v1.Time{Time: s"2022-01-10 04:35:09 +0000 UTC"}, + CreationTimestamp: v1.Time{}, DeletionTimestamp: nil, DeletionGracePeriodSeconds: nil, Labels: map[string]string{ "app": "cluster-monitoring-operator", - "app.kubernetes.io/name": "cluster-monitoring-operator", }, Annotations: map[string]string{ "include.release.openshift.io/ibm-cloud-managed": "true", "include.release.openshift.io/self-managed-high-availability": "true", "include.release.openshift.io/single-node-developer": "true", - "service.alpha.openshift.io/serving-cert-signed-by": "openshift-service-serving-signer@1641789443", "service.beta.openshift.io/serving-cert-secret-name": "cluster-monitoring-operator-tls", - "service.beta.openshift.io/serving-cert-signed-by": "openshift-service-serving-signer@1641789443", }, OwnerReferences: {{APIVersion: "config.openshift.io/v1", Kind: "ClusterVersion", Name: "version", UID: "334d6c04-126d-4271-96ec-d303e93b7d1c", ...}}, Finalizers: nil, ClusterName: "", - ManagedFields: []v1.ManagedFieldsEntry{ - { - Manager: "cluster-version-operator", - Operation: "Update", - APIVersion: "v1", - Time: s"2022-01-10 05:39:32 +0000 UTC", - FieldsType: "FieldsV1", - FieldsV1: s`{"f:metadata":{"f:annotations":{".":{},"f:include.release.opensh`..., - }, - { - Manager: "Go-http-client", - Operation: "Update", - APIVersion: "v1", - Time: s"2022-01-10 05:39:35 +0000 UTC", - FieldsType: "FieldsV1", - FieldsV1: s`{"f:metadata":{"f:annotations":{"f:service.alpha.openshift.io/se`..., - }, - }, + ManagedFields: nil, }, Spec: v1.ServiceSpec{ Ports: []v1.ServicePort{ { Name: "https", - Protocol: "TCP", + Protocol: "", AppProtocol: nil, Port: 8443, ... // 2 identical fields }, }, Selector: {"app": "cluster-monitoring-operator"}, ClusterIP: "None", - ClusterIPs: []string{"None"}, + ClusterIPs: nil, - Type: "ClusterIP", + Type: "", ExternalIPs: nil, - SessionAffinity: "None", + SessionAffinity: "", LoadBalancerIP: "", LoadBalancerSourceRanges: nil, ... // 3 identical fields PublishNotReadyAddresses: false, SessionAffinityConfig: nil, - IPFamilies: []v1.IPFamily{"IPv4"}, + IPFamilies: nil, - IPFamilyPolicy: &"SingleStack", + IPFamilyPolicy: nil, AllocateLoadBalancerNodePorts: nil, LoadBalancerClass: nil, - InternalTrafficPolicy: &"Cluster", + InternalTrafficPolicy: nil, }, Status: {}, } Extract the service manifest # cat 0000_50_cluster-monitoring-operator_03-service.yaml 1 apiVersion: v1 2 kind: Service 3 metadata: 4 annotations: 5 service.beta.openshift.io/serving-cert-secret-name: cluster-monitoring-operator-tls 6 include.release.openshift.io/ibm-cloud-managed: "true" 7 include.release.openshift.io/self-managed-high-availability: "true" 8 include.release.openshift.io/single-node-developer: "true" 9 labels: 10 app: cluster-monitoring-operator 11 name: cluster-monitoring-operator 12 namespace: openshift-monitoring 13 spec: 14 clusterIP: None 15 ports: 16 - name: https 17 port: 8443 18 targetPort: https 19 selector: 20 app: cluster-monitoring-operator Looking at the in-cluster object # oc get service/cluster-monitoring-operator -oyaml -n openshift-monitoring apiVersion: v1 kind: Service metadata: annotations: include.release.openshift.io/ibm-cloud-managed: "true" include.release.openshift.io/self-managed-high-availability: "true" include.release.openshift.io/single-node-developer: "true" service.alpha.openshift.io/serving-cert-signed-by: openshift-service-serving-signer@1641789443 service.beta.openshift.io/serving-cert-secret-name: cluster-monitoring-operator-tls service.beta.openshift.io/serving-cert-signed-by: openshift-service-serving-signer@1641789443 creationTimestamp: "2022-01-10T04:35:09Z" labels: app: cluster-monitoring-operator app.kubernetes.io/name: cluster-monitoring-operator name: cluster-monitoring-operator namespace: openshift-monitoring ownerReferences: - apiVersion: config.openshift.io/v1 kind: ClusterVersion name: version uid: 334d6c04-126d-4271-96ec-d303e93b7d1c resourceVersion: "66862" uid: bf847434-1d7b-403e-8799-3301762a9e4b spec: clusterIP: None clusterIPs: - None internalTrafficPolicy: Cluster ipFamilies: - IPv4 ipFamilyPolicy: SingleStack ports: - name: https port: 8443 protocol: TCP targetPort: https selector: app: cluster-monitoring-operator sessionAffinity: None type: ClusterIP status: loadBalancer: {} Version-Release number of the following components: 4.10.0-0.nightly-2022-01-09-195852 How reproducible: 1/1 Steps to Reproduce: 1. Install a 4.10 cluster 2. Grep 'Updating .*due to diff' in the cvo log to check hot-loopings 3. Actual results: CVO hotloops on Service openshift-monitoring/cluster-monitoring-operator Expected results: CVO should not hotloop on it in a fresh installed cluster Additional info: Please attach logs from ansible-playbook with the -vvv flag