Recreating objects continously because of buggy apply logic in CVO can have severe consequences for performance (compare https://bugzilla.redhat.com/show_bug.cgi?id=1795412) and availability (compare https://bugzilla.redhat.com/show_bug.cgi?id=1863011). In https://github.com/openshift/cluster-version-operator/pull/258 a generic way was proposed to log these hotlooping resources, but it was turned down (https://github.com/openshift/cluster-version-operator/pull/258#issuecomment-544964000) and replaced with a CRD specific solution. Now we had another case (https://bugzilla.redhat.com/show_bug.cgi?id=1863011) and have spent more than a man week of work to triage the issue in both api-auth team and in monitoring team. Instead, CVO should detect this and make developers aware easily. In CVO logs this is completey invisible.
We have the same issue in cluster-monitoring-operator so https://bugzilla.redhat.com/show_bug.cgi?id=1856189 is related.
The fix for bug 1795412 was removing overlapping manifests for the same resource [1]. Unfortunately, the fix from 2020-02-14 landed before we were cutting any 4.4 RCs: $ oc adm release info quay.io/openshift-release-dev/ocp-release:4.4.0-rc.0-x86_64 | head -n3 Name: 4.4.0-rc.0 Digest: sha256:746567f5dd9f5532a2f15180dadd99b4fed11f46e92500637fe6bfb61236a628 Created: 2020-03-11T17:57:04Z so I can't easily extract manifests and show the overlap there. But looking for similar overlap in recent releases: $ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.6.0-fc.6-x86_64 Extracted release payload from digest sha256:933f3d6f61ddec9f3b88a0932b47c438d7dfc15ff1873ab176284b66c9cff76e created at 2020-09-14T21:50:05Z $ for X in manifests/*.yaml; do JSON="$(yaml2json < "${X}")"; if test "x${JSON:0:1}" = 'x['; then echo "${JSON}" | jq -c '.[]'; elif test "$(echo "${JSON}" | jq .items)" != 'null'; then echo "${JSON}" | jq -c '.items[]'; else echo "${JSON}" | jq -c .; fi; done | grep -v '^""$' | jq -r '.kind + " " + .apiVersion + " " + ((.metadata // {}).namespace // "-") + " " + ((.metadata // {}).name // "-")' | sort | uniq -c | sort -n | tail -n4 1 Storage operator.openshift.io/v1 - cluster 2 Namespace v1 - openshift-cluster-storage-operator 2 Namespace v1 - openshift-machine-api 2 PrometheusRule monitoring.coreos.com/v1 openshift-kube-controller-manager-operator kube-controller-manager-operator $ grep -r openshift-cluster-storage-operator manifests | grep 'namespace\.yaml' manifests/0000_50_cluster-csi-snapshot-controller-operator_00_namespace.yaml: name: openshift-cluster-storage-operator manifests/0000_50_cluster-storage-operator_01_operator_namespace.yaml: name: openshift-cluster-storage-operator $ grep -B10 -r openshift-machine-api manifests | grep 'kind: Namespace' manifests/0000_50_cluster-autoscaler-operator_00_namespace.yaml-kind: Namespace manifests/0000_30_machine-api-operator_00_namespace.yaml-kind: Namespace $ grep -B10 -r 'name: kube-controller-manager-operator' manifests | grep 'kind: PrometheusRule' manifests/0000_90_kube-controller-manager-operator_05_alert-pdb.yaml-kind: PrometheusRule manifests/0000_90_kube-controller-manager-operator_05_alert-kcm-down.yaml-kind: PrometheusRule $ diff -u0 <(yaml2json <manifests/0000_50_cluster-storage-operator_01_operator_namespace.yaml | jq -S .) <(yaml2json <manifests/0000_50_cluster-csi-snapshot-controller-operator_00_namespace.yaml | jq -S .) --- /dev/fd/63 2020-09-15 21:57:52.836562697 -0700 +++ /dev/fd/62 2020-09-15 21:57:52.837562709 -0700 @@ -7,0 +8,3 @@ + "labels": { + "openshift.io/cluster-monitoring": "true" + }, $ diff -u0 <(yaml2json <manifests/0000_30_machine-api-operator_00_namespace.yaml | jq -S .) <(yaml2json <manifests/0000_50_cluster-autoscaler-operator_00_namespace.yaml | jq -S .) --- /dev/fd/63 2020-09-15 21:59:06.566442814 -0700 +++ /dev/fd/62 2020-09-15 21:59:06.569442850 -0700 @@ -5,3 +4,0 @@ - "annotations": { - "openshift.io/node-selector": "" - }, $ diff -u manifests/0000_90_kube-controller-manager-operator_05_alert-pdb.yaml manifests/0000_90_kube-controller-manager-operator_05_alert-kcm-down.yaml --- manifests/0000_90_kube-controller-manager-operator_05_alert-pdb.yaml 2020-09-12 05:33:59.000000000 -0700 +++ manifests/0000_90_kube-controller-manager-operator_05_alert-kcm-down.yaml 2020-09-12 05:33:59.000000000 -0700 @@ -9,19 +9,11 @@ groups: - name: cluster-version rules: - - alert: PodDisruptionBudgetAtLimit + - alert: KubeControllerManagerDown annotations: - message: The pod disruption budget is preventing further disruption to pods because it is at the minimum allowed level. + message: KubeControllerManager has disappeared from Prometheus target discovery. expr: | - max by(namespace, poddisruptionbudget) (kube_poddisruptionbudget_status_expected_pods == kube_poddisruptionbudget_status_desired_healthy) - for: 15m - labels: - severity: warning - - alert: PodDisruptionBudgetLimit - annotations: - message: The pod disruption budget is below the minimum number allowed pods. - expr: | - max by (namespace, poddisruptionbudget) (kube_poddisruptionbudget_status_expected_pods < kube_poddisruptionbudget_status_desired_healthy) + absent(up{job="kube-controller-manager"} == 1) for: 15m labels: severity: critical If that sort of issue is what you're worried about it, I'd rather guard against it with an 'oc adm release new ...' check, so we die before even constructing the release image, instead of getting up and waiting for folks to go poking around in the CVO logs. That's for CVO vs. CVO. I'd also like to serve CVO metrics about out-of-update stomp rates, so we could detect CVO vs. other actors, but that's maybe a different RFE. Moving to 'oc', medium/medium, and 4.7.0 as the target, because we haven't had the dup-manifest guards in 'oc adm release new ...' for many releases now. If we get to them before code freeze, we can pull this bug back to 4.6. [1]: https://github.com/openshift/cluster-config-operator/pull/115
I've commented on the controller-manager overlap in [1], spun the storage namespace divergence out into bug 1879365, and I'll follow up with the machine-api folks tomorrow. [1]: https://bugzilla.redhat.com/show_bug.cgi?id=1856189#c41
This bug isn't ASSIGNED yet, so moving from Jack back to the updates team. Should have done that when I moved it to oc, sorry.
I've filed bug 1879777 for the machine-api namespace divergence.
This is certainly no oc bug. It is about CVO updating objects again and again.
To be clear: this is not an umbrella for overlapping resources like the CRD bug and https://bugzilla.redhat.com/show_bug.cgi?id=1879777. It's about means to detect these easily without spending hours of engineering time debugging the symptoms coming from it. The fix is obvious in CVO and an hour away to be implemented.
Sergiusz said the PrometheusRule overlap was unrelated to bug 1856189, so I've spun it out into a fresh bug 1881246. I think CVO vs. * contention can be detected via tooling that consumes audit logs, and that long-term we want metrics around how frequently the CVO is updating a managed object, and am happy to leave this bug open to track those efforts. I am not against folks adjusting CVO logging around this, but am not convinced that entries in the CVO logs are much more discoverable than entries in the audit logs, and audit logs both already exist, and would catch whoever the CVO was fighting with.
Hopefully we will get a Late audit guard in the next sprint, to close out this issue.
Moving this to severity high as https://bugzilla.redhat.com/show_bug.cgi?id=1881481 is severity high.
Can't be assigned to me when it's NEW. Moving back to the team.
Reproducing with 4.8.0-fc.3 # oc -n openshift-cluster-version logs cluster-version-operator-577cf9ccc-bk89g | grep "Updating "| sort null CVO does not log the resources update.
Verifying with 4.8.0-0.nightly-2021-06-09-214128 # oc -n openshift-cluster-version logs cluster-version-operator-66b66cf4bf-blrdc | grep -i 'creating\|updating' I0610 02:18:26.065199 1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff: &unstructured.Unstructured{ I0610 02:18:39.527376 1 generic.go:40] MachineHealthCheck openshift-machine-api/machine-api-termination-handler not found, creating I0610 02:18:39.568938 1 generic.go:40] ConsoleLink /openshift-blog not found, creating I0610 02:18:39.586370 1 generic.go:40] ConsoleLink /openshift-learning-portal not found, creating I0610 02:18:39.666343 1 generic.go:40] ConsoleQuickStart /add-healthchecks not found, creating I0610 02:18:39.685663 1 generic.go:40] ConsoleQuickStart /explore-pipelines not found, creating I0610 02:18:39.723735 1 generic.go:40] ConsoleQuickStart /install-serverless not found, creating I0610 02:18:39.771458 1 generic.go:40] ConsoleQuickStart /manage-helm-repos not found, creating I0610 02:18:39.823625 1 generic.go:40] ConsoleQuickStart /monitor-sampleapp not found, creating I0610 02:18:39.879209 1 generic.go:40] ConsoleQuickStart /node-with-s2i not found, creating I0610 02:18:39.920251 1 generic.go:40] ConsoleQuickStart /ocs-install-tour not found, creating I0610 02:18:39.971311 1 generic.go:40] ConsoleQuickStart /quarkus-with-helm not found, creating I0610 02:18:40.023886 1 generic.go:40] ConsoleQuickStart /quarkus-with-s2i not found, creating I0610 02:18:40.071456 1 generic.go:40] ConsoleQuickStart /sample-application not found, creating I0610 02:18:40.125975 1 generic.go:40] ConsoleQuickStart /spring-with-s2i not found, creating I0610 02:21:40.539565 1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff: &unstructured.Unstructured{ I0610 02:25:20.053447 1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff: &unstructured.Unstructured{ I0610 02:29:45.418338 1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff: &unstructured.Unstructured{ I0610 02:29:53.771669 1 generic.go:40] FlowSchema /openshift-oauth-apiserver-sar not found, creating I0610 02:29:54.868009 1 generic.go:40] FlowSchema /openshift-oauth-apiserver not found, creating I0610 02:29:55.892133 1 generic.go:40] FlowSchema /openshift-authentication-operator not found, creating I0610 02:29:56.592455 1 generic.go:40] FlowSchema /openshift-oauth-server not found, creating I0610 02:32:50.086724 1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff: &unstructured.Unstructured{ I0610 02:36:06.305651 1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff: &unstructured.Unstructured{ I0610 02:39:22.525605 1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff: &unstructured.Unstructured{ I0610 02:42:38.762699 1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff: &unstructured.Unstructured{ I0610 02:45:54.984293 1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff: &unstructured.Unstructured{ I0610 02:49:11.243045 1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff: &unstructured.Unstructured{ I0610 02:52:27.476287 1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff: &unstructured.Unstructured{ I0610 02:55:43.783153 1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff: &unstructured.Unstructured{ I0610 02:59:00.002695 1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff: &unstructured.Unstructured{ I0610 03:02:16.402442 1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff: &unstructured.Unstructured{ I0610 03:05:32.614572 1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff: &unstructured.Unstructured{ I0610 03:08:48.832720 1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff: &unstructured.Unstructured{ I0610 03:12:05.075710 1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff: &unstructured.Unstructured{ I0610 03:15:21.296072 1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff: &unstructured.Unstructured{ I0610 03:18:37.607295 1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff: &unstructured.Unstructured{ I0610 03:21:53.820533 1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff: &unstructured.Unstructured{ I0610 03:25:10.041747 1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff: &unstructured.Unstructured{ I0610 03:28:26.278980 1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff: &unstructured.Unstructured{ I0610 03:31:42.497175 1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff: &unstructured.Unstructured{ I0610 03:34:58.733902 1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff: &unstructured.Unstructured{ I0610 03:38:15.149093 1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff: &unstructured.Unstructured{ I0610 03:41:31.369113 1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff: &unstructured.Unstructured{ CVO logs when objects are created and updated due to diff. But most objects do not get namespace logged.
Verifying with deployment deletion # oc -n openshift-operator-lifecycle-manager delete deployment catalog-operator deployment.apps "catalog-operator" deleted # oc -n openshift-cluster-version logs cluster-version-operator-66b66cf4bf-blrdc | grep catalog-operator | grep -i "creating\|updating" I0610 06:58:11.413077 1 apps.go:21] Deployment openshift-operator-lifecycle-manager/catalog-operator not found, creating Deployment creation is logged. Vadim, I found a tiny issue and get it fixed in https://github.com/openshift/cluster-version-operator/pull/588. Please help review. Thanks Please help confirm if the namespace issue mentioned in comment#17 is expected. Thanks
(In reply to Yang Yang from comment #18) > Vadim, I found a tiny issue and get it fixed in > https://github.com/openshift/cluster-version-operator/pull/588. Please help > review. Thanks > > Please help confirm if the namespace issue mentioned in comment#17 is > expected. Thanks Good catch, thanks!
> > Please help confirm if the namespace issue mentioned in comment#17 is > > expected. Thanks Some objects don't have a dedicated handler in CVO code, so we use "unstructured" handler. Some types of these resources may or may not be namespaced, so we print namespace in the log (it however may be empty). In order to keep the code clean I didn't add a special case to handle empty namespace - this doesn't matter when it comes to logging hotlooping. This is why some logs have `/<objectname>` without namespace set
Thanks Vadim. Based on comment#17 and comment#18, moving it to verified state.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438