1879184 – CVO must detect or log resource hotloops

Bug 1879184 - CVO must detect or log resource hotloops

Summary: CVO must detect or log resource hotloops

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cluster Version Operator
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Vadim Rutkovsky
QA Contact:	Yang Yang
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1970272
TreeView+	depends on / blocked

Reported:	2020-09-15 15:55 UTC by Stefan Schimanski
Modified:	2021-07-27 22:33 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 22:32:55 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-version-operator pull 561	None	open	Bug 1879184: Log object updates and show existing/required diff	2021-05-31 16:23:15 UTC
Github	openshift cluster-version-operator pull 588	None	open	Bug 1879184: Fixing the log message in rbac.go for clusterrole	2021-06-10 17:15:52 UTC
Red Hat Product Errata	RHSA-2021:2438	None	None	None	2021-07-27 22:33:19 UTC

Description Stefan Schimanski 2020-09-15 15:55:54 UTC

Recreating objects continously because of buggy apply logic in CVO can have severe consequences for performance (compare https://bugzilla.redhat.com/show_bug.cgi?id=1795412) and availability (compare https://bugzilla.redhat.com/show_bug.cgi?id=1863011).

In https://github.com/openshift/cluster-version-operator/pull/258 a generic way was proposed to log these hotlooping resources, but it was turned down (https://github.com/openshift/cluster-version-operator/pull/258#issuecomment-544964000) and replaced with a CRD specific solution.

Now we had another case (https://bugzilla.redhat.com/show_bug.cgi?id=1863011) and have spent more than a man week of work to triage the issue in both api-auth team and in monitoring team. Instead, CVO should detect this and make developers aware easily. In CVO logs this is completey invisible.

Comment 1 Sergiusz Urbaniak 2020-09-15 15:59:34 UTC

We have the same issue in cluster-monitoring-operator so https://bugzilla.redhat.com/show_bug.cgi?id=1856189 is related.

Comment 2 W. Trevor King 2020-09-16 05:04:27 UTC

The fix for bug 1795412 was removing overlapping manifests for the same resource [1].  Unfortunately, the fix from 2020-02-14 landed before we were cutting any 4.4 RCs:

  $ oc adm release info quay.io/openshift-release-dev/ocp-release:4.4.0-rc.0-x86_64 | head -n3
  Name:      4.4.0-rc.0
  Digest:    sha256:746567f5dd9f5532a2f15180dadd99b4fed11f46e92500637fe6bfb61236a628
  Created:   2020-03-11T17:57:04Z

so I can't easily extract manifests and show the overlap there.  But looking for similar overlap in recent releases:

  $ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.6.0-fc.6-x86_64
  Extracted release payload from digest sha256:933f3d6f61ddec9f3b88a0932b47c438d7dfc15ff1873ab176284b66c9cff76e created at 2020-09-14T21:50:05Z
  $ for X in manifests/*.yaml; do JSON="$(yaml2json < "${X}")"; if test "x${JSON:0:1}" = 'x['; then echo "${JSON}" | jq -c '.[]'; elif test "$(echo "${JSON}" | jq .items)" != 'null'; then echo "${JSON}" | jq -c '.items[]'; else echo "${JSON}" | jq -c .; fi; done | grep -v '^""$' | jq -r '.kind + " " + .apiVersion + " " + ((.metadata // {}).namespace // "-") + " " + ((.metadata // {}).name // "-")' | sort | uniq -c | sort -n | tail -n4
        1 Storage operator.openshift.io/v1 - cluster
        2 Namespace v1 - openshift-cluster-storage-operator
        2 Namespace v1 - openshift-machine-api
        2 PrometheusRule monitoring.coreos.com/v1 openshift-kube-controller-manager-operator kube-controller-manager-operator
  $ grep -r openshift-cluster-storage-operator manifests | grep 'namespace\.yaml'
  manifests/0000_50_cluster-csi-snapshot-controller-operator_00_namespace.yaml:  name: openshift-cluster-storage-operator
  manifests/0000_50_cluster-storage-operator_01_operator_namespace.yaml:  name: openshift-cluster-storage-operator
  $ grep -B10 -r openshift-machine-api manifests | grep 'kind: Namespace'
  manifests/0000_50_cluster-autoscaler-operator_00_namespace.yaml-kind: Namespace
  manifests/0000_30_machine-api-operator_00_namespace.yaml-kind: Namespace
  $ grep -B10 -r 'name: kube-controller-manager-operator' manifests | grep 'kind: PrometheusRule'
  manifests/0000_90_kube-controller-manager-operator_05_alert-pdb.yaml-kind: PrometheusRule
  manifests/0000_90_kube-controller-manager-operator_05_alert-kcm-down.yaml-kind: PrometheusRule
  $ diff -u0 <(yaml2json <manifests/0000_50_cluster-storage-operator_01_operator_namespace.yaml | jq -S .) <(yaml2json <manifests/0000_50_cluster-csi-snapshot-controller-operator_00_namespace.yaml | jq -S .)
  --- /dev/fd/63	2020-09-15 21:57:52.836562697 -0700
  +++ /dev/fd/62	2020-09-15 21:57:52.837562709 -0700
  @@ -7,0 +8,3 @@
  +    "labels": {
  +      "openshift.io/cluster-monitoring": "true"
  +    },
  $ diff -u0 <(yaml2json <manifests/0000_30_machine-api-operator_00_namespace.yaml | jq -S .) <(yaml2json <manifests/0000_50_cluster-autoscaler-operator_00_namespace.yaml | jq -S .)
  --- /dev/fd/63	2020-09-15 21:59:06.566442814 -0700
  +++ /dev/fd/62	2020-09-15 21:59:06.569442850 -0700
  @@ -5,3 +4,0 @@
  -    "annotations": {
  -      "openshift.io/node-selector": ""
  -    },
  $ diff -u manifests/0000_90_kube-controller-manager-operator_05_alert-pdb.yaml manifests/0000_90_kube-controller-manager-operator_05_alert-kcm-down.yaml
  --- manifests/0000_90_kube-controller-manager-operator_05_alert-pdb.yaml	2020-09-12 05:33:59.000000000 -0700
  +++ manifests/0000_90_kube-controller-manager-operator_05_alert-kcm-down.yaml	2020-09-12 05:33:59.000000000 -0700
  @@ -9,19 +9,11 @@
     groups:
       - name: cluster-version
         rules:
  -        - alert: PodDisruptionBudgetAtLimit
  +        - alert: KubeControllerManagerDown
             annotations:
  -            message: The pod disruption budget is preventing further disruption to pods because it is at the minimum allowed level.
  +            message: KubeControllerManager has disappeared from Prometheus target discovery.
             expr: |
  -            max by(namespace, poddisruptionbudget) (kube_poddisruptionbudget_status_expected_pods == kube_poddisruptionbudget_status_desired_healthy)
  -          for: 15m
  -          labels:
  -            severity: warning
  -        - alert: PodDisruptionBudgetLimit
  -          annotations:
  -            message: The pod disruption budget is below the minimum number allowed pods.
  -          expr: |
  -            max by (namespace, poddisruptionbudget) (kube_poddisruptionbudget_status_expected_pods < kube_poddisruptionbudget_status_desired_healthy)
  +            absent(up{job="kube-controller-manager"} == 1)
             for: 15m
             labels:
               severity: critical

If that sort of issue is what you're worried about it, I'd rather guard against it with an 'oc adm release new ...' check, so we die before even constructing the release image, instead of getting up and waiting for folks to go poking around in the CVO logs.  That's for CVO vs. CVO.  I'd also like to serve CVO metrics about out-of-update stomp rates, so we could detect CVO vs. other actors, but that's maybe a different RFE.

Moving to 'oc', medium/medium, and 4.7.0 as the target, because we haven't had the dup-manifest guards in 'oc adm release new ...' for many releases now.  If we get to them before code freeze, we can pull this bug back to 4.6.

[1]: https://github.com/openshift/cluster-config-operator/pull/115

Comment 3 W. Trevor King 2020-09-16 05:30:45 UTC

I've commented on the controller-manager overlap in [1], spun the storage namespace divergence out into bug 1879365, and I'll follow up with the machine-api folks tomorrow.

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1856189#c41

Comment 4 W. Trevor King 2020-09-16 23:17:28 UTC

This bug isn't ASSIGNED yet, so moving from Jack back to the updates team.  Should have done that when I moved it to oc, sorry.

Comment 5 W. Trevor King 2020-09-16 23:54:22 UTC

I've filed bug 1879777 for the machine-api namespace divergence.

Comment 6 Stefan Schimanski 2020-09-17 13:09:46 UTC

This is certainly no oc bug. It is about CVO updating objects again and again.

Comment 7 Stefan Schimanski 2020-09-17 13:13:10 UTC

To be clear: this is not an umbrella for overlapping resources like the CRD bug and https://bugzilla.redhat.com/show_bug.cgi?id=1879777. It's about means to detect these easily without spending hours of engineering time debugging the symptoms coming from it. The fix is obvious in CVO and an hour away to be implemented.

Comment 8 W. Trevor King 2020-09-21 23:54:42 UTC

Sergiusz said the PrometheusRule overlap was unrelated to bug 1856189, so I've spun it out into a fresh bug 1881246.  I think CVO vs. * contention can be detected via tooling that consumes audit logs, and that long-term we want metrics around how frequently the CVO is updating a managed object, and am happy to leave this bug open to track those efforts.  I am not against folks adjusting CVO logging around this, but am not convinced that entries in the CVO logs are much more discoverable than entries in the audit logs, and audit logs both already exist, and would catch whoever the CVO was fighting with.

Comment 9 W. Trevor King 2020-10-02 23:13:15 UTC

Hopefully we will get a Late audit guard in the next sprint, to close out this issue.

Comment 12 Lalatendu Mohanty 2021-01-05 17:44:52 UTC

Moving this to severity high as https://bugzilla.redhat.com/show_bug.cgi?id=1881481 is severity high.

Comment 14 W. Trevor King 2021-04-08 03:37:19 UTC

Can't be assigned to me when it's NEW.  Moving back to the team.

Comment 15 Yang Yang 2021-06-08 02:14:40 UTC

Reproducing with 4.8.0-fc.3

# oc -n openshift-cluster-version logs cluster-version-operator-577cf9ccc-bk89g | grep "Updating "| sort
null

CVO does not log the resources update.

Comment 17 Yang Yang 2021-06-10 06:33:26 UTC

Verifying with 4.8.0-0.nightly-2021-06-09-214128

# oc -n openshift-cluster-version logs cluster-version-operator-66b66cf4bf-blrdc | grep -i 'creating\|updating'
I0610 02:18:26.065199       1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff:   &unstructured.Unstructured{
I0610 02:18:39.527376       1 generic.go:40] MachineHealthCheck openshift-machine-api/machine-api-termination-handler not found, creating
I0610 02:18:39.568938       1 generic.go:40] ConsoleLink /openshift-blog not found, creating
I0610 02:18:39.586370       1 generic.go:40] ConsoleLink /openshift-learning-portal not found, creating
I0610 02:18:39.666343       1 generic.go:40] ConsoleQuickStart /add-healthchecks not found, creating
I0610 02:18:39.685663       1 generic.go:40] ConsoleQuickStart /explore-pipelines not found, creating
I0610 02:18:39.723735       1 generic.go:40] ConsoleQuickStart /install-serverless not found, creating
I0610 02:18:39.771458       1 generic.go:40] ConsoleQuickStart /manage-helm-repos not found, creating
I0610 02:18:39.823625       1 generic.go:40] ConsoleQuickStart /monitor-sampleapp not found, creating
I0610 02:18:39.879209       1 generic.go:40] ConsoleQuickStart /node-with-s2i not found, creating
I0610 02:18:39.920251       1 generic.go:40] ConsoleQuickStart /ocs-install-tour not found, creating
I0610 02:18:39.971311       1 generic.go:40] ConsoleQuickStart /quarkus-with-helm not found, creating
I0610 02:18:40.023886       1 generic.go:40] ConsoleQuickStart /quarkus-with-s2i not found, creating
I0610 02:18:40.071456       1 generic.go:40] ConsoleQuickStart /sample-application not found, creating
I0610 02:18:40.125975       1 generic.go:40] ConsoleQuickStart /spring-with-s2i not found, creating
I0610 02:21:40.539565       1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff:   &unstructured.Unstructured{
I0610 02:25:20.053447       1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff:   &unstructured.Unstructured{
I0610 02:29:45.418338       1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff:   &unstructured.Unstructured{
I0610 02:29:53.771669       1 generic.go:40] FlowSchema /openshift-oauth-apiserver-sar not found, creating
I0610 02:29:54.868009       1 generic.go:40] FlowSchema /openshift-oauth-apiserver not found, creating
I0610 02:29:55.892133       1 generic.go:40] FlowSchema /openshift-authentication-operator not found, creating
I0610 02:29:56.592455       1 generic.go:40] FlowSchema /openshift-oauth-server not found, creating
I0610 02:32:50.086724       1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff:   &unstructured.Unstructured{
I0610 02:36:06.305651       1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff:   &unstructured.Unstructured{
I0610 02:39:22.525605       1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff:   &unstructured.Unstructured{
I0610 02:42:38.762699       1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff:   &unstructured.Unstructured{
I0610 02:45:54.984293       1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff:   &unstructured.Unstructured{
I0610 02:49:11.243045       1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff:   &unstructured.Unstructured{
I0610 02:52:27.476287       1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff:   &unstructured.Unstructured{
I0610 02:55:43.783153       1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff:   &unstructured.Unstructured{
I0610 02:59:00.002695       1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff:   &unstructured.Unstructured{
I0610 03:02:16.402442       1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff:   &unstructured.Unstructured{
I0610 03:05:32.614572       1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff:   &unstructured.Unstructured{
I0610 03:08:48.832720       1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff:   &unstructured.Unstructured{
I0610 03:12:05.075710       1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff:   &unstructured.Unstructured{
I0610 03:15:21.296072       1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff:   &unstructured.Unstructured{
I0610 03:18:37.607295       1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff:   &unstructured.Unstructured{
I0610 03:21:53.820533       1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff:   &unstructured.Unstructured{
I0610 03:25:10.041747       1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff:   &unstructured.Unstructured{
I0610 03:28:26.278980       1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff:   &unstructured.Unstructured{
I0610 03:31:42.497175       1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff:   &unstructured.Unstructured{
I0610 03:34:58.733902       1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff:   &unstructured.Unstructured{
I0610 03:38:15.149093       1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff:   &unstructured.Unstructured{
I0610 03:41:31.369113       1 generic.go:81] Updating PriorityClass /openshift-user-critical due to diff:   &unstructured.Unstructured{

CVO logs when objects are created and updated due to diff. But most objects do not get namespace logged.

Comment 18 Yang Yang 2021-06-10 08:10:07 UTC

Verifying with deployment deletion

# oc -n openshift-operator-lifecycle-manager delete deployment catalog-operator
deployment.apps "catalog-operator" deleted

# oc -n openshift-cluster-version logs cluster-version-operator-66b66cf4bf-blrdc | grep catalog-operator | grep -i "creating\|updating"
I0610 06:58:11.413077       1 apps.go:21] Deployment openshift-operator-lifecycle-manager/catalog-operator not found, creating

Deployment creation is logged.

Vadim, I found a tiny issue and get it fixed in https://github.com/openshift/cluster-version-operator/pull/588. Please help review. Thanks

Please help confirm if the namespace issue mentioned in comment#17 is expected. Thanks

Comment 19 Vadim Rutkovsky 2021-06-10 17:11:14 UTC

(In reply to Yang Yang from comment #18)
> Vadim, I found a tiny issue and get it fixed in
> https://github.com/openshift/cluster-version-operator/pull/588. Please help
> review. Thanks
> 
> Please help confirm if the namespace issue mentioned in comment#17 is
> expected. Thanks

Good catch, thanks!

Comment 20 Vadim Rutkovsky 2021-06-10 17:14:34 UTC

> > Please help confirm if the namespace issue mentioned in comment#17 is
> > expected. Thanks

Some objects don't have a dedicated handler in CVO code, so we use "unstructured" handler. Some types of these resources may or may not be namespaced, so we print namespace in the log (it however may be empty). In order to keep the code clean I didn't add a special case to handle empty namespace - this doesn't matter when it comes to logging hotlooping.
This is why some logs have `/<objectname>` without namespace set

Comment 22 Yang Yang 2021-06-11 01:55:38 UTC

Thanks Vadim.

Based on comment#17 and comment#18, moving it to verified state.

Comment 25 errata-xmlrpc 2021-07-27 22:32:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.