Bug 1881246

Summary:	Overlapping, divergent PrometheusRule manifests
Product:	OpenShift Container Platform	Reporter:	W. Trevor King <wking>
Component:	kube-controller-manager	Assignee:	Maciej Szulik <maszulik>
Status:	CLOSED ERRATA	QA Contact:	RamaKasturi <knarra>
Severity:	low	Docs Contact:
Priority:	medium
Version:	4.6	CC:	aos-bugs, knarra, mfojtik
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-10-27 16:43:44 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description W. Trevor King 2020-09-21 23:50:54 UTC

Description of problem:

From [1]:

  $ oc adm release extract --to manifests quay.io/openshift-release-dev/ocp-release:4.6.0-fc.6-x86_64
  Extracted release payload from digest sha256:933f3d6f61ddec9f3b88a0932b47c438d7dfc15ff1873ab176284b66c9cff76e created at 2020-09-14T21:50:05Z
  $ diff -u manifests/0000_90_kube-controller-manager-operator_05_alert-pdb.yaml manifests/0000_90_kube-controller-manager-operator_05_alert-kcm-down.yaml
  --- manifests/0000_90_kube-controller-manager-operator_05_alert-pdb.yaml	2020-09-12 05:33:59.000000000 -0700
  +++ manifests/0000_90_kube-controller-manager-operator_05_alert-kcm-down.yaml	2020-09-12 05:33:59.000000000 -0700
  @@ -9,19 +9,11 @@
     groups:
       - name: cluster-version
         rules:
  -        - alert: PodDisruptionBudgetAtLimit
  +        - alert: KubeControllerManagerDown
             annotations:
  -            message: The pod disruption budget is preventing further disruption to pods because it is at the minimum allowed level.
  +            message: KubeControllerManager has disappeared from Prometheus target discovery.
             expr: |
  -            max by(namespace, poddisruptionbudget) (kube_poddisruptionbudget_status_expected_pods == kube_poddisruptionbudget_status_desired_healthy)
  -          for: 15m
  -          labels:
  -            severity: warning
  -        - alert: PodDisruptionBudgetLimit
  -          annotations:
  -            message: The pod disruption budget is below the minimum number allowed pods.
  -          expr: |
  -            max by (namespace, poddisruptionbudget) (kube_poddisruptionbudget_status_expected_pods < kube_poddisruptionbudget_status_desired_healthy)
  +            absent(up{job="kube-controller-manager"} == 1)
             for: 15m
             labels:
               severity: critical

I don't understand why [2,3] are using the same kind/namespace/name with different spec.groups; maybe that's ok for PrometheusRule?  We've had the two separate files since [4], and the two separate YAML entries since [5].  Is the overlapping kind/namespace/name intentional?  Or can we collapse to a single kind/namespace/name entries with multiple groups?

[1]: https://bugzilla.redhat.com/show_bug.cgi?id=1879184#c2
[2]: https://github.com/openshift/cluster-kube-controller-manager-operator/blob/9773980cbca12bfb0d5e719c13fb81b0de352efb/manifests/0000_90_kube-controller-manager-operator_05_alert-kcm-down.yaml
[3]: https://github.com/openshift/cluster-kube-controller-manager-operator/blob/9773980cbca12bfb0d5e719c13fb81b0de352efb/manifests/0000_90_kube-controller-manager-operator_05_alert-pdb.yaml
[4]: https://github.com/openshift/cluster-kube-controller-manager-operator/commit/326750ade37b48ae282074ee3cf05aef71ea5cd6
[5]: https://github.com/openshift/cluster-kube-controller-manager-operator/commit/f072caf44eb237f61c4de157bf8fe39f093f681b

Comment 2 RamaKasturi 2020-09-24 12:55:27 UTC

Verified with the payload below and i only see one single file with all the contents in the PR present after doing an oc adm release extract.

[ramakasturinarra@dhcp35-60 openshift-client-linux-4.6.0-0.nightly-2020-09-24-015627]$ ./oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.6.0-0.nightly-2020-09-24-015627   True        False         5h18m   Cluster version is 4.6.0-0.nightly-2020-09-24-015627
[ramakasturinarra@dhcp35-60 openshift-client-linux-4.6.0-0.nightly-2020-09-24-015627]$ ./oc version
Client Version: 4.6.0-0.nightly-2020-09-24-015627
Server Version: 4.6.0-0.nightly-2020-09-24-015627
Kubernetes Version: v1.19.0+fff8183


[ramakasturinarra@dhcp35-60 manifests]$ cat 0000_90_kube-controller-manager-operator_05_alerts.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: kube-controller-manager-operator
  namespace: openshift-kube-controller-manager-operator
  annotations:
    include.release.openshift.io/self-managed-high-availability: "true"
    exclude.release.openshift.io/internal-openshift-hosted: "true"
spec:
  groups:
    - name: cluster-version
      rules:
        - alert: KubeControllerManagerDown
          annotations:
            message: KubeControllerManager has disappeared from Prometheus target discovery.
          expr: |
            absent(up{job="kube-controller-manager"} == 1)
          for: 15m
          labels:
            severity: critical
        - alert: PodDisruptionBudgetAtLimit
          annotations:
            message: The pod disruption budget is preventing further disruption to pods because it is at the minimum allowed level.
          expr: |
            max by(namespace, poddisruptionbudget) (kube_poddisruptionbudget_status_expected_pods == kube_poddisruptionbudget_status_desired_healthy)
          for: 15m
          labels:
            severity: warning
        - alert: PodDisruptionBudgetLimit
          annotations:
            message: The pod disruption budget is below the minimum number allowed pods.
          expr: |
            max by (namespace, poddisruptionbudget) (kube_poddisruptionbudget_status_expected_pods < kube_poddisruptionbudget_status_desired_healthy)
          for: 15m
          labels:
            severity: critical


Based on the above moving the bug to verified state.

Comment 5 errata-xmlrpc 2020-10-27 16:43:44 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196