Bug 2010354

Summary: OpenShift Alerting Rules Style-Guide Compliance
Product: OpenShift Container Platform Reporter: Brad Ison <brad.ison>
Component: kube-schedulerAssignee: Jan Chaloupka <jchaloup>
Status: CLOSED ERRATA QA Contact: RamaKasturi <knarra>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.10CC: aos-bugs, mfojtik
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-10 16:16:32 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1992536    

Description Brad Ison 2021-10-04 13:52:58 UTC
Hello,

The OpenShift Monitoring Team has published a set guidelines for
writing alerting rules in OpenShift, including a basic style guide.
You can find these here:

  https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md
  https://github.com/openshift/enhancements/blob/master/enhancements/monitoring/alerting-consistency.md#style-guide

A subset of these are now being enforced in OpenShift End-to-End
tests [1], with temporary exceptions for existing non-compliant rules.

This component was found to have the following issues:

* Alerts without summary and/or description annotations:

  - KubeSchedulerDown
  - SchedulerLegacyPolicySet

Alerts MUST include summary and description annotations.

Think of summary as the first line of a commit message, or an email
subject line. It should be brief but informative. The description is
the longer, more detailed explanation of the alert.

The enhancement document linked above has examples of alerts with
these annotations.


* Alerts found to not include a namespace label:

  - KubeSchedulerDown

Alerts SHOULD include a namespace label indicating the alert's source.

This requirement originally comes from our SRE team, as they use the
namespace label as the first means of routing alerts. Many alerts
already include a namespace label as a result of the PromQL
expressions used, others may require a static label.

Example of a change to PromQL to include a namespace label:

  https://github.com/openshift/cluster-monitoring-operator/commit/52d1f05#diff-9024dcef0fd244c0267c46858da24fbd1f45633515fafae0f98781b20805ff1dL22-R22

Example of adding a static namespace label:

  https://github.com/openshift/cluster-monitoring-operator/commit/52d1f05#diff-352702e71122d34a1be04c0588356cd8cb8a10df547f1c3c39fec18fa75b1593R304

If you have questions about how to best to modify your alerting rules
to include a namespace label, please reach out to the OpenShift
Monitoring Team in the #forum-monitoring channel on Slack, or on our
mailing list: team-monitoring

Thank you!

Repo: openshift/cluster-kube-scheduler-operator

[1]: https://github.com/openshift/origin/commit/097e7a6

Comment 3 RamaKasturi 2021-10-26 16:03:52 UTC
Verified with build below and i see that the SchedulerLegacyPolicySet & KubeSchedulerDown now adheres to openshift Alerting rules style-guide compliance.

[knarra@knarra openshift-client-linux-4.10.0-0.nightly-2021-10-25-190146]$ ./oc get clusterversion
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.10.0-0.nightly-2021-10-25-190146   True        False         8h      Error while reconciling 4.10.0-0.nightly-2021-10-25-190146: the cluster operator monitoring has not yet successfully rolled out

Performed below steps to verify the bug:
=============================================
1) oc create configmap -n openshift-config --from-file=policy.cfg scheduler-policy
2) oc patch Scheduler cluster --type='merge' -p '{"spec":{"policy":{"name":"scheduler-policy"}}}' --type=merge
3) Browse through prometheus URL
4) check that SchedulerLegacyPolicySet has summary & description in annotations
name: SchedulerLegacyPolicySet
expr: cluster_legacy_scheduler_policy > 0
for: 1h
labels:
severity: warning
annotations:
description: The scheduler is currently configured to use a legacy scheduler policy API. Use of the policy API is deprecated and removed in 4.10.
summary: Legacy scheduler policy API in use by the scheduler.

For KubeSchedulerDown Scenario:
=============================
1) Login to master nodes via oc debug node/<master_node>
2) Run mv /etc/kubernetes/manifests/kube-scheduler-pod.yaml /home 
3) Repeat steps 1 & 2 for all the master nodes in the cluster
4) wait for 15 mins
5) Browse through prometheus URL
6) check that KubeSchedulerDown has summary & description in annotations & namespace label
name: KubeSchedulerDown
expr: absent(up{job="scheduler"} == 1)
for: 15m
labels:
namespace: openshift-kube-scheduler
severity: critical
annotations:
description: KubeScheduler has disappeared from Prometheus target discovery.
summary: Target disappeared from Prometheus target discovery.

Based on the above moving but to verified state.

Comment 8 errata-xmlrpc 2022-03-10 16:16:32 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056