Bug 1976765

Summary: AlertmanagerMembersInconsistent fires too quickly, causing serial-test noise
Product: OpenShift Container Platform Reporter: Filip Petkovski <fpetkovs>
Component: MonitoringAssignee: Jayapriya Pai <janantha>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.8CC: alegrand, amcdermo, anpicker, aos-bugs, ccoleman, dgrisonn, erooth, juzhao, kakkoyun, lcosic, pkrupa, pnair, rgudimet, spasquie, wking
Target Milestone: ---   
Target Release: 4.8.z   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: 1936919 Environment:
[Late] Alerts shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Suite:openshift/conformance/parallel]
Last Closed: 2021-08-16 18:32:12 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1936919    
Bug Blocks:    

Comment 1 W. Trevor King 2021-07-14 16:37:07 UTC
I'm updating the bug title (and moving the test-case name into the environment field for Sippy) to make it more clear what is getting fixed here.

Comment 5 Junqi Zhao 2021-08-10 05:46:45 UTC
checked with CI jobs, no firing AlertmanagerMembersInconsistent alert
https://search.ci.openshift.org/?search=AlertmanagerMembersInconsistent&maxAge=48h&context=1&type=bug%2Bjunit&name=periodic-ci-openshift-release-master-ci-4.8.*&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

    - alert: AlertmanagerMembersInconsistent
      annotations:
        description: Alertmanager {{ $labels.namespace }}/{{ $labels.pod}} has only
          found {{ $value }} members of the {{$labels.job}} cluster.
        summary: A member of an Alertmanager cluster has not found all other cluster
          members.
      expr: |
        # Without max_over_time, failed scrapes could create false negatives, see
        # https://www.robustperception.io/alerting-on-gauges-in-prometheus-2-0 for details.
          max_over_time(alertmanager_cluster_members{job="alertmanager-main",namespace="openshift-monitoring"}[5m])
        < on (namespace,service) group_left
          count by (namespace,service) (max_over_time(alertmanager_cluster_members{job="alertmanager-main",namespace="openshift-monitoring"}[5m]))
      for: 15m
      labels:
        severity: critical

Comment 7 errata-xmlrpc 2021-08-16 18:32:12 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.8.5 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2021:3121