Bug 2012426

Summary: ThanosSidecarBucketOperationsFailed/ThanosSidecarUnhealthy alerts don't have namespace label
Product: OpenShift Container Platform Reporter: Junqi Zhao <juzhao>
Component: MonitoringAssignee: Arunprasad Rajkumar <arajkuma>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: low Docs Contact:
Priority: low    
Version: 4.9CC: amuller, anpicker, aos-bugs, arajkuma, erooth
Target Milestone: ---   
Target Release: 4.10.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-03-10 16:18:42 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Junqi Zhao 2021-10-09 09:10:26 UTC
Description of problem:
when review 4.9 release note, https://github.com/openshift/openshift-docs/pull/37264, find ThanosSidecarBucketOperationsFailed/ThanosSidecarUnhealthy alerts don't have namespace label
*************************************
      - alert: ThanosSidecarBucketOperationsFailed
        annotations:
          description: Thanos Sidecar {{$labels.instance}} bucket operations are failing
          summary: Thanos Sidecar bucket operations are failing
        expr: |
          sum by (job, instance) (rate(thanos_objstore_bucket_operation_failures_total{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"}[5m])) > 0
        for: 1h
        labels:
          severity: warning
      - alert: ThanosSidecarUnhealthy
        annotations:
          description: Thanos Sidecar {{$labels.instance}} is unhealthy for more than
            {{$value}} seconds.
          summary: Thanos Sidecar is unhealthy.
        expr: |
          time() - max by (job, instance) (thanos_sidecar_last_heartbeat_success_time_seconds{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"}) >= 240
        for: 1h
        labels:
          severity: warning
*************************************
example, search expr for ThanosSidecarUnhealthy
time() - max by (job, instance) (thanos_sidecar_last_heartbeat_success_time_seconds{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"})
result does not include namespace label
{instance="10.129.2.10:10902", job="prometheus-k8s-thanos-sidecar"}  12.650763988494873
{instance="10.131.0.11:10902", job="prometheus-k8s-thanos-sidecar"}  15.16017460823059

we could add the namespace label to expr, that is
time() - max by (job, instance, namespace) (thanos_sidecar_last_heartbeat_success_time_seconds{job=~"prometheus-(k8s|user-workload)-thanos-sidecar"})
result
{instance="10.129.2.10:10902", job="prometheus-k8s-thanos-sidecar", namespace="openshift-monitoring"}  38.67030143737793
{instance="10.131.0.11:10902", job="prometheus-k8s-thanos-sidecar", namespace="openshift-monitoring"}  41.178159952163696

same for ThanosSidecarBucketOperationsFailed alert

Version-Release number of selected component (if applicable):
4.9.0-0.nightly-2021-10-08-093633

How reproducible:
always

Steps to Reproduce:
1. see the description
2.
3.

Actual results:
ThanosSidecarBucketOperationsFailed/ThanosSidecarUnhealthy alerts don't have namespace label

Expected results:
ThanosSidecarBucketOperationsFailed/ThanosSidecarUnhealthy alerts have namespace label

Additional info:

Comment 11 errata-xmlrpc 2022-03-10 16:18:42 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056