Bug 2078667

Summary: periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-uwm is permfailing
Product: OpenShift Container Platform Reporter: Ben Parees <bparees>
Component: MonitoringAssignee: Arunprasad Rajkumar <arajkuma>
Status: CLOSED ERRATA QA Contact: Junqi Zhao <juzhao>
Severity: low Docs Contact:
Priority: low    
Version: 4.6CC: amuller, anpicker, aos-bugs, arajkuma, erooth, jfajersk, sippy
Target Milestone: ---Keywords: Reopened
Target Release: 4.6.zFlags: arajkuma: needinfo-
Hardware: Unspecified   
OS: Unspecified   
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-05-26 17:00:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Bug Depends On: 1829723    
Bug Blocks:    

Description Ben Parees 2022-04-26 01:12:38 UTC

is failing frequently in CI, see testgrid results:

sample failure:


   query failed: ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards",alertstate="firing",severity!="info"} >= 1: promQL query: ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards",alertstate="firing",severity!="info"} >= 1 had reported incorrect results:

tldr, looks like user workload monitoring isn't being configured successfully in this job, and testing UWM seems to be the intent of this job's existence.

Comment 1 Junqi Zhao 2022-04-26 03:11:45 UTC
see from bug 1829723, enable user workload monitoring only, would see the PrometheusNotIngestingSamples alerts for openshift-user-workload-monitoring project.
the fix is in 4.7 and above versions

Comment 2 Arunprasad Rajkumar 2022-04-26 03:48:30 UTC
IIUC, this happens when UWM is enabled without configuring it to scrape any target. Looks like diff here in [1] helps to mitigate the alert.

[1] https://github.com/openshift/cluster-monitoring-operator/pull/1018/files#diff-dae955ba385d21a49bc2b1b5d05dfa03b60c77e1219675d54376ac0f7e9f496bR2091

Comment 3 Arunprasad Rajkumar 2022-04-26 05:16:00 UTC
@jfajersk Should we consider back porting portion of 1829723 to fix this bug? This is not a high severity bug, this alert fires due to incorrect expression not due to the incorrect functionality.

*** This bug has been marked as a duplicate of bug 1829723 ***

Comment 4 Ben Parees 2022-04-26 13:15:50 UTC
the problem w/ this bug is it prevents the job from ever passing so unless someone is actively looking into why the job is failing, every time it fails, we won't know if we regress some other aspect of what this job is testing.

Since Monitoring presumably owns the job in question, it's ultimately up to you, but the options seem to be:

1) fix it (either fix the code, or at least change the job to skip this test)
2) delete the job because no one is watching to see if it fails anyway and thus it's not adding any value
3) explain how you're keeping an eye on this job's results even in when it fails, to ensure the failure is not for a new/regression reason

Comment 9 Junqi Zhao 2022-05-11 02:40:33 UTC
tested with 4.6.0-0.nightly-2022-05-10-195008, enabled UWM only, no PrometheusNotIngestingSamples alert
# oc -n openshift-user-workload-monitoring get pod
NAME                                   READY   STATUS    RESTARTS   AGE
prometheus-operator-65799c6c77-cvpks   2/2     Running   0          9m26s
prometheus-user-workload-0             4/4     Running   1          9m20s
prometheus-user-workload-1             4/4     Running   1          9m20s
thanos-ruler-user-workload-0           3/3     Running   0          9m18s
thanos-ruler-user-workload-1           3/3     Running   0          9m18s

# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?query=ALERTS' | jq '.data.result[].metric.alertname'

Comment 12 errata-xmlrpc 2022-05-26 17:00:16 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.6.58 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.