Bug 2078667

Summary:	periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-uwm is permfailing
Product:	OpenShift Container Platform	Reporter:	Ben Parees <bparees>
Component:	Monitoring	Assignee:	Arunprasad Rajkumar <arajkuma>
Status:	CLOSED ERRATA	QA Contact:	Junqi Zhao <juzhao>
Severity:	low	Docs Contact:
Priority:	low
Version:	4.6	CC:	amuller, anpicker, aos-bugs, arajkuma, erooth, jfajersk, sippy
Target Milestone:	---	Keywords:	Reopened
Target Release:	4.6.z	Flags:	arajkuma: needinfo-
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:	job=periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-uwm=all
Last Closed:	2022-05-26 17:00:16 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1829723
Bug Blocks:

Description Ben Parees 2022-04-26 01:12:38 UTC

job:
periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-uwm 

is failing frequently in CI, see testgrid results:
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-informing#periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-uwm

sample failure:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-uwm/1518271442184048640

error:

   query failed: ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards",alertstate="firing",severity!="info"} >= 1: promQL query: ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards",alertstate="firing",severity!="info"} >= 1 had reported incorrect results:
    [{"metric":{"__name__":"ALERTS","alertname":"PrometheusNotIngestingSamples","alertstate":"firing","container":"kube-rbac-proxy","endpoint":"metrics","instance":"10.128.2.5:9091","job":"prometheus-user-workload","namespace":"openshift-user-workload-monitoring","pod":"prometheus-user-workload-1","service":"prometheus-user-workload","severity":"warning"},"value":[1650821149.545,"1"]},{"metric":{"__name__":"ALERTS","alertname":"PrometheusNotIngestingSamples","alertstate":"firing","container":"kube-rbac-proxy","endpoint":"metrics","instance":"10.131.0.8:9091","job":"prometheus-user-workload","namespace":"openshift-user-workload-monitoring","pod":"prometheus-user-workload-0","service":"prometheus-user-workload","severity":"warning"},"value":[1650821149.545,"1"]}]
occurred}


tldr, looks like user workload monitoring isn't being configured successfully in this job, and testing UWM seems to be the intent of this job's existence.

Comment 1 Junqi Zhao 2022-04-26 03:11:45 UTC

see from bug 1829723, enable user workload monitoring only, would see the PrometheusNotIngestingSamples alerts for openshift-user-workload-monitoring project.
the fix is in 4.7 and above versions

Comment 2 Arunprasad Rajkumar 2022-04-26 03:48:30 UTC

IIUC, this happens when UWM is enabled without configuring it to scrape any target. Looks like diff here in [1] helps to mitigate the alert.

[1] https://github.com/openshift/cluster-monitoring-operator/pull/1018/files#diff-dae955ba385d21a49bc2b1b5d05dfa03b60c77e1219675d54376ac0f7e9f496bR2091

Comment 3 Arunprasad Rajkumar 2022-04-26 05:16:00 UTC

@jfajersk Should we consider back porting portion of 1829723 to fix this bug? This is not a high severity bug, this alert fires due to incorrect expression not due to the incorrect functionality.

*** This bug has been marked as a duplicate of bug 1829723 ***

Comment 4 Ben Parees 2022-04-26 13:15:50 UTC

the problem w/ this bug is it prevents the job from ever passing so unless someone is actively looking into why the job is failing, every time it fails, we won't know if we regress some other aspect of what this job is testing.

Since Monitoring presumably owns the job in question, it's ultimately up to you, but the options seem to be:

1) fix it (either fix the code, or at least change the job to skip this test)
2) delete the job because no one is watching to see if it fails anyway and thus it's not adding any value
3) explain how you're keeping an eye on this job's results even in when it fails, to ensure the failure is not for a new/regression reason

Comment 9 Junqi Zhao 2022-05-11 02:40:33 UTC

tested with 4.6.0-0.nightly-2022-05-10-195008, enabled UWM only, no PrometheusNotIngestingSamples alert
# oc -n openshift-user-workload-monitoring get pod
NAME                                   READY   STATUS    RESTARTS   AGE
prometheus-operator-65799c6c77-cvpks   2/2     Running   0          9m26s
prometheus-user-workload-0             4/4     Running   1          9m20s
prometheus-user-workload-1             4/4     Running   1          9m20s
thanos-ruler-user-workload-0           3/3     Running   0          9m18s
thanos-ruler-user-workload-1           3/3     Running   0          9m18s

# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?query=ALERTS' | jq '.data.result[].metric.alertname'
"AlertmanagerReceiversNotConfigured"
"Watchdog"

Comment 12 errata-xmlrpc 2022-05-26 17:00:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.6.58 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:2264