2078667 – periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-uwm is permfailing

Bug 2078667 - periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-uwm is permfailing

Summary: periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-u...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	low
Target Milestone:	---
Target Release:	4.6.z
Assignee:	Arunprasad Rajkumar
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:
Depends On:	1829723
Blocks:
TreeView+	depends on / blocked

Reported:	2022-04-26 01:12 UTC by Ben Parees
Modified:	2022-05-26 17:00 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:	job=periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-uwm=all
Last Closed:	2022-05-26 17:00:16 UTC
Target Upstream Version:
Embargoed:
Flags:	arajkuma: needinfo-

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift cluster-monitoring-operator pull 1649	0	None	open	Bug 2078667: Fix prometheus not ingesting samples	2022-04-27 14:43:22 UTC
Red Hat Product Errata	RHSA-2022:2264	0	None	None	None	2022-05-26 17:00:23 UTC

Description Ben Parees 2022-04-26 01:12:38 UTC

job:
periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-uwm 

is failing frequently in CI, see testgrid results:
https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-informing#periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-uwm

sample failure:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-uwm/1518271442184048640

error:

   query failed: ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards",alertstate="firing",severity!="info"} >= 1: promQL query: ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards",alertstate="firing",severity!="info"} >= 1 had reported incorrect results:
    [{"metric":{"__name__":"ALERTS","alertname":"PrometheusNotIngestingSamples","alertstate":"firing","container":"kube-rbac-proxy","endpoint":"metrics","instance":"10.128.2.5:9091","job":"prometheus-user-workload","namespace":"openshift-user-workload-monitoring","pod":"prometheus-user-workload-1","service":"prometheus-user-workload","severity":"warning"},"value":[1650821149.545,"1"]},{"metric":{"__name__":"ALERTS","alertname":"PrometheusNotIngestingSamples","alertstate":"firing","container":"kube-rbac-proxy","endpoint":"metrics","instance":"10.131.0.8:9091","job":"prometheus-user-workload","namespace":"openshift-user-workload-monitoring","pod":"prometheus-user-workload-0","service":"prometheus-user-workload","severity":"warning"},"value":[1650821149.545,"1"]}]
occurred}


tldr, looks like user workload monitoring isn't being configured successfully in this job, and testing UWM seems to be the intent of this job's existence.

Comment 1 Junqi Zhao 2022-04-26 03:11:45 UTC

see from bug 1829723, enable user workload monitoring only, would see the PrometheusNotIngestingSamples alerts for openshift-user-workload-monitoring project.
the fix is in 4.7 and above versions

Comment 2 Arunprasad Rajkumar 2022-04-26 03:48:30 UTC

IIUC, this happens when UWM is enabled without configuring it to scrape any target. Looks like diff here in [1] helps to mitigate the alert.

[1] https://github.com/openshift/cluster-monitoring-operator/pull/1018/files#diff-dae955ba385d21a49bc2b1b5d05dfa03b60c77e1219675d54376ac0f7e9f496bR2091

Comment 3 Arunprasad Rajkumar 2022-04-26 05:16:00 UTC

@jfajersk Should we consider back porting portion of 1829723 to fix this bug? This is not a high severity bug, this alert fires due to incorrect expression not due to the incorrect functionality.

*** This bug has been marked as a duplicate of bug 1829723 ***

Comment 4 Ben Parees 2022-04-26 13:15:50 UTC

the problem w/ this bug is it prevents the job from ever passing so unless someone is actively looking into why the job is failing, every time it fails, we won't know if we regress some other aspect of what this job is testing.

Since Monitoring presumably owns the job in question, it's ultimately up to you, but the options seem to be:

1) fix it (either fix the code, or at least change the job to skip this test)
2) delete the job because no one is watching to see if it fails anyway and thus it's not adding any value
3) explain how you're keeping an eye on this job's results even in when it fails, to ensure the failure is not for a new/regression reason

Comment 9 Junqi Zhao 2022-05-11 02:40:33 UTC

tested with 4.6.0-0.nightly-2022-05-10-195008, enabled UWM only, no PrometheusNotIngestingSamples alert
# oc -n openshift-user-workload-monitoring get pod
NAME                                   READY   STATUS    RESTARTS   AGE
prometheus-operator-65799c6c77-cvpks   2/2     Running   0          9m26s
prometheus-user-workload-0             4/4     Running   1          9m20s
prometheus-user-workload-1             4/4     Running   1          9m20s
thanos-ruler-user-workload-0           3/3     Running   0          9m18s
thanos-ruler-user-workload-1           3/3     Running   0          9m18s

# token=`oc sa get-token prometheus-k8s -n openshift-monitoring`
# oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?query=ALERTS' | jq '.data.result[].metric.alertname'
"AlertmanagerReceiversNotConfigured"
"Watchdog"

Comment 12 errata-xmlrpc 2022-05-26 17:00:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.6.58 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:2264

Note You need to log in before you can comment on or make changes to this bug.