job: periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-uwm is failing frequently in CI, see testgrid results: https://testgrid.k8s.io/redhat-openshift-ocp-release-4.7-informing#periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-uwm sample failure: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-ci-4.7-upgrade-from-stable-4.6-e2e-aws-uwm/1518271442184048640 error: query failed: ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards",alertstate="firing",severity!="info"} >= 1: promQL query: ALERTS{alertname!~"Watchdog|AlertmanagerReceiversNotConfigured|PrometheusRemoteWriteDesiredShards",alertstate="firing",severity!="info"} >= 1 had reported incorrect results: [{"metric":{"__name__":"ALERTS","alertname":"PrometheusNotIngestingSamples","alertstate":"firing","container":"kube-rbac-proxy","endpoint":"metrics","instance":"10.128.2.5:9091","job":"prometheus-user-workload","namespace":"openshift-user-workload-monitoring","pod":"prometheus-user-workload-1","service":"prometheus-user-workload","severity":"warning"},"value":[1650821149.545,"1"]},{"metric":{"__name__":"ALERTS","alertname":"PrometheusNotIngestingSamples","alertstate":"firing","container":"kube-rbac-proxy","endpoint":"metrics","instance":"10.131.0.8:9091","job":"prometheus-user-workload","namespace":"openshift-user-workload-monitoring","pod":"prometheus-user-workload-0","service":"prometheus-user-workload","severity":"warning"},"value":[1650821149.545,"1"]}] occurred} tldr, looks like user workload monitoring isn't being configured successfully in this job, and testing UWM seems to be the intent of this job's existence.
see from bug 1829723, enable user workload monitoring only, would see the PrometheusNotIngestingSamples alerts for openshift-user-workload-monitoring project. the fix is in 4.7 and above versions
IIUC, this happens when UWM is enabled without configuring it to scrape any target. Looks like diff here in [1] helps to mitigate the alert. [1] https://github.com/openshift/cluster-monitoring-operator/pull/1018/files#diff-dae955ba385d21a49bc2b1b5d05dfa03b60c77e1219675d54376ac0f7e9f496bR2091
@jfajersk Should we consider back porting portion of 1829723 to fix this bug? This is not a high severity bug, this alert fires due to incorrect expression not due to the incorrect functionality. *** This bug has been marked as a duplicate of bug 1829723 ***
the problem w/ this bug is it prevents the job from ever passing so unless someone is actively looking into why the job is failing, every time it fails, we won't know if we regress some other aspect of what this job is testing. Since Monitoring presumably owns the job in question, it's ultimately up to you, but the options seem to be: 1) fix it (either fix the code, or at least change the job to skip this test) 2) delete the job because no one is watching to see if it fails anyway and thus it's not adding any value 3) explain how you're keeping an eye on this job's results even in when it fails, to ensure the failure is not for a new/regression reason
tested with 4.6.0-0.nightly-2022-05-10-195008, enabled UWM only, no PrometheusNotIngestingSamples alert # oc -n openshift-user-workload-monitoring get pod NAME READY STATUS RESTARTS AGE prometheus-operator-65799c6c77-cvpks 2/2 Running 0 9m26s prometheus-user-workload-0 4/4 Running 1 9m20s prometheus-user-workload-1 4/4 Running 1 9m20s thanos-ruler-user-workload-0 3/3 Running 0 9m18s thanos-ruler-user-workload-1 3/3 Running 0 9m18s # token=`oc sa get-token prometheus-k8s -n openshift-monitoring` # oc -n openshift-monitoring exec -c prometheus prometheus-k8s-0 -- curl -k -H "Authorization: Bearer $token" 'https://thanos-querier.openshift-monitoring.svc:9091/api/v1/query?query=ALERTS' | jq '.data.result[].metric.alertname' "AlertmanagerReceiversNotConfigured" "Watchdog"
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.6.58 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:2264