1768756 – "shouldn't report any alerts in firing state apart from Watchdog" missed dozens of alerts

Bug 1768756 - "shouldn't report any alerts in firing state apart from Watchdog" missed dozens of alerts

Summary: "shouldn't report any alerts in firing state apart from Watchdog" missed doze...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Monitoring
Sub Component:
Version:	4.3.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	low
Target Milestone:	---
Target Release:	4.4.0
Assignee:	Lili Cosic
QA Contact:	Junqi Zhao
Docs Contact:
URL:
Whiteboard:	groom
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-11-05 08:07 UTC by Tomáš Nožička
Modified:	2020-05-04 11:15 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-05-04 11:14:43 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin pull 24492	0	None	closed	Bug 1768756: Run Prometheus specific tests cases at the end of the test run	2020-12-18 09:09:30 UTC
Red Hat Product Errata	RHBA-2020:0581	0	None	None	None	2020-05-04 11:15:11 UTC

Description Tomáš Nožička 2019-11-05 08:07:31 UTC

Description of problem:

e2e are missing a lot of alerts fired during in e2e run. This is expected from how the test is written - being a regular test, it gets scheduled as a regular test in random order, so in the worst case it is scheduled first and misses all alerts that are caused by e2e run because those run after the test has checked the alerts. Best case it runs last and sees all. Probability for both of those case is low but in average it is run in the middle and misses half of the alert which is pretty bad. Firstly we miss bugs, and secondly having those alerts being caught would help triaging e2e a lot I think.

There are already tests that are run only after the e2e suite is over, or during the e2e run and finished only when e2e are done - check "Monitor cluster while tests execute".

first test passed at 2019-11-04T15:52:54

even with the current exception there being 
`ALERTS{alertname!~"Watchdog|ClusterMachineApproverDown|UsingDeprecatedAPIExtensionsV1Beta1",alertstate="firing"} >= 1`

(Does this need a range if test take longer? I don't know what the default is.)


the e2e check passed: (19s) 2019-11-04T16:07:44 "[Feature:Prometheus][Conformance] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog [Suite:openshift/conformance/parallel/minimal]"

but last test happened around 16:20:29.898866

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-kube-apiserver-operator/639/pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws/2895#1:build-log.txt%3A9342

checking the prometheus alerts at https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-kube-apiserver-operator/639/pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws/2895/artifacts/e2e-aws/metrics/prometheus.tar

using: sum_over_time(ALERTS{alertname!~"Watchdog|ClusterMachineApproverDown|UsingDeprecatedAPIExtensionsV1Beta1"}[2d])

these fired:

{alertname="AlertmanagerMembersInconsistent",alertstate="pending",endpoint="web",instance="10.129.2.10:9095",job="alertmanager-main",namespace="openshift-monitoring",pod="alertmanager-main-1",service="alertmanager-main",severity="critical"}	1
{alertname="CPUThrottlingHigh",alertstate="pending",container="sti-build",namespace="e2e-test-s2i-build-quota-f65dt",pod="s2i-build-quota-1-build",severity="warning"}	13
{alertname="ClusterOperatorDegraded",alertstate="pending",condition="Degraded",endpoint="metrics",instance="10.0.131.192:9099",job="cluster-version-operator",name="authentication",namespace="openshift-cluster-version",pod="cluster-version-operator-74c4f5cc49-ctjrx",reason="RouteHealthDegradedFailedGet",service="cluster-version-operator",severity="critical"}	1
{alertname="ClusterOperatorDown",alertstate="pending",endpoint="metrics",instance="10.0.131.192:9099",job="cluster-version-operator",name="authentication",namespace="openshift-cluster-version",pod="cluster-version-operator-74c4f5cc49-ctjrx",service="cluster-version-operator",severity="critical"}	1
{alertname="ClusterVersionOperatorDown",alertstate="pending",severity="critical"}	1
{alertname="KubeAPIErrorsHigh",alertstate="pending",resource="pods",severity="critical",verb="POST"}	3
{alertname="KubeAPIErrorsHigh",alertstate="pending",resource="pods",severity="warning",verb="POST"}	6
{alertname="KubeAPILatencyHigh",alertstate="pending",component="apiserver",endpoint="https",job="apiserver",namespace="default",quantile="0.99",resource="configmaps",scope="namespace",service="kubernetes",severity="critical",verb="POST",version="v1"}	9
{alertname="KubeAPILatencyHigh",alertstate="pending",component="apiserver",endpoint="https",job="apiserver",namespace="default",quantile="0.99",resource="configmaps",scope="namespace",service="kubernetes",severity="warning",verb="POST",version="v1"}	9
{alertname="KubePersistentVolumeErrors",alertstate="pending",endpoint="https-main",instance="10.129.2.4:8443",job="kube-state-metrics",namespace="openshift-monitoring",persistentvolume="pvc-af19d81b-9895-47da-957d-7da743821e51",phase="Failed",pod="kube-state-metrics-7c69f886b4-xlwx9",service="kube-state-metrics",severity="critical"}	4
{alertname="KubePersistentVolumeErrors",alertstate="pending",endpoint="https-main",instance="10.129.2.4:8443",job="kube-state-metrics",namespace="openshift-monitoring",persistentvolume="pvc-b45a1a02-c4b7-438c-a06c-c98de7fbb47e",phase="Failed",pod="kube-state-metrics-7c69f886b4-xlwx9",service="kube-state-metrics",severity="critical"}	4
{alertname="KubePersistentVolumeErrors",alertstate="pending",endpoint="https-main",instance="10.129.2.4:8443",job="kube-state-metrics",namespace="openshift-monitoring",persistentvolume="pvc-bb632cd7-773d-499f-9506-eec87331a388",phase="Failed",pod="kube-state-metrics-7c69f886b4-xlwx9",service="kube-state-metrics",severity="critical"}	4
{alertname="KubePersistentVolumeErrors",alertstate="pending",endpoint="https-main",instance="10.129.2.4:8443",job="kube-state-metrics",namespace="openshift-monitoring",persistentvolume="pvc-fc9b368b-14a2-4d6d-827f-f4f43421c294",phase="Failed",pod="kube-state-metrics-7c69f886b4-xlwx9",service="kube-state-metrics",severity="critical"}	4
{alertname="KubePodNotReady",alertstate="pending",namespace="openshift-must-gather-vcr4h",pod="must-gather-cltrz",severity="critical"}	12
{alertname="KubeletDown",alertstate="pending",severity="critical"}	2
{alertname="MachineAPIOperatorDown",alertstate="pending",severity="critical"}	1
{alertname="MachineWithoutValidNode",alertstate="pending",api_version="machine.openshift.io/v1beta1",endpoint="https",exported_namespace="openshift-machine-api",instance="10.128.0.21:8443",job="machine-api-operator",name="ci-op-83idn38x-63f8c-7b7bq-master-0",namespace="openshift-machine-api",node="ip-10-0-142-89.ec2.internal",phase="Running",pod="machine-api-operator-5f4656fddc-wkpkc",service="machine-api-operator",severity="critical",spec_provider_id="aws:///us-east-1a/i-0340a5e50f5bcc6a4"}	2
{alertname="MachineWithoutValidNode",alertstate="pending",api_version="machine.openshift.io/v1beta1",endpoint="https",exported_namespace="openshift-machine-api",instance="10.128.0.21:8443",job="machine-api-operator",name="ci-op-83idn38x-63f8c-7b7bq-master-1",namespace="openshift-machine-api",node="ip-10-0-150-179.ec2.internal",phase="Running",pod="machine-api-operator-5f4656fddc-wkpkc",service="machine-api-operator",severity="critical",spec_provider_id="aws:///us-east-1b/i-02dcc22b04f252bdc"}	2
{alertname="MachineWithoutValidNode",alertstate="pending",api_version="machine.openshift.io/v1beta1",endpoint="https",exported_namespace="openshift-machine-api",instance="10.128.0.21:8443",job="machine-api-operator",name="ci-op-83idn38x-63f8c-7b7bq-master-2",namespace="openshift-machine-api",node="ip-10-0-131-192.ec2.internal",phase="Running",pod="machine-api-operator-5f4656fddc-wkpkc",service="machine-api-operator",severity="critical",spec_provider_id="aws:///us-east-1a/i-0ae3d6f02a410b956"}	2
{alertname="MachineWithoutValidNode",alertstate="pending",api_version="machine.openshift.io/v1beta1",endpoint="https",exported_namespace="openshift-machine-api",instance="10.128.0.21:8443",job="machine-api-operator",name="ci-op-83idn38x-63f8c-7b7bq-worker-us-east-1a-4x4fb",namespace="openshift-machine-api",node="ip-10-0-134-173.ec2.internal",phase="Running",pod="machine-api-operator-5f4656fddc-wkpkc",service="machine-api-operator",severity="critical",spec_provider_id="aws:///us-east-1a/i-06064777b294e4a69"}	2
{alertname="MachineWithoutValidNode",alertstate="pending",api_version="machine.openshift.io/v1beta1",endpoint="https",exported_namespace="openshift-machine-api",instance="10.128.0.21:8443",job="machine-api-operator",name="ci-op-83idn38x-63f8c-7b7bq-worker-us-east-1a-qb4zr",namespace="openshift-machine-api",node="ip-10-0-138-90.ec2.internal",phase="Running",pod="machine-api-operator-5f4656fddc-wkpkc",service="machine-api-operator",severity="critical",spec_provider_id="aws:///us-east-1a/i-03ebe00e6fb395386"}	2
{alertname="MachineWithoutValidNode",alertstate="pending",api_version="machine.openshift.io/v1beta1",endpoint="https",exported_namespace="openshift-machine-api",instance="10.128.0.21:8443",job="machine-api-operator",name="ci-op-83idn38x-63f8c-7b7bq-worker-us-east-1b-gvdmh",namespace="openshift-machine-api",node="ip-10-0-150-33.ec2.internal",phase="Running",pod="machine-api-operator-5f4656fddc-wkpkc",service="machine-api-operator",severity="critical",spec_provider_id="aws:///us-east-1b/i-04229d168fa0da5fd"}

Version-Release number of selected component (if applicable):
4.3

Comment 1 Frederic Branczyk 2019-11-05 08:24:23 UTC

The alerts in the result returned from the query above are all in `pending` state, meaning the condition didn't last long enough for it to actually be a problem (at least by the way the individual alert was written), thus they didn't actually result in any firing alert. I do agree that we should make sure that we look at the entire time range though, and not only the instant at which the test runs. Additionally is it possible to ensure that the test is scheduled at the very end if that's not already the case (or at least as one of the last tests run)?

Comment 2 Pawel Krupa 2019-11-05 10:25:53 UTC

This is not a release blocker and relates only to e2e test improvements in origin. Setting severity to low and target release to 4.4.

Comment 3 Tomáš Nožička 2019-11-05 15:42:40 UTC

> The alerts in the result returned from the query above are all in `pending` state, meaning the condition didn't last long enough for it to actually be a problem

I'll need to think about it more, but I don't see right now a valid case when the alerts should be even pending during e2e, like these:
{alertname="KubeletDown",alertstate="pending",severity="critical"}	2
{alertname="MachineAPIOperatorDown",alertstate="pending",severity="critical"}	1

> Additionally is it possible to ensure that the test is scheduled at the very end if that's not already the case (or at least as one of the last tests run)?

already referenced it in the description - check existing e2e test named "Monitor cluster while tests execute" that Clayton wrote to run at the end of the suite.


> This is not a release blocker and relates only to e2e test improvements in origin. Setting severity to low and target release to 4.4.

It is not about improving e2e primarily, but about avoiding shipping a release with bugs.

Comment 4 Frederic Branczyk 2019-11-05 16:06:30 UTC

Pending alerts is a good thing and standard best practice in monitoring, because there can be various circumstances why an expression may trigger intermittently (intermittent network connection problem, target discovery problem, etc.), which this case actually nicely shows. I extracted the Prometheus dump, and it showed that Prometheus actually never failed to scrape these two targets, but at the very start of Prometheus monitoring the components the targets were not discovered in time for alert rule evaluation, which resolved within a single scrape interval, had these alerts triggered immediately then we would have gotten a false positive alert.

To me it looks like all the data is showing that the alerting rules are correct, however, I do still think that we need to run this test at the very end. We'll take care of that part, thank you for opening this issue, I think this is an important thing that we missed to do so far.

Comment 5 Tomáš Nožička 2019-11-06 07:10:19 UTC

> Pending alerts is a good thing and standard best practice in monitoring, because there can be various circumstances why an expression may trigger intermittently (intermittent network connection problem, target discovery problem, etc.), which this case actually nicely shows. I extracted the Prometheus dump, and it showed that Prometheus actually never failed to scrape these two targets, but at the very start of Prometheus monitoring the components the targets were not discovered in time for alert rule evaluation, which resolved within a single scrape interval, had these alerts triggered immediately then we would have gotten a false positive alert.

I took this from a failed run, so I assumed there was something actually wrong. The explanation makes sense to me, thanks.

Comment 6 Simon Pasquier 2019-11-13 08:15:33 UTC

I've seen one CI run where the "MachineWithoutValidNode" and "MachineWithNoRunningPhase" alerts were firing [1]. I loaded the Prometheus data in PromeCIeus and the issue seems to be caused by machine-api-operator exposing inconsistent labels for the mapi_machine_created_timestamp_seconds metric.

The PromQL expression of "MachineWithoutValidNode" is: "(mapi_machine_created_timestamp_seconds unless on(node) kube_node_info) > 0" [2]. But mapi_machine_created_timestamp_seconds doesn't always have a "node" label [3] in which case the alert will fire.

Either there's a bug in machine-api/machine-api-operator or the alert should be tweaked/removed.

[1] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/543/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1569
[2] https://github.com/openshift/machine-api-operator/blob/5d1c7835e2393005881532029e3458f7d0c63580/install/0000_90_machine-api-operator_04_alertrules.yaml#L13-L20
[3] https://github.com/openshift/machine-api-operator/blob/5d1c7835e2393005881532029e3458f7d0c63580/pkg/metrics/metrics.go#L78-L99

Comment 11 Junqi Zhao 2020-02-21 11:21:32 UTC

PR is merged, and no issue with CI

Comment 13 errata-xmlrpc 2020-05-04 11:14:43 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0581

Note You need to log in before you can comment on or make changes to this bug.