Description of problem: e2e are missing a lot of alerts fired during in e2e run. This is expected from how the test is written - being a regular test, it gets scheduled as a regular test in random order, so in the worst case it is scheduled first and misses all alerts that are caused by e2e run because those run after the test has checked the alerts. Best case it runs last and sees all. Probability for both of those case is low but in average it is run in the middle and misses half of the alert which is pretty bad. Firstly we miss bugs, and secondly having those alerts being caught would help triaging e2e a lot I think. There are already tests that are run only after the e2e suite is over, or during the e2e run and finished only when e2e are done - check "Monitor cluster while tests execute". first test passed at 2019-11-04T15:52:54 even with the current exception there being `ALERTS{alertname!~"Watchdog|ClusterMachineApproverDown|UsingDeprecatedAPIExtensionsV1Beta1",alertstate="firing"} >= 1` (Does this need a range if test take longer? I don't know what the default is.) the e2e check passed: (19s) 2019-11-04T16:07:44 "[Feature:Prometheus][Conformance] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog [Suite:openshift/conformance/parallel/minimal]" but last test happened around 16:20:29.898866 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-kube-apiserver-operator/639/pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws/2895#1:build-log.txt%3A9342 checking the prometheus alerts at https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_cluster-kube-apiserver-operator/639/pull-ci-openshift-cluster-kube-apiserver-operator-master-e2e-aws/2895/artifacts/e2e-aws/metrics/prometheus.tar using: sum_over_time(ALERTS{alertname!~"Watchdog|ClusterMachineApproverDown|UsingDeprecatedAPIExtensionsV1Beta1"}[2d]) these fired: {alertname="AlertmanagerMembersInconsistent",alertstate="pending",endpoint="web",instance="10.129.2.10:9095",job="alertmanager-main",namespace="openshift-monitoring",pod="alertmanager-main-1",service="alertmanager-main",severity="critical"} 1 {alertname="CPUThrottlingHigh",alertstate="pending",container="sti-build",namespace="e2e-test-s2i-build-quota-f65dt",pod="s2i-build-quota-1-build",severity="warning"} 13 {alertname="ClusterOperatorDegraded",alertstate="pending",condition="Degraded",endpoint="metrics",instance="10.0.131.192:9099",job="cluster-version-operator",name="authentication",namespace="openshift-cluster-version",pod="cluster-version-operator-74c4f5cc49-ctjrx",reason="RouteHealthDegradedFailedGet",service="cluster-version-operator",severity="critical"} 1 {alertname="ClusterOperatorDown",alertstate="pending",endpoint="metrics",instance="10.0.131.192:9099",job="cluster-version-operator",name="authentication",namespace="openshift-cluster-version",pod="cluster-version-operator-74c4f5cc49-ctjrx",service="cluster-version-operator",severity="critical"} 1 {alertname="ClusterVersionOperatorDown",alertstate="pending",severity="critical"} 1 {alertname="KubeAPIErrorsHigh",alertstate="pending",resource="pods",severity="critical",verb="POST"} 3 {alertname="KubeAPIErrorsHigh",alertstate="pending",resource="pods",severity="warning",verb="POST"} 6 {alertname="KubeAPILatencyHigh",alertstate="pending",component="apiserver",endpoint="https",job="apiserver",namespace="default",quantile="0.99",resource="configmaps",scope="namespace",service="kubernetes",severity="critical",verb="POST",version="v1"} 9 {alertname="KubeAPILatencyHigh",alertstate="pending",component="apiserver",endpoint="https",job="apiserver",namespace="default",quantile="0.99",resource="configmaps",scope="namespace",service="kubernetes",severity="warning",verb="POST",version="v1"} 9 {alertname="KubePersistentVolumeErrors",alertstate="pending",endpoint="https-main",instance="10.129.2.4:8443",job="kube-state-metrics",namespace="openshift-monitoring",persistentvolume="pvc-af19d81b-9895-47da-957d-7da743821e51",phase="Failed",pod="kube-state-metrics-7c69f886b4-xlwx9",service="kube-state-metrics",severity="critical"} 4 {alertname="KubePersistentVolumeErrors",alertstate="pending",endpoint="https-main",instance="10.129.2.4:8443",job="kube-state-metrics",namespace="openshift-monitoring",persistentvolume="pvc-b45a1a02-c4b7-438c-a06c-c98de7fbb47e",phase="Failed",pod="kube-state-metrics-7c69f886b4-xlwx9",service="kube-state-metrics",severity="critical"} 4 {alertname="KubePersistentVolumeErrors",alertstate="pending",endpoint="https-main",instance="10.129.2.4:8443",job="kube-state-metrics",namespace="openshift-monitoring",persistentvolume="pvc-bb632cd7-773d-499f-9506-eec87331a388",phase="Failed",pod="kube-state-metrics-7c69f886b4-xlwx9",service="kube-state-metrics",severity="critical"} 4 {alertname="KubePersistentVolumeErrors",alertstate="pending",endpoint="https-main",instance="10.129.2.4:8443",job="kube-state-metrics",namespace="openshift-monitoring",persistentvolume="pvc-fc9b368b-14a2-4d6d-827f-f4f43421c294",phase="Failed",pod="kube-state-metrics-7c69f886b4-xlwx9",service="kube-state-metrics",severity="critical"} 4 {alertname="KubePodNotReady",alertstate="pending",namespace="openshift-must-gather-vcr4h",pod="must-gather-cltrz",severity="critical"} 12 {alertname="KubeletDown",alertstate="pending",severity="critical"} 2 {alertname="MachineAPIOperatorDown",alertstate="pending",severity="critical"} 1 {alertname="MachineWithoutValidNode",alertstate="pending",api_version="machine.openshift.io/v1beta1",endpoint="https",exported_namespace="openshift-machine-api",instance="10.128.0.21:8443",job="machine-api-operator",name="ci-op-83idn38x-63f8c-7b7bq-master-0",namespace="openshift-machine-api",node="ip-10-0-142-89.ec2.internal",phase="Running",pod="machine-api-operator-5f4656fddc-wkpkc",service="machine-api-operator",severity="critical",spec_provider_id="aws:///us-east-1a/i-0340a5e50f5bcc6a4"} 2 {alertname="MachineWithoutValidNode",alertstate="pending",api_version="machine.openshift.io/v1beta1",endpoint="https",exported_namespace="openshift-machine-api",instance="10.128.0.21:8443",job="machine-api-operator",name="ci-op-83idn38x-63f8c-7b7bq-master-1",namespace="openshift-machine-api",node="ip-10-0-150-179.ec2.internal",phase="Running",pod="machine-api-operator-5f4656fddc-wkpkc",service="machine-api-operator",severity="critical",spec_provider_id="aws:///us-east-1b/i-02dcc22b04f252bdc"} 2 {alertname="MachineWithoutValidNode",alertstate="pending",api_version="machine.openshift.io/v1beta1",endpoint="https",exported_namespace="openshift-machine-api",instance="10.128.0.21:8443",job="machine-api-operator",name="ci-op-83idn38x-63f8c-7b7bq-master-2",namespace="openshift-machine-api",node="ip-10-0-131-192.ec2.internal",phase="Running",pod="machine-api-operator-5f4656fddc-wkpkc",service="machine-api-operator",severity="critical",spec_provider_id="aws:///us-east-1a/i-0ae3d6f02a410b956"} 2 {alertname="MachineWithoutValidNode",alertstate="pending",api_version="machine.openshift.io/v1beta1",endpoint="https",exported_namespace="openshift-machine-api",instance="10.128.0.21:8443",job="machine-api-operator",name="ci-op-83idn38x-63f8c-7b7bq-worker-us-east-1a-4x4fb",namespace="openshift-machine-api",node="ip-10-0-134-173.ec2.internal",phase="Running",pod="machine-api-operator-5f4656fddc-wkpkc",service="machine-api-operator",severity="critical",spec_provider_id="aws:///us-east-1a/i-06064777b294e4a69"} 2 {alertname="MachineWithoutValidNode",alertstate="pending",api_version="machine.openshift.io/v1beta1",endpoint="https",exported_namespace="openshift-machine-api",instance="10.128.0.21:8443",job="machine-api-operator",name="ci-op-83idn38x-63f8c-7b7bq-worker-us-east-1a-qb4zr",namespace="openshift-machine-api",node="ip-10-0-138-90.ec2.internal",phase="Running",pod="machine-api-operator-5f4656fddc-wkpkc",service="machine-api-operator",severity="critical",spec_provider_id="aws:///us-east-1a/i-03ebe00e6fb395386"} 2 {alertname="MachineWithoutValidNode",alertstate="pending",api_version="machine.openshift.io/v1beta1",endpoint="https",exported_namespace="openshift-machine-api",instance="10.128.0.21:8443",job="machine-api-operator",name="ci-op-83idn38x-63f8c-7b7bq-worker-us-east-1b-gvdmh",namespace="openshift-machine-api",node="ip-10-0-150-33.ec2.internal",phase="Running",pod="machine-api-operator-5f4656fddc-wkpkc",service="machine-api-operator",severity="critical",spec_provider_id="aws:///us-east-1b/i-04229d168fa0da5fd"} Version-Release number of selected component (if applicable): 4.3
The alerts in the result returned from the query above are all in `pending` state, meaning the condition didn't last long enough for it to actually be a problem (at least by the way the individual alert was written), thus they didn't actually result in any firing alert. I do agree that we should make sure that we look at the entire time range though, and not only the instant at which the test runs. Additionally is it possible to ensure that the test is scheduled at the very end if that's not already the case (or at least as one of the last tests run)?
This is not a release blocker and relates only to e2e test improvements in origin. Setting severity to low and target release to 4.4.
> The alerts in the result returned from the query above are all in `pending` state, meaning the condition didn't last long enough for it to actually be a problem I'll need to think about it more, but I don't see right now a valid case when the alerts should be even pending during e2e, like these: {alertname="KubeletDown",alertstate="pending",severity="critical"} 2 {alertname="MachineAPIOperatorDown",alertstate="pending",severity="critical"} 1 > Additionally is it possible to ensure that the test is scheduled at the very end if that's not already the case (or at least as one of the last tests run)? already referenced it in the description - check existing e2e test named "Monitor cluster while tests execute" that Clayton wrote to run at the end of the suite. > This is not a release blocker and relates only to e2e test improvements in origin. Setting severity to low and target release to 4.4. It is not about improving e2e primarily, but about avoiding shipping a release with bugs.
Pending alerts is a good thing and standard best practice in monitoring, because there can be various circumstances why an expression may trigger intermittently (intermittent network connection problem, target discovery problem, etc.), which this case actually nicely shows. I extracted the Prometheus dump, and it showed that Prometheus actually never failed to scrape these two targets, but at the very start of Prometheus monitoring the components the targets were not discovered in time for alert rule evaluation, which resolved within a single scrape interval, had these alerts triggered immediately then we would have gotten a false positive alert. To me it looks like all the data is showing that the alerting rules are correct, however, I do still think that we need to run this test at the very end. We'll take care of that part, thank you for opening this issue, I think this is an important thing that we missed to do so far.
> Pending alerts is a good thing and standard best practice in monitoring, because there can be various circumstances why an expression may trigger intermittently (intermittent network connection problem, target discovery problem, etc.), which this case actually nicely shows. I extracted the Prometheus dump, and it showed that Prometheus actually never failed to scrape these two targets, but at the very start of Prometheus monitoring the components the targets were not discovered in time for alert rule evaluation, which resolved within a single scrape interval, had these alerts triggered immediately then we would have gotten a false positive alert. I took this from a failed run, so I assumed there was something actually wrong. The explanation makes sense to me, thanks.
I've seen one CI run where the "MachineWithoutValidNode" and "MachineWithNoRunningPhase" alerts were firing [1]. I loaded the Prometheus data in PromeCIeus and the issue seems to be caused by machine-api-operator exposing inconsistent labels for the mapi_machine_created_timestamp_seconds metric. The PromQL expression of "MachineWithoutValidNode" is: "(mapi_machine_created_timestamp_seconds unless on(node) kube_node_info) > 0" [2]. But mapi_machine_created_timestamp_seconds doesn't always have a "node" label [3] in which case the alert will fire. Either there's a bug in machine-api/machine-api-operator or the alert should be tweaked/removed. [1] https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-monitoring-operator/543/pull-ci-openshift-cluster-monitoring-operator-master-e2e-aws/1569 [2] https://github.com/openshift/machine-api-operator/blob/5d1c7835e2393005881532029e3458f7d0c63580/install/0000_90_machine-api-operator_04_alertrules.yaml#L13-L20 [3] https://github.com/openshift/machine-api-operator/blob/5d1c7835e2393005881532029e3458f7d0c63580/pkg/metrics/metrics.go#L78-L99
PR is merged, and no issue with CI
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0581