We're seeing the following test fail fairly consistently (10 times out of the last 13 runs - https://testgrid.k8s.io/redhat-openshift-ocp-release-4.4-informing#release-openshift-ocp-e2e-aws-scaleup-rhel7-4.4&sort-by-flakiness= ) on the e2e-aws-scaleup-rhel7 suite: [It] [Top Level] [Feature:Prometheus][Conformance] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Suite:openshift/conformance/parallel/minimal] See https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-e2e-aws-scaleup-rhel7-4.4/14 for a sample failure. The trace is below: fail [github.com/openshift/origin/test/extended/prometheus/prometheus_builds.go:121]: Unexpected error: <*errors.errorString | 0xc000aaf960>: { s: "host command failed: error running /usr/bin/kubectl --server=https://api.ci-op-cb3jb5md-f9c74.origin-ci-int-aws.dev.rhcloud.com:6443 --kubeconfig=/tmp/admin.kubeconfig exec --namespace=e2e-test-prometheus-pvxlh execpodlnh49 -- /bin/sh -x -c curl -s -k -H 'Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6InQ0WDlrQ0FlUV9vTmg3TWplVEZ0Q3U4MXpYa2N4NXlFdUV0c0tEcEJTTkEifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWFkYXB0ZXItdG9rZW4tZjRzengiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoicHJvbWV0aGV1cy1hZGFwdGVyIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiYzVmOWFmMzUtZTZiNS00MzhlLTk1ZWEtZWI4OGE4ODAwOWRhIiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Om9wZW5zaGlmdC1tb25pdG9yaW5nOnByb21ldGhldXMtYWRhcHRlciJ9.wrljp7m6WdO2coQSxzUUs-HFRE8jwZq1t8RuO7TokyeLepTu_tuEeelVuHVVkf9QOc-hA0pQcypII6kOpmemvydJftVYjRXOl6iEPZVAt2EdU8iIVqMb7VFvVMqixNuNzjX9behQtwnWcC4G4NNic_rsA_VqOHwP3qw0T27B-C-ZADnWAAmD9fkIL28hcBcMnpUE-_brYOqa4ipg3NbankvRepUr-zngCI-64AC6pL5BSP3Q8vwwRxNTiwKCLkotv4VLPMtg4hNpJtfBT_e2gG2I_YTSUzsEDcvOZACVrlOqFYllSpvvZ5bZFtILa4SN8FwfcK-1J8u36vwnquAY-Q' \"https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=ALERTS%7Balertname%21~%22Watchdog%7CAlertmanagerReceiversNotConfigured%7CPrometheusRemoteWriteDesiredShards%22%2Calertstate%3D%22firing%22%7D+%3E%3D+1\":\nCommand stdout:\n\nstderr:\n+ curl -s -k -H 'Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6InQ0WDlrQ0FlUV9vTmg3TWplVEZ0Q3U4MXpYa2N4NXlFdUV0c0tEcEJTTkEifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWFkYXB0ZXItdG9rZW4tZjRzengiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoicHJvbWV0aGV1cy1hZGFwdGVyIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiYzVmOWFmMzUtZTZiNS00MzhlLTk1ZWEtZWI4OGE4ODAwOWRhIiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Om9wZW5zaGlmdC1tb25pdG9yaW5nOnByb21ldGhldXMtYWRhcHRlciJ9.wrljp7m6WdO2coQSxzUUs-HFRE8jwZq1t8RuO7TokyeLepTu_tuEeelVuHVVkf9QOc-hA0pQcypII6kOpmemvydJftVYjRXOl6iEPZVAt2EdU8iIVqMb7VFvVMqixNuNzjX9behQtwnWcC4G4NNic_rsA_VqOHwP3qw0T27B-C-ZADnWAAmD9fkIL28hcBcMnpUE-_brYOqa4ipg3NbankvRepUr-zngCI-64AC6pL5BSP3Q8vwwRxNTiwKCLkotv4VLPMtg4hNpJtfBT_e2gG2I_YTSUzsEDcvOZACVrlOqFYllSpvvZ5bZFtILa4SN8FwfcK-1J8u36vwnquAY-Q' 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=ALERTS%7Balertname%21~%22Watchdog%7CAlertmanagerReceiversNotConfigured%7CPrometheusRemoteWriteDesiredShards%22%2Calertstate%3D%22firing%22%7D+%3E%3D+1'\ncommand terminated with exit code 6\n\nerror:\nexit status 6\n", } host command failed: error running /usr/bin/kubectl --server=https://api.ci-op-cb3jb5md-f9c74.origin-ci-int-aws.dev.rhcloud.com:6443 --kubeconfig=/tmp/admin.kubeconfig exec --namespace=e2e-test-prometheus-pvxlh execpodlnh49 -- /bin/sh -x -c curl -s -k -H 'Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6InQ0WDlrQ0FlUV9vTmg3TWplVEZ0Q3U4MXpYa2N4NXlFdUV0c0tEcEJTTkEifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWFkYXB0ZXItdG9rZW4tZjRzengiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoicHJvbWV0aGV1cy1hZGFwdGVyIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiYzVmOWFmMzUtZTZiNS00MzhlLTk1ZWEtZWI4OGE4ODAwOWRhIiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Om9wZW5zaGlmdC1tb25pdG9yaW5nOnByb21ldGhldXMtYWRhcHRlciJ9.wrljp7m6WdO2coQSxzUUs-HFRE8jwZq1t8RuO7TokyeLepTu_tuEeelVuHVVkf9QOc-hA0pQcypII6kOpmemvydJftVYjRXOl6iEPZVAt2EdU8iIVqMb7VFvVMqixNuNzjX9behQtwnWcC4G4NNic_rsA_VqOHwP3qw0T27B-C-ZADnWAAmD9fkIL28hcBcMnpUE-_brYOqa4ipg3NbankvRepUr-zngCI-64AC6pL5BSP3Q8vwwRxNTiwKCLkotv4VLPMtg4hNpJtfBT_e2gG2I_YTSUzsEDcvOZACVrlOqFYllSpvvZ5bZFtILa4SN8FwfcK-1J8u36vwnquAY-Q' "https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=ALERTS%7Balertname%21~%22Watchdog%7CAlertmanagerReceiversNotConfigured%7CPrometheusRemoteWriteDesiredShards%22%2Calertstate%3D%22firing%22%7D+%3E%3D+1": Command stdout: stderr: + curl -s -k -H 'Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6InQ0WDlrQ0FlUV9vTmg3TWplVEZ0Q3U4MXpYa2N4NXlFdUV0c0tEcEJTTkEifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWFkYXB0ZXItdG9rZW4tZjRzengiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoicHJvbWV0aGV1cy1hZGFwdGVyIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiYzVmOWFmMzUtZTZiNS00MzhlLTk1ZWEtZWI4OGE4ODAwOWRhIiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Om9wZW5zaGlmdC1tb25pdG9yaW5nOnByb21ldGhldXMtYWRhcHRlciJ9.wrljp7m6WdO2coQSxzUUs-HFRE8jwZq1t8RuO7TokyeLepTu_tuEeelVuHVVkf9QOc-hA0pQcypII6kOpmemvydJftVYjRXOl6iEPZVAt2EdU8iIVqMb7VFvVMqixNuNzjX9behQtwnWcC4G4NNic_rsA_VqOHwP3qw0T27B-C-ZADnWAAmD9fkIL28hcBcMnpUE-_brYOqa4ipg3NbankvRepUr-zngCI-64AC6pL5BSP3Q8vwwRxNTiwKCLkotv4VLPMtg4hNpJtfBT_e2gG2I_YTSUzsEDcvOZACVrlOqFYllSpvvZ5bZFtILa4SN8FwfcK-1J8u36vwnquAY-Q' 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=ALERTS%7Balertname%21~%22Watchdog%7CAlertmanagerReceiversNotConfigured%7CPrometheusRemoteWriteDesiredShards%22%2Calertstate%3D%22firing%22%7D+%3E%3D+1' command terminated with exit code 6
code 6 is dns resolution problems, was route53 cleaned up prematurely? Anyway, moving to 4.5.0 as this is an e2e test problem in CI not likely a product problem.
Seen here: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-e2e-aws-scaleup-rhel7-4.4/18 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-e2e-aws-scaleup-rhel7-4.4/19 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-e2e-aws-scaleup-rhel7-4.4/20 Tagging with "buildcop".
This is a different issue than the other Prometheus job failures we've been seeing. It looks like a node is disappearing during the course of the tests.
Could we get some help from the node team to confirm the nodes are healthy during tests?
Looking at: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-e2e-aws-scaleup-rhel7-4.4/18 From masters-journal: Apr 21 20:59:22.980451 ip-10-0-155-53 hyperkube[1379]: I0421 20:59:22.980247 1379 prober.go:116] Readiness probe for "sdn-lcznd_openshift-sdn(3763fa53-b0ce-406f-bcaa-a8d9b6a10599):sdn" failed (failure): Apr 21 20:59:22.980451 ip-10-0-155-53 hyperkube[1379]: I0421 20:59:22.980364 1379 event.go:281] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-sdn", Name:"sdn-lcznd", UID:"3763fa53-b0ce-406f-bcaa-a8d9b6a10599", APIVersion:"v1", ResourceVersion:"2392", FieldPath:"spec.containers{sdn}"}): type: 'Warning' reason: 'Unhealthy' Readiness probe failed: Apr 21 20:59:24.453179 ip-10-0-155-53 crio[1308]: time="2020-04-21 20:59:24.453086730Z" level=info msg="exec'd [/bin/bash -c #!/bin/bash\n/usr/share/openvswitch/scripts/ovs-ctl status > /dev/null &&\n/usr/bin/ovs-appctl -T 5 ofproto/list > /dev/null &&\n/usr/bin/ovs-vsctl -t 5 show > /dev/null\n] in openshift-sdn/ovs-kfbwv/openvswitch" id=da126453-ff3c-4354-8383-19718d3c5ccc Apr 21 20:59:24.453597 ip-10-0-155-53 hyperkube[1379]: I0421 20:59:24.453449 1379 prober.go:129] Readiness probe for "ovs-kfbwv_openshift-sdn(464474c6-6e58-4f62-aa89-c82920e89ff0):openvswitch" succeeded Apr 21 20:59:24.583770 ip-10-0-155-53 hyperkube[1379]: E0421 20:59:24.583714 1379 kubelet.go:2194] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network There were 75 occurrences of '):sdn" failed' in the masters-journal For https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-ocp-e2e-aws-scaleup-rhel7-4.4/20/artifacts/e2e-aws-scaleup-rhel7/nodes/, there were 110 75 occurrences of '):sdn" failed' (probe)
From test run #20: Apr 23 20:14:37.373280 ip-10-0-151-118 hyperkube[1376]: E0423 20:14:37.373148 1376 pod_workers.go:191] Error syncing pod 130833e6-dd8e-4310-a16b-8f4b710c9691 ("installer-2-ip-10-0-151-118.ec2.internal_openshift-kube-scheduler(130833e6-dd8e-4310-a16b-8f4b710c9691)"), skipping: network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network For test run #18, there were 113 occurrences of the above form in masters-journal
From test run #20: Apr 23 20:16:45.915847 ip-10-0-151-118 hyperkube[1376]: I0423 20:16:45.915528 1376 prober.go:116] Readiness probe for "kube-apiserver-ip-10-0-151-118.ec2.internal_openshift-kube-apiserver(78c948c66d882abadbb5d45ca08df1a3):kube-apiserver" failed (failure): Get https://10.0.151.118:6443/healthz: read tcp 10.0.151.118:40566->10.0.151.118:6443: read: connection reset by peer This reminds me of BZ 1823950
'Prometheus when installed on the cluster shouldn't report any alerts' has been passing for the past 3 days (17th to 19th).
7 failures in the last 48 hours https://search.apps.build01.ci.devcluster.openshift.com/?search=failed%3A.*Prometheus+when+installed+on+the+cluster+shouldn%27t+report+any+alerts+in+firing+state+apart+from+Watchdog+and+AlertmanagerReceiversNotConfigured&maxAge=48h&context=1&type=build-log&name=rhel7&maxMatches=5&maxBytes=20971520&groupBy=job pull-ci-openshift-installer-master-e2e-aws-scaleup-rhel7 - 77 runs, 77% failed, 5% of failures match #6182 build-log.txt.gz 21 hours ago failed: (39.6s) 2020-05-19T20:46:33 "[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Suite:openshift/conformance/parallel]" #6180 build-log.txt.gz 22 hours ago failed: (35.8s) 2020-05-19T20:16:00 "[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Suite:openshift/conformance/parallel]" #6170 build-log.txt.gz 27 hours ago failed: (35.7s) 2020-05-19T14:54:10 "[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Suite:openshift/conformance/parallel]" pull-ci-openshift-machine-api-operator-master-e2e-aws-scaleup-rhel7 - 42 runs, 79% failed, 3% of failures match #900 build-log.txt.gz 21 hours ago failed: (43s) 2020-05-19T20:50:29 "[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Suite:openshift/conformance/parallel]" pull-ci-openshift-openshift-ansible-master-e2e-aws-scaleup-rhel7 - 1 runs, 100% failed, 100% of failures match #1112 build-log.txt.gz 26 hours ago failed: (38.6s) 2020-05-19T16:24:25 "[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Suite:openshift/conformance/parallel]" pull-ci-openshift-installer-release-4.4-e2e-aws-scaleup-rhel7 - 24 runs, 58% failed, 7% of failures match #350 build-log.txt.gz 39 hours ago failed: (20.3s) 2020-05-19T02:52:28 "[Feature:Prometheus][Conformance] Prometheus when installed on the cluster [Top Level] [Feature:Prometheus][Conformance] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Suite:openshift/conformance/parallel/minimal]" pull-ci-openshift-machine-config-operator-release-4.4-e2e-aws-scaleup-rhel7 - 8 runs, 63% failed, 20% of failures match #153 build-log.txt.gz 35 hours ago failed: (27.7s) 2020-05-19T07:06:45 "[Feature:Prometheus][Conformance] Prometheus when installed on the cluster [Top Level] [Feature:Prometheus][Conformance] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Suite:openshift/conformance/parallel/minimal]" Example failure: https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/3620/pull-ci-openshift-installer-master-e2e-aws-scaleup-rhel7/6182 Command stdout: stderr: + curl -s -k -H 'Authorization: Bearer <omited>' 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=ALERTS%7Balertname%21~%22Watchdog%7CAlertmanagerReceiversNotConfigured%7CPrometheusRemoteWriteDesiredShards%22%2Calertstate%3D%22firing%22%2Cseverity%21%3D%22info%22%7D+%3E%3D+1' command terminated with exit code 6 error: exit status 6
For #6182 , according to timeline: May 19 20:45:53.477 - 39s I test="[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Suite:openshift/conformance/parallel]" running In workers-journal, I searched for prometheus related probes (all succeeded): May 19 20:46:10.995477 ip-10-0-128-29.ec2.internal hyperkube[1578]: I0519 20:46:10.994896 1578 prober.go:133] Liveness probe for "prometheus-k8s-0_openshift-monitoring(42993aec-8c68-49f6-811a-1263aee61de1):prometheus" succeeded There was no prometheus probe in masters-journal. But from the few probes I looked at, they succeeded.
There was no '):sdn" failed' log in the masters-journal or worker-journal for #6182
I’m adding UpcomingSprint, because I was occupied by fixing bugs with higher priority/severity, developing new features with higher priority, or developing new features to improve stability at a macro level. I will revisit this bug next sprint.
Looking through runs that hit this failure and not finding any errors in any of the default DNS pods. DNS operator logs also appear to be normal (at first I was under the impression that maybe default DNS pods were not getting scheduled onto new nodes properly with the RHEL scale up, but that does not seem to be the case). This test has not failed on the e2e-aws-scaleup-rhel7 in 3 days but I suspect it may fail again soon.
I’m adding UpcomingSprint since I will be revisiting this low severity bug in the coming weeks.
This test is still effecting the stability of the RHEL7 worker scaleup jobs. We would like to get this resolved before we ship another release. 18 failures in the last 48 hours, increasing severity. https://search.ci.openshift.org/?search=failed%3A.*Prometheus+when+installed+on+the+cluster+shouldn%27t+report+any+alerts+in+firing+state+apart+from+Watchdog&maxAge=48h&context=1&type=build-log&name=rhel7&maxMatches=5&maxBytes=20971520&groupBy=job
Looks like this test is failing when the prometheus test curl execpod is scheduled to new nodes brought up by the rhel7 scaleup job. The prometheus test is most likely hitting the node's DNS pod just before the DNS pod is completely responsive. The test that is failing is the first prometheus e2e test that runs, and all of the remaining prometheus tests that run after this test are not hitting the DNS error seen in this BZ, despite using the same curl code. I am going to add retry options to the prometheus test execpod's curl call, which should prevent this error from occurring in the future.
checked https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-ocp-e2e-aws-scaleup-rhel7-4.6 and no the failure in the last week. moving to verified.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4196