1826094 – Prometheus test is consistently failing on e2e-aws-scaleup-rhel7

Bug 1826094 - Prometheus test is consistently failing on e2e-aws-scaleup-rhel7

Summary: Prometheus test is consistently failing on e2e-aws-scaleup-rhel7

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.4
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Stephen Greene
QA Contact:	Hongan Li
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1779811
TreeView+	depends on / blocked

Reported:	2020-04-20 20:59 UTC by Christian Huffman
Modified:	2022-08-04 22:27 UTC (History)
CC List:	10 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	Cause: e2e-aws-scaleup-rhel7 job spawns new worker nodes that run a pod from the openshift-dns daemonset. Consequence: openshift-dns pods take a few seconds to come online and therefore DNS availability may be impacted shortly after the node comes online. Fix: Have the prometheus e2e test retry DNS failures a few times in the event that a new worker nodes almost ready DNS pod is hit. Result: Prometheus e2e test passes.
Clone Of:
Environment:
Last Closed:	2020-10-27 15:58:26 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift origin pull 25386	0	None	closed	Bug 1826094: Add retry options to Prometheus helper curl command	2020-12-18 16:07:44 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 15:58:41 UTC

Description Christian Huffman 2020-04-20 20:59:09 UTC

We're seeing the following test fail fairly consistently (10 times out of the last 13 runs - https://testgrid.k8s.io/redhat-openshift-ocp-release-4.4-informing#release-openshift-ocp-e2e-aws-scaleup-rhel7-4.4&sort-by-flakiness= ) on the e2e-aws-scaleup-rhel7 suite:

[It] [Top Level] [Feature:Prometheus][Conformance] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Suite:openshift/conformance/parallel/minimal]

See https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-e2e-aws-scaleup-rhel7-4.4/14 for a sample failure.

The trace is below:

fail [github.com/openshift/origin/test/extended/prometheus/prometheus_builds.go:121]: Unexpected error:
    <*errors.errorString | 0xc000aaf960>: {
        s: "host command failed: error running /usr/bin/kubectl --server=https://api.ci-op-cb3jb5md-f9c74.origin-ci-int-aws.dev.rhcloud.com:6443 --kubeconfig=/tmp/admin.kubeconfig exec --namespace=e2e-test-prometheus-pvxlh execpodlnh49 -- /bin/sh -x -c curl -s -k -H 'Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6InQ0WDlrQ0FlUV9vTmg3TWplVEZ0Q3U4MXpYa2N4NXlFdUV0c0tEcEJTTkEifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWFkYXB0ZXItdG9rZW4tZjRzengiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoicHJvbWV0aGV1cy1hZGFwdGVyIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiYzVmOWFmMzUtZTZiNS00MzhlLTk1ZWEtZWI4OGE4ODAwOWRhIiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Om9wZW5zaGlmdC1tb25pdG9yaW5nOnByb21ldGhldXMtYWRhcHRlciJ9.wrljp7m6WdO2coQSxzUUs-HFRE8jwZq1t8RuO7TokyeLepTu_tuEeelVuHVVkf9QOc-hA0pQcypII6kOpmemvydJftVYjRXOl6iEPZVAt2EdU8iIVqMb7VFvVMqixNuNzjX9behQtwnWcC4G4NNic_rsA_VqOHwP3qw0T27B-C-ZADnWAAmD9fkIL28hcBcMnpUE-_brYOqa4ipg3NbankvRepUr-zngCI-64AC6pL5BSP3Q8vwwRxNTiwKCLkotv4VLPMtg4hNpJtfBT_e2gG2I_YTSUzsEDcvOZACVrlOqFYllSpvvZ5bZFtILa4SN8FwfcK-1J8u36vwnquAY-Q' \"https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=ALERTS%7Balertname%21~%22Watchdog%7CAlertmanagerReceiversNotConfigured%7CPrometheusRemoteWriteDesiredShards%22%2Calertstate%3D%22firing%22%7D+%3E%3D+1\":\nCommand stdout:\n\nstderr:\n+ curl -s -k -H 'Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6InQ0WDlrQ0FlUV9vTmg3TWplVEZ0Q3U4MXpYa2N4NXlFdUV0c0tEcEJTTkEifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWFkYXB0ZXItdG9rZW4tZjRzengiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoicHJvbWV0aGV1cy1hZGFwdGVyIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiYzVmOWFmMzUtZTZiNS00MzhlLTk1ZWEtZWI4OGE4ODAwOWRhIiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Om9wZW5zaGlmdC1tb25pdG9yaW5nOnByb21ldGhldXMtYWRhcHRlciJ9.wrljp7m6WdO2coQSxzUUs-HFRE8jwZq1t8RuO7TokyeLepTu_tuEeelVuHVVkf9QOc-hA0pQcypII6kOpmemvydJftVYjRXOl6iEPZVAt2EdU8iIVqMb7VFvVMqixNuNzjX9behQtwnWcC4G4NNic_rsA_VqOHwP3qw0T27B-C-ZADnWAAmD9fkIL28hcBcMnpUE-_brYOqa4ipg3NbankvRepUr-zngCI-64AC6pL5BSP3Q8vwwRxNTiwKCLkotv4VLPMtg4hNpJtfBT_e2gG2I_YTSUzsEDcvOZACVrlOqFYllSpvvZ5bZFtILa4SN8FwfcK-1J8u36vwnquAY-Q' 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=ALERTS%7Balertname%21~%22Watchdog%7CAlertmanagerReceiversNotConfigured%7CPrometheusRemoteWriteDesiredShards%22%2Calertstate%3D%22firing%22%7D+%3E%3D+1'\ncommand terminated with exit code 6\n\nerror:\nexit status 6\n",
    }
    host command failed: error running /usr/bin/kubectl --server=https://api.ci-op-cb3jb5md-f9c74.origin-ci-int-aws.dev.rhcloud.com:6443 --kubeconfig=/tmp/admin.kubeconfig exec --namespace=e2e-test-prometheus-pvxlh execpodlnh49 -- /bin/sh -x -c curl -s -k -H 'Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6InQ0WDlrQ0FlUV9vTmg3TWplVEZ0Q3U4MXpYa2N4NXlFdUV0c0tEcEJTTkEifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWFkYXB0ZXItdG9rZW4tZjRzengiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoicHJvbWV0aGV1cy1hZGFwdGVyIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiYzVmOWFmMzUtZTZiNS00MzhlLTk1ZWEtZWI4OGE4ODAwOWRhIiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Om9wZW5zaGlmdC1tb25pdG9yaW5nOnByb21ldGhldXMtYWRhcHRlciJ9.wrljp7m6WdO2coQSxzUUs-HFRE8jwZq1t8RuO7TokyeLepTu_tuEeelVuHVVkf9QOc-hA0pQcypII6kOpmemvydJftVYjRXOl6iEPZVAt2EdU8iIVqMb7VFvVMqixNuNzjX9behQtwnWcC4G4NNic_rsA_VqOHwP3qw0T27B-C-ZADnWAAmD9fkIL28hcBcMnpUE-_brYOqa4ipg3NbankvRepUr-zngCI-64AC6pL5BSP3Q8vwwRxNTiwKCLkotv4VLPMtg4hNpJtfBT_e2gG2I_YTSUzsEDcvOZACVrlOqFYllSpvvZ5bZFtILa4SN8FwfcK-1J8u36vwnquAY-Q' "https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=ALERTS%7Balertname%21~%22Watchdog%7CAlertmanagerReceiversNotConfigured%7CPrometheusRemoteWriteDesiredShards%22%2Calertstate%3D%22firing%22%7D+%3E%3D+1":
    Command stdout:
    
    stderr:
    + curl -s -k -H 'Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6InQ0WDlrQ0FlUV9vTmg3TWplVEZ0Q3U4MXpYa2N4NXlFdUV0c0tEcEJTTkEifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWFkYXB0ZXItdG9rZW4tZjRzengiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoicHJvbWV0aGV1cy1hZGFwdGVyIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiYzVmOWFmMzUtZTZiNS00MzhlLTk1ZWEtZWI4OGE4ODAwOWRhIiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Om9wZW5zaGlmdC1tb25pdG9yaW5nOnByb21ldGhldXMtYWRhcHRlciJ9.wrljp7m6WdO2coQSxzUUs-HFRE8jwZq1t8RuO7TokyeLepTu_tuEeelVuHVVkf9QOc-hA0pQcypII6kOpmemvydJftVYjRXOl6iEPZVAt2EdU8iIVqMb7VFvVMqixNuNzjX9behQtwnWcC4G4NNic_rsA_VqOHwP3qw0T27B-C-ZADnWAAmD9fkIL28hcBcMnpUE-_brYOqa4ipg3NbankvRepUr-zngCI-64AC6pL5BSP3Q8vwwRxNTiwKCLkotv4VLPMtg4hNpJtfBT_e2gG2I_YTSUzsEDcvOZACVrlOqFYllSpvvZ5bZFtILa4SN8FwfcK-1J8u36vwnquAY-Q' 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=ALERTS%7Balertname%21~%22Watchdog%7CAlertmanagerReceiversNotConfigured%7CPrometheusRemoteWriteDesiredShards%22%2Calertstate%3D%22firing%22%7D+%3E%3D+1'
    command terminated with exit code 6

Comment 1 Scott Dodson 2020-04-21 01:51:13 UTC

code 6 is dns resolution problems, was route53 cleaned up prematurely? Anyway, moving to 4.5.0 as this is an e2e test problem in CI not likely a product problem.

Comment 3 Miciah Dashiel Butler Masters 2020-04-24 15:41:02 UTC

Seen here:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-e2e-aws-scaleup-rhel7-4.4/18
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-e2e-aws-scaleup-rhel7-4.4/19
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-e2e-aws-scaleup-rhel7-4.4/20

Tagging with "buildcop".

Comment 4 Russell Teague 2020-04-27 19:50:21 UTC

This is a different issue than the other Prometheus job failures we've been seeing.  It looks like a node is disappearing during the course of the tests.

Comment 5 Russell Teague 2020-04-29 20:41:14 UTC

Could we get some help from the node team to confirm the nodes are healthy during tests?

Comment 6 Ted Yu 2020-05-12 16:12:29 UTC

Looking at:
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/logs/release-openshift-ocp-e2e-aws-scaleup-rhel7-4.4/18

From masters-journal:

Apr 21 20:59:22.980451 ip-10-0-155-53 hyperkube[1379]: I0421 20:59:22.980247    1379 prober.go:116] Readiness probe for "sdn-lcznd_openshift-sdn(3763fa53-b0ce-406f-bcaa-a8d9b6a10599):sdn" failed (failure):
Apr 21 20:59:22.980451 ip-10-0-155-53 hyperkube[1379]: I0421 20:59:22.980364    1379 event.go:281] Event(v1.ObjectReference{Kind:"Pod", Namespace:"openshift-sdn", Name:"sdn-lcznd", UID:"3763fa53-b0ce-406f-bcaa-a8d9b6a10599", APIVersion:"v1", ResourceVersion:"2392", FieldPath:"spec.containers{sdn}"}): type: 'Warning' reason: 'Unhealthy' Readiness probe failed:
Apr 21 20:59:24.453179 ip-10-0-155-53 crio[1308]: time="2020-04-21 20:59:24.453086730Z" level=info msg="exec'd [/bin/bash -c #!/bin/bash\n/usr/share/openvswitch/scripts/ovs-ctl status > /dev/null &&\n/usr/bin/ovs-appctl -T 5 ofproto/list > /dev/null &&\n/usr/bin/ovs-vsctl -t 5 show > /dev/null\n] in openshift-sdn/ovs-kfbwv/openvswitch" id=da126453-ff3c-4354-8383-19718d3c5ccc
Apr 21 20:59:24.453597 ip-10-0-155-53 hyperkube[1379]: I0421 20:59:24.453449    1379 prober.go:129] Readiness probe for "ovs-kfbwv_openshift-sdn(464474c6-6e58-4f62-aa89-c82920e89ff0):openvswitch" succeeded
Apr 21 20:59:24.583770 ip-10-0-155-53 hyperkube[1379]: E0421 20:59:24.583714    1379 kubelet.go:2194] Container runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network

There were 75 occurrences of '):sdn" failed' in the masters-journal

For https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/logs/release-openshift-ocp-e2e-aws-scaleup-rhel7-4.4/20/artifacts/e2e-aws-scaleup-rhel7/nodes/, there were 110 75 occurrences of '):sdn" failed' (probe)

Comment 7 Ted Yu 2020-05-12 18:45:06 UTC

From test run #20:

Apr 23 20:14:37.373280 ip-10-0-151-118 hyperkube[1376]: E0423 20:14:37.373148    1376 pod_workers.go:191] Error syncing pod 130833e6-dd8e-4310-a16b-8f4b710c9691 ("installer-2-ip-10-0-151-118.ec2.internal_openshift-kube-scheduler(130833e6-dd8e-4310-a16b-8f4b710c9691)"), skipping: network is not ready: runtime network not ready: NetworkReady=false reason:NetworkPluginNotReady message:Network plugin returns error: Missing CNI default network

For test run #18, there were 113 occurrences of the above form in masters-journal

Comment 8 Ted Yu 2020-05-12 18:53:11 UTC

From test run #20:

Apr 23 20:16:45.915847 ip-10-0-151-118 hyperkube[1376]: I0423 20:16:45.915528    1376 prober.go:116] Readiness probe for "kube-apiserver-ip-10-0-151-118.ec2.internal_openshift-kube-apiserver(78c948c66d882abadbb5d45ca08df1a3):kube-apiserver" failed (failure): Get https://10.0.151.118:6443/healthz: read tcp 10.0.151.118:40566->10.0.151.118:6443: read: connection reset by peer

This reminds me of BZ 1823950

Comment 9 Ted Yu 2020-05-20 16:04:12 UTC

'Prometheus when installed on the cluster shouldn't report any alerts' has been passing for the past 3 days (17th to 19th).

Comment 10 Russell Teague 2020-05-20 18:53:35 UTC

7 failures in the last 48 hours

https://search.apps.build01.ci.devcluster.openshift.com/?search=failed%3A.*Prometheus+when+installed+on+the+cluster+shouldn%27t+report+any+alerts+in+firing+state+apart+from+Watchdog+and+AlertmanagerReceiversNotConfigured&maxAge=48h&context=1&type=build-log&name=rhel7&maxMatches=5&maxBytes=20971520&groupBy=job

pull-ci-openshift-installer-master-e2e-aws-scaleup-rhel7 - 77 runs, 77% failed, 5% of failures match
#6182	build-log.txt.gz	21 hours ago	
failed: (39.6s) 2020-05-19T20:46:33 "[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Suite:openshift/conformance/parallel]"
#6180	build-log.txt.gz	22 hours ago	
failed: (35.8s) 2020-05-19T20:16:00 "[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Suite:openshift/conformance/parallel]"
#6170	build-log.txt.gz	27 hours ago	
failed: (35.7s) 2020-05-19T14:54:10 "[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Suite:openshift/conformance/parallel]"
pull-ci-openshift-machine-api-operator-master-e2e-aws-scaleup-rhel7 - 42 runs, 79% failed, 3% of failures match
#900	build-log.txt.gz	21 hours ago	
failed: (43s) 2020-05-19T20:50:29 "[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Suite:openshift/conformance/parallel]"
pull-ci-openshift-openshift-ansible-master-e2e-aws-scaleup-rhel7 - 1 runs, 100% failed, 100% of failures match
#1112	build-log.txt.gz	26 hours ago	
failed: (38.6s) 2020-05-19T16:24:25 "[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Suite:openshift/conformance/parallel]"
pull-ci-openshift-installer-release-4.4-e2e-aws-scaleup-rhel7 - 24 runs, 58% failed, 7% of failures match
#350	build-log.txt.gz	39 hours ago	
failed: (20.3s) 2020-05-19T02:52:28 "[Feature:Prometheus][Conformance] Prometheus when installed on the cluster [Top Level] [Feature:Prometheus][Conformance] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Suite:openshift/conformance/parallel/minimal]"
pull-ci-openshift-machine-config-operator-release-4.4-e2e-aws-scaleup-rhel7 - 8 runs, 63% failed, 20% of failures match
#153	build-log.txt.gz	35 hours ago	
failed: (27.7s) 2020-05-19T07:06:45 "[Feature:Prometheus][Conformance] Prometheus when installed on the cluster [Top Level] [Feature:Prometheus][Conformance] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Suite:openshift/conformance/parallel/minimal]"


Example failure:
https://deck-ci.apps.ci.l2s4.p1.openshiftapps.com/view/gcs/origin-ci-test/pr-logs/pull/openshift_installer/3620/pull-ci-openshift-installer-master-e2e-aws-scaleup-rhel7/6182

    Command stdout:
    
    stderr:
    + curl -s -k -H 'Authorization: Bearer <omited>' 'https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/query?query=ALERTS%7Balertname%21~%22Watchdog%7CAlertmanagerReceiversNotConfigured%7CPrometheusRemoteWriteDesiredShards%22%2Calertstate%3D%22firing%22%2Cseverity%21%3D%22info%22%7D+%3E%3D+1'
    command terminated with exit code 6
    
    error:
    exit status 6

Comment 11 Ted Yu 2020-05-20 19:21:21 UTC

For #6182 , according to timeline:

May 19 20:45:53.477 - 39s   I test="[sig-instrumentation] Prometheus when installed on the cluster shouldn't report any alerts in firing state apart from Watchdog and AlertmanagerReceiversNotConfigured [Early] [Suite:openshift/conformance/parallel]" running

In workers-journal, I searched for prometheus related probes (all succeeded):

May 19 20:46:10.995477 ip-10-0-128-29.ec2.internal hyperkube[1578]: I0519 20:46:10.994896    1578 prober.go:133] Liveness probe for "prometheus-k8s-0_openshift-monitoring(42993aec-8c68-49f6-811a-1263aee61de1):prometheus" succeeded

There was no prometheus probe in masters-journal. But from the few probes I looked at, they succeeded.

Comment 12 Ted Yu 2020-05-20 21:06:11 UTC

There was no '):sdn" failed' log in the masters-journal or worker-journal for #6182

Comment 20 Andrew McDermott 2020-06-18 16:13:28 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 21 Stephen Greene 2020-06-26 20:31:11 UTC

Looking through runs that hit this failure and not finding any errors in any of the default DNS pods. DNS operator logs also appear to be normal (at first I was under the impression that maybe default DNS pods were not getting scheduled onto new nodes properly with the RHEL scale up, but that does not seem to be the case). This test has not failed on the e2e-aws-scaleup-rhel7 in 3 days but I suspect it may fail again soon.

Comment 22 Andrew McDermott 2020-07-09 12:16:35 UTC

I’m adding UpcomingSprint, because I was occupied by fixing bugs with
higher priority/severity, developing new features with higher
priority, or developing new features to improve stability at a macro
level. I will revisit this bug next sprint.

Comment 23 Stephen Greene 2020-07-28 19:44:35 UTC

I’m adding UpcomingSprint since I will be revisiting this low severity bug in the coming weeks.

Comment 24 Russell Teague 2020-08-05 12:26:10 UTC

This test is still effecting the stability of the RHEL7 worker scaleup jobs.  We would like to get this resolved before we ship another release.

18 failures in the last 48 hours, increasing severity.

https://search.ci.openshift.org/?search=failed%3A.*Prometheus+when+installed+on+the+cluster+shouldn%27t+report+any+alerts+in+firing+state+apart+from+Watchdog&maxAge=48h&context=1&type=build-log&name=rhel7&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 25 Stephen Greene 2020-08-06 19:52:19 UTC

Looks like this test is failing when the prometheus test curl execpod is scheduled to new nodes brought up by the rhel7 scaleup job. The prometheus test is most likely hitting the node's DNS pod just before the DNS pod is completely responsive. The test that is failing is the first prometheus e2e test that runs, and all of the remaining prometheus tests that run after this test are not hitting the DNS error seen in this BZ, despite using the same curl code. I am going to add retry options to the prometheus test execpod's curl call, which should prevent this error from occurring in the future.

Comment 29 Hongan Li 2020-08-25 06:15:28 UTC

checked https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-ocp-e2e-aws-scaleup-rhel7-4.6 and no the failure in the last week. moving to verified.

Comment 31 errata-xmlrpc 2020-10-27 15:58:26 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.