12 or so prometheus tests are failing in a cluster upgraded from 4.1 to 4.2 to 4.3 to 4.4. This may be a serious platform issue or a test issue, but if it's a test issue it's blocking us from understanding if the metrics are right. Needs immediate triage to determine why the test is failing, and the fix needs to land ASAP so we can identify whether other blockers exist. https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2-to-4.3-to-4.4-nightly/22 [Feature:Prometheus][Conformance] Prometheus when installed on the cluster [Top Level] [Feature:Prometheus][Conformance] Prometheus when installed on the cluster should provide ingress metrics [Suite:openshift/conformance/parallel/minimal] expand_less 34s fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:308]: Unexpected error: <*errors.errorString | 0xc003c6dd60>: { s: "host command failed: error running /usr/bin/kubectl --server=https://api.ci-op-p9tp56ty-599c3.origin-ci-int-aws.dev.rhcloud.com:6443 --kubeconfig=/tmp/admin.kubeconfig exec --namespace=e2e-test-prometheus-8r97x execpod7fghx -- /bin/sh -x -c curl -s -k -H 'Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6IiJ9.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWFkYXB0ZXItdG9rZW4tN2Y5djUiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoicHJvbWV0aGV1cy1hZGFwdGVyIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiMjM5MzFiMmMtNjgwMy0xMWVhLThiYWQtMGFkZjc4YmFkZmMzIiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Om9wZW5zaGlmdC1tb25pdG9yaW5nOnByb21ldGhldXMtYWRhcHRlciJ9.qA3Qtn_B0_qDar0RPfLJq7PrDtYE1kP6zr2IkE27KDYkgld2Bx8Rj3V-dc5IgNkMss_V9k1tGj3x-O5huJ5jUDZ0VTLtMTYdVWjA2pZt5ZMjtAqpXXwBdYQ4TrZpnpVp3t8wIMXNP-Ka0g40q_tmetwRaxhzvHUz_G1B1TF7gqEc327W8-AqUx7LHVA2JdfYtJ6DUC9AByP6uZxUfQgWdrjRYM4pDKeGWrQjTV-y_PQZIC8dyQTMtEcR4OkLPUEs4aZiUCe9zCxWjSr2CIwjp6IKCax89-377IDOWVhipbHBomGYpIGxBMmjsiSdj2I6177Uf-3Wx3_Gn57s-SPU1Q' \"https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/targets\":\nCommand stdout:\n\nstderr:\n+ curl -s -k -H 'Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6IiJ9.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWFkYXB0ZXItdG9rZW4tN2Y5djUiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoicHJvbWV0aGV1cy1hZGFwdGVyIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiMjM5MzFiMmMtNjgwMy0xMWVhLThiYWQtMGFkZjc4YmFkZmMzIiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Om9wZW5zaGlmdC1tb25pdG9yaW5nOnByb21ldGhldXMtYWRhcHRlciJ9.qA3Qtn_B0_qDar0RPfLJq7PrDtYE1kP6zr2IkE27KDYkgld2Bx8Rj3V-dc5IgNkMss_V9k1tGj3x-O5huJ5jUDZ0VTLtMTYdVWjA2pZt5ZMjtAqpXXwBdYQ4TrZpnpVp3t8wIMXNP-Ka0g40q_tmetwRaxhzvHUz_G1B1TF7gqEc327W8-AqUx7LHVA2JdfYtJ6DUC9AByP6uZxUfQgWdrjRYM4pDKeGWrQjTV-y_PQZIC8dyQTMtEcR4OkLPUEs4aZiUCe9zCxWjSr2CIwjp6IKCax89-377IDOWVhipbHBomGYpIGxBMmjsiSdj2I6177Uf-3Wx3_Gn57s-SPU1Q' https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/targets\ncommand terminated with exit code 35\n\nerror:\nexit status 35\n", } host command failed: error running /usr/bin/kubectl --server=https://api.ci-op-p9tp56ty-599c3.origin-ci-int-aws.dev.rhcloud.com:6443 --kubeconfig=/tmp/admin.kubeconfig exec --namespace=e2e-test-prometheus-8r97x execpod7fghx -- /bin/sh -x -c curl -s -k -H 'Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6IiJ9.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWFkYXB0ZXItdG9rZW4tN2Y5djUiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoicHJvbWV0aGV1cy1hZGFwdGVyIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiMjM5MzFiMmMtNjgwMy0xMWVhLThiYWQtMGFkZjc4YmFkZmMzIiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Om9wZW5zaGlmdC1tb25pdG9yaW5nOnByb21ldGhldXMtYWRhcHRlciJ9.qA3Qtn_B0_qDar0RPfLJq7PrDtYE1kP6zr2IkE27KDYkgld2Bx8Rj3V-dc5IgNkMss_V9k1tGj3x-O5huJ5jUDZ0VTLtMTYdVWjA2pZt5ZMjtAqpXXwBdYQ4TrZpnpVp3t8wIMXNP-Ka0g40q_tmetwRaxhzvHUz_G1B1TF7gqEc327W8-AqUx7LHVA2JdfYtJ6DUC9AByP6uZxUfQgWdrjRYM4pDKeGWrQjTV-y_PQZIC8dyQTMtEcR4OkLPUEs4aZiUCe9zCxWjSr2CIwjp6IKCax89-377IDOWVhipbHBomGYpIGxBMmjsiSdj2I6177Uf-3Wx3_Gn57s-SPU1Q' "https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/targets": Command stdout: stderr: + curl -s -k -H 'Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6IiJ9.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWFkYXB0ZXItdG9rZW4tN2Y5djUiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoicHJvbWV0aGV1cy1hZGFwdGVyIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiMjM5MzFiMmMtNjgwMy0xMWVhLThiYWQtMGFkZjc4YmFkZmMzIiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Om9wZW5zaGlmdC1tb25pdG9yaW5nOnByb21ldGhldXMtYWRhcHRlciJ9.qA3Qtn_B0_qDar0RPfLJq7PrDtYE1kP6zr2IkE27KDYkgld2Bx8Rj3V-dc5IgNkMss_V9k1tGj3x-O5huJ5jUDZ0VTLtMTYdVWjA2pZt5ZMjtAqpXXwBdYQ4TrZpnpVp3t8wIMXNP-Ka0g40q_tmetwRaxhzvHUz_G1B1TF7gqEc327W8-AqUx7LHVA2JdfYtJ6DUC9AByP6uZxUfQgWdrjRYM4pDKeGWrQjTV-y_PQZIC8dyQTMtEcR4OkLPUEs4aZiUCe9zCxWjSr2CIwjp6IKCax89-377IDOWVhipbHBomGYpIGxBMmjsiSdj2I6177Uf-3Wx3_Gn57s-SPU1Q' https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/targets command terminated with exit code 35 error: exit status 35
As mentioned on slack I suspect this is a test setup issue, as looking at dumps of prometheus and logs of cluster-monitoring-operator and prometheus itself there was no errors or failures. The alerts there suggest a SDN problem in run #22.
I don't see any issues with monitoring stack, reassigning to the networking team as only identified issues are with SDN.
Follow up comment to clarify, we suspect why the Prometheus can't be queried due to networking problems in the cluster. As with the dump we can see there are metrics and alerts there for certain components.
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1]. If you feel like this bug still needs to be a suspect, please add keyword again. [1]: https://github.com/openshift/enhancements/pull/475