Bug 1814334

Summary: Chunk of 4.4 prometheus tests are failing on cluster upgraded from 4.1 to 4.2 to 4.3 to 4.4
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: NetworkingAssignee: Jacob Tanenbaum <jtanenba>
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED DUPLICATE Docs Contact:
Severity: urgent    
Priority: unspecified CC: alegrand, anpicker, erooth, kakkoyun, lcosic, mloibl, pkrupa, ricarril, surbania
Version: 4.4   
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-03-24 20:59:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Clayton Coleman 2020-03-17 16:51:14 UTC
12 or so prometheus tests are failing in a cluster upgraded from 4.1 to 4.2 to 4.3 to 4.4.  This may be a serious platform issue or a test issue, but if it's a test issue it's blocking us from understanding if the metrics are right.

Needs immediate triage to determine why the test is failing, and the fix needs to land ASAP so we can identify whether other blockers exist.

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.1-to-4.2-to-4.3-to-4.4-nightly/22


[Feature:Prometheus][Conformance] Prometheus when installed on the cluster [Top Level] [Feature:Prometheus][Conformance] Prometheus when installed on the cluster should provide ingress metrics [Suite:openshift/conformance/parallel/minimal] expand_less	34s
fail [github.com/openshift/origin/test/extended/prometheus/prometheus.go:308]: Unexpected error:
    <*errors.errorString | 0xc003c6dd60>: {
        s: "host command failed: error running /usr/bin/kubectl --server=https://api.ci-op-p9tp56ty-599c3.origin-ci-int-aws.dev.rhcloud.com:6443 --kubeconfig=/tmp/admin.kubeconfig exec --namespace=e2e-test-prometheus-8r97x execpod7fghx -- /bin/sh -x -c curl -s -k -H 'Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6IiJ9.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWFkYXB0ZXItdG9rZW4tN2Y5djUiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoicHJvbWV0aGV1cy1hZGFwdGVyIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiMjM5MzFiMmMtNjgwMy0xMWVhLThiYWQtMGFkZjc4YmFkZmMzIiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Om9wZW5zaGlmdC1tb25pdG9yaW5nOnByb21ldGhldXMtYWRhcHRlciJ9.qA3Qtn_B0_qDar0RPfLJq7PrDtYE1kP6zr2IkE27KDYkgld2Bx8Rj3V-dc5IgNkMss_V9k1tGj3x-O5huJ5jUDZ0VTLtMTYdVWjA2pZt5ZMjtAqpXXwBdYQ4TrZpnpVp3t8wIMXNP-Ka0g40q_tmetwRaxhzvHUz_G1B1TF7gqEc327W8-AqUx7LHVA2JdfYtJ6DUC9AByP6uZxUfQgWdrjRYM4pDKeGWrQjTV-y_PQZIC8dyQTMtEcR4OkLPUEs4aZiUCe9zCxWjSr2CIwjp6IKCax89-377IDOWVhipbHBomGYpIGxBMmjsiSdj2I6177Uf-3Wx3_Gn57s-SPU1Q' \"https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/targets\":\nCommand stdout:\n\nstderr:\n+ curl -s -k -H 'Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6IiJ9.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWFkYXB0ZXItdG9rZW4tN2Y5djUiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoicHJvbWV0aGV1cy1hZGFwdGVyIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiMjM5MzFiMmMtNjgwMy0xMWVhLThiYWQtMGFkZjc4YmFkZmMzIiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Om9wZW5zaGlmdC1tb25pdG9yaW5nOnByb21ldGhldXMtYWRhcHRlciJ9.qA3Qtn_B0_qDar0RPfLJq7PrDtYE1kP6zr2IkE27KDYkgld2Bx8Rj3V-dc5IgNkMss_V9k1tGj3x-O5huJ5jUDZ0VTLtMTYdVWjA2pZt5ZMjtAqpXXwBdYQ4TrZpnpVp3t8wIMXNP-Ka0g40q_tmetwRaxhzvHUz_G1B1TF7gqEc327W8-AqUx7LHVA2JdfYtJ6DUC9AByP6uZxUfQgWdrjRYM4pDKeGWrQjTV-y_PQZIC8dyQTMtEcR4OkLPUEs4aZiUCe9zCxWjSr2CIwjp6IKCax89-377IDOWVhipbHBomGYpIGxBMmjsiSdj2I6177Uf-3Wx3_Gn57s-SPU1Q' https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/targets\ncommand terminated with exit code 35\n\nerror:\nexit status 35\n",
    }
    host command failed: error running /usr/bin/kubectl --server=https://api.ci-op-p9tp56ty-599c3.origin-ci-int-aws.dev.rhcloud.com:6443 --kubeconfig=/tmp/admin.kubeconfig exec --namespace=e2e-test-prometheus-8r97x execpod7fghx -- /bin/sh -x -c curl -s -k -H 'Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6IiJ9.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWFkYXB0ZXItdG9rZW4tN2Y5djUiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoicHJvbWV0aGV1cy1hZGFwdGVyIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiMjM5MzFiMmMtNjgwMy0xMWVhLThiYWQtMGFkZjc4YmFkZmMzIiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Om9wZW5zaGlmdC1tb25pdG9yaW5nOnByb21ldGhldXMtYWRhcHRlciJ9.qA3Qtn_B0_qDar0RPfLJq7PrDtYE1kP6zr2IkE27KDYkgld2Bx8Rj3V-dc5IgNkMss_V9k1tGj3x-O5huJ5jUDZ0VTLtMTYdVWjA2pZt5ZMjtAqpXXwBdYQ4TrZpnpVp3t8wIMXNP-Ka0g40q_tmetwRaxhzvHUz_G1B1TF7gqEc327W8-AqUx7LHVA2JdfYtJ6DUC9AByP6uZxUfQgWdrjRYM4pDKeGWrQjTV-y_PQZIC8dyQTMtEcR4OkLPUEs4aZiUCe9zCxWjSr2CIwjp6IKCax89-377IDOWVhipbHBomGYpIGxBMmjsiSdj2I6177Uf-3Wx3_Gn57s-SPU1Q' "https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/targets":
    Command stdout:
    
    stderr:
    + curl -s -k -H 'Authorization: Bearer eyJhbGciOiJSUzI1NiIsImtpZCI6IiJ9.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJvcGVuc2hpZnQtbW9uaXRvcmluZyIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJwcm9tZXRoZXVzLWFkYXB0ZXItdG9rZW4tN2Y5djUiLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoicHJvbWV0aGV1cy1hZGFwdGVyIiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9zZXJ2aWNlLWFjY291bnQudWlkIjoiMjM5MzFiMmMtNjgwMy0xMWVhLThiYWQtMGFkZjc4YmFkZmMzIiwic3ViIjoic3lzdGVtOnNlcnZpY2VhY2NvdW50Om9wZW5zaGlmdC1tb25pdG9yaW5nOnByb21ldGhldXMtYWRhcHRlciJ9.qA3Qtn_B0_qDar0RPfLJq7PrDtYE1kP6zr2IkE27KDYkgld2Bx8Rj3V-dc5IgNkMss_V9k1tGj3x-O5huJ5jUDZ0VTLtMTYdVWjA2pZt5ZMjtAqpXXwBdYQ4TrZpnpVp3t8wIMXNP-Ka0g40q_tmetwRaxhzvHUz_G1B1TF7gqEc327W8-AqUx7LHVA2JdfYtJ6DUC9AByP6uZxUfQgWdrjRYM4pDKeGWrQjTV-y_PQZIC8dyQTMtEcR4OkLPUEs4aZiUCe9zCxWjSr2CIwjp6IKCax89-377IDOWVhipbHBomGYpIGxBMmjsiSdj2I6177Uf-3Wx3_Gn57s-SPU1Q' https://prometheus-k8s.openshift-monitoring.svc:9091/api/v1/targets
    command terminated with exit code 35
    
    error:
    exit status 35

Comment 1 Lili Cosic 2020-03-17 17:11:41 UTC
As mentioned on slack I suspect this is a test setup issue, as looking at dumps of prometheus and logs of cluster-monitoring-operator and prometheus itself there was no errors or failures. The alerts there suggest a SDN problem in run #22.

Comment 3 Pawel Krupa 2020-03-19 08:08:23 UTC
I don't see any issues with monitoring stack, reassigning to the networking team as only identified issues are with SDN.

Comment 4 Lili Cosic 2020-03-19 08:37:27 UTC
Follow up comment to clarify, we suspect why the Prometheus can't be queried due to networking problems in the cluster. As with the dump we can see there are metrics and alerts there for certain components.

Comment 6 W. Trevor King 2021-04-05 17:46:14 UTC
Removing UpgradeBlocker from this older bug, to remove it from the suspect queue described in [1].  If you feel like this bug still needs to be a suspect, please add keyword again.

[1]: https://github.com/openshift/enhancements/pull/475