Description of problem: We are failing prometheus availability tests on GCP, but the master is available. So either we haven't got the rule correct in the test, or the endpoint is not responding when the server starts. Either way, we need to fix this before we ship. https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-network-operator/727/pull-ci-openshift-cluster-network-operator-master-e2e-gcp-ovn/1289911562324152320 Version-Release number of selected component (if applicable): 4.6 How reproducible: Every time. Steps to Reproduce: 1. Run the e2e-gcp-ovn test suite 2. 3. Actual results: Failing prometheus tests. Expected results: All tests pass. Additional info:
Ryan is on leave
David made a test to observe this happening in the e2e tests https://github.com/openshift/origin/pull/25391 Not sure what would cause the delay. However, the error does print after just a second. https://github.com/kubernetes/kubernetes/blob/eb8b5a9854e3e113ac6d4d3e66d01548488b8e2f/pkg/kubelet/util/manager/watch_based_manager.go#L189-L192 Has about a 5-10% failure rate in e2e-aws https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-blocking#release-openshift-ocp-installer-e2e-aws-4.6&width=5&include-filter-by-regex=sig-instrumentation&include-filter-by-regex=Prometheus%20when%20installed
watch based manager in the kubelet hasn't seen a lot of action https://github.com/kubernetes/kubernetes/commits/master/pkg/kubelet/util/manager/watch_based_manager.go
this hasn't failed in 5 days at this point
Looks like this may have been a transient issue. High flake rates having stopped and failures in these tests now are not related to the prometheus pods not starting.