Description of problem: Recently I found that some of our tests fails because of the error unable to get route: the server is currently unable to handle the request (get routes.route.openshift.io testroute) In our test we have disabled the openshift-apiserver operator, so openshift-apiserver shouldn't be interrupted and should stay available during the test. Version-Release number of selected component (if applicable): master How reproducible: Sometimes. There are few occasions: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-image-registry-operator/428/pull-ci-openshift-cluster-image-registry-operator-master-e2e-aws-operator/2202 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-image-registry-operator/437/pull-ci-openshift-cluster-image-registry-operator-master-e2e-aws-operator/2196 https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-image-registry-operator/428/pull-ci-openshift-cluster-image-registry-operator-master-e2e-aws-operator/2195 Expected results: the apiserver stays high available during the test suite and always handle `get routes.route.openshift.io` requests
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-image-registry-operator/437/pull-ci-openshift-cluster-image-registry-operator-master-e2e-aws-operator/2196 seems to be a one-off issue. The test tried to connect to the server but failed with "unable to get route: the server is currently unable to handle the request (get routes.route.openshift.io testroute)". The error message implies that the servers should have returned HTTP 503 error code. However, I didn't find a single request with that HTTP status (see attached file). Additionally `aggregator_unavailable_apiservice` metric didn't report anything.
Created attachment 1662648 [details] http request to route.openshift.io group
Created attachment 1662649 [details] aggregator_unavailable_apiservice metrics
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-image-registry-operator/428/pull-ci-openshift-cluster-image-registry-operator-master-e2e-aws-operator/2202 is similar additionally one test reports dial tcp 10.0.131.15:10250: connect: connection refused
I've checked briefly other runs to see if they suffer from the same issue and I haven't found any (https://prow.svc.ci.openshift.org/job-history/origin-ci-test/pr-logs/directory/pull-ci-openshift-cluster-image-registry-operator-master-e2e-aws-operator)
Assuming that the graceful mechanism we have works you shouldn't see any interruptions even during restarts. Oleg does the issue still occurs?
I haven't seen this issue for a while.