Bug 1791905

Summary: the server sometimes is unable to handle the get routes.route.openshift.io request
Product: OpenShift Container Platform Reporter: Oleg Bulatov <obulatov>
Component: openshift-apiserverAssignee: Lukasz Szaszkiewicz <lszaszki>
Status: CLOSED CURRENTRELEASE QA Contact: Xingxing Xia <xxia>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.4CC: adam.kaplan, aos-bugs, mfojtik, scuppett, sttts
Target Milestone: ---   
Target Release: 4.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-02-17 16:52:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
http request to route.openshift.io group
none
aggregator_unavailable_apiservice metrics none

Description Oleg Bulatov 2020-01-16 16:48:14 UTC
Description of problem:

Recently I found that some of our tests fails because of the error

unable to get route: the server is currently unable to handle the request (get routes.route.openshift.io testroute)

In our test we have disabled the openshift-apiserver operator, so openshift-apiserver shouldn't be interrupted and should stay available during the test.

Version-Release number of selected component (if applicable):

master

How reproducible:

Sometimes. There are few occasions:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-image-registry-operator/428/pull-ci-openshift-cluster-image-registry-operator-master-e2e-aws-operator/2202
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-image-registry-operator/437/pull-ci-openshift-cluster-image-registry-operator-master-e2e-aws-operator/2196
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-image-registry-operator/428/pull-ci-openshift-cluster-image-registry-operator-master-e2e-aws-operator/2195

Expected results:

the apiserver stays high available during the test suite and always handle `get routes.route.openshift.io` requests

Comment 2 Lukasz Szaszkiewicz 2020-02-12 11:21:31 UTC
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-image-registry-operator/437/pull-ci-openshift-cluster-image-registry-operator-master-e2e-aws-operator/2196 seems to be a one-off issue.

The test tried to connect to the server but failed with "unable to get route: the server is currently unable to handle the request (get routes.route.openshift.io testroute)". The error message implies that the servers should have returned HTTP 503 error code. However, I didn't find a single request with that HTTP status (see attached file). Additionally `aggregator_unavailable_apiservice` metric didn't report anything.

Comment 3 Lukasz Szaszkiewicz 2020-02-12 11:22:37 UTC
Created attachment 1662648 [details]
http request to route.openshift.io group

Comment 4 Lukasz Szaszkiewicz 2020-02-12 11:23:18 UTC
Created attachment 1662649 [details]
aggregator_unavailable_apiservice metrics

Comment 5 Lukasz Szaszkiewicz 2020-02-12 11:54:27 UTC
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-image-registry-operator/428/pull-ci-openshift-cluster-image-registry-operator-master-e2e-aws-operator/2202 is similar additionally one test reports dial tcp 10.0.131.15:10250: connect: connection refused

Comment 7 Lukasz Szaszkiewicz 2020-02-12 12:49:53 UTC
I've checked briefly other runs to see if they suffer from the same issue and I haven't found any (https://prow.svc.ci.openshift.org/job-history/origin-ci-test/pr-logs/directory/pull-ci-openshift-cluster-image-registry-operator-master-e2e-aws-operator)

Comment 8 Lukasz Szaszkiewicz 2020-02-12 12:56:21 UTC
Assuming that the graceful mechanism we have works you shouldn't see any interruptions even during restarts.
Oleg does the issue still occurs?

Comment 9 Oleg Bulatov 2020-02-13 15:50:28 UTC
I haven't seen this issue for a while.