Description of problem: Running OpenShift 4.4.9 on Azure (UPI) we are reporting problems with openshift-apiserver after a given amount of time. Rebooting the Master system(s) does recover the state and things start to work again as expected. But then again after a bit, it starts to misbehave and we are seeing requests failing towards the API. $ oc get clusterversion -o yaml [...] > message: |- > Multiple errors are preventing progress: > * Could not update imagestream "openshift/cli-artifacts" (299 of 573): the server is down or not responding > * Could not update oauthclient "console" (342 of 573): the server is down or not responding $ oc get clusteroperator openshift-apiserver -o yaml [...] > message: 'APIServicesAvailable: "image.openshift.io.v1" is not ready: 503 (the > server is currently unable to handle the request) > > APIServicesAvailable: "quota.openshift.io.v1" is not ready: 503 (the server > is currently unable to handle the request)' > reason: APIServices_Error $ oc logs kube-apiserver-indigo-ocp-master-1 -c kube-apiserver [...] > 2020-07-27T11:22:42.525578234Z E0727 11:22:42.525520 1 controller.go:114] loading OpenAPI spec for "v1.user.openshift.io" failed with: failed to retrieve openAPI spec, http error: > ResponseCode: 503, Body: Error trying to reach service: 'net/http: TLS handshake timeout', Header: map[Content-Type:[text/plain; charset=utf-8] X-Content-Type-Options:[nosniff]] > 2020-07-27T11:22:42.525578234Z I0727 11:22:42.525552 1 controller.go:127] OpenAPI AggregationController: action for item v1.user.openshift.io: Rate Limited Requeue. Version-Release number of selected component (if applicable): - 4.4.9 How reproducible: - Only possible when running on Azure Steps to Reproduce: 1. N/A Actual results: The `openshift-apiserver` operator turns into degraded mode and oc commands are failing or taking a lot of time Expected results: Cluster to run stable for all time Additional info: It looks very much like https://bugzilla.redhat.com/show_bug.cgi?id=1840112. Also customer is using Proxy and I'm investigating in this direction as well. But the fact that after a reboot things are going back to normal it must be something related to OpenShift 4 or RHEL CoreOS
*** This bug has been marked as a duplicate of bug 1825219 ***