Bug 1861359

Summary: OpenShift 4.4.9 on Azure (UPI) is starting to run slow after a certain amount of time
Product: OpenShift Container Platform Reporter: Simon Reber <sreber>
Component: openshift-apiserverAssignee: Stefan Schimanski <sttts>
Status: CLOSED DUPLICATE QA Contact: Xingxing Xia <xxia>
Severity: high Docs Contact:
Priority: high    
Version: 4.4CC: aos-bugs, mfojtik, rsandu
Target Milestone: ---Keywords: UpcomingSprint
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-08-03 13:11:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Simon Reber 2020-07-28 12:00:22 UTC
Description of problem:

Running OpenShift 4.4.9 on Azure (UPI) we are reporting problems with openshift-apiserver after a given amount of time. Rebooting the Master system(s) does recover the state and things start to work again as expected. But then again after a bit, it starts to misbehave and we are seeing requests failing towards the API.

$ oc get clusterversion -o yaml
[...]
>       message: |-
>         Multiple errors are preventing progress:
>         * Could not update imagestream "openshift/cli-artifacts" (299 of 573): the server is down or not responding
>         * Could not update oauthclient "console" (342 of 573): the server is down or not responding

$ oc get clusteroperator openshift-apiserver -o yaml
[...]
>     message: 'APIServicesAvailable: "image.openshift.io.v1" is not ready: 503 (the
>       server is currently unable to handle the request)
> 
>       APIServicesAvailable: "quota.openshift.io.v1" is not ready: 503 (the server
>       is currently unable to handle the request)'
>     reason: APIServices_Error

$ oc logs kube-apiserver-indigo-ocp-master-1 -c kube-apiserver
[...]
> 2020-07-27T11:22:42.525578234Z E0727 11:22:42.525520       1 controller.go:114] loading OpenAPI spec for "v1.user.openshift.io" failed with: failed to retrieve openAPI spec, http error:
>  ResponseCode: 503, Body: Error trying to reach service: 'net/http: TLS handshake timeout', Header: map[Content-Type:[text/plain; charset=utf-8] X-Content-Type-Options:[nosniff]]
> 2020-07-27T11:22:42.525578234Z I0727 11:22:42.525552       1 controller.go:127] OpenAPI AggregationController: action for item v1.user.openshift.io: Rate Limited Requeue.


Version-Release number of selected component (if applicable):

 - 4.4.9

How reproducible:

 - Only possible when running on Azure


Steps to Reproduce:
1. N/A

Actual results:

The `openshift-apiserver` operator turns into degraded mode and oc commands are failing or taking a lot of time

Expected results:

Cluster to run stable for all time

Additional info:

It looks very much like https://bugzilla.redhat.com/show_bug.cgi?id=1840112. Also customer is using Proxy and I'm investigating in this direction as well. But the fact that after a reboot things are going back to normal it must be something related to OpenShift 4 or RHEL CoreOS

Comment 10 Simon Reber 2020-08-03 13:11:55 UTC

*** This bug has been marked as a duplicate of bug 1825219 ***