Bug 1861359

Summary:	OpenShift 4.4.9 on Azure (UPI) is starting to run slow after a certain amount of time
Product:	OpenShift Container Platform	Reporter:	Simon Reber <sreber>
Component:	openshift-apiserver	Assignee:	Stefan Schimanski <sttts>
Status:	CLOSED DUPLICATE	QA Contact:	Xingxing Xia <xxia>
Severity:	high	Docs Contact:
Priority:	high
Version:	4.4	CC:	aos-bugs, mfojtik, rsandu
Target Milestone:	---	Keywords:	UpcomingSprint
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-08-03 13:11:55 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Simon Reber 2020-07-28 12:00:22 UTC

Description of problem:

Running OpenShift 4.4.9 on Azure (UPI) we are reporting problems with openshift-apiserver after a given amount of time. Rebooting the Master system(s) does recover the state and things start to work again as expected. But then again after a bit, it starts to misbehave and we are seeing requests failing towards the API.

$ oc get clusterversion -o yaml
[...]
>       message: |-
>         Multiple errors are preventing progress:
>         * Could not update imagestream "openshift/cli-artifacts" (299 of 573): the server is down or not responding
>         * Could not update oauthclient "console" (342 of 573): the server is down or not responding

$ oc get clusteroperator openshift-apiserver -o yaml
[...]
>     message: 'APIServicesAvailable: "image.openshift.io.v1" is not ready: 503 (the
>       server is currently unable to handle the request)
> 
>       APIServicesAvailable: "quota.openshift.io.v1" is not ready: 503 (the server
>       is currently unable to handle the request)'
>     reason: APIServices_Error

$ oc logs kube-apiserver-indigo-ocp-master-1 -c kube-apiserver
[...]
> 2020-07-27T11:22:42.525578234Z E0727 11:22:42.525520       1 controller.go:114] loading OpenAPI spec for "v1.user.openshift.io" failed with: failed to retrieve openAPI spec, http error:
>  ResponseCode: 503, Body: Error trying to reach service: 'net/http: TLS handshake timeout', Header: map[Content-Type:[text/plain; charset=utf-8] X-Content-Type-Options:[nosniff]]
> 2020-07-27T11:22:42.525578234Z I0727 11:22:42.525552       1 controller.go:127] OpenAPI AggregationController: action for item v1.user.openshift.io: Rate Limited Requeue.


Version-Release number of selected component (if applicable):

 - 4.4.9

How reproducible:

 - Only possible when running on Azure


Steps to Reproduce:
1. N/A

Actual results:

The `openshift-apiserver` operator turns into degraded mode and oc commands are failing or taking a lot of time

Expected results:

Cluster to run stable for all time

Additional info:

It looks very much like https://bugzilla.redhat.com/show_bug.cgi?id=1840112. Also customer is using Proxy and I'm investigating in this direction as well. But the fact that after a reboot things are going back to normal it must be something related to OpenShift 4 or RHEL CoreOS

Comment 10 Simon Reber 2020-08-03 13:11:55 UTC


*** This bug has been marked as a duplicate of bug 1825219 ***