1861359 – OpenShift 4.4.9 on Azure (UPI) is starting to run slow after a certain amount of time

Bug 1861359 - OpenShift 4.4.9 on Azure (UPI) is starting to run slow after a certain amount of time

Summary: OpenShift 4.4.9 on Azure (UPI) is starting to run slow after a certain amount...

Keywords:
Status:	CLOSED DUPLICATE of bug 1825219
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	openshift-apiserver
Sub Component:
Version:	4.4
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Stefan Schimanski
QA Contact:	Xingxing Xia
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-07-28 12:00 UTC by Simon Reber
Modified:	2024-12-20 19:11 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-08-03 13:11:55 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Simon Reber 2020-07-28 12:00:22 UTC

Description of problem:

Running OpenShift 4.4.9 on Azure (UPI) we are reporting problems with openshift-apiserver after a given amount of time. Rebooting the Master system(s) does recover the state and things start to work again as expected. But then again after a bit, it starts to misbehave and we are seeing requests failing towards the API.

$ oc get clusterversion -o yaml
[...]
>       message: |-
>         Multiple errors are preventing progress:
>         * Could not update imagestream "openshift/cli-artifacts" (299 of 573): the server is down or not responding
>         * Could not update oauthclient "console" (342 of 573): the server is down or not responding

$ oc get clusteroperator openshift-apiserver -o yaml
[...]
>     message: 'APIServicesAvailable: "image.openshift.io.v1" is not ready: 503 (the
>       server is currently unable to handle the request)
> 
>       APIServicesAvailable: "quota.openshift.io.v1" is not ready: 503 (the server
>       is currently unable to handle the request)'
>     reason: APIServices_Error

$ oc logs kube-apiserver-indigo-ocp-master-1 -c kube-apiserver
[...]
> 2020-07-27T11:22:42.525578234Z E0727 11:22:42.525520       1 controller.go:114] loading OpenAPI spec for "v1.user.openshift.io" failed with: failed to retrieve openAPI spec, http error:
>  ResponseCode: 503, Body: Error trying to reach service: 'net/http: TLS handshake timeout', Header: map[Content-Type:[text/plain; charset=utf-8] X-Content-Type-Options:[nosniff]]
> 2020-07-27T11:22:42.525578234Z I0727 11:22:42.525552       1 controller.go:127] OpenAPI AggregationController: action for item v1.user.openshift.io: Rate Limited Requeue.


Version-Release number of selected component (if applicable):

 - 4.4.9

How reproducible:

 - Only possible when running on Azure


Steps to Reproduce:
1. N/A

Actual results:

The `openshift-apiserver` operator turns into degraded mode and oc commands are failing or taking a lot of time

Expected results:

Cluster to run stable for all time

Additional info:

It looks very much like https://bugzilla.redhat.com/show_bug.cgi?id=1840112. Also customer is using Proxy and I'm investigating in this direction as well. But the fact that after a reboot things are going back to normal it must be something related to OpenShift 4 or RHEL CoreOS

Comment 10 Simon Reber 2020-08-03 13:11:55 UTC


*** This bug has been marked as a duplicate of bug 1825219 ***

Note You need to log in before you can comment on or make changes to this bug.