Bug 1861359 - OpenShift 4.4.9 on Azure (UPI) is starting to run slow after a certain amount of time
Summary: OpenShift 4.4.9 on Azure (UPI) is starting to run slow after a certain amount...
Keywords:
Status: CLOSED DUPLICATE of bug 1825219
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: openshift-apiserver
Version: 4.4
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: ---
Assignee: Stefan Schimanski
QA Contact: Xingxing Xia
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-07-28 12:00 UTC by Simon Reber
Modified: 2020-08-03 13:12 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-08-03 13:11:55 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Description Simon Reber 2020-07-28 12:00:22 UTC
Description of problem:

Running OpenShift 4.4.9 on Azure (UPI) we are reporting problems with openshift-apiserver after a given amount of time. Rebooting the Master system(s) does recover the state and things start to work again as expected. But then again after a bit, it starts to misbehave and we are seeing requests failing towards the API.

$ oc get clusterversion -o yaml
[...]
>       message: |-
>         Multiple errors are preventing progress:
>         * Could not update imagestream "openshift/cli-artifacts" (299 of 573): the server is down or not responding
>         * Could not update oauthclient "console" (342 of 573): the server is down or not responding

$ oc get clusteroperator openshift-apiserver -o yaml
[...]
>     message: 'APIServicesAvailable: "image.openshift.io.v1" is not ready: 503 (the
>       server is currently unable to handle the request)
> 
>       APIServicesAvailable: "quota.openshift.io.v1" is not ready: 503 (the server
>       is currently unable to handle the request)'
>     reason: APIServices_Error

$ oc logs kube-apiserver-indigo-ocp-master-1 -c kube-apiserver
[...]
> 2020-07-27T11:22:42.525578234Z E0727 11:22:42.525520       1 controller.go:114] loading OpenAPI spec for "v1.user.openshift.io" failed with: failed to retrieve openAPI spec, http error:
>  ResponseCode: 503, Body: Error trying to reach service: 'net/http: TLS handshake timeout', Header: map[Content-Type:[text/plain; charset=utf-8] X-Content-Type-Options:[nosniff]]
> 2020-07-27T11:22:42.525578234Z I0727 11:22:42.525552       1 controller.go:127] OpenAPI AggregationController: action for item v1.user.openshift.io: Rate Limited Requeue.


Version-Release number of selected component (if applicable):

 - 4.4.9

How reproducible:

 - Only possible when running on Azure


Steps to Reproduce:
1. N/A

Actual results:

The `openshift-apiserver` operator turns into degraded mode and oc commands are failing or taking a lot of time

Expected results:

Cluster to run stable for all time

Additional info:

It looks very much like https://bugzilla.redhat.com/show_bug.cgi?id=1840112. Also customer is using Proxy and I'm investigating in this direction as well. But the fact that after a reboot things are going back to normal it must be something related to OpenShift 4 or RHEL CoreOS

Comment 10 Simon Reber 2020-08-03 13:11:55 UTC

*** This bug has been marked as a duplicate of bug 1825219 ***


Note You need to log in before you can comment on or make changes to this bug.