+++ This bug was initially created as a clone of Bug #1934216 +++ --- snip --- --- Additional comment from Michael Gugino on 2021-03-02 21:43:43 UTC --- Redacted some URLs from the lab cluster. 4.6 has a 14 second delay between "Registering Components" whereas 4.7 has a 42 second delay. This delay is just above the 3 x 10 seconds (30) liveness probe timeout on the pod, which causes endless failure and restart. A potential workaround for the stuck cluster is to copy the machine-controllers deployment, remove all containers except the machineset controller, adjust the liveness probe to 6 x 30, rename, and apply. This should restore the machineset capabilities while we work on a fix. I added some debug statements to the binaries in between those "Registering Components" calls. $ ./machineset-debug-47 2021/03/02 16:30:19 Registering Components. 2021/03/02 16:30:19 after get config I0302 16:30:21.360936 4015643 request.go:655] Throttling request took 1.043307661s, request: GET:https://my-cluster.example.redhat.com:6443/apis/snapshot.storage.k8s.io/v1beta1?timeout=32s I0302 16:30:31.410138 4015643 request.go:655] Throttling request took 4.949716425s, request: GET:https://my-cluster.example.redhat.com:6443/apis/authorization.openshift.io/v1?timeout=32s 2021/03/02 16:30:33 After new manager call 2021/03/02 16:30:33 After new defaulter call I0302 16:30:41.577767 4015643 request.go:655] Throttling request took 1.001682443s, request: GET:https://my-cluster.example.redhat.com:6443/apis/apiregistration.k8s.io/v1beta1?timeout=32s 2021/03/02 16:30:47 After new validator call 2021/03/02 16:30:47 After new msdefaulter call I0302 16:30:51.625198 4015643 request.go:655] Throttling request took 3.149199313s, request: GET:https://my-cluster.example.redhat.com:6443/apis/coordination.k8s.io/v1?timeout=32s I0302 16:31:01.652524 4015643 request.go:655] Throttling request took 6.905982449s, request: GET:https://my-cluster.example.redhat.com:6443/apis/operator.openshift.io/v1?timeout=32s 2021/03/02 16:31:01 After new msvalidator call 2021/03/02 16:31:01 Registering Components. 2021/03/02 16:31:01 Starting the Cmd. $ ./machineset-debug-46 2021/03/02 16:34:23 Registering Components. I0302 16:34:25.690275 4016198 request.go:645] Throttling request took 1.046962653s, request: GET:https://my-cluster.example.redhat.com:6443/apis/hco.kubevirt.io/v1alpha1?timeout=32s I0302 16:34:35.740404 4016198 request.go:645] Throttling request took 4.851744443s, request: GET:https://my-cluster.example.redhat.com:6443/apis/operator.cert-manager.io/v1alpha1?timeout=32s 2021/03/02 16:34:37 after manager 2021/03/02 16:34:38 after defaulter 2021/03/02 16:34:38 after validator 2021/03/02 16:34:38 after msdefaulter 2021/03/02 16:34:38 after msvalidator 2021/03/02 16:34:38 Registering Components. 2021/03/02 16:34:38 Starting the Cmd. --- snip ---
This is client-side throttling. Compare the oc fix https://github.com/openshift/oc/pull/696/files.
The cause of the above condition is two fold. First, the machineset controller was building 3 clients instead of just 1. Second, on some clusters, the time it takes to build a client is much longer than on other clusters. On a typical CI run, I observed a 3 second client creation time on the machineset-controller. On the test cluster I had access to, I observed a 14 second client creation time. The cluster doesn't seem to be starved of any resource, such as CPU. One possible cause is some aggregated API is being slow to respond, or the API server is missing, TBD. The upstream docs state the spec for an aggregated API response is 5 seconds, which seems very high.
To make this actionable: 1. fix the client throttling. It's not helpful to look into claimed slowness when your binary is slow because it throttles itself. 2. show a request (timestamp and URL) that is slower than expected, and provide must-gather and audit logs for the cluster when the request is processed.
Created attachment 1760396 [details] machineset-controller with slow client creation Attached the slow client creation from in-cluster controller. The controller fails to start after 30 seconds due to liveness probe settings. This controller contains the 3 client creations, which pushes the total time above 30 seconds. However, on CI, client creation takes 3 seconds, so we're trying to determine what the discrepancy is.
The attached log still shows client-side throttling a lot (> 5s for requests). Before those are fixed there is no sense to dig deeper into server issues.
Created attachment 1760403 [details] cluster logging operator slow client creation This is another controller with a very slow rest client creation. Something is causing this cluster to have excessively slow client creation times. Is this due to number of API groups or cluster objects? If this is exclusively caused clientside, we need to come up with some best practices to ensure this doesn't happen.
The throttling messages in the log are completely client-side. We enabled log level '-v=4' or higher and did no see the following message that indicates server-side throttling. > Got a Retry-After 1s response for attempt https://github.com/kubernetes/kubernetes/blob/master/staging/src/k8s.io/client-go/rest/request.go#L959 Moving it back to the cloud team so they can investigate further. We are happy to look at it if there are evidence of slowness at the apiserver side or any other issues related to the apiserver. But right now, based on the log it is client-side throttling that is slowing the requests down (the server is not involved here)
Freshly installed cluster: grep 'GET https://api.mgugino-devb2.devcluster.openshift.com' machineset.c1.out | wc -l 89 Impacted cluster: grep 'GET https://impacted-cluster.example.redhat.com:6443' machineset.c2.out | wc -l 302
Freshly installed cluster: ./oc get apiservices | wc -l 80 Impacted cluster: ./oc get apiservices | wc -l 145 From the previous comment, the fresh install is 89 calls for a total of about 3 seconds, equaling about 33ms per call. The impacted cluster is 14/302, for about 46ms per call. We need a cluster-wide setting for operator rest clients. Other components may be hitting this issue, we'll need to investigate what an appropriate setting is so we don't DOS the API server on upgrades.
On the impacted cluster: Adjusting the rest client Burst to 100 (up from the default of 10) still results in 7 to 12 second delays (down slightly from 14). 250 results near in instantaneous completion. It seems that once we exhaust the first 100 requests we incrementally backoff for the balance of the requests. Perhaps we need to explore QPS as well: https://pkg.go.dev/k8s.io/client-go/rest#Config
Testing on the impacted cluster, QPS seems to have a greater impact than Burst. QPS defaults to 5 and is what causes the excessive throttling on the client side. rest.Burst = 150 rest.QPS = 75.0 With these settings, time is decreased to 4 seconds. This should allow us to scale to 1000 requests before encountering any problems. We should investigate what the upper limit with these settings might be so we can alert/do something useful when this limit is hit.
Note to future readers, we discussed this out of band and decided that we should try to investigate creating a static restmapper that can be used in our components so that we don't have to dynamically build the restmapping from the API on startup, this would resolve the huge rush of API calls that our components make when we start up. Alternatively, we should switch to typed clients and ditch the controller runtime client
I think we can use a DefaultRestMapper [1] which uses the runtime.Scheme (which we always construct anyway) to avoid making the API calls to perform the discovery. If we want to have a fallback to the dynamic discovery, we can also use a MultiRestMapper [2] to allow the restmapper to use the discovery if the resource isn't available in the scheme. If we take this approach, we should use a Lazy DynamicRestMapper [3]. Once we create this rest mapper, we can pass it into any controller runtime manager we create by providing a rest mapper provider function in the options. [1]: https://github.com/kubernetes/apimachinery/blob/57f2a0733447cfd41294477d833cce6580faaca3/pkg/api/meta/restmapper.go#L76-L97 [2]: https://github.com/kubernetes/apimachinery/blob/57f2a0733447cfd41294477d833cce6580faaca3/pkg/api/meta/multirestmapper.go#L27-L28 [3]: https://github.com/kubernetes-sigs/controller-runtime/blob/fa42462a01b0f33cfb42dd7396d198435a013122/pkg/client/apiutil/dynamicrestmapper.go#L55-L60
Bug 1934216 had an immediate workaround for this issue, so I'm removing UpgradeBlocker to pull it out of our suspect queue [1]. If folks are concerned about client creation time creating similar breakage even with bug 1934216's mitigation in place, I recommend making this bug blocker+ and fixing it before 4.8 GAs. If, in addition, we hear about folks on 4.7.z with the mitigation in place still breaking, please add UpgradeBlocker back here and float an impact statement. [1]: https://github.com/openshift/enhancements/pull/426
Regarding https://bugzilla.redhat.com/show_bug.cgi?id=1934555#c13: This doesn't actually work, building a static restmapper requires more information than we can gather from a runtime scheme. We will have to keep the dynamic rest mapper for now as this is far simpler and the benefit is negligible.