Bug 1521169
Summary: | [starter-us-west-1] requests being rejected due to "the server has received too many requests" | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Justin Pierce <jupierce> |
Component: | Master | Assignee: | Michal Fojtik <mfojtik> |
Status: | CLOSED ERRATA | QA Contact: | Wang Haoran <haowang> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 3.7.0 | CC: | aos-bugs, ccoleman, deads, decarr, eparis, jokerman, jupierce, mfisher, mifiedle, mmccomas, pdwyer, sspeiche, vcorrea, vizak, xtian |
Target Milestone: | --- | Keywords: | DeliveryBlocker, OnlineStarter, OpsBlocker |
Target Release: | 3.10.0 | Flags: | jupierce:
needinfo-
|
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-07-30 19:09:00 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Justin Pierce
2017-12-05 22:21:05 UTC
Justin, does it stabilized now? I think get nodes also gets the information of pods and other stuff on the node, so it might generate a lot of requests. *** Bug 1521126 has been marked as a duplicate of this bug. *** Spawned https://github.com/kubernetes/kubernetes/issues/57061 gRPC requests to etcd are either not completing or not being released correctly. At some point the HTTP/2 connection to etcd begins rejecting new client requests, which causes other requests to fail. This appears to be only happening for a single resource type (secrets in this case) - other calls succeed. No, different cause. That other bug is on watches and is 3.1 specific. This sort of looks like what https://github.com/grpc/grpc-go/pull/1005 fixes. If etcd 3.2.x is setting http2.SettingMaxConcurrentStreams, then the client is going to try to honor that. If 3.1.x didn't set that, then that would explain why we didn't see this on 3.1.x It looks like 3.1.9 etcd + grpc did not set a default max streams, which in the go http2 library defaults to 1000 but in the grpc http server looks like it defaults to MaxInt. 3.2.9 looks like it has a lower default. Need to test against a 3.2.x server to confirm. The goroutine trace is hung waiting to get quota out of the pool on v1.0.4 gRPC, so I'm fairly confident 1005 is the root cause. To workaround this issue, users can enable the watch cache for secrets (and other affected resources as necessary) by adding this to apiServerArguments in master-config.yaml watch-cache-sizes: - secrets#1000 Enabling the watch cache prevents a large chunk of API requests from going to the underlying etcd, which reduces the pressure on the gRPC pool, which ensures that the race is less likely to occur. On us-east-2 Jenkins (okhttp) is creating 200 watches per second (successfully) - 50 secrets, 50 config maps, 50 builds, and 50 buildconfigs. They are probably namespace scoped. This seems likely to be the cause of the contention, and enabling the watch cache for those resources should allow us to get past this until we can deliver a patch to 3.6 and 3.7 for the quotaPool thing in gRPC. *** Bug 1535035 has been marked as a duplicate of this bug. *** The PR for origin 3.7 is merging: https://github.com/openshift/origin/pull/17735 The OSE PR for 3.7 will merge soon: https://github.com/openshift/origin/pull/17735 This should be already fixed in 3.10 as we got new etcd client and version. Moved to verify as I didn't see this problem in 3.10, feel free to reopen if you met this again. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1816 |