Bug 1521169

Summary: [starter-us-west-1] requests being rejected due to "the server has received too many requests"
Product: OpenShift Container Platform Reporter: Justin Pierce <jupierce>
Component: MasterAssignee: Michal Fojtik <mfojtik>
Status: CLOSED ERRATA QA Contact: Wang Haoran <haowang>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.7.0CC: aos-bugs, ccoleman, deads, decarr, eparis, jokerman, jupierce, mfisher, mifiedle, mmccomas, pdwyer, sspeiche, vcorrea, vizak, xtian
Target Milestone: ---Keywords: DeliveryBlocker, OnlineStarter, OpsBlocker
Target Release: 3.10.0Flags: jupierce: needinfo-
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-07-30 19:09:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Justin Pierce 2017-12-05 22:21:05 UTC
Description of problem:
during an upgrade to v3.7 GA, the cluster began to show signs of trouble with nodes flapping and simple CLI interactions (e.g. oc get nodes) failing due to there being too many requests.

Version-Release number of selected component (if applicable):
Moving from v3.6 -> v3.7.9

How reproducible:
Happening now. This is the only cluster which has exhibited this symptom.

Comment 3 Michal Fojtik 2017-12-06 09:47:37 UTC
Justin, does it stabilized now? I think get nodes also gets the information of pods and other stuff on the node, so it might generate a lot of requests.

Comment 6 Jessica Forrester 2017-12-06 13:43:06 UTC
*** Bug 1521126 has been marked as a duplicate of this bug. ***

Comment 12 Clayton Coleman 2017-12-12 00:52:34 UTC
Spawned https://github.com/kubernetes/kubernetes/issues/57061

gRPC requests to etcd are either not completing or not being released correctly. At some point the HTTP/2 connection to etcd begins rejecting new client requests, which causes other requests to fail.  This appears to be only happening for a single resource type (secrets in this case) - other calls succeed.

Comment 14 Clayton Coleman 2017-12-12 16:20:09 UTC
No, different cause.  That other bug is on watches and is 3.1 specific.

This sort of looks like what https://github.com/grpc/grpc-go/pull/1005 fixes.  If etcd 3.2.x is setting http2.SettingMaxConcurrentStreams, then the client is going to try to honor that. If 3.1.x didn't set that, then that would explain why we didn't see this on 3.1.x

Comment 15 Clayton Coleman 2017-12-12 16:49:13 UTC
It looks like 3.1.9 etcd + grpc did not set a default max streams, which in the go http2 library defaults to 1000 but in the grpc http server looks like it defaults to MaxInt.  3.2.9 looks like it has a lower default.  Need to test against a 3.2.x server to confirm.

Comment 16 Clayton Coleman 2017-12-12 20:55:28 UTC
The goroutine trace is hung waiting to get quota out of the pool on v1.0.4 gRPC, so I'm fairly confident 1005 is the root cause.  

To workaround this issue, users can enable the watch cache for secrets (and other affected resources as necessary) by adding this to apiServerArguments in master-config.yaml

    watch-cache-sizes:
    - secrets#1000

Comment 17 Clayton Coleman 2017-12-12 20:56:32 UTC
Enabling the watch cache prevents a large chunk of API requests from going to the underlying etcd, which reduces the pressure on the gRPC pool, which ensures that the race is less likely to occur.

Comment 18 Clayton Coleman 2017-12-13 14:27:11 UTC
On us-east-2 Jenkins (okhttp) is creating 200 watches per second (successfully) - 50 secrets, 50 config maps, 50 builds, and 50 buildconfigs.  They are probably namespace scoped. This seems likely to be the cause of the contention, and enabling the watch cache for those resources should allow us to get past this until we can deliver a patch to 3.6 and 3.7 for the quotaPool thing in gRPC.

Comment 19 Justin Pierce 2018-01-16 21:54:53 UTC
*** Bug 1535035 has been marked as a duplicate of this bug. ***

Comment 21 Michal Fojtik 2018-01-23 12:35:43 UTC
The PR for origin 3.7 is merging: https://github.com/openshift/origin/pull/17735
The OSE PR for 3.7 will merge soon: https://github.com/openshift/origin/pull/17735

Comment 28 Michal Fojtik 2018-04-27 13:58:24 UTC
This should be already fixed in 3.10 as we got new etcd client and version.

Comment 30 Wang Haoran 2018-05-16 02:47:42 UTC
Moved to verify as I didn't see this problem in 3.10, feel free to reopen if you met this again.

Comment 32 errata-xmlrpc 2018-07-30 19:09:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1816