Description of problem:
during an upgrade to v3.7 GA, the cluster began to show signs of trouble with nodes flapping and simple CLI interactions (e.g. oc get nodes) failing due to there being too many requests.
Version-Release number of selected component (if applicable):
Moving from v3.6 -> v3.7.9
Happening now. This is the only cluster which has exhibited this symptom.
Justin, does it stabilized now? I think get nodes also gets the information of pods and other stuff on the node, so it might generate a lot of requests.
*** Bug 1521126 has been marked as a duplicate of this bug. ***
gRPC requests to etcd are either not completing or not being released correctly. At some point the HTTP/2 connection to etcd begins rejecting new client requests, which causes other requests to fail. This appears to be only happening for a single resource type (secrets in this case) - other calls succeed.
No, different cause. That other bug is on watches and is 3.1 specific.
This sort of looks like what https://github.com/grpc/grpc-go/pull/1005 fixes. If etcd 3.2.x is setting http2.SettingMaxConcurrentStreams, then the client is going to try to honor that. If 3.1.x didn't set that, then that would explain why we didn't see this on 3.1.x
It looks like 3.1.9 etcd + grpc did not set a default max streams, which in the go http2 library defaults to 1000 but in the grpc http server looks like it defaults to MaxInt. 3.2.9 looks like it has a lower default. Need to test against a 3.2.x server to confirm.
The goroutine trace is hung waiting to get quota out of the pool on v1.0.4 gRPC, so I'm fairly confident 1005 is the root cause.
To workaround this issue, users can enable the watch cache for secrets (and other affected resources as necessary) by adding this to apiServerArguments in master-config.yaml
Enabling the watch cache prevents a large chunk of API requests from going to the underlying etcd, which reduces the pressure on the gRPC pool, which ensures that the race is less likely to occur.
On us-east-2 Jenkins (okhttp) is creating 200 watches per second (successfully) - 50 secrets, 50 config maps, 50 builds, and 50 buildconfigs. They are probably namespace scoped. This seems likely to be the cause of the contention, and enabling the watch cache for those resources should allow us to get past this until we can deliver a patch to 3.6 and 3.7 for the quotaPool thing in gRPC.
*** Bug 1535035 has been marked as a duplicate of this bug. ***
The PR for origin 3.7 is merging: https://github.com/openshift/origin/pull/17735
The OSE PR for 3.7 will merge soon: https://github.com/openshift/origin/pull/17735
This should be already fixed in 3.10 as we got new etcd client and version.
Moved to verify as I didn't see this problem in 3.10, feel free to reopen if you met this again.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.