Description of problem: On the CNCF gear (2K nodes on OpenStack 10), executing a cluster horizontal test to create 300 projects with the following specs: - 75 deployment configs - 100 replicas/pods - 2 builds - 1 image stream - 2 secrets - 1 route A successful run would result in 30K running pods on 2K nodes. During the run, the following panic occurred on one of the master-api. Full master log will be linked to this bz Apr 5 12:56:08 svt-m-2 atomic-openshift-master-api: [1m4.963099272s] [1m4.91061541s] Writing http response done (20881 items) Apr 5 12:56:08 svt-m-2 atomic-openshift-master-api: "List /api/v1/endpoints" [1m4.963100517s] [1.245µs] END Apr 5 12:56:08 svt-m-2 atomic-openshift-master-api: I0405 12:56:08.576332 107783 logs.go:41] http2: panic serving 10.10.0.18:42370: kill connection/stream Apr 5 12:56:08 svt-m-2 atomic-openshift-master-api: goroutine 550297327 [running]: Apr 5 12:56:08 svt-m-2 atomic-openshift-master-api: net/http.(*http2serverConn).runHandler.func1(0xc4612cb150, 0xc454cbdf7f, 0xc42766afc0) Apr 5 12:56:08 svt-m-2 atomic-openshift-master-api: /usr/lib/golang/src/net/http/h2_bundle.go:4304 +0xd1 Apr 5 12:56:08 svt-m-2 atomic-openshift-master-api: panic(0x3a912a0, 0xc42041a550) Apr 5 12:56:08 svt-m-2 atomic-openshift-master-api: /usr/lib/golang/src/runtime/panic.go:458 +0x243 Apr 5 12:56:08 svt-m-2 atomic-openshift-master-api: github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/genericapiserver/filters.(*baseTimeoutWriter).timeout(0xc5f780fa40, 0x0, 0x0) Apr 5 12:56:08 svt-m-2 atomic-openshift-master-api: /builddir/build/BUILD/atomic-openshift-git-0.35a4ef0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/genericapiserver/filters/timeout.go:205 +0x336 Apr 5 12:56:08 svt-m-2 atomic-openshift-master-api: github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/genericapiserver/filters.(*timeoutHandler).ServeHTTP(0xc421adec20, 0x93083e0, 0xc4612cb150, 0xc4e0fdd680) Apr 5 12:56:08 svt-m-2 atomic-openshift-master-api: /builddir/build/BUILD/atomic-openshift-git-0.35a4ef0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/genericapiserver/filters/timeout.go:85 +0x2a0 Apr 5 12:56:08 svt-m-2 atomic-openshift-master-api: github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/genericapiserver/filters.WithMaxInFlightLimit.func1(0x93083e0, 0xc4612cb150, 0xc4e0fdd680) Apr 5 12:56:08 svt-m-2 atomic-openshift-master-api: /builddir/build/BUILD/atomic-openshift-git-0.35a4ef0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/genericapiserver/filters/maxinflight.go:45 +0x127 Apr 5 12:56:08 svt-m-2 atomic-openshift-master-api: net/http.HandlerFunc.ServeHTTP(0xc421825080, 0x93083e0, 0xc4612cb150, 0xc4e0fdd680) Apr 5 12:56:08 svt-m-2 atomic-openshift-master-api: /usr/lib/golang/src/net/http/server.go:1726 +0x44 Apr 5 12:56:08 svt-m-2 atomic-openshift-master-api: github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/apiserver/filters.WithRequestInfo.func1(0x93083e0, 0xc4612cb150, 0xc4e0fdd680) Apr 5 12:56:08 svt-m-2 atomic-openshift-master-api: /builddir/build/BUILD/atomic-openshift-git-0.35a4ef0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/apiserver/filters/requestinfo.go:45 +0x212 Apr 5 12:56:08 svt-m-2 atomic-openshift-master-api: net/http.HandlerFunc.ServeHTTP(0xc421825140, 0x93083e0, 0xc4612cb150, 0xc4e0fdd680) Apr 5 12:56:08 svt-m-2 atomic-openshift-master-api: /usr/lib/golang/src/net/http/server.go:1726 +0x44 Apr 5 12:56:08 svt-m-2 atomic-openshift-master-api: github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/api.WithRequestContext.func1(0x93083e0, 0xc4612cb150, 0xc4e0fdd680) Apr 5 12:56:08 svt-m-2 atomic-openshift-master-api: /builddir/build/BUILD/atomic-openshift-git-0.35a4ef0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/api/requestcontext.go:107 +0xef Apr 5 12:56:08 svt-m-2 atomic-openshift-master-api: net/http.HandlerFunc.ServeHTTP(0xc421aded20, 0x93083e0, 0xc4612cb150, 0xc4e0fdd680) Apr 5 12:56:08 svt-m-2 atomic-openshift-master-api: /usr/lib/golang/src/net/http/server.go:1726 +0x44 Apr 5 12:56:08 svt-m-2 atomic-openshift-master-api: net/http.serverHandler.ServeHTTP(0xc423597a80, 0x93083e0, 0xc4612cb150, 0xc4e0fdd680) Apr 5 12:56:08 svt-m-2 atomic-openshift-master-api: /usr/lib/golang/src/net/http/server.go:2202 +0x7d Apr 5 12:56:08 svt-m-2 atomic-openshift-master-api: net/http.initNPNRequest.ServeHTTP(0xc42c5de380, 0xc423597a80, 0x93083e0, 0xc4612cb150, 0xc4e0fdd680) Apr 5 12:56:08 svt-m-2 atomic-openshift-master-api: /usr/lib/golang/src/net/http/server.go:2670 +0x93 Apr 5 12:56:08 svt-m-2 atomic-openshift-master-api: net/http.(*initNPNRequest).ServeHTTP(0xc4236d9f50, 0x93083e0, 0xc4612cb150, 0xc4e0fdd680) Apr 5 12:56:08 svt-m-2 atomic-openshift-master-api: <autogenerated>:288 +0x79 Apr 5 12:56:08 svt-m-2 atomic-openshift-master-api: net/http.(Handler).ServeHTTP-fm(0x93083e0, 0xc4612cb150, 0xc4e0fdd680) Apr 5 12:56:08 svt-m-2 atomic-openshift-master-api: /usr/lib/golang/src/net/http/h2_bundle.go:4084 +0x4d Apr 5 12:56:08 svt-m-2 atomic-openshift-master-api: net/http.(*http2serverConn).runHandler(0xc42766afc0, 0xc4612cb150, 0xc4e0fdd680, 0xc5f780f9e0) Apr 5 12:56:08 svt-m-2 atomic-openshift-master-api: /usr/lib/golang/src/net/http/h2_bundle.go:4314 +0x89 Apr 5 12:56:08 svt-m-2 atomic-openshift-master-api: created by net/http.(*http2serverConn).processHeaders Apr 5 12:56:08 svt-m-2 atomic-openshift-master-api: /usr/lib/golang/src/net/http/h2_bundle.go:4092 +0x6e2 Version-Release number of selected component (if applicable): 3.5.0.39 How reproducible: Unknown The problem occurred running cluster-loader (https://github.com/openshift/svt/tree/master/openshift_scalability) with the following configuration on the CNCF 2K node cluster. projects: - num: 300 basename: svt-1- tuning: default templates: - num: 2 file: ./content/build-template.json - num: 1 file: ./content/image-stream-template.json - num: 50 file: ./content/deployment-config-1rep-pause-template.json parameters: - ENV_VALUE: "asodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij12" - num: 25 file: ./content/deployment-config-2rep-pause-template.json parameters: - ENV_VALUE: "asodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij12" - num: 2 file: ./content/ssh-secret-template.json - num: 1 file: ./content/route-template.json # rcs and services are implemented in deployments. tuningsets: - name: default templates: stepping: stepsize: 5 pause: 10 s rate_limit: delay: 250 ms quotas: - name: default
This is not a one-time panic. The log is filled with them. Once it happens the first time, it occurs repeatedly.
Even after restarting master-api, issuing commands like oc get projects and oc get pods are failing with this panic occurring multiple times for each command.
Also, it is on all master-api - not just one as originally stated in the description.
This looks like https://github.com/kubernetes/kubernetes/pull/29594 to me, but evidently that patch is cherry picekd already into openshift.
Not closed yet = https://github.com/kubernetes/kubernetes/issues/30305 . Its not clear why this is unclosed though : either it was forgotton, or it is truly not yet resolved.
Stefan, could you please investigate and see if you can root cause this?
http://file.rdu.redhat.com/schituku/logFile < - reproduced log file using ReallyCrash option = true Looks like this is the bit that is getting tripped: ``` // The timeout writer has not been used by the inner handler. // We can safely timeout the HTTP request by sending by a timeout // handler if !tw.wroteHeader && !tw.hijacked { tw.w.WriteHeader(http.StatusGatewayTimeout) enc := json.NewEncoder(tw.w) enc.Encode(&err.ErrStatus) } else { // The timeout writer has been used by the inner handler. There is // no way to timeout the HTTP request at the point. We have to shutdown // the connection for HTTP1 or reset stream for HTTP2. // // Note from: Brad Fitzpatrick // if the ServeHTTP goroutine panics, that will do the best possible thing for both // HTTP/1 and HTTP/2. In HTTP/1, assuming you're replying with at least HTTP/1.1 and // you've already flushed the headers so it's using HTTP chunking, it'll kill the TCP // connection immediately without a proper 0-byte EOF chunk, so the peer will recognize // the response as bogus. In HTTP/2 the server will just RST_STREAM the stream, leaving // the TCP connection open, but resetting the stream to the peer so it'll have an error, // like the HTTP/1 case. panic(errConnKilled) ```
My suspicion is that the LongRunningRequestCheck does not support list request with watch=true in the ListOptions, i.e. those are cut off after the timeout. The timeout defaults to 60 minutes though. Mike, does this match with your tests? Or are those timeout happens early one? Here is the origin master LongRunningRequestCheck (which should match that of 3.5): https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/pkg/genericapiserver/filters/longrunning.go In Kube 1.6 this changed a lot, not to the better though I fear: https://github.com/kubernetes/kubernetes/blob/312c44da2e1946d5b24eafe731ebd00fbdfeece7/staging/src/k8s.io/apiserver/pkg/server/filters/longrunning.go#L31 There are multiple ways to start a watch LIST request: through ListOptions in the body and through ListOptions as GET url params. In 3.5 at least the latter case is handled: https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/pkg/genericapiserver/config.go#L213 @liggit can you comment? I think you touched the watch param logic recently in upstream.
@stts, My observation has been: - the panics start occurring after a couple of hours in this scale-up run. Now that I know what to look for, I can get exact timing if that is useful by re-running starting with a clean cluster - In yesterday's case, restarting the controllers/api-servers made no difference. The panics resumed almost immediately after restart.
At the suggestion of @sttts - based on his observation of 75 panics/second in the logs - I set maxRequestsInFlight to 65 as an experiment. Setting it below 75 eliminated the panics. It caused lots of issues with short running GETs (nodes getting secrets, etc) but it did eliminate the panic. Hopefully this is a useful data point.
The workaround of reducing maxRequestsInFlight to 65 suggests that the rate of requests overload the apiserver(s). As a consequence, certain LIST requests (we do not know which) reach the (hard-coded) timeout of 60 seconds for non-long-running requests. At the moment, the only way to control the load of the server is by setting maxRequestsInFlight to such a low enough value that the server can process the expensive LIST requests within those 60 seconds. At the same time, this influences all non-expensive requests like GETs as well. This had bad influence on the cluster behavior. Open questions: - how much of the load is due to cheap request and how much is due to expensive LIST requests? - where is the bottleneck? etcd? are optimizations possible? - which LIST calls are expensive in the experiment? - can those LIST calls be optimized client-side (in the kubelet? in the proxy?)? What we can do: - make the 60 sec timeout configurable - add a maxExpensiveRequestsInFlight flag which applies only to LIST requests. This can be set much lower than the generic maxRequestsInFlight value. - develop self-controlling mechanisms to control maxExpensiveRequestsInFlight dynamically depending on the duration of expensive requests. @mike: how can we get answers to the first 3 open questions?
Reducing severity as it is not urgent anymore without the test cluster. Also postponing to the next point release.
Mike, have you been able to run any more tests?
no info will be available until june
Do we have anything else to go on here. At this point i'm going to move this BZ to 3.7.
*** Bug 1467416 has been marked as a duplicate of this bug. ***
Per duplicate bug 1467416 we can recreate this by deploying logging in a 1200 node cluster without the new mux service turned on for fluentd. When the fluentd daemonset was created, repeated/continuous panics popped in the master-api logs and the cluster became unstable - at least in terms of oc commands needed to rescue the cluster and getting it working reliably again.
Hit this in the OpenShift 3.6 scalability lab trying to delete 5000 projects with 20000 pods. During the delete (OOTB in flight settings), the same panic as reported above started happening continuously. Master log location will be linked when available.
Scenarios reproducing this: 1. Scale up to ~23K pods with 2000 nodes in 3.5 in the CNCF cluster 2. Start daemonset with 1200 fluentd pods trying to contact master-api-server in 3.6 3. Delete 5000 projects/20K pods in 3.6 in a cluster with 2000 nodes.
I think in this case the problem is that we aren't exerting enough back pressure - clients are effectively crushing the master to die. So something like rate of timeouts per unit time should control how much back pressure to put on clients, just like rate of 429 per second should alter how long the retry-after is to send back to the server.
In the delete 5000 project case, lowering max in flight to 100 allowed the deletes to complete without panics. quite a few nodes went to NotReady though with timeouts/connection issues to the master-api servers via master-api load balancer.
In prod we're increasing (doubling) the limits because we are not CPU or memory bound, but we are bound by etcd2 + json setup. I don't know what to recommend for this case yet.
For reference those two open upstream issues around 429 handling: https://github.com/kubernetes/client-go/issues/222 https://github.com/kubernetes/kubernetes/issues/48610
IS there anything we can do here now? Was this an actual panic? I thought this was the logs saying the request took too long and not actual panics. If so, can we at least change the bz title so it is less dire? I thought I discussed similar issues with David on other BZs.
We can probably improve our backpreasure logic to keep the cluster functional in case of many timeout of long-running LIST requests. Other than that, I agree about changing the title. It's more like: "Continuous request timeouts due to ineffective back-preasure when creating 22.5K deployment configs with 30K pods".