1439324 – Continuous request timeouts due to ineffective back-preasure when creating 22.5K deployment configs with 30K pods

Bug 1439324 - Continuous request timeouts due to ineffective back-preasure when creating 22.5K deployment configs with 30K pods

Summary: Continuous request timeouts due to ineffective back-preasure when creating 22...

Keywords:
Status:	CLOSED CANTFIX
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Master
Sub Component:
Version:	3.5.0
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	3.7.0
Assignee:	Stefan Schimanski
QA Contact:	DeShuai Ma
Docs Contact:
URL:
Whiteboard:	aos-scalability-36
Duplicates (1):	1467416 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-04-05 17:21 UTC by Mike Fiedler
Modified:	2017-09-22 15:39 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-09-22 15:04:42 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Mike Fiedler 2017-04-05 17:21:13 UTC

Description of problem:

On the CNCF gear (2K nodes on OpenStack 10), executing a cluster horizontal test to create 300 projects with the following specs:

- 75 deployment configs
- 100 replicas/pods
- 2 builds
- 1 image stream
- 2 secrets
- 1 route

A successful run would result in 30K running pods on 2K nodes.

During the run, the following panic occurred on one of the master-api.   Full master log will be linked to this bz


Apr  5 12:56:08 svt-m-2 atomic-openshift-master-api: [1m4.963099272s] [1m4.91061541s] Writing http response done (20881 items)
Apr  5 12:56:08 svt-m-2 atomic-openshift-master-api: "List /api/v1/endpoints" [1m4.963100517s] [1.245µs] END
Apr  5 12:56:08 svt-m-2 atomic-openshift-master-api: I0405 12:56:08.576332  107783 logs.go:41] http2: panic serving 10.10.0.18:42370: kill connection/stream
Apr  5 12:56:08 svt-m-2 atomic-openshift-master-api: goroutine 550297327 [running]:
Apr  5 12:56:08 svt-m-2 atomic-openshift-master-api: net/http.(*http2serverConn).runHandler.func1(0xc4612cb150, 0xc454cbdf7f, 0xc42766afc0)
Apr  5 12:56:08 svt-m-2 atomic-openshift-master-api: /usr/lib/golang/src/net/http/h2_bundle.go:4304 +0xd1
Apr  5 12:56:08 svt-m-2 atomic-openshift-master-api: panic(0x3a912a0, 0xc42041a550)
Apr  5 12:56:08 svt-m-2 atomic-openshift-master-api: /usr/lib/golang/src/runtime/panic.go:458 +0x243
Apr  5 12:56:08 svt-m-2 atomic-openshift-master-api: github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/genericapiserver/filters.(*baseTimeoutWriter).timeout(0xc5f780fa40, 0x0, 0x0)
Apr  5 12:56:08 svt-m-2 atomic-openshift-master-api: /builddir/build/BUILD/atomic-openshift-git-0.35a4ef0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/genericapiserver/filters/timeout.go:205 +0x336
Apr  5 12:56:08 svt-m-2 atomic-openshift-master-api: github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/genericapiserver/filters.(*timeoutHandler).ServeHTTP(0xc421adec20, 0x93083e0, 0xc4612cb150, 0xc4e0fdd680)
Apr  5 12:56:08 svt-m-2 atomic-openshift-master-api: /builddir/build/BUILD/atomic-openshift-git-0.35a4ef0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/genericapiserver/filters/timeout.go:85 +0x2a0
Apr  5 12:56:08 svt-m-2 atomic-openshift-master-api: github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/genericapiserver/filters.WithMaxInFlightLimit.func1(0x93083e0, 0xc4612cb150, 0xc4e0fdd680)
Apr  5 12:56:08 svt-m-2 atomic-openshift-master-api: /builddir/build/BUILD/atomic-openshift-git-0.35a4ef0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/genericapiserver/filters/maxinflight.go:45 +0x127
Apr  5 12:56:08 svt-m-2 atomic-openshift-master-api: net/http.HandlerFunc.ServeHTTP(0xc421825080, 0x93083e0, 0xc4612cb150, 0xc4e0fdd680)
Apr  5 12:56:08 svt-m-2 atomic-openshift-master-api: /usr/lib/golang/src/net/http/server.go:1726 +0x44
Apr  5 12:56:08 svt-m-2 atomic-openshift-master-api: github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/apiserver/filters.WithRequestInfo.func1(0x93083e0, 0xc4612cb150, 0xc4e0fdd680)
Apr  5 12:56:08 svt-m-2 atomic-openshift-master-api: /builddir/build/BUILD/atomic-openshift-git-0.35a4ef0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/apiserver/filters/requestinfo.go:45 +0x212
Apr  5 12:56:08 svt-m-2 atomic-openshift-master-api: net/http.HandlerFunc.ServeHTTP(0xc421825140, 0x93083e0, 0xc4612cb150, 0xc4e0fdd680)
Apr  5 12:56:08 svt-m-2 atomic-openshift-master-api: /usr/lib/golang/src/net/http/server.go:1726 +0x44
Apr  5 12:56:08 svt-m-2 atomic-openshift-master-api: github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/api.WithRequestContext.func1(0x93083e0, 0xc4612cb150, 0xc4e0fdd680)
Apr  5 12:56:08 svt-m-2 atomic-openshift-master-api: /builddir/build/BUILD/atomic-openshift-git-0.35a4ef0/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/kubernetes/pkg/api/requestcontext.go:107 +0xef
Apr  5 12:56:08 svt-m-2 atomic-openshift-master-api: net/http.HandlerFunc.ServeHTTP(0xc421aded20, 0x93083e0, 0xc4612cb150, 0xc4e0fdd680)
Apr  5 12:56:08 svt-m-2 atomic-openshift-master-api: /usr/lib/golang/src/net/http/server.go:1726 +0x44
Apr  5 12:56:08 svt-m-2 atomic-openshift-master-api: net/http.serverHandler.ServeHTTP(0xc423597a80, 0x93083e0, 0xc4612cb150, 0xc4e0fdd680)
Apr  5 12:56:08 svt-m-2 atomic-openshift-master-api: /usr/lib/golang/src/net/http/server.go:2202 +0x7d
Apr  5 12:56:08 svt-m-2 atomic-openshift-master-api: net/http.initNPNRequest.ServeHTTP(0xc42c5de380, 0xc423597a80, 0x93083e0, 0xc4612cb150, 0xc4e0fdd680)
Apr  5 12:56:08 svt-m-2 atomic-openshift-master-api: /usr/lib/golang/src/net/http/server.go:2670 +0x93
Apr  5 12:56:08 svt-m-2 atomic-openshift-master-api: net/http.(*initNPNRequest).ServeHTTP(0xc4236d9f50, 0x93083e0, 0xc4612cb150, 0xc4e0fdd680)
Apr  5 12:56:08 svt-m-2 atomic-openshift-master-api: <autogenerated>:288 +0x79
Apr  5 12:56:08 svt-m-2 atomic-openshift-master-api: net/http.(Handler).ServeHTTP-fm(0x93083e0, 0xc4612cb150, 0xc4e0fdd680)
Apr  5 12:56:08 svt-m-2 atomic-openshift-master-api: /usr/lib/golang/src/net/http/h2_bundle.go:4084 +0x4d
Apr  5 12:56:08 svt-m-2 atomic-openshift-master-api: net/http.(*http2serverConn).runHandler(0xc42766afc0, 0xc4612cb150, 0xc4e0fdd680, 0xc5f780f9e0)
Apr  5 12:56:08 svt-m-2 atomic-openshift-master-api: /usr/lib/golang/src/net/http/h2_bundle.go:4314 +0x89
Apr  5 12:56:08 svt-m-2 atomic-openshift-master-api: created by net/http.(*http2serverConn).processHeaders
Apr  5 12:56:08 svt-m-2 atomic-openshift-master-api: /usr/lib/golang/src/net/http/h2_bundle.go:4092 +0x6e2



Version-Release number of selected component (if applicable): 3.5.0.39


How reproducible:  Unknown

The problem occurred running cluster-loader (https://github.com/openshift/svt/tree/master/openshift_scalability)  with the following configuration on the CNCF 2K node cluster.

projects:
  - num: 300
    basename: svt-1-
    tuning: default
    templates:
      -
        num: 2
        file: ./content/build-template.json
      -
        num: 1
        file: ./content/image-stream-template.json
      - 
        num: 50 
        file: ./content/deployment-config-1rep-pause-template.json
        parameters:
          -
            ENV_VALUE: "asodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij12"
      -
        num: 25
        file: ./content/deployment-config-2rep-pause-template.json
        parameters:
          -
            ENV_VALUE: "asodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij0emc2oed2ed2ed2e2easodfn209e8j0eij12"
      -
        num: 2
        file: ./content/ssh-secret-template.json
      -
        num: 1
        file: ./content/route-template.json
      # rcs and services are implemented in deployments.
tuningsets:
  - name: default
    templates:
      stepping:
        stepsize: 5 
        pause: 10 s
      rate_limit:
        delay: 250 ms
quotas:
  - name: default

Comment 2 Mike Fiedler 2017-04-05 17:35:35 UTC

This is not a one-time panic.   The log is filled with them.   Once it happens the first time, it occurs repeatedly.

Comment 3 Mike Fiedler 2017-04-05 17:44:11 UTC

Even after restarting master-api, issuing commands like oc get projects and oc get pods are failing with this panic occurring multiple times for each command.

Comment 5 Mike Fiedler 2017-04-05 18:45:31 UTC

Also, it is on all master-api - not just one as originally stated in the description.

Comment 6 Jay Vyas 2017-04-05 19:27:50 UTC

This looks like https://github.com/kubernetes/kubernetes/pull/29594 to me, but evidently that patch is cherry picekd already into openshift.

Comment 7 Jay Vyas 2017-04-05 19:41:55 UTC

Not closed yet = https://github.com/kubernetes/kubernetes/issues/30305 .  Its not clear why this is unclosed though : either it was forgotton, or it is truly not yet resolved.

Comment 8 Andy Goldstein 2017-04-05 21:03:58 UTC

Stefan, could you please investigate and see if you can root cause this?

Comment 9 Jay Vyas 2017-04-05 21:11:14 UTC

http://file.rdu.redhat.com/schituku/logFile < - reproduced log file using ReallyCrash option = true

Looks like this is the bit that is getting tripped:

```
	// The timeout writer has not been used by the inner handler.
	// We can safely timeout the HTTP request by sending by a timeout
	// handler
	if !tw.wroteHeader && !tw.hijacked {
		tw.w.WriteHeader(http.StatusGatewayTimeout)
		enc := json.NewEncoder(tw.w)
		enc.Encode(&err.ErrStatus)
	} else {
		// The timeout writer has been used by the inner handler. There is
		// no way to timeout the HTTP request at the point. We have to shutdown
		// the connection for HTTP1 or reset stream for HTTP2.
		//
		// Note from: Brad Fitzpatrick
		// if the ServeHTTP goroutine panics, that will do the best possible thing for both
		// HTTP/1 and HTTP/2. In HTTP/1, assuming you're replying with at least HTTP/1.1 and
		// you've already flushed the headers so it's using HTTP chunking, it'll kill the TCP
		// connection immediately without a proper 0-byte EOF chunk, so the peer will recognize
		// the response as bogus. In HTTP/2 the server will just RST_STREAM the stream, leaving
		// the TCP connection open, but resetting the stream to the peer so it'll have an error,
		// like the HTTP/1 case.
		panic(errConnKilled)
```

Comment 10 Stefan Schimanski 2017-04-06 06:54:57 UTC

My suspicion is that the LongRunningRequestCheck does not support list request with watch=true in the ListOptions, i.e. those are cut off after the timeout. 

The timeout defaults to 60 minutes though. Mike, does this match with your tests? Or are those timeout happens early one?

Here is the origin master LongRunningRequestCheck (which should match that of 3.5):

  https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/pkg/genericapiserver/filters/longrunning.go

In Kube 1.6 this changed a lot, not to the better though I fear:

  https://github.com/kubernetes/kubernetes/blob/312c44da2e1946d5b24eafe731ebd00fbdfeece7/staging/src/k8s.io/apiserver/pkg/server/filters/longrunning.go#L31

There are multiple ways to start a watch LIST request: through ListOptions in the body and through ListOptions as GET url params. In 3.5 at least the latter case is handled:

  https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/pkg/genericapiserver/config.go#L213

@liggit can you comment? I think you touched the watch param logic recently in upstream.

Comment 11 Mike Fiedler 2017-04-06 12:14:48 UTC

@stts, My observation has been:
    - the panics start occurring after a couple of hours in this scale-up run.   Now that I know what to look for, I can get exact timing if that is useful by re-running starting with a clean cluster
    - In yesterday's case, restarting the controllers/api-servers made no difference.   The panics resumed almost immediately after restart.

Comment 12 Mike Fiedler 2017-04-07 14:44:03 UTC

At the suggestion of @sttts - based on his observation of 75 panics/second in the logs - I set maxRequestsInFlight to 65 as an experiment.   Setting it below 75 eliminated the panics.   It caused lots of issues with short running GETs (nodes getting secrets, etc) but it did eliminate the panic.  Hopefully this is a useful data point.

Comment 13 Stefan Schimanski 2017-04-19 08:26:49 UTC

The workaround of reducing maxRequestsInFlight to 65 suggests that the rate of requests overload the apiserver(s). As a consequence, certain LIST requests (we do not know which) reach the (hard-coded) timeout of 60 seconds for non-long-running requests.

At the moment, the only way to control the load of the server is by setting maxRequestsInFlight to such a low enough value that the server can process the expensive LIST requests within those 60 seconds. At the same time, this influences all non-expensive requests like GETs as well. This had bad influence on the cluster behavior.

Open questions:

- how much of the load is due to cheap request and how much is due to expensive LIST requests?
- where is the bottleneck? etcd? are optimizations possible?
- which LIST calls are expensive in the experiment?
- can those LIST calls be optimized client-side (in the kubelet? in the proxy?)?

What we can do:

- make the 60 sec timeout configurable
- add a maxExpensiveRequestsInFlight flag which applies only to LIST requests. This can be set much lower than the generic maxRequestsInFlight value.
- develop self-controlling mechanisms to control maxExpensiveRequestsInFlight dynamically depending on the duration of expensive requests.

@mike: how can we get answers to the first 3 open questions?

Comment 17 Stefan Schimanski 2017-04-20 07:09:53 UTC

Reducing severity as it is not urgent anymore without the test cluster. Also postponing to the next point release.

Comment 18 Andy Goldstein 2017-05-02 18:35:01 UTC

Mike, have you been able to run any more tests?

Comment 21 Jordan Liggitt 2017-05-30 15:57:16 UTC

no info will be available until june

Comment 22 Eric Paris 2017-06-27 23:28:14 UTC

Do we have anything else to go on here. At this point i'm going to move this BZ to 3.7.

Comment 25 Derek Carr 2017-07-03 18:35:44 UTC

*** Bug 1467416 has been marked as a duplicate of this bug. ***

Comment 26 Mike Fiedler 2017-07-04 02:01:52 UTC

Per duplicate bug 1467416 we can recreate this by deploying logging in a 1200 node cluster without the new mux service turned on for fluentd.   When the fluentd daemonset was created, repeated/continuous panics popped in the master-api logs and the cluster became unstable - at least in terms of oc commands needed to rescue the cluster and getting it working reliably again.

Comment 27 Mike Fiedler 2017-07-07 02:39:19 UTC

Hit this in the OpenShift 3.6 scalability lab trying to delete 5000 projects with 20000 pods.  During the delete (OOTB in flight settings), the same panic as reported above started happening continuously.

Master log location will be linked when available.

Comment 28 Mike Fiedler 2017-07-07 02:46:43 UTC

Scenarios reproducing this:

1. Scale up to ~23K pods with 2000 nodes in 3.5 in the CNCF cluster
2. Start daemonset with 1200 fluentd pods trying to contact master-api-server in 3.6
3. Delete 5000 projects/20K pods in 3.6 in a cluster with 2000 nodes.

Comment 29 Clayton Coleman 2017-07-07 15:42:37 UTC

I think in this case the problem is that we aren't exerting enough back pressure - clients are effectively crushing the master to die.  So something like rate of timeouts  per unit time should control how much back pressure to put on clients, just like rate of 429 per second should alter how long the retry-after is to send back to the server.

Comment 30 Mike Fiedler 2017-07-07 16:29:45 UTC

In the delete 5000 project case, lowering max in flight to 100 allowed the deletes to complete without panics.   quite a few nodes went to NotReady though with timeouts/connection issues to the master-api servers via master-api load balancer.

Comment 31 Clayton Coleman 2017-07-14 17:41:26 UTC

In prod we're increasing (doubling) the limits because we are not CPU or memory bound, but we are bound by etcd2 + json setup.  I don't know what to recommend for this case yet.

Comment 32 Stefan Schimanski 2017-07-17 15:09:34 UTC

For reference those two open upstream issues around 429 handling:

https://github.com/kubernetes/client-go/issues/222

https://github.com/kubernetes/kubernetes/issues/48610

Comment 33 Derek Carr 2017-08-19 03:49:29 UTC

IS there anything we can do here now?  Was this an actual panic?  I thought this was the logs saying the request took too long and not actual panics.  If so, can we at least change the bz title so it is less dire?  I thought I discussed similar issues with David on other BZs.

Comment 34 Stefan Schimanski 2017-08-21 06:59:46 UTC

We can probably improve our backpreasure logic to keep the cluster functional in case of many timeout of long-running LIST requests. Other than that, I agree about changing the title. It's more like: "Continuous request timeouts due to ineffective back-preasure when creating 22.5K deployment configs with 30K pods".

Note You need to log in before you can comment on or make changes to this bug.