Bug 1467416

Summary: panic in net/http/server.go starting daemonset with 1200 pods
Product: OpenShift Container Platform Reporter: Mike Fiedler <mifiedle>
Component: NodeAssignee: Derek Carr <decarr>
Status: CLOSED DUPLICATE QA Contact: DeShuai Ma <dma>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.6.0CC: aos-bugs, ekuric, jokerman, mmccomas
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard: aos-scalability-36
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-07-03 18:35:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Mike Fiedler 2017-07-03 17:51:12 UTC
Description of problem:

This looks very similar to https://bugzilla.redhat.com/show_bug.cgi?id=1439324 but I hit it via a different route + line numbers and packages are somewhat different in the stack, so opening a new bug for initial triage.

The scale lab is currently at 1200 nodes (on the way to 2000).   I deployed logging in the cluster and when the logging-fluentd daemonset was created, I started seeing panics in the master api server logs.   Stack is:

Jul  3 13:35:22 172 atomic-openshift-master-api: I0703 13:35:22.987317   97941 logs.go:41] http: panic serving 172.16.0.20:41518: kill connection/stream
Jul  3 13:35:22 172 atomic-openshift-master-api: goroutine 3576844 [running]:
Jul  3 13:35:22 172 atomic-openshift-master-api: net/http.(*conn).serve.func1(0xc54ec97580)
Jul  3 13:35:22 172 atomic-openshift-master-api: /usr/lib/golang/src/net/http/server.go:1491 +0x12a
Jul  3 13:35:22 172 atomic-openshift-master-api: panic(0x4777520, 0xc4204e4fd0)
Jul  3 13:35:22 172 atomic-openshift-master-api: /usr/lib/golang/src/runtime/panic.go:458 +0x243
Jul  3 13:35:22 172 atomic-openshift-master-api: github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/filters.(*baseTimeoutWriter).timeout(0xc980ef01c0, 0xc9801a8a00)
Jul  3 13:35:22 172 atomic-openshift-master-api: /builddir/build/BUILD/atomic-openshift-git-0.c91cc09/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:214 +0x187
Jul  3 13:35:22 172 atomic-openshift-master-api: github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/filters.(*timeoutHandler).ServeHTTP(0xc421365100, 0xa3f59a0, 0xc6931d32b0, 0xc975fbf0e0)
Jul  3 13:35:22 172 atomic-openshift-master-api: /builddir/build/BUILD/atomic-openshift-git-0.c91cc09/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/filters/timeout.go:98 +0x28c
Jul  3 13:35:22 172 atomic-openshift-master-api: github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/filters.WithMaxInFlightLimit.func1(0xa3f59a0, 0xc6931d32b0, 0xc975fbf0e0)
Jul  3 13:35:22 172 atomic-openshift-master-api: /builddir/build/BUILD/atomic-openshift-git-0.c91cc09/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/filters/maxinflight.go:95 +0x197
Jul  3 13:35:22 172 atomic-openshift-master-api: net/http.HandlerFunc.ServeHTTP(0xc42132bfc0, 0xa3f59a0, 0xc6931d32b0, 0xc975fbf0e0)
Jul  3 13:35:22 172 atomic-openshift-master-api: /usr/lib/golang/src/net/http/server.go:1726 +0x44
Jul  3 13:35:22 172 atomic-openshift-master-api: github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/endpoints/filters.WithRequestInfo.func1(0xa3f59a0, 0xc6931d32b0, 0xc975fbf0e0)
Jul  3 13:35:22 172 atomic-openshift-master-api: /builddir/build/BUILD/atomic-openshift-git-0.c91cc09/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/endpoints/filters/requestinfo.go:45 +0x212
Jul  3 13:35:22 172 atomic-openshift-master-api: net/http.HandlerFunc.ServeHTTP(0xc4213667b0, 0xa3f59a0, 0xc6931d32b0, 0xc975fbf0e0)
Jul  3 13:35:22 172 atomic-openshift-master-api: /usr/lib/golang/src/net/http/server.go:1726 +0x44
Jul  3 13:35:22 172 atomic-openshift-master-api: github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/endpoints/request.WithRequestContext.func1(0xa3f59a0, 0xc6931d32b0, 0xc975fbf0e0)
Jul  3 13:35:22 172 atomic-openshift-master-api: /builddir/build/BUILD/atomic-openshift-git-0.c91cc09/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/endpoints/request/requestcontext.go:107 +0xef
Jul  3 13:35:22 172 atomic-openshift-master-api: net/http.HandlerFunc.ServeHTTP(0xc4213651c0, 0xa3f59a0, 0xc6931d32b0, 0xc975fbf0e0)
Jul  3 13:35:22 172 atomic-openshift-master-api: /usr/lib/golang/src/net/http/server.go:1726 +0x44
Jul  3 13:35:22 172 atomic-openshift-master-api: github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server.(*APIServerHandler).ServeHTTP(0xc4213667e0, 0xa3f59a0, 0xc6931d32b0, 0xc975fbf0e0)
Jul  3 13:35:22 172 atomic-openshift-master-api: /builddir/build/BUILD/atomic-openshift-git-0.c91cc09/_output/local/go/src/github.com/openshift/origin/vendor/k8s.io/apiserver/pkg/server/handler.go:193 +0x51
Jul  3 13:35:22 172 atomic-openshift-master-api: net/http.serverHandler.ServeHTTP(0xc4247d2a80, 0xa3f59a0, 0xc6931d32b0, 0xc975fbf0e0)
Jul  3 13:35:22 172 atomic-openshift-master-api: /usr/lib/golang/src/net/http/server.go:2202 +0x7d
Jul  3 13:35:22 172 atomic-openshift-master-api: net/http.(*conn).serve(0xc54ec97580, 0xa3fa120, 0xc9761c2440)
Jul  3 13:35:22 172 atomic-openshift-master-api: /usr/lib/golang/src/net/http/server.go:1579 +0x4b7
Jul  3 13:35:22 172 atomic-openshift-master-api: created by net/http.(*Server).Serve
Jul  3 13:35:22 172 atomic-openshift-master-api: /usr/lib/golang/src/net/http/server.go:2293 +0x44d




Version-Release number of selected component (if applicable): 

openshift v3.6.126.1
kubernetes v1.6.1+5115d708d7



How reproducible:  Once so far.  I will update bz after trying to delete/recreate the ds


Steps to Reproduce:
1.  Deployed HA cluster with 1200 compute nodes
2.  Deployed logging


Actual results:

Panics in the API server logs

Expected results:

No error.
Additional info:

Comment 2 Derek Carr 2017-07-03 18:19:32 UTC
From the logs, the panic is coming from an actual timeout on serving the request:

see: https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/staging/src/k8s.io/apiserver/pkg/server/filters/timeout.go#L214

Comment 3 Derek Carr 2017-07-03 18:35:44 UTC
duplicating on the original bug as the line numbers do not align because the code was moved around in 1.5 from 1.6, but its the same basic request is timing out problem.

https://github.com/openshift/origin/blob/release-1.5/vendor/k8s.io/kubernetes/pkg/genericapiserver/filters/timeout.go#L205

is the same as 1.6:

https://github.com/openshift/origin/blob/master/vendor/k8s.io/kubernetes/staging/src/k8s.io/apiserver/pkg/server/filters/timeout.go#L214

it appears that the LIST call starts writing, but is unable to finish serializing in time to write-out the response before hitting the timeout.

*** This bug has been marked as a duplicate of bug 1439324 ***