Bug 1340519
Summary: | Running kubernetes density e2e causes docker 1.10 to wedge, completely unresponsive. | ||||||
---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Timothy St. Clair <tstclair> | ||||
Component: | docker | Assignee: | Antonio Murdaca <amurdaca> | ||||
Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 24 | CC: | adimania, admiller, agoldste, amurdaca, dwalsh, eparis, erich, ichavero, jcajka, jchaloup, jeder, lnykryn, lsm5, marianne, mifiedle, miminar, mpatel, nalin, qcai, riek, rrati, skuznets, tstclair, vbatts, vgoyal | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | All | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | docker-1.10.3-15.gitf476348.fc24 | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 1341906 (view as bug list) | Environment: | |||||
Last Closed: | 2016-06-18 18:48:36 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1341906 | ||||||
Attachments: |
|
Description
Timothy St. Clair
2016-05-27 16:23:17 UTC
sudo strace docker ps: shows timeouts on /var/run/docker.sock Unfortunately those logs aren't related to the this hang and I took care of them (an update will be released tomorrow in F 23 and F24) Could you dump the goroutines stack of the docker daemon where you issue docker ps and attach them here? kill SIGUSR1 `pidof docker` (the stack will be printed in the daemon logs) Created attachment 1163705 [details]
docker crash w/recover
I ran the following controlled experiment to reproduce the behavior and captured a crash.
terminal 1: for i in `seq 1 200`; do sudo docker run -d busybox sleep 1d; done
terminal 2: sudo docker ps <-random
crash
Behavior is sporadic, sometimes it crashes, sometimes it locks. I will try to capture a lock as well.
The only log msg seen on a SIGUSR1 on lock condition is: Jun 01 14:52:23 timz-kubelet-v1.3.novalocal docker[12159]: time="2016-06-01T14:52:23.056915136Z" level=error msg="Force shutdown daemon" I may have seen this happen on rhel 7.2 as well. I was running a density test and watching top on the kubelet I saw docker using about 60% of the cpu. Eventually cpu usage dropped and my test failed. I attempted to do a docker ps on the kubelet, but nothing responded. I attempted to do the 'kill SIGUSR1' command above, but there is no stack trace. Jun 01 14:50:46 kubelet-v1.3.novalocal forward-journal[6888]: time="2016-06-01T14:50:46.341854107-04:00" level=error msg="Force shutdown daemon" Jun 01 14:50:41 kubelet-v1.3.novalocal forward-journal[6888]: time="2016-06-01T14:50:41.383174809-04:00" level=info msg="Container aa3c92b9e1248d50a1ab523d6647c83ea9f052a2f797524334fa6c429128f553 failed to exit within 10 seconds of SIGTERM - using the force" Jun 01 14:50:41 kubelet-v1.3.novalocal forward-journal[6888]: time="2016-06-01T14:50:41.369585491-04:00" level=info msg="Container 00b4e8276d66753d997368598df2e5355540f224f2a9fe0f25e368d0013bb22c failed to exit within 10 seconds of SIGTERM - using the force" Jun 01 14:50:41 kubelet-v1.3.novalocal forward-journal[6888]: time="2016-06-01T14:50:41.366587471-04:00" level=info msg="Container 1c3ab8fe4111cc30c85ed040c08d4b02e7e4b9811bf491620187d49b80924e6a failed to exit within 10 seconds of SIGTERM - using the force" Jun 01 14:50:31 kubelet-v1.3.novalocal docker-latest[6887]: 2016-06-01 18:50:31 +0000 [info]: process finished code=0 Jun 01 14:50:31 kubelet-v1.3.novalocal docker-latest[6887]: 2016-06-01 18:50:31 +0000 [info]: shutting down output type="elasticsearch" plugin_id="object:1532cd4" Jun 01 14:50:31 kubelet-v1.3.novalocal docker-latest[6887]: 2016-06-01 18:50:31 +0000 [info]: shutting down output type="null" plugin_id="object:10717cc" Jun 01 14:50:31 kubelet-v1.3.novalocal docker-latest[6887]: 2016-06-01 18:50:31 +0000 [info]: shutting down filter type="kubernetes_metadata" plugin_id="object:101358c" Jun 01 14:50:31 kubelet-v1.3.novalocal docker-latest[6887]: 2016-06-01 18:50:31 +0000 [info]: shutting down input type="tail" plugin_id="object:111be0c" Jun 01 14:50:31 kubelet-v1.3.novalocal docker-latest[6887]: 2016-06-01 18:50:31 +0000 [info]: shutting down input type="tail" plugin_id="object:113d890" Jun 01 14:50:31 kubelet-v1.3.novalocal docker-latest[6887]: 2016-06-01 18:50:31 +0000 [info]: shutting down input type="tail" plugin_id="object:11a0814" Jun 01 14:50:31 kubelet-v1.3.novalocal docker-latest[6887]: 2016-06-01 18:50:31 +0000 [info]: shutting down input type="tail" plugin_id="object:1170c68" Jun 01 14:50:31 kubelet-v1.3.novalocal docker-latest[6887]: 2016-06-01 18:50:31 +0000 [info]: shutting down input type="tail" plugin_id="object:11552ec" Jun 01 14:50:31 kubelet-v1.3.novalocal docker-latest[6887]: 2016-06-01 18:50:31 +0000 [info]: shutting down input type="tail" plugin_id="object:117a8a8" Jun 01 14:50:31 kubelet-v1.3.novalocal docker-latest[6887]: 2016-06-01 18:50:31 +0000 [info]: shutting down input type="tail" plugin_id="object:1180500" Jun 01 14:50:31 kubelet-v1.3.novalocal docker-latest[6887]: 2016-06-01 18:50:31 +0000 [info]: shutting down input type="tail" plugin_id="object:1185fb4" Jun 01 14:50:31 kubelet-v1.3.novalocal docker-latest[6887]: 2016-06-01 18:50:31 +0000 [info]: shutting down input type="tail" plugin_id="object:152aed0" Jun 01 14:50:31 kubelet-v1.3.novalocal docker-latest[6887]: 2016-06-01 18:50:31 +0000 [info]: shutting down fluentd Jun 01 14:50:31 kubelet-v1.3.novalocal docker-latest[6887]: shutting down, got signal: Terminated Jun 01 14:50:31 kubelet-v1.3.novalocal docker-latest[6887]: shutting down, got signal: Terminated Jun 01 14:50:31 kubelet-v1.3.novalocal docker-latest[6887]: shutting down, got signal: Terminated ... Jun 01 14:50:31 kubelet-v1.3.novalocal forward-journal[6888]: time="2016-06-01T14:50:31.338924560-04:00" level=info msg="Processing signal 'terminated'" Jun 01 14:50:30 kubelet-v1.3.novalocal docker-latest[6887]: 2016/06/01 18:50:30 skydns: failure to forward request "read udp 10.1.4.30:53: i/o timeout" Jun 01 14:50:25 kubelet-v1.3.novalocal docker-latest[6887]: 2016/06/01 18:50:25 Worker running nslookup kubernetes.default.svc.cluster.local 127.0.0.1 >/dev/null Jun 01 14:50:22 kubelet-v1.3.novalocal docker-latest[6887]: 2016/06/01 18:50:22 skydns: failure to forward request "read udp 10.1.4.30:53: i/o timeout" Jun 01 14:50:17 kubelet-v1.3.novalocal docker-latest[6887]: 2016/06/01 18:50:17 Worker running nslookup kubernetes.default.svc.cluster.local 127.0.0.1 >/dev/null ... Jun 01 14:40:05 kubelet-v1.3.novalocal forward-journal[6888]: time="2016-06-01T14:40:05.610263039-04:00" level=error msg="Unable to read peer creds and server socket address: bad file descriptor" Jun 01 14:40:05 kubelet-v1.3.novalocal forward-journal[6888]: time="2016-06-01T14:40:05.610169893-04:00" level=warning msg="ResponseWriter does not contain a field named conn" (In reply to Timothy St. Clair from comment #3) > Created attachment 1163705 [details] > docker crash w/recover > > I ran the following controlled experiment to reproduce the behavior and > captured a crash. > > > terminal 1: for i in `seq 1 200`; do sudo docker run -d busybox sleep 1d; > done > terminal 2: sudo docker ps <-random > > crash > > Behavior is sporadic, sometimes it crashes, sometimes it locks. I will try > to capture a lock as well. I'm getting the same results (same panic in journald logs). Nalin, can it be related to the stuff going on with logging in docker? I can see from the logs things related to the log copier (and notify oom?) I've tested the same command with upstream docker 1.10.3 - docker does hang the same way it does with our docker fork - plus the logs are pretty the same. I'm gonna report this upstream. The logs show a panic which really seems related to "godbus" and it can be from either the systemd cgroup driver or the journald log driver. Could you run test without systemd cgroup and journald log driver to see if docker works fine without using dbus? Indeed, what I just did. Turns our I still experience the "runtime/cgo: pthread_create failed: Resource temporarily unavailable" Well we could attempt this test against docker-1.11 or docker-1.12-devel to see if it still has it. Upstream says that setting TaskMax=infinity solves this issue: https://github.com/docker/docker/issues/20096 https://github.com/docker/docker/issues/21298 https://github.com/docker/docker/issues/22978 https://github.com/docker/docker/issues/9868 Is that something we set in unit file? The situation in 1.11.2 from upstream seems worse as it's 5x slow to spawn containers and eventually it does end with: 713bf2f18c49f46f8119c3d10891eb729ac0960eadc895597d2c92c613de237b 536b66834bb7b046dae6ba791eedef0967f9417ada29320260c07ba04466e35c b6206776be50c6285fe18424b4868f57c57912fc0924341a15e95c51277b333b 1a0c421cbfcf692129a0a98847750a1974ebb1660be4370bf7a311426421365e d958e81c82a6b15a18763fbdcc749b3b70aedf46b6ace048b321dea682159468 821f0a469ba0972214f6f05b4a92c04dae083be11fa3e3590948f827f204577e a5413638baf99f1c5aaafdf95edaab606a0b7f5290d45073a6d826f1bb14213b c442afb6bb7b3a24cd1fb4ae6aa51c1bd1ee403db473c25507ad1b064a393814 8c67d23dbac7c8599a78b3385547702252a6748a5fdd174e220c288776ac7e1a af30515600c6f724bb53c422bf4d1e19195dd833bbb41a558e78542689328f23 docker: Error response from daemon: rpc error: code = 2 desc = "containerd: container not started". 17b5d244ca27a71a71226c9a8dfb616e9a1c301fcb5cec23d5345278e18c7c18 docker: Error response from daemon: rpc error: code = 2 desc = "containerd: container not started". d30a02560b4c6b2f8d6b4797f75a3d451b7f38577dc9bf5642edd00c157fed21 docker: Error response from daemon: rpc error: code = 2 desc = "containerd: container not started". 310a6d2a75741b39d78f4ad9e7c33c231ffa6addc382f212caf88902e1b0627b docker: Error response from daemon: rpc error: code = 2 desc = "containerd: container not started". 98dad17c5ba6b0407f7e8b3c9b200c50db91b013bf634b3aaa7218fe7011c55f docker: Error response from daemon: rpc error: code = 2 desc = "containerd: container not started". 4b0fa3cc7529e02df075c693ea89ea3380033f3c00bdce457a388920b51b272c docker: Error response from daemon: rpc error: code = 2 desc = "containerd: container not started". 0e73ffa738c7595fcde323b33e39317565c86dd010e21b066d9c2dc2851a316d docker: Error response from daemon: rpc error: code = 2 desc = "containerd: container not started". 180bdc290b5226390f1e091ea1c48a793fa0923230c1ca703d51c54df8226619 docker: Error response from daemon: rpc error: code = 2 desc = "containerd: container not started". c61f84e68a9b24619b3c2ba4d6a583723bc549bc82f73b059c9e5c88a6ef9d05 docker: Error response from daemon: rpc error: code = 2 desc = "containerd: container not started". 6a4b92b8ce6a1e7b5d88d64417ae32357705d18adc5a606a180bcf3861235426 docker: Error response from daemon: rpc error: code = 2 desc = "containerd: container not started". 4aa37ac242da836c420cef487513ef1645ca9b2f5727430240e8cca1be1cb201 docker: Error response from daemon: rpc error: code = 2 desc = "containerd: container not started". cb463deea237796aef20e6dc1a292d40831c93294c2a831a5d4b2c97836f786d docker: Error response from daemon: rpc error: code = 2 desc = "containerd: container not started". 1b1c5cded7514b7cb91a2bd41910010975b67a2474e65832687972c78790b177 docker: Error response from daemon: rpc error: code = 2 desc = "containerd: container not started". 1a34d33f2955175b7a17dfbf5592115ffcf3a59fde99b10a2878c9ce2fee1c56 docker: Error response from daemon: rpc error: code = 2 desc = "containerd: container not started". 398c9d9a8a71d47a78bcc47dd5641a4dbce3403cab0a447f03557cdc4948bc3e I'll try to set TaskMax=infinity with 1.10.3 but I remember we dropped it in Fedora because it caused issues somewhere (sorry I don't remember exactly). (In reply to Daniel Walsh from comment #13) > Is that something we set in unit file? No we don't right now. I'm gonna test on my laptop setting TaskMax to infinity. Someone replied to my post upstream also and they pointed me to https://github.com/docker/docker/issues/22865#issuecomment-220718133 (not sure if it's related) TasksMax defaults to 512 according to googling. I don't know if this is threads or processes. For something like docker this is probably low. Setting TasksMax=infinity solved this BZ - I'm not sure what a safe value may be though. I don't feel infinity is good enough. Jeremy do you have any idea? I'm going to fix docker service file in Fedora after we agree on a safe value. Antonio, it shouldn't be related to session keys however we should set it to high value regardless. Tim any chance you can try this out setting TasksMax=infinity in docker.service? Thanks Because the 512 default will blow up many things, if/when tasksmax is backported to rhel it will default to inifinity. https://bugzilla.redhat.com/show_bug.cgi?id=1337244#c7 and the corresponding discussion here: http://post-office.corp.redhat.com/archives/os-devel-list/2016-May/msg00019.html Adding Lukas to this bz since it's about Fedora. Lukas -- what is everyone doing in Fedora for tasksmax? I see ceph is setting it explicitly in their unit file so they don't encounter this issue. Is that the plan for fedora, to have each unitfile specify tasksmax=inifinity ? docker-1.10.3-15.gitf476348.fc24 has been submitted as an update to Fedora 24. https://bodhi.fedoraproject.org/updates/FEDORA-2016-6ef52e1fc3 I don't maintain the fedora package but I would advise to stick with upstream setting there.
> Is that the plan for fedora, to have each unitfile specify tasksmax=inifinity ?
Only where it makes sense.
docker-1.10.3-15.gitf476348.fc24 has been pushed to the Fedora 24 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-6ef52e1fc3 docker-1.10.3-15.gitf476348.fc24 has been pushed to the Fedora 24 stable repository. If problems still persist, please make note of it in this bug report. |