Bug 1340519

Summary: Running kubernetes density e2e causes docker 1.10 to wedge, completely unresponsive.
Product: [Fedora] Fedora Reporter: Timothy St. Clair <tstclair>
Component: dockerAssignee: Antonio Murdaca <amurdaca>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 24CC: adimania, admiller, agoldste, amurdaca, dwalsh, eparis, erich, ichavero, jcajka, jchaloup, jeder, lnykryn, lsm5, marianne, mifiedle, miminar, mpatel, nalin, qcai, riek, rrati, skuznets, tstclair, vbatts, vgoyal
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: docker-1.10.3-15.gitf476348.fc24 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1341906 (view as bug list) Environment:
Last Closed: 2016-06-18 18:48:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1341906    
Attachments:
Description Flags
docker crash w/recover none

Description Timothy St. Clair 2016-05-27 16:23:17 UTC
Description of problem:
Running kubernetes density tests on a f24-cluster to validate docker correctness and performance of docker 1.10 against latest kubernetes 1.3 master.  When running density tests docker becomes unresponsive and does not recover.

Version-Release number of selected component (if applicable):
1.10.3-10.git8ecd47f.fc24

How reproducible:
100%

Steps to Reproduce:
1. e2e.test --host=https://ip.addr:443 --ginkgo.focus="Density" --kubeconfig=/etc/kubernetes/kubectl.kubeconfig --repo-root=/root/kubernetes --provider=local
2. let test execute for a while and navigate to kube-node.
3. sudo docker ps.

Actual results:
Unresponsive never returns.  

Docker can not restart.  

Expected results:
return result.

Additional info:
Docker logs show a series of odd logs: 
----
May 27 16:18:18 timz-kubelet-v1.3-2.novalocal docker[20185]: time="2016-05-27T16:18:18.239998873Z" level=error msg="Handler for GET /containers/fc56ebd888b97bb1f8b4ce9b35f6a38e325f9a46e1568ff915000921
May 27 16:18:18 timz-kubelet-v1.3-2.novalocal docker[20185]: time="2016-05-27T16:18:18.239811216Z" level=error msg="Unable to read peer creds and server socket address: bad file descriptor"
May 27 16:18:18 timz-kubelet-v1.3-2.novalocal docker[20185]: time="2016-05-27T16:18:18.239690862Z" level=warning msg="tcpconn is not a struct"

---
May 27 16:18:29 timz-kubelet-v1.3-2.novalocal docker[20185]: 2016-05-27 16:18:29 +0000 [warn]: /var/log/containers/density250-0-057e09f0-241f-11e6-a818-fa163eff0a21-64nzd_e2e-tests-density-mplsu_densi
ty250-0-057e09f0-241f-11e6-a818-fa163eff0a21-617068ef83b573522ba21a3e8c3b4f224d10f7f9ef49ef3d0beabb1a1ebcde44.log unreadable. It is excluded and would be examined next time.

---

Comment 1 Timothy St. Clair 2016-05-27 16:32:46 UTC
sudo strace docker ps: 
shows timeouts on /var/run/docker.sock

Comment 2 Antonio Murdaca 2016-05-31 20:17:28 UTC
Unfortunately those logs aren't related to the this hang and I took care of them (an update will be released tomorrow in F 23 and F24) 

Could you dump the goroutines stack of the docker daemon where you issue docker ps and attach them here? 

kill SIGUSR1 `pidof docker` (the stack will be printed in the daemon logs)

Comment 3 Timothy St. Clair 2016-06-01 14:48:41 UTC
Created attachment 1163705 [details]
docker crash w/recover

I ran the following controlled experiment to reproduce the behavior and captured a crash. 


terminal 1: for i in `seq 1 200`; do sudo docker run -d busybox sleep 1d; done
terminal 2: sudo docker ps <-random 

crash 

Behavior is sporadic, sometimes it crashes, sometimes it locks.  I will try to capture a lock as well.

Comment 4 Timothy St. Clair 2016-06-01 14:56:28 UTC
The only log msg seen on a SIGUSR1 on lock condition is: 

Jun 01 14:52:23 timz-kubelet-v1.3.novalocal docker[12159]: time="2016-06-01T14:52:23.056915136Z" level=error msg="Force shutdown daemon"

Comment 5 Robert Rati 2016-06-01 18:58:32 UTC
I may have seen this happen on rhel 7.2 as well.  I was running a density test and watching top on the kubelet I saw docker using about 60% of the cpu.  Eventually cpu usage dropped and my test failed.  I attempted to do a docker ps on the kubelet, but nothing responded.  I attempted to do the 'kill SIGUSR1' command above, but there is no stack trace.


Jun 01 14:50:46 kubelet-v1.3.novalocal forward-journal[6888]: time="2016-06-01T14:50:46.341854107-04:00" level=error msg="Force shutdown daemon"
Jun 01 14:50:41 kubelet-v1.3.novalocal forward-journal[6888]: time="2016-06-01T14:50:41.383174809-04:00" level=info msg="Container aa3c92b9e1248d50a1ab523d6647c83ea9f052a2f797524334fa6c429128f553 failed to exit within 10 seconds of SIGTERM - using the force"
Jun 01 14:50:41 kubelet-v1.3.novalocal forward-journal[6888]: time="2016-06-01T14:50:41.369585491-04:00" level=info msg="Container 00b4e8276d66753d997368598df2e5355540f224f2a9fe0f25e368d0013bb22c failed to exit within 10 seconds of SIGTERM - using the force"
Jun 01 14:50:41 kubelet-v1.3.novalocal forward-journal[6888]: time="2016-06-01T14:50:41.366587471-04:00" level=info msg="Container 1c3ab8fe4111cc30c85ed040c08d4b02e7e4b9811bf491620187d49b80924e6a failed to exit within 10 seconds of SIGTERM - using the force"
Jun 01 14:50:31 kubelet-v1.3.novalocal docker-latest[6887]: 2016-06-01 18:50:31 +0000 [info]: process finished code=0
Jun 01 14:50:31 kubelet-v1.3.novalocal docker-latest[6887]: 2016-06-01 18:50:31 +0000 [info]: shutting down output type="elasticsearch" plugin_id="object:1532cd4"
Jun 01 14:50:31 kubelet-v1.3.novalocal docker-latest[6887]: 2016-06-01 18:50:31 +0000 [info]: shutting down output type="null" plugin_id="object:10717cc"
Jun 01 14:50:31 kubelet-v1.3.novalocal docker-latest[6887]: 2016-06-01 18:50:31 +0000 [info]: shutting down filter type="kubernetes_metadata" plugin_id="object:101358c"
Jun 01 14:50:31 kubelet-v1.3.novalocal docker-latest[6887]: 2016-06-01 18:50:31 +0000 [info]: shutting down input type="tail" plugin_id="object:111be0c"
Jun 01 14:50:31 kubelet-v1.3.novalocal docker-latest[6887]: 2016-06-01 18:50:31 +0000 [info]: shutting down input type="tail" plugin_id="object:113d890"
Jun 01 14:50:31 kubelet-v1.3.novalocal docker-latest[6887]: 2016-06-01 18:50:31 +0000 [info]: shutting down input type="tail" plugin_id="object:11a0814"
Jun 01 14:50:31 kubelet-v1.3.novalocal docker-latest[6887]: 2016-06-01 18:50:31 +0000 [info]: shutting down input type="tail" plugin_id="object:1170c68"
Jun 01 14:50:31 kubelet-v1.3.novalocal docker-latest[6887]: 2016-06-01 18:50:31 +0000 [info]: shutting down input type="tail" plugin_id="object:11552ec"
Jun 01 14:50:31 kubelet-v1.3.novalocal docker-latest[6887]: 2016-06-01 18:50:31 +0000 [info]: shutting down input type="tail" plugin_id="object:117a8a8"
Jun 01 14:50:31 kubelet-v1.3.novalocal docker-latest[6887]: 2016-06-01 18:50:31 +0000 [info]: shutting down input type="tail" plugin_id="object:1180500"
Jun 01 14:50:31 kubelet-v1.3.novalocal docker-latest[6887]: 2016-06-01 18:50:31 +0000 [info]: shutting down input type="tail" plugin_id="object:1185fb4"
Jun 01 14:50:31 kubelet-v1.3.novalocal docker-latest[6887]: 2016-06-01 18:50:31 +0000 [info]: shutting down input type="tail" plugin_id="object:152aed0"
Jun 01 14:50:31 kubelet-v1.3.novalocal docker-latest[6887]: 2016-06-01 18:50:31 +0000 [info]: shutting down fluentd
Jun 01 14:50:31 kubelet-v1.3.novalocal docker-latest[6887]: shutting down, got signal: Terminated
Jun 01 14:50:31 kubelet-v1.3.novalocal docker-latest[6887]: shutting down, got signal: Terminated
Jun 01 14:50:31 kubelet-v1.3.novalocal docker-latest[6887]: shutting down, got signal: Terminated
...
Jun 01 14:50:31 kubelet-v1.3.novalocal forward-journal[6888]: time="2016-06-01T14:50:31.338924560-04:00" level=info msg="Processing signal 'terminated'"
Jun 01 14:50:30 kubelet-v1.3.novalocal docker-latest[6887]: 2016/06/01 18:50:30 skydns: failure to forward request "read udp 10.1.4.30:53: i/o timeout"
Jun 01 14:50:25 kubelet-v1.3.novalocal docker-latest[6887]: 2016/06/01 18:50:25 Worker running nslookup kubernetes.default.svc.cluster.local 127.0.0.1 >/dev/null
Jun 01 14:50:22 kubelet-v1.3.novalocal docker-latest[6887]: 2016/06/01 18:50:22 skydns: failure to forward request "read udp 10.1.4.30:53: i/o timeout"
Jun 01 14:50:17 kubelet-v1.3.novalocal docker-latest[6887]: 2016/06/01 18:50:17 Worker running nslookup kubernetes.default.svc.cluster.local 127.0.0.1 >/dev/null
...
Jun 01 14:40:05 kubelet-v1.3.novalocal forward-journal[6888]: time="2016-06-01T14:40:05.610263039-04:00" level=error msg="Unable to read peer creds and server socket address: bad file descriptor"
Jun 01 14:40:05 kubelet-v1.3.novalocal forward-journal[6888]: time="2016-06-01T14:40:05.610169893-04:00" level=warning msg="ResponseWriter does not contain a field named conn"

Comment 7 Antonio Murdaca 2016-06-02 12:13:58 UTC
(In reply to Timothy St. Clair from comment #3)
> Created attachment 1163705 [details]
> docker crash w/recover
> 
> I ran the following controlled experiment to reproduce the behavior and
> captured a crash. 
> 
> 
> terminal 1: for i in `seq 1 200`; do sudo docker run -d busybox sleep 1d;
> done
> terminal 2: sudo docker ps <-random 
> 
> crash 
> 
> Behavior is sporadic, sometimes it crashes, sometimes it locks.  I will try
> to capture a lock as well.

I'm getting the same results (same panic in journald logs). Nalin, can it be related to the stuff going on with logging in docker? I can see from the logs things related to the log copier (and notify oom?)

Comment 8 Antonio Murdaca 2016-06-02 13:45:37 UTC
I've tested the same command with upstream docker 1.10.3 - docker does hang the same way it does with our docker fork - plus the logs are pretty the same. I'm gonna report this upstream. The logs show a panic which really seems related to "godbus" and it can be from either the systemd cgroup driver or the journald log driver.

Comment 9 Daniel Walsh 2016-06-02 13:58:12 UTC
Could you run test without systemd cgroup and journald log driver to see if docker works fine without using dbus?

Comment 10 Antonio Murdaca 2016-06-02 14:03:23 UTC
Indeed, what I just did. Turns our I still experience the "runtime/cgo: pthread_create failed: Resource temporarily unavailable"

Comment 11 Daniel Walsh 2016-06-02 14:05:19 UTC
Well we could attempt this test against docker-1.11 or docker-1.12-devel to see if it still has it.

Comment 13 Daniel Walsh 2016-06-02 14:14:45 UTC
Is that something we set in unit file?

Comment 14 Antonio Murdaca 2016-06-02 14:17:31 UTC
The situation in 1.11.2 from upstream seems worse as it's 5x slow to spawn containers and eventually it does end with:

713bf2f18c49f46f8119c3d10891eb729ac0960eadc895597d2c92c613de237b
536b66834bb7b046dae6ba791eedef0967f9417ada29320260c07ba04466e35c
b6206776be50c6285fe18424b4868f57c57912fc0924341a15e95c51277b333b
1a0c421cbfcf692129a0a98847750a1974ebb1660be4370bf7a311426421365e
d958e81c82a6b15a18763fbdcc749b3b70aedf46b6ace048b321dea682159468
821f0a469ba0972214f6f05b4a92c04dae083be11fa3e3590948f827f204577e
a5413638baf99f1c5aaafdf95edaab606a0b7f5290d45073a6d826f1bb14213b
c442afb6bb7b3a24cd1fb4ae6aa51c1bd1ee403db473c25507ad1b064a393814
8c67d23dbac7c8599a78b3385547702252a6748a5fdd174e220c288776ac7e1a
af30515600c6f724bb53c422bf4d1e19195dd833bbb41a558e78542689328f23
docker: Error response from daemon: rpc error: code = 2 desc = "containerd: container not started".
17b5d244ca27a71a71226c9a8dfb616e9a1c301fcb5cec23d5345278e18c7c18
docker: Error response from daemon: rpc error: code = 2 desc = "containerd: container not started".
d30a02560b4c6b2f8d6b4797f75a3d451b7f38577dc9bf5642edd00c157fed21
docker: Error response from daemon: rpc error: code = 2 desc = "containerd: container not started".
310a6d2a75741b39d78f4ad9e7c33c231ffa6addc382f212caf88902e1b0627b
docker: Error response from daemon: rpc error: code = 2 desc = "containerd: container not started".
98dad17c5ba6b0407f7e8b3c9b200c50db91b013bf634b3aaa7218fe7011c55f
docker: Error response from daemon: rpc error: code = 2 desc = "containerd: container not started".
4b0fa3cc7529e02df075c693ea89ea3380033f3c00bdce457a388920b51b272c
docker: Error response from daemon: rpc error: code = 2 desc = "containerd: container not started".
0e73ffa738c7595fcde323b33e39317565c86dd010e21b066d9c2dc2851a316d
docker: Error response from daemon: rpc error: code = 2 desc = "containerd: container not started".
180bdc290b5226390f1e091ea1c48a793fa0923230c1ca703d51c54df8226619
docker: Error response from daemon: rpc error: code = 2 desc = "containerd: container not started".
c61f84e68a9b24619b3c2ba4d6a583723bc549bc82f73b059c9e5c88a6ef9d05
docker: Error response from daemon: rpc error: code = 2 desc = "containerd: container not started".
6a4b92b8ce6a1e7b5d88d64417ae32357705d18adc5a606a180bcf3861235426
docker: Error response from daemon: rpc error: code = 2 desc = "containerd: container not started".
4aa37ac242da836c420cef487513ef1645ca9b2f5727430240e8cca1be1cb201
docker: Error response from daemon: rpc error: code = 2 desc = "containerd: container not started".
cb463deea237796aef20e6dc1a292d40831c93294c2a831a5d4b2c97836f786d
docker: Error response from daemon: rpc error: code = 2 desc = "containerd: container not started".
1b1c5cded7514b7cb91a2bd41910010975b67a2474e65832687972c78790b177
docker: Error response from daemon: rpc error: code = 2 desc = "containerd: container not started".
1a34d33f2955175b7a17dfbf5592115ffcf3a59fde99b10a2878c9ce2fee1c56
docker: Error response from daemon: rpc error: code = 2 desc = "containerd: container not started".
398c9d9a8a71d47a78bcc47dd5641a4dbce3403cab0a447f03557cdc4948bc3e


I'll try to set TaskMax=infinity with 1.10.3 but I remember we dropped it in Fedora because it caused issues somewhere (sorry I don't remember exactly).

Comment 15 Antonio Murdaca 2016-06-02 14:19:45 UTC
(In reply to Daniel Walsh from comment #13)
> Is that something we set in unit file?

No we don't right now. I'm gonna test on my laptop setting TaskMax to infinity.

Someone replied to my post upstream also and they pointed me to https://github.com/docker/docker/issues/22865#issuecomment-220718133 (not sure if it's related)

Comment 16 Daniel Walsh 2016-06-02 14:20:43 UTC
TasksMax defaults to 512 according to googling.  I don't know if this is threads or processes.  For something like docker this is probably low.

Comment 17 Antonio Murdaca 2016-06-02 14:31:41 UTC
Setting TasksMax=infinity solved this BZ - I'm not sure what a safe value may be though. I don't feel infinity is good enough. Jeremy do you have any idea?
I'm going to fix docker service file in Fedora after we agree on a safe value.

Comment 18 Mrunal Patel 2016-06-02 14:33:54 UTC
Antonio, it shouldn't be related to session keys however we should set it to high value regardless.

Comment 19 Antonio Murdaca 2016-06-02 14:34:18 UTC
Tim any chance you can try this out setting TasksMax=infinity in docker.service?
Thanks

Comment 20 Jeremy Eder 2016-06-02 14:41:30 UTC
Because the 512 default will blow up many things, if/when tasksmax is backported to rhel it will default to inifinity.

https://bugzilla.redhat.com/show_bug.cgi?id=1337244#c7
and the corresponding discussion here:
http://post-office.corp.redhat.com/archives/os-devel-list/2016-May/msg00019.html

Adding Lukas to this bz since it's about Fedora.  Lukas -- what is everyone doing in Fedora for tasksmax?  I see ceph is setting it explicitly in their unit file so they don't encounter this issue.  Is that the plan for fedora, to have each unitfile specify tasksmax=inifinity ?

Comment 21 Fedora Update System 2016-06-02 15:43:18 UTC
docker-1.10.3-15.gitf476348.fc24 has been submitted as an update to Fedora 24. https://bodhi.fedoraproject.org/updates/FEDORA-2016-6ef52e1fc3

Comment 22 Lukáš Nykrýn 2016-06-03 08:24:52 UTC
I don't maintain the fedora package but I would advise to stick with upstream setting there. 

> Is that the plan for fedora, to have each unitfile specify tasksmax=inifinity ?

Only where it makes sense.

Comment 23 Fedora Update System 2016-06-03 09:26:20 UTC
docker-1.10.3-15.gitf476348.fc24 has been pushed to the Fedora 24 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2016-6ef52e1fc3

Comment 24 Fedora Update System 2016-06-18 18:48:11 UTC
docker-1.10.3-15.gitf476348.fc24 has been pushed to the Fedora 24 stable repository. If problems still persist, please make note of it in this bug report.