Bug 1302408
Summary: | Docker 1.9 performance issues | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Andy Goldstein <agoldste> |
Component: | docker-latest | Assignee: | Lokesh Mandvekar <lsm5> |
Status: | CLOSED ERRATA | QA Contact: | atomic-bugs <atomic-bugs> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 7.2 | CC: | agrimm, amurdaca, dwalsh, jeder, jkrieger, lsm5, lsu, matt, sdodson, sghosh, szobair, tstclair, twiest, vbatts |
Target Milestone: | rc | Keywords: | Extras |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2016-05-12 14:54:25 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1303130, 1303656 |
Description
Andy Goldstein
2016-01-27 18:55:03 UTC
also related: https://github.com/kubernetes/kubernetes/pull/18385 Reproducer: sporadic Running density tests against upstream kubernetes to load 100 pods onto a single machine running against 1.9.1 we had seen random 'docker ps' on the order of 30+seconds response time. Experiment 1: start kubernetes from hack/local-cluster-up.sh and append --max-pods=100 to the kubelet args. Create a simple wide replication controller using 'pause' pods and submit the file. While this is going on run `sudo docker ps | grep pause | wc -l` in a loop checking the fill rate. This call will at times block for a long time. Experiment 2: Same as experiment 1, but 1/2 through the fill resize the replication controller to 1, wait 10 seconds, then resize back to 100 to force the overlapping operations. jeremy's experiment was slightly different but similar behavior was observed. I believe this is a known docker issue and not something we can fix until docker-1.10, I believe. Right now the fix Jen linked above isn't even in the milestone for upstream docker 1.10 (In reply to Timothy St. Clair from comment #4) > Reproducer: sporadic > > Running density tests against upstream kubernetes to load 100 pods onto a > single machine running against 1.9.1 we had seen random 'docker ps' on the > order of 30+seconds response time. > > Experiment 1: > > start kubernetes from hack/local-cluster-up.sh and append --max-pods=100 to > the kubelet args. Create a simple wide replication controller using 'pause' > pods and submit the file. While this is going on run `sudo docker ps | grep > pause | wc -l` in a loop checking the fill rate. This call will at times > block for a long time. > > Experiment 2: > > Same as experiment 1, but 1/2 through the fill resize the replication > controller to 1, wait 10 seconds, then resize back to 100 to force the > overlapping operations. Neither of these reproduced it for me. I was able to see a 7 second lag on docker 1.8 when I scaled the RC from 1 --> 100 pods. docker 1.9 lagged about 4 seconds during that same test. I'd consider that perfectly fine, given the fact that scaling to 100 pods actually dumps 200 container start requests on the docker daemon... I wasn't paying close enough attention at the time I hit some sort of "docker ps lag" to say what caused it. (my guess was that the lag was related to me breaking DNS on my system shortly before I observed the issue). And I only did see it that one time. Tim -- what environment did you see this on (scale lab openstack? ec2?), Was it upstream kube master, on F23? What docker storage graph driver were you using? Github is down now at the moment, but tomorrow at least I'll read through the upstream issues (last I looked, they were terribly convoluted with extraneous information). NOTE: this was HEAD prior to PLEG changes. I had noticed it on F23, where I have setup a separate partition for the thin-pool. Trying to be as clear as possible here, but there's a lot of detail so let me know if you need further clarification. I can reproduce the symptoms of https://github.com/docker/docker/issues/17720 as follows: - kube upstream tag v1.1.3 (prior to pod lifecycle event patch set and some other caching ones) - set max pods to 200 in hack/local-up-cluster.sh - start kube with hack/local-up-cluster.sh - create an rc with 1 replica (sleep pod) - while true ; do date ; time docker ps | wc -l ; sleep 1 ; done - cluster/kubectl.sh scale --replicas=100 rc/frontend-1 And watch the output of the "time docker ps" loop for any variance. - With docker 1.8.2-7, the longest delay of docker ps is about 7 seconds. - With docker 1.9.1-14, the longest delay of docker ps was about 35 seconds. - *** I ran the docker-1.9.1-14 test a few times, and on a 2nd occasion I got some kernel soft lockups which could only be rectified with a reboot. - With docker-1.10-dev from https://master.dockerproject.org/ and the longest delay was about 3 seconds. - With openshift origin (post-rebase...IOW including PR6320) + docker-1.9.1-14, the longest delay was 6 seconds. - In all cases, kernel was 3.10.0-336.el7.x86_64 (which is a RHEL7.3-dev kernel) and I used a dedicated disk with thinp/lvm storage. So, in conclusion: - Up until the last 60 days, Kubernetes was being very inefficient wrt talking to docker (which everyone knew, and has been addressed in a series of unrelated patches that, when taken in the aggregate, significantly reduce the chatter between docker and kube). - I don't have a precise bisect, but "something" in docker-1.9 made it more fragile and this abuse from Kube triggered this issue. - Now that the abuse has been significantly reduced, docker-1.9 no longer exhibits this "hang" with kube upstream master. - The good news is that it seems the openshift origin rebase that landed yesterday has "enough" of those patches to make docker-1.9 stable again. - docker-1.10.0-dev also handled the load well. This is congruent with my findings. Awesome. Fixed in docker-1.10 What's the behavior at OpenShift's default maxPods of 40? Is this bug left open for purposes of tracking docker 1.10 and OSE 3.1/3.0 or will the Conflicts: atomic-openshift < 3.2 exist forever in Docker 1.9 and newer? moving to docker-latest, because 1.10 With docker-latest-1.10.3-21.el7.x86_64.rpm, the time docker-latest ps is good to me, 0.0$i seconds i see in the `while true` Just to be noted, i did `cp docker-latest docker` in binary,`cp docker-latest.service docker.service` in service to make the kubenetes happy to run the pods correctly. Move to verified Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHEA-2016-1057.html The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days |