1302408 – Docker 1.9 performance issues

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 1302408 - Docker 1.9 performance issues

Summary: Docker 1.9 performance issues

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 7
Classification:	Red Hat
Component:	docker-latest
Sub Component:
Version:	7.2
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	rc
Target Release:	---
Assignee:	Lokesh Mandvekar
QA Contact:	atomic-bugs@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	OSOPS_V3 docker-1.10
TreeView+	depends on / blocked

Reported:	2016-01-27 18:55 UTC by Andy Goldstein
Modified:	2023-09-14 03:16 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-05-12 14:54:25 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2016:1057	0	normal	SHIPPED_LIVE	new packages: docker-latest	2016-05-12 18:51:24 UTC

Description Andy Goldstein 2016-01-27 18:55:03 UTC

Description of problem: There are potentially performance issues with Docker 1.9


Version-Release number of selected component (if applicable):


How reproducible: unsure

See the details in https://github.com/docker/docker/issues/17720 and https://github.com/kubernetes/kubernetes/issues/16110

Comment 1 Timothy St. Clair 2016-01-27 18:57:34 UTC

also related: https://github.com/kubernetes/kubernetes/pull/18385

Comment 3 Jen Krieger 2016-01-27 20:33:49 UTC

Related: https://github.com/docker/docker/pull/19729

Comment 4 Timothy St. Clair 2016-01-27 21:13:06 UTC

Reproducer: sporadic

Running density tests against upstream kubernetes to load 100 pods onto a single machine running against 1.9.1 we had seen random 'docker ps' on the order of 30+seconds response time.  

Experiment 1: 

start kubernetes from hack/local-cluster-up.sh and append --max-pods=100 to the kubelet args.  Create a simple wide replication controller using 'pause' pods and submit the file.  While this is going on run `sudo docker ps | grep pause | wc -l` in a loop checking the fill rate.  This call will at times block for a long time.  

Experiment 2: 

Same as experiment 1, but 1/2 through the fill resize the replication controller to 1, wait 10 seconds, then resize back to 100 to force the overlapping operations.  


jeremy's experiment was slightly different but similar behavior was observed.

Comment 5 Daniel Walsh 2016-01-27 22:12:48 UTC

I believe this is a known docker issue and not something we can fix until docker-1.10, I believe.

Comment 6 Antonio Murdaca 2016-01-27 23:36:59 UTC

Right now the fix Jen linked above isn't even in the milestone for upstream docker 1.10

Comment 7 Jeremy Eder 2016-01-28 01:53:21 UTC

(In reply to Timothy St. Clair from comment #4)
> Reproducer: sporadic
> 
> Running density tests against upstream kubernetes to load 100 pods onto a
> single machine running against 1.9.1 we had seen random 'docker ps' on the
> order of 30+seconds response time.  
> 
> Experiment 1: 
> 
> start kubernetes from hack/local-cluster-up.sh and append --max-pods=100 to
> the kubelet args.  Create a simple wide replication controller using 'pause'
> pods and submit the file.  While this is going on run `sudo docker ps | grep
> pause | wc -l` in a loop checking the fill rate.  This call will at times
> block for a long time.  
> 
> Experiment 2: 
> 
> Same as experiment 1, but 1/2 through the fill resize the replication
> controller to 1, wait 10 seconds, then resize back to 100 to force the
> overlapping operations.  


Neither of these reproduced it for me.  I was able to see a 7 second lag on docker 1.8 when I scaled the RC from 1 --> 100 pods.  docker 1.9 lagged about 4 seconds during that same test.  I'd consider that perfectly fine, given the fact that scaling to 100 pods actually dumps 200 container start requests on the docker daemon...

I wasn't paying close enough attention at the time I hit some sort of "docker ps lag" to say what caused it.  (my guess was that the lag was related to me breaking DNS on my system shortly before I observed the issue).  And I only did see it that one time.

Tim -- what environment did you see this on (scale lab openstack? ec2?), Was it upstream kube master, on F23?  What docker storage graph driver were you using?

Github is down now at the moment, but tomorrow at least I'll read through the upstream issues (last I looked, they were terribly convoluted with extraneous information).

Comment 8 Timothy St. Clair 2016-01-28 14:19:17 UTC

NOTE: this was HEAD prior to PLEG changes.

Comment 9 Timothy St. Clair 2016-01-28 14:28:59 UTC

I had noticed it on F23, where I have setup a separate partition for the thin-pool.

Comment 10 Jeremy Eder 2016-01-28 17:11:24 UTC

Trying to be as clear as possible here, but there's a lot of detail so let me know if you need further clarification.

I can reproduce the symptoms of https://github.com/docker/docker/issues/17720 as follows:

- kube upstream tag v1.1.3 (prior to pod lifecycle event patch set and some other caching ones)
- set max pods to 200 in hack/local-up-cluster.sh
- start kube with hack/local-up-cluster.sh
- create an rc with 1 replica (sleep pod)
- while true ; do date ; time docker ps | wc -l ; sleep 1 ; done
- cluster/kubectl.sh scale --replicas=100 rc/frontend-1

And watch the output of the "time docker ps" loop for any variance.

- With docker 1.8.2-7, the longest delay of docker ps is about 7 seconds.
- With docker 1.9.1-14, the longest delay of docker ps was about 35 seconds.
- *** I ran the docker-1.9.1-14 test a few times, and on a 2nd occasion I got some kernel soft lockups which could only be rectified with a reboot.
- With docker-1.10-dev from https://master.dockerproject.org/ and the longest delay was about 3 seconds.
- With openshift origin (post-rebase...IOW including PR6320) + docker-1.9.1-14, the longest delay was 6 seconds.
- In all cases, kernel was 3.10.0-336.el7.x86_64 (which is a RHEL7.3-dev kernel) and I used a dedicated disk with thinp/lvm storage.

So, in conclusion:

- Up until the last 60 days, Kubernetes was being very inefficient wrt talking to docker (which everyone knew, and has been addressed in a series of unrelated patches that, when taken in the aggregate, significantly reduce the chatter between docker and kube).
- I don't have a precise bisect, but "something" in docker-1.9 made it more fragile and this abuse from Kube triggered this issue.
- Now that the abuse has been significantly reduced, docker-1.9 no longer exhibits this "hang" with kube upstream master.
- The good news is that it seems the openshift origin rebase that landed yesterday has "enough" of those patches to make docker-1.9 stable again.
- docker-1.10.0-dev also handled the load well.

Comment 11 Timothy St. Clair 2016-01-28 17:31:02 UTC

This is congruent with my findings.

Comment 12 Daniel Walsh 2016-01-28 22:26:19 UTC

Awesome.

Comment 13 Daniel Walsh 2016-01-28 22:26:44 UTC

Fixed in docker-1.10

Comment 14 Scott Dodson 2016-02-02 20:52:06 UTC

What's the behavior at OpenShift's default maxPods of 40?

Comment 15 Scott Dodson 2016-04-05 13:34:47 UTC

Is this bug left open for purposes of tracking docker 1.10 and OSE 3.1/3.0 or will the Conflicts: atomic-openshift < 3.2 exist forever in Docker 1.9 and newer?

Comment 16 Lokesh Mandvekar 2016-04-18 18:34:38 UTC

moving to docker-latest, because 1.10

Comment 18 Luwen Su 2016-05-04 09:36:20 UTC

With docker-latest-1.10.3-21.el7.x86_64.rpm, the time docker-latest ps is good to me, 0.0$i seconds i see in the `while true`

Just to be noted, i did 
`cp docker-latest docker` in binary,`cp docker-latest.service docker.service` in service to make the kubenetes happy to run the pods correctly.

Move to verified

Comment 20 errata-xmlrpc 2016-05-12 14:54:25 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHEA-2016-1057.html

Comment 21 Red Hat Bugzilla 2023-09-14 03:16:50 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.