Created attachment 1168504 [details] Docker-registry pod logs at debug level Description of problem: I initially thought this was related to https://bugzilla.redhat.com/show_bug.cgi?id=1314381 but am now fairly certain it is a regression or unique issue. On two identical AWS EC2 configurations which differ only in that one is OSE 3.2.0/Docker 1.9.1 and the other is OSE 3.2.1/Docker 1.10.3, push times on clusters using AWS S3 for storage are 9x or more longer on 3.2.1 when compared to 3.2.0 Pushing incremental builds of the cakephp-example takes an average of 4.9 seconds on 3.2.0 and 44.8 seconds on 3.2.1 pbench (see links below) shows much higher CPU usage on the OSE 3.2.1/Docker 1.10.3 system, especiall /usr/bin/dockerregistry_/etc/registrycon Version-Release number of selected component (if applicable): How reproducible: 3.2: openshift v3.2.0.20 docker-common-1.9.1-40.el7.x86_64 3.2.1 openshift v3.2.1.1-1-g33fa4ea docker-common-1.10.3-42.el7.x86_64 Environment: Each cluster is 4 m4.xlarge nodes: master, registry/router and 2 workers. S3 configuration is: version: 0.1 log: level: debug http: addr: :5000 auth: openshift: realm: openshift storage: cache: layerinfo: inmemory s3: accesskey: key secretkey: secret region: us-west-2 bucket: aoe-perf-test encrypt: false secure: false v4auth: true rootdirectory: /registry middleware: repository: - name: openshift options: pullthrough: true The builds use the same bucket, but different project names - no collision between the clusters. Steps to Reproduce: 1. Create 2 projects with the cakephp-example template 2. Run 2 builds simultaneously 5 times 3. Measure the push time as seen in the oc logs for the build. Actual results: 4.9s avg push time for 3.2.0 and 44.8s for 3.2.1 Expected results: Same push time Additional info: The docker-registry logs are attached. I will attach the S3 logs once they become available pidstat CPU for 3.2.1 registry node: http://perf-infra.ec2.breakage.org/pbench/results/ip-172-31-46-198/3.2.1_s3_2builds_5time/tools-default/ip-172-31-4-84/pidstat/cpu_usage.html pidstat CPU for 3.2.0 registry node: http://perf-infra.ec2.breakage.org/pbench/results/ip-172-31-20-224/3.2.0_s3_2builds_5time/tools-default/ip-172-31-20-225/pidstat/cpu_usage.html Profiling the registry pod, top10 during the push shows: 650ms of 1790ms total (36.31%) Showing top 10 nodes out of 585 (cum >= 40ms) flat flat% sum% cum cum% 130ms 7.26% 7.26% 130ms 7.26% crypto/cipher.(*gcm).mul 100ms 5.59% 12.85% 180ms 10.06% github.com/ugorji/go/codec.(*jsonDecDriver).appendStringAsBytes 70ms 3.91% 16.76% 70ms 3.91% github.com/ugorji/go/codec.(*bytesDecReader).readn1 70ms 3.91% 20.67% 70ms 3.91% syscall.Syscall 60ms 3.35% 24.02% 80ms 4.47% runtime.mallocgc 50ms 2.79% 26.82% 50ms 2.79% encoding/json.stateInString 50ms 2.79% 29.61% 130ms 7.26% math/big.nat.divLarge 40ms 2.23% 31.84% 50ms 2.79% github.com/ugorji/go/codec.(*Decoder).getDecFn 40ms 2.23% 34.08% 40ms 2.23% math/big.mulAddVWW 40ms 2.23% 36.31% 40ms 2.23% time.nextStdChunk
Notes: 3.2.1 push at approx 14:59 EDT/19:59 UTC 3.2.0 push at approx 15:18 EDT/20:18 UTC Full pbench data for 3.2.1 registry: http://perf-infra.ec2.breakage.org/pbench/results/ip-172-31-46-198/3.2.1_s3_2builds_5time/tools-default/ip-172-31-4-84 Full pbench data for 3.2.0 registry: http://perf-infra.ec2.breakage.org/pbench/results/ip-172-31-20-224/3.2.0_s3_2builds_5time/tools-default/ip-172-31-20-225
Things I have not tried (yet) - let me know if they would be useful: OSE 3.2.1 on Docker 1.9.1. I'd have to downgrade Docker or hack the install since the install uplifts Docker to 1.10. Using non-AWS S3 storage (NFS, local, etc) for the registry). I will try that next.
Created attachment 1168508 [details] S3 logs
In comment 4, downgrade was from 1.10.3 (not 1.10.1) to 1.9.1
One difference I see in previous reports vs. this one is the previous testers used a chunksize of 26214400. Encryption was found to have an impact but your configuration shows that as off.
Could this be a caching problem at the remote registry. If you repeat the test do you see the docker-1.10 as being slower on every push?
Can we get a profile of docker registry with 1.9.1 as well to compare with what you have for 1.10 in the description?
Here is an upstream issue that discusses performance problems on S3 https://github.com/docker/distribution/issues/335
There were major changes to registry between docker 1.9 (which had 2.2) and docker 1.10 (which has 2.3). The docker registry 2.4 release talks about performance improvements to the S3 driver and marks the older ones as deprecated. https://github.com/docker/distribution/releases/tag/v2.4.0 What version of upstream registry is our registry based on?
Mrunal, we still ship distribution v2.2.1. > The docker registry 2.4 release talks about performance improvements to the S3 driver and marks the older ones as deprecated. Interesting: > New S3 storage driver > The default s3 storage driver is now implemented on top of the official Amazon S3 SDK, boasting major performance and stability goodness. The previous storage is still available, but deprecated. But that doesn't really explain why the push with 1.10 is so much slower compared to 1.9 with the same registry.
Mike, can you try disabling the "rhel-push-plugin" in docker.service? (just remove "--authorization-plugin=rhel-push-plugin" -> that should not account for that many lag but it's worth a try to me.
Sorry - spotty network access until tomorrow (21 June) Comment 6: Chunk size and encryption: The same registry config yaml file is being used for both 1.9.1 and 1.10.3 Comment 7 (Dan Walsh): Yes, it is slow on subsequent pushes. The push times are very consistent on both 1.9.1 and 1.10.3. Comment 8: We would be happy to profile. Can you give me instructions on exactly what you want done? I don't have much experience with it.
Vikas, The cluster that is up right now is configured for S3 and is currently running 1.9.1. When you get the profiling instructions from Mrunal, you can run the test (build cakephp-example and look at the push time in the logs) and get the profiling information. Then, upgrade all nodes to Docker 1.10.3 (yum -y update docker) and systemctl restart atomic-openshift-master and run the profiling again. After that is done, disable the push plugin (comment 12 from Antonio) and run the test 1 more time and see if the push time changes. I'll be back online tonight (20 June)
(In reply to Daniel Walsh from comment #7) > Could this be a caching problem at the remote registry. If you repeat the > test do you see the docker-1.10 as being slower on every push? This seems like caching problem here is some data, I did multiple push for the same application. Tried with couple of projects. Before I did following tests on the same Env Mike was using I had to recreate registry, after upgrading docker to 1.10. Mike was using registry.qe.openshift.com/openshift3/ose-docker-registry v3.2.1.1 and after recreating registry it was updated to registry.qe.openshift.com/openshift3/ose-docker-registry v3.2.1.2 Project 1 1st push - 22 sec I0620 10:51:35.904964 1 sti.go:272] Pushing 172.25.15.180:5000/s99/cakephp-example:latest image ... I0620 10:52:57.478191 1 sti.go:288] Successfully pushed 172.25.15.180:5000/s99/cakephp-example:latest 2nd push - 6 sec I0620 10:55:33.035430 1 sti.go:272] Pushing 172.25.15.180:5000/s99/cakephp-example:latest image ... I0620 10:55:39.265890 1 sti.go:288] Successfully pushed 172.25.15.180:5000/s99/cakephp-example:latest 3rd push - 6 sec I0620 10:58:14.408561 1 sti.go:272] Pushing 172.25.15.180:5000/s99/cakephp-example:latest image ... I0620 10:58:20.307625 1 sti.go:288] Successfully pushed 172.25.15.180:5000/s99/cakephp-example:latest Project 2 1st push - 81 sec I0620 11:03:17.413883 1 sti.go:272] Pushing 172.25.15.180:5000/s100/cakephp-example:latest image ... I0620 11:04:38.353416 1 sti.go:288] Successfully pushed 172.25.15.180:5000/s100/cakephp-example:latest 2nd push - 6 sec I0620 11:06:49.291778 1 sti.go:272] Pushing 172.25.15.180:5000/s100/cakephp-example:latest image ... I0620 11:06:56.766610 1 sti.go:288] Successfully pushed 172.25.15.180:5000/s100/cakephp-example:latest 3rd push - 6 sec I0620 11:09:51.089837 1 sti.go:272] Pushing 172.25.15.180:5000/s100/cakephp-example:latest image ... I0620 11:09:57.536442 1 sti.go:288] Successfully pushed 172.25.15.180:5000/s100/cakephp-example:latest Also ran into following errors sometimes when doing a push, its random (running another build after failed one passes) I0620 11:05:10.297309 1 sti.go:334] Successfully built s100/cakephp-example-3:8bc9e5b1 I0620 11:05:10.319088 1 cleanup.go:23] Removing temporary directory /tmp/s2i-build764296085 I0620 11:05:10.319128 1 fs.go:156] Removing directory '/tmp/s2i-build764296085' I0620 11:05:10.363238 1 sti.go:268] Using provided push secret for pushing 172.25.15.180:5000/s100/cakephp-example:latest image I0620 11:05:10.363261 1 sti.go:272] Pushing 172.25.15.180:5000/s100/cakephp-example:latest image ... I0620 11:06:32.095028 1 sti.go:277] Registry server Address: I0620 11:06:32.095060 1 sti.go:278] Registry server User Name: serviceaccount I0620 11:06:32.095071 1 sti.go:279] Registry server Email: serviceaccount I0620 11:06:32.095081 1 sti.go:284] Registry server Password: <<non-empty>> F0620 11:06:32.095095 1 builder.go:204] Error: build error: Failed to push image. Response from registry is: open /var/lib/docker/devicemapper/mnt/5d14e637e0f8b8c4485faadf4ad429d7ad2b4b8505da0e5c14805443a38191ac/rootfs/opt/app-root/src/lib/Cake/View/Errors/missing_controller.ctp: no such file or directory
Above data was without disabling the plugin, trying one more time after disabling the plugin.
After removing the plugin and restarting docker on all the master and all the nodes. Here are the push timings. There were few push in between took longer (see Project 3) but generally 1st was way longer ~ 80 sec and after that it was around < 10 sec. Project 1 1st push - 82 sec I0620 13:12:01.981375 1 sti.go:272] Pushing 172.25.15.180:5000/s104/cakephp-example:latest image ... I0620 13:13:23.977269 1 sti.go:288] Successfully pushed 172.25.15.180:5000/s104/cakephp-example:latest 2nd push - 6 sec I0620 13:16:10.136958 1 sti.go:272] Pushing 172.25.15.180:5000/s104/cakephp-example:latest image ... I0620 13:16:16.597136 1 sti.go:288] Successfully pushed 172.25.15.180:5000/s104/cakephp-example:latest 3rd push - 7 sec I0620 13:16:57.454555 1 sti.go:272] Pushing 172.25.15.180:5000/s104/cakephp-example:latest image ... I0620 13:17:04.832616 1 sti.go:288] Successfully pushed 172.25.15.180:5000/s104/cakephp-example:latest 4th push - 7 sec I0620 13:24:54.891811 1 sti.go:272] Pushing 172.25.15.180:5000/s110/cakephp-example:latest image ... I0620 13:25:01.377846 1 sti.go:288] Successfully pushed 172.25.15.180:5000/s110/cakephp-example:latest Project 2 1st push - 79 sec I0620 13:26:29.678242 1 sti.go:272] Pushing 172.25.15.180:5000/s111/cakephp-example:latest image ... I0620 13:27:48.221214 1 sti.go:288] Successfully pushed 172.25.15.180:5000/s111/cakephp-example:latest 2nd push - 6 sec I0620 13:29:53.217358 1 sti.go:272] Pushing 172.25.15.180:5000/s111/cakephp-example:latest image ... I0620 13:29:59.999959 1 sti.go:288] Successfully pushed 172.25.15.180:5000/s111/cakephp-example:latest 3rd push - 5 sec I0620 13:30:27.108576 1 sti.go:272] Pushing 172.25.15.180:5000/s111/cakephp-example:latest image ... I0620 13:30:32.972733 1 sti.go:288] Successfully pushed 172.25.15.180:5000/s111/cakephp-example:latest Project 3 1st push - 83 sec I0620 13:32:08.440133 1 sti.go:272] Pushing 172.25.15.180:5000/s112/cakephp-example:latest image ... I0620 13:33:31.136736 1 sti.go:288] Successfully pushed 172.25.15.180:5000/s112/cakephp-example:latest 2nd push - 48 sec I0620 13:47:31.087718 1 sti.go:272] Pushing 172.25.15.180:5000/s114/cakephp-example:latest image ... I0620 13:48:19.094347 1 sti.go:288] Successfully pushed 172.25.15.180:5000/s114/cakephp-example:latest 3rd push - 6 sec I0620 13:35:02.341559 1 sti.go:272] Pushing 172.25.15.180:5000/s112/cakephp-example:latest image ... I0620 13:35:08.519137 1 sti.go:288] Successfully pushed 172.25.15.180:5000/s112/cakephp-example:latest
Thanks, that's definitely not the plugin then
Interesting - this env is now pushing fast with 1.10.3. It's history is that it started at 1.10.3 (slow pushes), was downgraded to 1.9.1 (fast pushes) and then upgraded to 1.10.3 again (still fast). I will try to recreate this tomorrow.
Hi Dan, Can you give me a summary of what this performance issue is about? Does it have a workaround?
Vikas, do you perform some sort of cleanup before each push? Or do you just rerun the build? Short push times of 2nd and each subsequent push is expected since since registry already has the layers. I tried to reproduce this outside of OSE with upstream docker registry. I tested with registry:v2.2.1 (current OSE), registry:v2.4.0 (upcoming OSE 3.3.0), docker 1.9.1 and docker 1.10.3 with two different images. My conclusion is: - There is a significant performance boost while using registry:v2.4.0 over registry:v2.2.1 (pushes take nearly half a time). - With upstream registry, pushes with Docker 1.10 are generally faster compared to prior Docker version. - The more layers image has, the more effective a push with Docker 1.10 will be. That's because Docker 1.10 pushes several layers concurrently. Details are available at https://gist.github.com/miminar/0ba69ea5348b45cdd73eda64aa8a4b10 > Interesting - this env is now pushing fast with 1.10.3. It's history is that it started at 1.10.3 (slow pushes), was downgraded to 1.9.1 (fast pushes) and then upgraded to 1.10.3 again (still fast). I will try to recreate this tomorrow. @Mike, @Vikas, do you cleanup your S3 bucket before running a next set of tests? Docker checks for presence of blobs in the registry/repository before actually pushing them. If using the same S3 bucket with just a restarts to the registry and pushing the same image (or one based on previously pushed one), all the subsequent pushes will still be faster compared to the first one.
Michal, We do not clean up in between pushes. I've verified that (for the cakephp-example app) the very first build when the app is created pushes 4 layers and each subsequent build - the ones we are measuring the times of for the purposes of this bz - only pushes 1 layer. I am cleaning out the bucket when changing versions to account for any possibility of incompatible images between versions.
Performed additional testing to validate the behavior described in comment 15 and comment 19. Unfortunately (for my sanity, anyway), I reproduced it. Details. Initial 1.10.3 test - slow push ------------------------------- - New cluster on AWS - 4 m4.xlarge instances: 1 master, 1 router/registry, 2 app nodes - installed Docker 1.10.3-42 from RHEL extras - installed OpenShift 3.2.1.3 - configured the registry to use AWS S3 storage - cleared the S3 bucket - Push testing on 1.10.3: - Created 40 projects and created an application using cakephp-example - Verified all 40 apps had successful initial builds with good pushes - Performed 5 additional (sequential, 2 at a time to use both app nodes) builds in each namespace. - Average push time: 48.3 seconds (verified single layer pushed per build) - no cleaning of S3 bucket between builds. 1.9.1 "downgrade push" - fast push ---------------------------------- - Downgraded docker to 1.9.1 on all nodes - Erased all namespaces and S3 bucket contents to avoid compatibility issues - Push testing on 1.9.1 - Created 40 projects and created an application using cakephp-example - Verified all 40 apps had successful initial builds with good pushes - Performed 5 additional (sequential, 2 at a time to use both app nodes) builds in each namespace. - Average push time: 5.7 seconds 1.10.3 "re-upgrade push" - fast push ------------------------------------ - Upgraded docker back to 1.10.3 - Erased all namespaces and S3 bucket contents to avoid compatibility issues - Push testing on 1.10.3: - Created 40 projects and created an application using cakephp-example - Verified all 40 apps had successful initial builds with good pushes - Performed 5 additional (sequential, 2 at a time to use both app nodes) builds in each namespace. - Average push time: 6.2 seconds I don't have a lot of confidence in the second 1.10.3 environment. I am going to run the test again tomorrow starting with 1.9.1 and then doing a single straight upgrade to 1.10.3. That is more sane than downgrades and then reverting to higher versions.
Very suspicious. Can this be related to a "performance hit when accessing EBS blocks initialized from snapshot" [1]? Quoting "Amazon EBS Performance Tips": > There is a significant increase in latency when you first access each block of data on a new EBS volume that was restored from a snapshot. You can avoid this performance hit by accessing each block prior to putting the volume into production. This process is called initialization (formerly known as pre-warming). For more information, see Initializing Amazon EBS Volumes. [1] http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSPerformance.html We were forced to increase timeouts in our tests to accommodate for this [2]. [2] https://github.com/openshift/origin/issues/7243#issuecomment-189573187 If there is a restoration from snapshot involved, could you also try to "initialize" the volume before running the first "Initial 1.10.3 test - slow push" using `fio` as recommended in EBS docs [3]? [3] http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ebs-initialize.html#ebs-initialize-linux
Should we change the description to reference Docker 1.10 rather than OSE 3.2.1 as the problem seems to actually be with docker?
@Scott thanks, I was just updating the text.
To simplify the recreate on this and avoid the upgrade/downgrade/upgrade contortions, I've recreated the problem as follows below. In my opinion this should be the canonical way to recreate the issue. - create two x 4 node (m4.xlarge) clusters on AWS: master, router/registry, 2 application nodes. Both registries are configured to use the same S3 bucket as backend storage, but with separate root directories in that bucket to ensure namespace separation. - The nodes have no Docker code installed on them at all to begin with. Any partitions previously used for docker storage have been deleted. Docker 1.9.1: ============ - Install Docker 1.9.1 from RHEL extras, Docker storage is a devicemapper thinp LVM on IOPS-provisioned EBS storage - Install OSE 3.2.1.3 - Create 40 projects and create an application using the django-example template in each. - After each project's application builds once and pushes to the registry (known long push of all layers), build it again as a warm up. - Now run 2 concurrent builds (1 per application node) x 25 times sequentially and record the push time of each build. Each push is a single ~6MB application layer. - Average push time for 50 total builds: 10.58s Docker 1.10.3 ============= - Exactly the same as above, command-for-command, using 10.1.3 from RHEL extras. - Average push time for 50 total builds: 122.36s
Mike, could this be a dupe of bug 1368024 or bug 1372065?
Michal, I expect this to no longer be an issue once we get a build with the new registry on top of 3.2.1.x. Whether it is a dupe or not - not sure, but I would bet it is the same solution.
Ok thanks, I'm closing it as a dupe of bug 1372065. Feel free to reopen. *** This bug has been marked as a duplicate of bug 1372065 ***