Hide Forgot
Created attachment 1143925 [details] Openshift build log from failure. Description of problem: This is a follow-on to https://bugzilla.redhat.com/show_bug.cgi?id=1318395 In that BZ, builds would hang forever on Docker API calls. A dump of such a hang is attached to that BZ. The build team wrapped their Docker calls with a 20 second timeout and now fail builds that hit the timeout (build log attached). From my past experience I believe some of these calls are just slow and not actually hung, but I have no evidence. In any case the time out is 20 seconds. I profiled (pprof, sar, iostat, mpstat) a run with such a failure. The run was 16 concurrent builds across 2 worker nodes (8 builds/node). The nodes are AWS m4.xlarge (4 vCPU, 16 GB). Docker is configured to use thin LVM backed by EBS provisioned IOPS (1800 iops) volumes. The failure for the log above occurred at 13:36:47 and the performance data for the node on which the failed build ran can be found here: http://perf-infra.ec2.breakage.org/incoming/ip-172-31-47-30/16_builds_2_nodes_b/tools-default/ip-172-31-47-33/ The performance data shows that the node does not appear to be overloaded. CPU maxes out at 80/400 although cpu_03 does show as being busier than the others: http://perf-infra.ec2.breakage.org/incoming/ip-172-31-47-30/16_builds_2_nodes_b/tools-default/ip-172-31-47-33/sar/per_cpu.html. The same data for the master, infrastrucure and other worker node can be found in the same directory tree. master is .30, infra is .31 and the other worker is .32. Let me know what other documentation is required. Version-Release number of selected component (if applicable): 3.2.0.11 How reproducible: Always with enough concurrent builds - usually kicks in around 2 builds/cpu. Steps to Reproduce: 1. Install an AWS cluster (1 master, 1 infra, 2 workers) with IOPS provisioned EBS storage volume for thin LVM 2. Create 16 projects and applications (cakephp-example). Let initial build occur. 3. Launch 16 builds simultaneously. From shell or I have a python script in the openshift/svt repo that can be used. Actual results: Usually (not always) 1 or 2 of the builds will fail with the timeout error. Expected results: All builds run successfully unless we determine this level of activity is excessive/beyond current capabilities. Additional info: Attached: failed build log, journalctl -u docker, hung build dump from BZ 1318395. See URLs above for other data.
Created attachment 1143926 [details] Syslogs for node where build failed
Created attachment 1143927 [details] Stack dump for docker daemon from BZ 1318395 (build hang)
What version of docker is this happening on? If it is docker-1.8, could you attempt to repeat on docker-1.9?
Re: comment 3 - Docker is : Client: Version: 1.9.1 API version: 1.21 Package version: docker-1.9.1-23.el7.x86_64 Go version: go1.4.2 Git commit: f97fb16/1.9.1 Built: OS/Arch: linux/amd64 Server: Version: 1.9.1 API version: 1.21 Package version: docker-1.9.1-23.el7.x86_64 Go version: go1.4.2 Git commit: f97fb16/1.9.1 Built: OS/Arch: linux/amd64
Could this just be the hang we have seen with docker pull, we are blocking simultanious docker pulls I believe in another part of OpenShift, maybe we need to block a certain amount of simultanious pulls in docker build.
With the general improvements in the latest OpenShift and Docker is this issue reproducible?
This issue is still reproducible in OpenShift 3.3. Concurrent S2I builds almost always trigger the 20 second timeout introduced in OpenShift 3.2. This makes gathering stack traces of hangs difficult as things tend to timeout first. Let us know what documentation you'd like to see for this bug.
I am not sure what this bug is about. How did you arrive at this 20 second number and what guarantees that build will always succeed in 20 seconds. If you are trying to detect hangs, I would expect that timeout limits will be very high say few minutes atleast and not few seconds.
20 seconds is an artificial limit added by OpenShift builds in https://bugzilla.redhat.com/show_bug.cgi?id=1318395 during 3.2. It was recently raised to 1 minute in OpenShift 3.3 (https://github.com/openshift/source-to-image/pull/576) and we still see occasional timeouts starting containers in builds.
So many things can go wrong. Especially in cloud, often storage can be very slow and lot of things can get backlogged behind it and take long time. So while you can use some sort of timeout for warning, how can you be sure that a random number is good enough to terminate builds. And even if you do that, how does that translate into a bug. Platform never offered any guarantees like that. This is more of an observation.
This was opened when builds would hang forever trying to create containers. A hard and permanent hang. To work around it OpenShift now cancels builds that take longer than their threshold to start. I'm guessing, but cannot prove that the hangs are still happening, but OpenShift kills the container. I can try to build openshift binaries that don't have the timeout to recreate the hangs, but that's not product behavior any longer.
Thanks Jhon for lowering priority of this. If we are not even sure if something is a bug or not, that should not be a high priority issue.
@Mike, I know docker has had issues with multiple parallel buids. I hope those issues have been fixed in newer docker.
See https://bugzilla.redhat.com/show_bug.cgi?id=1375580 for related information.
With the resolution of Bug 1375580, are you able to reproduce the issue?
This issue no longer occurs in 3.6