Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1324180

Summary:

Builds fail with Docker operations > 20 seconds, system with build pod not out of capacity

Product:

OpenShift Container Platform

Reporter:

Mike Fiedler <mifiedle>

Component:

Containers

Assignee:

Jhon Honce <jhonce>

Status:

CLOSED WORKSFORME

QA Contact:

Mike Fiedler <mifiedle>

Severity:

low

Docs Contact:

Priority:

medium

Version:

3.2.0

CC:

amurdaca, aos-bugs, dwalsh, imcleod, jhonce, jokerman, mifiedle, mmccomas, mpatel, vgoyal

Target Milestone:

---

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2017-09-11 16:17:41 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Openshift build log from failure.	none
Syslogs for node where build failed	none
Stack dump for docker daemon from BZ 1318395 (build hang)	none

Description Mike Fiedler 2016-04-05 18:21:11 UTC

Created attachment 1143925 [details]
Openshift build log from failure.

Description of problem:

This is a follow-on to https://bugzilla.redhat.com/show_bug.cgi?id=1318395

In that BZ, builds would hang forever on Docker API calls.   A dump of such a hang is attached to that BZ.  

The build team wrapped their Docker calls with a 20 second timeout and now fail builds that hit the timeout (build log attached).   From my past experience I believe some of these calls are just slow and not actually hung, but I have no evidence.  In any case the time out is 20 seconds.

I profiled (pprof, sar, iostat, mpstat) a run with such a failure.   The run was 16 concurrent builds across 2 worker nodes (8 builds/node).  The nodes are AWS m4.xlarge (4 vCPU, 16 GB).  Docker is configured to use thin LVM backed by EBS provisioned IOPS (1800 iops) volumes.  

The failure for the log above occurred at 13:36:47 and the performance data for the node on which the failed build ran can be found here:  http://perf-infra.ec2.breakage.org/incoming/ip-172-31-47-30/16_builds_2_nodes_b/tools-default/ip-172-31-47-33/

The performance data shows that the node does not appear to be overloaded.  CPU maxes out at 80/400 although cpu_03 does show as being busier than the others:  http://perf-infra.ec2.breakage.org/incoming/ip-172-31-47-30/16_builds_2_nodes_b/tools-default/ip-172-31-47-33/sar/per_cpu.html.

The same data for the master, infrastrucure and other worker node can be found in the same directory tree.  master is .30, infra is .31 and the other worker is .32.

Let me know what other documentation is required.

Version-Release number of selected component (if applicable): 3.2.0.11


How reproducible:  Always with enough concurrent builds - usually kicks in around 2 builds/cpu.


Steps to Reproduce:
1. Install an AWS cluster (1 master, 1 infra, 2 workers) with IOPS provisioned EBS storage volume for thin LVM
2. Create 16 projects and applications (cakephp-example).  Let initial build occur.
3. Launch 16 builds simultaneously.  From shell or I have a python script in the openshift/svt repo that can be used.

Actual results:

Usually (not always) 1 or 2 of the builds will fail with the timeout error.  


Expected results:

All builds run successfully unless we determine this level of activity is excessive/beyond current capabilities.

Additional info:

Attached:  failed build log, journalctl -u docker, hung build dump from BZ 1318395.  See URLs above for other data.

Comment 1 Mike Fiedler 2016-04-05 18:21:55 UTC

Created attachment 1143926 [details]
Syslogs for node where build failed

Comment 2 Mike Fiedler 2016-04-05 18:22:40 UTC

Created attachment 1143927 [details]
Stack dump for docker daemon from BZ 1318395 (build hang)

Comment 3 Daniel Walsh 2016-04-05 20:07:04 UTC

What version of docker is this happening on?  If it is docker-1.8, could you attempt to repeat on docker-1.9?

Comment 4 Mike Fiedler 2016-04-05 20:29:20 UTC

Re: comment 3 - Docker is :

Client:
 Version:         1.9.1
 API version:     1.21
 Package version: docker-1.9.1-23.el7.x86_64
 Go version:      go1.4.2
 Git commit:      f97fb16/1.9.1
 Built:           
 OS/Arch:         linux/amd64

Server:
 Version:         1.9.1
 API version:     1.21
 Package version: docker-1.9.1-23.el7.x86_64
 Go version:      go1.4.2
 Git commit:      f97fb16/1.9.1
 Built:           
 OS/Arch:         linux/amd64

Comment 5 Daniel Walsh 2016-04-06 13:36:14 UTC

Could this just be the hang we have seen with docker pull, we are blocking simultanious docker pulls I believe in another part of OpenShift, maybe we need to block a certain amount of simultanious pulls in docker build.

Comment 6 Jhon Honce 2016-08-16 22:12:57 UTC

With the general improvements in the latest OpenShift and Docker is this issue reproducible?

Comment 7 Mike Fiedler 2016-08-17 02:10:09 UTC

This issue is still reproducible in OpenShift 3.3.   Concurrent S2I builds almost always trigger the 20 second timeout introduced in OpenShift 3.2.   This makes gathering stack traces of hangs difficult as things tend to timeout first.

Let us know what documentation you'd like to see for this bug.

Comment 8 Vivek Goyal 2016-09-01 17:57:14 UTC

I am not sure what this bug is about. How did you arrive at this 20 second number and what guarantees that build will always succeed in 20 seconds. 

If you are trying to detect hangs, I would expect that timeout limits will be very high say few minutes atleast and not few seconds.

Comment 9 Mike Fiedler 2016-09-01 18:58:35 UTC

20 seconds is an artificial limit added by OpenShift builds in https://bugzilla.redhat.com/show_bug.cgi?id=1318395 during 3.2.   It was recently raised to 1 minute in OpenShift 3.3 (https://github.com/openshift/source-to-image/pull/576) and we still see occasional timeouts starting containers in builds.

Comment 10 Vivek Goyal 2016-09-01 19:04:48 UTC

So many things can go wrong. Especially in cloud, often storage can be very slow and lot of things can get backlogged behind it and take long time.

So while you can use some sort of timeout for warning, how can you be sure that a random number is good enough to terminate builds.

And even if you do that, how does that translate into a bug. Platform never offered any guarantees like that. This is more of an observation.

Comment 11 Mike Fiedler 2016-09-01 19:36:03 UTC

This was opened when builds would hang forever trying to create containers.   A hard and permanent hang.   To work around it OpenShift now cancels builds that take longer than their threshold to start.   I'm guessing, but cannot prove that the hangs are still happening, but OpenShift kills the container.   I can try to build openshift binaries that don't have the timeout to recreate the hangs, but that's not product behavior any longer.

Comment 12 Mike Fiedler 2016-09-01 19:36:10 UTC

This was opened when builds would hang forever trying to create containers.   A hard and permanent hang.   To work around it OpenShift now cancels builds that take longer than their threshold to start.   I'm guessing, but cannot prove that the hangs are still happening, but OpenShift kills the container.   I can try to build openshift binaries that don't have the timeout to recreate the hangs, but that's not product behavior any longer.

Comment 13 Vivek Goyal 2016-09-01 19:55:14 UTC

Thanks Jhon for lowering priority of this. If we are not even sure if something is a bug or not, that should not be a high priority issue.

Comment 14 Vivek Goyal 2016-09-01 19:56:08 UTC

@Mike, I know docker has had issues with multiple parallel buids. I hope those issues have been fixed in newer docker.

Comment 15 Jhon Honce 2016-09-15 16:49:37 UTC

See https://bugzilla.redhat.com/show_bug.cgi?id=1375580 for related information.

Comment 16 Jhon Honce 2017-09-08 15:28:00 UTC

With the resolution of Bug 1375580, are you able to reproduce the issue?

Comment 17 Mike Fiedler 2017-09-11 16:17:41 UTC

This issue no longer occurs in 3.6