Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1318395

Summary: Build hangs indefinitely during container creation when running simultaneous builds
Product: OpenShift Container Platform Reporter: Mike Fiedler <mifiedle>
Component: BuildAssignee: Michal Fojtik <mfojtik>
Status: CLOSED ERRATA QA Contact: Mike Fiedler <mifiedle>
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.2.0CC: amurdaca, aos-bugs, bparees, dwalsh, jeder, mifiedle, mpatel, tdawson, xtian
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2016-05-12 16:33:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Build log, strace and proc status
none
kill -USR1 stacktrace for docker daemon on node where build hung none

Description Mike Fiedler 2016-03-16 18:17:07 UTC
Created attachment 1137111 [details]
Build log, strace and proc status

Description of problem:

When running stress tests of 10 builds per node repeatedly, eventually at least one build will hang forever.  The logs show the build is stuck at docker container creation.   oc cancel-build will free the build up.

An strace and /proc/<pid>/status of the wedged build are attached.


Version-Release number of selected component (if applicable):  3.2.0.3


How reproducible: Frequent while running simultaneous builds


Steps to Reproduce:
1.  Create a cluster with a master and 7 nodes:  1 router, 1 registry, 5 builders.  I'm using m4.xlarge AWS instances.
2.  Create 20 projects.  Spread 50 quickstart application instance across the projects (cakephp, django, dancer and nodejs)
3.  Build all 50 simultaneously after they have built once.  Build repeatedly.   This is 10 builds/node roughly

Actual results:

Eventually one or more (usually one) build will get wedged forever.   Build log always shows it stuck here:

I0316 12:11:07.103974       1 docker.go:411] Image contains io.openshift.s2i.scripts-url set to 'image:///usr/libexec/s2i'
I0316 12:11:07.104452       1 docker.go:466] Base directory for STI scripts is '/usr/libexec/s2i'. Untarring destination is '/tmp'.
I0316 12:11:07.104505       1 docker.go:616] Creating container  using config: &{Hostname: Domainname: User: Memory:0..........

Full log attached



Expected results:

Builds run to completion successfully or with failure messages.

Additional info:

Attached:  build log, strace and proc status of openshift-sti-build while hanging.

Comment 1 Ben Parees 2016-03-16 18:27:10 UTC
Michal can you take a look at this?  Cesar and I discussed it and he suggested (and I agree) that we should just add a short (10s?) timeout on calls to the docker api and if the timeout is hit, fail the build.

Comment 2 Jeremy Eder 2016-03-16 18:29:54 UTC
Hi Ben -- fail that build and retry, right?

Comment 3 Ben Parees 2016-03-16 18:40:09 UTC
I could see us adding retry at some point, but for the immediate fix, no, just fail.  It's basically no different than "your build failed because we couldn't reach your maven repo to pull dependencies" which we also don't retry.  We don't have a good mechanism for retrying a build that has gotten this far along.

Comment 4 Ben Parees 2016-03-16 18:41:01 UTC
(fundamentally I would not expect this to be a frequent occurrence, so if it is, we need to get to the bottom of why the docker daemon is becoming unresponsive, not fix it via retries)

Comment 5 Mike Fiedler 2016-03-16 18:44:14 UTC
Let me know if there is any docker trace/log/etc that would help.   At the activity level I mentioned I can hit this a good bit.

Comment 6 Ben Parees 2016-03-16 18:46:37 UTC
Nothing that will help me per se, but if we can get someone more docker knowledgeable to look at the docker daemon logs that might help.

Once you hit this, does everything hit it, or it's just random builds that hit it while others work fine at the same time?

Comment 7 Jeremy Eder 2016-03-16 18:57:56 UTC
Mike -- can you reproduce with docker daemon debug logging enabled?

In /etc/sysconfig/docker

OPTIONS='--selinux-enabled --debug=true'

Restart docker service, reproduce the hang and capture the logs (these logs will show up in the journal).

Comment 8 Daniel Walsh 2016-03-16 19:06:11 UTC
What version of docker?  Can you try this on docker-1.9 to see if this fixes the problem

Comment 9 Mike Fiedler 2016-03-16 19:19:23 UTC
Docker is 1.9.1

Server:
 Version:         1.9.1
 API version:     1.21
 Package version: docker-1.9.1-19.el7.x86_64
 Go version:      go1.4.2
 Git commit:      185277d/1.9.1
 Built:           
 OS/Arch:         linux/amd64

Comment 10 Daniel Walsh 2016-03-16 19:34:52 UTC
I know docker was working on some fixes for some locking issues.

Comment 11 Antonio Murdaca 2016-03-16 19:39:18 UTC
Could you also "kill -3" the docker daemon and get goroutines stack

Comment 12 Antonio Murdaca 2016-03-16 19:40:02 UTC
Hanging on docker build isn't new though https://github.com/docker/docker/search?q=hangs&type=Issues&utf8

Comment 13 Ben Parees 2016-03-16 19:41:38 UTC
This is not a "docker build" operation, this is openshift's build system trying to create a docker container to perform an s2i build.

Comment 14 Daniel Walsh 2016-03-16 19:49:57 UTC
Right but the belief is that docker build is hanging.

Comment 15 Ben Parees 2016-03-16 20:03:33 UTC
there's no docker build, but yes the docker daemon seems to be hanging, or at least this particular call to the docker daemon (hence my question of whether once you hit this, everything fails every time, or just random calls seem to hit this)

Comment 16 Mike Fiedler 2016-03-16 20:20:12 UTC
re: comment 15 and comment 6 

I know if I cancel the build the next set of builds will work ok, there is nothing "broken" about the node.   I do not know yet if, while it is hanging, new builds on that node will hang.   I will try that along with --debug=true and kill -3 and report back.

Comment 17 Michal Fojtik 2016-03-17 09:46:59 UTC
Ben, I'm working on a fix today (increase timeout)

Comment 18 Michal Fojtik 2016-03-17 10:17:50 UTC
(actually not increasing timeout but set timeout ;-) ETOOMUCHCOFFE

Comment 19 Michal Fojtik 2016-03-17 11:03:59 UTC
PR: https://github.com/openshift/source-to-image/pull/429

Comment 20 openshift-github-bot 2016-03-17 13:22:45 UTC
Commit pushed to master at https://github.com/openshift/source-to-image

https://github.com/openshift/source-to-image/commit/edbb438d2caca4b8e3721d6c29a65ac289e5d1e3
Bug 1318395: Add timeout check for Docker API calls

Comment 21 Michal Fojtik 2016-03-17 13:38:50 UTC
Waiting for s2i bump in origin to deliver this change.

Comment 22 Antonio Murdaca 2016-03-18 13:17:06 UTC
Another way to dump the stack for docker is to call kill -USR1 <daemon pid> and you'll get the stack in the daemon logs

Comment 23 Mike Fiedler 2016-03-23 18:22:50 UTC
I now see this error in the log:

Error: build error: calling the function timeout after 10s

However, I think this is too short.  I'm seeing many failures when running simultaneous builds.   I would previously see hangs ever 10 runs or so when running 10 simultaneous builds on a node, now I see multiple failures of this type per run - I'm certain they were not all hung.

I understand any value we choose right now is arbitrary, but can we increase this some to better take into account busy nodes?   I can't complete the SVT tests with the current failure rates.

wzheng - please move this back to assigned

Comment 25 Michal Fojtik 2016-03-30 08:19:36 UTC
Mike: We added more informative messages here: https://github.com/openshift/source-to-image/pull/445

(so you will know what and why timeouts). We can also increase the timeout if necessary, but I would rather fail the build and rerun (reschedule to another non-that-busy node if possible).

Comment 26 Michal Fojtik 2016-03-30 08:27:19 UTC
The PR should be merged to origin by now, so moving this ON_QA.

Comment 27 Michal Fojtik 2016-03-30 10:25:47 UTC
Additionally, if we think this is a Docker bug, we should open a separate issue for Docker, agree?

Comment 28 Mike Fiedler 2016-04-01 21:09:30 UTC
Michal,  New location where an indefinite hang was seen:

I0401 16:38:21.478826       1 docker.go:623] Attaching to container "a5732ac3c27f0c4bf457dc8e79f0bc5b52f9cf677485232e25234279b669a11a" ...
I0401 16:38:21.479079       1 docker.go:632] Starting container "a5732ac3c27f0c4bf457dc8e79f0bc5b52f9cf677485232e25234279b669a11a" ...

It's been there for about 20 minutes.


This is on 3.2.0.9 (Wednesday puddle).   Let me know if you prefer a new BZ as opposed to reopening this one.  I know that can get tiresome.

It's clear we need to get an issue opened for the Docker API hangs.

Antonio/runcom I did get the stackdump suggested in comment 22 while this hang was occurring.  Attaching it to this BZ.  Let me know if you want a separate issue or BZ to track it.

Comment 29 Mike Fiedler 2016-04-01 21:10:51 UTC
Created attachment 1142704 [details]
kill -USR1 stacktrace for docker daemon on node where build hung

Comment 30 Michal Fojtik 2016-04-04 12:21:06 UTC
Mike: That should already be fixed (I can see the timeout wrapper around that function call...) It should timeout after 20 seconds. Maybe the changes are not yet sync into the puddle.

Yeah, I really want to open an issue against Docker API hangs... The wrappers are just temporary solution, they don't fix the problem, just make the failure harder so we can retry. The Docker API should return an error in case it is busy instead of hanging forever.

Comment 31 Mike Fiedler 2016-04-05 01:59:05 UTC
I hit the same hang again on today's puddle (3.2.0.11).  Can you check if the change from last week is in OSE 3.2 builds?

From 3.2.0.11, I hit this hang:

I0404 21:43:50.758093       1 docker.go:623] Attaching to container "c3c193f3ef3f204ca5adfbd87fdb5474e09e27b58ed160492696fe0b158ab961" ...
I0404 21:43:50.758347       1 docker.go:632] Starting container "c3c193f3ef3f204ca5adfbd87fdb5474e09e27b58ed160492696fe0b158ab961" ...

It has been there for 15 minutes.

Comment 32 Mike Fiedler 2016-04-05 02:41:56 UTC
I will separate out a bz tomorrow to deal with Docker-specific issues.  Antonio/runcom, if you have a chance to look at the attachment ref'ed in comment 29, it might give us a starting point.

Comment 35 Mike Fiedler 2016-04-05 16:28:05 UTC
Most builds with Docker timeout issues now hang with the new messages:

I0405 12:24:02.786297       1 fs.go:156] Removing directory '/tmp/s2i-build965773972'
F0405 12:24:02.842605       1 builder.go:204] Error: build error: timeout after waiting 20s for Docker to create container


But occasional indefinite hangs occur with the following in the log:

I0404 21:43:50.758093       1 docker.go:623] Attaching to container "c3c193f3ef3f204ca5adfbd87fdb5474e09e27b58ed160492696fe0b158ab961" ...
I0404 21:43:50.758347       1 docker.go:632] Starting container "c3c193f3ef3f204ca5adfbd87fdb5474e09e27b58ed160492696fe0b158ab961" ...

This is on 3.2.0.11

Comment 36 Mike Fiedler 2016-04-15 14:28:32 UTC
Not seeing the issue in comment 35 any longer.  Verified line numbers of messages match the current code with the timeout "protection".

Comment 37 Mike Fiedler 2016-04-25 11:59:19 UTC
The timeouts are currently wrapping the hangs that were being encountered.  A new bug will be opened for any new hangs.

Comment 39 errata-xmlrpc 2016-05-12 16:33:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2016:1064