Bug 1318395
| Summary: | Build hangs indefinitely during container creation when running simultaneous builds | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Mike Fiedler <mifiedle> | ||||||
| Component: | Build | Assignee: | Michal Fojtik <mfojtik> | ||||||
| Status: | CLOSED ERRATA | QA Contact: | Mike Fiedler <mifiedle> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | unspecified | ||||||||
| Version: | 3.2.0 | CC: | amurdaca, aos-bugs, bparees, dwalsh, jeder, mifiedle, mpatel, tdawson, xtian | ||||||
| Target Milestone: | --- | ||||||||
| Target Release: | --- | ||||||||
| Hardware: | x86_64 | ||||||||
| OS: | Linux | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2016-05-12 16:33:14 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
Michal can you take a look at this? Cesar and I discussed it and he suggested (and I agree) that we should just add a short (10s?) timeout on calls to the docker api and if the timeout is hit, fail the build. Hi Ben -- fail that build and retry, right? I could see us adding retry at some point, but for the immediate fix, no, just fail. It's basically no different than "your build failed because we couldn't reach your maven repo to pull dependencies" which we also don't retry. We don't have a good mechanism for retrying a build that has gotten this far along. (fundamentally I would not expect this to be a frequent occurrence, so if it is, we need to get to the bottom of why the docker daemon is becoming unresponsive, not fix it via retries) Let me know if there is any docker trace/log/etc that would help. At the activity level I mentioned I can hit this a good bit. Nothing that will help me per se, but if we can get someone more docker knowledgeable to look at the docker daemon logs that might help. Once you hit this, does everything hit it, or it's just random builds that hit it while others work fine at the same time? Mike -- can you reproduce with docker daemon debug logging enabled? In /etc/sysconfig/docker OPTIONS='--selinux-enabled --debug=true' Restart docker service, reproduce the hang and capture the logs (these logs will show up in the journal). What version of docker? Can you try this on docker-1.9 to see if this fixes the problem Docker is 1.9.1 Server: Version: 1.9.1 API version: 1.21 Package version: docker-1.9.1-19.el7.x86_64 Go version: go1.4.2 Git commit: 185277d/1.9.1 Built: OS/Arch: linux/amd64 I know docker was working on some fixes for some locking issues. Could you also "kill -3" the docker daemon and get goroutines stack Hanging on docker build isn't new though https://github.com/docker/docker/search?q=hangs&type=Issues&utf8 This is not a "docker build" operation, this is openshift's build system trying to create a docker container to perform an s2i build. Right but the belief is that docker build is hanging. there's no docker build, but yes the docker daemon seems to be hanging, or at least this particular call to the docker daemon (hence my question of whether once you hit this, everything fails every time, or just random calls seem to hit this) re: comment 15 and comment 6 I know if I cancel the build the next set of builds will work ok, there is nothing "broken" about the node. I do not know yet if, while it is hanging, new builds on that node will hang. I will try that along with --debug=true and kill -3 and report back. Ben, I'm working on a fix today (increase timeout) (actually not increasing timeout but set timeout ;-) ETOOMUCHCOFFE Commit pushed to master at https://github.com/openshift/source-to-image https://github.com/openshift/source-to-image/commit/edbb438d2caca4b8e3721d6c29a65ac289e5d1e3 Bug 1318395: Add timeout check for Docker API calls Waiting for s2i bump in origin to deliver this change. Another way to dump the stack for docker is to call kill -USR1 <daemon pid> and you'll get the stack in the daemon logs I now see this error in the log: Error: build error: calling the function timeout after 10s However, I think this is too short. I'm seeing many failures when running simultaneous builds. I would previously see hangs ever 10 runs or so when running 10 simultaneous builds on a node, now I see multiple failures of this type per run - I'm certain they were not all hung. I understand any value we choose right now is arbitrary, but can we increase this some to better take into account busy nodes? I can't complete the SVT tests with the current failure rates. wzheng - please move this back to assigned Mike: We added more informative messages here: https://github.com/openshift/source-to-image/pull/445 (so you will know what and why timeouts). We can also increase the timeout if necessary, but I would rather fail the build and rerun (reschedule to another non-that-busy node if possible). The PR should be merged to origin by now, so moving this ON_QA. Additionally, if we think this is a Docker bug, we should open a separate issue for Docker, agree? Michal, New location where an indefinite hang was seen: I0401 16:38:21.478826 1 docker.go:623] Attaching to container "a5732ac3c27f0c4bf457dc8e79f0bc5b52f9cf677485232e25234279b669a11a" ... I0401 16:38:21.479079 1 docker.go:632] Starting container "a5732ac3c27f0c4bf457dc8e79f0bc5b52f9cf677485232e25234279b669a11a" ... It's been there for about 20 minutes. This is on 3.2.0.9 (Wednesday puddle). Let me know if you prefer a new BZ as opposed to reopening this one. I know that can get tiresome. It's clear we need to get an issue opened for the Docker API hangs. Antonio/runcom I did get the stackdump suggested in comment 22 while this hang was occurring. Attaching it to this BZ. Let me know if you want a separate issue or BZ to track it. Created attachment 1142704 [details]
kill -USR1 stacktrace for docker daemon on node where build hung
Mike: That should already be fixed (I can see the timeout wrapper around that function call...) It should timeout after 20 seconds. Maybe the changes are not yet sync into the puddle. Yeah, I really want to open an issue against Docker API hangs... The wrappers are just temporary solution, they don't fix the problem, just make the failure harder so we can retry. The Docker API should return an error in case it is busy instead of hanging forever. I hit the same hang again on today's puddle (3.2.0.11). Can you check if the change from last week is in OSE 3.2 builds? From 3.2.0.11, I hit this hang: I0404 21:43:50.758093 1 docker.go:623] Attaching to container "c3c193f3ef3f204ca5adfbd87fdb5474e09e27b58ed160492696fe0b158ab961" ... I0404 21:43:50.758347 1 docker.go:632] Starting container "c3c193f3ef3f204ca5adfbd87fdb5474e09e27b58ed160492696fe0b158ab961" ... It has been there for 15 minutes. I will separate out a bz tomorrow to deal with Docker-specific issues. Antonio/runcom, if you have a chance to look at the attachment ref'ed in comment 29, it might give us a starting point. Most builds with Docker timeout issues now hang with the new messages: I0405 12:24:02.786297 1 fs.go:156] Removing directory '/tmp/s2i-build965773972' F0405 12:24:02.842605 1 builder.go:204] Error: build error: timeout after waiting 20s for Docker to create container But occasional indefinite hangs occur with the following in the log: I0404 21:43:50.758093 1 docker.go:623] Attaching to container "c3c193f3ef3f204ca5adfbd87fdb5474e09e27b58ed160492696fe0b158ab961" ... I0404 21:43:50.758347 1 docker.go:632] Starting container "c3c193f3ef3f204ca5adfbd87fdb5474e09e27b58ed160492696fe0b158ab961" ... This is on 3.2.0.11 Not seeing the issue in comment 35 any longer. Verified line numbers of messages match the current code with the timeout "protection". The timeouts are currently wrapping the hangs that were being encountered. A new bug will be opened for any new hangs. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2016:1064 |
Created attachment 1137111 [details] Build log, strace and proc status Description of problem: When running stress tests of 10 builds per node repeatedly, eventually at least one build will hang forever. The logs show the build is stuck at docker container creation. oc cancel-build will free the build up. An strace and /proc/<pid>/status of the wedged build are attached. Version-Release number of selected component (if applicable): 3.2.0.3 How reproducible: Frequent while running simultaneous builds Steps to Reproduce: 1. Create a cluster with a master and 7 nodes: 1 router, 1 registry, 5 builders. I'm using m4.xlarge AWS instances. 2. Create 20 projects. Spread 50 quickstart application instance across the projects (cakephp, django, dancer and nodejs) 3. Build all 50 simultaneously after they have built once. Build repeatedly. This is 10 builds/node roughly Actual results: Eventually one or more (usually one) build will get wedged forever. Build log always shows it stuck here: I0316 12:11:07.103974 1 docker.go:411] Image contains io.openshift.s2i.scripts-url set to 'image:///usr/libexec/s2i' I0316 12:11:07.104452 1 docker.go:466] Base directory for STI scripts is '/usr/libexec/s2i'. Untarring destination is '/tmp'. I0316 12:11:07.104505 1 docker.go:616] Creating container using config: &{Hostname: Domainname: User: Memory:0.......... Full log attached Expected results: Builds run to completion successfully or with failure messages. Additional info: Attached: build log, strace and proc status of openshift-sti-build while hanging.