| Summary: | Builds fail with Docker operations > 20 seconds, system with build pod not out of capacity | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Mike Fiedler <mifiedle> | ||||||||
| Component: | Containers | Assignee: | Jhon Honce <jhonce> | ||||||||
| Status: | CLOSED WORKSFORME | QA Contact: | Mike Fiedler <mifiedle> | ||||||||
| Severity: | low | Docs Contact: | |||||||||
| Priority: | medium | ||||||||||
| Version: | 3.2.0 | CC: | amurdaca, aos-bugs, dwalsh, imcleod, jhonce, jokerman, mifiedle, mmccomas, mpatel, vgoyal | ||||||||
| Target Milestone: | --- | ||||||||||
| Target Release: | --- | ||||||||||
| Hardware: | Unspecified | ||||||||||
| OS: | Unspecified | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | 2017-09-11 16:17:41 UTC | Type: | Bug | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Attachments: |
|
||||||||||
|
Description
Mike Fiedler
2016-04-05 18:21:11 UTC
Created attachment 1143926 [details]
Syslogs for node where build failed
Created attachment 1143927 [details] Stack dump for docker daemon from BZ 1318395 (build hang) What version of docker is this happening on? If it is docker-1.8, could you attempt to repeat on docker-1.9? Re: comment 3 - Docker is : Client: Version: 1.9.1 API version: 1.21 Package version: docker-1.9.1-23.el7.x86_64 Go version: go1.4.2 Git commit: f97fb16/1.9.1 Built: OS/Arch: linux/amd64 Server: Version: 1.9.1 API version: 1.21 Package version: docker-1.9.1-23.el7.x86_64 Go version: go1.4.2 Git commit: f97fb16/1.9.1 Built: OS/Arch: linux/amd64 Could this just be the hang we have seen with docker pull, we are blocking simultanious docker pulls I believe in another part of OpenShift, maybe we need to block a certain amount of simultanious pulls in docker build. With the general improvements in the latest OpenShift and Docker is this issue reproducible? This issue is still reproducible in OpenShift 3.3. Concurrent S2I builds almost always trigger the 20 second timeout introduced in OpenShift 3.2. This makes gathering stack traces of hangs difficult as things tend to timeout first. Let us know what documentation you'd like to see for this bug. I am not sure what this bug is about. How did you arrive at this 20 second number and what guarantees that build will always succeed in 20 seconds. If you are trying to detect hangs, I would expect that timeout limits will be very high say few minutes atleast and not few seconds. 20 seconds is an artificial limit added by OpenShift builds in https://bugzilla.redhat.com/show_bug.cgi?id=1318395 during 3.2. It was recently raised to 1 minute in OpenShift 3.3 (https://github.com/openshift/source-to-image/pull/576) and we still see occasional timeouts starting containers in builds. So many things can go wrong. Especially in cloud, often storage can be very slow and lot of things can get backlogged behind it and take long time. So while you can use some sort of timeout for warning, how can you be sure that a random number is good enough to terminate builds. And even if you do that, how does that translate into a bug. Platform never offered any guarantees like that. This is more of an observation. This was opened when builds would hang forever trying to create containers. A hard and permanent hang. To work around it OpenShift now cancels builds that take longer than their threshold to start. I'm guessing, but cannot prove that the hangs are still happening, but OpenShift kills the container. I can try to build openshift binaries that don't have the timeout to recreate the hangs, but that's not product behavior any longer. This was opened when builds would hang forever trying to create containers. A hard and permanent hang. To work around it OpenShift now cancels builds that take longer than their threshold to start. I'm guessing, but cannot prove that the hangs are still happening, but OpenShift kills the container. I can try to build openshift binaries that don't have the timeout to recreate the hangs, but that's not product behavior any longer. Thanks Jhon for lowering priority of this. If we are not even sure if something is a bug or not, that should not be a high priority issue. @Mike, I know docker has had issues with multiple parallel buids. I hope those issues have been fixed in newer docker. See https://bugzilla.redhat.com/show_bug.cgi?id=1375580 for related information. With the resolution of Bug 1375580, are you able to reproduce the issue? This issue no longer occurs in 3.6 |