Bug 1370032

Summary: Deployments unable to start pods due to "connection reset by peer"
Product: OpenShift Online Reporter: Pieter Nagel <pieter>
Component: DeploymentsAssignee: Michal Fojtik <mfojtik>
Status: CLOSED CURRENTRELEASE QA Contact: zhou ying <yinzhou>
Severity: low Docs Contact:
Priority: medium    
Version: 3.xCC: abhgupta, aos-bugs, jhonce, jokerman, mmccomas
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-02-16 22:12:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Output of oc describe pod/tau-web-dev-gfa-18-z5nvk none

Description Pieter Nagel 2016-08-25 06:26:32 UTC
Created attachment 1193884 [details]
Output of oc describe pod/tau-web-dev-gfa-18-z5nvk

Description of problem:

As of yesterday all my deployments have been timing out due to errors well before even getting round to pulling the image.

Looking at the events on the failed pod, I see lots of messages like "Error syncing pod, skipping: failed to "StartContainer" for "POD" with RunContainerError: "runContainer: API error (500): Cannot start container 8ed16139f51fb937b8b9ce1747f062142bf1ffe7dd2792031617d92536e8cd0c: [8] System error: read parent: connection reset by peer\n"

More detailed output of "oc describe" for the given pod attached.

How reproducible:

Consistently reproducible.

Steps to Reproduce:

1. Log in to OpenShift Online as GitHub user 'pjnagel'.
2. Run "oc deploy tau-web-dev-gfa --retry -n tau-dev", or navigate to tau-web-dev-gfa in console and click 'deploy'.

Actual results:

At some pod a pod will be visible in the overview section of the web console. It will remain in "Container creating" status for a long time. Clicking on the pod and going to "events" tab shows errors as described above.


Expected results:

Expected the pod to at least be created and proceed to pulling and running the image.

Comment 1 Pieter Nagel 2016-08-25 08:11:09 UTC
Note: yesterday, before I started experiencing this bug on this deploymentconfig,  I first experienced the bug I just reported as 1370056.

Comment 2 Michal Fojtik 2016-08-25 08:54:48 UTC
Moving this to containers team as this seems to be a Docker issue.

Comment 3 Jhon Honce 2016-08-31 18:07:59 UTC
After researching issue, it appears to be caused by a lack of resources allocated causing the issue. A better error message could help with the confusion.

Comment 4 Jhon Honce 2016-08-31 18:22:20 UTC
Issue should be resolved in docker builds including https://github.com/projectatomic/docker/commit/9d9f154f20a906820698c34ee3fc4b6c452fe5b8

Comment 5 Abhishek Gupta 2016-10-14 19:58:17 UTC
The docker version that we now have in INT/STG/PROD should have this fix. Moving this to QE to test.

Comment 6 zhou ying 2016-10-17 01:35:41 UTC
Can't reproduce this issue on INT, will verify it. 
openshift version
openshift v3.3.1.1+cb482ab-dirty
kubernetes v1.3.0+52492b4
etcd 2.3.0+git

Comment 7 zhou ying 2016-10-17 01:45:39 UTC
Can't reproduce this issue on STG too.