1370032 – Deployments unable to start pods due to "connection reset by peer"

Bug 1370032 - Deployments unable to start pods due to "connection reset by peer"

Summary: Deployments unable to start pods due to "connection reset by peer"

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	OpenShift Online
Classification:	Red Hat
Component:	Deployments
Sub Component:
Version:	3.x
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	low
Target Milestone:	---
Target Release:	---
Assignee:	Michal Fojtik
QA Contact:	zhou ying
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-08-25 06:26 UTC by Pieter Nagel
Modified:	2017-02-16 22:12 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2017-02-16 22:12:25 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Output of oc describe pod/tau-web-dev-gfa-18-z5nvk (4.50 KB, text/plain) 2016-08-25 06:26 UTC, Pieter Nagel	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1370056	0	unspecified	CLOSED	[preview]Deployments fail because replicationcontrollers request too few resources	2021-02-22 00:41:40 UTC

Internal Links: 1370056

Description Pieter Nagel 2016-08-25 06:26:32 UTC

Created attachment 1193884 [details]
Output of oc describe pod/tau-web-dev-gfa-18-z5nvk

Description of problem:

As of yesterday all my deployments have been timing out due to errors well before even getting round to pulling the image.

Looking at the events on the failed pod, I see lots of messages like "Error syncing pod, skipping: failed to "StartContainer" for "POD" with RunContainerError: "runContainer: API error (500): Cannot start container 8ed16139f51fb937b8b9ce1747f062142bf1ffe7dd2792031617d92536e8cd0c: [8] System error: read parent: connection reset by peer\n"

More detailed output of "oc describe" for the given pod attached.

How reproducible:

Consistently reproducible.

Steps to Reproduce:

1. Log in to OpenShift Online as GitHub user 'pjnagel'.
2. Run "oc deploy tau-web-dev-gfa --retry -n tau-dev", or navigate to tau-web-dev-gfa in console and click 'deploy'.

Actual results:

At some pod a pod will be visible in the overview section of the web console. It will remain in "Container creating" status for a long time. Clicking on the pod and going to "events" tab shows errors as described above.


Expected results:

Expected the pod to at least be created and proceed to pulling and running the image.

Comment 1 Pieter Nagel 2016-08-25 08:11:09 UTC

Note: yesterday, before I started experiencing this bug on this deploymentconfig,  I first experienced the bug I just reported as 1370056.

Comment 2 Michal Fojtik 2016-08-25 08:54:48 UTC

Moving this to containers team as this seems to be a Docker issue.

Comment 3 Jhon Honce 2016-08-31 18:07:59 UTC

After researching issue, it appears to be caused by a lack of resources allocated causing the issue. A better error message could help with the confusion.

Comment 4 Jhon Honce 2016-08-31 18:22:20 UTC

Issue should be resolved in docker builds including https://github.com/projectatomic/docker/commit/9d9f154f20a906820698c34ee3fc4b6c452fe5b8

Comment 5 Abhishek Gupta 2016-10-14 19:58:17 UTC

The docker version that we now have in INT/STG/PROD should have this fix. Moving this to QE to test.

Comment 6 zhou ying 2016-10-17 01:35:41 UTC

Can't reproduce this issue on INT, will verify it. 
openshift version
openshift v3.3.1.1+cb482ab-dirty
kubernetes v1.3.0+52492b4
etcd 2.3.0+git

Comment 7 zhou ying 2016-10-17 01:45:39 UTC

Can't reproduce this issue on STG too.

Note You need to log in before you can comment on or make changes to this bug.