Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1583133

Summary:	[Free-int][Free-STG]Failed to connect to pod with 'getsockopt' error
Product:	OpenShift Container Platform	Reporter:	HugoNgai <yuwei>
Component:	Containers	Assignee:	Matt Woodson <mwoodson>
Status:	CLOSED ERRATA	QA Contact:	DeShuai Ma <dma>
Severity:	high	Docs Contact:
Priority:	high
Version:	3.10.0	CC:	amurdaca, aos-bugs, bleanhar, eminguez, jokerman, jupierce, mmccomas, mpatel, mwoodson, yuwan
Target Milestone:	---	Keywords:	OnlineStarter, Reopened
Target Release:	3.10.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-07-30 19:16:51 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description HugoNgai 2018-05-28 09:28:55 UTC

Description of problem:
When using command oc exec ... , the bug occurs.
Error from server: error dialing backend: dial tcp 172.31.59.112:10010: getsockopt: connection timed out

Version-Release number of selected component (if applicable):
oc v3.10.0-0.53.0
kubernetes v1.10.0+b81c8f8

How reproducible:
80%

Steps to Reproduce:
1.oc new-app
2.oc exec <pod name>
3.

Actual results:
Error from server: error dialing backend: dial tcp 172.31.59.112:10010: getsockopt: connection timed out

Expected results:
Can connect to the Pod successfully.

Additional info:

Comment 1 Brenton Leanhardt 2018-05-30 13:26:50 UTC

Hi Yuhao, I would suspect a temporary problem in the environment.  I was not able to reproduce the issue after 1000 requests.

[bleanhar@granby free-int]$ ./oc version
oc v3.10.0-0.53.0
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://api.free-int.openshift.com:443
openshift v3.10.0-0.53.0
kubernetes v1.10.0+b81c8f8

Definitely re-open this if you continue to see the problem.

Comment 2 wangyu 2018-06-01 05:07:09 UTC

I just test this issue on both free-int and free-sag clusters again, it still failed to connect to pods with the same error message "Error from server: error dialing backend: dial tcp 172.31.59.226:10010: getsockopt: connection timed out".

the log is as bellow:
[root@yuwan ~]# oc get pod
NAME              READY     STATUS    RESTARTS   AGE
ruby-ex-1-build   1/1       Running   0          11s
[root@yuwan ~]# oc get pod
NAME              READY     STATUS    RESTARTS   AGE
ruby-ex-1-build   1/1       Running   0          16s
[root@yuwan ~]# oc get pod
NAME              READY     STATUS      RESTARTS   AGE
ruby-ex-1-build   0/1       Completed   0          4m
ruby-ex-1-zzcpj   1/1       Running     0          3m
[root@yuwan ~]# oc rsh ruby-ex-1-zzcpj
Error from server: error dialing backend: dial tcp 172.31.59.226:10010: getsockopt: connection timed out

Comment 3 wangyu 2018-06-01 05:18:25 UTC

(In reply to wangyu from comment #2)
> I just test this issue on both free-int and free-sag clusters again, it
> still failed to connect to pods with the same error message "Error from
> server: error dialing backend: dial tcp 172.31.59.226:10010: getsockopt:
> connection timed out".
> 
> the log is as bellow:
> [root@yuwan ~]# oc get pod
> NAME              READY     STATUS    RESTARTS   AGE
> ruby-ex-1-build   1/1       Running   0          11s
> [root@yuwan ~]# oc get pod
> NAME              READY     STATUS    RESTARTS   AGE
> ruby-ex-1-build   1/1       Running   0          16s
> [root@yuwan ~]# oc get pod
> NAME              READY     STATUS      RESTARTS   AGE
> ruby-ex-1-build   0/1       Completed   0          4m
> ruby-ex-1-zzcpj   1/1       Running     0          3m
> [root@yuwan ~]# oc rsh ruby-ex-1-zzcpj
> Error from server: error dialing backend: dial tcp 172.31.59.226:10010:
> getsockopt: connection timed out



The projects I used to test are "yuwan-test-free0011" on server "https://api.free-int.openshift.com:443” and "yuwan-test-stgfree0011" on server "https://api.free-stg.openshift.com:443" .

Comment 5 Antonio Murdaca 2018-06-04 13:34:35 UTC

I can't seem to find a rule in iptables which allows port 10010 (the streaming server on any node).
What would have caused that?

Comment 7 Seth Jennings 2018-06-04 15:14:34 UTC

Seems like a runtime firewall issue.  I wonder if a previous rule in the chain is DROPping it before it reaches the ACCEPT rule.

https://github.com/openshift/openshift-ansible/pull/5911
https://github.com/openshift/openshift-ansible/blob/master/roles/container_runtime/defaults/main.yml#L46

Comment 8 Mrunal Patel 2018-06-04 15:42:47 UTC

Yeah, agree with Seth. We are working with networking team to figure this out.

Comment 11 Antonio Murdaca 2018-06-04 16:09:11 UTC

Figured out this is a security group issue with the cluster itself, not CRI-O related. (Thanks guys!).

Leaving container component as I'm not sure where it belongs to but mwoodson is already working on it.

Comment 12 Antonio Murdaca 2018-06-05 07:17:52 UTC

BTW, this has been fixed in free-int as Matt created the SG for the streaming server.

Comment 13 Matt Woodson 2018-06-05 14:15:19 UTC

I have added the SG to all starter tier nodes.  I have informed the Ops team of the need for the new SG rule, and I have updated our default SG's to include this rule as well.

I'm marking this as done.

Comment 14 wangyu 2018-06-06 02:17:03 UTC

The verification for this bug have finished on free-int1/free-int2 and free-stg1/free-stg2.

Comment 16 errata-xmlrpc 2018-07-30 19:16:51 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1816