Bug 1583133 - [Free-int][Free-STG]Failed to connect to pod with 'getsockopt' error
Summary: [Free-int][Free-STG]Failed to connect to pod with 'getsockopt' error
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Containers
Version: 3.10.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 3.10.0
Assignee: Matt Woodson
QA Contact: DeShuai Ma
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-05-28 09:28 UTC by HugoNgai
Modified: 2018-07-30 19:17 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-07-30 19:16:51 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2018:1816 None None None 2018-07-30 19:17:40 UTC

Internal Links: 1583640

Description HugoNgai 2018-05-28 09:28:55 UTC
Description of problem:
When using command oc exec ... , the bug occurs.
Error from server: error dialing backend: dial tcp 172.31.59.112:10010: getsockopt: connection timed out

Version-Release number of selected component (if applicable):
oc v3.10.0-0.53.0
kubernetes v1.10.0+b81c8f8

How reproducible:
80%

Steps to Reproduce:
1.oc new-app
2.oc exec <pod name>
3.

Actual results:
Error from server: error dialing backend: dial tcp 172.31.59.112:10010: getsockopt: connection timed out

Expected results:
Can connect to the Pod successfully.

Additional info:

Comment 1 Brenton Leanhardt 2018-05-30 13:26:50 UTC
Hi Yuhao, I would suspect a temporary problem in the environment.  I was not able to reproduce the issue after 1000 requests.

[bleanhar@granby free-int]$ ./oc version
oc v3.10.0-0.53.0
kubernetes v1.10.0+b81c8f8
features: Basic-Auth GSSAPI Kerberos SPNEGO

Server https://api.free-int.openshift.com:443
openshift v3.10.0-0.53.0
kubernetes v1.10.0+b81c8f8

Definitely re-open this if you continue to see the problem.

Comment 2 wangyu 2018-06-01 05:07:09 UTC
I just test this issue on both free-int and free-sag clusters again, it still failed to connect to pods with the same error message "Error from server: error dialing backend: dial tcp 172.31.59.226:10010: getsockopt: connection timed out".

the log is as bellow:
[root@yuwan ~]# oc get pod
NAME              READY     STATUS    RESTARTS   AGE
ruby-ex-1-build   1/1       Running   0          11s
[root@yuwan ~]# oc get pod
NAME              READY     STATUS    RESTARTS   AGE
ruby-ex-1-build   1/1       Running   0          16s
[root@yuwan ~]# oc get pod
NAME              READY     STATUS      RESTARTS   AGE
ruby-ex-1-build   0/1       Completed   0          4m
ruby-ex-1-zzcpj   1/1       Running     0          3m
[root@yuwan ~]# oc rsh ruby-ex-1-zzcpj
Error from server: error dialing backend: dial tcp 172.31.59.226:10010: getsockopt: connection timed out

Comment 3 wangyu 2018-06-01 05:18:25 UTC
(In reply to wangyu from comment #2)
> I just test this issue on both free-int and free-sag clusters again, it
> still failed to connect to pods with the same error message "Error from
> server: error dialing backend: dial tcp 172.31.59.226:10010: getsockopt:
> connection timed out".
> 
> the log is as bellow:
> [root@yuwan ~]# oc get pod
> NAME              READY     STATUS    RESTARTS   AGE
> ruby-ex-1-build   1/1       Running   0          11s
> [root@yuwan ~]# oc get pod
> NAME              READY     STATUS    RESTARTS   AGE
> ruby-ex-1-build   1/1       Running   0          16s
> [root@yuwan ~]# oc get pod
> NAME              READY     STATUS      RESTARTS   AGE
> ruby-ex-1-build   0/1       Completed   0          4m
> ruby-ex-1-zzcpj   1/1       Running     0          3m
> [root@yuwan ~]# oc rsh ruby-ex-1-zzcpj
> Error from server: error dialing backend: dial tcp 172.31.59.226:10010:
> getsockopt: connection timed out



The projects I used to test are "yuwan-test-free0011" on server "https://api.free-int.openshift.com:443” and "yuwan-test-stgfree0011" on server "https://api.free-stg.openshift.com:443" .

Comment 5 Antonio Murdaca 2018-06-04 13:34:35 UTC
I can't seem to find a rule in iptables which allows port 10010 (the streaming server on any node).
What would have caused that?

Comment 7 Seth Jennings 2018-06-04 15:14:34 UTC
Seems like a runtime firewall issue.  I wonder if a previous rule in the chain is DROPping it before it reaches the ACCEPT rule.

https://github.com/openshift/openshift-ansible/pull/5911
https://github.com/openshift/openshift-ansible/blob/master/roles/container_runtime/defaults/main.yml#L46

Comment 8 Mrunal Patel 2018-06-04 15:42:47 UTC
Yeah, agree with Seth. We are working with networking team to figure this out.

Comment 11 Antonio Murdaca 2018-06-04 16:09:11 UTC
Figured out this is a security group issue with the cluster itself, not CRI-O related. (Thanks guys!).

Leaving container component as I'm not sure where it belongs to but mwoodson is already working on it.

Comment 12 Antonio Murdaca 2018-06-05 07:17:52 UTC
BTW, this has been fixed in free-int as Matt created the SG for the streaming server.

Comment 13 Matt Woodson 2018-06-05 14:15:19 UTC
I have added the SG to all starter tier nodes.  I have informed the Ops team of the need for the new SG rule, and I have updated our default SG's to include this rule as well.

I'm marking this as done.

Comment 14 wangyu 2018-06-06 02:17:03 UTC
The verification for this bug have finished on free-int1/free-int2 and free-stg1/free-stg2.

Comment 16 errata-xmlrpc 2018-07-30 19:16:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1816


Note You need to log in before you can comment on or make changes to this bug.