Description of problem: When using command oc exec ... , the bug occurs. Error from server: error dialing backend: dial tcp 172.31.59.112:10010: getsockopt: connection timed out Version-Release number of selected component (if applicable): oc v3.10.0-0.53.0 kubernetes v1.10.0+b81c8f8 How reproducible: 80% Steps to Reproduce: 1.oc new-app 2.oc exec <pod name> 3. Actual results: Error from server: error dialing backend: dial tcp 172.31.59.112:10010: getsockopt: connection timed out Expected results: Can connect to the Pod successfully. Additional info:
Hi Yuhao, I would suspect a temporary problem in the environment. I was not able to reproduce the issue after 1000 requests. [bleanhar@granby free-int]$ ./oc version oc v3.10.0-0.53.0 kubernetes v1.10.0+b81c8f8 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://api.free-int.openshift.com:443 openshift v3.10.0-0.53.0 kubernetes v1.10.0+b81c8f8 Definitely re-open this if you continue to see the problem.
I just test this issue on both free-int and free-sag clusters again, it still failed to connect to pods with the same error message "Error from server: error dialing backend: dial tcp 172.31.59.226:10010: getsockopt: connection timed out". the log is as bellow: [root@yuwan ~]# oc get pod NAME READY STATUS RESTARTS AGE ruby-ex-1-build 1/1 Running 0 11s [root@yuwan ~]# oc get pod NAME READY STATUS RESTARTS AGE ruby-ex-1-build 1/1 Running 0 16s [root@yuwan ~]# oc get pod NAME READY STATUS RESTARTS AGE ruby-ex-1-build 0/1 Completed 0 4m ruby-ex-1-zzcpj 1/1 Running 0 3m [root@yuwan ~]# oc rsh ruby-ex-1-zzcpj Error from server: error dialing backend: dial tcp 172.31.59.226:10010: getsockopt: connection timed out
(In reply to wangyu from comment #2) > I just test this issue on both free-int and free-sag clusters again, it > still failed to connect to pods with the same error message "Error from > server: error dialing backend: dial tcp 172.31.59.226:10010: getsockopt: > connection timed out". > > the log is as bellow: > [root@yuwan ~]# oc get pod > NAME READY STATUS RESTARTS AGE > ruby-ex-1-build 1/1 Running 0 11s > [root@yuwan ~]# oc get pod > NAME READY STATUS RESTARTS AGE > ruby-ex-1-build 1/1 Running 0 16s > [root@yuwan ~]# oc get pod > NAME READY STATUS RESTARTS AGE > ruby-ex-1-build 0/1 Completed 0 4m > ruby-ex-1-zzcpj 1/1 Running 0 3m > [root@yuwan ~]# oc rsh ruby-ex-1-zzcpj > Error from server: error dialing backend: dial tcp 172.31.59.226:10010: > getsockopt: connection timed out The projects I used to test are "yuwan-test-free0011" on server "https://api.free-int.openshift.com:443” and "yuwan-test-stgfree0011" on server "https://api.free-stg.openshift.com:443" .
I can't seem to find a rule in iptables which allows port 10010 (the streaming server on any node). What would have caused that?
Seems like a runtime firewall issue. I wonder if a previous rule in the chain is DROPping it before it reaches the ACCEPT rule. https://github.com/openshift/openshift-ansible/pull/5911 https://github.com/openshift/openshift-ansible/blob/master/roles/container_runtime/defaults/main.yml#L46
Yeah, agree with Seth. We are working with networking team to figure this out.
Figured out this is a security group issue with the cluster itself, not CRI-O related. (Thanks guys!). Leaving container component as I'm not sure where it belongs to but mwoodson is already working on it.
BTW, this has been fixed in free-int as Matt created the SG for the streaming server.
I have added the SG to all starter tier nodes. I have informed the Ops team of the need for the new SG rule, and I have updated our default SG's to include this rule as well. I'm marking this as done.
The verification for this bug have finished on free-int1/free-int2 and free-stg1/free-stg2.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1816