Bug 1583133
| Summary: | [Free-int][Free-STG]Failed to connect to pod with 'getsockopt' error | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | HugoNgai <yuwei> |
| Component: | Containers | Assignee: | Matt Woodson <mwoodson> |
| Status: | CLOSED ERRATA | QA Contact: | DeShuai Ma <dma> |
| Severity: | high | Docs Contact: | |
| Priority: | high | ||
| Version: | 3.10.0 | CC: | amurdaca, aos-bugs, bleanhar, eminguez, jokerman, jupierce, mmccomas, mpatel, mwoodson, yuwan |
| Target Milestone: | --- | Keywords: | OnlineStarter, Reopened |
| Target Release: | 3.10.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2018-07-30 19:16:51 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
HugoNgai
2018-05-28 09:28:55 UTC
Hi Yuhao, I would suspect a temporary problem in the environment. I was not able to reproduce the issue after 1000 requests. [bleanhar@granby free-int]$ ./oc version oc v3.10.0-0.53.0 kubernetes v1.10.0+b81c8f8 features: Basic-Auth GSSAPI Kerberos SPNEGO Server https://api.free-int.openshift.com:443 openshift v3.10.0-0.53.0 kubernetes v1.10.0+b81c8f8 Definitely re-open this if you continue to see the problem. I just test this issue on both free-int and free-sag clusters again, it still failed to connect to pods with the same error message "Error from server: error dialing backend: dial tcp 172.31.59.226:10010: getsockopt: connection timed out". the log is as bellow: [root@yuwan ~]# oc get pod NAME READY STATUS RESTARTS AGE ruby-ex-1-build 1/1 Running 0 11s [root@yuwan ~]# oc get pod NAME READY STATUS RESTARTS AGE ruby-ex-1-build 1/1 Running 0 16s [root@yuwan ~]# oc get pod NAME READY STATUS RESTARTS AGE ruby-ex-1-build 0/1 Completed 0 4m ruby-ex-1-zzcpj 1/1 Running 0 3m [root@yuwan ~]# oc rsh ruby-ex-1-zzcpj Error from server: error dialing backend: dial tcp 172.31.59.226:10010: getsockopt: connection timed out (In reply to wangyu from comment #2) > I just test this issue on both free-int and free-sag clusters again, it > still failed to connect to pods with the same error message "Error from > server: error dialing backend: dial tcp 172.31.59.226:10010: getsockopt: > connection timed out". > > the log is as bellow: > [root@yuwan ~]# oc get pod > NAME READY STATUS RESTARTS AGE > ruby-ex-1-build 1/1 Running 0 11s > [root@yuwan ~]# oc get pod > NAME READY STATUS RESTARTS AGE > ruby-ex-1-build 1/1 Running 0 16s > [root@yuwan ~]# oc get pod > NAME READY STATUS RESTARTS AGE > ruby-ex-1-build 0/1 Completed 0 4m > ruby-ex-1-zzcpj 1/1 Running 0 3m > [root@yuwan ~]# oc rsh ruby-ex-1-zzcpj > Error from server: error dialing backend: dial tcp 172.31.59.226:10010: > getsockopt: connection timed out The projects I used to test are "yuwan-test-free0011" on server "https://api.free-int.openshift.com:443” and "yuwan-test-stgfree0011" on server "https://api.free-stg.openshift.com:443" . I can't seem to find a rule in iptables which allows port 10010 (the streaming server on any node). What would have caused that? Seems like a runtime firewall issue. I wonder if a previous rule in the chain is DROPping it before it reaches the ACCEPT rule. https://github.com/openshift/openshift-ansible/pull/5911 https://github.com/openshift/openshift-ansible/blob/master/roles/container_runtime/defaults/main.yml#L46 Yeah, agree with Seth. We are working with networking team to figure this out. Figured out this is a security group issue with the cluster itself, not CRI-O related. (Thanks guys!). Leaving container component as I'm not sure where it belongs to but mwoodson is already working on it. BTW, this has been fixed in free-int as Matt created the SG for the streaming server. I have added the SG to all starter tier nodes. I have informed the Ops team of the need for the new SG rule, and I have updated our default SG's to include this rule as well. I'm marking this as done. The verification for this bug have finished on free-int1/free-int2 and free-stg1/free-stg2. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:1816 |