Bug 1814706
| Summary: | Connection timed out after long time running when accessing Windows pod from Linux pod | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | gaoshang <sgao> |
| Component: | Windows Containers | Assignee: | Aravindh Puthiyaparambil <aravindh> |
| Status: | CLOSED ERRATA | QA Contact: | gaoshang <sgao> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 4.4 | CC: | aos-bugs, gmarkley, rgudimet |
| Target Milestone: | --- | ||
| Target Release: | 4.5.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | No Doc Update | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-07-13 17:22:38 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
This bug has been fixed in 4.5.0-0.nightly-2020-05-05-205255, move status to VERIFIED, thanks. Version-Release number of selected component (if applicable): # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2020-05-05-205255 True False 8h Cluster version is 4.5.0-0.nightly-2020-05-05-205255 windows-machine-config-bootstrapper git commit 3f4e97c9a50e07208facfcc3670caf729424a25c Steps: 1, Bring up the OCP cluster 4.5.0-0.nightly-2020-05-05-205255 with ovn-kubernetes 2, Bring up Windows node 3, Configure inventory file and run wsu 4, Create win-webserver and linux-webserver pod, wait for hours, check east-west network still available # oc create -f https://raw.githubusercontent.com/sgaoshang/winc-test/master/data/WinWebServer.yaml # oc create -f https://raw.githubusercontent.com/sgaoshang/winc-test/master/data/LinuxWebServer.yaml # oc get pod -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES linux-webserver-65b89c7f5c-4x2q9 1/1 Running 0 7h34m 10.128.2.18 ip-10-0-131-9.us-east-2.compute.internal <none> <none> win-webserver-76659cfd79-5g854 1/1 Running 0 7h34m 10.132.0.3 ip-10-0-37-34.us-east-2.compute.internal <none> <none> # oc exec linux-webserver-65b89c7f5c-4x2q9 curl 10.132.0.3 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 126 100 126 0 0 540 0 --:--:-- --:--:-- --:--:-- 540<html><body><H1>Windows Container Web Server</H1><p>IP 10.132.0.3 callerCount 47 <p>IP 10.132.0.3 callerCount 3 </body></html> # oc exec win-webserver-76659cfd79-5g854 curl 10.128.2.18 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0Linux Container Web Server 100 27 100 27 0 0 27 0 0:00:01 --:--:-- 0:00:01 870 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409 |
Description of problem: Connection timed out after long time running when accessing Windows pod from Linux pod in AWS cluster, please see Steps. Version-Release number of selected component (if applicable): # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.4.0-0.nightly-2020-03-17-135743 True False 20h Cluster version is 4.4.0-0.nightly-2020-03-17-135743 windows-machine-config-bootstrapper commit 69b264d8437746f07c1234daeba8f20dc40710bd How reproducible: Always Steps to Reproduce: 1, Create win-webserver and linux-webserver pod, at first east-west network testing passed. # oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES my-nginx-75897978cd-rd4m2 1/1 Running 0 81m 10.131.0.13 ip-10-0-134-251.us-east-2.compute.internal <none> <none> win-webserver-79b64df8b9-chw7f 1/1 Running 0 82m 10.132.0.2 ip-10-0-29-113.us-east-2.compute.internal <none> <none> # oc exec my-nginx-75897978cd-rd4m2 curl 10.132.0.2 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 125 100 125 0 0 543 0 --:--:-- --:--:-- --:--:-- 543 <html><body><H1>Windows Container Web Server</H1><p>IP 10.132.0.2 callerCount 3 <p>IP 10.132.0.2 callerCount 5 </body></html> 2, After more than 3 hours, please see pod "AGE", now the same east-west network failed. # oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES my-nginx-75897978cd-rd4m2 1/1 Running 0 3h25m 10.131.0.13 ip-10-0-134-251.us-east-2.compute.internal <none> <none> win-webserver-79b64df8b9-chw7f 1/1 Running 0 3h26m 10.132.0.2 ip-10-0-29-113.us-east-2.compute.internal <none> <none> # oc exec my-nginx-75897978cd-rd4m2 curl 10.132.0.2 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- 0:02:08 --:--:-- 0curl: (7) Failed to connect to 10.132.0.2 port 80: Connection timed out command terminated with exit code 7 3, Created another linux-webserver pod by edit deployment, the new pod to win-webserver still works. # oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES my-nginx-75897978cd-rd4m2 1/1 Running 0 4h 10.131.0.13 ip-10-0-134-251.us-east-2.compute.internal <none> <none> my-nginx-75897978cd-s2ks8 1/1 Running 0 12m 10.128.2.11 ip-10-0-159-2.us-east-2.compute.internal <none> <none> win-webserver-79b64df8b9-chw7f 1/1 Running 0 4h1m 10.132.0.2 ip-10-0-29-113.us-east-2.compute.internal <none> <none> # oc exec my-nginx-75897978cd-s2ks8 curl 10.132.0.2 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 158 100 158 0 0 731 0 --:--:-- --:--:-- --:--:-- 731 <html><body><H1>Windows Container Web Server</H1><p>IP 10.132.0.2 callerCount 2 <p>IP 10.132.0.2 callerCount 30 <p>IP 10.132.0.2 callerCount 22 </body></html> Maybe something happened in pod network during these 3 hours, which stopped the channel between linux pod to windows pod. It's the same when win-webserver pod access linux-webserver pod. Actual results: Connection timed out Expected results: Windows pod to Linux pod east-west network should always work. Additional info: