Description of problem: Connection timed out after long time running when accessing Windows pod from Linux pod in AWS cluster, please see Steps. Version-Release number of selected component (if applicable): # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.4.0-0.nightly-2020-03-17-135743 True False 20h Cluster version is 4.4.0-0.nightly-2020-03-17-135743 windows-machine-config-bootstrapper commit 69b264d8437746f07c1234daeba8f20dc40710bd How reproducible: Always Steps to Reproduce: 1, Create win-webserver and linux-webserver pod, at first east-west network testing passed. # oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES my-nginx-75897978cd-rd4m2 1/1 Running 0 81m 10.131.0.13 ip-10-0-134-251.us-east-2.compute.internal <none> <none> win-webserver-79b64df8b9-chw7f 1/1 Running 0 82m 10.132.0.2 ip-10-0-29-113.us-east-2.compute.internal <none> <none> # oc exec my-nginx-75897978cd-rd4m2 curl 10.132.0.2 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 125 100 125 0 0 543 0 --:--:-- --:--:-- --:--:-- 543 <html><body><H1>Windows Container Web Server</H1><p>IP 10.132.0.2 callerCount 3 <p>IP 10.132.0.2 callerCount 5 </body></html> 2, After more than 3 hours, please see pod "AGE", now the same east-west network failed. # oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES my-nginx-75897978cd-rd4m2 1/1 Running 0 3h25m 10.131.0.13 ip-10-0-134-251.us-east-2.compute.internal <none> <none> win-webserver-79b64df8b9-chw7f 1/1 Running 0 3h26m 10.132.0.2 ip-10-0-29-113.us-east-2.compute.internal <none> <none> # oc exec my-nginx-75897978cd-rd4m2 curl 10.132.0.2 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- 0:02:08 --:--:-- 0curl: (7) Failed to connect to 10.132.0.2 port 80: Connection timed out command terminated with exit code 7 3, Created another linux-webserver pod by edit deployment, the new pod to win-webserver still works. # oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES my-nginx-75897978cd-rd4m2 1/1 Running 0 4h 10.131.0.13 ip-10-0-134-251.us-east-2.compute.internal <none> <none> my-nginx-75897978cd-s2ks8 1/1 Running 0 12m 10.128.2.11 ip-10-0-159-2.us-east-2.compute.internal <none> <none> win-webserver-79b64df8b9-chw7f 1/1 Running 0 4h1m 10.132.0.2 ip-10-0-29-113.us-east-2.compute.internal <none> <none> # oc exec my-nginx-75897978cd-s2ks8 curl 10.132.0.2 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 158 100 158 0 0 731 0 --:--:-- --:--:-- --:--:-- 731 <html><body><H1>Windows Container Web Server</H1><p>IP 10.132.0.2 callerCount 2 <p>IP 10.132.0.2 callerCount 30 <p>IP 10.132.0.2 callerCount 22 </body></html> Maybe something happened in pod network during these 3 hours, which stopped the channel between linux pod to windows pod. It's the same when win-webserver pod access linux-webserver pod. Actual results: Connection timed out Expected results: Windows pod to Linux pod east-west network should always work. Additional info:
This bug has been fixed in 4.5.0-0.nightly-2020-05-05-205255, move status to VERIFIED, thanks. Version-Release number of selected component (if applicable): # oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.5.0-0.nightly-2020-05-05-205255 True False 8h Cluster version is 4.5.0-0.nightly-2020-05-05-205255 windows-machine-config-bootstrapper git commit 3f4e97c9a50e07208facfcc3670caf729424a25c Steps: 1, Bring up the OCP cluster 4.5.0-0.nightly-2020-05-05-205255 with ovn-kubernetes 2, Bring up Windows node 3, Configure inventory file and run wsu 4, Create win-webserver and linux-webserver pod, wait for hours, check east-west network still available # oc create -f https://raw.githubusercontent.com/sgaoshang/winc-test/master/data/WinWebServer.yaml # oc create -f https://raw.githubusercontent.com/sgaoshang/winc-test/master/data/LinuxWebServer.yaml # oc get pod -owide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES linux-webserver-65b89c7f5c-4x2q9 1/1 Running 0 7h34m 10.128.2.18 ip-10-0-131-9.us-east-2.compute.internal <none> <none> win-webserver-76659cfd79-5g854 1/1 Running 0 7h34m 10.132.0.3 ip-10-0-37-34.us-east-2.compute.internal <none> <none> # oc exec linux-webserver-65b89c7f5c-4x2q9 curl 10.132.0.3 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 126 100 126 0 0 540 0 --:--:-- --:--:-- --:--:-- 540<html><body><H1>Windows Container Web Server</H1><p>IP 10.132.0.3 callerCount 47 <p>IP 10.132.0.3 callerCount 3 </body></html> # oc exec win-webserver-76659cfd79-5g854 curl 10.128.2.18 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- --:--:-- --:--:-- 0Linux Container Web Server 100 27 100 27 0 0 27 0 0:00:01 --:--:-- 0:00:01 870
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:2409