Bug 1814706

Summary: Connection timed out after long time running when accessing Windows pod from Linux pod
Product: OpenShift Container Platform Reporter: gaoshang <sgao>
Component: Windows ContainersAssignee: Aravindh Puthiyaparambil <aravindh>
Status: CLOSED ERRATA QA Contact: gaoshang <sgao>
Severity: high Docs Contact:
Priority: unspecified    
Version: 4.4CC: aos-bugs, gmarkley, rgudimet
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-07-13 17:22:38 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description gaoshang 2020-03-18 14:38:56 UTC
Description of problem:
Connection timed out after long time running when accessing Windows pod from Linux pod in AWS cluster, please see Steps.

Version-Release number of selected component (if applicable):
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.0-0.nightly-2020-03-17-135743   True        False         20h     Cluster version is 4.4.0-0.nightly-2020-03-17-135743

windows-machine-config-bootstrapper commit
69b264d8437746f07c1234daeba8f20dc40710bd

How reproducible:
Always

Steps to Reproduce:
1, Create win-webserver and linux-webserver pod, at first east-west network testing passed.
# oc get pod -o wide
NAME                             READY   STATUS    RESTARTS   AGE   IP            NODE                                         NOMINATED NODE   READINESS GATES
my-nginx-75897978cd-rd4m2        1/1     Running   0          81m   10.131.0.13   ip-10-0-134-251.us-east-2.compute.internal   <none>           <none>
win-webserver-79b64df8b9-chw7f   1/1     Running   0          82m   10.132.0.2    ip-10-0-29-113.us-east-2.compute.internal    <none>           <none>
# oc exec my-nginx-75897978cd-rd4m2 curl 10.132.0.2
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   125  100   125    0     0    543      0 --:--:-- --:--:-- --:--:--   543
<html><body><H1>Windows Container Web Server</H1><p>IP 10.132.0.2 callerCount 3 <p>IP 10.132.0.2 callerCount 5 </body></html>

2, After more than 3 hours, please see pod "AGE", now the same east-west network failed.
# oc get pod -o wide
NAME                             READY   STATUS      RESTARTS   AGE     IP            NODE                                         NOMINATED NODE   READINESS GATES
my-nginx-75897978cd-rd4m2        1/1     Running     0          3h25m   10.131.0.13   ip-10-0-134-251.us-east-2.compute.internal   <none>           <none>
win-webserver-79b64df8b9-chw7f   1/1     Running     0          3h26m   10.132.0.2    ip-10-0-29-113.us-east-2.compute.internal    <none>           <none>
# oc exec my-nginx-75897978cd-rd4m2 curl 10.132.0.2
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:02:08 --:--:--     0curl: (7) Failed to connect to 10.132.0.2 port 80: Connection timed out
command terminated with exit code 7

3, Created another linux-webserver pod by edit deployment, the new pod to win-webserver still works. 

# oc get pod -o wide
NAME                             READY   STATUS      RESTARTS   AGE    IP            NODE                                         NOMINATED NODE   READINESS GATES
my-nginx-75897978cd-rd4m2        1/1     Running     0          4h     10.131.0.13   ip-10-0-134-251.us-east-2.compute.internal   <none>           <none>
my-nginx-75897978cd-s2ks8        1/1     Running     0          12m    10.128.2.11   ip-10-0-159-2.us-east-2.compute.internal     <none>           <none>
win-webserver-79b64df8b9-chw7f   1/1     Running     0          4h1m   10.132.0.2    ip-10-0-29-113.us-east-2.compute.internal    <none>           <none>

# oc exec my-nginx-75897978cd-s2ks8 curl 10.132.0.2
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   158  100   158    0     0    731      0 --:--:-- --:--:-- --:--:--   731
<html><body><H1>Windows Container Web Server</H1><p>IP 10.132.0.2 callerCount 2 <p>IP 10.132.0.2 callerCount 30 <p>IP 10.132.0.2 callerCount 22 </body></html>


Maybe something happened in pod network during these 3 hours, which stopped the channel between linux pod to windows pod. It's the same when win-webserver pod access linux-webserver pod.

Actual results:
Connection timed out

Expected results:
Windows pod to Linux pod east-west network should always work.

Additional info:

Comment 1 gaoshang 2020-05-06 14:58:43 UTC
This bug has been fixed in 4.5.0-0.nightly-2020-05-05-205255, move status to VERIFIED, thanks.

Version-Release number of selected component (if applicable):
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-05-05-205255   True        False         8h      Cluster version is 4.5.0-0.nightly-2020-05-05-205255

windows-machine-config-bootstrapper git commit 3f4e97c9a50e07208facfcc3670caf729424a25c

Steps:

1, Bring up the OCP cluster 4.5.0-0.nightly-2020-05-05-205255 with ovn-kubernetes
2, Bring up Windows node
3, Configure inventory file and run wsu
4, Create win-webserver and linux-webserver pod, wait for hours, check east-west network still available

# oc create -f https://raw.githubusercontent.com/sgaoshang/winc-test/master/data/WinWebServer.yaml
# oc create -f https://raw.githubusercontent.com/sgaoshang/winc-test/master/data/LinuxWebServer.yaml

# oc get pod -owide
NAME                               READY   STATUS    RESTARTS   AGE     IP            NODE                                       NOMINATED NODE   READINESS GATES
linux-webserver-65b89c7f5c-4x2q9   1/1     Running   0          7h34m   10.128.2.18   ip-10-0-131-9.us-east-2.compute.internal   <none>           <none>
win-webserver-76659cfd79-5g854     1/1     Running   0          7h34m   10.132.0.3    ip-10-0-37-34.us-east-2.compute.internal   <none>           <none>

# oc exec linux-webserver-65b89c7f5c-4x2q9 curl 10.132.0.3
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   126  100   126    0     0    540      0 --:--:-- --:--:-- --:--:--   540<html><body><H1>Windows Container Web Server</H1><p>IP 10.132.0.3 callerCount 47 <p>IP 10.132.0.3 callerCount 3 </body></html>

# oc exec win-webserver-76659cfd79-5g854 curl 10.128.2.18
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0Linux Container Web Server
100    27  100    27    0     0     27      0  0:00:01 --:--:--  0:00:01   870

Comment 4 errata-xmlrpc 2020-07-13 17:22:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409