Bug 1814706 - Connection timed out after long time running when accessing Windows pod from Linux pod
Summary: Connection timed out after long time running when accessing Windows pod from ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Windows Containers
Version: 4.4
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: 4.5.0
Assignee: Aravindh Puthiyaparambil
QA Contact: gaoshang
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-03-18 14:38 UTC by gaoshang
Modified: 2020-07-13 17:23 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-07-13 17:22:38 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:23:02 UTC

Description gaoshang 2020-03-18 14:38:56 UTC
Description of problem:
Connection timed out after long time running when accessing Windows pod from Linux pod in AWS cluster, please see Steps.

Version-Release number of selected component (if applicable):
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.4.0-0.nightly-2020-03-17-135743   True        False         20h     Cluster version is 4.4.0-0.nightly-2020-03-17-135743

windows-machine-config-bootstrapper commit
69b264d8437746f07c1234daeba8f20dc40710bd

How reproducible:
Always

Steps to Reproduce:
1, Create win-webserver and linux-webserver pod, at first east-west network testing passed.
# oc get pod -o wide
NAME                             READY   STATUS    RESTARTS   AGE   IP            NODE                                         NOMINATED NODE   READINESS GATES
my-nginx-75897978cd-rd4m2        1/1     Running   0          81m   10.131.0.13   ip-10-0-134-251.us-east-2.compute.internal   <none>           <none>
win-webserver-79b64df8b9-chw7f   1/1     Running   0          82m   10.132.0.2    ip-10-0-29-113.us-east-2.compute.internal    <none>           <none>
# oc exec my-nginx-75897978cd-rd4m2 curl 10.132.0.2
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   125  100   125    0     0    543      0 --:--:-- --:--:-- --:--:--   543
<html><body><H1>Windows Container Web Server</H1><p>IP 10.132.0.2 callerCount 3 <p>IP 10.132.0.2 callerCount 5 </body></html>

2, After more than 3 hours, please see pod "AGE", now the same east-west network failed.
# oc get pod -o wide
NAME                             READY   STATUS      RESTARTS   AGE     IP            NODE                                         NOMINATED NODE   READINESS GATES
my-nginx-75897978cd-rd4m2        1/1     Running     0          3h25m   10.131.0.13   ip-10-0-134-251.us-east-2.compute.internal   <none>           <none>
win-webserver-79b64df8b9-chw7f   1/1     Running     0          3h26m   10.132.0.2    ip-10-0-29-113.us-east-2.compute.internal    <none>           <none>
# oc exec my-nginx-75897978cd-rd4m2 curl 10.132.0.2
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:02:08 --:--:--     0curl: (7) Failed to connect to 10.132.0.2 port 80: Connection timed out
command terminated with exit code 7

3, Created another linux-webserver pod by edit deployment, the new pod to win-webserver still works. 

# oc get pod -o wide
NAME                             READY   STATUS      RESTARTS   AGE    IP            NODE                                         NOMINATED NODE   READINESS GATES
my-nginx-75897978cd-rd4m2        1/1     Running     0          4h     10.131.0.13   ip-10-0-134-251.us-east-2.compute.internal   <none>           <none>
my-nginx-75897978cd-s2ks8        1/1     Running     0          12m    10.128.2.11   ip-10-0-159-2.us-east-2.compute.internal     <none>           <none>
win-webserver-79b64df8b9-chw7f   1/1     Running     0          4h1m   10.132.0.2    ip-10-0-29-113.us-east-2.compute.internal    <none>           <none>

# oc exec my-nginx-75897978cd-s2ks8 curl 10.132.0.2
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   158  100   158    0     0    731      0 --:--:-- --:--:-- --:--:--   731
<html><body><H1>Windows Container Web Server</H1><p>IP 10.132.0.2 callerCount 2 <p>IP 10.132.0.2 callerCount 30 <p>IP 10.132.0.2 callerCount 22 </body></html>


Maybe something happened in pod network during these 3 hours, which stopped the channel between linux pod to windows pod. It's the same when win-webserver pod access linux-webserver pod.

Actual results:
Connection timed out

Expected results:
Windows pod to Linux pod east-west network should always work.

Additional info:

Comment 1 gaoshang 2020-05-06 14:58:43 UTC
This bug has been fixed in 4.5.0-0.nightly-2020-05-05-205255, move status to VERIFIED, thanks.

Version-Release number of selected component (if applicable):
# oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.5.0-0.nightly-2020-05-05-205255   True        False         8h      Cluster version is 4.5.0-0.nightly-2020-05-05-205255

windows-machine-config-bootstrapper git commit 3f4e97c9a50e07208facfcc3670caf729424a25c

Steps:

1, Bring up the OCP cluster 4.5.0-0.nightly-2020-05-05-205255 with ovn-kubernetes
2, Bring up Windows node
3, Configure inventory file and run wsu
4, Create win-webserver and linux-webserver pod, wait for hours, check east-west network still available

# oc create -f https://raw.githubusercontent.com/sgaoshang/winc-test/master/data/WinWebServer.yaml
# oc create -f https://raw.githubusercontent.com/sgaoshang/winc-test/master/data/LinuxWebServer.yaml

# oc get pod -owide
NAME                               READY   STATUS    RESTARTS   AGE     IP            NODE                                       NOMINATED NODE   READINESS GATES
linux-webserver-65b89c7f5c-4x2q9   1/1     Running   0          7h34m   10.128.2.18   ip-10-0-131-9.us-east-2.compute.internal   <none>           <none>
win-webserver-76659cfd79-5g854     1/1     Running   0          7h34m   10.132.0.3    ip-10-0-37-34.us-east-2.compute.internal   <none>           <none>

# oc exec linux-webserver-65b89c7f5c-4x2q9 curl 10.132.0.3
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   126  100   126    0     0    540      0 --:--:-- --:--:-- --:--:--   540<html><body><H1>Windows Container Web Server</H1><p>IP 10.132.0.3 callerCount 47 <p>IP 10.132.0.3 callerCount 3 </body></html>

# oc exec win-webserver-76659cfd79-5g854 curl 10.128.2.18
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0Linux Container Web Server
100    27  100    27    0     0     27      0  0:00:01 --:--:--  0:00:01   870

Comment 4 errata-xmlrpc 2020-07-13 17:22:38 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.