Bug 1807193

Summary: Windows pod unreachable with "No route to host" error
Product: OpenShift Container Platform Reporter: Sebastian Soto <ssoto>
Component: Windows ContainersAssignee: Sebastian Soto <ssoto>
Status: CLOSED WORKSFORME QA Contact: gaoshang <sgao>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: 4.4CC: anusaxen, aos-bugs, dcbw, gmarkley, rgudimet
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-03-25 05:45:56 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
ovnkube-node none

Description Sebastian Soto 2020-02-25 18:59:24 UTC
Description of problem:
After a Windows pod running a webserver is provisioned curling the webserver gives 

```
Failed to connect to 10.132.0.6 port 80: No route to host
```

Version-Release number of selected component (if applicable):


How reproducible:
Very

Steps to Reproduce:
1. Deploy Windows pod:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: win-webserver
  name: win-webserver
spec:
  securityContext:
  selector:
    matchLabels:
      app: win-webserver
  replicas: 1
  template:
    metadata:
      labels:
        app: win-webserver
      name: win-webserver
    spec:
      podSecurityContext:
      tolerations:
      - key: "os"
        value: "Windows"
        Effect: "NoSchedule"
      containers:
      - name: windowswebserver
        securityContext:
        image: mcr.microsoft.com/windows/servercore:ltsc2019
        imagePullPolicy: IfNotPresent
        command:
        - powershell.exe
        - -command
        - $listener = New-Object System.Net.HttpListener; $listener.Prefixes.Add('http://*:80/'); $listener.Start();Write-Host('Listening at http://*:80/'); while ($listener.IsListening) { $context = $listener.GetContext(); $response = $context.Response; $content='<html><body><H1>Windows Container Web Server</H1></body></html>'; $buffer = [System.Text.Encoding]::UTF8.GetBytes($content); $response.ContentLength64 = $buffer.Length; $response.OutputStream.Write($buffer, 0, $buffer.Length); $response.Close(); };
      nodeSelector:
        beta.kubernetes.io/os: windows


2. Curl the pod from a linux pod

Actual results:
Failed to connect to 10.132.0.6 port 80: No route to host

Expected results:
<html><body><H1>Windows Container Web Server</H1></body></html>

Additional info:

Comment 3 Dan Williams 2020-03-02 22:40:58 UTC
The "k8s.ovn.org/l3-gateway-config annotation not found for node" message should be suppressed, it just means ovnkube couldn't find that annotation and shouldn't affect operation. I've suppressed that message in the hybrid overlay code upstream now.

Comment 6 gaoshang 2020-03-03 06:50:59 UTC
Created attachment 1667123 [details]
ovnkube-node

Comment 7 gaoshang 2020-03-03 06:52:30 UTC
Comment on attachment 1667123 [details]
ovnkube-node

# oc logs -n openshift-ovn-kubernetes pod/ovnkube-node-grghd -c ovnkube-node

Comment 9 Sebastian Soto 2020-03-03 21:40:03 UTC
I have not been able to reproduce this bug using openshift-install-linux-4.4.0-0.nightly-2020-03-02-124231

What I've tried:

Running the east-west test on the same 2 VM's 10+ times
Running the WSU and then the east-west test on the same 2 VMs 6 times
Running the WSU and then the east-west test on new VMs 4 times

Comment 10 Anurag saxena 2020-03-04 21:46:24 UTC
Indeed. Not reproducible for me as well on 4.4.0-0.nightly-2020-03-04-143604. Shang gao Can you also check in your env on latest nightly?

Comment 11 gaoshang 2020-03-05 15:08:28 UTC
(In reply to Anurag saxena from comment #10)
> Indeed. Not reproducible for me as well on
> 4.4.0-0.nightly-2020-03-04-143604. Shang gao Can you also check in your env
> on latest nightly?

I think this bug still exist, please see following steps

1, Create win-webserver and linux-webserver pod, at first east-west network testing passed.
[root@sgaoos aws]# oc get pod -o wide
NAME                             READY   STATUS    RESTARTS   AGE   IP            NODE                                         NOMINATED NODE   READINESS GATES
my-nginx-75897978cd-rd4m2        1/1     Running   0          81m   10.131.0.13   ip-10-0-134-251.us-east-2.compute.internal   <none>           <none>
win-webserver-79b64df8b9-chw7f   1/1     Running   0          82m   10.132.0.2    ip-10-0-29-113.us-east-2.compute.internal    <none>           <none>
[root@sgaoos aws]# oc exec my-nginx-75897978cd-rd4m2 curl 10.132.0.2
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   125  100   125    0     0    543      0 --:--:-- --:--:-- --:--:--   543
<html><body><H1>Windows Container Web Server</H1><p>IP 10.132.0.2 callerCount 3 <p>IP 10.132.0.2 callerCount 5 </body></html>

2, After more than 3 hours, please see pod "AGE", now the same east-west network failed.

[root@sgaoos aws]# oc get pod -o wide
NAME                             READY   STATUS      RESTARTS   AGE     IP            NODE                                         NOMINATED NODE   READINESS GATES
my-nginx-75897978cd-rd4m2        1/1     Running     0          3h25m   10.131.0.13   ip-10-0-134-251.us-east-2.compute.internal   <none>           <none>
win-webserver-79b64df8b9-chw7f   1/1     Running     0          3h26m   10.132.0.2    ip-10-0-29-113.us-east-2.compute.internal    <none>           <none>
[root@sgaoos aws]# oc exec my-nginx-75897978cd-rd4m2 curl 10.132.0.2
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:02:08 --:--:--     0curl: (7) Failed to connect to 10.132.0.2 port 80: Connection timed out
command terminated with exit code 7

3, Created another linux-webserver pod by edit deployment, the new pod to win-webserver still works. 

[root@sgaoos aws]# oc get pod -o wide
NAME                             READY   STATUS      RESTARTS   AGE    IP            NODE                                         NOMINATED NODE   READINESS GATES
my-nginx-75897978cd-rd4m2        1/1     Running     0          4h     10.131.0.13   ip-10-0-134-251.us-east-2.compute.internal   <none>           <none>
my-nginx-75897978cd-s2ks8        1/1     Running     0          12m    10.128.2.11   ip-10-0-159-2.us-east-2.compute.internal     <none>           <none>
win-webserver-79b64df8b9-chw7f   1/1     Running     0          4h1m   10.132.0.2    ip-10-0-29-113.us-east-2.compute.internal    <none>           <none>

[root@sgaoos aws]# oc exec my-nginx-75897978cd-s2ks8 curl 10.132.0.2
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   158  100   158    0     0    731      0 --:--:-- --:--:-- --:--:--   731
<html><body><H1>Windows Container Web Server</H1><p>IP 10.132.0.2 callerCount 2 <p>IP 10.132.0.2 callerCount 30 <p>IP 10.132.0.2 callerCount 22 </body></html>


Maybe something happened in pod network during these 3 hours, which stopped the channel between linux pod to windows pod. It's the same when win-webserver pod access linux-webserver pod.