Bug 1807193

Summary:

Windows pod unreachable with "No route to host" error

Product:

OpenShift Container Platform

Reporter:

Sebastian Soto <ssoto>

Component:

Windows Containers

Assignee:

Sebastian Soto <ssoto>

Status:

CLOSED WORKSFORME

QA Contact:

gaoshang <sgao>

Severity:

unspecified

Docs Contact:

Priority:

unspecified

Version:

4.4

CC:

anusaxen, aos-bugs, dcbw, gmarkley, rgudimet

Target Milestone:

---

Target Release:

4.5.0

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2020-03-25 05:45:56 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
ovnkube-node	none

Description Sebastian Soto 2020-02-25 18:59:24 UTC

Description of problem:
After a Windows pod running a webserver is provisioned curling the webserver gives 

```
Failed to connect to 10.132.0.6 port 80: No route to host
```

Version-Release number of selected component (if applicable):


How reproducible:
Very

Steps to Reproduce:
1. Deploy Windows pod:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: win-webserver
  name: win-webserver
spec:
  securityContext:
  selector:
    matchLabels:
      app: win-webserver
  replicas: 1
  template:
    metadata:
      labels:
        app: win-webserver
      name: win-webserver
    spec:
      podSecurityContext:
      tolerations:
      - key: "os"
        value: "Windows"
        Effect: "NoSchedule"
      containers:
      - name: windowswebserver
        securityContext:
        image: mcr.microsoft.com/windows/servercore:ltsc2019
        imagePullPolicy: IfNotPresent
        command:
        - powershell.exe
        - -command
        - $listener = New-Object System.Net.HttpListener; $listener.Prefixes.Add('http://*:80/'); $listener.Start();Write-Host('Listening at http://*:80/'); while ($listener.IsListening) { $context = $listener.GetContext(); $response = $context.Response; $content='<html><body><H1>Windows Container Web Server</H1></body></html>'; $buffer = [System.Text.Encoding]::UTF8.GetBytes($content); $response.ContentLength64 = $buffer.Length; $response.OutputStream.Write($buffer, 0, $buffer.Length); $response.Close(); };
      nodeSelector:
        beta.kubernetes.io/os: windows


2. Curl the pod from a linux pod

Actual results:
Failed to connect to 10.132.0.6 port 80: No route to host

Expected results:
<html><body><H1>Windows Container Web Server</H1></body></html>

Additional info:

Comment 1 Sebastian Soto 2020-02-28 20:14:52 UTC

I'm seeing a number of errors in the OVNkube-master logs in a job when this occurred:

Job it occurred in:
https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_windows-machine-config-bootstrapper/159/pull-ci-openshift-windows-machine-config-bootstrapper-master-e2e-wsu/190

Log:
https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_windows-machine-config-bootstrapper/159/pull-ci-openshift-windows-machine-config-bootstrapper-master-e2e-wsu/190/artifacts/e2e-wsu/pods/openshift-ovn-kubernetes_ovnkube-master-m8lrq_ovnkube-master.log
time="2020-02-28T18:06:26Z" level=error msg="k8s.ovn.org/l3-gateway-config annotation not found for node \"ip-10-0-10-167.ec2.internal\""

Comment 3 Dan Williams 2020-03-02 22:40:58 UTC

The "k8s.ovn.org/l3-gateway-config annotation not found for node" message should be suppressed, it just means ovnkube couldn't find that annotation and shouldn't affect operation. I've suppressed that message in the hybrid overlay code upstream now.

Comment 6 gaoshang 2020-03-03 06:50:59 UTC

Created attachment 1667123 [details]
ovnkube-node

Comment 7 gaoshang 2020-03-03 06:52:30 UTC

Comment on attachment 1667123 [details]
ovnkube-node

# oc logs -n openshift-ovn-kubernetes pod/ovnkube-node-grghd -c ovnkube-node

Comment 9 Sebastian Soto 2020-03-03 21:40:03 UTC

I have not been able to reproduce this bug using openshift-install-linux-4.4.0-0.nightly-2020-03-02-124231

What I've tried:

Running the east-west test on the same 2 VM's 10+ times
Running the WSU and then the east-west test on the same 2 VMs 6 times
Running the WSU and then the east-west test on new VMs 4 times

Comment 10 Anurag saxena 2020-03-04 21:46:24 UTC

Indeed. Not reproducible for me as well on 4.4.0-0.nightly-2020-03-04-143604. Shang gao Can you also check in your env on latest nightly?

Comment 11 gaoshang 2020-03-05 15:08:28 UTC

(In reply to Anurag saxena from comment #10)
> Indeed. Not reproducible for me as well on
> 4.4.0-0.nightly-2020-03-04-143604. Shang gao Can you also check in your env
> on latest nightly?

I think this bug still exist, please see following steps

1, Create win-webserver and linux-webserver pod, at first east-west network testing passed.
[root@sgaoos aws]# oc get pod -o wide
NAME                             READY   STATUS    RESTARTS   AGE   IP            NODE                                         NOMINATED NODE   READINESS GATES
my-nginx-75897978cd-rd4m2        1/1     Running   0          81m   10.131.0.13   ip-10-0-134-251.us-east-2.compute.internal   <none>           <none>
win-webserver-79b64df8b9-chw7f   1/1     Running   0          82m   10.132.0.2    ip-10-0-29-113.us-east-2.compute.internal    <none>           <none>
[root@sgaoos aws]# oc exec my-nginx-75897978cd-rd4m2 curl 10.132.0.2
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   125  100   125    0     0    543      0 --:--:-- --:--:-- --:--:--   543
<html><body><H1>Windows Container Web Server</H1><p>IP 10.132.0.2 callerCount 3 <p>IP 10.132.0.2 callerCount 5 </body></html>

2, After more than 3 hours, please see pod "AGE", now the same east-west network failed.

[root@sgaoos aws]# oc get pod -o wide
NAME                             READY   STATUS      RESTARTS   AGE     IP            NODE                                         NOMINATED NODE   READINESS GATES
my-nginx-75897978cd-rd4m2        1/1     Running     0          3h25m   10.131.0.13   ip-10-0-134-251.us-east-2.compute.internal   <none>           <none>
win-webserver-79b64df8b9-chw7f   1/1     Running     0          3h26m   10.132.0.2    ip-10-0-29-113.us-east-2.compute.internal    <none>           <none>
[root@sgaoos aws]# oc exec my-nginx-75897978cd-rd4m2 curl 10.132.0.2
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:02:08 --:--:--     0curl: (7) Failed to connect to 10.132.0.2 port 80: Connection timed out
command terminated with exit code 7

3, Created another linux-webserver pod by edit deployment, the new pod to win-webserver still works. 

[root@sgaoos aws]# oc get pod -o wide
NAME                             READY   STATUS      RESTARTS   AGE    IP            NODE                                         NOMINATED NODE   READINESS GATES
my-nginx-75897978cd-rd4m2        1/1     Running     0          4h     10.131.0.13   ip-10-0-134-251.us-east-2.compute.internal   <none>           <none>
my-nginx-75897978cd-s2ks8        1/1     Running     0          12m    10.128.2.11   ip-10-0-159-2.us-east-2.compute.internal     <none>           <none>
win-webserver-79b64df8b9-chw7f   1/1     Running     0          4h1m   10.132.0.2    ip-10-0-29-113.us-east-2.compute.internal    <none>           <none>

[root@sgaoos aws]# oc exec my-nginx-75897978cd-s2ks8 curl 10.132.0.2
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   158  100   158    0     0    731      0 --:--:-- --:--:-- --:--:--   731
<html><body><H1>Windows Container Web Server</H1><p>IP 10.132.0.2 callerCount 2 <p>IP 10.132.0.2 callerCount 30 <p>IP 10.132.0.2 callerCount 22 </body></html>


Maybe something happened in pod network during these 3 hours, which stopped the channel between linux pod to windows pod. It's the same when win-webserver pod access linux-webserver pod.