Description of problem: After a Windows pod running a webserver is provisioned curling the webserver gives ``` Failed to connect to 10.132.0.6 port 80: No route to host ``` Version-Release number of selected component (if applicable): How reproducible: Very Steps to Reproduce: 1. Deploy Windows pod: apiVersion: apps/v1 kind: Deployment metadata: labels: app: win-webserver name: win-webserver spec: securityContext: selector: matchLabels: app: win-webserver replicas: 1 template: metadata: labels: app: win-webserver name: win-webserver spec: podSecurityContext: tolerations: - key: "os" value: "Windows" Effect: "NoSchedule" containers: - name: windowswebserver securityContext: image: mcr.microsoft.com/windows/servercore:ltsc2019 imagePullPolicy: IfNotPresent command: - powershell.exe - -command - $listener = New-Object System.Net.HttpListener; $listener.Prefixes.Add('http://*:80/'); $listener.Start();Write-Host('Listening at http://*:80/'); while ($listener.IsListening) { $context = $listener.GetContext(); $response = $context.Response; $content='<html><body><H1>Windows Container Web Server</H1></body></html>'; $buffer = [System.Text.Encoding]::UTF8.GetBytes($content); $response.ContentLength64 = $buffer.Length; $response.OutputStream.Write($buffer, 0, $buffer.Length); $response.Close(); }; nodeSelector: beta.kubernetes.io/os: windows 2. Curl the pod from a linux pod Actual results: Failed to connect to 10.132.0.6 port 80: No route to host Expected results: <html><body><H1>Windows Container Web Server</H1></body></html> Additional info:
I'm seeing a number of errors in the OVNkube-master logs in a job when this occurred: Job it occurred in: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_windows-machine-config-bootstrapper/159/pull-ci-openshift-windows-machine-config-bootstrapper-master-e2e-wsu/190 Log: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_windows-machine-config-bootstrapper/159/pull-ci-openshift-windows-machine-config-bootstrapper-master-e2e-wsu/190/artifacts/e2e-wsu/pods/openshift-ovn-kubernetes_ovnkube-master-m8lrq_ovnkube-master.log time="2020-02-28T18:06:26Z" level=error msg="k8s.ovn.org/l3-gateway-config annotation not found for node \"ip-10-0-10-167.ec2.internal\""
The "k8s.ovn.org/l3-gateway-config annotation not found for node" message should be suppressed, it just means ovnkube couldn't find that annotation and shouldn't affect operation. I've suppressed that message in the hybrid overlay code upstream now.
Created attachment 1667123 [details] ovnkube-node
Comment on attachment 1667123 [details] ovnkube-node # oc logs -n openshift-ovn-kubernetes pod/ovnkube-node-grghd -c ovnkube-node
I have not been able to reproduce this bug using openshift-install-linux-4.4.0-0.nightly-2020-03-02-124231 What I've tried: Running the east-west test on the same 2 VM's 10+ times Running the WSU and then the east-west test on the same 2 VMs 6 times Running the WSU and then the east-west test on new VMs 4 times
Indeed. Not reproducible for me as well on 4.4.0-0.nightly-2020-03-04-143604. Shang gao Can you also check in your env on latest nightly?
(In reply to Anurag saxena from comment #10) > Indeed. Not reproducible for me as well on > 4.4.0-0.nightly-2020-03-04-143604. Shang gao Can you also check in your env > on latest nightly? I think this bug still exist, please see following steps 1, Create win-webserver and linux-webserver pod, at first east-west network testing passed. [root@sgaoos aws]# oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES my-nginx-75897978cd-rd4m2 1/1 Running 0 81m 10.131.0.13 ip-10-0-134-251.us-east-2.compute.internal <none> <none> win-webserver-79b64df8b9-chw7f 1/1 Running 0 82m 10.132.0.2 ip-10-0-29-113.us-east-2.compute.internal <none> <none> [root@sgaoos aws]# oc exec my-nginx-75897978cd-rd4m2 curl 10.132.0.2 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 125 100 125 0 0 543 0 --:--:-- --:--:-- --:--:-- 543 <html><body><H1>Windows Container Web Server</H1><p>IP 10.132.0.2 callerCount 3 <p>IP 10.132.0.2 callerCount 5 </body></html> 2, After more than 3 hours, please see pod "AGE", now the same east-west network failed. [root@sgaoos aws]# oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES my-nginx-75897978cd-rd4m2 1/1 Running 0 3h25m 10.131.0.13 ip-10-0-134-251.us-east-2.compute.internal <none> <none> win-webserver-79b64df8b9-chw7f 1/1 Running 0 3h26m 10.132.0.2 ip-10-0-29-113.us-east-2.compute.internal <none> <none> [root@sgaoos aws]# oc exec my-nginx-75897978cd-rd4m2 curl 10.132.0.2 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- 0:02:08 --:--:-- 0curl: (7) Failed to connect to 10.132.0.2 port 80: Connection timed out command terminated with exit code 7 3, Created another linux-webserver pod by edit deployment, the new pod to win-webserver still works. [root@sgaoos aws]# oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES my-nginx-75897978cd-rd4m2 1/1 Running 0 4h 10.131.0.13 ip-10-0-134-251.us-east-2.compute.internal <none> <none> my-nginx-75897978cd-s2ks8 1/1 Running 0 12m 10.128.2.11 ip-10-0-159-2.us-east-2.compute.internal <none> <none> win-webserver-79b64df8b9-chw7f 1/1 Running 0 4h1m 10.132.0.2 ip-10-0-29-113.us-east-2.compute.internal <none> <none> [root@sgaoos aws]# oc exec my-nginx-75897978cd-s2ks8 curl 10.132.0.2 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 158 100 158 0 0 731 0 --:--:-- --:--:-- --:--:-- 731 <html><body><H1>Windows Container Web Server</H1><p>IP 10.132.0.2 callerCount 2 <p>IP 10.132.0.2 callerCount 30 <p>IP 10.132.0.2 callerCount 22 </body></html> Maybe something happened in pod network during these 3 hours, which stopped the channel between linux pod to windows pod. It's the same when win-webserver pod access linux-webserver pod.