Bug 1807193
Summary: | Windows pod unreachable with "No route to host" error | ||||||
---|---|---|---|---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Sebastian Soto <ssoto> | ||||
Component: | Windows Containers | Assignee: | Sebastian Soto <ssoto> | ||||
Status: | CLOSED WORKSFORME | QA Contact: | gaoshang <sgao> | ||||
Severity: | unspecified | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | 4.4 | CC: | anusaxen, aos-bugs, dcbw, gmarkley, rgudimet | ||||
Target Milestone: | --- | ||||||
Target Release: | 4.5.0 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | If docs needed, set a value | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2020-03-25 05:45:56 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
Description
Sebastian Soto
2020-02-25 18:59:24 UTC
I'm seeing a number of errors in the OVNkube-master logs in a job when this occurred: Job it occurred in: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/pr-logs/pull/openshift_windows-machine-config-bootstrapper/159/pull-ci-openshift-windows-machine-config-bootstrapper-master-e2e-wsu/190 Log: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_windows-machine-config-bootstrapper/159/pull-ci-openshift-windows-machine-config-bootstrapper-master-e2e-wsu/190/artifacts/e2e-wsu/pods/openshift-ovn-kubernetes_ovnkube-master-m8lrq_ovnkube-master.log time="2020-02-28T18:06:26Z" level=error msg="k8s.ovn.org/l3-gateway-config annotation not found for node \"ip-10-0-10-167.ec2.internal\"" The "k8s.ovn.org/l3-gateway-config annotation not found for node" message should be suppressed, it just means ovnkube couldn't find that annotation and shouldn't affect operation. I've suppressed that message in the hybrid overlay code upstream now. Created attachment 1667123 [details]
ovnkube-node
Comment on attachment 1667123 [details]
ovnkube-node
# oc logs -n openshift-ovn-kubernetes pod/ovnkube-node-grghd -c ovnkube-node
I have not been able to reproduce this bug using openshift-install-linux-4.4.0-0.nightly-2020-03-02-124231 What I've tried: Running the east-west test on the same 2 VM's 10+ times Running the WSU and then the east-west test on the same 2 VMs 6 times Running the WSU and then the east-west test on new VMs 4 times Indeed. Not reproducible for me as well on 4.4.0-0.nightly-2020-03-04-143604. Shang gao Can you also check in your env on latest nightly? (In reply to Anurag saxena from comment #10) > Indeed. Not reproducible for me as well on > 4.4.0-0.nightly-2020-03-04-143604. Shang gao Can you also check in your env > on latest nightly? I think this bug still exist, please see following steps 1, Create win-webserver and linux-webserver pod, at first east-west network testing passed. [root@sgaoos aws]# oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES my-nginx-75897978cd-rd4m2 1/1 Running 0 81m 10.131.0.13 ip-10-0-134-251.us-east-2.compute.internal <none> <none> win-webserver-79b64df8b9-chw7f 1/1 Running 0 82m 10.132.0.2 ip-10-0-29-113.us-east-2.compute.internal <none> <none> [root@sgaoos aws]# oc exec my-nginx-75897978cd-rd4m2 curl 10.132.0.2 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 125 100 125 0 0 543 0 --:--:-- --:--:-- --:--:-- 543 <html><body><H1>Windows Container Web Server</H1><p>IP 10.132.0.2 callerCount 3 <p>IP 10.132.0.2 callerCount 5 </body></html> 2, After more than 3 hours, please see pod "AGE", now the same east-west network failed. [root@sgaoos aws]# oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES my-nginx-75897978cd-rd4m2 1/1 Running 0 3h25m 10.131.0.13 ip-10-0-134-251.us-east-2.compute.internal <none> <none> win-webserver-79b64df8b9-chw7f 1/1 Running 0 3h26m 10.132.0.2 ip-10-0-29-113.us-east-2.compute.internal <none> <none> [root@sgaoos aws]# oc exec my-nginx-75897978cd-rd4m2 curl 10.132.0.2 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 0 0 0 0 0 0 0 0 --:--:-- 0:02:08 --:--:-- 0curl: (7) Failed to connect to 10.132.0.2 port 80: Connection timed out command terminated with exit code 7 3, Created another linux-webserver pod by edit deployment, the new pod to win-webserver still works. [root@sgaoos aws]# oc get pod -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES my-nginx-75897978cd-rd4m2 1/1 Running 0 4h 10.131.0.13 ip-10-0-134-251.us-east-2.compute.internal <none> <none> my-nginx-75897978cd-s2ks8 1/1 Running 0 12m 10.128.2.11 ip-10-0-159-2.us-east-2.compute.internal <none> <none> win-webserver-79b64df8b9-chw7f 1/1 Running 0 4h1m 10.132.0.2 ip-10-0-29-113.us-east-2.compute.internal <none> <none> [root@sgaoos aws]# oc exec my-nginx-75897978cd-s2ks8 curl 10.132.0.2 % Total % Received % Xferd Average Speed Time Time Time Current Dload Upload Total Spent Left Speed 100 158 100 158 0 0 731 0 --:--:-- --:--:-- --:--:-- 731 <html><body><H1>Windows Container Web Server</H1><p>IP 10.132.0.2 callerCount 2 <p>IP 10.132.0.2 callerCount 30 <p>IP 10.132.0.2 callerCount 22 </body></html> Maybe something happened in pod network during these 3 hours, which stopped the channel between linux pod to windows pod. It's the same when win-webserver pod access linux-webserver pod. |