Bug 1886786
| Summary: | Pods can't reach the azure node IP and service network seems to be not operational | ||
|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Michal Fojtik <mfojtik> |
| Component: | Networking | Assignee: | Dan Winship <danw> |
| Networking sub component: | openshift-sdn | QA Contact: | zhaozhanqi <zzhao> |
| Status: | CLOSED DUPLICATE | Docs Contact: | |
| Severity: | urgent | ||
| Priority: | unspecified | CC: | deads |
| Version: | 4.6 | ||
| Target Milestone: | --- | ||
| Target Release: | 4.6.0 | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-10-09 14:57:44 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
Michal Fojtik
2020-10-09 12:01:42 UTC
Also the traffic to the node IPs fails the same way:
W1009 11:12:41.401851 1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.6:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.6:2379: connect: no route to host". Reconnecting...
W1009 11:12:41.401850 1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.8:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.8:2379: connect: no route to host". Reconnecting...
W1009 11:12:41.401891 1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.7:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.7:2379: connect: no route to host". Reconnecting...
We use this to write to etcd, so without that connection the server cannot function. Because we don't yet have loki on these jobs, we only get "lucky" some of the time and a previous pod log will exist containing this information. As we roll out revisions, we lose previous failures, so this could be more common that the current set of logs implies.
Because we cannot get to the node IPs and we cannot get to the service IPs, we cannot make a mark to flag this for CI.
This is just the known system-vs-containerized OVS bug (as seen in the OVS logs, "openvswitch is running in container"), which still existed in rc0. That resulted in a cluster which claimed to be working but wasn't, and then the attempt to upgrade it failed before it really even started ("Working towards 4.6.0-rc.1: 1% complete").
*** This bug has been marked as a duplicate of bug 1880591 ***
|