Description of problem: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/1314473755086426112 This is a cluster-bot job running upgrade from 4.6.0-rc0 to 4.6.0-rc1. There is ~40m disruption of OAS that require service network to be operational. The oauth server is crashlooping because: Error: unable to load configmap based request-header-client-ca-file: Get "https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication": dial tcp 172.30.0.1:443: connect: no route to host Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Also the traffic to the node IPs fails the same way: W1009 11:12:41.401851 1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.6:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.6:2379: connect: no route to host". Reconnecting... W1009 11:12:41.401850 1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.8:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.8:2379: connect: no route to host". Reconnecting... W1009 11:12:41.401891 1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.7:2379 <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.7:2379: connect: no route to host". Reconnecting... We use this to write to etcd, so without that connection the server cannot function. Because we don't yet have loki on these jobs, we only get "lucky" some of the time and a previous pod log will exist containing this information. As we roll out revisions, we lose previous failures, so this could be more common that the current set of logs implies. Because we cannot get to the node IPs and we cannot get to the service IPs, we cannot make a mark to flag this for CI.
This is just the known system-vs-containerized OVS bug (as seen in the OVS logs, "openvswitch is running in container"), which still existed in rc0. That resulted in a cluster which claimed to be working but wasn't, and then the attempt to upgrade it failed before it really even started ("Working towards 4.6.0-rc.1: 1% complete"). *** This bug has been marked as a duplicate of bug 1880591 ***