Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1886786

Summary: Pods can't reach the azure node IP and service network seems to be not operational
Product: OpenShift Container Platform Reporter: Michal Fojtik <mfojtik>
Component: NetworkingAssignee: Dan Winship <danw>
Networking sub component: openshift-sdn QA Contact: zhaozhanqi <zzhao>
Status: CLOSED DUPLICATE Docs Contact:
Severity: urgent    
Priority: unspecified CC: deads
Version: 4.6   
Target Milestone: ---   
Target Release: 4.6.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-10-09 14:57:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Michal Fojtik 2020-10-09 12:01:42 UTC
Description of problem:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/1314473755086426112

This is a cluster-bot job running upgrade from 4.6.0-rc0 to 4.6.0-rc1. There is ~40m disruption of OAS that require service network to be operational.

The oauth server is crashlooping because:

Error: unable to load configmap based request-header-client-ca-file: Get "https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication": dial tcp 172.30.0.1:443: connect: no route to host


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 David Eads 2020-10-09 12:04:27 UTC
Also the traffic to the node IPs fails the same way:

W1009 11:12:41.401851       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.6:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.6:2379: connect: no route to host". Reconnecting...
W1009 11:12:41.401850       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.8:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.8:2379: connect: no route to host". Reconnecting...
W1009 11:12:41.401891       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.7:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.7:2379: connect: no route to host". Reconnecting...


We use this to write to etcd, so without that connection the server cannot function.  Because we don't yet have loki on these jobs, we only get "lucky" some of the time and a previous pod log will exist containing this information.  As we roll out revisions, we lose previous failures, so this could be more common that the current set of logs implies.

Because we cannot get to the node IPs and we cannot get to the service IPs, we cannot make a mark to flag this for CI.

Comment 3 Dan Winship 2020-10-09 14:57:44 UTC
This is just the known system-vs-containerized OVS bug (as seen in the OVS logs, "openvswitch is running in container"), which still existed in rc0. That resulted in a cluster which claimed to be working but wasn't, and then the attempt to upgrade it failed before it really even started ("Working towards 4.6.0-rc.1: 1% complete").

*** This bug has been marked as a duplicate of bug 1880591 ***