1886786 – Pods can't reach the azure node IP and service network seems to be not operational

Bug 1886786 - Pods can't reach the azure node IP and service network seems to be not operational

Summary: Pods can't reach the azure node IP and service network seems to be not operat...

Keywords:
Status:	CLOSED DUPLICATE of bug 1880591
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Dan Winship
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-10-09 12:01 UTC by Michal Fojtik
Modified:	2020-10-09 14:57 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-10-09 14:57:44 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Michal Fojtik 2020-10-09 12:01:42 UTC

Description of problem:

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-launch-azure/1314473755086426112

This is a cluster-bot job running upgrade from 4.6.0-rc0 to 4.6.0-rc1. There is ~40m disruption of OAS that require service network to be operational.

The oauth server is crashlooping because:

Error: unable to load configmap based request-header-client-ca-file: Get "https://172.30.0.1:443/api/v1/namespaces/kube-system/configmaps/extension-apiserver-authentication": dial tcp 172.30.0.1:443: connect: no route to host


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 David Eads 2020-10-09 12:04:27 UTC

Also the traffic to the node IPs fails the same way:

W1009 11:12:41.401851       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.6:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.6:2379: connect: no route to host". Reconnecting...
W1009 11:12:41.401850       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.8:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.8:2379: connect: no route to host". Reconnecting...
W1009 11:12:41.401891       1 clientconn.go:1223] grpc: addrConn.createTransport failed to connect to {https://10.0.0.7:2379  <nil> 0 <nil>}. Err :connection error: desc = "transport: Error while dialing dial tcp 10.0.0.7:2379: connect: no route to host". Reconnecting...


We use this to write to etcd, so without that connection the server cannot function.  Because we don't yet have loki on these jobs, we only get "lucky" some of the time and a previous pod log will exist containing this information.  As we roll out revisions, we lose previous failures, so this could be more common that the current set of logs implies.

Because we cannot get to the node IPs and we cannot get to the service IPs, we cannot make a mark to flag this for CI.

Comment 3 Dan Winship 2020-10-09 14:57:44 UTC

This is just the known system-vs-containerized OVS bug (as seen in the OVS logs, "openvswitch is running in container"), which still existed in rc0. That resulted in a cluster which claimed to be working but wasn't, and then the attempt to upgrade it failed before it really even started ("Working towards 4.6.0-rc.1: 1% complete").

*** This bug has been marked as a duplicate of bug 1880591 ***

Note You need to log in before you can comment on or make changes to this bug.