1715515 – Upgrade CI jobs are failing on networking problems (connect: no route to host)

Bug 1715515 - Upgrade CI jobs are failing on networking problems (connect: no route to host)

Summary: Upgrade CI jobs are failing on networking problems (connect: no route to host)

Keywords:
Status:	CLOSED DUPLICATE of bug 1714699
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Networking
Sub Component:
Version:	4.2.0
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Target Release:	4.2.0
Assignee:	Casey Callendrello
QA Contact:	zhaozhanqi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2019-05-30 14:23 UTC by Petr Muller
Modified:	2019-05-30 16:08 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2019-05-30 16:08:52 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Petr Muller 2019-05-30 14:23:31 UTC

Description of problem:

Since ~19:00 CEST May 29, the release upgrade CI are failing a lot. Most recent failure is https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade/2053

The failing test is

[Disruptive] Cluster upgrade should maintain a functioning cluster [Feature:ClusterUpgrade] [Suite:openshift] [Serial]


manifested by a "Cluster did not complete upgrade: timed out waiting for the condition" error, and the logs are littered by `dial tcp 172.30.0.1:443: connect: no route to host`-like messages. 

There is a Slack thread with some on-going investigation: https://coreos.slack.com/archives/CEKNRGF25/p1559194633000200

Comment 1 Dan Williams 2019-05-30 16:08:52 UTC

Likely a dupe of https://bugzilla.redhat.com/show_bug.cgi?id=1714699, because:

I0530 11:20:44.815682       1 event.go:209] Event(v1.ObjectReference{Kind:"DaemonSet", Namespace:"openshift-sdn", Name:"ovs", UID:"9ae4bdee-82c9-11e9-bd3e-12c64ec43b90", APIVersion:"apps/v1", ResourceVersion:"28327", FieldPath:""}): type: 'Normal' reason: 'SuccessfulDelete' Deleted pod: ovs-vk475
E0530 11:21:03.511611       1 resource_quota_controller.go:414] unable to retrieve the complete list of server APIs: authorization.openshift.io/v1: Get https://localhost:6443/apis/authorization.openshift.io/v1?timeout=32s: net/http: request canceled (Client.Timeout exceeded while awaiting headers), build.openshift.io/v1: Get https://localhost:6443/apis/build.openshift.io/v1?timeout=32s: net/http: request canceled (Client.Timeout exceeded while awaiting headers), oauth.openshift.io/v1: Get https://localhost:6443/apis/oauth.openshift.io/v1?timeout=32s: net/http: request canceled (Client.Timeout exceeded while awaiting headers), route.openshift.io/v1: Get https://localhost:6443/apis/route.openshift.io/v1?timeout=32s: net/http: request canceled (Client.Timeout exceeded while awaiting headers), security.openshift.io/v1: Get https://localhost:6443/apis/security.openshift.io/v1?timeout=32s: net/http: request canceled (Client.Timeout exceeded while awaiting headers)

the node on which kube-controller-manager is running deletes its OVS pod and then fails to be able to talk to the apiserver on localhost (*not* through the SDN proxy). It never creates a new OVS pod for its own node.

*** This bug has been marked as a duplicate of bug 1714699 ***

Note You need to log in before you can comment on or make changes to this bug.