Bug 1687881 - UPGRADE Automated upgrade tests have never passed
Summary: UPGRADE Automated upgrade tests have never passed
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cluster Version Operator
Version: 4.1.0
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.1.0
Assignee: Clayton Coleman
QA Contact: liujia
: 1683648 (view as bug list)
Depends On:
Blocks: 1664187
TreeView+ depends on / blocked
Reported: 2019-03-12 14:41 UTC by Clayton Coleman
Modified: 2019-06-04 10:45 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Last Closed: 2019-06-04 10:45:33 UTC
Target Upstream Version:

Attachments (Terms of Use)

System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:0758 0 None None None 2019-06-04 10:45:41 UTC

Description Clayton Coleman 2019-03-12 14:41:31 UTC
The new automated upgrade tests are failing due to what appears to be a certificate rotation / network connectivity issue.


During upgrade a number of issues crop up, but one of the root issues is that etcd appears to be unreachable after upgrade.

2019-03-11 12:22:59.676142 I | embed: rejected connection from "" (error "tls: failed to verify client's certificate: x509: certificate specifies an incompatible key usage", ServerName "")
WARNING: 2019/03/11 12:22:59 Failed to dial connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate"; please retry.

Until e2e upgrade jobs have passed more than once this issue will remain open and top priority.

Comment 1 Dan Winship 2019-03-13 00:09:06 UTC
I'm looking at a missing OVS flow problem right now... I'll either reassign this bug to myself or else file a new bug blocking this one once I figure out if that's the entire problem

Comment 2 Dan Winship 2019-03-13 02:28:48 UTC
Clayton, would it be possible to make e2e-aws-upgrade grab a set of logs from the cluster immediately before kicking off the upgrade? Eg, in https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/22302/pull-ci-openshift-origin-master-e2e-aws-upgrade/2/, the cluster-network-operator log starts at 01:28:59, but the cluster was clearly running before that (CNO's first update to its operator status marks it as "Available: True").

Also, in this upgrade, it appears that none of the SDN pods were restarted (and, possibly as a result of that, the test passed). What exactly does the upgrade test do? It seems like it ought to fake an update of every image...

Comment 4 Clayton Coleman 2019-03-14 18:04:39 UTC
The bug you hit with e2e-aws-upgrade the PR job was fixed.  Will follow up with other bugs.

Comment 5 Anurag saxena 2019-03-14 20:28:39 UTC
Hi Clayton, 

Is it expected during the upgrade that the "oc get clusterversion" VERSION should report the version being upgraded? or i beleive it should show the old version until the new version is upgraded successfully 

Exisiting version on cluster
$ oc get clusteroperators.config.openshift.io | grep "NAME\|network"
NAME                                  VERSION                             AVAILABLE   PROGRESSING   FAILING   SINCE
network                               4.0.0-0.nightly-2019-03-13-233958   True        False         False     15m

After oc adm upgrade,

# oc get clusterversion
version   4.0.0-0.ci-2019-03-14-150906   True        True          6m6s    Working towards 4.0.0-0.ci-2019-03-14-150906: 9% complete


Comment 6 W. Trevor King 2019-03-15 23:48:29 UTC
*** Bug 1683648 has been marked as a duplicate of this bug. ***

Comment 8 errata-xmlrpc 2019-06-04 10:45:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.