Bug 1687881

Summary:	UPGRADE Automated upgrade tests have never passed
Product:	OpenShift Container Platform	Reporter:	Clayton Coleman <ccoleman>
Component:	Cluster Version Operator	Assignee:	Clayton Coleman <ccoleman>
Status:	CLOSED ERRATA	QA Contact:	liujia <jiajliu>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	4.1.0	CC:	anusaxen, aos-bugs, danw, dcaldwel, decarr, jokerman, mmccomas, sjenning, sponnaga, trankin, vrutkovs, wking
Target Milestone:	---
Target Release:	4.1.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-06-04 10:45:33 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1664187

Description Clayton Coleman 2019-03-12 14:41:31 UTC

The new automated upgrade tests are failing due to what appears to be a certificate rotation / network connectivity issue.

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/12

During upgrade a number of issues crop up, but one of the root issues is that etcd appears to be unreachable after upgrade.

2019-03-11 12:22:59.676142 I | embed: rejected connection from "127.0.0.1:52746" (error "tls: failed to verify client's certificate: x509: certificate specifies an incompatible key usage", ServerName "")
WARNING: 2019/03/11 12:22:59 Failed to dial 0.0.0.0:2379: connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate"; please retry.

Until e2e upgrade jobs have passed more than once this issue will remain open and top priority.

Comment 1 Dan Winship 2019-03-13 00:09:06 UTC

I'm looking at a missing OVS flow problem right now... I'll either reassign this bug to myself or else file a new bug blocking this one once I figure out if that's the entire problem

Comment 2 Dan Winship 2019-03-13 02:28:48 UTC

Clayton, would it be possible to make e2e-aws-upgrade grab a set of logs from the cluster immediately before kicking off the upgrade? Eg, in https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/22302/pull-ci-openshift-origin-master-e2e-aws-upgrade/2/, the cluster-network-operator log starts at 01:28:59, but the cluster was clearly running before that (CNO's first update to its operator status marks it as "Available: True").

Also, in this upgrade, it appears that none of the SDN pods were restarted (and, possibly as a result of that, the test passed). What exactly does the upgrade test do? It seems like it ought to fake an update of every image...

Comment 3 Clayton Coleman 2019-03-14 18:04:03 UTC

Fixed by https://github.com/openshift/origin/pull/22302

https://openshift-release.svc.ci.openshift.org/releasestream/4.0.0-0.ci/release/4.0.0-0.ci-2019-03-14-150906
https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/152

Comment 4 Clayton Coleman 2019-03-14 18:04:39 UTC

The bug you hit with e2e-aws-upgrade the PR job was fixed.  Will follow up with other bugs.

Comment 5 Anurag saxena 2019-03-14 20:28:39 UTC

Hi Clayton, 

Is it expected during the upgrade that the "oc get clusterversion" VERSION should report the version being upgraded? or i beleive it should show the old version until the new version is upgraded successfully 

Exisiting version on cluster
$ oc get clusteroperators.config.openshift.io | grep "NAME\|network"
NAME                                  VERSION                             AVAILABLE   PROGRESSING   FAILING   SINCE
network                               4.0.0-0.nightly-2019-03-13-233958   True        False         False     15m

After oc adm upgrade,

# oc get clusterversion
NAME      VERSION                        AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.ci-2019-03-14-150906   True        True          6m6s    Working towards 4.0.0-0.ci-2019-03-14-150906: 9% complete


//Anurag

Comment 6 W. Trevor King 2019-03-15 23:48:29 UTC

*** Bug 1683648 has been marked as a duplicate of this bug. ***

Comment 8 errata-xmlrpc 2019-06-04 10:45:33 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758