Bug 1687881

Summary: UPGRADE Automated upgrade tests have never passed
Product: OpenShift Container Platform Reporter: Clayton Coleman <ccoleman>
Component: Cluster Version OperatorAssignee: Clayton Coleman <ccoleman>
Status: CLOSED ERRATA QA Contact: liujia <jiajliu>
Severity: urgent Docs Contact:
Priority: urgent    
Version: 4.1.0CC: anusaxen, aos-bugs, danw, dcaldwel, decarr, jokerman, mmccomas, sjenning, sponnaga, trankin, vrutkovs, wking
Target Milestone: ---   
Target Release: 4.1.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-06-04 10:45:33 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1664187    

Description Clayton Coleman 2019-03-12 14:41:31 UTC
The new automated upgrade tests are failing due to what appears to be a certificate rotation / network connectivity issue.

https://openshift-gce-devel.appspot.com/build/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.0/12

During upgrade a number of issues crop up, but one of the root issues is that etcd appears to be unreachable after upgrade.

2019-03-11 12:22:59.676142 I | embed: rejected connection from "127.0.0.1:52746" (error "tls: failed to verify client's certificate: x509: certificate specifies an incompatible key usage", ServerName "")
WARNING: 2019/03/11 12:22:59 Failed to dial 0.0.0.0:2379: connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate"; please retry.

Until e2e upgrade jobs have passed more than once this issue will remain open and top priority.

Comment 1 Dan Winship 2019-03-13 00:09:06 UTC
I'm looking at a missing OVS flow problem right now... I'll either reassign this bug to myself or else file a new bug blocking this one once I figure out if that's the entire problem

Comment 2 Dan Winship 2019-03-13 02:28:48 UTC
Clayton, would it be possible to make e2e-aws-upgrade grab a set of logs from the cluster immediately before kicking off the upgrade? Eg, in https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/22302/pull-ci-openshift-origin-master-e2e-aws-upgrade/2/, the cluster-network-operator log starts at 01:28:59, but the cluster was clearly running before that (CNO's first update to its operator status marks it as "Available: True").

Also, in this upgrade, it appears that none of the SDN pods were restarted (and, possibly as a result of that, the test passed). What exactly does the upgrade test do? It seems like it ought to fake an update of every image...

Comment 4 Clayton Coleman 2019-03-14 18:04:39 UTC
The bug you hit with e2e-aws-upgrade the PR job was fixed.  Will follow up with other bugs.

Comment 5 Anurag saxena 2019-03-14 20:28:39 UTC
Hi Clayton, 

Is it expected during the upgrade that the "oc get clusterversion" VERSION should report the version being upgraded? or i beleive it should show the old version until the new version is upgraded successfully 

Exisiting version on cluster
$ oc get clusteroperators.config.openshift.io | grep "NAME\|network"
NAME                                  VERSION                             AVAILABLE   PROGRESSING   FAILING   SINCE
network                               4.0.0-0.nightly-2019-03-13-233958   True        False         False     15m

After oc adm upgrade,

# oc get clusterversion
NAME      VERSION                        AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.0.0-0.ci-2019-03-14-150906   True        True          6m6s    Working towards 4.0.0-0.ci-2019-03-14-150906: 9% complete


//Anurag

Comment 6 W. Trevor King 2019-03-15 23:48:29 UTC
*** Bug 1683648 has been marked as a duplicate of this bug. ***

Comment 8 errata-xmlrpc 2019-06-04 10:45:33 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:0758