The new automated upgrade tests are failing due to what appears to be a certificate rotation / network connectivity issue.
During upgrade a number of issues crop up, but one of the root issues is that etcd appears to be unreachable after upgrade.
2019-03-11 12:22:59.676142 I | embed: rejected connection from "127.0.0.1:52746" (error "tls: failed to verify client's certificate: x509: certificate specifies an incompatible key usage", ServerName "")
WARNING: 2019/03/11 12:22:59 Failed to dial 0.0.0.0:2379: connection error: desc = "transport: authentication handshake failed: remote error: tls: bad certificate"; please retry.
Until e2e upgrade jobs have passed more than once this issue will remain open and top priority.
I'm looking at a missing OVS flow problem right now... I'll either reassign this bug to myself or else file a new bug blocking this one once I figure out if that's the entire problem
Clayton, would it be possible to make e2e-aws-upgrade grab a set of logs from the cluster immediately before kicking off the upgrade? Eg, in https://gcsweb-ci.svc.ci.openshift.org/gcs/origin-ci-test/pr-logs/pull/22302/pull-ci-openshift-origin-master-e2e-aws-upgrade/2/, the cluster-network-operator log starts at 01:28:59, but the cluster was clearly running before that (CNO's first update to its operator status marks it as "Available: True").
Also, in this upgrade, it appears that none of the SDN pods were restarted (and, possibly as a result of that, the test passed). What exactly does the upgrade test do? It seems like it ought to fake an update of every image...
Fixed by https://github.com/openshift/origin/pull/22302
The bug you hit with e2e-aws-upgrade the PR job was fixed. Will follow up with other bugs.
Is it expected during the upgrade that the "oc get clusterversion" VERSION should report the version being upgraded? or i beleive it should show the old version until the new version is upgraded successfully
Exisiting version on cluster
$ oc get clusteroperators.config.openshift.io | grep "NAME\|network"
NAME VERSION AVAILABLE PROGRESSING FAILING SINCE
network 4.0.0-0.nightly-2019-03-13-233958 True False False 15m
After oc adm upgrade,
# oc get clusterversion
NAME VERSION AVAILABLE PROGRESSING SINCE STATUS
version 4.0.0-0.ci-2019-03-14-150906 True True 6m6s Working towards 4.0.0-0.ci-2019-03-14-150906: 9% complete
*** Bug 1683648 has been marked as a duplicate of this bug. ***
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.