Bug 1819492

Summary: invalid apiserver certificates causing large blocks of test failures on vsphere
Product: OpenShift Container Platform Reporter: Ben Parees <bparees>
Component: InstallerAssignee: Joseph Callen <jcallen>
Installer sub component: openshift-installer QA Contact: jima
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: high CC: aos-bugs, dphillip, jcallen, jima, kgarriso, mfojtik, sdodson
Version: 4.4   
Target Milestone: ---   
Target Release: 4.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1824991 (view as bug list) Environment:
Last Closed: 2020-07-13 17:24:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1824991    

Description Ben Parees 2020-04-01 01:25:02 UTC
Description of problem:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.4/1358

and

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.4/1357

Both experienced a large number of test failures.  Most of the failures report errors like:

Unable to connect to the server: x509: certificate is valid for api.ci-op-4byd7z0v-3858a.origin-ci-int-aws.dev.rhcloud.com, not api.ci-op-7zg3gn6s-e99c3.origin-ci-int-aws.dev.rhcloud.com  (job 1357)

and

error: failed to discover supported resources: Get https://api.ci-op-3jpl538d-e99c3.origin-ci-int-aws.dev.rhcloud.com:6443/apis/batch/v1?timeout=32s: x509: certificate is valid for kubernetes, kubernetes.default, kubernetes.default.svc, kubernetes.default.svc.cluster.local, openshift, openshift.default, openshift.default.svc, openshift.default.svc.cluster.local, not api.ci-op-3jpl538d-e99c3.origin-ci-int-aws.dev.rhcloud.com  (job 1358)

Version-Release number of selected component (if applicable):
4.4


The failures seem to be specific to the vsphere UPI jobs, you can see incidents here:

https://search.svc.ci.openshift.org/?search=x509%3A+certificate+is+valid+for+kubernetes&maxAge=48h&context=0&type=bug%2Bjunit

Comment 1 Stefan Schimanski 2020-04-01 07:34:01 UTC
This looks like some components connect to the API with an unknown DNS name. The API server uses normal SNI mechanism to select the right cert.

  api.ci-op-4byd7z0v-3858a.origin-ci-int-aws.dev.rhcloud.com, not api.ci-op-7zg3gn6s-e99c3.origin-ci-int-aws.dev.rhcloud.com

This means the internal LB name changes during the execution. This is worrisome and most probably a upi platform issue.

Comment 3 Scott Dodson 2020-04-02 17:47:50 UTC
This is a mixup in CI jobs and not a bug customers would be exposed to.

Comment 4 Joseph Callen 2020-04-02 18:02:51 UTC
I am actively working on this problem - will update when I have something to report.

Comment 8 Scott Dodson 2020-04-08 20:29:22 UTC
I think we were mostly just focused on the "certificate is valid for api.ci-op-4byd7z0v-3858a.origin-ci-int-aws.dev.rhcloud.com, not api.ci-op-7zg3gn6s-e99c3.origin-ci-int-aws.dev.rhcloud.com" but you're welcome to fix all the CI testing defects you wish.


There's no chance that the pod-Service test can be tied back to replicating this change https://github.com/openshift/machine-config-operator/pull/1628 which has now been applied to both ovirt and OSP? See the linked bugs.

Comment 9 Ben Parees 2020-04-09 15:41:08 UTC
> Updates to UPI complete currently getting PRs ready.


can you link the PRs in this bug?


The flaky tests you're seeing are known flaky everywhere, so if you've resolved the cert issue i say we merge your changes.

Comment 10 Joseph Callen 2020-04-09 16:48:33 UTC
https://github.com/openshift/installer/pull/3429

This PR to update the metal terrform will also need to be merged:

https://github.com/openshift/installer/pull/3235#issuecomment-611627886

Job template changes:
https://github.com/openshift/release/pull/8259

Comment 11 Kirsten Garrison 2020-04-16 19:00:35 UTC
@joseph, just to confirm are those PRs the only changes needed to close this BZ?

Comment 12 Ben Parees 2020-04-16 19:35:06 UTC
created a clone for this to be backported to 4.4.z (does not have to be 4.4.0) so we can get our CI cleaned up.

Comment 15 jima 2020-04-24 06:00:07 UTC
The issue is rarely reproduced on QE CI job, we only met once on ocp4.5 nightly build, and could not reproduced any more.

I just checked on DEV CI job, it seems that issue is still happened after the code is merged.
The last night build on 4.5 reproduced the issue is https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.5/873.

So just confirm that issue is fixed?

Comment 17 jima 2020-04-30 01:38:14 UTC
Thanks for info, Joseph. I checked on recent one week, the issue was not raised up again after build number #873, and the issue is fixed on 4.5.

Comment 18 errata-xmlrpc 2020-07-13 17:24:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409