Bug 1819492 - invalid apiserver certificates causing large blocks of test failures on vsphere
Summary: invalid apiserver certificates causing large blocks of test failures on vsphere
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.4
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.5.0
Assignee: Joseph Callen
QA Contact: jima
URL:
Whiteboard:
Depends On:
Blocks: 1824991
TreeView+ depends on / blocked
 
Reported: 2020-04-01 01:25 UTC by Ben Parees
Modified: 2020-07-13 17:24 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1824991 (view as bug list)
Environment:
Last Closed: 2020-07-13 17:24:27 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift installer pull 3429 0 None closed Bug 1819492: vsphere upi and metal: terraform 0.12.x update, general updates and reorg 2021-01-20 07:35:44 UTC
Red Hat Product Errata RHBA-2020:2409 0 None None None 2020-07-13 17:24:49 UTC

Description Ben Parees 2020-04-01 01:25:02 UTC
Description of problem:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.4/1358

and

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.4/1357

Both experienced a large number of test failures.  Most of the failures report errors like:

Unable to connect to the server: x509: certificate is valid for api.ci-op-4byd7z0v-3858a.origin-ci-int-aws.dev.rhcloud.com, not api.ci-op-7zg3gn6s-e99c3.origin-ci-int-aws.dev.rhcloud.com  (job 1357)

and

error: failed to discover supported resources: Get https://api.ci-op-3jpl538d-e99c3.origin-ci-int-aws.dev.rhcloud.com:6443/apis/batch/v1?timeout=32s: x509: certificate is valid for kubernetes, kubernetes.default, kubernetes.default.svc, kubernetes.default.svc.cluster.local, openshift, openshift.default, openshift.default.svc, openshift.default.svc.cluster.local, not api.ci-op-3jpl538d-e99c3.origin-ci-int-aws.dev.rhcloud.com  (job 1358)

Version-Release number of selected component (if applicable):
4.4


The failures seem to be specific to the vsphere UPI jobs, you can see incidents here:

https://search.svc.ci.openshift.org/?search=x509%3A+certificate+is+valid+for+kubernetes&maxAge=48h&context=0&type=bug%2Bjunit

Comment 1 Stefan Schimanski 2020-04-01 07:34:01 UTC
This looks like some components connect to the API with an unknown DNS name. The API server uses normal SNI mechanism to select the right cert.

  api.ci-op-4byd7z0v-3858a.origin-ci-int-aws.dev.rhcloud.com, not api.ci-op-7zg3gn6s-e99c3.origin-ci-int-aws.dev.rhcloud.com

This means the internal LB name changes during the execution. This is worrisome and most probably a upi platform issue.

Comment 3 Scott Dodson 2020-04-02 17:47:50 UTC
This is a mixup in CI jobs and not a bug customers would be exposed to.

Comment 4 Joseph Callen 2020-04-02 18:02:51 UTC
I am actively working on this problem - will update when I have something to report.

Comment 8 Scott Dodson 2020-04-08 20:29:22 UTC
I think we were mostly just focused on the "certificate is valid for api.ci-op-4byd7z0v-3858a.origin-ci-int-aws.dev.rhcloud.com, not api.ci-op-7zg3gn6s-e99c3.origin-ci-int-aws.dev.rhcloud.com" but you're welcome to fix all the CI testing defects you wish.


There's no chance that the pod-Service test can be tied back to replicating this change https://github.com/openshift/machine-config-operator/pull/1628 which has now been applied to both ovirt and OSP? See the linked bugs.

Comment 9 Ben Parees 2020-04-09 15:41:08 UTC
> Updates to UPI complete currently getting PRs ready.


can you link the PRs in this bug?


The flaky tests you're seeing are known flaky everywhere, so if you've resolved the cert issue i say we merge your changes.

Comment 10 Joseph Callen 2020-04-09 16:48:33 UTC
https://github.com/openshift/installer/pull/3429

This PR to update the metal terrform will also need to be merged:

https://github.com/openshift/installer/pull/3235#issuecomment-611627886

Job template changes:
https://github.com/openshift/release/pull/8259

Comment 11 Kirsten Garrison 2020-04-16 19:00:35 UTC
@joseph, just to confirm are those PRs the only changes needed to close this BZ?

Comment 12 Ben Parees 2020-04-16 19:35:06 UTC
created a clone for this to be backported to 4.4.z (does not have to be 4.4.0) so we can get our CI cleaned up.

Comment 15 jima 2020-04-24 06:00:07 UTC
The issue is rarely reproduced on QE CI job, we only met once on ocp4.5 nightly build, and could not reproduced any more.

I just checked on DEV CI job, it seems that issue is still happened after the code is merged.
The last night build on 4.5 reproduced the issue is https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.5/873.

So just confirm that issue is fixed?

Comment 17 jima 2020-04-30 01:38:14 UTC
Thanks for info, Joseph. I checked on recent one week, the issue was not raised up again after build number #873, and the issue is fixed on 4.5.

Comment 18 errata-xmlrpc 2020-07-13 17:24:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:2409


Note You need to log in before you can comment on or make changes to this bug.