Description of problem:
Both experienced a large number of test failures. Most of the failures report errors like:
Unable to connect to the server: x509: certificate is valid for api.ci-op-4byd7z0v-3858a.origin-ci-int-aws.dev.rhcloud.com, not api.ci-op-7zg3gn6s-e99c3.origin-ci-int-aws.dev.rhcloud.com (job 1357)
error: failed to discover supported resources: Get https://api.ci-op-3jpl538d-e99c3.origin-ci-int-aws.dev.rhcloud.com:6443/apis/batch/v1?timeout=32s: x509: certificate is valid for kubernetes, kubernetes.default, kubernetes.default.svc, kubernetes.default.svc.cluster.local, openshift, openshift.default, openshift.default.svc, openshift.default.svc.cluster.local, not api.ci-op-3jpl538d-e99c3.origin-ci-int-aws.dev.rhcloud.com (job 1358)
Version-Release number of selected component (if applicable):
The failures seem to be specific to the vsphere UPI jobs, you can see incidents here:
This looks like some components connect to the API with an unknown DNS name. The API server uses normal SNI mechanism to select the right cert.
api.ci-op-4byd7z0v-3858a.origin-ci-int-aws.dev.rhcloud.com, not api.ci-op-7zg3gn6s-e99c3.origin-ci-int-aws.dev.rhcloud.com
This means the internal LB name changes during the execution. This is worrisome and most probably a upi platform issue.
For tracking we are seeing the x509 error in the following runs
Job url: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.4/1357
Number of test failures: 412
Job url: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.4/1362
Number of test failures: 397
Job url: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.4/1358
Number of test failures: 380
Job url: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.4/1347
Number of test failures: 345
Job url: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.4/1367 <--- today
Number of test failures: 304
This is a mixup in CI jobs and not a bug customers would be exposed to.
I am actively working on this problem - will update when I have something to report.
I think we were mostly just focused on the "certificate is valid for api.ci-op-4byd7z0v-3858a.origin-ci-int-aws.dev.rhcloud.com, not api.ci-op-7zg3gn6s-e99c3.origin-ci-int-aws.dev.rhcloud.com" but you're welcome to fix all the CI testing defects you wish.
There's no chance that the pod-Service test can be tied back to replicating this change https://github.com/openshift/machine-config-operator/pull/1628 which has now been applied to both ovirt and OSP? See the linked bugs.
> Updates to UPI complete currently getting PRs ready.
can you link the PRs in this bug?
The flaky tests you're seeing are known flaky everywhere, so if you've resolved the cert issue i say we merge your changes.
This PR to update the metal terrform will also need to be merged:
Job template changes:
@joseph, just to confirm are those PRs the only changes needed to close this BZ?
created a clone for this to be backported to 4.4.z (does not have to be 4.4.0) so we can get our CI cleaned up.
The issue is rarely reproduced on QE CI job, we only met once on ocp4.5 nightly build, and could not reproduced any more.
I just checked on DEV CI job, it seems that issue is still happened after the code is merged.
The last night build on 4.5 reproduced the issue is https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.5/873.
So just confirm that issue is fixed?
See my comments:
Thanks for info, Joseph. I checked on recent one week, the issue was not raised up again after build number #873, and the issue is fixed on 4.5.
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.
For information on the advisory, and where to find the updated
files, follow the link below.
If the solution does not work for you, open a new bug report.