+++ This bug was initially created as a clone of Bug #1819492 +++
Description of problem:
Both experienced a large number of test failures. Most of the failures report errors like:
Unable to connect to the server: x509: certificate is valid for api.ci-op-4byd7z0v-3858a.origin-ci-int-aws.dev.rhcloud.com, not api.ci-op-7zg3gn6s-e99c3.origin-ci-int-aws.dev.rhcloud.com (job 1357)
error: failed to discover supported resources: Get https://api.ci-op-3jpl538d-e99c3.origin-ci-int-aws.dev.rhcloud.com:6443/apis/batch/v1?timeout=32s: x509: certificate is valid for kubernetes, kubernetes.default, kubernetes.default.svc, kubernetes.default.svc.cluster.local, openshift, openshift.default, openshift.default.svc, openshift.default.svc.cluster.local, not api.ci-op-3jpl538d-e99c3.origin-ci-int-aws.dev.rhcloud.com (job 1358)
Version-Release number of selected component (if applicable):
The failures seem to be specific to the vsphere UPI jobs, you can see incidents here:
--- Additional comment from Stefan Schimanski on 2020-04-01 07:34:01 UTC ---
This looks like some components connect to the API with an unknown DNS name. The API server uses normal SNI mechanism to select the right cert.
api.ci-op-4byd7z0v-3858a.origin-ci-int-aws.dev.rhcloud.com, not api.ci-op-7zg3gn6s-e99c3.origin-ci-int-aws.dev.rhcloud.com
This means the internal LB name changes during the execution. This is worrisome and most probably a upi platform issue.
--- Additional comment from Kirsten Garrison on 2020-04-01 22:30:13 UTC ---
For tracking we are seeing the x509 error in the following runs
Job url: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.4/1357
Number of test failures: 412
Job url: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.4/1362
Number of test failures: 397
Job url: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.4/1358
Number of test failures: 380
Job url: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.4/1347
Number of test failures: 345
Job url: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.4/1367 <--- today
Number of test failures: 304
--- Additional comment from Scott Dodson on 2020-04-02 17:47:50 UTC ---
This is a mixup in CI jobs and not a bug customers would be exposed to.
--- Additional comment from Joseph Callen on 2020-04-02 18:02:51 UTC ---
I am actively working on this problem - will update when I have something to report.
--- Additional comment from Scott Dodson on 2020-04-08 00:01:43 UTC ---
Was going to wait until tomorrow to ask about this but I figured I'd likely forget to do so. Are we making progress on this one?
--- Additional comment from Joseph Callen on 2020-04-08 04:05:53 UTC ---
I have been testing changes to the vSphere UPI template to get better results with `openshift-tests`
1.) I need to modify the UPI template to check if the patch of the image registry actually applied. In testing I found that I needed to run the patch _after_ the cluster was up
2.) I have been testing using a LB in place of multiple A records for api, api-int and *.apps - significantly less test failures
3.) Since Monday I have been reusing/rewriting the vSphere UPI template. This change will update to 0.12 of terraform, remove etcd DNS, add an LB and change the api, api-int and *.apps to a single A record.
As of 4/8/2020 12:04 AM I have finished with the changes - I will be testing tomorrow.
--- Additional comment from Joseph Callen on 2020-04-08 14:48:14 UTC ---
Updates to UPI complete currently getting PRs ready.
[sig-network] Networking Granular Checks: Services should function for pod-Service: udp [Suite:openshift/conformance/parallel] [Suite:k8s]
[sig-instrumentation] Prometheus when installed on the cluster should start and expose a secured proxy and unsecured metrics [Suite:openshift/conformance/parallel]
[sig-network] Networking Granular Checks: Services should function for endpoint-Service: udp [Skipped:Network/OVNKubernetes] [Suite:openshift/conformance/parallel] [Suite:k8s]
--- Additional comment from Scott Dodson on 2020-04-08 20:29:22 UTC ---
I think we were mostly just focused on the "certificate is valid for api.ci-op-4byd7z0v-3858a.origin-ci-int-aws.dev.rhcloud.com, not api.ci-op-7zg3gn6s-e99c3.origin-ci-int-aws.dev.rhcloud.com" but you're welcome to fix all the CI testing defects you wish.
There's no chance that the pod-Service test can be tied back to replicating this change https://github.com/openshift/machine-config-operator/pull/1628 which has now been applied to both ovirt and OSP? See the linked bugs.
--- Additional comment from Ben Parees on 2020-04-09 15:41:08 UTC ---
> Updates to UPI complete currently getting PRs ready.
can you link the PRs in this bug?
The flaky tests you're seeing are known flaky everywhere, so if you've resolved the cert issue i say we merge your changes.
--- Additional comment from Joseph Callen on 2020-04-09 16:48:33 UTC ---
This PR to update the metal terrform will also need to be merged:
Job template changes:
--- Additional comment from Kirsten Garrison on 2020-04-16 19:00:35 UTC ---
@joseph, just to confirm are those PRs the only changes needed to close this BZ?
This issue is not always reproduced, but only hit on one time with 4.5 nightly build.
What did change to resolve the issue of "x509 certificate"?
Testing with the new UPI in CI I found the problem at least partially.
The IPs within phpIPAM are being deleted and the cluster is still running. The next CI job comes along and takes one or more of those addresses.
I would still like the Terraform changes to move forward but I also need to figure out a solution for the IPAM issue.
Can we get the issue fully resolved upstream and then backport both changes at once after we've shown improvement in success or is this a situation where any problematic branches are going to taint the results of other branches due to the IPAM issue?
(In reply to Scott Dodson from comment #5)
> Can we get the issue fully resolved upstream and then backport both changes
> at once after we've shown improvement in success or is this a situation
> where any problematic branches are going to taint the results of other
> branches due to the IPAM issue?
Sorry for the late response. I have been on the lookout for this issue and haven't seen it come up again. Though there are still openshift-tests failures that shouldn't exist. The updated terraform the last operation that completes is deleting the IPAM records. If teardown for some reason doesn't completely correctly at least those allocated ip addresses wouldn't be reused.
I have a replacement for phpIPAM in the works (netbox) but the issue will be switching it out from all the previous version of vSphere UPI. Changing to netbox the advantage is the API, the terraform is less complex to manage.
On QE testing env, it is rarely reproduced this issue.
On dev ci env, I used below link to search failed ci job related with error "x509: certificate is valid for kubernetes".
There is no failed job due to the error of "x509: certificate is valid for kubernetes" on 4.4 release.
Do you think the issue is verified?
yes i think this can be verified.
I caught one yesterday, found the reason and submitted a PR to fix:
Should fix any remaining issues with the x509 certificate issues.
Will that addresss these? https://search.apps.build01.ci.devcluster.openshift.com/?search=x509%3A+certificate+is+valid+for+kubernetes&maxAge=168h&context=0&type=bug%2Bjunit&name=4.5&maxMatches=5&maxBytes=20971520&groupBy=job
Those look like a different cert issue, so probably needs a new bug if your PR doesn't address it.
(In reply to Ben Parees from comment #12)
> Will that addresss these?
> Those look like a different cert issue, so probably needs a new bug if your
> PR doesn't address it.
The first result is the one I found yesterday and submitted the PR for.
I will continue to monitor vSphere job results.
If there's a unique bug and fix in comment 12 I'd prefer we open a new bug for that and move this to VERIFIED. I'm fine with engineering moving it to VERIFIED as long as we know the originally reported problem has been addressed. The search results seem ambiguous to me so leaving that up to you.
tl;dr based on my observations the certificate failures in CI for vsphere are caused by the same issue - two running clusters allocating via IPAM the same ip addresses for a master node
The certificate issue was caused by ip addresses being used by master instances in two different clusters.
I don't think the changes to the UPI terraform  helped to resolve this in retrospect. Though it might have minorly improved passing rates.
It was removing the destroy of the bootstrap  that helped resolve it and hopefully `terraform destory -refresh=false`  that will make sure it doesn't happen again.
We changed the process for deletion because of issues moving to 0.12 terraform . My current guess this is the original cause of the issue.
When testing the changes for terraform I made sure that variables were set purposely so that IPAM was last to be destroyed.
I think this issue shows a potential problem that needs to be resolved in our move of CI:
- Do we need to destroy the bootstrap node?
- How can we guarantee that addresses are not duplicated?
I will add this comment to: https://issues.redhat.com/browse/SPLAT-2
as a reminder.
From your comment15, it still has some work on CI to fix this issue.
If we still use this CR to track the issue, can we move the status to "MODIFIED"? Or it is ok to track the issue with https://issues.redhat.com/browse/SPLAT-2, and move this bug to "VERIFIED"?
According to comment15, set bug to "ASSIGNED" to track two potential problems.
This is CI specific and still monitoring. Moving to 4.6
I just checked in CI search. This error has not occurred in the past 14 days
based on comment 21