Bug 1824991

Summary:	invalid apiserver certificates causing large blocks of test failures on vsphere
Product:	OpenShift Container Platform	Reporter:	Ben Parees <bparees>
Component:	Installer	Assignee:	Joseph Callen <jcallen>
Installer sub component:	openshift-installer	QA Contact:	jima
Status:	CLOSED CURRENTRELEASE	Docs Contact:
Severity:	high
Priority:	high	CC:	aos-bugs, dphillip, jcallen, jima, kgarriso, mfojtik, scuppett, sdodson
Version:	4.4
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	No Doc Update
Doc Text:		Story Points:	---
Clone Of:	1819492	Environment:
Last Closed:	2020-06-01 18:18:33 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1819492
Bug Blocks:

Description Ben Parees 2020-04-16 19:34:03 UTC

+++ This bug was initially created as a clone of Bug #1819492 +++

Description of problem:

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.4/1358

and

https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.4/1357

Both experienced a large number of test failures.  Most of the failures report errors like:

Unable to connect to the server: x509: certificate is valid for api.ci-op-4byd7z0v-3858a.origin-ci-int-aws.dev.rhcloud.com, not api.ci-op-7zg3gn6s-e99c3.origin-ci-int-aws.dev.rhcloud.com  (job 1357)

and

error: failed to discover supported resources: Get https://api.ci-op-3jpl538d-e99c3.origin-ci-int-aws.dev.rhcloud.com:6443/apis/batch/v1?timeout=32s: x509: certificate is valid for kubernetes, kubernetes.default, kubernetes.default.svc, kubernetes.default.svc.cluster.local, openshift, openshift.default, openshift.default.svc, openshift.default.svc.cluster.local, not api.ci-op-3jpl538d-e99c3.origin-ci-int-aws.dev.rhcloud.com  (job 1358)

Version-Release number of selected component (if applicable):
4.4


The failures seem to be specific to the vsphere UPI jobs, you can see incidents here:

https://search.svc.ci.openshift.org/?search=x509%3A+certificate+is+valid+for+kubernetes&maxAge=48h&context=0&type=bug%2Bjunit

--- Additional comment from Stefan Schimanski on 2020-04-01 07:34:01 UTC ---

This looks like some components connect to the API with an unknown DNS name. The API server uses normal SNI mechanism to select the right cert.

  api.ci-op-4byd7z0v-3858a.origin-ci-int-aws.dev.rhcloud.com, not api.ci-op-7zg3gn6s-e99c3.origin-ci-int-aws.dev.rhcloud.com

This means the internal LB name changes during the execution. This is worrisome and most probably a upi platform issue.

--- Additional comment from Kirsten Garrison on 2020-04-01 22:30:13 UTC ---

For tracking we are seeing the x509 error in the following runs

Job url: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.4/1357
Number of test failures: 412
Job url: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.4/1362
Number of test failures: 397
Job url: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.4/1358
Number of test failures: 380
Job url: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.4/1347
Number of test failures: 345
Job url: https://prow.svc.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-vsphere-upi-4.4/1367 <--- today
Number of test failures: 304

--- Additional comment from Scott Dodson on 2020-04-02 17:47:50 UTC ---

This is a mixup in CI jobs and not a bug customers would be exposed to.

--- Additional comment from Joseph Callen on 2020-04-02 18:02:51 UTC ---

I am actively working on this problem - will update when I have something to report.

--- Additional comment from Scott Dodson on 2020-04-08 00:01:43 UTC ---

Was going to wait until tomorrow to ask about this but I figured I'd likely forget to do so. Are we making progress on this one?

--- Additional comment from Joseph Callen on 2020-04-08 04:05:53 UTC ---

I have been testing changes to the vSphere UPI template to get better results with `openshift-tests`

1.) I need to modify the UPI template to check if the patch of the image registry actually applied.  In testing I found that I needed to run the patch _after_ the cluster was up
2.) I have been testing using a LB in place of multiple A records for api, api-int and *.apps - significantly less test failures
3.) Since Monday I have been reusing/rewriting the vSphere UPI template.  This change will update to 0.12 of terraform, remove etcd DNS, add an LB and change the api, api-int and *.apps to a single A record. 
As of 4/8/2020 12:04 AM I have finished with the changes - I will be testing tomorrow.

--- Additional comment from Joseph Callen on 2020-04-08 14:48:14 UTC ---

Updates to UPI complete currently getting PRs ready.


openshift/conformance/parallel

Flaky tests:

[sig-network] Networking Granular Checks: Services should function for pod-Service: udp [Suite:openshift/conformance/parallel] [Suite:k8s]

Failing tests:

[sig-instrumentation] Prometheus when installed on the cluster should start and expose a secured proxy and unsecured metrics [Suite:openshift/conformance/parallel]
[sig-network] Networking Granular Checks: Services should function for endpoint-Service: udp [Skipped:Network/OVNKubernetes] [Suite:openshift/conformance/parallel] [Suite:k8s]

Running openshift/conformance/serial
now

--- Additional comment from Scott Dodson on 2020-04-08 20:29:22 UTC ---

I think we were mostly just focused on the "certificate is valid for api.ci-op-4byd7z0v-3858a.origin-ci-int-aws.dev.rhcloud.com, not api.ci-op-7zg3gn6s-e99c3.origin-ci-int-aws.dev.rhcloud.com" but you're welcome to fix all the CI testing defects you wish.


There's no chance that the pod-Service test can be tied back to replicating this change https://github.com/openshift/machine-config-operator/pull/1628 which has now been applied to both ovirt and OSP? See the linked bugs.

--- Additional comment from Ben Parees on 2020-04-09 15:41:08 UTC ---

> Updates to UPI complete currently getting PRs ready.


can you link the PRs in this bug?


The flaky tests you're seeing are known flaky everywhere, so if you've resolved the cert issue i say we merge your changes.

--- Additional comment from Joseph Callen on 2020-04-09 16:48:33 UTC ---

https://github.com/openshift/installer/pull/3429

This PR to update the metal terrform will also need to be merged:

https://github.com/openshift/installer/pull/3235#issuecomment-611627886

Job template changes:
https://github.com/openshift/release/pull/8259

--- Additional comment from Kirsten Garrison on 2020-04-16 19:00:35 UTC ---

@joseph, just to confirm are those PRs the only changes needed to close this BZ?

Comment 3 jima 2020-04-21 01:23:08 UTC

This issue is not always reproduced, but only hit on one time with 4.5 nightly build.
https://openshift-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/Launch%20Environment%20Flexy/88239/console

What did change to resolve the issue of "x509 certificate"?

Comment 4 Joseph Callen 2020-04-23 14:50:42 UTC

Testing with the new UPI in CI I found the problem at least partially.

The IPs within phpIPAM are being deleted and the cluster is still running.  The next CI job comes along and takes one or more of those addresses.
I would still like the Terraform changes to move forward but I also need to figure out a solution for the IPAM issue.

Comment 5 Scott Dodson 2020-04-23 23:27:52 UTC

Can we get the issue fully resolved upstream and then backport both changes at once after we've shown improvement in success or is this a situation where any problematic branches are going to taint the results of other branches due to the IPAM issue?

Comment 6 Joseph Callen 2020-04-27 21:09:29 UTC

(In reply to Scott Dodson from comment #5)
> Can we get the issue fully resolved upstream and then backport both changes
> at once after we've shown improvement in success or is this a situation
> where any problematic branches are going to taint the results of other
> branches due to the IPAM issue?

Sorry for the late response.  I have been on the lookout for this issue and haven't seen it come up again.  Though there are still openshift-tests failures that shouldn't exist.  The updated terraform the last operation that completes is deleting the IPAM records.  If teardown for some reason doesn't completely correctly at least those allocated ip addresses wouldn't be reused.

I have a replacement for phpIPAM in the works (netbox) but the issue will be switching it out from all the previous version of vSphere UPI. Changing to netbox the advantage is the API, the terraform is less complex to manage.

Comment 9 jima 2020-05-19 06:34:33 UTC

@Ben Parees, 
On QE testing env, it is rarely reproduced this issue. 
On dev ci env, I used below link to search failed ci job related with error "x509: certificate is valid for kubernetes".
https://search.svc.ci.openshift.org/?search=x509%3A+certificate+is+valid+for+kubernetes&maxAge=336h&context=0&type=bug%2Bjunit

There is no failed job due to the error of "x509: certificate is valid for kubernetes" on 4.4 release.

Do you think the issue is verified?

Comment 10 Ben Parees 2020-05-19 12:46:20 UTC

yes i think this can be verified.

Comment 11 Joseph Callen 2020-05-19 12:50:33 UTC

I caught one yesterday, found the reason and submitted a PR to fix:
https://github.com/openshift/release/pull/9166
Should fix any remaining issues with the x509 certificate issues.

Comment 12 Ben Parees 2020-05-19 12:53:23 UTC

Will that addresss these?  https://search.apps.build01.ci.devcluster.openshift.com/?search=x509%3A+certificate+is+valid+for+kubernetes&maxAge=168h&context=0&type=bug%2Bjunit&name=4.5&maxMatches=5&maxBytes=20971520&groupBy=job

Those look like a different cert issue, so probably needs a new bug if your PR doesn't address it.

Comment 13 Joseph Callen 2020-05-19 12:59:05 UTC

(In reply to Ben Parees from comment #12)
> Will that addresss these? 
> https://search.apps.build01.ci.devcluster.openshift.com/
> ?search=x509%3A+certificate+is+valid+for+kubernetes&maxAge=168h&context=0&typ
> e=bug%2Bjunit&name=4.5&maxMatches=5&maxBytes=20971520&groupBy=job
> 
> Those look like a different cert issue, so probably needs a new bug if your
> PR doesn't address it.

The first result is the one I found yesterday and submitted the PR for.
I will continue to monitor vSphere job results.

Comment 14 Scott Dodson 2020-05-20 12:18:24 UTC

Joe,

If there's a unique bug and fix in comment 12 I'd prefer we open a new bug for that and move this to VERIFIED. I'm fine with engineering moving it to VERIFIED as long as we know the originally reported problem has been addressed. The search results seem ambiguous to me so leaving that up to you.

Comment 15 Joseph Callen 2020-05-20 14:02:06 UTC

tl;dr based on my observations the certificate failures in CI for vsphere are caused by the same issue - two running clusters allocating via IPAM the same ip addresses for a master node

The certificate issue was caused by ip addresses being used by master instances in two different clusters.
I don't think the changes to the UPI terraform [0] helped to resolve this in retrospect. Though it might have minorly improved passing rates.
It was removing the destroy of the bootstrap [1] that helped resolve it and hopefully `terraform destory -refresh=false` [2] that will make sure it doesn't happen again.
We changed the process for deletion because of issues moving to 0.12 terraform [3]. My current guess this is the original cause of the issue.
When testing the changes for terraform I made sure that variables were set purposely so that IPAM was last to be destroyed. 

[0] https://github.com/openshift/installer/pull/3429
[1] https://github.com/openshift/release/pull/8617
[2] https://github.com/openshift/release/pull/9166
[3] https://github.com/openshift/release/pull/7618#issuecomment-604033776

I think this issue shows a potential problem that needs to be resolved in our move of CI:
- Do we need to destroy the bootstrap node?
- How can we guarantee that addresses are not duplicated?

I will add this comment to: https://issues.redhat.com/browse/SPLAT-2
as a reminder.

Comment 16 jima 2020-05-20 15:07:45 UTC

Joseph,

From your comment15, it still has some work on CI to fix this issue.
If we still use this CR to track the issue, can we move the status to "MODIFIED"? Or it is ok to track the issue with https://issues.redhat.com/browse/SPLAT-2, and move this bug to "VERIFIED"?

Comment 17 jima 2020-05-21 01:26:50 UTC

According to comment15, set bug to "ASSIGNED" to track two potential problems.

Comment 19 Joseph Callen 2020-05-28 19:06:24 UTC

This is CI specific and still monitoring.  Moving to 4.6

Comment 21 Joseph Callen 2020-06-01 18:11:19 UTC

I just checked in CI search.  This error has not occurred in the past 14 days
https://search.apps.build01.ci.devcluster.openshift.com/?search=Unable+to+connect+to+the+server%3A+x509%3A&maxAge=336h&context=2&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 22 Scott Dodson 2020-06-01 18:18:33 UTC

based on comment 21