Bug 1824991
Summary: | invalid apiserver certificates causing large blocks of test failures on vsphere | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Ben Parees <bparees> |
Component: | Installer | Assignee: | Joseph Callen <jcallen> |
Installer sub component: | openshift-installer | QA Contact: | jima |
Status: | CLOSED CURRENTRELEASE | Docs Contact: | |
Severity: | high | ||
Priority: | high | CC: | aos-bugs, dphillip, jcallen, jima, kgarriso, mfojtik, scuppett, sdodson |
Version: | 4.4 | ||
Target Milestone: | --- | ||
Target Release: | 4.6.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | |
Clone Of: | 1819492 | Environment: | |
Last Closed: | 2020-06-01 18:18:33 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1819492 | ||
Bug Blocks: |
Description
Ben Parees
2020-04-16 19:34:03 UTC
This issue is not always reproduced, but only hit on one time with 4.5 nightly build. https://openshift-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/job/Launch%20Environment%20Flexy/88239/console What did change to resolve the issue of "x509 certificate"? Testing with the new UPI in CI I found the problem at least partially. The IPs within phpIPAM are being deleted and the cluster is still running. The next CI job comes along and takes one or more of those addresses. I would still like the Terraform changes to move forward but I also need to figure out a solution for the IPAM issue. Can we get the issue fully resolved upstream and then backport both changes at once after we've shown improvement in success or is this a situation where any problematic branches are going to taint the results of other branches due to the IPAM issue? (In reply to Scott Dodson from comment #5) > Can we get the issue fully resolved upstream and then backport both changes > at once after we've shown improvement in success or is this a situation > where any problematic branches are going to taint the results of other > branches due to the IPAM issue? Sorry for the late response. I have been on the lookout for this issue and haven't seen it come up again. Though there are still openshift-tests failures that shouldn't exist. The updated terraform the last operation that completes is deleting the IPAM records. If teardown for some reason doesn't completely correctly at least those allocated ip addresses wouldn't be reused. I have a replacement for phpIPAM in the works (netbox) but the issue will be switching it out from all the previous version of vSphere UPI. Changing to netbox the advantage is the API, the terraform is less complex to manage. @Ben Parees, On QE testing env, it is rarely reproduced this issue. On dev ci env, I used below link to search failed ci job related with error "x509: certificate is valid for kubernetes". https://search.svc.ci.openshift.org/?search=x509%3A+certificate+is+valid+for+kubernetes&maxAge=336h&context=0&type=bug%2Bjunit There is no failed job due to the error of "x509: certificate is valid for kubernetes" on 4.4 release. Do you think the issue is verified? yes i think this can be verified. I caught one yesterday, found the reason and submitted a PR to fix: https://github.com/openshift/release/pull/9166 Should fix any remaining issues with the x509 certificate issues. Will that addresss these? https://search.apps.build01.ci.devcluster.openshift.com/?search=x509%3A+certificate+is+valid+for+kubernetes&maxAge=168h&context=0&type=bug%2Bjunit&name=4.5&maxMatches=5&maxBytes=20971520&groupBy=job Those look like a different cert issue, so probably needs a new bug if your PR doesn't address it. (In reply to Ben Parees from comment #12) > Will that addresss these? > https://search.apps.build01.ci.devcluster.openshift.com/ > ?search=x509%3A+certificate+is+valid+for+kubernetes&maxAge=168h&context=0&typ > e=bug%2Bjunit&name=4.5&maxMatches=5&maxBytes=20971520&groupBy=job > > Those look like a different cert issue, so probably needs a new bug if your > PR doesn't address it. The first result is the one I found yesterday and submitted the PR for. I will continue to monitor vSphere job results. Joe, If there's a unique bug and fix in comment 12 I'd prefer we open a new bug for that and move this to VERIFIED. I'm fine with engineering moving it to VERIFIED as long as we know the originally reported problem has been addressed. The search results seem ambiguous to me so leaving that up to you. tl;dr based on my observations the certificate failures in CI for vsphere are caused by the same issue - two running clusters allocating via IPAM the same ip addresses for a master node The certificate issue was caused by ip addresses being used by master instances in two different clusters. I don't think the changes to the UPI terraform [0] helped to resolve this in retrospect. Though it might have minorly improved passing rates. It was removing the destroy of the bootstrap [1] that helped resolve it and hopefully `terraform destory -refresh=false` [2] that will make sure it doesn't happen again. We changed the process for deletion because of issues moving to 0.12 terraform [3]. My current guess this is the original cause of the issue. When testing the changes for terraform I made sure that variables were set purposely so that IPAM was last to be destroyed. [0] https://github.com/openshift/installer/pull/3429 [1] https://github.com/openshift/release/pull/8617 [2] https://github.com/openshift/release/pull/9166 [3] https://github.com/openshift/release/pull/7618#issuecomment-604033776 I think this issue shows a potential problem that needs to be resolved in our move of CI: - Do we need to destroy the bootstrap node? - How can we guarantee that addresses are not duplicated? I will add this comment to: https://issues.redhat.com/browse/SPLAT-2 as a reminder. Joseph, From your comment15, it still has some work on CI to fix this issue. If we still use this CR to track the issue, can we move the status to "MODIFIED"? Or it is ok to track the issue with https://issues.redhat.com/browse/SPLAT-2, and move this bug to "VERIFIED"? According to comment15, set bug to "ASSIGNED" to track two potential problems. This is CI specific and still monitoring. Moving to 4.6 I just checked in CI search. This error has not occurred in the past 14 days https://search.apps.build01.ci.devcluster.openshift.com/?search=Unable+to+connect+to+the+server%3A+x509%3A&maxAge=336h&context=2&type=junit&name=&maxMatches=5&maxBytes=20971520&groupBy=job based on comment 21 |