Bug 1940551
| Summary: | About 2% of CI jobs fail (on several platforms) due to "Managed cluster should have the same number of machines and nodes" | |||
|---|---|---|---|---|
| Product: | OpenShift Container Platform | Reporter: | Clayton Coleman <ccoleman> | |
| Component: | Cloud Compute | Assignee: | Michael Gugino <mgugino> | |
| Cloud Compute sub component: | Other Providers | QA Contact: | sunzhaohua <zhsun> | |
| Status: | CLOSED NOTABUG | Docs Contact: | ||
| Severity: | high | |||
| Priority: | unspecified | CC: | mgugino, mimccune | |
| Version: | 4.8 | |||
| Target Milestone: | --- | |||
| Target Release: | --- | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 1940898 1940972 1941107 (view as bug list) | Environment: | ||
| Last Closed: | 2021-08-19 09:57:52 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1940972, 1941107 | |||
|
Description
Clayton Coleman
2021-03-18 15:37:09 UTC
A much smaller number of failures have happened on AWS / Azure / GCP so this is probably not common to all platforms (unless there is a race condition triggered by slower provisoining, or there are edge cases where we aren't tolerant of non-cloud errors in the provider stack, etc) The first test case I looked at was Openstack, and it failed to successfully provision a machine, so possibly lack of capacity in that platform. The bare metal ovn tests lack gather extra for analysis, we need to get that to conform. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-vsphere/1372795001506893824 vsphere test has one worker node fail to issue a CSR. The machine was created, 3 minutes later the machine was 'provisioned' indicating the cloud was running the instance. However, this cluster has 5 nodes (2 workers, 3 masters) but 15 approved CSRs. There should only be 10 approved CSRs with 5 nodes (1 client, 1 server). This indicates there's some problem with the hosts or the CSR approver. https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.8/1373949894003265536 AWS serial failed during the delete create Mar 22 12:09:19.296: INFO: Error getting nodes from machineSet: not all machines have a node reference: map[ci-op-h1zqznhj-ef0bf-jglv4-worker-us-west-1b-wgwmm:ip-10-0-213-0.us-west-1.compute.internal] Seems similar (In reply to Clayton Coleman from comment #3) > https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift- > ocp-installer-e2e-aws-serial-4.8/1373949894003265536 > > AWS serial failed during the delete create > > Mar 22 12:09:19.296: INFO: Error getting nodes from machineSet: not all > machines have a node reference: > map[ci-op-h1zqznhj-ef0bf-jglv4-worker-us-west-1b-wgwmm:ip-10-0-213-0.us-west- > 1.compute.internal] > > Seems similar machine-api looks fine here. Created a bug against RHCOS for super-slow boot process: https://bugzilla.redhat.com/show_bug.cgi?id=1942145 As it turns out "not all machines have a node reference" uncovered a different issue on Azure for the machine-api: machine-controller is going OOM there: https://bugzilla.redhat.com/show_bug.cgi?id=1942161 From the same CI job, also found the API server randomly dies: https://bugzilla.redhat.com/show_bug.cgi?id=1942169 Still seeing super slow boots here: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-cloud-controller-manager-operator/34/pull-ci-openshift-cluster-cloud-controller-manager-operator-master-e2e-aws/1385156990736535552/artifacts/e2e-aws/gather-extra/artifacts/nodes/ip-10-0-171-146.us-east-2.compute.internal/ No progress has been made on https://bugzilla.redhat.com/show_bug.cgi?id=1942145 Filtering out the older releases (4.6/4.7)as any changes we've made likely won't have been backported, so won't show up on these builds and kubevirt failures: https://search.ci.openshift.org/?search=Managed+cluster+should+have+same+number+of+Machines+and+Nodes&maxAge=168h&context=1&type=junit&name=&excludeName=4.7%7C4.6%7Ckubevirt&maxMatches=5&maxBytes=20971520&groupBy=job We currently have a 0.07% match on this search term, which accounts for 0.26% of failures. Looking through a few of the runs: - [1] vSphere: From the machine status we can see that the Machine was created and is being reported as existing by vSphere. Something went wrong during the startup of the VM, but I don't have any visibility into what that could be. It will either be an RHCOS issue or something on the vSphere side. Looking at MCS logs I only see two ignition requests which implies that while the VM was created, it never booted to the Ignition phase, could this be a problem with the vCenter capacity? - [2] Agnostic on Azure: The test "flaked". We know VMs can be slow to boot on Azure, especially when you create multiple instances at the same time. It seems when you request multiple, Azure processes them 1x1 and theres a small delay between the first and last VMs actually being created and starting. In this case it took an extra 15 minutes for the instance to start up, which is longer than I would have expected. All of the Machines report that the VM was created on the Azure side successfully within a few seconds of each other. Looking at MCS all of the VMs called ignition within around 3 minutes of each other. Looking at the CSRs, they were all created in a timely manner, but on one of the nodes were created significantly later. There was a large gap between ignition and Kubelet starting/creating its CSR. Could possibly throw this over to the Node team to investigate. - [3] vSphere OVN: Same symptoms as [1] - [4] Azure 4.9 OVN: Same symptoms as [2] except the machines took 4 (about normal), 11 and 21 minutes respectively to come up. (I also looked through another few Azure ones which show similar symptoms.) - [5] AWS RHEL: Looking at the Machine logs, around the time the test started, the Machine controller was in the process of removing some Machines and adding some new ones. Looking at what's happening here, it seems that the RHEL test set up involves spinning up a cluster, then removes the existing machinesets, and creates new RHEL machinesets. What they aren't doing is waiting for those new Machines to be ready before starting the test, causing this issue. I would suggest whoever owns that test adds some wait to make sure the Machines come up before they start running the tests. [1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-vsphere/1427480336643657728 [2] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/637/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic/1427376531528749057 [3] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/660/pull-ci-openshift-ovn-kubernetes-master-e2e-vsphere-ovn/1427358306669694976 [4] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/18796/rehearse-18796-periodic-ci-openshift-release-master-ci-4.9-e2e-azure-ovn/1427353960854851584 [5] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/5088/pull-ci-openshift-installer-master-e2e-aws-workers-rhel7/1427266814060007424 Scanning through the rest of the tests I think I've captured all of the failure scenarios we are seeing today for this. Next steps: - Understand if there's something better we can do to ensure vSphere capacity is provided for (I suspect some tests aren't requesting the right number of machines) - Understand if there's anything that can be done to investigate slow Azure boots - Find out who owns the RHEL tests and see if they can come up with a way to prevent this test from failing in this manner Spoke to the installer team about the RHEL issue and they are aware and tracking that https://bugzilla.redhat.com/show_bug.cgi?id=1979966 The Azure slow boots are the same symptoms as https://bugzilla.redhat.com/show_bug.cgi?id=1942145 which the RHCOS team are aware of, I have added the details from my analysis today to this so hopefully this gets picked up again soon I spoke to the splat team yesterday who showed me that CI jobs now have vsphere diagnostic data captured during the gather steps in deprovision. Looking at the above failures for vSphere, the VM screenshots for the Machines that never came up suggest that the VMs are actually not powered on. The splat team suggested that the issue could be related to this existing bug https://bugzilla.redhat.com/show_bug.cgi?id=1952739 The team are also going to add additional VM information to the vSphere diagnostics to allow us to determine VM configuration and in particular power states. --- This means that all three of the symptoms identified for the failures in these tests are known issues and have teams actively working on them/bugs assigned to the appropriate teams. I don't think there's much more we can do with this bug in particular, I'm going to close it now and defer to the other teams with their respective bugs. |