Bug 1940551

Summary:	About 2% of CI jobs fail (on several platforms) due to "Managed cluster should have the same number of machines and nodes"
Product:	OpenShift Container Platform	Reporter:	Clayton Coleman <ccoleman>
Component:	Cloud Compute	Assignee:	Michael Gugino <mgugino>
Cloud Compute sub component:	Other Providers	QA Contact:	sunzhaohua <zhsun>
Status:	CLOSED NOTABUG	Docs Contact:
Severity:	high
Priority:	unspecified	CC:	mgugino, mimccune
Version:	4.8
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1940898 1940972 1941107 (view as bug list)		Environment:
Last Closed:	2021-08-19 09:57:52 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	1940972, 1941107

Description Clayton Coleman 2021-03-18 15:37:09 UTC

https://search.ci.openshift.org/?search=Managed+cluster+should+have+same+number+of+Machines+and+Nodes&maxAge=168h&context=1&type=bug%2Bjunit&name=&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Looking at the last 7 days, somewhere between 1 and 3% of jobs on metal, vsphere, ovirt, and openstack fail to get all the nodes they expect.  These jobs are on environments with slightly lower reliability than aws/gcp/azure, so some level of failure is expected, but it also points towards potential sources of error on our side that simply haven't been identified and fixed.

I would like metal / vsphere investigated to determine if there are a consistent set of reasons that aren't related to operational concerns from the folks running those (metal is slightly more concerning).  If we can show that the bulk of those failures are accounted for by infrastructure issues, we can drop the severity of this to medium.  However, reducing those sources will also make those teams happier and we may have gaps that have crept in over time.

Setting as high until we determine that this is not a single systemic issue on these platforms.

Comment 1 Clayton Coleman 2021-03-18 15:38:24 UTC

A much smaller number of failures have happened on AWS / Azure / GCP so this is probably not common to all platforms (unless there is a race condition triggered by slower provisoining, or there are edge cases where we aren't tolerant of non-cloud errors in the provider stack, etc)

Comment 2 Michael Gugino 2021-03-19 13:47:55 UTC

The first test case I looked at was Openstack, and it failed to successfully provision a machine, so possibly lack of capacity in that platform.

The bare metal ovn tests lack gather extra for analysis, we need to get that to conform.

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-vsphere/1372795001506893824

vsphere test has one worker node fail to issue a CSR.  The machine was created, 3 minutes later the machine was 'provisioned' indicating the cloud was running the instance.  However, this cluster has 5 nodes (2 workers, 3 masters) but 15 approved CSRs.  There should only be 10 approved CSRs with 5 nodes (1 client, 1 server).  This indicates there's some problem with the hosts or the CSR approver.

Comment 3 Clayton Coleman 2021-03-22 13:58:32 UTC

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.8/1373949894003265536

AWS serial failed during the delete create 

Mar 22 12:09:19.296: INFO: Error getting nodes from machineSet: not all machines have a node reference: map[ci-op-h1zqznhj-ef0bf-jglv4-worker-us-west-1b-wgwmm:ip-10-0-213-0.us-west-1.compute.internal]

Seems similar

Comment 4 Michael Gugino 2021-03-23 18:25:39 UTC

(In reply to Clayton Coleman from comment #3)
> https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-
> ocp-installer-e2e-aws-serial-4.8/1373949894003265536
> 
> AWS serial failed during the delete create 
> 
> Mar 22 12:09:19.296: INFO: Error getting nodes from machineSet: not all
> machines have a node reference:
> map[ci-op-h1zqznhj-ef0bf-jglv4-worker-us-west-1b-wgwmm:ip-10-0-213-0.us-west-
> 1.compute.internal]
> 
> Seems similar

machine-api looks fine here.  Created a bug against RHCOS for super-slow boot process: https://bugzilla.redhat.com/show_bug.cgi?id=1942145

Comment 5 Michael Gugino 2021-03-23 19:25:25 UTC

As it turns out "not all machines have a node reference" uncovered a different issue on Azure for the machine-api:

machine-controller is going OOM there: https://bugzilla.redhat.com/show_bug.cgi?id=1942161

From the same CI job, also found the API server randomly dies: https://bugzilla.redhat.com/show_bug.cgi?id=1942169

Comment 6 Michael Gugino 2021-04-22 14:55:12 UTC

Still seeing super slow boots here: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/pr-logs/pull/openshift_cluster-cloud-controller-manager-operator/34/pull-ci-openshift-cluster-cloud-controller-manager-operator-master-e2e-aws/1385156990736535552/artifacts/e2e-aws/gather-extra/artifacts/nodes/ip-10-0-171-146.us-east-2.compute.internal/

No progress has been made on https://bugzilla.redhat.com/show_bug.cgi?id=1942145

Comment 7 Joel Speed 2021-08-17 08:52:18 UTC

Filtering out the older releases (4.6/4.7)as any changes we've made likely won't have been backported, so won't show up on these builds and kubevirt failures: https://search.ci.openshift.org/?search=Managed+cluster+should+have+same+number+of+Machines+and+Nodes&maxAge=168h&context=1&type=junit&name=&excludeName=4.7%7C4.6%7Ckubevirt&maxMatches=5&maxBytes=20971520&groupBy=job

We currently have a 0.07% match on this search term, which accounts for 0.26% of failures.

Looking through a few of the runs:

- [1] vSphere: From the machine status we can see that the Machine was created and is being reported as existing by vSphere. Something went wrong during the startup of the VM, but I don't have any visibility into what that could be. It will either be an RHCOS issue or something on the vSphere side. Looking at MCS logs I only see two ignition requests which implies that while the VM was created, it never booted to the Ignition phase, could this be a problem with the vCenter capacity?

- [2] Agnostic on Azure: The test "flaked". We know VMs can be slow to boot on Azure, especially when you create multiple instances at the same time. It seems when you request multiple, Azure processes them 1x1 and theres a small delay between the first and last VMs actually being created and starting. In this case it took an extra 15 minutes for the instance to start up, which is longer than I would have expected. All of the Machines report that the VM was created on the Azure side successfully within a few seconds of each other. Looking at MCS all of the VMs called ignition within around 3 minutes of each other. Looking at the CSRs, they were all created in a timely manner, but on one of the nodes were created significantly later. There was a large gap between ignition and Kubelet starting/creating its CSR. Could possibly throw this over to the Node team to investigate.

- [3] vSphere OVN: Same symptoms as [1]

- [4] Azure 4.9 OVN: Same symptoms as [2] except the machines took 4 (about normal), 11 and 21 minutes respectively to come up. (I also looked through another few Azure ones which show similar symptoms.)

- [5] AWS RHEL: Looking at the Machine logs, around the time the test started, the Machine controller was in the process of removing some Machines and adding some new ones. Looking at what's happening here, it seems that the RHEL test set up involves spinning up a cluster, then removes the existing machinesets, and creates new RHEL machinesets. What they aren't doing is waiting for those new Machines to be ready before starting the test, causing this issue. I would suggest whoever owns that test adds some wait to make sure the Machines come up before they start running the tests.

[1] https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-vsphere/1427480336643657728
[2] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_cluster-version-operator/637/pull-ci-openshift-cluster-version-operator-master-e2e-agnostic/1427376531528749057
[3] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_ovn-kubernetes/660/pull-ci-openshift-ovn-kubernetes-master-e2e-vsphere-ovn/1427358306669694976
[4] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_release/18796/rehearse-18796-periodic-ci-openshift-release-master-ci-4.9-e2e-azure-ovn/1427353960854851584
[5] https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_installer/5088/pull-ci-openshift-installer-master-e2e-aws-workers-rhel7/1427266814060007424

Scanning through the rest of the tests I think I've captured all of the failure scenarios we are seeing today for this.

Next steps:
- Understand if there's something better we can do to ensure vSphere capacity is provided for (I suspect some tests aren't requesting the right number of machines)
- Understand if there's anything that can be done to investigate slow Azure boots
- Find out who owns the RHEL tests and see if they can come up with a way to prevent this test from failing in this manner

Comment 8 Joel Speed 2021-08-17 12:53:29 UTC

Spoke to the installer team about the RHEL issue and they are aware and tracking that https://bugzilla.redhat.com/show_bug.cgi?id=1979966

The Azure slow boots are the same symptoms as https://bugzilla.redhat.com/show_bug.cgi?id=1942145 which the RHCOS team are aware of, I have added the details from my analysis today to this so hopefully this gets picked up again soon

Comment 9 Joel Speed 2021-08-19 09:57:52 UTC

I spoke to the splat team yesterday who showed me that CI jobs now have vsphere diagnostic data captured during the gather steps in deprovision.

Looking at the above failures for vSphere, the VM screenshots for the Machines that never came up suggest that the VMs are actually not powered on.

The splat team suggested that the issue could be related to this existing bug https://bugzilla.redhat.com/show_bug.cgi?id=1952739

The team are also going to add additional VM information to the vSphere diagnostics to allow us to determine VM configuration and in particular power states.

---

This means that all three of the symptoms identified for the failures in these tests are known issues and have teams actively working on them/bugs assigned to the appropriate teams.

I don't think there's much more we can do with this bug in particular, I'm going to close it now and defer to the other teams with their respective bugs.