[sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously
is failing frequently in CI, see search results:
This is failing on 5% of all serial runs, but is heavily clustered around AWS which makes it more impactful on that platform
@Michael McCune, do you think you could take a look at this during the next sprint? I believe you recently were refactoring this test
(In reply to Joel Speed from comment #2)
> @Michael McCune, do you think you could take a look at this during the next
> sprint? I believe you recently were refactoring this test
yeah, definitely. i've added myself as the assignee. thanks for the heads up Joel!
I have taken a look into this. ci-op-351tf1zl-c470d-vgjf6-worker-us-east-1b-fg78b is one of two machines stuck in 'provisioned' and should not be. The machine is listed as 'running' in the cloud, has networking info. There are no pending CSRs. There is no trace of the machine in the MCS server.
This looks like an ignition failure to me. Moving to MCO team.
possibly related to https://bugzilla.redhat.com/show_bug.cgi?id=1859428#c12
This is failing 10% of runs on openstack and aws:
Bumping severity to urgent to reflect failure percentages
So I took a quick look at a failing job https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.6/1300592655343816704
and a passing job https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.6/1300691923236818944
> There is no trace of the machine in the MCS server.
That's not the case, see the following MCS logs:
There's 5 requests, 3 initially and 2 a bit later. those are the 3 initial workers + 2 scaleup workers. Its a bit confusing because I think its logging based on ip of the availablity zone (2/1 zone split -> 3/2 as defined by the job so it checks out there) which is odd because I thought we used ip of workers.
I don't think its an ignition failure based on that (its still possible) but I cannot verify since I have no access to the console logs. See: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.6/1300592655343816704/artifacts/e2e-aws-serial/nodes/
There are all 8 nodes there (6 initial + 2 scaleup) but none of those has any console logs (all empty). This is also the case for the passing job so something may have regressed gathering of that. In light of this I don't have the logs to continue debugging. We'd have to somehow reproduce this in a live cluster or have the console logs capture more info.
One suspect is this: https://bugzilla.redhat.com/show_bug.cgi?id=1859428#c16 that the image pull failed on those nodes. That said since 2 workers are being scaled up, what are the chances that they both failed a pull if its just a transient network hiccup? I'm not sure where in the networking stack it fails either so I think we'd have to check deeper.
The easiest "fix" we can move forward with is to add a retry to that service and see where that gets us but without more info I have no guarentee that will solve this bug
*** Bug 1859428 has been marked as a duplicate of this bug. ***
Looking at the CI history of the e2e-aws-serial job
...shows mostly green for the last 7 days
Failures found in the history were unrelated to the test that was originally reporting failing.