Bug 1870343 - [sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously
Summary: [sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow...
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.6
Hardware: Unspecified
OS: Unspecified
Target Milestone: ---
: 4.6.0
Assignee: Antonio Murdaca
QA Contact: Michael Nguyen
: 1859428 (view as bug list)
Depends On:
TreeView+ depends on / blocked
Reported: 2020-08-19 20:11 UTC by David Eads
Modified: 2020-09-19 15:28 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
[sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously
Last Closed:
Target Upstream Version:

Attachments (Terms of Use)

System ID Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2055 None closed Bug 1870343: machine-config-daemon-pull: Retry pulling image 2020-09-18 13:04:14 UTC

Description David Eads 2020-08-19 20:11:49 UTC
[sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously 

is failing frequently in CI, see search results:

This is failing on 5% of all serial runs, but is heavily clustered around AWS which makes it more impactful on that platform


Comment 2 Joel Speed 2020-08-20 16:39:29 UTC
@Michael McCune, do you think you could take a look at this during the next sprint? I believe you recently were refactoring this test

Comment 3 Michael McCune 2020-08-20 21:45:44 UTC
(In reply to Joel Speed from comment #2)
> @Michael McCune, do you think you could take a look at this during the next
> sprint? I believe you recently were refactoring this test

yeah, definitely. i've added myself as the assignee. thanks for the heads up Joel!

Comment 4 Michael Gugino 2020-08-28 16:35:22 UTC
I have taken a look into this.  ci-op-351tf1zl-c470d-vgjf6-worker-us-east-1b-fg78b is one of two machines stuck in 'provisioned' and should not be.  The machine is listed as 'running' in the cloud, has networking info.  There are no pending CSRs.  There is no trace of the machine in the MCS server.

This looks like an ignition failure to me.  Moving to MCO team.

Comment 5 Antonio Murdaca 2020-09-01 08:07:19 UTC
possibly related to https://bugzilla.redhat.com/show_bug.cgi?id=1859428#c12

Comment 7 Yu Qi Zhang 2020-09-02 16:02:03 UTC
So I took a quick look at a failing job https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.6/1300592655343816704
and a passing job https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.6/1300691923236818944

First thing:

> There is no trace of the machine in the MCS server.

That's not the case, see the following MCS logs:



There's 5 requests, 3 initially and 2 a bit later. those are the 3 initial workers + 2 scaleup workers. Its a bit confusing because I think its logging based on ip of the availablity zone (2/1 zone split -> 3/2 as defined by the job so it checks out there) which is odd because I thought we used ip of workers.

I don't think its an ignition failure based on that (its still possible) but I cannot verify since I have no access to the console logs. See: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.6/1300592655343816704/artifacts/e2e-aws-serial/nodes/

There are all 8 nodes there (6 initial + 2 scaleup) but none of those has any console logs (all empty). This is also the case for the passing job so something may have regressed gathering of that. In light of this I don't have the logs to continue debugging. We'd have to somehow reproduce this in a live cluster or have the console logs capture more info.

One suspect is this: https://bugzilla.redhat.com/show_bug.cgi?id=1859428#c16 that the image pull failed on those nodes. That said since 2 workers are being scaled up, what are the chances that they both failed a pull if its just a transient network hiccup? I'm not sure where in the networking stack it fails either so I think we'd have to check deeper.

The easiest "fix" we can move forward with is to add a retry to that service and see where that gets us but without more info I have no guarentee that will solve this bug

Comment 8 Antonio Murdaca 2020-09-08 12:40:01 UTC
*** Bug 1859428 has been marked as a duplicate of this bug. ***

Comment 12 Micah Abbott 2020-09-19 15:28:36 UTC
Looking at the CI history of the e2e-aws-serial job


...shows mostly green for the last 7 days

Failures found in the history were unrelated to the test that was originally reporting failing.


Note You need to log in before you can comment on or make changes to this bug.