1870343 – [sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously

Bug 1870343 - [sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously

Summary: [sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Machine Config Operator
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	urgent
Target Milestone:	---
Target Release:	4.6.0
Assignee:	Antonio Murdaca
QA Contact:	Michael Nguyen
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1859428 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2020-08-19 20:11 UTC by David Eads
Modified:	2020-10-27 16:30 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:	[sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously
Last Closed:	2020-10-27 16:29:46 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 2055	0	None	closed	Bug 1870343: machine-config-daemon-pull: Retry pulling image	2020-11-16 01:39:29 UTC
Red Hat Product Errata	RHBA-2020:4196	0	None	None	None	2020-10-27 16:30:01 UTC

Description David Eads 2020-08-19 20:11:49 UTC

test:
[sig-cluster-lifecycle][Feature:Machines][Serial] Managed cluster should grow and decrease when scaling different machineSets simultaneously 

is failing frequently in CI, see search results:
https://search.ci.openshift.org/?maxAge=168h&context=1&type=bug%2Bjunit&name=&maxMatches=5&maxBytes=20971520&groupBy=job&search=%5C%5Bsig-cluster-lifecycle%5C%5D%5C%5BFeature%3AMachines%5C%5D%5C%5BSerial%5C%5D+Managed+cluster+should+grow+and+decrease+when+scaling+different+machineSets+simultaneously


This is failing on 5% of all serial runs, but is heavily clustered around AWS which makes it more impactful on that platform


https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-serial-4.6/1295796415158554624

Comment 1 Colin Walters 2020-08-19 20:41:56 UTC

https://github.com/openshift/machine-config-operator/blob/master/docs/FAQ.md#q-how-does-this-relate-to-machine-api

Comment 2 Joel Speed 2020-08-20 16:39:29 UTC

@Michael McCune, do you think you could take a look at this during the next sprint? I believe you recently were refactoring this test

Comment 3 Michael McCune 2020-08-20 21:45:44 UTC

(In reply to Joel Speed from comment #2)
> @Michael McCune, do you think you could take a look at this during the next
> sprint? I believe you recently were refactoring this test

yeah, definitely. i've added myself as the assignee. thanks for the heads up Joel!

Comment 4 Michael Gugino 2020-08-28 16:35:22 UTC

I have taken a look into this.  ci-op-351tf1zl-c470d-vgjf6-worker-us-east-1b-fg78b is one of two machines stuck in 'provisioned' and should not be.  The machine is listed as 'running' in the cloud, has networking info.  There are no pending CSRs.  There is no trace of the machine in the MCS server.

This looks like an ignition failure to me.  Moving to MCO team.

Comment 5 Antonio Murdaca 2020-09-01 08:07:19 UTC

possibly related to https://bugzilla.redhat.com/show_bug.cgi?id=1859428#c12

Comment 6 David Eads 2020-09-02 15:04:37 UTC

This is failing 10% of runs on openstack and aws:

https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-ocp-installer-e2e-openstack-serial-4.6

https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-blocking#release-openshift-ocp-installer-e2e-aws-serial-4.6

Bumping severity to urgent to reflect failure percentages

Comment 7 Yu Qi Zhang 2020-09-02 16:02:03 UTC

So I took a quick look at a failing job https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.6/1300592655343816704
and a passing job https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.6/1300691923236818944

First thing:

> There is no trace of the machine in the MCS server.

That's not the case, see the following MCS logs:
https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.6/1300592655343816704/artifacts/e2e-aws-serial/pods/openshift-machine-config-operator_machine-config-server-5rm2q_machine-config-server.log

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.6/1300592655343816704/artifacts/e2e-aws-serial/pods/openshift-machine-config-operator_machine-config-server-cbbff_machine-config-server.log

https://storage.googleapis.com/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.6/1300592655343816704/artifacts/e2e-aws-serial/pods/openshift-machine-config-operator_machine-config-server-kqtgh_machine-config-server.log

There's 5 requests, 3 initially and 2 a bit later. those are the 3 initial workers + 2 scaleup workers. Its a bit confusing because I think its logging based on ip of the availablity zone (2/1 zone split -> 3/2 as defined by the job so it checks out there) which is odd because I thought we used ip of workers.

I don't think its an ignition failure based on that (its still possible) but I cannot verify since I have no access to the console logs. See: https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.6/1300592655343816704/artifacts/e2e-aws-serial/nodes/

There are all 8 nodes there (6 initial + 2 scaleup) but none of those has any console logs (all empty). This is also the case for the passing job so something may have regressed gathering of that. In light of this I don't have the logs to continue debugging. We'd have to somehow reproduce this in a live cluster or have the console logs capture more info.

One suspect is this: https://bugzilla.redhat.com/show_bug.cgi?id=1859428#c16 that the image pull failed on those nodes. That said since 2 workers are being scaled up, what are the chances that they both failed a pull if its just a transient network hiccup? I'm not sure where in the networking stack it fails either so I think we'd have to check deeper.

The easiest "fix" we can move forward with is to add a retry to that service and see where that gets us but without more info I have no guarentee that will solve this bug

Comment 8 Antonio Murdaca 2020-09-08 12:40:01 UTC

*** Bug 1859428 has been marked as a duplicate of this bug. ***

Comment 12 Micah Abbott 2020-09-19 15:28:36 UTC

Looking at the CI history of the e2e-aws-serial job

https://prow.ci.openshift.org/job-history/gs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-aws-serial-4.6

...shows mostly green for the last 7 days

Failures found in the history were unrelated to the test that was originally reporting failing.

Marking VERIFIED

Comment 14 errata-xmlrpc 2020-10-27 16:29:46 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.6 GA Images), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4196

Note You need to log in before you can comment on or make changes to this bug.