1971207 – installer only created one worker node and the install failed

Bug 1971207 - installer only created one worker node and the install failed

Summary: installer only created one worker node and the install failed

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Installer
Sub Component:
Version:	4.8
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.9.0
Assignee:	Etienne Simard
QA Contact:	Etienne Simard
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2021-06-12 22:44 UTC by Ben Parees
Modified:	2021-10-18 17:34 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:	CI-only fix.
Clone Of:
Environment:
Last Closed:	2021-10-18 17:33:48 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Github	openshift release pull 19285	0	None	open	Bug 1971207: AWS CI dynamic availability zone selection	2021-06-15 14:14:37 UTC
Red Hat Product Errata	RHSA-2021:3759	0	None	None	None	2021-10-18 17:34:08 UTC

Description Ben Parees 2021-06-12 22:44:16 UTC

Version:

$ openshift-install version
4.8.0-0.nightly-2021-06-11-192710 


Platform: AWS - IPI

What happened?

install failed when various operators couldn't deploy their replicas because only a single worker node existed:

INFO[2021-06-12T03:46:21Z] level=error msg=Cluster operator ingress Degraded is True with IngressDegraded: The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-7d49958b56-npnxn" cannot be scheduled: 0/4 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity rules, 1 node(s) didn't match pod anti-affinity rules, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Make sure you have sufficient worker nodes.) 

Full logs from job run here:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-canary/1403543544680943616

must gather here:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-canary/1403543544680943616/artifacts/e2e-aws-canary/gather-must-gather/artifacts/must-gather.tar


node status here:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-canary/1403543544680943616/artifacts/e2e-aws-canary/gather-extra/artifacts/nodes.json


What did you expect to happen?

the install to complete successfully


How to reproduce it (as minimally and precisely as possible)?

unknown, but this job has hit it several times and it uses the standard aws ipi CI install flow.

In fact it looks like a lot of jobs are hitting it:
https://search.ci.openshift.org/?search=The+%22default%22+ingress+controller+reports+Degraded%3DTrue&maxAge=48h&context=1&type=junit&name=4.8&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 1 Matthew Staebler 2021-06-12 23:21:04 UTC

  errorMessage: 'error launching instance: Your requested instance type (m4.xlarge)
    is not supported in your requested Availability Zone (us-west-2d). Please retry
    your request by not specifying an Availability Zone or choosing us-west-2a, us-west-2b,
    us-west-2c.'

Comment 2 Matthew Staebler 2021-06-12 23:23:08 UTC

@esimard This may be related to the recent CI changes to select the availability zone dynamically. Could you take a look at this?

Comment 3 Etienne Simard 2021-06-15 13:21:07 UTC

Hello,

I confirm what you suggested.

There is an edge case (at minimum) in this region where some specific types of Instances are not available.
My assumption of looking for the larger one was not right.

Expanding the Instance type lookup per availability zone should fix this.

Comment 5 Etienne Simard 2021-06-17 00:40:49 UTC

CI-only fix (verified in CI).

Comment 8 errata-xmlrpc 2021-10-18 17:33:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759

Note You need to log in before you can comment on or make changes to this bug.