Bug 1971207 - installer only created one worker node and the install failed
Summary: installer only created one worker node and the install failed
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Installer
Version: 4.8
Hardware: Unspecified
OS: Unspecified
medium
medium
Target Milestone: ---
: 4.9.0
Assignee: Etienne Simard
QA Contact: Etienne Simard
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-06-12 22:44 UTC by Ben Parees
Modified: 2021-10-18 17:34 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
CI-only fix.
Clone Of:
Environment:
Last Closed: 2021-10-18 17:33:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift release pull 19285 0 None open Bug 1971207: AWS CI dynamic availability zone selection 2021-06-15 14:14:37 UTC
Red Hat Product Errata RHSA-2021:3759 0 None None None 2021-10-18 17:34:08 UTC

Description Ben Parees 2021-06-12 22:44:16 UTC
Version:

$ openshift-install version
4.8.0-0.nightly-2021-06-11-192710 


Platform: AWS - IPI

What happened?

install failed when various operators couldn't deploy their replicas because only a single worker node existed:

INFO[2021-06-12T03:46:21Z] level=error msg=Cluster operator ingress Degraded is True with IngressDegraded: The "default" ingress controller reports Degraded=True: DegradedConditions: One or more other status conditions indicate a degraded state: PodsScheduled=False (PodsNotScheduled: Some pods are not scheduled: Pod "router-default-7d49958b56-npnxn" cannot be scheduled: 0/4 nodes are available: 1 node(s) didn't match pod affinity/anti-affinity rules, 1 node(s) didn't match pod anti-affinity rules, 3 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate. Make sure you have sufficient worker nodes.) 

Full logs from job run here:
https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-canary/1403543544680943616

must gather here:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-canary/1403543544680943616/artifacts/e2e-aws-canary/gather-must-gather/artifacts/must-gather.tar


node status here:
https://gcsweb-ci.apps.ci.l2s4.p1.openshiftapps.com/gcs/origin-ci-test/logs/periodic-ci-openshift-release-master-nightly-4.8-e2e-aws-canary/1403543544680943616/artifacts/e2e-aws-canary/gather-extra/artifacts/nodes.json


What did you expect to happen?

the install to complete successfully


How to reproduce it (as minimally and precisely as possible)?

unknown, but this job has hit it several times and it uses the standard aws ipi CI install flow.

In fact it looks like a lot of jobs are hitting it:
https://search.ci.openshift.org/?search=The+%22default%22+ingress+controller+reports+Degraded%3DTrue&maxAge=48h&context=1&type=junit&name=4.8&excludeName=&maxMatches=5&maxBytes=20971520&groupBy=job

Comment 1 Matthew Staebler 2021-06-12 23:21:04 UTC
  errorMessage: 'error launching instance: Your requested instance type (m4.xlarge)
    is not supported in your requested Availability Zone (us-west-2d). Please retry
    your request by not specifying an Availability Zone or choosing us-west-2a, us-west-2b,
    us-west-2c.'

Comment 2 Matthew Staebler 2021-06-12 23:23:08 UTC
@esimard This may be related to the recent CI changes to select the availability zone dynamically. Could you take a look at this?

Comment 3 Etienne Simard 2021-06-15 13:21:07 UTC
Hello,

I confirm what you suggested.

There is an edge case (at minimum) in this region where some specific types of Instances are not available.
My assumption of looking for the larger one was not right.

Expanding the Instance type lookup per availability zone should fix this.

Comment 5 Etienne Simard 2021-06-17 00:40:49 UTC
CI-only fix (verified in CI).

Comment 8 errata-xmlrpc 2021-10-18 17:33:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.9.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:3759


Note You need to log in before you can comment on or make changes to this bug.