1877059 – missing nodes on azure

Bug 1877059 - missing nodes on azure

Summary: missing nodes on azure

Keywords:
Status:	CLOSED DUPLICATE of bug 1857446
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	4.7.0
Assignee:	Ryan Phillips
QA Contact:	Weinan Liu
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1875774 (view as bug list)
Depends On:
Blocks:	1882116
TreeView+	depends on / blocked

Reported:	2020-09-08 18:48 UTC by David Eads
Modified:	2020-11-03 19:11 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2020-11-03 19:11:36 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description David Eads 2020-09-08 18:48:44 UTC

Three azure runs in the last day failed to create the expected number of nodes. This cause an install failure as ingress doesn't come fully available.

https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-openshift-ocp-installer-e2e-azure-4.6

specifically
1. https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.6/1303312119290138624

2. https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.6/1303077792237228032

3. https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.6/1302961038026608640

Comment 1 Michael Gugino 2020-09-08 19:30:06 UTC

(In reply to David Eads from comment #0)
> Three azure runs in the last day failed to create the expected number of
> nodes. This cause an install failure as ingress doesn't come fully available.
> 
> https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-
> openshift-ocp-installer-e2e-azure-4.6
> 
> specifically
> 1.
> https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-
> ocp-installer-e2e-azure-4.6/1303312119290138624

The master hosts on this cluster must be seriously broken.  There are several minute pauses between log lines on the machine-controller.  Also, possible DNS lookup failures for authentication against the cloud.

> 2.
> https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-
> ocp-installer-e2e-azure-4.6/1303077792237228032

On this one, the machine-api-operator never successfully rolled out:

                    {
                        "lastTransitionTime": "2020-09-07T21:29:55Z",
                        "lastUpdateTime": "2020-09-07T21:29:55Z",
                        "message": "pods \"machine-api-operator-84b76f9dcd-\" is forbidden: unable to validate against any security context constraint: []",
                        "reason": "FailedCreate",
                        "status": "True",
                        "type": "ReplicaFailure"
                    },


Looks to be some authorization problem on the cluster.

> 3.
> https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-
> ocp-installer-e2e-azure-4.6/1302961038026608640

This one the machine-controller appeared to be working as normal, but it didn't start until:

I0907 14:11:58.463861       1 main.go:105] Watching machine-api objects only in namespace "openshift-machine-api" for reconciliation.


and the test was killed at:

2020/09/07 14:27:22 Container setup in pod e2e-azure failed, exit code 1, reason Error

So, if the test ran for a few more minutes, the machines would have become nodes most likely.  I can see the approved CSR reqeusts for the 2 node bootstrappers, within a normal time frame for azure (Instances take about 15-20 mintues to spin up from machine creation to node joining the cluster).  In fact, one worker machine did become a node:

"creationTimestamp": "2020-09-07T14:28:04Z",


I don't see anything that implicates the machine-api in these failures at this time.  Looks like these may be resource starvation in the first two cases and a slow install in the last case.

Comment 2 Alberto 2020-09-09 07:36:52 UTC

*** Bug 1875774 has been marked as a duplicate of this bug. ***

Comment 3 Sam Batschelet 2020-09-10 11:20:20 UTC

I think this is related to https://bugzilla.redhat.com/show_bug.cgi?id=1877483

specifically this run:

https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-ocp-installer-e2e-azure-4.6/1303077792237228032

Comment 4 Michael Gugino 2020-09-21 23:19:31 UTC

Moving this over to the node team to investigate poor master peformance.

Note You need to log in before you can comment on or make changes to this bug.