Bug 1877059 - missing nodes on azure
Summary: missing nodes on azure
Keywords:
Status: CLOSED DUPLICATE of bug 1857446
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.7.0
Assignee: Ryan Phillips
QA Contact: Weinan Liu
URL:
Whiteboard:
: 1875774 (view as bug list)
Depends On:
Blocks: 1882116
TreeView+ depends on / blocked
 
Reported: 2020-09-08 18:48 UTC by David Eads
Modified: 2020-11-03 19:11 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2020-11-03 19:11:36 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)

Comment 1 Michael Gugino 2020-09-08 19:30:06 UTC
(In reply to David Eads from comment #0)
> Three azure runs in the last day failed to create the expected number of
> nodes. This cause an install failure as ingress doesn't come fully available.
> 
> https://testgrid.k8s.io/redhat-openshift-ocp-release-4.6-informing#release-
> openshift-ocp-installer-e2e-azure-4.6
> 
> specifically
> 1.
> https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-
> ocp-installer-e2e-azure-4.6/1303312119290138624

The master hosts on this cluster must be seriously broken.  There are several minute pauses between log lines on the machine-controller.  Also, possible DNS lookup failures for authentication against the cloud.

> 2.
> https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-
> ocp-installer-e2e-azure-4.6/1303077792237228032

On this one, the machine-api-operator never successfully rolled out:

                    {
                        "lastTransitionTime": "2020-09-07T21:29:55Z",
                        "lastUpdateTime": "2020-09-07T21:29:55Z",
                        "message": "pods \"machine-api-operator-84b76f9dcd-\" is forbidden: unable to validate against any security context constraint: []",
                        "reason": "FailedCreate",
                        "status": "True",
                        "type": "ReplicaFailure"
                    },


Looks to be some authorization problem on the cluster.

> 3.
> https://prow.ci.openshift.org/view/gcs/origin-ci-test/logs/release-openshift-
> ocp-installer-e2e-azure-4.6/1302961038026608640

This one the machine-controller appeared to be working as normal, but it didn't start until:

I0907 14:11:58.463861       1 main.go:105] Watching machine-api objects only in namespace "openshift-machine-api" for reconciliation.


and the test was killed at:

2020/09/07 14:27:22 Container setup in pod e2e-azure failed, exit code 1, reason Error

So, if the test ran for a few more minutes, the machines would have become nodes most likely.  I can see the approved CSR reqeusts for the 2 node bootstrappers, within a normal time frame for azure (Instances take about 15-20 mintues to spin up from machine creation to node joining the cluster).  In fact, one worker machine did become a node:

"creationTimestamp": "2020-09-07T14:28:04Z",


I don't see anything that implicates the machine-api in these failures at this time.  Looks like these may be resource starvation in the first two cases and a slow install in the last case.

Comment 2 Alberto 2020-09-09 07:36:52 UTC
*** Bug 1875774 has been marked as a duplicate of this bug. ***

Comment 4 Michael Gugino 2020-09-21 23:19:31 UTC
Moving this over to the node team to investigate poor master peformance.


Note You need to log in before you can comment on or make changes to this bug.