1896751 – [RHV IPI] Worker nodes stuck in the Provisioning Stage if the machineset has a long name

Bug 1896751 - [RHV IPI] Worker nodes stuck in the Provisioning Stage if the machineset has a long name

Summary: [RHV IPI] Worker nodes stuck in the Provisioning Stage if the machineset has ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Cloud Compute
Sub Component:
Version:	4.6.z
Hardware:	All
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Douglas Schilling Landgraf
QA Contact:	michal
Docs Contact:
URL:
Whiteboard:
Depends On:	1915122 1983690 1983695
Blocks:
TreeView+	depends on / blocked

Reported:	2020-11-11 13:05 UTC by Miguel Figueiredo Nunes
Modified:	2024-10-01 17:03 UTC (History)
CC List:	15 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 22:34:10 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2021:2438	0	None	None	None	2021-07-27 22:34:30 UTC

Description Miguel Figueiredo Nunes 2020-11-11 13:05:45 UTC

Description of problem:
Worker nodes scaled after installation don't move from Provisioning even when available. CSR's need to be approved manually.

Version-Release number of selected component (if applicable):
4.4 -> 4.5 -> 4.6.z

How reproducible:
Always

Steps to Reproduce:
1. Provision a Cluster
2. Scale a new node
3. The node is provisioned but never moves from Provisioning and needs to have the CSR's approved manually

Actual results:
Cluster working but machines in a not correct stage and the provisioning process needs manual intervention, when it shouldn't

Expected results:
Cluster with all machines in Running stage if everything worked as planned and no manual intervention in the scaling process

Additional info:
Please see comments

Comment 3 Colin Walters 2020-11-11 20:26:14 UTC

At the moment debugging this requires manual work to ssh to the nodes or try to gather instance console logs.
Can you try to gather that please?

Comment 5 Colin Walters 2020-11-12 14:52:22 UTC

See https://github.com/openshift/machine-config-operator/pull/2219/files

Comment 6 Colin Walters 2020-11-12 14:55:40 UTC

OK if the machines join after CSR approval, this is currently a Machine API bug, not a MCO bug.
The component involved is https://github.com/openshift/cluster-machine-approver

Logs from that pod may be helpful, or a full must-gather.

Comment 10 Colin Walters 2020-11-19 14:49:24 UTC

The core failure might be this from the machine-api-controller:
2020-11-05T17:06:05.224968044Z E1105 17:06:05.224830       1 actuator.go:306] failed to lookup the VM IP lookup openshift-stage-wz4zh-worker-0-xlmnc on 172.48.0.10:53: no such host - skip setting addresses for this machine

(This may be related to https://github.com/openshift/machine-config-operator/pull/2042 )

This needs more analysis from the https://github.com/openshift/cluster-api-provider-ovirt maintainers though.

If it works to just manually approve the CSRs, I'd move forward with that in the short term.  See e.g.:
https://docs.openshift.com/container-platform/4.6/machine_management/user_infra/adding-rhel-compute.html#installation-approve-csrs_adding-rhel-compute

Since the bug here is about not wanting manual action, you could add a loop which auto-approves CSRs.  This is a somewhat safe action assuming that you've firewalled off access to the Machine Config Server from external sources and are using the default SDN.  See also https://github.com/openshift/enhancements/pull/443

Basically (in e.g. a pod from openshift/cli):

while sleep 5; do 
  oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs oc adm certificate approved
done

Comment 48 Gal Zaidman 2021-03-01 07:43:25 UTC

Moving to ON_QA because https://bugzilla.redhat.com/show_bug.cgi?id=1915122# has been moved to ON_QA

Comment 51 michal 2021-04-06 09:05:45 UTC

OCP-  4.8.0-0.nightly-2021-04-02-002210 
RHV - 4.4.5.10

steps:
1) create Machineset with a longer name
2) run oc get machines - verify that machine was created and the status is Running
3) verify in RHV that machine created


results:
machine was created and here status is 'running'

Comment 55 errata-xmlrpc 2021-07-27 22:34:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.