Bug 1896751

Summary:	[RHV IPI] Worker nodes stuck in the Provisioning Stage if the machineset has a long name
Product:	OpenShift Container Platform	Reporter:	Miguel Figueiredo Nunes <mnunes>
Component:	Cloud Compute	Assignee:	Douglas Schilling Landgraf <dougsland>
Cloud Compute sub component:	oVirt Provider	QA Contact:	michal <mgold>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	medium	CC:	dougsland, erich, eslutsky, gzaidman, hpopal, jrouth, mburman, mkalinin, ocprhvteam, openshift-bugs-escalate, rcunha, rgregory, trees, vfarias, walters
Version:	4.6.z
Target Milestone:	---
Target Release:	4.8.0
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2021-07-27 22:34:10 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1915122, 1983690, 1983695
Bug Blocks:

Description Miguel Figueiredo Nunes 2020-11-11 13:05:45 UTC

Description of problem:
Worker nodes scaled after installation don't move from Provisioning even when available. CSR's need to be approved manually.

Version-Release number of selected component (if applicable):
4.4 -> 4.5 -> 4.6.z

How reproducible:
Always

Steps to Reproduce:
1. Provision a Cluster
2. Scale a new node
3. The node is provisioned but never moves from Provisioning and needs to have the CSR's approved manually

Actual results:
Cluster working but machines in a not correct stage and the provisioning process needs manual intervention, when it shouldn't

Expected results:
Cluster with all machines in Running stage if everything worked as planned and no manual intervention in the scaling process

Additional info:
Please see comments

Comment 3 Colin Walters 2020-11-11 20:26:14 UTC

At the moment debugging this requires manual work to ssh to the nodes or try to gather instance console logs.
Can you try to gather that please?

Comment 5 Colin Walters 2020-11-12 14:52:22 UTC

See https://github.com/openshift/machine-config-operator/pull/2219/files

Comment 6 Colin Walters 2020-11-12 14:55:40 UTC

OK if the machines join after CSR approval, this is currently a Machine API bug, not a MCO bug.
The component involved is https://github.com/openshift/cluster-machine-approver

Logs from that pod may be helpful, or a full must-gather.

Comment 10 Colin Walters 2020-11-19 14:49:24 UTC

The core failure might be this from the machine-api-controller:
2020-11-05T17:06:05.224968044Z E1105 17:06:05.224830       1 actuator.go:306] failed to lookup the VM IP lookup openshift-stage-wz4zh-worker-0-xlmnc on 172.48.0.10:53: no such host - skip setting addresses for this machine

(This may be related to https://github.com/openshift/machine-config-operator/pull/2042 )

This needs more analysis from the https://github.com/openshift/cluster-api-provider-ovirt maintainers though.

If it works to just manually approve the CSRs, I'd move forward with that in the short term.  See e.g.:
https://docs.openshift.com/container-platform/4.6/machine_management/user_infra/adding-rhel-compute.html#installation-approve-csrs_adding-rhel-compute

Since the bug here is about not wanting manual action, you could add a loop which auto-approves CSRs.  This is a somewhat safe action assuming that you've firewalled off access to the Machine Config Server from external sources and are using the default SDN.  See also https://github.com/openshift/enhancements/pull/443

Basically (in e.g. a pod from openshift/cli):

while sleep 5; do 
  oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs oc adm certificate approved
done

Comment 48 Gal Zaidman 2021-03-01 07:43:25 UTC

Moving to ON_QA because https://bugzilla.redhat.com/show_bug.cgi?id=1915122# has been moved to ON_QA

Comment 51 michal 2021-04-06 09:05:45 UTC

OCP-  4.8.0-0.nightly-2021-04-02-002210 
RHV - 4.4.5.10

steps:
1) create Machineset with a longer name
2) run oc get machines - verify that machine was created and the status is Running
3) verify in RHV that machine created


results:
machine was created and here status is 'running'

Comment 55 errata-xmlrpc 2021-07-27 22:34:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438