Bug 1896751

Summary: [RHV IPI] Worker nodes stuck in the Provisioning Stage if the machineset has a long name
Product: OpenShift Container Platform Reporter: Miguel Figueiredo Nunes <mnunes>
Component: Cloud ComputeAssignee: Douglas Schilling Landgraf <dougsland>
Cloud Compute sub component: oVirt Provider QA Contact: michal <mgold>
Status: CLOSED ERRATA Docs Contact:
Severity: high    
Priority: medium CC: dougsland, erich, eslutsky, gzaidman, hpopal, jrouth, mburman, mkalinin, ocprhvteam, openshift-bugs-escalate, rcunha, rgregory, trees, vfarias, walters
Version: 4.6.z   
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 22:34:10 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1915122, 1983690, 1983695    
Bug Blocks:    

Description Miguel Figueiredo Nunes 2020-11-11 13:05:45 UTC
Description of problem:
Worker nodes scaled after installation don't move from Provisioning even when available. CSR's need to be approved manually.

Version-Release number of selected component (if applicable):
4.4 -> 4.5 -> 4.6.z

How reproducible:
Always

Steps to Reproduce:
1. Provision a Cluster
2. Scale a new node
3. The node is provisioned but never moves from Provisioning and needs to have the CSR's approved manually

Actual results:
Cluster working but machines in a not correct stage and the provisioning process needs manual intervention, when it shouldn't

Expected results:
Cluster with all machines in Running stage if everything worked as planned and no manual intervention in the scaling process

Additional info:
Please see comments

Comment 3 Colin Walters 2020-11-11 20:26:14 UTC
At the moment debugging this requires manual work to ssh to the nodes or try to gather instance console logs.
Can you try to gather that please?

Comment 6 Colin Walters 2020-11-12 14:55:40 UTC
OK if the machines join after CSR approval, this is currently a Machine API bug, not a MCO bug.
The component involved is https://github.com/openshift/cluster-machine-approver

Logs from that pod may be helpful, or a full must-gather.

Comment 10 Colin Walters 2020-11-19 14:49:24 UTC
The core failure might be this from the machine-api-controller:
2020-11-05T17:06:05.224968044Z E1105 17:06:05.224830       1 actuator.go:306] failed to lookup the VM IP lookup openshift-stage-wz4zh-worker-0-xlmnc on 172.48.0.10:53: no such host - skip setting addresses for this machine

(This may be related to https://github.com/openshift/machine-config-operator/pull/2042 )

This needs more analysis from the https://github.com/openshift/cluster-api-provider-ovirt maintainers though.

If it works to just manually approve the CSRs, I'd move forward with that in the short term.  See e.g.:
https://docs.openshift.com/container-platform/4.6/machine_management/user_infra/adding-rhel-compute.html#installation-approve-csrs_adding-rhel-compute

Since the bug here is about not wanting manual action, you could add a loop which auto-approves CSRs.  This is a somewhat safe action assuming that you've firewalled off access to the Machine Config Server from external sources and are using the default SDN.  See also https://github.com/openshift/enhancements/pull/443

Basically (in e.g. a pod from openshift/cli):

while sleep 5; do 
  oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs oc adm certificate approved
done

Comment 48 Gal Zaidman 2021-03-01 07:43:25 UTC
Moving to ON_QA because https://bugzilla.redhat.com/show_bug.cgi?id=1915122# has been moved to ON_QA

Comment 51 michal 2021-04-06 09:05:45 UTC
OCP-  4.8.0-0.nightly-2021-04-02-002210 
RHV - 4.4.5.10

steps:
1) create Machineset with a longer name
2) run oc get machines - verify that machine was created and the status is Running
3) verify in RHV that machine created


results:
machine was created and here status is 'running'

Comment 55 errata-xmlrpc 2021-07-27 22:34:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438