Bug 1896751 - [RHV IPI] Worker nodes stuck in the Provisioning Stage if the machineset has a long name
Summary: [RHV IPI] Worker nodes stuck in the Provisioning Stage if the machineset has ...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.6.z
Hardware: All
OS: Linux
medium
high
Target Milestone: ---
: 4.8.0
Assignee: Douglas Schilling Landgraf
QA Contact: michal
URL:
Whiteboard:
Depends On: 1915122 1983690 1983695
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-11-11 13:05 UTC by Miguel Figueiredo Nunes
Modified: 2021-07-27 22:34 UTC (History)
15 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-07-27 22:34:10 UTC
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2021:2438 0 None None None 2021-07-27 22:34:30 UTC

Description Miguel Figueiredo Nunes 2020-11-11 13:05:45 UTC
Description of problem:
Worker nodes scaled after installation don't move from Provisioning even when available. CSR's need to be approved manually.

Version-Release number of selected component (if applicable):
4.4 -> 4.5 -> 4.6.z

How reproducible:
Always

Steps to Reproduce:
1. Provision a Cluster
2. Scale a new node
3. The node is provisioned but never moves from Provisioning and needs to have the CSR's approved manually

Actual results:
Cluster working but machines in a not correct stage and the provisioning process needs manual intervention, when it shouldn't

Expected results:
Cluster with all machines in Running stage if everything worked as planned and no manual intervention in the scaling process

Additional info:
Please see comments

Comment 3 Colin Walters 2020-11-11 20:26:14 UTC
At the moment debugging this requires manual work to ssh to the nodes or try to gather instance console logs.
Can you try to gather that please?

Comment 6 Colin Walters 2020-11-12 14:55:40 UTC
OK if the machines join after CSR approval, this is currently a Machine API bug, not a MCO bug.
The component involved is https://github.com/openshift/cluster-machine-approver

Logs from that pod may be helpful, or a full must-gather.

Comment 10 Colin Walters 2020-11-19 14:49:24 UTC
The core failure might be this from the machine-api-controller:
2020-11-05T17:06:05.224968044Z E1105 17:06:05.224830       1 actuator.go:306] failed to lookup the VM IP lookup openshift-stage-wz4zh-worker-0-xlmnc on 172.48.0.10:53: no such host - skip setting addresses for this machine

(This may be related to https://github.com/openshift/machine-config-operator/pull/2042 )

This needs more analysis from the https://github.com/openshift/cluster-api-provider-ovirt maintainers though.

If it works to just manually approve the CSRs, I'd move forward with that in the short term.  See e.g.:
https://docs.openshift.com/container-platform/4.6/machine_management/user_infra/adding-rhel-compute.html#installation-approve-csrs_adding-rhel-compute

Since the bug here is about not wanting manual action, you could add a loop which auto-approves CSRs.  This is a somewhat safe action assuming that you've firewalled off access to the Machine Config Server from external sources and are using the default SDN.  See also https://github.com/openshift/enhancements/pull/443

Basically (in e.g. a pod from openshift/cli):

while sleep 5; do 
  oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs oc adm certificate approved
done

Comment 48 Gal Zaidman 2021-03-01 07:43:25 UTC
Moving to ON_QA because https://bugzilla.redhat.com/show_bug.cgi?id=1915122# has been moved to ON_QA

Comment 51 michal 2021-04-06 09:05:45 UTC
OCP-  4.8.0-0.nightly-2021-04-02-002210 
RHV - 4.4.5.10

steps:
1) create Machineset with a longer name
2) run oc get machines - verify that machine was created and the status is Running
3) verify in RHV that machine created


results:
machine was created and here status is 'running'

Comment 55 errata-xmlrpc 2021-07-27 22:34:10 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438


Note You need to log in before you can comment on or make changes to this bug.