Description of problem: Worker nodes scaled after installation don't move from Provisioning even when available. CSR's need to be approved manually. Version-Release number of selected component (if applicable): 4.4 -> 4.5 -> 4.6.z How reproducible: Always Steps to Reproduce: 1. Provision a Cluster 2. Scale a new node 3. The node is provisioned but never moves from Provisioning and needs to have the CSR's approved manually Actual results: Cluster working but machines in a not correct stage and the provisioning process needs manual intervention, when it shouldn't Expected results: Cluster with all machines in Running stage if everything worked as planned and no manual intervention in the scaling process Additional info: Please see comments
At the moment debugging this requires manual work to ssh to the nodes or try to gather instance console logs. Can you try to gather that please?
See https://github.com/openshift/machine-config-operator/pull/2219/files
OK if the machines join after CSR approval, this is currently a Machine API bug, not a MCO bug. The component involved is https://github.com/openshift/cluster-machine-approver Logs from that pod may be helpful, or a full must-gather.
The core failure might be this from the machine-api-controller: 2020-11-05T17:06:05.224968044Z E1105 17:06:05.224830 1 actuator.go:306] failed to lookup the VM IP lookup openshift-stage-wz4zh-worker-0-xlmnc on 172.48.0.10:53: no such host - skip setting addresses for this machine (This may be related to https://github.com/openshift/machine-config-operator/pull/2042 ) This needs more analysis from the https://github.com/openshift/cluster-api-provider-ovirt maintainers though. If it works to just manually approve the CSRs, I'd move forward with that in the short term. See e.g.: https://docs.openshift.com/container-platform/4.6/machine_management/user_infra/adding-rhel-compute.html#installation-approve-csrs_adding-rhel-compute Since the bug here is about not wanting manual action, you could add a loop which auto-approves CSRs. This is a somewhat safe action assuming that you've firewalled off access to the Machine Config Server from external sources and are using the default SDN. See also https://github.com/openshift/enhancements/pull/443 Basically (in e.g. a pod from openshift/cli): while sleep 5; do oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs oc adm certificate approved done
Moving to ON_QA because https://bugzilla.redhat.com/show_bug.cgi?id=1915122# has been moved to ON_QA
OCP- 4.8.0-0.nightly-2021-04-02-002210 RHV - 4.4.5.10 steps: 1) create Machineset with a longer name 2) run oc get machines - verify that machine was created and the status is Running 3) verify in RHV that machine created results: machine was created and here status is 'running'
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438