Bug 1896751
Summary: | [RHV IPI] Worker nodes stuck in the Provisioning Stage if the machineset has a long name | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Miguel Figueiredo Nunes <mnunes> |
Component: | Cloud Compute | Assignee: | Douglas Schilling Landgraf <dougsland> |
Cloud Compute sub component: | oVirt Provider | QA Contact: | michal <mgold> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | medium | CC: | dougsland, erich, eslutsky, gzaidman, hpopal, jrouth, mburman, mkalinin, ocprhvteam, openshift-bugs-escalate, rcunha, rgregory, trees, vfarias, walters |
Version: | 4.6.z | ||
Target Milestone: | --- | ||
Target Release: | 4.8.0 | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-07-27 22:34:10 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | 1915122, 1983690, 1983695 | ||
Bug Blocks: |
Description
Miguel Figueiredo Nunes
2020-11-11 13:05:45 UTC
At the moment debugging this requires manual work to ssh to the nodes or try to gather instance console logs. Can you try to gather that please? OK if the machines join after CSR approval, this is currently a Machine API bug, not a MCO bug. The component involved is https://github.com/openshift/cluster-machine-approver Logs from that pod may be helpful, or a full must-gather. The core failure might be this from the machine-api-controller: 2020-11-05T17:06:05.224968044Z E1105 17:06:05.224830 1 actuator.go:306] failed to lookup the VM IP lookup openshift-stage-wz4zh-worker-0-xlmnc on 172.48.0.10:53: no such host - skip setting addresses for this machine (This may be related to https://github.com/openshift/machine-config-operator/pull/2042 ) This needs more analysis from the https://github.com/openshift/cluster-api-provider-ovirt maintainers though. If it works to just manually approve the CSRs, I'd move forward with that in the short term. See e.g.: https://docs.openshift.com/container-platform/4.6/machine_management/user_infra/adding-rhel-compute.html#installation-approve-csrs_adding-rhel-compute Since the bug here is about not wanting manual action, you could add a loop which auto-approves CSRs. This is a somewhat safe action assuming that you've firewalled off access to the Machine Config Server from external sources and are using the default SDN. See also https://github.com/openshift/enhancements/pull/443 Basically (in e.g. a pod from openshift/cli): while sleep 5; do oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs oc adm certificate approved done Moving to ON_QA because https://bugzilla.redhat.com/show_bug.cgi?id=1915122# has been moved to ON_QA OCP- 4.8.0-0.nightly-2021-04-02-002210 RHV - 4.4.5.10 steps: 1) create Machineset with a longer name 2) run oc get machines - verify that machine was created and the status is Running 3) verify in RHV that machine created results: machine was created and here status is 'running' Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |