Bug 2051533
Summary: | Adding day2 remote worker node requires manually approving CSRs | ||
---|---|---|---|
Product: | Red Hat Advanced Cluster Management for Kubernetes | Reporter: | nshidlin <nshidlin> |
Component: | Infrastructure Operator | Assignee: | Ori Amizur <oamizur> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Trey West <trwest> |
Severity: | urgent | Docs Contact: | Derek <dcadzow> |
Priority: | unspecified | ||
Version: | rhacm-2.6 | CC: | ccrum, fpercoco, mfilanov, oamizur, otuchfel, smiron, trwest, yfirst, yobshans, yuhe |
Target Milestone: | --- | Keywords: | Regression, Reopened, TestBlocker |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2022-10-03 20:20:49 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
nshidlin
2022-02-07 12:48:46 UTC
We have been debugging this and it's mostly related to DHCP's provided hostname (full FQDN) used as the node's hostname. It only occurs when a custom hostname is specified in the BMH. The hostname is properly set by assisted agent but then the full FQDN is used when the node is rebooted as provided by DHCP. This causes a failure in the CSR approval for day2 because the machine-approver won't find a node (which uses the full FQDN) that matches the hostname found in the Machine CR. When no custom hostname is requested, then full FQDN is used for everything and things work as expected. Arguably, we could say that the node should likely not use the full FQDN as the node name but this is out of our realm. May be worth bringing this up to the OCP team. (In reply to Flavio Percoco from comment #1) > We have been debugging this and it's mostly related to DHCP's provided > hostname (full FQDN) used as the node's hostname. It only occurs when a > custom hostname is specified in the BMH. > > The hostname is properly set by assisted agent but then the full FQDN is > used when the node is rebooted as provided by DHCP. This causes a failure in > the CSR approval for day2 because the machine-approver won't find a node > (which uses the full FQDN) that matches the hostname found in the Machine > CR. When no custom hostname is requested, then full FQDN is used for > everything and things work as expected. > > Arguably, we could say that the node should likely not use the full FQDN as > the node name but this is out of our realm. May be worth bringing this up to > the OCP team. In fact I see now that hostname annotation set by the user is only honored in the BMH. Looking at a day1 install (no remote workernodes): Setting the hostname annotation to be different for the short name: oc get agent -o wide NAME CLUSTER APPROVED ROLE STAGE HOSTNAME REQUESTED HOSTNAME 1aaff5fc-9e04-4ad8-b8ae-681a5fde5fda spoke-0 true master Done spoke-master-0-2.spoke-0.qe.lab.redhat.com foobar-master-0-2 222c3773-9ab4-4cd1-a377-62acafd98441 spoke-0 true worker Done spoke-worker-0-0.spoke-0.qe.lab.redhat.com foobar-worker-0-0 5a013ba6-6939-43ff-9b3f-882cb717b796 spoke-0 true worker Done spoke-worker-0-1.spoke-0.qe.lab.redhat.com foobar-worker-0-1 d32ad241-6247-47a4-a81b-8c546708ec0a spoke-0 true master Done spoke-master-0-0.spoke-0.qe.lab.redhat.com foo-master-0-0 fb944e70-121b-4363-933f-f428f4281306 spoke-0 true master Done spoke-master-0-1.spoke-0.qe.lab.redhat.com foobar-master-0-1 on the installed spoke cluster: node name does not honor annotation: oc get nodes NAME STATUS ROLES AGE VERSION spoke-master-0-0.spoke-0.qe.lab.redhat.com Ready master 45m v1.23.3+b63be7f spoke-master-0-1.spoke-0.qe.lab.redhat.com Ready master 45m v1.23.3+b63be7f spoke-master-0-2.spoke-0.qe.lab.redhat.com Ready master 21m v1.23.3+b63be7f spoke-worker-0-0.spoke-0.qe.lab.redhat.com Ready worker 23m v1.23.3+b63be7f spoke-worker-0-1.spoke-0.qe.lab.redhat.com Ready worker 23m v1.23.3+b63be7f The bmh however does: oc get bmh -n openshift-machine-api NAME STATE CONSUMER ONLINE ERROR AGE foo-master-0-0 unmanaged spoke-0-kvhm2-master-0 true 39m foobar-master-0-1 unmanaged spoke-0-kvhm2-master-1 true 39m foobar-master-0-2 unmanaged spoke-0-kvhm2-master-2 true 39m foobar-worker-0-0 unmanaged spoke-0-kvhm2-worker-0-6msh7 true 39m foobar-worker-0-1 unmanaged spoke-0-kvhm2-worker-0-zw92r true 39m oc get machine -n openshift-machine-api -o wide NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE spoke-0-kvhm2-master-0 Running 39m spoke-master-0-0.spoke-0.qe.lab.redhat.com baremetalhost:///openshift-machine-api/foo-master-0-0/72587cc3-a450-40e5-963c-d7492c666b84 unmanaged spoke-0-kvhm2-master-1 Running 39m spoke-master-0-1.spoke-0.qe.lab.redhat.com baremetalhost:///openshift-machine-api/foobar-master-0-1/3c386561-dd89-4444-84e0-3a74f46cce6e unmanaged spoke-0-kvhm2-master-2 Running 39m spoke-master-0-2.spoke-0.qe.lab.redhat.com baremetalhost:///openshift-machine-api/foobar-master-0-2/8bd2964c-a9f0-4988-ad2e-778d12abcfad unmanaged spoke-0-kvhm2-worker-0-6msh7 Running 25m spoke-worker-0-0.spoke-0.qe.lab.redhat.com baremetalhost:///openshift-machine-api/foobar-worker-0-0/09b00b28-288a-4f6f-a1d6-4428686c2978 unmanaged spoke-0-kvhm2-worker-0-zw92r Running 25m spoke-worker-0-1.spoke-0.qe.lab.redhat.com baremetalhost:///openshift-machine-api/foobar-worker-0-1/84bfe4f8-b811-4f95-996d-026b34baff54 unmanaged sshing into the nodes also show the hostname annotation is not honored: [core@spoke-worker-0-0 ~]$ hostname spoke-worker-0-0.spoke-0.qe.lab.redhat.com This issue is continuing to reproduce on OCP 4.10.20/MCE 2.1.0-DOWNANDBACK-2022-06-28-15-44-48: On spoke cluster: $ oc get bmh -A NAMESPACE NAME STATE CONSUMER ONLINE ERROR AGE openshift-machine-api spoke-master-0-0 unmanaged spoke-0-cp2wt-master-0 true 23h openshift-machine-api spoke-master-0-1 unmanaged spoke-0-cp2wt-master-1 true 23h openshift-machine-api spoke-master-0-2 unmanaged spoke-0-cp2wt-master-2 true 23h openshift-machine-api spoke-rwn-0-0-bmh externally provisioned spoke-0-spoke-rwn-0-0-bmh true provisioned registration error 22h openshift-machine-api spoke-worker-0-0 unmanaged spoke-0-cp2wt-worker-0-57stz true 23h openshift-machine-api spoke-worker-0-1 unmanaged spoke-0-cp2wt-worker-0-gw7kz true 23h Before manually approving CSRs: $ oc get machine -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE spoke-0-cp2wt-master-0 Running 21h spoke-0-cp2wt-master-1 Running 21h spoke-0-cp2wt-master-2 Running 21h spoke-0-cp2wt-worker-0-57stz Running 21h spoke-0-cp2wt-worker-0-gw7kz Running 21h spoke-0-spoke-rwn-0-0-bmh Provisioned 20h $ oc get nodes -A NAME STATUS ROLES AGE VERSION spoke-master-0-0.spoke-0.qe.lab.redhat.com Ready master 21h v1.23.5+3afdacb spoke-master-0-1.spoke-0.qe.lab.redhat.com Ready master 21h v1.23.5+3afdacb spoke-master-0-2.spoke-0.qe.lab.redhat.com Ready master 20h v1.23.5+3afdacb spoke-worker-0-0.spoke-0.qe.lab.redhat.com Ready worker 20h v1.23.5+3afdacb spoke-worker-0-1.spoke-0.qe.lab.redhat.com Ready worker 20h v1.23.5+3afdacb - No node created for the RWN. After the CSRs were manually approved, the relevant node was created and the machine resource transferred to "Running" phase. *** This bug has been marked as a duplicate of bug 2087213 *** I tested this PR (https://github.com/openshift/machine-config-operator/pull/3285) which includes the changes needed in order for IPv6 DHCP to work on 4.11 (https://github.com/openshift/machine-config-operator/pull/3282) and the changes implemented by Ori (https://github.com/openshift/machine-config-operator/pull/3276). This seemed to have solved the issue in regards to CSRs not being auto-approved. @mfilanov Can this move to verified? Thanks! We cannot verify this until the fix is merged into MCO and present in a 4.11 z-stream. I just did early testing to ensure that the attached PR fixes the issue but not with an official build. Agree with Trey May we know when exactly the OCP release with the PR merged is available? Does it require a new AI build at ACM side? Thanks! PR https://github.com/openshift/machine-config-operator/pull/3276 merged to master Hey @oamizur , we need to cherry pick the PR into the release-4.11 branch in order to get it into 4.11 builds. So far, the fix will only show up in 4.12. Backport to 4.11 PR is https://github.com/openshift/machine-config-operator/pull/3305 backport was merged This has been verified by QE The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days |