Description of problem: When adding remote worker node using ZTP the agent finishes the installation and is marked as done. oc get agent -o wide NAME CLUSTER APPROVED ROLE STAGE HOSTNAME REQUESTED HOSTNAME 0277804e-2a7c-4d95-9d0f-e22a190d582a spoke-0 true worker Done spoke-worker-0-0.spoke-0.qe.lab.redhat.com spoke-worker-0-0 12efa520-5b99-4474-805d-931e46ad43f7 spoke-0 true master Done spoke-master-0-2.spoke-0.qe.lab.redhat.com spoke-master-0-2 3b8eec89-f26f-4896-8f71-8a810894c560 spoke-0 true master Done spoke-master-0-0.spoke-0.qe.lab.redhat.com spoke-master-0-0 3fb3749e-c132-4258-ad1a-08a0445c9022 spoke-0 true worker Done spoke-worker-0-1.spoke-0.qe.lab.redhat.com spoke-worker-0-1 728559e9-5543-41d9-adb0-e58196f765af spoke-0 true master Done spoke-master-0-1.spoke-0.qe.lab.redhat.com spoke-master-0-1 982e1ff6-6e83-4800-b061-8cdfd0b844fb spoke-0 true worker Done spoke-rwn-0-1.spoke-rwn-0.qe.lab.redhat.com spoke-rwn-0-1 a76eaa6a-b351-429f-bfa1-e53a70503573 spoke-0 true worker Done spoke-rwn-0-0.spoke-rwn-0.qe.lab.redhat.com spoke-rwn-0-0 Logging into the spoke cluster the bmh and machine resources are created and the node resource is not: oc get bmh -n openshift-machine-api NAME STATE CONSUMER ONLINE ERROR AGE spoke-master-0-0 unmanaged spoke-0-pxbfh-master-0 true 3h32m spoke-master-0-1 unmanaged spoke-0-pxbfh-master-1 true 3h32m spoke-master-0-2 unmanaged spoke-0-pxbfh-master-2 true 3h32m spoke-rwn-0-0-bmh externally provisioned spoke-0-spoke-rwn-0-0-bmh true provisioned registration error 168m spoke-rwn-0-1-bmh externally provisioned spoke-0-spoke-rwn-0-1-bmh true provisioned registration error 168m spoke-worker-0-0 unmanaged spoke-0-pxbfh-worker-0-65mrb true 3h32m spoke-worker-0-1 unmanaged spoke-0-pxbfh-worker-0-nnmcq true 3h32m oc get machine -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE spoke-0-pxbfh-master-0 Running 3h33m spoke-0-pxbfh-master-1 Running 3h33m spoke-0-pxbfh-master-2 Running 3h33m spoke-0-pxbfh-worker-0-65mrb Running 3h19m spoke-0-pxbfh-worker-0-nnmcq Running 3h20m spoke-0-spoke-rwn-0-0-bmh Provisioned 169m spoke-0-spoke-rwn-0-1-bmh Provisioned 169m Note: bmh is in error state: Normal ProvisionedRegistrationError 30m metal3-baremetal-controller Host adoption failed: Error while attempting to adopt node 529b3e75-5d04-4486-9296-269081d0ec02: Error validating Redfish virtual media. Some parameters were missing in node's driver_info. Missing are: ['deploy_kernel', 'deploy_ramdisk']. oc get nodes NAME STATUS ROLES AGE VERSION spoke-master-0-0.spoke-0.qe.lab.redhat.com Ready master 72m v1.22.3+2cb6068 spoke-master-0-1.spoke-0.qe.lab.redhat.com Ready master 50m v1.22.3+2cb6068 spoke-master-0-2.spoke-0.qe.lab.redhat.com Ready master 72m v1.22.3+2cb6068 spoke-worker-0-0.spoke-0.qe.lab.redhat.com Ready worker 51m v1.22.3+2cb6068 spoke-worker-0-1.spoke-0.qe.lab.redhat.com Ready worker 51m v1.22.3+2cb6068 node-bootstrapper CSR is created but not auto-approved; periodically another node-strapper csr is created until it is manually approved: oc get csr | grep Pending csr-5ll2g 9m9s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending csr-f8vbl 8m24s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending Version-Release number of selected component (if applicable): assisted-service master at revision af0bafb3f7f629932f8c3dc31ccddedfe6984926 ocp version: 4.10.0-rc.1 How reproducible: Every time Steps to Reproduce: 1. Install remote worker node using ztp 2. Wait for node resource to be created Actual results: node-bootstrapper and node CSR are not auto-approved and node resource is not created. The bmh resource remains in registration error Expected results: node-bootstrapper and node CSR should be auto-approved and node resource created. The bmh resource should not be in registration error Additional info:
We have been debugging this and it's mostly related to DHCP's provided hostname (full FQDN) used as the node's hostname. It only occurs when a custom hostname is specified in the BMH. The hostname is properly set by assisted agent but then the full FQDN is used when the node is rebooted as provided by DHCP. This causes a failure in the CSR approval for day2 because the machine-approver won't find a node (which uses the full FQDN) that matches the hostname found in the Machine CR. When no custom hostname is requested, then full FQDN is used for everything and things work as expected. Arguably, we could say that the node should likely not use the full FQDN as the node name but this is out of our realm. May be worth bringing this up to the OCP team.
(In reply to Flavio Percoco from comment #1) > We have been debugging this and it's mostly related to DHCP's provided > hostname (full FQDN) used as the node's hostname. It only occurs when a > custom hostname is specified in the BMH. > > The hostname is properly set by assisted agent but then the full FQDN is > used when the node is rebooted as provided by DHCP. This causes a failure in > the CSR approval for day2 because the machine-approver won't find a node > (which uses the full FQDN) that matches the hostname found in the Machine > CR. When no custom hostname is requested, then full FQDN is used for > everything and things work as expected. > > Arguably, we could say that the node should likely not use the full FQDN as > the node name but this is out of our realm. May be worth bringing this up to > the OCP team. In fact I see now that hostname annotation set by the user is only honored in the BMH. Looking at a day1 install (no remote workernodes): Setting the hostname annotation to be different for the short name: oc get agent -o wide NAME CLUSTER APPROVED ROLE STAGE HOSTNAME REQUESTED HOSTNAME 1aaff5fc-9e04-4ad8-b8ae-681a5fde5fda spoke-0 true master Done spoke-master-0-2.spoke-0.qe.lab.redhat.com foobar-master-0-2 222c3773-9ab4-4cd1-a377-62acafd98441 spoke-0 true worker Done spoke-worker-0-0.spoke-0.qe.lab.redhat.com foobar-worker-0-0 5a013ba6-6939-43ff-9b3f-882cb717b796 spoke-0 true worker Done spoke-worker-0-1.spoke-0.qe.lab.redhat.com foobar-worker-0-1 d32ad241-6247-47a4-a81b-8c546708ec0a spoke-0 true master Done spoke-master-0-0.spoke-0.qe.lab.redhat.com foo-master-0-0 fb944e70-121b-4363-933f-f428f4281306 spoke-0 true master Done spoke-master-0-1.spoke-0.qe.lab.redhat.com foobar-master-0-1 on the installed spoke cluster: node name does not honor annotation: oc get nodes NAME STATUS ROLES AGE VERSION spoke-master-0-0.spoke-0.qe.lab.redhat.com Ready master 45m v1.23.3+b63be7f spoke-master-0-1.spoke-0.qe.lab.redhat.com Ready master 45m v1.23.3+b63be7f spoke-master-0-2.spoke-0.qe.lab.redhat.com Ready master 21m v1.23.3+b63be7f spoke-worker-0-0.spoke-0.qe.lab.redhat.com Ready worker 23m v1.23.3+b63be7f spoke-worker-0-1.spoke-0.qe.lab.redhat.com Ready worker 23m v1.23.3+b63be7f The bmh however does: oc get bmh -n openshift-machine-api NAME STATE CONSUMER ONLINE ERROR AGE foo-master-0-0 unmanaged spoke-0-kvhm2-master-0 true 39m foobar-master-0-1 unmanaged spoke-0-kvhm2-master-1 true 39m foobar-master-0-2 unmanaged spoke-0-kvhm2-master-2 true 39m foobar-worker-0-0 unmanaged spoke-0-kvhm2-worker-0-6msh7 true 39m foobar-worker-0-1 unmanaged spoke-0-kvhm2-worker-0-zw92r true 39m oc get machine -n openshift-machine-api -o wide NAME PHASE TYPE REGION ZONE AGE NODE PROVIDERID STATE spoke-0-kvhm2-master-0 Running 39m spoke-master-0-0.spoke-0.qe.lab.redhat.com baremetalhost:///openshift-machine-api/foo-master-0-0/72587cc3-a450-40e5-963c-d7492c666b84 unmanaged spoke-0-kvhm2-master-1 Running 39m spoke-master-0-1.spoke-0.qe.lab.redhat.com baremetalhost:///openshift-machine-api/foobar-master-0-1/3c386561-dd89-4444-84e0-3a74f46cce6e unmanaged spoke-0-kvhm2-master-2 Running 39m spoke-master-0-2.spoke-0.qe.lab.redhat.com baremetalhost:///openshift-machine-api/foobar-master-0-2/8bd2964c-a9f0-4988-ad2e-778d12abcfad unmanaged spoke-0-kvhm2-worker-0-6msh7 Running 25m spoke-worker-0-0.spoke-0.qe.lab.redhat.com baremetalhost:///openshift-machine-api/foobar-worker-0-0/09b00b28-288a-4f6f-a1d6-4428686c2978 unmanaged spoke-0-kvhm2-worker-0-zw92r Running 25m spoke-worker-0-1.spoke-0.qe.lab.redhat.com baremetalhost:///openshift-machine-api/foobar-worker-0-1/84bfe4f8-b811-4f95-996d-026b34baff54 unmanaged sshing into the nodes also show the hostname annotation is not honored: [core@spoke-worker-0-0 ~]$ hostname spoke-worker-0-0.spoke-0.qe.lab.redhat.com
This issue is continuing to reproduce on OCP 4.10.20/MCE 2.1.0-DOWNANDBACK-2022-06-28-15-44-48: On spoke cluster: $ oc get bmh -A NAMESPACE NAME STATE CONSUMER ONLINE ERROR AGE openshift-machine-api spoke-master-0-0 unmanaged spoke-0-cp2wt-master-0 true 23h openshift-machine-api spoke-master-0-1 unmanaged spoke-0-cp2wt-master-1 true 23h openshift-machine-api spoke-master-0-2 unmanaged spoke-0-cp2wt-master-2 true 23h openshift-machine-api spoke-rwn-0-0-bmh externally provisioned spoke-0-spoke-rwn-0-0-bmh true provisioned registration error 22h openshift-machine-api spoke-worker-0-0 unmanaged spoke-0-cp2wt-worker-0-57stz true 23h openshift-machine-api spoke-worker-0-1 unmanaged spoke-0-cp2wt-worker-0-gw7kz true 23h Before manually approving CSRs: $ oc get machine -n openshift-machine-api NAME PHASE TYPE REGION ZONE AGE spoke-0-cp2wt-master-0 Running 21h spoke-0-cp2wt-master-1 Running 21h spoke-0-cp2wt-master-2 Running 21h spoke-0-cp2wt-worker-0-57stz Running 21h spoke-0-cp2wt-worker-0-gw7kz Running 21h spoke-0-spoke-rwn-0-0-bmh Provisioned 20h $ oc get nodes -A NAME STATUS ROLES AGE VERSION spoke-master-0-0.spoke-0.qe.lab.redhat.com Ready master 21h v1.23.5+3afdacb spoke-master-0-1.spoke-0.qe.lab.redhat.com Ready master 21h v1.23.5+3afdacb spoke-master-0-2.spoke-0.qe.lab.redhat.com Ready master 20h v1.23.5+3afdacb spoke-worker-0-0.spoke-0.qe.lab.redhat.com Ready worker 20h v1.23.5+3afdacb spoke-worker-0-1.spoke-0.qe.lab.redhat.com Ready worker 20h v1.23.5+3afdacb - No node created for the RWN. After the CSRs were manually approved, the relevant node was created and the machine resource transferred to "Running" phase.
*** This bug has been marked as a duplicate of bug 2087213 ***
I tested this PR (https://github.com/openshift/machine-config-operator/pull/3285) which includes the changes needed in order for IPv6 DHCP to work on 4.11 (https://github.com/openshift/machine-config-operator/pull/3282) and the changes implemented by Ori (https://github.com/openshift/machine-config-operator/pull/3276). This seemed to have solved the issue in regards to CSRs not being auto-approved.
@mfilanov Can this move to verified? Thanks!
We cannot verify this until the fix is merged into MCO and present in a 4.11 z-stream. I just did early testing to ensure that the attached PR fixes the issue but not with an official build.
Agree with Trey
May we know when exactly the OCP release with the PR merged is available? Does it require a new AI build at ACM side? Thanks!
PR https://github.com/openshift/machine-config-operator/pull/3276 merged to master
Hey @oamizur , we need to cherry pick the PR into the release-4.11 branch in order to get it into 4.11 builds. So far, the fix will only show up in 4.12.
Backport to 4.11 PR is https://github.com/openshift/machine-config-operator/pull/3305
backport was merged
This has been verified by QE
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days