Bug 2051533

Summary:	Adding day2 remote worker node requires manually approving CSRs
Product:	Red Hat Advanced Cluster Management for Kubernetes	Reporter:	nshidlin <nshidlin>
Component:	Infrastructure Operator	Assignee:	Ori Amizur <oamizur>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Trey West <trwest>
Severity:	urgent	Docs Contact:	Derek <dcadzow>
Priority:	unspecified
Version:	rhacm-2.6	CC:	ccrum, fpercoco, mfilanov, oamizur, otuchfel, smiron, trwest, yfirst, yobshans, yuhe
Target Milestone:	---	Keywords:	Regression, Reopened, TestBlocker
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-10-03 20:20:49 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description nshidlin 2022-02-07 12:48:46 UTC

Description of problem:
When adding remote worker node using ZTP the agent finishes the installation and is marked as done. 
oc get agent -o wide
NAME                                   CLUSTER   APPROVED   ROLE     STAGE   HOSTNAME                                      REQUESTED HOSTNAME
0277804e-2a7c-4d95-9d0f-e22a190d582a   spoke-0   true       worker   Done    spoke-worker-0-0.spoke-0.qe.lab.redhat.com    spoke-worker-0-0
12efa520-5b99-4474-805d-931e46ad43f7   spoke-0   true       master   Done    spoke-master-0-2.spoke-0.qe.lab.redhat.com    spoke-master-0-2
3b8eec89-f26f-4896-8f71-8a810894c560   spoke-0   true       master   Done    spoke-master-0-0.spoke-0.qe.lab.redhat.com    spoke-master-0-0
3fb3749e-c132-4258-ad1a-08a0445c9022   spoke-0   true       worker   Done    spoke-worker-0-1.spoke-0.qe.lab.redhat.com    spoke-worker-0-1
728559e9-5543-41d9-adb0-e58196f765af   spoke-0   true       master   Done    spoke-master-0-1.spoke-0.qe.lab.redhat.com    spoke-master-0-1
982e1ff6-6e83-4800-b061-8cdfd0b844fb   spoke-0   true       worker   Done    spoke-rwn-0-1.spoke-rwn-0.qe.lab.redhat.com   spoke-rwn-0-1
a76eaa6a-b351-429f-bfa1-e53a70503573   spoke-0   true       worker   Done    spoke-rwn-0-0.spoke-rwn-0.qe.lab.redhat.com   spoke-rwn-0-0



Logging into the spoke cluster the bmh and machine resources are created and the node resource is not:
oc get bmh -n openshift-machine-api
NAME                STATE                    CONSUMER                       ONLINE   ERROR                            AGE
spoke-master-0-0    unmanaged                spoke-0-pxbfh-master-0         true                                      3h32m
spoke-master-0-1    unmanaged                spoke-0-pxbfh-master-1         true                                      3h32m
spoke-master-0-2    unmanaged                spoke-0-pxbfh-master-2         true                                      3h32m
spoke-rwn-0-0-bmh   externally provisioned   spoke-0-spoke-rwn-0-0-bmh      true     provisioned registration error   168m
spoke-rwn-0-1-bmh   externally provisioned   spoke-0-spoke-rwn-0-1-bmh      true     provisioned registration error   168m
spoke-worker-0-0    unmanaged                spoke-0-pxbfh-worker-0-65mrb   true                                      3h32m
spoke-worker-0-1    unmanaged                spoke-0-pxbfh-worker-0-nnmcq   true                                      3h32m     

 oc get machine -n openshift-machine-api
NAME                           PHASE         TYPE   REGION   ZONE   AGE
spoke-0-pxbfh-master-0         Running                              3h33m
spoke-0-pxbfh-master-1         Running                              3h33m
spoke-0-pxbfh-master-2         Running                              3h33m
spoke-0-pxbfh-worker-0-65mrb   Running                              3h19m
spoke-0-pxbfh-worker-0-nnmcq   Running                              3h20m
spoke-0-spoke-rwn-0-0-bmh      Provisioned                          169m
spoke-0-spoke-rwn-0-1-bmh      Provisioned                          169m

Note: bmh is in error state:
Normal  ProvisionedRegistrationError  30m   metal3-baremetal-controller  Host adoption failed: Error while attempting to adopt node 529b3e75-5d04-4486-9296-269081d0ec02: Error validating Redfish virtual media. Some parameters were missing in node's driver_info. Missing are: ['deploy_kernel', 'deploy_ramdisk'].

oc get nodes
NAME                                         STATUS   ROLES    AGE   VERSION
spoke-master-0-0.spoke-0.qe.lab.redhat.com   Ready    master   72m   v1.22.3+2cb6068
spoke-master-0-1.spoke-0.qe.lab.redhat.com   Ready    master   50m   v1.22.3+2cb6068
spoke-master-0-2.spoke-0.qe.lab.redhat.com   Ready    master   72m   v1.22.3+2cb6068
spoke-worker-0-0.spoke-0.qe.lab.redhat.com   Ready    worker   51m   v1.22.3+2cb6068
spoke-worker-0-1.spoke-0.qe.lab.redhat.com   Ready    worker   51m   v1.22.3+2cb6068


node-bootstrapper CSR is created but not auto-approved; periodically another node-strapper csr is created until it is manually approved:

oc get csr | grep Pending
csr-5ll2g                                        9m9s    kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper         <none>              Pending
csr-f8vbl                                        8m24s   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper         <none>              Pending


Version-Release number of selected component (if applicable):
assisted-service master at revision af0bafb3f7f629932f8c3dc31ccddedfe6984926
ocp version: 4.10.0-rc.1

How reproducible:
Every time

Steps to Reproduce:
1. Install remote worker node using ztp

2. Wait for node resource to be created

Actual results:
node-bootstrapper and node CSR are not auto-approved and node resource is not created.  The bmh resource remains in registration error

Expected results:
node-bootstrapper and node CSR should be auto-approved and node resource created.  The bmh resource should not be in registration error

Additional info:

Comment 1 Flavio Percoco 2022-02-10 09:04:08 UTC

We have been debugging this and it's mostly related to DHCP's provided hostname (full FQDN) used as the node's hostname. It only occurs when a custom hostname is specified in the BMH.

The hostname is properly set by assisted agent but then the full FQDN is used when the node is rebooted as provided by DHCP. This causes a failure in the CSR approval for day2 because the machine-approver won't find a node (which uses the full FQDN) that matches the hostname found in the Machine CR. When no custom hostname is requested, then full FQDN is used for everything and things work as expected.

Arguably, we could say that the node should likely not use the full FQDN as the node name but this is out of our realm. May be worth bringing this up to the OCP team.

Comment 2 nshidlin 2022-02-13 16:39:32 UTC

(In reply to Flavio Percoco from comment #1)
> We have been debugging this and it's mostly related to DHCP's provided
> hostname (full FQDN) used as the node's hostname. It only occurs when a
> custom hostname is specified in the BMH.
> 
> The hostname is properly set by assisted agent but then the full FQDN is
> used when the node is rebooted as provided by DHCP. This causes a failure in
> the CSR approval for day2 because the machine-approver won't find a node
> (which uses the full FQDN) that matches the hostname found in the Machine
> CR. When no custom hostname is requested, then full FQDN is used for
> everything and things work as expected.
> 
> Arguably, we could say that the node should likely not use the full FQDN as
> the node name but this is out of our realm. May be worth bringing this up to
> the OCP team.

In fact I see now that hostname annotation set by the user is only honored in the BMH. Looking at a day1 install (no remote workernodes):
Setting the hostname annotation to be different for the short name:

oc get agent -o wide
NAME                                   CLUSTER   APPROVED   ROLE     STAGE   HOSTNAME                                     REQUESTED HOSTNAME
1aaff5fc-9e04-4ad8-b8ae-681a5fde5fda   spoke-0   true       master   Done    spoke-master-0-2.spoke-0.qe.lab.redhat.com   foobar-master-0-2
222c3773-9ab4-4cd1-a377-62acafd98441   spoke-0   true       worker   Done    spoke-worker-0-0.spoke-0.qe.lab.redhat.com   foobar-worker-0-0
5a013ba6-6939-43ff-9b3f-882cb717b796   spoke-0   true       worker   Done    spoke-worker-0-1.spoke-0.qe.lab.redhat.com   foobar-worker-0-1
d32ad241-6247-47a4-a81b-8c546708ec0a   spoke-0   true       master   Done    spoke-master-0-0.spoke-0.qe.lab.redhat.com   foo-master-0-0
fb944e70-121b-4363-933f-f428f4281306   spoke-0   true       master   Done    spoke-master-0-1.spoke-0.qe.lab.redhat.com   foobar-master-0-1

on the installed spoke cluster:
node name does not honor annotation: 
oc get nodes
NAME                                         STATUS   ROLES    AGE   VERSION
spoke-master-0-0.spoke-0.qe.lab.redhat.com   Ready    master   45m   v1.23.3+b63be7f
spoke-master-0-1.spoke-0.qe.lab.redhat.com   Ready    master   45m   v1.23.3+b63be7f
spoke-master-0-2.spoke-0.qe.lab.redhat.com   Ready    master   21m   v1.23.3+b63be7f
spoke-worker-0-0.spoke-0.qe.lab.redhat.com   Ready    worker   23m   v1.23.3+b63be7f
spoke-worker-0-1.spoke-0.qe.lab.redhat.com   Ready    worker   23m   v1.23.3+b63be7f

The bmh however does:
oc get bmh -n openshift-machine-api 
NAME                STATE       CONSUMER                       ONLINE   ERROR   AGE
foo-master-0-0      unmanaged   spoke-0-kvhm2-master-0         true             39m
foobar-master-0-1   unmanaged   spoke-0-kvhm2-master-1         true             39m
foobar-master-0-2   unmanaged   spoke-0-kvhm2-master-2         true             39m
foobar-worker-0-0   unmanaged   spoke-0-kvhm2-worker-0-6msh7   true             39m
foobar-worker-0-1   unmanaged   spoke-0-kvhm2-worker-0-zw92r   true             39m

oc get machine -n openshift-machine-api -o wide
NAME                           PHASE     TYPE   REGION   ZONE   AGE   NODE                                         PROVIDERID                                                                                      STATE
spoke-0-kvhm2-master-0         Running                          39m   spoke-master-0-0.spoke-0.qe.lab.redhat.com   baremetalhost:///openshift-machine-api/foo-master-0-0/72587cc3-a450-40e5-963c-d7492c666b84      unmanaged
spoke-0-kvhm2-master-1         Running                          39m   spoke-master-0-1.spoke-0.qe.lab.redhat.com   baremetalhost:///openshift-machine-api/foobar-master-0-1/3c386561-dd89-4444-84e0-3a74f46cce6e   unmanaged
spoke-0-kvhm2-master-2         Running                          39m   spoke-master-0-2.spoke-0.qe.lab.redhat.com   baremetalhost:///openshift-machine-api/foobar-master-0-2/8bd2964c-a9f0-4988-ad2e-778d12abcfad   unmanaged
spoke-0-kvhm2-worker-0-6msh7   Running                          25m   spoke-worker-0-0.spoke-0.qe.lab.redhat.com   baremetalhost:///openshift-machine-api/foobar-worker-0-0/09b00b28-288a-4f6f-a1d6-4428686c2978   unmanaged
spoke-0-kvhm2-worker-0-zw92r   Running                          25m   spoke-worker-0-1.spoke-0.qe.lab.redhat.com   baremetalhost:///openshift-machine-api/foobar-worker-0-1/84bfe4f8-b811-4f95-996d-026b34baff54   unmanaged

sshing into the nodes also show the hostname annotation is not honored:
 
[core@spoke-worker-0-0 ~]$ hostname
spoke-worker-0-0.spoke-0.qe.lab.redhat.com

Comment 5 Yona First 2022-07-04 13:42:41 UTC

This issue is continuing to reproduce on OCP 4.10.20/MCE 2.1.0-DOWNANDBACK-2022-06-28-15-44-48:

On spoke cluster:

$ oc get bmh -A
NAMESPACE               NAME                STATE                    CONSUMER                       ONLINE   ERROR                            AGE
openshift-machine-api   spoke-master-0-0    unmanaged                spoke-0-cp2wt-master-0         true                                      23h
openshift-machine-api   spoke-master-0-1    unmanaged                spoke-0-cp2wt-master-1         true                                      23h
openshift-machine-api   spoke-master-0-2    unmanaged                spoke-0-cp2wt-master-2         true                                      23h
openshift-machine-api   spoke-rwn-0-0-bmh   externally provisioned   spoke-0-spoke-rwn-0-0-bmh      true     provisioned registration error   22h
openshift-machine-api   spoke-worker-0-0    unmanaged                spoke-0-cp2wt-worker-0-57stz   true                                      23h
openshift-machine-api   spoke-worker-0-1    unmanaged                spoke-0-cp2wt-worker-0-gw7kz   true                                      23h

Before manually approving CSRs:

$ oc get machine -n openshift-machine-api
NAME                           PHASE         TYPE   REGION   ZONE   AGE
spoke-0-cp2wt-master-0         Running                              21h
spoke-0-cp2wt-master-1         Running                              21h
spoke-0-cp2wt-master-2         Running                              21h
spoke-0-cp2wt-worker-0-57stz   Running                              21h
spoke-0-cp2wt-worker-0-gw7kz   Running                              21h
spoke-0-spoke-rwn-0-0-bmh      Provisioned                          20h

$ oc get nodes -A
NAME                                         STATUS   ROLES    AGE   VERSION
spoke-master-0-0.spoke-0.qe.lab.redhat.com   Ready    master   21h   v1.23.5+3afdacb
spoke-master-0-1.spoke-0.qe.lab.redhat.com   Ready    master   21h   v1.23.5+3afdacb
spoke-master-0-2.spoke-0.qe.lab.redhat.com   Ready    master   20h   v1.23.5+3afdacb
spoke-worker-0-0.spoke-0.qe.lab.redhat.com   Ready    worker   20h   v1.23.5+3afdacb
spoke-worker-0-1.spoke-0.qe.lab.redhat.com   Ready    worker   20h   v1.23.5+3afdacb

- No node created for the RWN.

After the CSRs were manually approved, the relevant node was created and the machine resource transferred to "Running" phase.

Comment 7 Flavio Percoco 2022-07-11 14:12:24 UTC


*** This bug has been marked as a duplicate of bug 2087213 ***

Comment 10 Trey West 2022-08-16 12:45:20 UTC

I tested this PR (https://github.com/openshift/machine-config-operator/pull/3285) which includes the changes needed in order for IPv6 DHCP to work on 4.11 (https://github.com/openshift/machine-config-operator/pull/3282) and the changes implemented by Ori (https://github.com/openshift/machine-config-operator/pull/3276). This seemed to have solved the issue in regards to CSRs not being auto-approved.

Comment 11 Yuanyuan He 2022-08-17 15:07:40 UTC

@mfilanov Can this move to verified? Thanks!

Comment 12 Trey West 2022-08-17 15:52:30 UTC

We cannot verify this until the fix is merged into MCO and present in a 4.11 z-stream. I just did early testing to ensure that the attached PR fixes the issue but not with an official build.

Comment 13 Michael Filanov 2022-08-18 07:09:49 UTC

Agree with Trey

Comment 14 Yuanyuan He 2022-08-18 14:49:41 UTC

May we know when exactly the OCP release with the PR merged is available? Does it require a new AI build at ACM side? Thanks!

Comment 15 Ori Amizur 2022-08-22 06:35:50 UTC

PR https://github.com/openshift/machine-config-operator/pull/3276 merged to master

Comment 16 Trey West 2022-08-22 12:30:21 UTC

Hey @oamizur , we need to cherry pick the PR into the release-4.11 branch in order to get it into 4.11 builds. So far, the fix will only show up in 4.12.

Comment 18 Chad Crum 2022-08-30 19:41:51 UTC

Backport to 4.11 PR is https://github.com/openshift/machine-config-operator/pull/3305

Comment 19 Michael Filanov 2022-09-04 07:58:31 UTC

backport was merged

Comment 20 Chad Crum 2022-09-27 17:12:05 UTC

This has been verified by QE

Comment 21 Red Hat Bugzilla 2023-09-18 04:31:40 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 120 days