Description of problem: The environment is OpenShift on OpenStack, the customer is adding a FIP on the worker node from OpenStack Horizon Web Client. In OpenShift this VIP is picked and is being set as 'ExternalIP' on the worker node. This additional IP is only seen in the `node` object but NOT in the `machine` object for the same worker node. Because this additional IP(FIP) is not present in the `status.addresses` section of the `machine` object, the cluster-machine-approver specifications are not fulfilled and it fails to approve CSRs for such nodes. Version-Release number of selected component (if applicable): - RHOCP 4.7 How reproducible: - Every time Steps to Reproduce: Customer reproducer: ~~~ Install using ACM an OpenShift 4.7 cluster on OpenStack. 1.1. At this stage all Worker Nodes and Machines do not have an "ExternalIP" field defined On OpenStack level 2.1. From a FIP pool already defined assign 1 FIP to each Worker Instance Port. This is a manual task done using the OpenStack Horizon Web Client. 2.2. Floating IP active and associated to each Worker Instance On OpenShift level 3.1. Now is possible to see the "ExternalIP" field on the node but not on the machine. ~~~ Other: - Install OCP on OSP. - Confirm no External IPs are present on the node. - Assign FIP to any node from OpenStack Horizon Web Client(GUI). - Confirm if ExternalIP is now updated on the node. (oc get nodes -owide) - Verify that the same IP is NOT present in machine object of the same node. Expected results: The FIP should be picked by the `machine` object and added in the status.addresses field. This will allow the cluster-machine-approver to automatically approve node CSRs for such nodes. Additional info: - We had to debug a situation where IPI installation of OCP on OSP was failing to auto approve node CSRs. - Our investigations have led us to this issue with machine object not picking external IP causing the CSR auto-approval failure.
Would you mind explaining the use case for why you want a floating IP address to be part of the node addresses?
The use case is to configure egress IPs on nodes as part of a future solution. https://docs.openshift.com/container-platform/4.9/networking/openshift_sdn/assigning-egress-ips.html
This is caused by 2 problems in downstream CAPO: Firstly, we are not setting providerID on the machine object. If we set providerID then addresses are not even used by the node link controller for machine matching. Secondly, we are filtering the addresses we return, which is the immediate cause of this problem. Both of these are already fixed in MAPO.
Removing the Triaged keyword because: * the target release value is missing
Reproduced this locally. Stand up a 4.9 cluster with a single worker. Scale the worker machineset to 2. While the new machine is provisioning, manually add a floating ip to it. When kubelet starts, there is a pending CSR: $ oc get csr NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITIONcsr-7sdgv 112s kubernetes.io/kubelet-serving system:node:mbooth-psi-sx2gq-worker-0-5x2ct <none> Pending csr-zdz4m 2m2s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Approved,Issued
This problem isn't caused by failure of the nodelink-controller. As reported, kubelet is creating a CSR containing all IP addresses on the node, which in turn come from the local metadata service and includes all IPs, including the floating IP. CAPO, on the other hand, is trying to be clever with the addresses it reports. It is only reporting a single address from the 'primary' network. Both the floating IP and the fixed IP are exposed via ports on the same network, but it only reports one: which ever one is returned by Nova last. This is always the floating IP. It doesn't actually matter which of the 2 addresses CAPO reports: either case will fail in the cluster machine approver because the other IP address will not match and they are all required to match. We already fixed this in upstream CAPO and therefore MAPO: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/pull/1004 I think until we release MAPO we should write a downstream-only patch to make this work.
This is a serious bug, but it's not a blocker because it's latent.
Verified on OCP4.10.0-0.nightly-2021-12-14-083101 on top of OSP16.1 (RHOS-16.1-RHEL-8-20210903.n.0) new worker is created and a floating ip is attached to the OSP instance while machine is in provisioning status: $ oc apply -f new_machine.yaml machine.machine.openshift.io/ostest-kmzqk-new-worker created $ openstack server list +--------------------------------------+-----------------------------+--------+--------------------------------------------------------------+--------------------+--------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+-----------------------------+--------+--------------------------------------------------------------+--------------------+--------+ | 31d0647a-a4b7-4b78-8eb1-dd4176bd23eb | ostest-kmzqk-new-worker | ACTIVE | StorageNFS=172.17.5.229; ostest-kmzqk-openshift=10.196.3.179 | | | | 436b3af6-375d-40df-894e-0d7865d24b1b | ostest-kmzqk-worker-0-dh2rw | ACTIVE | StorageNFS=172.17.5.203; ostest-kmzqk-openshift=10.196.0.179 | ostest-kmzqk-rhcos | | | ce24ca88-9623-4a24-b15c-1c6a8be40a9e | ostest-kmzqk-worker-0-pqq7p | ACTIVE | StorageNFS=172.17.5.222; ostest-kmzqk-openshift=10.196.2.15 | ostest-kmzqk-rhcos | | | a0f9a346-32a7-4265-9fb0-8440c971938d | ostest-kmzqk-master-2 | ACTIVE | ostest-kmzqk-openshift=10.196.3.5 | ostest-kmzqk-rhcos | | | 749245dd-8a13-4e11-8329-32ad0fd05cc8 | ostest-kmzqk-master-1 | ACTIVE | ostest-kmzqk-openshift=10.196.2.57 | ostest-kmzqk-rhcos | | | e372bdbb-5601-4e17-a697-0786b6066199 | ostest-kmzqk-master-0 | ACTIVE | ostest-kmzqk-openshift=10.196.2.11 | ostest-kmzqk-rhcos | | +--------------------------------------+-----------------------------+--------+--------------------------------------------------------------+--------------------+--------+ $ openstack port list | grep 10.196.3.179 openstack floating ip set | c53b7759-5e70-4052-b8a3-94e2b01341d5 | ostest-kmzqk-new-worker-57a851e3-7b67-4606-9a88-c1cffa49b436 | fa:16:3e:30:c6:26 | ip_address='10.196.3.179', subnet_id='57a851e3-7b67-4606-9a88-c1cffa49b436' | ACTIVE | $ openstack floating ip set --port c53b7759-5e70-4052-b8a3-94e2b01341d5 10.46.44.49 $ openstack server list +--------------------------------------+-----------------------------+--------+---------------------------------------------------------------------------+--------------------+--------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+-----------------------------+--------+---------------------------------------------------------------------------+--------------------+--------+ | 31d0647a-a4b7-4b78-8eb1-dd4176bd23eb | ostest-kmzqk-new-worker | ACTIVE | StorageNFS=172.17.5.229; ostest-kmzqk-openshift=10.196.3.179, 10.46.44.49 | | | $ oc get machines -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api ostest-kmzqk-master-0 Running 2d openshift-machine-api ostest-kmzqk-master-1 Running 2d openshift-machine-api ostest-kmzqk-master-2 Running 2d openshift-machine-api ostest-kmzqk-new-worker Provisioned m4.xlarge regionOne nova 4m45s openshift-machine-api ostest-kmzqk-worker-0-dh2rw Running m4.xlarge regionOne nova 2d openshift-machine-api ostest-kmzqk-worker-0-pqq7p Running m4.xlarge regionOne nova 2d $ oc get csr -A -w NAMESPACE NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION csr-dbsvz 0s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending csr-dbsvz 0s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Approved csr-dbsvz 0s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Approved,Issued csr-rqqbc 0s kubernetes.io/kubelet-serving system:node:ostest-kmzqk-new-worker <none> Pending csr-rqqbc 0s kubernetes.io/kubelet-serving system:node:ostest-kmzqk-new-worker <none> Approved csr-rqqbc 0s kubernetes.io/kubelet-serving system:node:ostest-kmzqk-new-worker <none> Approved,Issued $ oc logs -n openshift-cluster-machine-approver -l app=machine-approver -c machine-approver-controller I1215 17:11:47.100530 1 csr_check.go:201] Falling back to machine-api authorization for ostest-kmzqk-new-worker I1215 17:11:47.117256 1 controller.go:227] CSR csr-v48j8 approved I1216 16:08:11.390045 1 controller.go:118] Reconciling CSR: csr-dbsvz I1216 16:08:11.452958 1 controller.go:227] CSR csr-dbsvz approved I1216 16:08:26.041054 1 controller.go:118] Reconciling CSR: csr-rqqbc I1216 16:08:26.238591 1 csr_check.go:156] csr-rqqbc: CSR does not appear to be client csr I1216 16:08:26.287170 1 csr_check.go:521] retrieving serving cert from ostest-kmzqk-new-worker (10.196.3.179:10250) I1216 16:08:26.290118 1 csr_check.go:181] Failed to retrieve current serving cert: remote error: tls: internal error I1216 16:08:26.290178 1 csr_check.go:201] Falling back to machine-api authorization for ostest-kmzqk-new-worker I1216 16:08:26.332617 1 controller.go:227] CSR csr-rqqbc approved $ oc get machine -n openshift-machine-api ostest-kmzqk-new-worker -o json | jq .status.addresses [ { "address": "172.17.5.229", "type": "InternalIP" }, { "address": "10.196.3.179", "type": "InternalIP" }, { "address": "10.46.44.49", "type": "ExternalIP" }, { "address": "ostest-kmzqk-new-worker", "type": "Hostname" }, { "address": "ostest-kmzqk-new-worker", "type": "InternalDNS" } ] $ oc get nodes NAME STATUS ROLES AGE VERSION ostest-kmzqk-master-0 Ready master 2d v1.22.1+6859754 ostest-kmzqk-master-1 Ready master 2d v1.22.1+6859754 ostest-kmzqk-master-2 Ready master 2d v1.22.1+6859754 ostest-kmzqk-new-worker Ready worker 11m v1.22.1+6859754 ostest-kmzqk-worker-0-dh2rw Ready worker 2d v1.22.1+6859754 ostest-kmzqk-worker-0-pqq7p Ready worker 2d v1.22.1+6859754 All the IPs are appearing now on the machine status section, including the FIP attached during instance creation. Therefore, the CSR is valid and the worker is correctly included on the cluster.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056