Bug 2022627

Summary:	Machine object not picking up external FIP added to an openstack vm
Product:	OpenShift Container Platform	Reporter:	Siddhant More <simore>
Component:	Cloud Compute	Assignee:	Matthew Booth <mbooth>
Cloud Compute sub component:	OpenStack Provider	QA Contact:	rlobillo
Status:	CLOSED ERRATA	Docs Contact:
Severity:	high
Priority:	high	CC:	aos-bugs, emacchi, m.andre, mbooth, mfedosin, pprinett
Version:	4.7	Keywords:	Triaged
Target Milestone:	---
Target Release:	4.10.0
Hardware:	Unspecified
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:	Cause: On OpenStack, if a user adds a floating IP to a host, when kubelet creates a certificate signing request for the Node (which it does both during initialisation and regularly thereafter), it will request a certificate for all IP addresses reported to it by OpenStack. This includes both the private and floating IP addresses. However, we were only reporting the private IP address on the Machine object. The cluster machine approver, which approves the CSR, will only accept the request if all requested IP addresses are reported on the Machine corresponding to the requesting Node. Consequence: CSRs created by kubelet are never accepted, and the Node cannot join the cluster. Fix: On OpenStack, all IP addresses are now reported on the Machine object. Result: Adding a floating IP to an OpenStack VM does not prevent it from joining an OpenShift cluster.	Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-03-10 16:26:48 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Siddhant More 2021-11-12 08:28:30 UTC

Description of problem:

The environment is OpenShift on OpenStack, the customer is adding a FIP on the worker node from OpenStack Horizon Web Client. In OpenShift this VIP is picked and is being set as 'ExternalIP' on the worker node. This additional IP is only seen in the `node` object but NOT in the `machine` object for the same worker node.

Because this additional IP(FIP) is not present in the `status.addresses` section of the `machine` object, the cluster-machine-approver specifications are not fulfilled and it fails to approve CSRs for such nodes.

Version-Release number of selected component (if applicable):
- RHOCP 4.7

How reproducible:
- Every time

Steps to Reproduce:

Customer reproducer:
~~~
Install using ACM an OpenShift 4.7 cluster on OpenStack.
1.1. At this stage all Worker Nodes and Machines do not have an "ExternalIP" field defined

On OpenStack level
2.1. From a FIP pool already defined assign 1 FIP to each Worker Instance Port. This is a manual task done using the OpenStack Horizon Web Client.
2.2. Floating IP active and associated to each Worker Instance

On OpenShift level
3.1. Now is possible to see the "ExternalIP" field on the node but not on the machine.
~~~

Other:
- Install OCP on OSP.
- Confirm no External IPs are present on the node.
- Assign FIP to any node from OpenStack Horizon Web Client(GUI).
- Confirm if ExternalIP is now updated on the node. (oc get nodes -owide)
- Verify that the same IP is NOT present in machine object of the same node.

Expected results:
The FIP should be picked by the `machine` object and added in the status.addresses field.
This will allow the cluster-machine-approver to automatically approve node CSRs for such nodes.

Additional info:
- We had to debug a situation where IPI installation of OCP on OSP was failing to auto approve node CSRs.
- Our investigations have led us to this issue with machine object not picking external IP causing the CSR auto-approval failure.

Comment 1 egarcia 2021-11-12 13:54:21 UTC

Would you mind explaining the use case for why you want a floating IP address to be part of the node addresses?

Comment 2 Siddhant More 2021-11-15 08:15:34 UTC

The use case is to configure egress IPs on nodes as part of a future solution. 

https://docs.openshift.com/container-platform/4.9/networking/openshift_sdn/assigning-egress-ips.html

Comment 4 Matthew Booth 2021-11-18 08:57:44 UTC

This is caused by 2 problems in downstream CAPO:

Firstly, we are not setting providerID on the machine object. If we set providerID then addresses are not even used by the node link controller for machine matching.

Secondly, we are filtering the addresses we return, which is the immediate cause of this problem.

Both of these are already fixed in MAPO.

Comment 6 ShiftStack Bugwatcher 2021-11-25 16:12:51 UTC

Removing the Triaged keyword because:
* the target release value is missing

Comment 9 Matthew Booth 2021-12-02 10:31:21 UTC

Reproduced this locally. Stand up a 4.9 cluster with a single worker. Scale the worker machineset to 2. While the new machine is provisioning, manually add a floating ip to it. When kubelet starts, there is a pending CSR:

$ oc get csr
NAME        AGE    SIGNERNAME                                    REQUESTOR                                                                   REQUESTEDDURATION   CONDITIONcsr-7sdgv   112s   kubernetes.io/kubelet-serving                 system:node:mbooth-psi-sx2gq-worker-0-5x2ct                                 <none>              Pending
csr-zdz4m   2m2s   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   <none>              Approved,Issued

Comment 10 Matthew Booth 2021-12-02 11:49:12 UTC

This problem isn't caused by failure of the nodelink-controller. As reported, kubelet is creating a CSR containing all IP addresses on the node, which in turn come from the local metadata service and includes all IPs, including the floating IP.

CAPO, on the other hand, is trying to be clever with the addresses it reports. It is only reporting a single address from the 'primary' network. Both the floating IP and the fixed IP are exposed via ports on the same network, but it only reports one: which ever one is returned by Nova last. This is always the floating IP. It doesn't actually matter which of the 2 addresses CAPO reports: either case will fail in the cluster machine approver because the other IP address will not match and they are all required to match.

We already fixed this in upstream CAPO and therefore MAPO: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/pull/1004

I think until we release MAPO we should write a downstream-only patch to make this work.

Comment 11 Matthew Booth 2021-12-03 12:02:16 UTC

This is a serious bug, but it's not a blocker because it's latent.

Comment 14 rlobillo 2021-12-16 16:25:20 UTC

Verified on OCP4.10.0-0.nightly-2021-12-14-083101 on top of OSP16.1 (RHOS-16.1-RHEL-8-20210903.n.0)

new worker is created and a floating ip is attached to the OSP instance while machine is in provisioning status:

$ oc apply -f new_machine.yaml                                                                                                                                                                            
machine.machine.openshift.io/ostest-kmzqk-new-worker created           

$ openstack server list                                                                                                                                                                                    
+--------------------------------------+-----------------------------+--------+--------------------------------------------------------------+--------------------+--------+                                                                  
| ID                                   | Name                        | Status | Networks                                                     | Image              | Flavor |                                                                  
+--------------------------------------+-----------------------------+--------+--------------------------------------------------------------+--------------------+--------+                                                                  
| 31d0647a-a4b7-4b78-8eb1-dd4176bd23eb | ostest-kmzqk-new-worker     | ACTIVE | StorageNFS=172.17.5.229; ostest-kmzqk-openshift=10.196.3.179 |                    |        |                                                                  
| 436b3af6-375d-40df-894e-0d7865d24b1b | ostest-kmzqk-worker-0-dh2rw | ACTIVE | StorageNFS=172.17.5.203; ostest-kmzqk-openshift=10.196.0.179 | ostest-kmzqk-rhcos |        |                                                                  
| ce24ca88-9623-4a24-b15c-1c6a8be40a9e | ostest-kmzqk-worker-0-pqq7p | ACTIVE | StorageNFS=172.17.5.222; ostest-kmzqk-openshift=10.196.2.15  | ostest-kmzqk-rhcos |        |                                                                  
| a0f9a346-32a7-4265-9fb0-8440c971938d | ostest-kmzqk-master-2       | ACTIVE | ostest-kmzqk-openshift=10.196.3.5                            | ostest-kmzqk-rhcos |        |                                                                  
| 749245dd-8a13-4e11-8329-32ad0fd05cc8 | ostest-kmzqk-master-1       | ACTIVE | ostest-kmzqk-openshift=10.196.2.57                           | ostest-kmzqk-rhcos |        |                                                                  
| e372bdbb-5601-4e17-a697-0786b6066199 | ostest-kmzqk-master-0       | ACTIVE | ostest-kmzqk-openshift=10.196.2.11                           | ostest-kmzqk-rhcos |        |                                                                  
+--------------------------------------+-----------------------------+--------+--------------------------------------------------------------+--------------------+--------+                  

$ openstack port list | grep 10.196.3.179
openstack floating ip set | c53b7759-5e70-4052-b8a3-94e2b01341d5 | ostest-kmzqk-new-worker-57a851e3-7b67-4606-9a88-c1cffa49b436     | fa:16:3e:30:c6:26 | ip_address='10.196.3.179', subnet_id='57a851e3-7b67-4606-9a88-c1cffa49b436'   | ACTIVE |
$ openstack floating ip set --port c53b7759-5e70-4052-b8a3-94e2b01341d5 10.46.44.49

$ openstack server list
+--------------------------------------+-----------------------------+--------+---------------------------------------------------------------------------+--------------------+--------+
| ID                                   | Name                        | Status | Networks                                                                  | Image              | Flavor |
+--------------------------------------+-----------------------------+--------+---------------------------------------------------------------------------+--------------------+--------+
| 31d0647a-a4b7-4b78-8eb1-dd4176bd23eb | ostest-kmzqk-new-worker     | ACTIVE | StorageNFS=172.17.5.229; ostest-kmzqk-openshift=10.196.3.179, 10.46.44.49 |                    |        |


$ oc get machines -A
NAMESPACE               NAME                          PHASE         TYPE        REGION      ZONE   AGE
openshift-machine-api   ostest-kmzqk-master-0         Running                                      2d
openshift-machine-api   ostest-kmzqk-master-1         Running                                      2d
openshift-machine-api   ostest-kmzqk-master-2         Running                                      2d
openshift-machine-api   ostest-kmzqk-new-worker       Provisioned   m4.xlarge   regionOne   nova   4m45s
openshift-machine-api   ostest-kmzqk-worker-0-dh2rw   Running       m4.xlarge   regionOne   nova   2d
openshift-machine-api   ostest-kmzqk-worker-0-pqq7p   Running       m4.xlarge   regionOne   nova   2d


$ oc get csr -A  -w                                                                                                                                                                                                    
NAMESPACE   NAME        AGE   SIGNERNAME                                    REQUESTOR                                                                   REQUESTEDDURATION   CONDITION                                                        
            csr-dbsvz   0s    kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   <none>              Pending                                                          
            csr-dbsvz   0s    kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   <none>              Approved                                                         
            csr-dbsvz   0s    kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   <none>              Approved,Issued                                                  
            csr-rqqbc   0s    kubernetes.io/kubelet-serving                 system:node:ostest-kmzqk-new-worker                                         <none>              Pending                                                          
            csr-rqqbc   0s    kubernetes.io/kubelet-serving                 system:node:ostest-kmzqk-new-worker                                         <none>              Approved                                                         
            csr-rqqbc   0s    kubernetes.io/kubelet-serving                 system:node:ostest-kmzqk-new-worker                                         <none>              Approved,Issued                           

$ oc logs -n openshift-cluster-machine-approver -l app=machine-approver -c machine-approver-controller
I1215 17:11:47.100530       1 csr_check.go:201] Falling back to machine-api authorization for ostest-kmzqk-new-worker
I1215 17:11:47.117256       1 controller.go:227] CSR csr-v48j8 approved
I1216 16:08:11.390045       1 controller.go:118] Reconciling CSR: csr-dbsvz
I1216 16:08:11.452958       1 controller.go:227] CSR csr-dbsvz approved
I1216 16:08:26.041054       1 controller.go:118] Reconciling CSR: csr-rqqbc
I1216 16:08:26.238591       1 csr_check.go:156] csr-rqqbc: CSR does not appear to be client csr
I1216 16:08:26.287170       1 csr_check.go:521] retrieving serving cert from ostest-kmzqk-new-worker (10.196.3.179:10250)
I1216 16:08:26.290118       1 csr_check.go:181] Failed to retrieve current serving cert: remote error: tls: internal error
I1216 16:08:26.290178       1 csr_check.go:201] Falling back to machine-api authorization for ostest-kmzqk-new-worker
I1216 16:08:26.332617       1 controller.go:227] CSR csr-rqqbc approved


$ oc get machine -n openshift-machine-api ostest-kmzqk-new-worker -o json | jq .status.addresses
[
  {
    "address": "172.17.5.229",
    "type": "InternalIP"
  },
  {
    "address": "10.196.3.179",
    "type": "InternalIP"
  },
  {
    "address": "10.46.44.49",
    "type": "ExternalIP"
  },
  {
    "address": "ostest-kmzqk-new-worker",
    "type": "Hostname"
  },
  {
    "address": "ostest-kmzqk-new-worker",
    "type": "InternalDNS"
  }
]

$ oc get nodes
NAME                          STATUS   ROLES    AGE   VERSION
ostest-kmzqk-master-0         Ready    master   2d    v1.22.1+6859754
ostest-kmzqk-master-1         Ready    master   2d    v1.22.1+6859754
ostest-kmzqk-master-2         Ready    master   2d    v1.22.1+6859754
ostest-kmzqk-new-worker       Ready    worker   11m   v1.22.1+6859754
ostest-kmzqk-worker-0-dh2rw   Ready    worker   2d    v1.22.1+6859754
ostest-kmzqk-worker-0-pqq7p   Ready    worker   2d    v1.22.1+6859754


All the IPs are appearing now on the machine status section, including the FIP attached during instance creation. Therefore, the CSR is valid and the worker is correctly included on the cluster.

Comment 20 errata-xmlrpc 2022-03-10 16:26:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:0056