Bug 2022627
Summary: | Machine object not picking up external FIP added to an openstack vm | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | Siddhant More <simore> |
Component: | Cloud Compute | Assignee: | Matthew Booth <mbooth> |
Cloud Compute sub component: | OpenStack Provider | QA Contact: | rlobillo |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | high | ||
Priority: | high | CC: | aos-bugs, emacchi, m.andre, mbooth, mfedosin, pprinett |
Version: | 4.7 | Keywords: | Triaged |
Target Milestone: | --- | ||
Target Release: | 4.10.0 | ||
Hardware: | Unspecified | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: |
Cause:
On OpenStack, if a user adds a floating IP to a host, when kubelet creates a certificate signing request for the Node (which it does both during initialisation and regularly thereafter), it will request a certificate for all IP addresses reported to it by OpenStack. This includes both the private and floating IP addresses. However, we were only reporting the private IP address on the Machine object. The cluster machine approver, which approves the CSR, will only accept the request if all requested IP addresses are reported on the Machine corresponding to the requesting Node.
Consequence:
CSRs created by kubelet are never accepted, and the Node cannot join the cluster.
Fix:
On OpenStack, all IP addresses are now reported on the Machine object.
Result:
Adding a floating IP to an OpenStack VM does not prevent it from joining an OpenShift cluster.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2022-03-10 16:26:48 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Siddhant More
2021-11-12 08:28:30 UTC
Would you mind explaining the use case for why you want a floating IP address to be part of the node addresses? The use case is to configure egress IPs on nodes as part of a future solution. https://docs.openshift.com/container-platform/4.9/networking/openshift_sdn/assigning-egress-ips.html This is caused by 2 problems in downstream CAPO: Firstly, we are not setting providerID on the machine object. If we set providerID then addresses are not even used by the node link controller for machine matching. Secondly, we are filtering the addresses we return, which is the immediate cause of this problem. Both of these are already fixed in MAPO. Removing the Triaged keyword because: * the target release value is missing Reproduced this locally. Stand up a 4.9 cluster with a single worker. Scale the worker machineset to 2. While the new machine is provisioning, manually add a floating ip to it. When kubelet starts, there is a pending CSR: $ oc get csr NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITIONcsr-7sdgv 112s kubernetes.io/kubelet-serving system:node:mbooth-psi-sx2gq-worker-0-5x2ct <none> Pending csr-zdz4m 2m2s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Approved,Issued This problem isn't caused by failure of the nodelink-controller. As reported, kubelet is creating a CSR containing all IP addresses on the node, which in turn come from the local metadata service and includes all IPs, including the floating IP. CAPO, on the other hand, is trying to be clever with the addresses it reports. It is only reporting a single address from the 'primary' network. Both the floating IP and the fixed IP are exposed via ports on the same network, but it only reports one: which ever one is returned by Nova last. This is always the floating IP. It doesn't actually matter which of the 2 addresses CAPO reports: either case will fail in the cluster machine approver because the other IP address will not match and they are all required to match. We already fixed this in upstream CAPO and therefore MAPO: https://github.com/kubernetes-sigs/cluster-api-provider-openstack/pull/1004 I think until we release MAPO we should write a downstream-only patch to make this work. This is a serious bug, but it's not a blocker because it's latent. Verified on OCP4.10.0-0.nightly-2021-12-14-083101 on top of OSP16.1 (RHOS-16.1-RHEL-8-20210903.n.0) new worker is created and a floating ip is attached to the OSP instance while machine is in provisioning status: $ oc apply -f new_machine.yaml machine.machine.openshift.io/ostest-kmzqk-new-worker created $ openstack server list +--------------------------------------+-----------------------------+--------+--------------------------------------------------------------+--------------------+--------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+-----------------------------+--------+--------------------------------------------------------------+--------------------+--------+ | 31d0647a-a4b7-4b78-8eb1-dd4176bd23eb | ostest-kmzqk-new-worker | ACTIVE | StorageNFS=172.17.5.229; ostest-kmzqk-openshift=10.196.3.179 | | | | 436b3af6-375d-40df-894e-0d7865d24b1b | ostest-kmzqk-worker-0-dh2rw | ACTIVE | StorageNFS=172.17.5.203; ostest-kmzqk-openshift=10.196.0.179 | ostest-kmzqk-rhcos | | | ce24ca88-9623-4a24-b15c-1c6a8be40a9e | ostest-kmzqk-worker-0-pqq7p | ACTIVE | StorageNFS=172.17.5.222; ostest-kmzqk-openshift=10.196.2.15 | ostest-kmzqk-rhcos | | | a0f9a346-32a7-4265-9fb0-8440c971938d | ostest-kmzqk-master-2 | ACTIVE | ostest-kmzqk-openshift=10.196.3.5 | ostest-kmzqk-rhcos | | | 749245dd-8a13-4e11-8329-32ad0fd05cc8 | ostest-kmzqk-master-1 | ACTIVE | ostest-kmzqk-openshift=10.196.2.57 | ostest-kmzqk-rhcos | | | e372bdbb-5601-4e17-a697-0786b6066199 | ostest-kmzqk-master-0 | ACTIVE | ostest-kmzqk-openshift=10.196.2.11 | ostest-kmzqk-rhcos | | +--------------------------------------+-----------------------------+--------+--------------------------------------------------------------+--------------------+--------+ $ openstack port list | grep 10.196.3.179 openstack floating ip set | c53b7759-5e70-4052-b8a3-94e2b01341d5 | ostest-kmzqk-new-worker-57a851e3-7b67-4606-9a88-c1cffa49b436 | fa:16:3e:30:c6:26 | ip_address='10.196.3.179', subnet_id='57a851e3-7b67-4606-9a88-c1cffa49b436' | ACTIVE | $ openstack floating ip set --port c53b7759-5e70-4052-b8a3-94e2b01341d5 10.46.44.49 $ openstack server list +--------------------------------------+-----------------------------+--------+---------------------------------------------------------------------------+--------------------+--------+ | ID | Name | Status | Networks | Image | Flavor | +--------------------------------------+-----------------------------+--------+---------------------------------------------------------------------------+--------------------+--------+ | 31d0647a-a4b7-4b78-8eb1-dd4176bd23eb | ostest-kmzqk-new-worker | ACTIVE | StorageNFS=172.17.5.229; ostest-kmzqk-openshift=10.196.3.179, 10.46.44.49 | | | $ oc get machines -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api ostest-kmzqk-master-0 Running 2d openshift-machine-api ostest-kmzqk-master-1 Running 2d openshift-machine-api ostest-kmzqk-master-2 Running 2d openshift-machine-api ostest-kmzqk-new-worker Provisioned m4.xlarge regionOne nova 4m45s openshift-machine-api ostest-kmzqk-worker-0-dh2rw Running m4.xlarge regionOne nova 2d openshift-machine-api ostest-kmzqk-worker-0-pqq7p Running m4.xlarge regionOne nova 2d $ oc get csr -A -w NAMESPACE NAME AGE SIGNERNAME REQUESTOR REQUESTEDDURATION CONDITION csr-dbsvz 0s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Pending csr-dbsvz 0s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Approved csr-dbsvz 0s kubernetes.io/kube-apiserver-client-kubelet system:serviceaccount:openshift-machine-config-operator:node-bootstrapper <none> Approved,Issued csr-rqqbc 0s kubernetes.io/kubelet-serving system:node:ostest-kmzqk-new-worker <none> Pending csr-rqqbc 0s kubernetes.io/kubelet-serving system:node:ostest-kmzqk-new-worker <none> Approved csr-rqqbc 0s kubernetes.io/kubelet-serving system:node:ostest-kmzqk-new-worker <none> Approved,Issued $ oc logs -n openshift-cluster-machine-approver -l app=machine-approver -c machine-approver-controller I1215 17:11:47.100530 1 csr_check.go:201] Falling back to machine-api authorization for ostest-kmzqk-new-worker I1215 17:11:47.117256 1 controller.go:227] CSR csr-v48j8 approved I1216 16:08:11.390045 1 controller.go:118] Reconciling CSR: csr-dbsvz I1216 16:08:11.452958 1 controller.go:227] CSR csr-dbsvz approved I1216 16:08:26.041054 1 controller.go:118] Reconciling CSR: csr-rqqbc I1216 16:08:26.238591 1 csr_check.go:156] csr-rqqbc: CSR does not appear to be client csr I1216 16:08:26.287170 1 csr_check.go:521] retrieving serving cert from ostest-kmzqk-new-worker (10.196.3.179:10250) I1216 16:08:26.290118 1 csr_check.go:181] Failed to retrieve current serving cert: remote error: tls: internal error I1216 16:08:26.290178 1 csr_check.go:201] Falling back to machine-api authorization for ostest-kmzqk-new-worker I1216 16:08:26.332617 1 controller.go:227] CSR csr-rqqbc approved $ oc get machine -n openshift-machine-api ostest-kmzqk-new-worker -o json | jq .status.addresses [ { "address": "172.17.5.229", "type": "InternalIP" }, { "address": "10.196.3.179", "type": "InternalIP" }, { "address": "10.46.44.49", "type": "ExternalIP" }, { "address": "ostest-kmzqk-new-worker", "type": "Hostname" }, { "address": "ostest-kmzqk-new-worker", "type": "InternalDNS" } ] $ oc get nodes NAME STATUS ROLES AGE VERSION ostest-kmzqk-master-0 Ready master 2d v1.22.1+6859754 ostest-kmzqk-master-1 Ready master 2d v1.22.1+6859754 ostest-kmzqk-master-2 Ready master 2d v1.22.1+6859754 ostest-kmzqk-new-worker Ready worker 11m v1.22.1+6859754 ostest-kmzqk-worker-0-dh2rw Ready worker 2d v1.22.1+6859754 ostest-kmzqk-worker-0-pqq7p Ready worker 2d v1.22.1+6859754 All the IPs are appearing now on the machine status section, including the FIP attached during instance creation. Therefore, the CSR is valid and the worker is correctly included on the cluster. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.10.3 security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:0056 |