Bug 1747270
Summary: | [osp] Machine with name "<cluster-id>-worker"couldn't join the cluster | ||
---|---|---|---|
Product: | OpenShift Container Platform | Reporter: | sunzhaohua <zhsun> |
Component: | Cloud Compute | Assignee: | Matthew Booth <mbooth> |
Cloud Compute sub component: | OpenStack Provider | QA Contact: | rlobillo |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | low | ||
Priority: | medium | CC: | adduarte, agarcial, eduen, egarcia, jhou, m.andre, mfedosin, pprinett |
Version: | 4.2.0 | Keywords: | Triaged |
Target Milestone: | --- | ||
Target Release: | 4.8.0 | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | osp | ||
Fixed In Version: | Doc Type: | No Doc Update | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-07-27 22:32:19 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
sunzhaohua
2019-08-30 02:59:54 UTC
The team considers this bug as valid. Considering this bug priority and our capacity, we are deferring this bug to an upcoming sprint. If there are reasons for us to reprioritise, please let us know. Considering the priority assigned to this bug and our team capacity, we are deferring this bug to an upcoming sprint. Please let us know if there are reasons for us to reprioritize. The team considers this bug as valid. Considering this bug priority and our capacity, we are deferring this bug to an upcoming sprint. If there are reasons for us to reprioritise, please let us know. Mike will check if this bug is still valid or not, with the new Machine Api Operator Hi! From the description it's not clear if we want to create worker machines without a suffix or not: > Description of problem: > Creating a machine with name "<cluster-id>-worker", the machine couldn't join the cluster. Create another machine with name "<cluster-id>-worker-aaa" could join the cluster. > Expected results: > Machine with name "<cluster-id>-worker"couldn't join the cluster Should we be able to add suffixless workers to the cluster or not? @mfedosin Do you remember what the root cause was? I have reproduced this issue. I manually created a worker called cluster-dsal-8bn7j-worker. MAO has annotated it with the metadata of existing machine cluster-dsal-8bn7j-worker-0-wnvdx without creating a new machine, and is spinning with: I0323 11:24:23.511087 1 controller.go:171] cluster-dsal-8bn7j-worker: reconciling Machine I0323 11:24:23.518150 1 utils.go:99] Cloud provider CA cert not provided, using system trust bundle I0323 11:24:24.209178 1 controller.go:279] cluster-dsal-8bn7j-worker: reconciling machine triggers idempotent update I0323 11:24:24.216003 1 utils.go:99] Cloud provider CA cert not provided, using system trust bundle I0323 11:24:24.660083 1 controller.go:295] cluster-dsal-8bn7j-worker: has no node yet, requeuing I suggest this warrants further investigation as it's a potentially serious issue in either CAPO or OpenStack. This could be either a bug in Gophercloud or a misunderstanding of the API. I put a break in GetInstanceList in CAPO. It appears that servers.List for the named server returns all 3 existing workers. (dlv) print opts *sigs.k8s.io/cluster-api-provider-openstack/pkg/cloud/openstack/clients.InstanceListOpts { Image: "rhcos", Flavor: "m1.xlarge", Name: "cluster-dsal-8bn7j-worker",} (dlv) print listOpts github.com/gophercloud/gophercloud/openstack/compute/v2/servers.ListOpts { ChangesSince: "", Image: "", Flavor: "", IP: "", IP6: "", Name: "cluster-dsal-8bn7j-worker", Status: "", Host: "", Marker: "", Limit: 0, AllTenants: false, TenantID: "", UserID: "", Tags: "", TagsAny: "", NotTags: "", NotTagsAny: "",} (dlv) print instanceList[0] *sigs.k8s.io/cluster-api-provider-openstack/pkg/cloud/openstack/clients.Instance { Server: github.com/gophercloud/gophercloud/openstack/compute/v2/servers.Server { ID: "76b03a66-b6e7-42bd-b04e-a82c7e5c1052", TenantID: "a2f172c1f76140a7a47d15f249997ef7", UserID: "41e8a7a892364179a924fbf8309e3b56", Name: "cluster-dsal-8bn7j-worker-0-wnvdx", Updated: (*time.Time)(0xc000199220), Created: (*time.Time)(0xc000199238), HostID: "0e5117846c36bd39e61bc64edb49a4081e7e75f3aeda906a8eab16ed", Status: "ACTIVE", Progress: 0, AccessIPv4: "", AccessIPv6: "", Image: map[string]interface {} [...], Flavor: map[string]interface {} [...], Addresses: map[string]interface {} [...], Metadata: map[string]string [...], Links: []interface {} len: 2, cap: 4, [ *(*interface {})(0xc0007d0040), *(*interface {})(0xc0007d0050), ], KeyName: "", AdminPass: "", SecurityGroups: []map[string]interface {} len: 1, cap: 4, [ [...], ], AttachedVolumes: []github.com/gophercloud/gophercloud/openstack/compute/v2/servers.AttachedVolume len: 0, cap: 0, [], Fault: (*"github.com/gophercloud/gophercloud/openstack/compute/v2/servers.Fault")(0xc000199320), Tags: *[]string nil,},} It is a limitation of the API. From https://docs.openstack.org/api-ref/compute/?expanded=list-servers-detail#list-servers, the name parameter is defined as: --- Filters the response by a server name, as a string. You can use regular expressions in the query. For example, the ?name=bob regular expression returns both bob and bobb. If you must match on only bob, you can use a regular expression that matches the syntax of the underlying database server that is implemented for Compute, such as MySQL or PostgreSQL. --- The api is doing a substring match on the name. As cluster-dsal-8bn7j-worker is a substring of all 3 worker nodes it is returning all worker nodes which causes MAO to assume the server already exists. Given that this appears to be a deliberate, albeit surprising feature of the nova api, we should not attempt to fix this in Gophercloud. Instead we should change GetInstanceList to explicitly specify a whole string match, i.e. name=^cluster-dsal-8bn7j-worker$. I have confirmed with curl that this works. Verified on 4.8.0-0.nightly-2021-04-15-074503 over RHOS-16.1-RHEL-8-20210311.n.1. On a running OCP Cluster, below manifest is loaded ($ oc apply -f new_machine.yaml): $ cat new_machine.yaml apiVersion: machine.openshift.io/v1beta1 kind: Machine metadata: labels: machine.openshift.io/cluster-api-cluster: ostest-vd4fm machine.openshift.io/cluster-api-machine-role: worker machine.openshift.io/cluster-api-machine-type: worker machine.openshift.io/instance-type: m4.xlarge machine.openshift.io/region: regionOne machine.openshift.io/zone: nova name: ostest-vd4fm-worker namespace: openshift-machine-api spec: metadata: {} providerSpec: value: apiVersion: openstackproviderconfig.openshift.io/v1alpha1 cloudName: openstack cloudsSecret: name: openstack-cloud-credentials namespace: openshift-machine-api flavor: m4.xlarge image: ostest-vd4fm-rhcos kind: OpenstackProviderSpec metadata: creationTimestamp: null networks: - filter: {} subnets: - filter: name: ostest-vd4fm-nodes tags: openshiftClusterID=ostest-vd4fm securityGroups: - filter: {} name: ostest-vd4fm-worker serverMetadata: Name: ostest-vd4fm-worker openshiftClusterID: ostest-vd4fm tags: - openshiftClusterID=ostest-vd4fm trunk: true userDataSecret: name: worker-user-data As a result, the new worker is properly added to the cluster: $ oc get machines -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api ostest-vd4fm-master-0 Running m4.xlarge regionOne nova 155m openshift-machine-api ostest-vd4fm-master-1 Running m4.xlarge regionOne nova 155m openshift-machine-api ostest-vd4fm-master-2 Running m4.xlarge regionOne nova 155m openshift-machine-api ostest-vd4fm-worker Running m4.xlarge regionOne nova 112m openshift-machine-api ostest-vd4fm-worker-0-2pxl8 Running m4.xlarge regionOne nova 143m $ oc get nodes NAME STATUS ROLES AGE VERSION ostest-vd4fm-master-0 Ready master 152m v1.21.0-rc.0+6825c59 ostest-vd4fm-master-1 Ready master 152m v1.21.0-rc.0+6825c59 ostest-vd4fm-master-2 Ready master 152m v1.21.0-rc.0+6825c59 ostest-vd4fm-worker Ready worker 99m v1.21.0-rc.0+6825c59 ostest-vd4fm-worker-0-2pxl8 Ready worker 131m v1.21.0-rc.0+6825c59 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-04-15-074503 True False 117m Cluster version is 4.8.0-0.nightly-2021-04-15-074503 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438 |