Description of problem: Creating a machine with name "<cluster-id>-worker", the machine couldn't join the cluster. Create another machine with name "<cluster-id>-worker-aaa" could join the cluster. Version-Release number of selected component (if applicable): 4.2.0-0.nightly-2019-08-27-072819 How reproducible: always Steps to Reproduce: 1. Create a machine apiVersion: machine.openshift.io/v1beta1 kind: Machine metadata: labels: machine.openshift.io/cluster-api-cluster: zhsun2-7j5gk machine.openshift.io/cluster-api-machine-role: worker machine.openshift.io/cluster-api-machine-type: worker machine.openshift.io/instance-type: m1.large machine.openshift.io/region: regionOne machine.openshift.io/zone: nova name: zhsun2-7j5gk-worker namespace: openshift-machine-api spec: metadata: creationTimestamp: null providerSpec: value: apiVersion: openstackproviderconfig.openshift.io/v1alpha1 cloudName: openstack cloudsSecret: name: openstack-cloud-credentials namespace: openshift-machine-api flavor: m1.large image: rhcos-42.80.20190815.3 kind: OpenstackProviderSpec metadata: creationTimestamp: null networks: - filter: {} subnets: - filter: name: zhsun2-7j5gk-nodes tags: openshiftClusterID=zhsun2-7j5gk securityGroups: - filter: {} name: zhsun2-7j5gk-worker serverMetadata: Name: zhsun2-7j5gk-worker openshiftClusterID: zhsun2-7j5gk tags: - openshiftClusterID=zhsun2-7j5gk trunk: true userDataSecret: name: worker-user-data 2. Check machine, node, log 3. Actual results: $ oc get machine NAME STATE TYPE REGION ZONE AGE zhsun2-7j5gk-master-0 ACTIVE m1.large regionOne nova 78m zhsun2-7j5gk-master-1 ACTIVE m1.large regionOne nova 78m zhsun2-7j5gk-master-2 ACTIVE m1.large regionOne nova 78m zhsun2-7j5gk-worker ACTIVE m1.large regionOne nova 6m18s zhsun2-7j5gk-worker-aaa ACTIVE m1.large regionOne nova 4m51s zhsun2-7j5gk-worker-c498z ACTIVE m1.large regionOne nova 77m zhsun2-7j5gk-worker-fbrmz ACTIVE m1.large regionOne nova 77m zhsun2-7j5gk-worker-kcfp6 ACTIVE m1.large regionOne nova 77m $ oc get node NAME STATUS ROLES AGE VERSION zhsun2-7j5gk-master-0 Ready master 78m v1.14.0+ceed07c42 zhsun2-7j5gk-master-1 Ready master 79m v1.14.0+ceed07c42 zhsun2-7j5gk-master-2 Ready master 78m v1.14.0+ceed07c42 zhsun2-7j5gk-worker-aaa Ready worker 45s v1.14.0+ceed07c42 zhsun2-7j5gk-worker-c498z Ready worker 72m v1.14.0+ceed07c42 zhsun2-7j5gk-worker-fbrmz Ready worker 71m v1.14.0+ceed07c42 zhsun2-7j5gk-worker-kcfp6 Ready worker 73m v1.14.0+ceed07c42 $ oc describe machine zhsun2-7j5gk-worker Status: Addresses: Address: 192.168.0.27 Type: InternalIP Address: zhsun2-7j5gk-worker Type: Hostname Address: zhsun2-7j5gk-worker Type: InternalDNS Events: <none> machine with name "zhsun2-7j5gk-worker" I0829 06:46:03.446140 1 controller.go:129] Reconciling Machine "zhsun2-7j5gk-worker" I0829 06:46:03.446364 1 controller.go:298] Machine "zhsun2-7j5gk-worker" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster I0829 06:46:03.462125 1 controller.go:129] Reconciling Machine "zhsun2-7j5gk-worker" I0829 06:46:03.462231 1 controller.go:298] Machine "zhsun2-7j5gk-worker" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster I0829 06:46:04.017690 1 controller.go:238] Reconciling machine "zhsun2-7j5gk-worker" triggers idempotent update I0829 06:46:04.522499 1 actuator.go:297] Populating current state for boostrap machine zhsun2-7j5gk-worker E0829 06:46:05.388498 1 controller.go:240] Error updating machine "openshift-machine-api/zhsun2-7j5gk-worker": Operation cannot be fulfilled on machines.machine.openshift.io "zhsun2-7j5gk-worker": the object has been modified; please apply your changes to the latest version and try again I0829 06:46:06.389065 1 controller.go:129] Reconciling Machine "zhsun2-7j5gk-worker" I0829 06:46:06.389098 1 controller.go:298] Machine "zhsun2-7j5gk-worker" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster I0829 06:46:06.904475 1 controller.go:238] Reconciling machine "zhsun2-7j5gk-worker" triggers idempotent update I0829 06:46:07.853811 1 actuator.go:297] Populating current state for boostrap machine zhsun2-7j5gk-worker I0829 06:46:08.545622 1 controller.go:129] Reconciling Machine "zhsun2-7j5gk-worker" I0829 06:46:08.545651 1 controller.go:298] Machine "zhsun2-7j5gk-worker" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster I0829 06:46:09.151649 1 controller.go:238] Reconciling machine "zhsun2-7j5gk-worker" triggers idempotent update I0829 06:46:09.152145 1 actuator.go:325] re-creating machine for update. E0829 06:46:09.152232 1 actuator.go:328] delete machine for update failed: Failed to get Machine Spec from Provider Spec (clients/machineservice.go 138): no such providerSpec found in manifest machine with name "zhsun2-7j5gk-worker-aaa" I0829 06:47:30.194734 1 controller.go:129] Reconciling Machine "zhsun2-7j5gk-worker-aaa" I0829 06:47:30.195267 1 controller.go:298] Machine "zhsun2-7j5gk-worker-aaa" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster I0829 06:47:30.219562 1 controller.go:129] Reconciling Machine "zhsun2-7j5gk-worker-aaa" I0829 06:47:30.219593 1 controller.go:298] Machine "zhsun2-7j5gk-worker-aaa" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster I0829 06:47:30.421552 1 controller.go:247] Reconciling machine object zhsun2-7j5gk-worker-aaa triggers idempotent create. W0829 06:48:41.327987 1 controller.go:249] Failed to create machine "zhsun2-7j5gk-worker-aaa": Operation cannot be fulfilled on machines.machine.openshift.io "zhsun2-7j5gk-worker-aaa": the object has been modified; please apply your changes to the latest version and try again I0829 06:48:42.328380 1 controller.go:129] Reconciling Machine "zhsun2-7j5gk-worker-c498z" I0829 06:48:42.328437 1 controller.go:298] Machine "zhsun2-7j5gk-worker-c498z" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster I0829 06:48:42.961372 1 controller.go:238] Reconciling machine "zhsun2-7j5gk-worker-c498z" triggers idempotent update I0829 06:48:42.961689 1 actuator.go:325] re-creating machine for update. E0829 06:48:42.961706 1 actuator.go:328] delete machine for update failed: Failed to get Machine Spec from Provider Spec (clients/machineservice.go 138): no such providerSpec found in manifest I0829 06:48:42.961765 1 controller.go:129] Reconciling Machine "zhsun2-7j5gk-worker-aaa" I0829 06:48:42.961787 1 controller.go:298] Machine "zhsun2-7j5gk-worker-aaa" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster I0829 06:48:43.559762 1 controller.go:238] Reconciling machine "zhsun2-7j5gk-worker-aaa" triggers idempotent update Expected results: Machine with name "<cluster-id>-worker"couldn't join the cluster Additional info:
The team considers this bug as valid. Considering this bug priority and our capacity, we are deferring this bug to an upcoming sprint. If there are reasons for us to reprioritise, please let us know.
Considering the priority assigned to this bug and our team capacity, we are deferring this bug to an upcoming sprint. Please let us know if there are reasons for us to reprioritize.
The team considers this bug as valid. Considering this bug priority and our capacity, we are deferring this bug to an upcoming sprint. If there are reasons for us to reprioritise, please let us know. Mike will check if this bug is still valid or not, with the new Machine Api Operator
Hi! From the description it's not clear if we want to create worker machines without a suffix or not: > Description of problem: > Creating a machine with name "<cluster-id>-worker", the machine couldn't join the cluster. Create another machine with name "<cluster-id>-worker-aaa" could join the cluster. > Expected results: > Machine with name "<cluster-id>-worker"couldn't join the cluster Should we be able to add suffixless workers to the cluster or not?
@mfedosin Do you remember what the root cause was?
I have reproduced this issue. I manually created a worker called cluster-dsal-8bn7j-worker. MAO has annotated it with the metadata of existing machine cluster-dsal-8bn7j-worker-0-wnvdx without creating a new machine, and is spinning with: I0323 11:24:23.511087 1 controller.go:171] cluster-dsal-8bn7j-worker: reconciling Machine I0323 11:24:23.518150 1 utils.go:99] Cloud provider CA cert not provided, using system trust bundle I0323 11:24:24.209178 1 controller.go:279] cluster-dsal-8bn7j-worker: reconciling machine triggers idempotent update I0323 11:24:24.216003 1 utils.go:99] Cloud provider CA cert not provided, using system trust bundle I0323 11:24:24.660083 1 controller.go:295] cluster-dsal-8bn7j-worker: has no node yet, requeuing I suggest this warrants further investigation as it's a potentially serious issue in either CAPO or OpenStack.
This could be either a bug in Gophercloud or a misunderstanding of the API. I put a break in GetInstanceList in CAPO. It appears that servers.List for the named server returns all 3 existing workers. (dlv) print opts *sigs.k8s.io/cluster-api-provider-openstack/pkg/cloud/openstack/clients.InstanceListOpts { Image: "rhcos", Flavor: "m1.xlarge", Name: "cluster-dsal-8bn7j-worker",} (dlv) print listOpts github.com/gophercloud/gophercloud/openstack/compute/v2/servers.ListOpts { ChangesSince: "", Image: "", Flavor: "", IP: "", IP6: "", Name: "cluster-dsal-8bn7j-worker", Status: "", Host: "", Marker: "", Limit: 0, AllTenants: false, TenantID: "", UserID: "", Tags: "", TagsAny: "", NotTags: "", NotTagsAny: "",} (dlv) print instanceList[0] *sigs.k8s.io/cluster-api-provider-openstack/pkg/cloud/openstack/clients.Instance { Server: github.com/gophercloud/gophercloud/openstack/compute/v2/servers.Server { ID: "76b03a66-b6e7-42bd-b04e-a82c7e5c1052", TenantID: "a2f172c1f76140a7a47d15f249997ef7", UserID: "41e8a7a892364179a924fbf8309e3b56", Name: "cluster-dsal-8bn7j-worker-0-wnvdx", Updated: (*time.Time)(0xc000199220), Created: (*time.Time)(0xc000199238), HostID: "0e5117846c36bd39e61bc64edb49a4081e7e75f3aeda906a8eab16ed", Status: "ACTIVE", Progress: 0, AccessIPv4: "", AccessIPv6: "", Image: map[string]interface {} [...], Flavor: map[string]interface {} [...], Addresses: map[string]interface {} [...], Metadata: map[string]string [...], Links: []interface {} len: 2, cap: 4, [ *(*interface {})(0xc0007d0040), *(*interface {})(0xc0007d0050), ], KeyName: "", AdminPass: "", SecurityGroups: []map[string]interface {} len: 1, cap: 4, [ [...], ], AttachedVolumes: []github.com/gophercloud/gophercloud/openstack/compute/v2/servers.AttachedVolume len: 0, cap: 0, [], Fault: (*"github.com/gophercloud/gophercloud/openstack/compute/v2/servers.Fault")(0xc000199320), Tags: *[]string nil,},}
It is a limitation of the API. From https://docs.openstack.org/api-ref/compute/?expanded=list-servers-detail#list-servers, the name parameter is defined as: --- Filters the response by a server name, as a string. You can use regular expressions in the query. For example, the ?name=bob regular expression returns both bob and bobb. If you must match on only bob, you can use a regular expression that matches the syntax of the underlying database server that is implemented for Compute, such as MySQL or PostgreSQL. --- The api is doing a substring match on the name. As cluster-dsal-8bn7j-worker is a substring of all 3 worker nodes it is returning all worker nodes which causes MAO to assume the server already exists.
Given that this appears to be a deliberate, albeit surprising feature of the nova api, we should not attempt to fix this in Gophercloud. Instead we should change GetInstanceList to explicitly specify a whole string match, i.e. name=^cluster-dsal-8bn7j-worker$. I have confirmed with curl that this works.
Verified on 4.8.0-0.nightly-2021-04-15-074503 over RHOS-16.1-RHEL-8-20210311.n.1. On a running OCP Cluster, below manifest is loaded ($ oc apply -f new_machine.yaml): $ cat new_machine.yaml apiVersion: machine.openshift.io/v1beta1 kind: Machine metadata: labels: machine.openshift.io/cluster-api-cluster: ostest-vd4fm machine.openshift.io/cluster-api-machine-role: worker machine.openshift.io/cluster-api-machine-type: worker machine.openshift.io/instance-type: m4.xlarge machine.openshift.io/region: regionOne machine.openshift.io/zone: nova name: ostest-vd4fm-worker namespace: openshift-machine-api spec: metadata: {} providerSpec: value: apiVersion: openstackproviderconfig.openshift.io/v1alpha1 cloudName: openstack cloudsSecret: name: openstack-cloud-credentials namespace: openshift-machine-api flavor: m4.xlarge image: ostest-vd4fm-rhcos kind: OpenstackProviderSpec metadata: creationTimestamp: null networks: - filter: {} subnets: - filter: name: ostest-vd4fm-nodes tags: openshiftClusterID=ostest-vd4fm securityGroups: - filter: {} name: ostest-vd4fm-worker serverMetadata: Name: ostest-vd4fm-worker openshiftClusterID: ostest-vd4fm tags: - openshiftClusterID=ostest-vd4fm trunk: true userDataSecret: name: worker-user-data As a result, the new worker is properly added to the cluster: $ oc get machines -A NAMESPACE NAME PHASE TYPE REGION ZONE AGE openshift-machine-api ostest-vd4fm-master-0 Running m4.xlarge regionOne nova 155m openshift-machine-api ostest-vd4fm-master-1 Running m4.xlarge regionOne nova 155m openshift-machine-api ostest-vd4fm-master-2 Running m4.xlarge regionOne nova 155m openshift-machine-api ostest-vd4fm-worker Running m4.xlarge regionOne nova 112m openshift-machine-api ostest-vd4fm-worker-0-2pxl8 Running m4.xlarge regionOne nova 143m $ oc get nodes NAME STATUS ROLES AGE VERSION ostest-vd4fm-master-0 Ready master 152m v1.21.0-rc.0+6825c59 ostest-vd4fm-master-1 Ready master 152m v1.21.0-rc.0+6825c59 ostest-vd4fm-master-2 Ready master 152m v1.21.0-rc.0+6825c59 ostest-vd4fm-worker Ready worker 99m v1.21.0-rc.0+6825c59 ostest-vd4fm-worker-0-2pxl8 Ready worker 131m v1.21.0-rc.0+6825c59 $ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-04-15-074503 True False 117m Cluster version is 4.8.0-0.nightly-2021-04-15-074503
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438