Bug 1747270

Summary: [osp] Machine with name "<cluster-id>-worker"couldn't join the cluster
Product: OpenShift Container Platform Reporter: sunzhaohua <zhsun>
Component: Cloud ComputeAssignee: Matthew Booth <mbooth>
Cloud Compute sub component: OpenStack Provider QA Contact: rlobillo
Status: CLOSED ERRATA Docs Contact:
Severity: low    
Priority: medium CC: adduarte, agarcial, eduen, egarcia, jhou, m.andre, mfedosin, pprinett
Version: 4.2.0Keywords: Triaged
Target Milestone: ---   
Target Release: 4.8.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: osp
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-07-27 22:32:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description sunzhaohua 2019-08-30 02:59:54 UTC
Description of problem:
Creating a machine with name "<cluster-id>-worker", the machine couldn't join the cluster. Create another machine with name "<cluster-id>-worker-aaa" could join the cluster.

Version-Release number of selected component (if applicable):
4.2.0-0.nightly-2019-08-27-072819

How reproducible:
always

Steps to Reproduce:
1. Create a machine
apiVersion: machine.openshift.io/v1beta1
kind: Machine
metadata:
  labels:
    machine.openshift.io/cluster-api-cluster: zhsun2-7j5gk
    machine.openshift.io/cluster-api-machine-role: worker
    machine.openshift.io/cluster-api-machine-type: worker
    machine.openshift.io/instance-type: m1.large
    machine.openshift.io/region: regionOne
    machine.openshift.io/zone: nova
  name: zhsun2-7j5gk-worker
  namespace: openshift-machine-api
spec:
  metadata:
    creationTimestamp: null
  providerSpec:
    value:
      apiVersion: openstackproviderconfig.openshift.io/v1alpha1
      cloudName: openstack
      cloudsSecret:
        name: openstack-cloud-credentials
        namespace: openshift-machine-api
      flavor: m1.large
      image: rhcos-42.80.20190815.3
      kind: OpenstackProviderSpec
      metadata:
        creationTimestamp: null
      networks:
      - filter: {}
        subnets:
        - filter:
            name: zhsun2-7j5gk-nodes
            tags: openshiftClusterID=zhsun2-7j5gk
      securityGroups:
      - filter: {}
        name: zhsun2-7j5gk-worker
      serverMetadata:
        Name: zhsun2-7j5gk-worker
        openshiftClusterID: zhsun2-7j5gk
      tags:
      - openshiftClusterID=zhsun2-7j5gk
      trunk: true
      userDataSecret:
        name: worker-user-data

2. Check machine, node, log
3.

Actual results:
$ oc get machine
NAME                        STATE    TYPE       REGION      ZONE   AGE
zhsun2-7j5gk-master-0       ACTIVE   m1.large   regionOne   nova   78m
zhsun2-7j5gk-master-1       ACTIVE   m1.large   regionOne   nova   78m
zhsun2-7j5gk-master-2       ACTIVE   m1.large   regionOne   nova   78m
zhsun2-7j5gk-worker         ACTIVE   m1.large   regionOne   nova   6m18s
zhsun2-7j5gk-worker-aaa     ACTIVE   m1.large   regionOne   nova   4m51s
zhsun2-7j5gk-worker-c498z   ACTIVE   m1.large   regionOne   nova   77m
zhsun2-7j5gk-worker-fbrmz   ACTIVE   m1.large   regionOne   nova   77m
zhsun2-7j5gk-worker-kcfp6   ACTIVE   m1.large   regionOne   nova   77m

$ oc get node
NAME                        STATUS   ROLES    AGE   VERSION
zhsun2-7j5gk-master-0       Ready    master   78m   v1.14.0+ceed07c42
zhsun2-7j5gk-master-1       Ready    master   79m   v1.14.0+ceed07c42
zhsun2-7j5gk-master-2       Ready    master   78m   v1.14.0+ceed07c42
zhsun2-7j5gk-worker-aaa     Ready    worker   45s   v1.14.0+ceed07c42
zhsun2-7j5gk-worker-c498z   Ready    worker   72m   v1.14.0+ceed07c42
zhsun2-7j5gk-worker-fbrmz   Ready    worker   71m   v1.14.0+ceed07c42
zhsun2-7j5gk-worker-kcfp6   Ready    worker   73m   v1.14.0+ceed07c42

$ oc describe machine zhsun2-7j5gk-worker
Status:
  Addresses:
    Address:  192.168.0.27
    Type:     InternalIP
    Address:  zhsun2-7j5gk-worker
    Type:     Hostname
    Address:  zhsun2-7j5gk-worker
    Type:     InternalDNS
Events:       <none>


machine with name "zhsun2-7j5gk-worker"
I0829 06:46:03.446140       1 controller.go:129] Reconciling Machine "zhsun2-7j5gk-worker"
I0829 06:46:03.446364       1 controller.go:298] Machine "zhsun2-7j5gk-worker" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0829 06:46:03.462125       1 controller.go:129] Reconciling Machine "zhsun2-7j5gk-worker"
I0829 06:46:03.462231       1 controller.go:298] Machine "zhsun2-7j5gk-worker" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0829 06:46:04.017690       1 controller.go:238] Reconciling machine "zhsun2-7j5gk-worker" triggers idempotent update
I0829 06:46:04.522499       1 actuator.go:297] Populating current state for boostrap machine zhsun2-7j5gk-worker
E0829 06:46:05.388498       1 controller.go:240] Error updating machine "openshift-machine-api/zhsun2-7j5gk-worker": Operation cannot be fulfilled on machines.machine.openshift.io "zhsun2-7j5gk-worker": the object has been modified; please apply your changes to the latest version and try again
I0829 06:46:06.389065       1 controller.go:129] Reconciling Machine "zhsun2-7j5gk-worker"
I0829 06:46:06.389098       1 controller.go:298] Machine "zhsun2-7j5gk-worker" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0829 06:46:06.904475       1 controller.go:238] Reconciling machine "zhsun2-7j5gk-worker" triggers idempotent update
I0829 06:46:07.853811       1 actuator.go:297] Populating current state for boostrap machine zhsun2-7j5gk-worker
I0829 06:46:08.545622       1 controller.go:129] Reconciling Machine "zhsun2-7j5gk-worker"
I0829 06:46:08.545651       1 controller.go:298] Machine "zhsun2-7j5gk-worker" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0829 06:46:09.151649       1 controller.go:238] Reconciling machine "zhsun2-7j5gk-worker" triggers idempotent update
I0829 06:46:09.152145       1 actuator.go:325] re-creating machine  for update.
E0829 06:46:09.152232       1 actuator.go:328] delete machine  for update failed: Failed to get Machine Spec from Provider Spec (clients/machineservice.go 138): no such providerSpec found in manifest

machine with name "zhsun2-7j5gk-worker-aaa"
I0829 06:47:30.194734       1 controller.go:129] Reconciling Machine "zhsun2-7j5gk-worker-aaa"
I0829 06:47:30.195267       1 controller.go:298] Machine "zhsun2-7j5gk-worker-aaa" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0829 06:47:30.219562       1 controller.go:129] Reconciling Machine "zhsun2-7j5gk-worker-aaa"
I0829 06:47:30.219593       1 controller.go:298] Machine "zhsun2-7j5gk-worker-aaa" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0829 06:47:30.421552       1 controller.go:247] Reconciling machine object zhsun2-7j5gk-worker-aaa triggers idempotent create.
W0829 06:48:41.327987       1 controller.go:249] Failed to create machine "zhsun2-7j5gk-worker-aaa": Operation cannot be fulfilled on machines.machine.openshift.io "zhsun2-7j5gk-worker-aaa": the object has been modified; please apply your changes to the latest version and try again
I0829 06:48:42.328380       1 controller.go:129] Reconciling Machine "zhsun2-7j5gk-worker-c498z"
I0829 06:48:42.328437       1 controller.go:298] Machine "zhsun2-7j5gk-worker-c498z" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0829 06:48:42.961372       1 controller.go:238] Reconciling machine "zhsun2-7j5gk-worker-c498z" triggers idempotent update
I0829 06:48:42.961689       1 actuator.go:325] re-creating machine  for update.
E0829 06:48:42.961706       1 actuator.go:328] delete machine  for update failed: Failed to get Machine Spec from Provider Spec (clients/machineservice.go 138): no such providerSpec found in manifest
I0829 06:48:42.961765       1 controller.go:129] Reconciling Machine "zhsun2-7j5gk-worker-aaa"
I0829 06:48:42.961787       1 controller.go:298] Machine "zhsun2-7j5gk-worker-aaa" in namespace "openshift-machine-api" doesn't specify "cluster.k8s.io/cluster-name" label, assuming nil cluster
I0829 06:48:43.559762       1 controller.go:238] Reconciling machine "zhsun2-7j5gk-worker-aaa" triggers idempotent update


Expected results:
Machine with name "<cluster-id>-worker"couldn't join the cluster

Additional info:

Comment 3 Pierre Prinetti 2020-05-07 14:40:36 UTC
The team considers this bug as valid. Considering this bug priority and our capacity, we are deferring this bug to an upcoming sprint. If there are reasons for us to reprioritise, please let us know.

Comment 4 Pierre Prinetti 2020-05-14 14:31:51 UTC
Considering the priority assigned to this bug and our team capacity, we are deferring this bug to an upcoming sprint. Please let us know if there are reasons for us to reprioritize.

Comment 6 Martin André 2020-06-25 14:25:47 UTC
The team considers this bug as valid. Considering this bug priority and our capacity, we are deferring this bug to an upcoming sprint. If there are reasons for us to reprioritise, please let us know.

Mike will check if this bug is still valid or not, with the new Machine Api Operator

Comment 10 Mike Fedosin 2020-09-17 15:40:09 UTC
Hi! From the description it's not clear if we want to create worker machines without a suffix or not:

> Description of problem:
> Creating a machine with name "<cluster-id>-worker", the machine couldn't join the cluster. Create another machine with name "<cluster-id>-worker-aaa" could join the cluster.

> Expected results:
> Machine with name "<cluster-id>-worker"couldn't join the cluster

Should we be able to add suffixless workers to the cluster or not?

Comment 16 Matthew Booth 2021-03-23 10:11:53 UTC
@mfedosin Do you remember what the root cause was?

Comment 19 Matthew Booth 2021-03-23 11:28:00 UTC
I have reproduced this issue. I manually created a worker called cluster-dsal-8bn7j-worker. MAO has annotated it with the metadata of existing machine cluster-dsal-8bn7j-worker-0-wnvdx without creating a new machine, and is spinning with:

I0323 11:24:23.511087       1 controller.go:171] cluster-dsal-8bn7j-worker: reconciling Machine
I0323 11:24:23.518150       1 utils.go:99] Cloud provider CA cert not provided, using system trust bundle
I0323 11:24:24.209178       1 controller.go:279] cluster-dsal-8bn7j-worker: reconciling machine triggers idempotent update
I0323 11:24:24.216003       1 utils.go:99] Cloud provider CA cert not provided, using system trust bundle
I0323 11:24:24.660083       1 controller.go:295] cluster-dsal-8bn7j-worker: has no node yet, requeuing

I suggest this warrants further investigation as it's a potentially serious issue in either CAPO or OpenStack.

Comment 20 Matthew Booth 2021-03-23 13:42:32 UTC
This could be either a bug in Gophercloud or a misunderstanding of the API. I put a break in GetInstanceList in CAPO. It appears that servers.List for the named server returns all 3 existing workers.

(dlv) print opts
*sigs.k8s.io/cluster-api-provider-openstack/pkg/cloud/openstack/clients.InstanceListOpts {
        Image: "rhcos",
        Flavor: "m1.xlarge",
        Name: "cluster-dsal-8bn7j-worker",}

(dlv) print listOpts
github.com/gophercloud/gophercloud/openstack/compute/v2/servers.ListOpts {
        ChangesSince: "",
        Image: "",
        Flavor: "",
        IP: "",
        IP6: "",
        Name: "cluster-dsal-8bn7j-worker",
        Status: "",
        Host: "",
        Marker: "",
        Limit: 0,
        AllTenants: false,
        TenantID: "",
        UserID: "",
        Tags: "",
        TagsAny: "",
        NotTags: "",
        NotTagsAny: "",}

(dlv) print instanceList[0]
*sigs.k8s.io/cluster-api-provider-openstack/pkg/cloud/openstack/clients.Instance {
        Server: github.com/gophercloud/gophercloud/openstack/compute/v2/servers.Server {
                ID: "76b03a66-b6e7-42bd-b04e-a82c7e5c1052",
                TenantID: "a2f172c1f76140a7a47d15f249997ef7",
                UserID: "41e8a7a892364179a924fbf8309e3b56",
                Name: "cluster-dsal-8bn7j-worker-0-wnvdx",
                Updated: (*time.Time)(0xc000199220),
                Created: (*time.Time)(0xc000199238),
                HostID: "0e5117846c36bd39e61bc64edb49a4081e7e75f3aeda906a8eab16ed",
                Status: "ACTIVE",
                Progress: 0,
                AccessIPv4: "",
                AccessIPv6: "",
                Image: map[string]interface {} [...],
                Flavor: map[string]interface {} [...],
                Addresses: map[string]interface {} [...],
                Metadata: map[string]string [...],
                Links: []interface {} len: 2, cap: 4, [
                        *(*interface {})(0xc0007d0040),
                        *(*interface {})(0xc0007d0050),
                ],
                KeyName: "",
                AdminPass: "",
                SecurityGroups: []map[string]interface {} len: 1, cap: 4, [
                        [...],
                ],
                AttachedVolumes: []github.com/gophercloud/gophercloud/openstack/compute/v2/servers.AttachedVolume len: 0, cap: 0, [],
                Fault: (*"github.com/gophercloud/gophercloud/openstack/compute/v2/servers.Fault")(0xc000199320),
                Tags: *[]string nil,},}

Comment 21 Matthew Booth 2021-03-23 13:53:04 UTC
It is a limitation of the API. From https://docs.openstack.org/api-ref/compute/?expanded=list-servers-detail#list-servers, the name parameter is defined as:

---
Filters the response by a server name, as a string. You can use regular expressions in the query. For example, the ?name=bob regular expression returns both bob and bobb. If you must match on only bob, you can use a regular expression that matches the syntax of the underlying database server that is implemented for Compute, such as MySQL or PostgreSQL.
---

The api is doing a substring match on the name. As cluster-dsal-8bn7j-worker is a substring of all 3 worker nodes it is returning all worker nodes which causes MAO to assume the server already exists.

Comment 22 Matthew Booth 2021-03-23 13:57:02 UTC
Given that this appears to be a deliberate, albeit surprising feature of the nova api, we should not attempt to fix this in Gophercloud. Instead we should change GetInstanceList to explicitly specify a whole string match, i.e. name=^cluster-dsal-8bn7j-worker$. I have confirmed with curl that this works.

Comment 24 rlobillo 2021-04-15 15:16:03 UTC
Verified on 4.8.0-0.nightly-2021-04-15-074503 over RHOS-16.1-RHEL-8-20210311.n.1.

On a running OCP Cluster, below manifest is loaded ($ oc apply -f new_machine.yaml):

$ cat new_machine.yaml
apiVersion: machine.openshift.io/v1beta1
kind: Machine
metadata:
  labels:
    machine.openshift.io/cluster-api-cluster: ostest-vd4fm
    machine.openshift.io/cluster-api-machine-role: worker
    machine.openshift.io/cluster-api-machine-type: worker
    machine.openshift.io/instance-type: m4.xlarge
    machine.openshift.io/region: regionOne
    machine.openshift.io/zone: nova
  name: ostest-vd4fm-worker
  namespace: openshift-machine-api
spec:
  metadata: {}
  providerSpec:
    value:
      apiVersion: openstackproviderconfig.openshift.io/v1alpha1
      cloudName: openstack
      cloudsSecret:
        name: openstack-cloud-credentials
        namespace: openshift-machine-api
      flavor: m4.xlarge
      image: ostest-vd4fm-rhcos
      kind: OpenstackProviderSpec
      metadata:
        creationTimestamp: null
      networks:
      - filter: {}
        subnets:
        - filter:
            name: ostest-vd4fm-nodes
            tags: openshiftClusterID=ostest-vd4fm
      securityGroups:
      - filter: {}
        name: ostest-vd4fm-worker
      serverMetadata:
        Name: ostest-vd4fm-worker
        openshiftClusterID: ostest-vd4fm
      tags:
      - openshiftClusterID=ostest-vd4fm
      trunk: true
      userDataSecret:
        name: worker-user-data


As a result, the new worker is properly added to the cluster:

$ oc get machines -A
NAMESPACE               NAME                          PHASE     TYPE        REGION      ZONE   AGE
openshift-machine-api   ostest-vd4fm-master-0         Running   m4.xlarge   regionOne   nova   155m
openshift-machine-api   ostest-vd4fm-master-1         Running   m4.xlarge   regionOne   nova   155m
openshift-machine-api   ostest-vd4fm-master-2         Running   m4.xlarge   regionOne   nova   155m
openshift-machine-api   ostest-vd4fm-worker           Running   m4.xlarge   regionOne   nova   112m
openshift-machine-api   ostest-vd4fm-worker-0-2pxl8   Running   m4.xlarge   regionOne   nova   143m

$ oc get nodes
NAME                          STATUS   ROLES    AGE    VERSION
ostest-vd4fm-master-0         Ready    master   152m   v1.21.0-rc.0+6825c59
ostest-vd4fm-master-1         Ready    master   152m   v1.21.0-rc.0+6825c59
ostest-vd4fm-master-2         Ready    master   152m   v1.21.0-rc.0+6825c59
ostest-vd4fm-worker           Ready    worker   99m    v1.21.0-rc.0+6825c59
ostest-vd4fm-worker-0-2pxl8   Ready    worker   131m   v1.21.0-rc.0+6825c59

$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-04-15-074503   True        False         117m    Cluster version is 4.8.0-0.nightly-2021-04-15-074503

Comment 27 errata-xmlrpc 2021-07-27 22:32:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438