Bug 2082667 - No new machines provisioned while machineset controller drained old nodes for change to machineset
Summary: No new machines provisioned while machineset controller drained old nodes for...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Cloud Compute
Version: 4.10
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: ---
: 4.11.0
Assignee: dmoiseev
QA Contact: sunzhaohua
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-05-06 17:33 UTC by Justin Pierce
Modified: 2022-08-10 11:11 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Feature: Node drain logic was moved to separate controller. Reason: Previously node drain procedure was performed within the main reconcile thread for each deleted machine, which led to significant deletion time due to block by the drain procedure. Given that machines are handled one at a time this affected performance, especially on large machine sets. Result: The drain procedure refactored in a way to be async of the rest of the controller logic, which improved deletion performance on a large machinesets.
Clone Of:
Environment:
Last Closed: 2022-08-10 11:10:43 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-api-provider-alibaba pull 34 0 None Merged Bug 2082667: Bump MAPI dependency. Separate node drain controller. 2022-06-20 14:31:52 UTC
Github openshift cluster-api-provider-baremetal pull 172 0 None Merged Bug 2082667: Bump MAPI dependency. Separate node drain controller. 2022-06-20 14:31:50 UTC
Github openshift cluster-api-provider-ibmcloud pull 21 0 None Merged Bug 2082667: Bump MAPI dependency. Separate node drain controller. 2022-06-20 14:31:49 UTC
Github openshift machine-api-operator pull 1023 0 None Merged Bug 2082667: Separate controller for the node draining 2022-06-16 14:52:59 UTC
Github openshift machine-api-provider-aws pull 42 0 None Merged Bug 2082667: Bump MAPI dependency. Separate node drain controller. 2022-06-16 14:53:00 UTC
Github openshift machine-api-provider-azure pull 26 0 None Merged Bug 2082667: Bump MAPI dependency. Separate node drain controller. 2022-06-16 14:53:00 UTC
Github openshift machine-api-provider-gcp pull 12 0 None Merged Bug 2082667: Bump MAPI dependency. Separate node drain controller. 2022-06-16 14:53:01 UTC
Github openshift machine-api-provider-nutanix pull 18 0 None Merged Bug 2082667: Bump MAPI dependency. Separate node drain controller. 2022-06-20 14:31:46 UTC
Github openshift machine-api-provider-openstack pull 42 0 None Merged Bug 2082667: Bump MAPI dependency. Separate node drain controller. 2022-06-20 14:31:44 UTC
Github openshift machine-api-provider-powervs pull 20 0 None Merged Bug 2082667: Bump MAPI dependency. Separate node drain controller. 2022-06-20 14:31:43 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 11:10:59 UTC

Comment 9 sunzhaohua 2022-06-21 08:38:16 UTC
Verified on aws, clusterversion 4.11.0-0.nightly-2022-06-20-220209

Create pdb, scale machineset to 10, then delete machines, in 4.10 new machines can't be created and waiting for the other nodes to drain
$ oc delete machine rugong-0621-nvx7c-worker-us-east-2cc-444q9 rugong-0621-nvx7c-worker-us-east-2cc-6mxp4 rugong-0621-nvx7c-worker-us-east-2cc-bt9s9 rugong-0621-nvx7c-worker-us-east-2cc-hb68b rugong-0621-nvx7c-worker-us-east-2cc-j85hs rugong-0621-nvx7c-worker-us-east-2cc-lrq8z rugong-0621-nvx7c-worker-us-east-2cc-md29j rugong-0621-nvx7c-worker-us-east-2cc-ndgr4 rugong-0621-nvx7c-worker-us-east-2cc-wkfhw
machine.machine.openshift.io "rugong-0621-nvx7c-worker-us-east-2cc-444q9" deleted
machine.machine.openshift.io "rugong-0621-nvx7c-worker-us-east-2cc-6mxp4" deleted
machine.machine.openshift.io "rugong-0621-nvx7c-worker-us-east-2cc-bt9s9" deleted
machine.machine.openshift.io "rugong-0621-nvx7c-worker-us-east-2cc-hb68b" deleted
machine.machine.openshift.io "rugong-0621-nvx7c-worker-us-east-2cc-j85hs" deleted
machine.machine.openshift.io "rugong-0621-nvx7c-worker-us-east-2cc-lrq8z" deleted
machine.machine.openshift.io "rugong-0621-nvx7c-worker-us-east-2cc-md29j" deleted
machine.machine.openshift.io "rugong-0621-nvx7c-worker-us-east-2cc-ndgr4" deleted
machine.machine.openshift.io "rugong-0621-nvx7c-worker-us-east-2cc-wkfhw" deleted
$ oc get machine                                                                                                                  [16:14:57]
NAME                                         PHASE      TYPE         REGION      ZONE         AGE
rugong-0621-nvx7c-master-0                   Running    m6i.xlarge   us-east-2   us-east-2a   7h45m
rugong-0621-nvx7c-master-1                   Running    m6i.xlarge   us-east-2   us-east-2b   7h45m
rugong-0621-nvx7c-master-2                   Running    m6i.xlarge   us-east-2   us-east-2c   7h45m
rugong-0621-nvx7c-worker-us-east-2a-c5mzh    Running    m6i.large    us-east-2   us-east-2a   7h41m
rugong-0621-nvx7c-worker-us-east-2b-fdplx    Running    m6i.large    us-east-2   us-east-2b   7h41m
rugong-0621-nvx7c-worker-us-east-2c-qgnmj    Running    m6i.large    us-east-2   us-east-2c   7h41m
rugong-0621-nvx7c-worker-us-east-2cc-2phqt                                                    52s
rugong-0621-nvx7c-worker-us-east-2cc-444q9   Deleting   m6i.large    us-east-2   us-east-2c   7m56s
rugong-0621-nvx7c-worker-us-east-2cc-4st7f   Running    m6i.large    us-east-2   us-east-2c   15m
rugong-0621-nvx7c-worker-us-east-2cc-5b4zp                                                    53s
rugong-0621-nvx7c-worker-us-east-2cc-6mxp4   Deleting   m6i.large    us-east-2   us-east-2c   7m56s
rugong-0621-nvx7c-worker-us-east-2cc-b2lfg                                                    51s
rugong-0621-nvx7c-worker-us-east-2cc-bt9s9   Deleting   m6i.large    us-east-2   us-east-2c   7m56s
rugong-0621-nvx7c-worker-us-east-2cc-dxh9l                                                    54s
rugong-0621-nvx7c-worker-us-east-2cc-f87h7                                                    52s
rugong-0621-nvx7c-worker-us-east-2cc-hb68b   Running    m6i.large    us-east-2   us-east-2c   7m56s
rugong-0621-nvx7c-worker-us-east-2cc-j85hs   Running    m6i.large    us-east-2   us-east-2c   7m56s
rugong-0621-nvx7c-worker-us-east-2cc-jh4w5                                                    52s
rugong-0621-nvx7c-worker-us-east-2cc-lrq8z   Running    m6i.large    us-east-2   us-east-2c   7m56s
rugong-0621-nvx7c-worker-us-east-2cc-m22w4                                                    52s
rugong-0621-nvx7c-worker-us-east-2cc-md29j   Running    m6i.large    us-east-2   us-east-2c   7m56s
rugong-0621-nvx7c-worker-us-east-2cc-n7l8s                                                    53s
rugong-0621-nvx7c-worker-us-east-2cc-ndgr4   Running    m6i.large    us-east-2   us-east-2c   7m56s
rugong-0621-nvx7c-worker-us-east-2cc-wkfhw   Running    m6i.large    us-east-2   us-east-2c   7m56s
rugong-0621-nvx7c-worker-us-east-2cc-z29xb   Deleting   m6i.large    us-east-2   us-east-2c   31m
rugong-0621-nvx7c-worker-us-east-2cc-zxl6k                                                    53s

In 4.11.0-0.nightly-2022-06-20-220209 after this fix, machines can quickly be created without waiting for the other Nodes to drain
$ oc delete machine zhsunaws-n52dn-worker-us-east-2c-2hxdj zhsunaws-n52dn-worker-us-east-2c-9slz7 zhsunaws-n52dn-worker-us-east-2c-cfxp4 zhsunaws-n52dn-worker-us-east-2c-kmmvc zhsunaws-n52dn-worker-us-east-2c-l4gjt zhsunaws-n52dn-worker-us-east-2c-qfg6h zhsunaws-n52dn-worker-us-east-2c-qmpg6 zhsunaws-n52dn-worker-us-east-2c-tq8br zhsunaws-n52dn-worker-us-east-2c-xjjs7
machine.machine.openshift.io "zhsunaws-n52dn-worker-us-east-2c-2hxdj" deleted
machine.machine.openshift.io "zhsunaws-n52dn-worker-us-east-2c-9slz7" deleted
machine.machine.openshift.io "zhsunaws-n52dn-worker-us-east-2c-cfxp4" deleted
machine.machine.openshift.io "zhsunaws-n52dn-worker-us-east-2c-kmmvc" deleted
machine.machine.openshift.io "zhsunaws-n52dn-worker-us-east-2c-l4gjt" deleted
machine.machine.openshift.io "zhsunaws-n52dn-worker-us-east-2c-qfg6h" deleted
machine.machine.openshift.io "zhsunaws-n52dn-worker-us-east-2c-qmpg6" deleted
machine.machine.openshift.io "zhsunaws-n52dn-worker-us-east-2c-tq8br" deleted
machine.machine.openshift.io "zhsunaws-n52dn-worker-us-east-2c-xjjs7" deleted
$ oc get machine                                                                               [16:14:18]
NAME                                     PHASE          TYPE         REGION      ZONE         AGE
zhsunaws-n52dn-master-0                  Running        m6i.xlarge   us-east-2   us-east-2a   7h9m
zhsunaws-n52dn-master-1                  Running        m6i.xlarge   us-east-2   us-east-2b   7h9m
zhsunaws-n52dn-master-2                  Running        m6i.xlarge   us-east-2   us-east-2c   7h9m
zhsunaws-n52dn-worker-us-east-2a-hnd9s   Running        m6i.xlarge   us-east-2   us-east-2a   7h7m
zhsunaws-n52dn-worker-us-east-2b-7wn4t   Running        m6i.xlarge   us-east-2   us-east-2b   7h7m
zhsunaws-n52dn-worker-us-east-2c-29tdf   Provisioning   m6i.xlarge   us-east-2   us-east-2c   20s
zhsunaws-n52dn-worker-us-east-2c-2hxdj   Deleting       m6i.xlarge   us-east-2   us-east-2c   7m50s
zhsunaws-n52dn-worker-us-east-2c-46vqz   Provisioning   m6i.xlarge   us-east-2   us-east-2c   20s
zhsunaws-n52dn-worker-us-east-2c-8j5x5   Provisioned    m6i.xlarge   us-east-2   us-east-2c   21s
zhsunaws-n52dn-worker-us-east-2c-94tb9   Provisioning   m6i.xlarge   us-east-2   us-east-2c   21s
zhsunaws-n52dn-worker-us-east-2c-9slz7   Deleting       m6i.xlarge   us-east-2   us-east-2c   7m50s
zhsunaws-n52dn-worker-us-east-2c-cfxp4   Deleting       m6i.xlarge   us-east-2   us-east-2c   7m50s
zhsunaws-n52dn-worker-us-east-2c-cxtkq   Provisioning   m6i.xlarge   us-east-2   us-east-2c   20s
zhsunaws-n52dn-worker-us-east-2c-jwczk   Deleting       m6i.xlarge   us-east-2   us-east-2c   31m
zhsunaws-n52dn-worker-us-east-2c-kmmvc   Deleting       m6i.xlarge   us-east-2   us-east-2c   7m50s
zhsunaws-n52dn-worker-us-east-2c-l4gjt   Deleting       m6i.xlarge   us-east-2   us-east-2c   7m50s
zhsunaws-n52dn-worker-us-east-2c-ppz5t   Running        m6i.xlarge   us-east-2   us-east-2c   14m
zhsunaws-n52dn-worker-us-east-2c-qfg6h   Deleting       m6i.xlarge   us-east-2   us-east-2c   7m50s
zhsunaws-n52dn-worker-us-east-2c-qjbld   Provisioned    m6i.xlarge   us-east-2   us-east-2c   21s
zhsunaws-n52dn-worker-us-east-2c-qmpg6   Deleting       m6i.xlarge   us-east-2   us-east-2c   7m50s
zhsunaws-n52dn-worker-us-east-2c-qrxcm   Provisioned    m6i.xlarge   us-east-2   us-east-2c   21s
zhsunaws-n52dn-worker-us-east-2c-sph5w   Provisioning   m6i.xlarge   us-east-2   us-east-2c   22s
zhsunaws-n52dn-worker-us-east-2c-tq8br   Deleting       m6i.xlarge   us-east-2   us-east-2c   7m50s
zhsunaws-n52dn-worker-us-east-2c-xjjs7   Deleting       m6i.xlarge   us-east-2   us-east-2c   7m50s
zhsunaws-n52dn-worker-us-east-2c-xtpwj   Provisioning   m6i.xlarge   us-east-2   us-east-2c   21s
$ oc get event
17m         Normal    DrainProceeds          machine/zhsunaws-n52dn-worker-us-east-2c-tq8br   Node drain proceeds
17m         Normal    DrainRequeued          machine/zhsunaws-n52dn-worker-us-east-2c-tq8br   Node drain requeued: requeue in: 20s
17m         Normal    Deleted                machine/zhsunaws-n52dn-worker-us-east-2c-tq8br   Node "ip-10-0-223-164.us-east-2.compute.internal" drained
17m         Normal    DrainSucceeded         machine/zhsunaws-n52dn-worker-us-east-2c-tq8br   Node drain succeeded

Comment 11 sunzhaohua 2022-06-22 10:08:00 UTC
Verified
cluterversion: 4.11.0-0.nightly-2022-06-22-015220
Tested on gcp,azure,alicloud,ibm,vsphere,nutanix and osp, all works fine.

Comment 13 errata-xmlrpc 2022-08-10 11:10:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.