Bug 2090436 - It takes 30min-60min to update the machine count in custom MachineConfigPools (MCPs) when a node is removed from the pool
Summary: It takes 30min-60min to update the machine count in custom MachineConfigPools...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Machine Config Operator
Version: 4.11
Hardware: Unspecified
OS: Unspecified
low
low
Target Milestone: ---
: 4.11.0
Assignee: John Kyros
QA Contact: Sergio
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-05-25 18:24 UTC by Sergio
Modified: 2022-08-10 11:14 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-10 11:14:11 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 3165 0 None open Bug 2090436: It takes 30min-60min to update the machine count in custom MachineConfigPools (MCPs) when a node is removed... 2022-05-27 22:07:09 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 11:14:22 UTC

Description Sergio 2022-05-25 18:24:03 UTC
Description of problem:
When a node is removed from a custom MachineConfigPool, the pool does not update the counters.

If we have a pool with 3 nodes, and we remove 1 to return this node to the "worker" pool, the worker pool will increase the machine count accordingly, but the custom MCP will not decrease his counter.

If we wait long enough (somewhere between 30 minutes and 1 hour) the custom pool finally updates the counter.

From the testing point of view, our end to end tests cannot wait 30 minutes for the counters to be updated when we remove a node from a pool.



Version-Release number of MCO (Machine Config Operator) (if applicable):

Platform (AWS, VSphere, Metal, etc.):

Are you certain that the root cause of the issue being reported is the MCO (Machine Config Operator)?
(Y/N/Not sure):
Yes

How reproducible:
Always

Did you catch this issue by running a Jenkins job? If yes, please list:
1. Jenkins job:

2. Profile:

Steps to Reproduce:
1. Label one worker node with the role label "infra"

$ oc label node $(oc get node -l node-role.kubernetes.io/worker -o jsonpath='{.items[0].metadata.name}') node-role.kubernetes.io/infra=


2.Create a custom pool using this label

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: infra
spec:
  machineConfigSelector:
    matchExpressions:
      - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,infra]}
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/infra: ""

3. Get all pools info
$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
infra    rendered-infra-0acd2a17a5ffbbcf02cf6b0d745ecb71    True      False      False      1              1                   1                     0                      4h19m
master   rendered-master-49e4407b86fae010f7e657afde9a1c5a   True      False      False      3              3                   3                     0                      7h
worker   rendered-worker-0acd2a17a5ffbbcf02cf6b0d745ecb71   True      False      False      2              2                   2                     0                      7h


Worker pool reports 2 nodes, and infra node reports 1 node.

4. Remove the label in the node
$ oc label node $(oc get node -l node-role.kubernetes.io/worker -o jsonpath='{.items[0].metadata.name}') node-role.kubernetes.io/infra-



Actual results:

The worker pool updates the machine count, but the custom "infra" pool does not do it

$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
infra    rendered-infra-0acd2a17a5ffbbcf02cf6b0d745ecb71    True      False      False      1              1                   1                     0                      6h21m
master   rendered-master-49e4407b86fae010f7e657afde9a1c5a   True      False      False      3              3                   3                     0                      9h
worker   rendered-worker-0acd2a17a5ffbbcf02cf6b0d745ecb71   True      False      False      3              3                   3                     0                      9h

Reports 3 nodes, but infra node continues reporting 1 node instead of zero.

After waiting 30min/60min the "infra" pool machinecount is updated.


Expected results:

The "infra" node should report 0 nodes once we remove the label from the only node in the "infra" pool without having to wait 30 minutes.


Additional info:

It has an impact in our end to end tests, since the tests cannot wait for 30 minutes until the machine count is updated.

The counter is increased normally when we add a node to the pool instead of removing it.

Comment 1 John Kyros 2022-05-27 22:02:41 UTC
Hmmm yeah, we currently don't requeue the "old" pool when a node leaves it, which means the pool doesn't get updated until something else queues it.

This is probably especially apparent when there was only one node left in the pool since after that last node leaves there is no other "pool business" happening for that pool that would result in a requeue.

Comment 7 errata-xmlrpc 2022-08-10 11:14:11 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.