Bug 2090436

Summary:	It takes 30min-60min to update the machine count in custom MachineConfigPools (MCPs) when a node is removed from the pool
Product:	OpenShift Container Platform	Reporter:	Sergio <sregidor>
Component:	Machine Config Operator	Assignee:	John Kyros <jkyros>
Machine Config Operator sub component:	Machine Config Operator	QA Contact:	Sergio <sregidor>
Status:	CLOSED ERRATA	Docs Contact:
Severity:	low
Priority:	low	CC:	jkyros, mkrejci, rioliu, wking
Version:	4.11
Target Milestone:	---
Target Release:	4.11.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2022-08-10 11:14:11 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Sergio 2022-05-25 18:24:03 UTC

Description of problem:
When a node is removed from a custom MachineConfigPool, the pool does not update the counters.

If we have a pool with 3 nodes, and we remove 1 to return this node to the "worker" pool, the worker pool will increase the machine count accordingly, but the custom MCP will not decrease his counter.

If we wait long enough (somewhere between 30 minutes and 1 hour) the custom pool finally updates the counter.

From the testing point of view, our end to end tests cannot wait 30 minutes for the counters to be updated when we remove a node from a pool.



Version-Release number of MCO (Machine Config Operator) (if applicable):

Platform (AWS, VSphere, Metal, etc.):

Are you certain that the root cause of the issue being reported is the MCO (Machine Config Operator)?
(Y/N/Not sure):
Yes

How reproducible:
Always

Did you catch this issue by running a Jenkins job? If yes, please list:
1. Jenkins job:

2. Profile:

Steps to Reproduce:
1. Label one worker node with the role label "infra"

$ oc label node $(oc get node -l node-role.kubernetes.io/worker -o jsonpath='{.items[0].metadata.name}') node-role.kubernetes.io/infra=


2.Create a custom pool using this label

apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
  name: infra
spec:
  machineConfigSelector:
    matchExpressions:
      - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,infra]}
  nodeSelector:
    matchLabels:
      node-role.kubernetes.io/infra: ""

3. Get all pools info
$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
infra    rendered-infra-0acd2a17a5ffbbcf02cf6b0d745ecb71    True      False      False      1              1                   1                     0                      4h19m
master   rendered-master-49e4407b86fae010f7e657afde9a1c5a   True      False      False      3              3                   3                     0                      7h
worker   rendered-worker-0acd2a17a5ffbbcf02cf6b0d745ecb71   True      False      False      2              2                   2                     0                      7h


Worker pool reports 2 nodes, and infra node reports 1 node.

4. Remove the label in the node
$ oc label node $(oc get node -l node-role.kubernetes.io/worker -o jsonpath='{.items[0].metadata.name}') node-role.kubernetes.io/infra-



Actual results:

The worker pool updates the machine count, but the custom "infra" pool does not do it

$ oc get mcp
NAME     CONFIG                                             UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
infra    rendered-infra-0acd2a17a5ffbbcf02cf6b0d745ecb71    True      False      False      1              1                   1                     0                      6h21m
master   rendered-master-49e4407b86fae010f7e657afde9a1c5a   True      False      False      3              3                   3                     0                      9h
worker   rendered-worker-0acd2a17a5ffbbcf02cf6b0d745ecb71   True      False      False      3              3                   3                     0                      9h

Reports 3 nodes, but infra node continues reporting 1 node instead of zero.

After waiting 30min/60min the "infra" pool machinecount is updated.


Expected results:

The "infra" node should report 0 nodes once we remove the label from the only node in the "infra" pool without having to wait 30 minutes.


Additional info:

It has an impact in our end to end tests, since the tests cannot wait for 30 minutes until the machine count is updated.

The counter is increased normally when we add a node to the pool instead of removing it.

Comment 1 John Kyros 2022-05-27 22:02:41 UTC

Hmmm yeah, we currently don't requeue the "old" pool when a node leaves it, which means the pool doesn't get updated until something else queues it.

This is probably especially apparent when there was only one node left in the pool since after that last node leaves there is no other "pool business" happening for that pool that would result in a requeue.

Comment 7 errata-xmlrpc 2022-08-10 11:14:11 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069