Description of problem: When a node is removed from a custom MachineConfigPool, the pool does not update the counters. If we have a pool with 3 nodes, and we remove 1 to return this node to the "worker" pool, the worker pool will increase the machine count accordingly, but the custom MCP will not decrease his counter. If we wait long enough (somewhere between 30 minutes and 1 hour) the custom pool finally updates the counter. From the testing point of view, our end to end tests cannot wait 30 minutes for the counters to be updated when we remove a node from a pool. Version-Release number of MCO (Machine Config Operator) (if applicable): Platform (AWS, VSphere, Metal, etc.): Are you certain that the root cause of the issue being reported is the MCO (Machine Config Operator)? (Y/N/Not sure): Yes How reproducible: Always Did you catch this issue by running a Jenkins job? If yes, please list: 1. Jenkins job: 2. Profile: Steps to Reproduce: 1. Label one worker node with the role label "infra" $ oc label node $(oc get node -l node-role.kubernetes.io/worker -o jsonpath='{.items[0].metadata.name}') node-role.kubernetes.io/infra= 2.Create a custom pool using this label apiVersion: machineconfiguration.openshift.io/v1 kind: MachineConfigPool metadata: name: infra spec: machineConfigSelector: matchExpressions: - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,infra]} nodeSelector: matchLabels: node-role.kubernetes.io/infra: "" 3. Get all pools info $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE infra rendered-infra-0acd2a17a5ffbbcf02cf6b0d745ecb71 True False False 1 1 1 0 4h19m master rendered-master-49e4407b86fae010f7e657afde9a1c5a True False False 3 3 3 0 7h worker rendered-worker-0acd2a17a5ffbbcf02cf6b0d745ecb71 True False False 2 2 2 0 7h Worker pool reports 2 nodes, and infra node reports 1 node. 4. Remove the label in the node $ oc label node $(oc get node -l node-role.kubernetes.io/worker -o jsonpath='{.items[0].metadata.name}') node-role.kubernetes.io/infra- Actual results: The worker pool updates the machine count, but the custom "infra" pool does not do it $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE infra rendered-infra-0acd2a17a5ffbbcf02cf6b0d745ecb71 True False False 1 1 1 0 6h21m master rendered-master-49e4407b86fae010f7e657afde9a1c5a True False False 3 3 3 0 9h worker rendered-worker-0acd2a17a5ffbbcf02cf6b0d745ecb71 True False False 3 3 3 0 9h Reports 3 nodes, but infra node continues reporting 1 node instead of zero. After waiting 30min/60min the "infra" pool machinecount is updated. Expected results: The "infra" node should report 0 nodes once we remove the label from the only node in the "infra" pool without having to wait 30 minutes. Additional info: It has an impact in our end to end tests, since the tests cannot wait for 30 minutes until the machine count is updated. The counter is increased normally when we add a node to the pool instead of removing it.
Hmmm yeah, we currently don't requeue the "old" pool when a node leaves it, which means the pool doesn't get updated until something else queues it. This is probably especially apparent when there was only one node left in the pool since after that last node leaves there is no other "pool business" happening for that pool that would result in a requeue.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069