Bug 1851965

Summary:	MachineConfigPool Status.Configuration.Source still contains removed machineconfig
Product:	OpenShift Container Platform	Reporter:	Pablo Alonso Rodriguez <palonsor>
Component:	Machine Config Operator	Assignee:	Antonio Murdaca <amurdaca>
Status:	CLOSED DUPLICATE	QA Contact:	Michael Nguyen <mnguyen>
Severity:	low	Docs Contact:
Priority:	low
Version:	4.4
Target Milestone:	---
Target Release:	4.6.0
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2020-06-30 13:03:46 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Pablo Alonso Rodriguez 2020-06-29 14:26:29 UTC

Description of problem:

After removing a concrete machine-config belonging to a machineconfigpool, it disappears from Spec.Configuration.Source but not from Status.Configuration.Sources, causing machine-config-operator to be degraded.

Version-Release number of selected component (if applicable):

4.4.6

How reproducible:

Only at customer scenario

Steps to Reproduce:
1. Delete wrong machineconfig that has no content


Actual results:

Mismatch appears, MCO degraded

Expected results:

No mismatch, no degradation

Additional info:

- Removed machine-config is one autogenerated 99-master-XXXXX-registries, where XXXX IS NOT the correct UID for "master" mcp (as one would expect from here[1])
- It was necessary to remove it because it had a wrong controller revision annotation and that was blocking the upgrade.
- I'll be attaching more information in comments and attachments


[1] - https://github.com/openshift/machine-config-operator/blob/435eeef100e934baa8f05d34ff58a4d83b038876/pkg/controller/container-runtime-config/helpers.go#L183

Comment 3 Antonio Murdaca 2020-06-29 14:34:11 UTC

(In reply to Pablo Alonso Rodriguez from comment #0)
> Description of problem:
> 
> After removing a concrete machine-config belonging to a machineconfigpool,
> it disappears from Spec.Configuration.Source but not from
> Status.Configuration.Sources, causing machine-config-operator to be degraded.
> 
> Version-Release number of selected component (if applicable):
> 
> 4.4.6
> 
> How reproducible:
> 
> Only at customer scenario
> 
> Steps to Reproduce:
> 1. Delete wrong machineconfig that has no content
> 
> 
> Actual results:
> 
> Mismatch appears, MCO degraded
> 
> Expected results:
> 
> No mismatch, no degradation
> 
> Additional info:
> 
> - Removed machine-config is one autogenerated 99-master-XXXXX-registries,
> where XXXX IS NOT the correct UID for "master" mcp (as one would expect from
> here[1])
> - It was necessary to remove it because it had a wrong controller revision
> annotation and that was blocking the upgrade.

I'm pretty sure the above is why the MCP got into this state, so the bug it's not there but rather why that MC had the wrong version (which is a known issues with the MCO). Deleting that is something we don't support and we'll try to enforce that as well.

We'll be looking at this thanks for the report! :)

> - I'll be attaching more information in comments and attachments
> 
> 
> [1] -
> https://github.com/openshift/machine-config-operator/blob/
> 435eeef100e934baa8f05d34ff58a4d83b038876/pkg/controller/container-runtime-
> config/helpers.go#L183

Comment 4 Pablo Alonso Rodriguez 2020-06-29 14:58:42 UTC

However, that MC should have never existed, shouldn't it?

And, although this is not easy to reproduce, this mismatch could potentially happen with a non auto-generated machineconfig, or is there anything different at this machineconfig that could make this reproducible only for the auto-generated machineconfig?

Comment 5 Antonio Murdaca 2020-06-29 15:02:52 UTC

(In reply to Pablo Alonso Rodriguez from comment #4)
> However, that MC should have never existed, shouldn't it?

99-whatever-registries MCs are autogenerated by one of the MCO's controllers (MCC) and they're always generated so they should exist all the times.
The reason why it had a version mismatch is the issue here, it should have moved to the new version (re-generated and re-created) but something either prevented that or failed.

> 
> And, although this is not easy to reproduce, this mismatch could potentially
> happen with a non auto-generated machineconfig, or is there anything
> different at this machineconfig that could make this reproducible only for
> the auto-generated machineconfig?

non auto-generated machineconfigs don't have a controller version - you'll find the controller version only on generated-by-the-mco MCs like the registries one.

Comment 6 Pablo Alonso Rodriguez 2020-06-29 15:10:07 UTC

But the issue is that there wasn't only one, there were many for the same mcp. Is it expected that there are many? 

It surprises me that there are many because, if I understood the controller code correctly, the UID of the MCP is part of the name. And only one among the many 99-master-XXX-registries  had the correct UID in the name (and that one also had correct controller revision).

Comment 7 Antonio Murdaca 2020-06-29 15:20:41 UTC

(In reply to Pablo Alonso Rodriguez from comment #6)
> But the issue is that there wasn't only one, there were many for the same
> mcp. Is it expected that there are many? 
> 
> It surprises me that there are many because, if I understood the controller
> code correctly, the UID of the MCP is part of the name. And only one among
> the many 99-master-XXX-registries  had the correct UID in the name (and that
> one also had correct controller revision).

I'm now wondering if that's because the MCPs themselves were deleted (and the MCO recreates them) - that would explain the UID changes (and only that)

We changed that in 4.5 to avoid relying on UID (it was a bug from the beginning causing other issues)

I'm gonna spin a cluster rn and test my theories, thanks for the inputs again!

Comment 8 Antonio Murdaca 2020-06-30 10:20:53 UTC

This is the fix that went into 4.6 https://github.com/openshift/machine-config-operator/commit/0afd59a5b0c09ba830ebdd50c73895109fd3d2ce

The PR that brings a stable naming to those MC is all but trivial and safe to backport tho: https://github.com/openshift/machine-config-operator/pull/1756

Seems like there's a workaround when this happens so I'll advice to:

- document master and worker MCPs should never be deleted or the MCO will recreate them with a different UID which may cause issues like this
- due to the above, apply the workaround if this ever happens again

Comment 9 Antonio Murdaca 2020-06-30 13:03:46 UTC

Alright, I've confirmed the above is the case and that bug was causing this up to 4.5. 4.6 has the fix and we have a workaround and we can also enhance our docs to notify to NOT delete MCPs.

➜  ~ oc get -o yaml mcp worker | grep -i UID
  uid: 9893f3e9-c781-4366-88a8-a49cb18b6af8
➜  ~ oc delete mcp worker
machineconfigpool.machineconfiguration.openshift.io "worker" deleted

# regenerated with another UID

➜  ~ oc get -o yaml mcp worker | grep -i UID
  uid: 22582a54-6ded-4e23-9b82-1c82f41f1be3

# the registries MC didn't yet update but you can see it still points to the wrong UID (now deleted) - an upgrade or an Image conf change will effectively create a new one with the new UID, leaving the other MC around as in this BZ

➜  ~ oc get mc
NAME                                                        GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
...
99-worker-9893f3e9-c781-4366-88a8-a49cb18b6af8-registries   da87dd6d947d66eb810d74622fb7664289148423   2.2.0             21m

# an easy check would be to create a ContainerRuntimeConfig first and you can see we're now using the new UID

➜  ~ oc get mc
NAME                                                              GENERATEDBYCONTROLLER                      IGNITIONVERSION   AGE
...
99-worker-22582a54-6ded-4e23-9b82-1c82f41f1be3-containerruntime   da87dd6d947d66eb810d74622fb7664289148423   2.2.0             8s
99-worker-9893f3e9-c781-4366-88a8-a49cb18b6af8-registries         da87dd6d947d66eb810d74622fb7664289148423   2.2.0             23m

# lastly, you can see at some point the registries MC is re-generated with a new UID

➜  ~ oc get mc
...
99-worker-22582a54-6ded-4e23-9b82-1c82f41f1be3-containerruntime   da87dd6d947d66eb810d74622fb7664289148423   2.2.0             33m
99-worker-22582a54-6ded-4e23-9b82-1c82f41f1be3-registries         da87dd6d947d66eb810d74622fb7664289148423   2.2.0             15m
99-worker-9893f3e9-c781-4366-88a8-a49cb18b6af8-registries         da87dd6d947d66eb810d74622fb7664289148423   2.2.0             56m

Now, if I upgrade the above, `99-worker-9893f3e9-c781-4366-88a8-a49cb18b6af8-registries` will be left behind and you're forced to manually remove it or the upgrade won't complete (as reported)

So, what you get afterwards:

> After removing a concrete machine-config belonging to a machineconfigpool, it disappears from Spec.Configuration.Source but not from Status.Configuration.Sources, causing machine-config-operator to be degraded.

is something which shouldn't even happen - I'm gonna investigate a better workaround but what we can do is 1) avoid deleting MCPs and 2) apply this workaround if it happens <= 4.5

*** This bug has been marked as a duplicate of bug 1826150 ***

Comment 10 Antonio Murdaca 2020-06-30 14:54:31 UTC

So, lastly, we did have a bug related to why the Source field wasn't "working" and that's tracked here https://github.com/openshift/machine-config-operator/pull/1881

I still think for <4.6 we can stick to doc+workaround since this issue is the first time happening and we can't see other possibilities than deleting the MCPs to trigger this - as said, in 4.6 the situation should be generally better and we don't rely on UIDs anymore + hopefully in 4.7 we'll get some more fixes for the controllers that manage these special MachineConfigs.