Bug 1953627

Summary: Failed to upgrade to 4.6.25 from 4.6.18 due to the machine-config failure
Product: OpenShift Container Platform Reporter: Angel Fortunato Acosta Bencomo <aacostab>
Component: NodeAssignee: Yu Qi Zhang <jerzhang>
Node sub component: Kubelet QA Contact: Sunil Choudhary <schoudha>
Status: CLOSED INSUFFICIENT_DATA Docs Contact:
Severity: high    
Priority: unspecified CC: aos-bugs, jerzhang, oarribas, vlours
Version: 4.6   
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2021-04-29 16:27:30 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Angel Fortunato Acosta Bencomo 2021-04-26 14:32:14 UTC
Description of problem:

machine-config cluster operator degraded due to controller version mismatch

~~~
$ omg get co machine-config -o yaml
...
- lastTransitionTime: '2021-04-22T22:41:00Z'
    message: 'Unable to apply 4.6.25: timed out waiting for the condition during syncRequiredMachineConfigPools:
      pool master has not progressed to latest configuration: controller version mismatch
      for 98-master-generated-kubelet expected d5dc2b519aed5b3ed6a6ab9e7f70f33740f9f8af
      has 14a2b82d9f4c4d8b423f8f05f6926778ef36870d: all 3 nodes are at latest configuration
      rendered-master-381b6c37f8f8020f2e740ba44a1460a2, retrying'
    reason: RequiredPoolsFailed
    status: 'True'
    type: Degraded
...
extension:
    lastSyncError: 'pool master has not progressed to latest configuration: controller
      version mismatch for 98-master-generated-kubelet expected d5dc2b519aed5b3ed6a6ab9e7f70f33740f9f8af
      has 14a2b82d9f4c4d8b423f8f05f6926778ef36870d: all 3 nodes are at latest configuration
      rendered-master-381b6c37f8f8020f2e740ba44a1460a2, retrying'
    master: all 3 nodes are at latest configuration rendered-master-381b6c37f8f8020f2e740ba44a1460a2
    worker: all 13 nodes are at latest configuration rendered-worker-e08dcb17ae6631b16767bdd8b61c8e93
...
~~~


Version-Release number of selected component (if applicable):
Version: 4.6.25
Version: 4.6.18 


Steps to Reproduce:
1. Upgrade to 4.6.25 from 4.6.18


Actual results:

~~~
$ omg get clusterversion
NAME     VERSION  AVAILABLE  PROGRESSING  SINCE  STATUS
version           True       True         2m52s  Unable to apply 4.6.25: the cluster operator machine-config has not yet successfully rolled out
~~~

~~~
$ omg get co
NAME                                      VERSION  AVAILABLE  PROGRESSING  DEGRADED  SINCE
machine-config                            4.6.18   False      True         True      21h
~~~

Expected results:
Upgrade to 4.6.25 successfully.


Additional info:
Attached the "01-master-kubelet_content.json", "98-master-generated-kubelet_content.json" and "machine-config-operator-57c965559d-66sl2.log" files

Comment 5 Yu Qi Zhang 2021-04-26 23:50:27 UTC
Hi,

The linked error

```
'Unable to apply 4.6.25: timed out waiting for the condition during syncRequiredMachineConfigPools:
      pool master has not progressed to latest configuration: controller version mismatch
      for 98-master-generated-kubelet expected d5dc2b519aed5b3ed6a6ab9e7f70f33740f9f8af
      has 14a2b82d9f4c4d8b423f8f05f6926778ef36870d: all 3 nodes are at latest configuration
      rendered-master-381b6c37f8f8020f2e740ba44a1460a2, retrying'
```

is basically saying a previous version of the MCO created a machineconfig based on a kubeletconfig, but the new one did not regenerate it, as seen by your later command:

98-master-generated-kubelet                       14a2b82d9f4c4d8b423f8f05f6926778ef36870d  3.1.0            10d
98-worker-generated-kubelet                       eab9c35dfbeb0d21be6e1db3887acbbb93592d34  3.1.0            10d

that is very odd, both the master and worker kubeletconfig never generated by the new version (d5dc2b519aed5b3ed6a6ab9e7f70f33740f9f8af), like all the other non-rendered configs.

I have a few questions:

1. were those ever modified manually?
2. could you post the kubeletconfigs on the system?
3. could you post the machineconfigcontroller pod logs? (oc get logs -n openshift-machine-config-operator machine-config-controller-xxx)