Bug 2050466

Summary: machine config update with invalid container runtime config should be more robust
Product: OpenShift Container Platform Reporter: Karthik Perumal <kramraja>
Component: NodeAssignee: Qi Wang <qiwan>
Node sub component: Kubelet QA Contact: MinLi <minmli>
Status: CLOSED ERRATA Docs Contact:
Severity: urgent    
Priority: medium CC: aos-bugs, cblecker, jerzhang, mkrejci, travi
Version: 4.9Keywords: ServiceDeliveryBlocker
Target Milestone: ---   
Target Release: 4.11.0   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-08-10 10:47:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Karthik Perumal 2022-02-04 02:37:39 UTC
Note:
1. if you are dealing with Machines or MachineSet objects, please select the component as "Cloud Compute" under same product.
2. if you are dealing with kubelet / kubeletconfigs / container runtime configs, please select the component as "Node" under same product.

Description of problem:
Managed openshift SRE had and incident where the customer updated the imageContentSourcePolicy with an invalid configuration which lead to MCO starting a rollout of machine-config with invalid container runtime config. Overtime, with subsequent machine-config updates, nodes that needed to be rebooted or replaced were not able to join the cluster due to kubelet and crio not starting because of the invalid container runtime configuration.

Version-Release number of MCO (Machine Config Operator) (if applicable): 
From the CSV: configure-alertmanager-operator.v0.1.400-d73fb19


Platform (AWS, VSphere, Metal, etc.): AWS

Are you certain that the root cause of the issue being reported is the MCO (Machine Config Operator)?
(Y/N/Not sure): Y

How reproducible:

Did you catch this issue by running a Jenkins job? If yes, please list:
1. Jenkins job: N/A

2. Profile: N/A

Steps to Reproduce:
1. update one of the imageContentSourcePolicy with invalid configuration. for eg: remove the mirrors. This should trigger a machine-config rollout for all machines in cluster
2. Simulate Another machineConfig update that would require node reboot. For example: Modify the kubeletConfig spec.autoSizingReserved to true or false. This will trigger another machine config rollout with a node reboot. 
3. At this point, the rebooted node will not be able to rejoin the cluster and will remain "NotReady" due to kubelet not starting because of invalid container runtime config

The effects of this issue make deteriorate the cluster health steadily as when the master's machine config is updated, it wont be able to come back which will cause other follow on effects like other nodes being rebooted due to potentially lower cluster resources (OOM kill) which would trigger more node reboots and so on.


Actual results:
At that point,  the cluster becomes quiet unhealthy and almost impossible to recover back to a healthys state, even after applying a fix to the broken imageContentSourcePolicy as MCO tries to get all machines up to date first with the kubeletconfig change which will never complete due to kubelet (and containerd/crio) not starting on the nodes


Expected results:
MCO should be more resilient to such fatal config changes and either detect these changes and not apply them or realise that the current machine config update hasn't gone well and roll back gracefully to the previous machine-config instead.


Additional info:

1. Please consider attaching a must-gather archive (via oc adm must-gather). Please review must-gather contents for sensitive information before attaching any must-gathers to a Bugzilla report. You may also mark the bug private if you wish.

Must-gather of a affected test cluster where this was reproduced will be provided in the comments. Instead of updating the kubletconfig (from step 2 of reproduction steps), I kicked off an upgrade which also showed similar results.

Comment 4 Yu Qi Zhang 2022-02-04 19:09:45 UTC
I'm not sure what degree of validation we would like to apply. Also consider that the MCO does similar things where you are able to break your nodes.

> Also, is there a recommended approach to manually getting the machine config pool to point to a specific machine-config object so that SRE could potentially use it to fix such situations?

See our debugging guide at https://docs.google.com/document/d/1fgP6Kv1D-75e1Ot0Kg-W2qPyxWDp2_CALltlBLuseec for some general cases and recommendations. It heavily depends on your scenario.

Also passing this to the node team, as the node team owns the containerruntimecontroller/ICSP configuration.

Comment 9 MinLi 2022-03-21 03:52:45 UTC
verified!

% oc get clusterversion              
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-03-20-160505   True        False         79m     Cluster version is 4.11.0-0.nightly-2022-03-20-160505

After applying "invalid" ICMP to the cluster, it didn't generate any machine config. And there is error log which tip : invalid empty entry for mirror configuration. 

% oc logs -f machine-config-controller-b4b69f8b8-6zrh9 -n openshift-machine-config-operator
...
I0321 03:30:39.037209       1 container_runtime_config_controller.go:363] Error syncing image config openshift-config: could not Create/Update MachineConfig: could not update registries config with new changes: invalid empty entry for mirror configuration
E0321 03:32:01.014926       1 container_runtime_config_controller.go:368] could not Create/Update MachineConfig: could not update registries config with new changes: invalid empty entry for mirror configuration
I0321 03:32:01.014946       1 container_runtime_config_controller.go:369] Dropping image config "openshift-config" out of the queue: could not Create/Update MachineConfig: could not update registries config with new changes: invalid empty entry for mirror configuration

Comment 11 errata-xmlrpc 2022-08-10 10:47:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069