Bug 2050466 - machine config update with invalid container runtime config should be more robust
Summary: machine config update with invalid container runtime config should be more ro...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node
Version: 4.9
Hardware: Unspecified
OS: Linux
medium
urgent
Target Milestone: ---
: 4.11.0
Assignee: Qi Wang
QA Contact: MinLi
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2022-02-04 02:37 UTC by Karthik Perumal
Modified: 2022-08-10 10:47 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-08-10 10:47:09 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift machine-config-operator pull 2969 0 None open Bug 2050466: Not allow empty string in icsp&image CR 2022-02-25 18:41:11 UTC
Github openshift runtime-utils pull 14 0 None open Bug 2050466: Not allow "" string in registriesConf entry 2022-02-25 03:12:08 UTC
Red Hat Product Errata RHSA-2022:5069 0 None None None 2022-08-10 10:47:34 UTC

Description Karthik Perumal 2022-02-04 02:37:39 UTC
Note:
1. if you are dealing with Machines or MachineSet objects, please select the component as "Cloud Compute" under same product.
2. if you are dealing with kubelet / kubeletconfigs / container runtime configs, please select the component as "Node" under same product.

Description of problem:
Managed openshift SRE had and incident where the customer updated the imageContentSourcePolicy with an invalid configuration which lead to MCO starting a rollout of machine-config with invalid container runtime config. Overtime, with subsequent machine-config updates, nodes that needed to be rebooted or replaced were not able to join the cluster due to kubelet and crio not starting because of the invalid container runtime configuration.

Version-Release number of MCO (Machine Config Operator) (if applicable): 
From the CSV: configure-alertmanager-operator.v0.1.400-d73fb19


Platform (AWS, VSphere, Metal, etc.): AWS

Are you certain that the root cause of the issue being reported is the MCO (Machine Config Operator)?
(Y/N/Not sure): Y

How reproducible:

Did you catch this issue by running a Jenkins job? If yes, please list:
1. Jenkins job: N/A

2. Profile: N/A

Steps to Reproduce:
1. update one of the imageContentSourcePolicy with invalid configuration. for eg: remove the mirrors. This should trigger a machine-config rollout for all machines in cluster
2. Simulate Another machineConfig update that would require node reboot. For example: Modify the kubeletConfig spec.autoSizingReserved to true or false. This will trigger another machine config rollout with a node reboot. 
3. At this point, the rebooted node will not be able to rejoin the cluster and will remain "NotReady" due to kubelet not starting because of invalid container runtime config

The effects of this issue make deteriorate the cluster health steadily as when the master's machine config is updated, it wont be able to come back which will cause other follow on effects like other nodes being rebooted due to potentially lower cluster resources (OOM kill) which would trigger more node reboots and so on.


Actual results:
At that point,  the cluster becomes quiet unhealthy and almost impossible to recover back to a healthys state, even after applying a fix to the broken imageContentSourcePolicy as MCO tries to get all machines up to date first with the kubeletconfig change which will never complete due to kubelet (and containerd/crio) not starting on the nodes


Expected results:
MCO should be more resilient to such fatal config changes and either detect these changes and not apply them or realise that the current machine config update hasn't gone well and roll back gracefully to the previous machine-config instead.


Additional info:

1. Please consider attaching a must-gather archive (via oc adm must-gather). Please review must-gather contents for sensitive information before attaching any must-gathers to a Bugzilla report. You may also mark the bug private if you wish.

Must-gather of a affected test cluster where this was reproduced will be provided in the comments. Instead of updating the kubletconfig (from step 2 of reproduction steps), I kicked off an upgrade which also showed similar results.

Comment 4 Yu Qi Zhang 2022-02-04 19:09:45 UTC
I'm not sure what degree of validation we would like to apply. Also consider that the MCO does similar things where you are able to break your nodes.

> Also, is there a recommended approach to manually getting the machine config pool to point to a specific machine-config object so that SRE could potentially use it to fix such situations?

See our debugging guide at https://docs.google.com/document/d/1fgP6Kv1D-75e1Ot0Kg-W2qPyxWDp2_CALltlBLuseec for some general cases and recommendations. It heavily depends on your scenario.

Also passing this to the node team, as the node team owns the containerruntimecontroller/ICSP configuration.

Comment 9 MinLi 2022-03-21 03:52:45 UTC
verified!

% oc get clusterversion              
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-03-20-160505   True        False         79m     Cluster version is 4.11.0-0.nightly-2022-03-20-160505

After applying "invalid" ICMP to the cluster, it didn't generate any machine config. And there is error log which tip : invalid empty entry for mirror configuration. 

% oc logs -f machine-config-controller-b4b69f8b8-6zrh9 -n openshift-machine-config-operator
...
I0321 03:30:39.037209       1 container_runtime_config_controller.go:363] Error syncing image config openshift-config: could not Create/Update MachineConfig: could not update registries config with new changes: invalid empty entry for mirror configuration
E0321 03:32:01.014926       1 container_runtime_config_controller.go:368] could not Create/Update MachineConfig: could not update registries config with new changes: invalid empty entry for mirror configuration
I0321 03:32:01.014946       1 container_runtime_config_controller.go:369] Dropping image config "openshift-config" out of the queue: could not Create/Update MachineConfig: could not update registries config with new changes: invalid empty entry for mirror configuration

Comment 11 errata-xmlrpc 2022-08-10 10:47:09 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069


Note You need to log in before you can comment on or make changes to this bug.