2050466 – machine config update with invalid container runtime config should be more robust

Bug 2050466 - machine config update with invalid container runtime config should be more robust

Summary: machine config update with invalid container runtime config should be more ro...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node
Sub Component:
Version:	4.9
Hardware:	Unspecified
OS:	Linux
Priority:	medium
Severity:	urgent
Target Milestone:	---
Target Release:	4.11.0
Assignee:	Qi Wang
QA Contact:	MinLi
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2022-02-04 02:37 UTC by Karthik Perumal
Modified:	2022-08-10 10:47 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-08-10 10:47:09 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift machine-config-operator pull 2969	None	open	Bug 2050466: Not allow empty string in icsp&image CR	2022-02-25 18:41:11 UTC
Github	openshift runtime-utils pull 14	None	open	Bug 2050466: Not allow "" string in registriesConf entry	2022-02-25 03:12:08 UTC
Red Hat Product Errata	RHSA-2022:5069	None	None	None	2022-08-10 10:47:34 UTC

Description Karthik Perumal 2022-02-04 02:37:39 UTC

Note:
1. if you are dealing with Machines or MachineSet objects, please select the component as "Cloud Compute" under same product.
2. if you are dealing with kubelet / kubeletconfigs / container runtime configs, please select the component as "Node" under same product.

Description of problem:
Managed openshift SRE had and incident where the customer updated the imageContentSourcePolicy with an invalid configuration which lead to MCO starting a rollout of machine-config with invalid container runtime config. Overtime, with subsequent machine-config updates, nodes that needed to be rebooted or replaced were not able to join the cluster due to kubelet and crio not starting because of the invalid container runtime configuration.

Version-Release number of MCO (Machine Config Operator) (if applicable):
From the CSV: configure-alertmanager-operator.v0.1.400-d73fb19

Platform (AWS, VSphere, Metal, etc.): AWS

Are you certain that the root cause of the issue being reported is the MCO (Machine Config Operator)?
(Y/N/Not sure): Y

How reproducible:

Did you catch this issue by running a Jenkins job? If yes, please list:
1. Jenkins job: N/A

2. Profile: N/A

Steps to Reproduce:
1. update one of the imageContentSourcePolicy with invalid configuration. for eg: remove the mirrors. This should trigger a machine-config rollout for all machines in cluster
2. Simulate Another machineConfig update that would require node reboot. For example: Modify the kubeletConfig spec.autoSizingReserved to true or false. This will trigger another machine config rollout with a node reboot.
3. At this point, the rebooted node will not be able to rejoin the cluster and will remain "NotReady" due to kubelet not starting because of invalid container runtime config

The effects of this issue make deteriorate the cluster health steadily as when the master's machine config is updated, it wont be able to come back which will cause other follow on effects like other nodes being rebooted due to potentially lower cluster resources (OOM kill) which would trigger more node reboots and so on.

Actual results:
At that point, the cluster becomes quiet unhealthy and almost impossible to recover back to a healthys state, even after applying a fix to the broken imageContentSourcePolicy as MCO tries to get all machines up to date first with the kubeletconfig change which will never complete due to kubelet (and containerd/crio) not starting on the nodes

Expected results:
MCO should be more resilient to such fatal config changes and either detect these changes and not apply them or realise that the current machine config update hasn't gone well and roll back gracefully to the previous machine-config instead.

Additional info:

1. Please consider attaching a must-gather archive (via oc adm must-gather). Please review must-gather contents for sensitive information before attaching any must-gathers to a Bugzilla report. You may also mark the bug private if you wish.

Must-gather of a affected test cluster where this was reproduced will be provided in the comments. Instead of updating the kubletconfig (from step 2 of reproduction steps), I kicked off an upgrade which also showed similar results.

Comment 4 Yu Qi Zhang 2022-02-04 19:09:45 UTC

I'm not sure what degree of validation we would like to apply. Also consider that the MCO does similar things where you are able to break your nodes.

> Also, is there a recommended approach to manually getting the machine config pool to point to a specific machine-config object so that SRE could potentially use it to fix such situations?

See our debugging guide at https://docs.google.com/document/d/1fgP6Kv1D-75e1Ot0Kg-W2qPyxWDp2_CALltlBLuseec for some general cases and recommendations. It heavily depends on your scenario.

Also passing this to the node team, as the node team owns the containerruntimecontroller/ICSP configuration.

Comment 9 MinLi 2022-03-21 03:52:45 UTC

verified!

% oc get clusterversion              
NAME      VERSION                              AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.11.0-0.nightly-2022-03-20-160505   True        False         79m     Cluster version is 4.11.0-0.nightly-2022-03-20-160505

After applying "invalid" ICMP to the cluster, it didn't generate any machine config. And there is error log which tip : invalid empty entry for mirror configuration. 

% oc logs -f machine-config-controller-b4b69f8b8-6zrh9 -n openshift-machine-config-operator
...
I0321 03:30:39.037209       1 container_runtime_config_controller.go:363] Error syncing image config openshift-config: could not Create/Update MachineConfig: could not update registries config with new changes: invalid empty entry for mirror configuration
E0321 03:32:01.014926       1 container_runtime_config_controller.go:368] could not Create/Update MachineConfig: could not update registries config with new changes: invalid empty entry for mirror configuration
I0321 03:32:01.014946       1 container_runtime_config_controller.go:369] Dropping image config "openshift-config" out of the queue: could not Create/Update MachineConfig: could not update registries config with new changes: invalid empty entry for mirror configuration

Comment 11 errata-xmlrpc 2022-08-10 10:47:09 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2022:5069

Note You need to log in before you can comment on or make changes to this bug.