Note: 1. if you are dealing with Machines or MachineSet objects, please select the component as "Cloud Compute" under same product. 2. if you are dealing with kubelet / kubeletconfigs / container runtime configs, please select the component as "Node" under same product. Description of problem: Managed openshift SRE had and incident where the customer updated the imageContentSourcePolicy with an invalid configuration which lead to MCO starting a rollout of machine-config with invalid container runtime config. Overtime, with subsequent machine-config updates, nodes that needed to be rebooted or replaced were not able to join the cluster due to kubelet and crio not starting because of the invalid container runtime configuration. Version-Release number of MCO (Machine Config Operator) (if applicable): From the CSV: configure-alertmanager-operator.v0.1.400-d73fb19 Platform (AWS, VSphere, Metal, etc.): AWS Are you certain that the root cause of the issue being reported is the MCO (Machine Config Operator)? (Y/N/Not sure): Y How reproducible: Did you catch this issue by running a Jenkins job? If yes, please list: 1. Jenkins job: N/A 2. Profile: N/A Steps to Reproduce: 1. update one of the imageContentSourcePolicy with invalid configuration. for eg: remove the mirrors. This should trigger a machine-config rollout for all machines in cluster 2. Simulate Another machineConfig update that would require node reboot. For example: Modify the kubeletConfig spec.autoSizingReserved to true or false. This will trigger another machine config rollout with a node reboot. 3. At this point, the rebooted node will not be able to rejoin the cluster and will remain "NotReady" due to kubelet not starting because of invalid container runtime config The effects of this issue make deteriorate the cluster health steadily as when the master's machine config is updated, it wont be able to come back which will cause other follow on effects like other nodes being rebooted due to potentially lower cluster resources (OOM kill) which would trigger more node reboots and so on. Actual results: At that point, the cluster becomes quiet unhealthy and almost impossible to recover back to a healthys state, even after applying a fix to the broken imageContentSourcePolicy as MCO tries to get all machines up to date first with the kubeletconfig change which will never complete due to kubelet (and containerd/crio) not starting on the nodes Expected results: MCO should be more resilient to such fatal config changes and either detect these changes and not apply them or realise that the current machine config update hasn't gone well and roll back gracefully to the previous machine-config instead. Additional info: 1. Please consider attaching a must-gather archive (via oc adm must-gather). Please review must-gather contents for sensitive information before attaching any must-gathers to a Bugzilla report. You may also mark the bug private if you wish. Must-gather of a affected test cluster where this was reproduced will be provided in the comments. Instead of updating the kubletconfig (from step 2 of reproduction steps), I kicked off an upgrade which also showed similar results.
I'm not sure what degree of validation we would like to apply. Also consider that the MCO does similar things where you are able to break your nodes. > Also, is there a recommended approach to manually getting the machine config pool to point to a specific machine-config object so that SRE could potentially use it to fix such situations? See our debugging guide at https://docs.google.com/document/d/1fgP6Kv1D-75e1Ot0Kg-W2qPyxWDp2_CALltlBLuseec for some general cases and recommendations. It heavily depends on your scenario. Also passing this to the node team, as the node team owns the containerruntimecontroller/ICSP configuration.
verified! % oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.11.0-0.nightly-2022-03-20-160505 True False 79m Cluster version is 4.11.0-0.nightly-2022-03-20-160505 After applying "invalid" ICMP to the cluster, it didn't generate any machine config. And there is error log which tip : invalid empty entry for mirror configuration. % oc logs -f machine-config-controller-b4b69f8b8-6zrh9 -n openshift-machine-config-operator ... I0321 03:30:39.037209 1 container_runtime_config_controller.go:363] Error syncing image config openshift-config: could not Create/Update MachineConfig: could not update registries config with new changes: invalid empty entry for mirror configuration E0321 03:32:01.014926 1 container_runtime_config_controller.go:368] could not Create/Update MachineConfig: could not update registries config with new changes: invalid empty entry for mirror configuration I0321 03:32:01.014946 1 container_runtime_config_controller.go:369] Dropping image config "openshift-config" out of the queue: could not Create/Update MachineConfig: could not update registries config with new changes: invalid empty entry for mirror configuration
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: OpenShift Container Platform 4.11.0 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2022:5069