Note: 1. if you are dealing with Machines or MachineSet objects, please select the component as "Cloud Compute" under same product. 2. if you are dealing with kubelet / kubeletconfigs / container runtime configs, please select the component as "Node" under same product. Description of problem: MCD applied new render config even if it was not applied for specific labels or group and then reverts it back to original config after reboot. 2021-09-10T23:45:59.652474153Z I0910 23:45:59.652406 3807 update.go:1678] initiating reboot: Node will reboot into config rendered-infra-cb54d95db7c87bafa2af73265130cab7 2021-09-24T23:26:54.172792871Z I0924 23:26:54.172752 3536 update.go:1678] initiating reboot: Node will reboot into config rendered-infra-296e595dd79dd8e6fb76da54d26b7393 2021-09-24T23:38:08.085630499Z I0924 23:38:08.085600 4041 update.go:1678] initiating reboot: Node will reboot into config rendered-infra-cb54d95db7c87bafa2af73265130cab7 If we see above node had rebooted to "config rendered-infra-296e595dd79dd8e6fb76da54d26b7393" from rendered-infra-cb54d95db7c87bafa2af73265130cab7 and then reverted back to rendered-infra-cb54d95db7c87bafa2af73265130cab7. Above change was not applied, however it did above automatically. We would like to understand if its bug. We have must-gather data available. we would like to understand if we need to have more verbose logs then what would be ideal steps ? Version-Release number of MCO (Machine Config Operator) (if applicable): Platform (AWS, VSphere, Metal, etc.): Are you certain that the root cause of the issue being reported is the MCO (Machine Config Operator)? (Y/N/Not sure): How reproducible: Actual results: - MCO should not apply changes to labels which are not matching. - MCO reverted config back to original after it noticed, however why such action took place. Expected results: - MCO should not apply changes to labels which are not matching. Additional info: 1. Please consider attaching a must-gather archive (via oc adm must-gather). Please review must-gather contents for sensitive information before attaching any must-gathers to a Bugzilla report. You may also mark the bug private if you wish. Attaching must gather data
Must-gather data is available in supportshell to the attached case.
Just to add some more information on the above issue. System was updated to platform 4.6.40 on 28/08/2021. Prior to applying kubeletconfig, machine-config-conntroller was restarted 50 minutes ago. **************************************** new kubeletconfig to change SystemReservedMemory of master nodes was applied . I have applied it to MCP master. Once the master nodes have been successfully rebooted, the infra and router-pre nodes have started to reboot. **************************************** Below yaml was applied $ cat set-allocatable-master.yaml apiVersion: machineconfiguration.openshift.io/v1 kind: KubeletConfig metadata: name: set-allocatable-master spec: machineConfigPoolSelector: matchLabels: custom-kubelet-systemreservation: reserved-resources-master kubeletConfig: systemReserved: cpu: 500m memory: 5.5Gi $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE infra rendered-infra-cb54d95db7c87bafa2af73265130cab7 True False False 3 3 3 0 245d master rendered-master-3e92983a28b6609e55b69ea3653982d5 True False False 3 3 3 0 246d router-pre rendered-router-pre-c88a1ae5ac364578c5bbd26ffeb3c545 True False False 3 3 3 0 50d worker rendered-worker-cb54d95db7c87bafa2af73265130cab7 True False False 0 0 0 0 246d worker-pre rendered-worker-pre-cb54d95db7c87bafa2af73265130cab7 False True False 47 27 47 0 233d worker-pro rendered-worker-pro-cb54d95db7c87bafa2af73265130cab7 False True False 34 15 34 0 233d $ oc get kubeletconfig NAME AGE logrotation 158d set-allocatable 14d set-allocatable-master 76m After that, it behaved as if the right rendered config for router-pre was rendered-router-pre-c3356e1eb7c5a1f950dbdd5c52b16d0c and the right rendered config for infra was rendered-infra-296e595dd79dd8e6fb76da54d26b7393, so it starts rolling them out. Those are created on 2021-08-27 (both the creation timestamps and logs of non-rebooted nodes show it clearly). ~~~ 2021-09-24T23:31:02.499863462Z I0924 23:31:02.499822 1 node_controller.go:414] Pool infra: Setting node ocjc4-pqh6j-infra-0-mpmzr target to rendered-infra-296e595dd79dd8e6fb76da54d26b7393 2021-09-24T23:31:02.499994502Z I0924 23:31:02.499981 1 node_controller.go:414] Pool router-pre: Setting node ocjc4-pqh6j-router-pre-6b2w9 target to rendered-router-pre-c3356e1eb7c5a1f950dbdd5c52b16d0c ~~~ Shortly after, it realizes that those are not the right rendered configs, so it switches to the right rendered-router-pre-c88a1ae5ac364578c5bbd26ffeb3c545 and rendered-infra-cb54d95db7c87bafa2af73265130cab7, that were created on 2021-09-10 when set-allocatable was created. ~~~ 2021-09-24T23:31:02.609997730Z I0924 23:31:02.609949 1 render_controller.go:522] Pool worker-pro: now targeting: rendered-worker-pro-cb54d95db7c87bafa2af73265130cab7 2021-09-24T23:31:02.632968231Z I0924 23:31:02.632910 1 render_controller.go:522] Pool infra: now targeting: rendered-infra-cb54d95db7c87bafa2af73265130cab7 2021-09-24T23:31:02.680398232Z I0924 23:31:02.680343 1 render_controller.go:522] Pool router-pre: now targeting: rendered-router-pre-c88a1ae5ac364578c5bbd26ffeb3c545 2021-09-24T23:31:07.525145041Z I0924 23:31:07.525088 1 status.go:90] Pool worker: All nodes are updated with rendered-worker-cb54d95db7c87bafa2af73265130cab7 ~~~ To add more confusion, if you diff both files (I'd recommend using a visual difftool like meld), you can see that the only relevant difference does correspond to the changes introduced by logrotation kubeletconfig (not set-allocatable, for which dates match, or set-allocatable-master, which is the one changed at the time of interest). So in short, machine-config-controller behaves glitchy during its startup in what regards the kubeletconfig rendering. Those glitches may switch the rendered config of the mcp for long enough for a fake update to be triggered. Below are the reboot time for the router-pre nodes ---------------------- ocjc4-pqh6j-router-pre-6b2w9 ----------------------------- $ omg logs machine-config-daemon-wrnj6 -c machine-config-daemon|grep reboot 2021-09-24T23:31:54.047424691Z I0924 23:31:54.047319 3481 update.go:1678] initiating reboot: Node will reboot into config rendered-router-pre-c3356e1eb7c5a1f950dbdd5c52b16d0c 2021-09-24T23:35:36.312608646Z I0924 23:35:36.312561 3444 update.go:1678] initiating reboot: Node will reboot into config rendered-router-pre-c88a1ae5ac364578c5bbd26ffeb3c545 ---------------------------- ocjc4-pqh6j-router-pre-8prp2 ----------------------------- This node did not reboot as per the logs $ omg logs machine-config-daemon-msk6d -c machine-config-daemon|grep reboot 2021-08-27T23:53:56.674494856Z I0827 23:53:56.674420 3123886 update.go:1678] initiating reboot: Node will reboot into config rendered-router-pre-c3356e1eb7c5a1f950dbdd5c52b16d0c 2021-09-10T23:45:59.846527148Z I0910 23:45:59.846458 3902 update.go:1678] initiating reboot: Node will reboot into config rendered-router-pre-c88a1ae5ac364578c5bbd26ffeb3c545 ============================ ocjc4-pqh6j-router-pre-gwrrs ------------------------------ $ omg logs machine-config-daemon-bn748 -c machine-config-daemon|grep reboot 2021-09-24T23:26:42.381245275Z I0924 23:26:42.381168 3250 update.go:1678] initiating reboot: Node will reboot into config rendered-router-pre-c3356e1eb7c5a1f950dbdd5c52b16d0c 2021-09-24T23:39:42.001537137Z I0924 23:39:42.001485 3376 update.go:1678] initiating reboot: Node will reboot into config rendered-router-pre-c88a1ae5ac364578c5bbd26ffeb3c545 ================================================================================= Infra nodes -------------------------- ocjc4-pqh6j-infra-0-6n6tj -------------------------- $ omg logs machine-config-daemon-bt4sr -c machine-config-daemon|grep reboot 2021-09-24T23:26:54.172792871Z I0924 23:26:54.172752 3536 update.go:1678] initiating reboot: Node will reboot into config rendered-infra-296e595dd79dd8e6fb76da54d26b7393 2021-09-24T23:38:08.085630499Z I0924 23:38:08.085600 4041 update.go:1678] initiating reboot: Node will reboot into config rendered-infra-cb54d95db7c87bafa2af73265130cab7 ----------------------------- ocjc4-pqh6j-infra-0-m6s9t ----------------------------- This machine did not reboot on 24th as per the logs $ omg logs machine-config-daemon-tsxxv -c machine-config-daemon|grep reboot 2021-08-27T23:59:21.960996406Z I0827 23:59:21.960939 4043758 update.go:1678] initiating reboot: Node will reboot into config rendered-infra-296e595dd79dd8e6fb76da54d26b7393 2021-09-10T23:50:14.607557463Z I0910 23:50:14.607519 4314 update.go:1678] initiating reboot: Node will reboot into config rendered-infra-cb54d95db7c87bafa2af73265130cab7 -------------------------- ocjc4-pqh6j-infra-0-mpmzr -------------------------- $ omg logs machine-config-daemon-xg9h7 -c machine-config-daemon|grep reboot 2021-09-24T23:32:24.210514498Z I0924 23:32:24.210435 3765 update.go:1678] initiating reboot: Node will reboot into config rendered-infra-296e595dd79dd8e6fb76da54d26b7393 2021-09-24T23:42:38.389579170Z I0924 23:42:38.389548 3996 update.go:1678] initiating reboot: Node will reboot into config rendered-infra-cb54d95db7c87bafa2af73265130cab7
From Comment#3, it appears that issue started after applying kubeletconfig. Moving to node team to analyze if problem is related to kubeletconfig controller.
So a few notes: 1. I can see that the controller did at some point attempt to flip like you mentioned, but unfortunately due to the fact that the controller restarted about 50 minutes prior to the must-gather, we no longer have those logs as to why it did the flip. 2. The user in this case has multiple kubeletconfigs per pool. This is actually not supported at 4.6 due to the inability to merge or properly handle overwriting them. If you decode the rendered configs, you'll see that one of them has "containerLogMaxSize": "50Mi", "systemReserved": { "cpu": "500m", "ephemeral-storage": "1Gi", "memory": "13Gi" }, and one of them has "containerLogMaxSize": "100Mi", "containerLogMaxFiles": 7, "systemReserved": { "cpu": "500m", "ephemeral-storage": "1Gi", "memory": "1Gi" }, And the user basically has both of those as separate kubeletconfigs: logrotation and set-allocatable. (logrotation and set-allocatable-master for master pool). Right now only one can take effect at any given time, and assuming what the desired end result is that both exist together, the user should merge those into one kubeletconfig per pool. For masters, that would look like: spec: kubeletConfig: systemReserved: cpu: 500m memory: 5.5Gi containerLogMaxFiles: 7 containerLogMaxSize: 100Mi For workers (which propagates to all other nodes) it should look like: spec: kubeletConfig: systemReserved: cpu: 500m memory: 13Gi containerLogMaxFiles: 7 containerLogMaxSize: 100Mi For this to properly work. Note that in newer versions the overwrite is less buggy, but there is still no merge, so you must do the above still if you want both configs to apply. My best guess as to what is happening, is that due to the fact that there are multiple configs, the controller somehow is fetching them in different orders one time(?) or reading one after the other and causing the kubeletconfigcontroller updating the base kubeletconfig machineconfig with the old one before realizing it should be on the new one. I can't be certain but it should no longer happen once there is only 1 kubeletconfig per pool
Also I just remembered, I think this is technically not deterministic in 4.6 (having multiple kubeletconfigs). It depends on the syncing of the object to the MCC, which can be due to e.g. a timestamp update or a controller update, I believe.
Sorry I didn't get a chance to get back to this earlier. Could you please follow the steps in this comment: https://bugzilla.redhat.com/show_bug.cgi?id=2009682#c5 and let me know if that works? I think we consider this a mix between user error and buggy behaviour. Please only have 1 kubeletconfig per pool.
Hi, given that there has been no response, I am going to close this as INSUFFICENT_DATA for now. If you are able to apply the fix (i.e. ensure there is only 1 kubeletconfig per pool, and merging the necessary kubeletconfigs together) and still see issues with the cluster, please provide updated information and reopen the bug. Thank you.
It was not possible to reproduce it again, so it is ok to leave this closed until we can reproduce. Thanks for your help.