What problem/issue/behavior are you having trouble with? What do you expect to see? worker nodes that have the latest machine config should not update when scaling out nodes. In reality, we see un-targeted nodes going through mc updates. We see the daemon logs for updates between two rendered machineConfigs. Where are you experiencing the behavior? What environment? OCP 4.74 When does the behavior occur? Frequency? Repeatedly? At certain times? Repeatedly when performing scale out operations. What information can you provide around timeframes and the business impact? This impacts our ability predict and avoid tenant application impact when performing updates.
Changing assignee to Ben for MCO team.
This may be running on the baremetal platform, but it appears to be a general machine-config issue on scaleout. Based on my reading of the case it sounds like there's some sort of conflict between MCO and PAO (the latter of which I'm not familiar with), so this may not be an MCO bug at all. I'll let the MCO team make that determination though.
So the customer case is pretty long. Distilled down it appears to be roughly 4 must-gather timelines of different types of approaches hitting this issue. Fundamentally, it appears that in the customer scenario, scaling (?) nodes in their setup sets the MCO to flip between different rendered configs. Note that the MCO always resyncs all MCs constantly, so my first guess is that the PAO generated configs (and corresponding labelling?) -> kubeletconfig -> machineconfig is... not being done correctly? Has a bug in the PAO? For simplicity of analysis purposes, I used the 2 must-gathers in this comment (the pre/post must-gather): https://access.redhat.com/support/cases/#/case/03070861/discussion?attachmentId=a092K00002vKCEGQA4 And the flip happening between rendered-workerperf-8567689b42ea7dabfb3118dafe136d96 and rendered-workerperf-cb815f01e9974630dcf1c991b9eeccc3: 1. The rendered-workerperf-cb815f01e9974630dcf1c991b9eeccc3 that the system wants to be on has these additions: /usr/local/bin/stalld /usr/local/bin/throttlectl.sh stalld.service karg set: - skew_tick=1 - nohz=on - rcu_nocbs=4-39,44-79 - tuned.non_isolcpus=0000ffff,ffff0000,00000f00,0000000f - intel_pstate=disable - nosoftlockup - tsc=nowatchdog - intel_iommu=on - iommu=pt - isolcpus=managed_irq,4-39,44-79 - systemd.cpu_affinity=0,1,2,3,40,41,42,43,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111 - default_hugepagesz=2M 2. The one that it flipped to temporarily has none of those changes but was interestingly generated 11 days ago, whereas the "correct" one was generated 74 days ago, meaning that the first action that generated that was 11 days ago. I see no MCs with that age, meaning that an existing MC was changed. Those MCs are managed by the NTO. I see that in the MachineConfig 50-nto-workerperf.yaml the last update timestamp was 2021-11-17T14:52:38Z, meaning that it did in fact perform some kind of update to it (likely triggering your reboot rollout.) Passing this to the NTO team to see what triggered those updates. In the eyes of the MCO, it is behaving as expected.
Thank you for the report. While I cannot reproduce this right now, by revising the appropriate parts of NTO code, I believe that given the right timing (openshift-tuned pods still not running during node startup) the issues described above are likely caused by NTO. I'm currently on leave -- I'll start working on a fix upon my return. As a workaround, please pause the MCP during the scaleups.
Back from leave. Many thanks Yu Qi for pointing me in the right direction. A minimal (NTO-only) reproducer for the QE: 1) Install a two worker node cluster. 2) Add both worker nodes into the same machine config pool. git clone https://github.com/openshift/cluster-node-tuning-operator cd cluster-node-tuning-operator/test/e2e/reboots node1="your-first-worker-node" # CHANGE THIS node2="your-second-worker-node" # CHANGE THIS profileRealtime="../testing_manifests/stalld.yaml" mcpRealtime="../../../examples/realtime-mcp.yaml" nodeLabelRealtime="node-role.kubernetes.io/worker-rt" oc label no $node1 ${nodeLabelRealtime}= --overwrite oc label no $node2 ${nodeLabelRealtime}= --overwrite oc create -f $profileRealtime oc create -f $mcpRealtime 3) Scale-up the cluster by one node. oc scale machineset/$your_machine_set --replicas=$existing_number_plus_one -n openshift-machine-api 4) Watch the scale-up and immediately once Tuned Profile becomes available add the new node into the same machine config pool. For example by using this script: new_node_get() { local nodes nodes_old n nodes=$(oc get profile --no-headers -o name -n openshift-cluster-node-tuning-operator) while true; do nodes_old="$nodes" nodes=$(oc get profile --no-headers -o name -n openshift-cluster-node-tuning-operator) for n in $nodes do test "${nodes_old//$n/}" == "${nodes_old}" && break 2 done sleep 1 done printf "${n##*/}" } new_node=$(new_node_get) date echo "oc label no $new_node node-role.kubernetes.io/worker-rt=" oc label no $new_node node-role.kubernetes.io/worker-rt= 5) Watch one of the old nodes unnecessarily rebooted and see the NTO logs with messages such as: ... updated MachineConfig 50-nto-worker-rt with ignition and kernel parameters: [] The [] is the key, this should not happen in this case. Upstream fix for 4.10: https://github.com/openshift/cluster-node-tuning-operator/pull/293
This BZ has been fixed in all the recent OCP releases down to 4.8.z. As 4.7 is in maintenance mode and only qualified Critical and Important Security Advisories (RHSAs) and Urgent and Selected High Priority Bug Fix Advisories (RHBAs) may be released (see: https://access.redhat.com/support/policy/updates/openshift), I'm closing this BZ.
We're discussing this bug in https://bugzilla.redhat.com/show_bug.cgi?id=2075126 around a 4.7 cluster. Hmm. So as I understand it...the bug here isn't actually *scaling up* i.e. new nodes - the actual trigger is changing the pool associated with a node. Would it work around the bug to pause the machineconfigpools until all adjustments to the pool assigned to nodes is complete?
(In reply to Colin Walters from comment #13) > We're discussing this bug in > https://bugzilla.redhat.com/show_bug.cgi?id=2075126 around a 4.7 cluster. > > Hmm. So as I understand it...the bug here isn't actually *scaling up* i.e. > new nodes - the actual trigger is changing the pool associated with a node. > > Would it work around the bug to pause the machineconfigpools until all > adjustments to the pool assigned to nodes is complete? Yes, I believe that would work. However, you'd want to wait until the NTO operands apply the new config and report back to the operator the calculated kernel arguments. This should not take longer than a few seconds.
[ocpadmin@ec2-18-217-45-133 nto]$ oc get clusterversion NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.7.0-0.nightly-2022-05-02-180457 True False 85m Cluster version is 4.7.0-0.nightly-2022-05-02-180457 [ocpadmin@ec2-18-217-45-133 nto]$ oc logs -f cluster-node-tuning-operator-5dcc9c8d68-4kr9q I0505 11:34:53.908545 1 main.go:25] Go Version: go1.15.14 I0505 11:34:53.908719 1 main.go:26] Go OS/Arch: linux/amd64 I0505 11:34:53.908729 1 main.go:27] node-tuning Version: v4.7.0-202204262116.p0.gddbc574.assembly.stream-0-g291a1f1-dirty I0505 11:34:53.912771 1 controller.go:1030] trying to become a leader I0505 11:34:53.927535 1 leaderelection.go:243] attempting to acquire leader lease openshift-cluster-node-tuning-operator/node-tuning-operator-lock... I0505 11:34:53.930841 1 controller.go:1103] current leader: cluster-node-tuning-operator-5dcc9c8d68-4kr9q_d68c65d1-26b8-472a-aef5-f7ffbae7c6c8 I0505 11:35:32.503773 1 leaderelection.go:253] successfully acquired lease openshift-cluster-node-tuning-operator/node-tuning-operator-lock I0505 11:35:32.503944 1 controller.go:1090] became leader: cluster-node-tuning-operator-5dcc9c8d68-4kr9q_563fa332-dde8-458b-b55c-56ab8f7bed1c I0505 11:35:32.504002 1 controller.go:959] starting Tuned controller I0505 11:35:32.804440 1 controller.go:1011] started events processor/controller I0505 11:36:06.734833 1 controller.go:581] created profile liqcui-oc4737ngt-skd8w-worker-b-klwbz.c.openshift-qe.internal [openshift-node] I0505 11:36:07.409652 1 controller.go:581] created profile liqcui-oc4737ngt-skd8w-worker-c-bnlxp.c.openshift-qe.internal [openshift-node] I0505 11:36:12.055225 1 controller.go:581] created profile liqcui-oc4737ngt-skd8w-worker-a-qrj9x.c.openshift-qe.internal [openshift-node] I0505 12:45:42.167888 1 controller.go:496] updated Tuned rendered I0505 12:45:42.191841 1 controller.go:615] updated profile liqcui-oc4737ngt-skd8w-worker-a-qrj9x.c.openshift-qe.internal [openshift-realtime] I0505 12:45:43.948819 1 controller.go:668] created MachineConfig 50-nto-worker-rt with kernel parameters: [skew_tick=1 isolcpus=1 intel_pstate=disable nosoftlockup tsc=nowatchdog systemd.cpu_affinity=0,2,3] I0505 12:53:40.840752 1 controller.go:615] updated profile liqcui-oc4737ngt-skd8w-worker-b-klwbz.c.openshift-qe.internal [openshift-realtime] I0505 12:53:58.474204 1 controller.go:581] created profile liqcui-oc4737ngt-skd8w-worker-f-j9lsp.c.openshift-qe.internal [openshift-node] I0505 12:53:59.438575 1 controller.go:615] updated profile liqcui-oc4737ngt-skd8w-worker-f-j9lsp.c.openshift-qe.internal [openshift-realtime] I0505 13:16:05.376665 1 controller.go:273] deleted Profile liqcui-oc4737ngt-skd8w-worker-f-j9lsp.c.openshift-qe.internal I0505 13:19:03.467807 1 controller.go:581] created profile liqcui-oc4737ngt-skd8w-worker-a-bjcr8.c.openshift-qe.internal [openshift-node] I0505 13:19:04.225443 1 controller.go:615] updated profile liqcui-oc4737ngt-skd8w-worker-a-bjcr8.c.openshift-qe.internal [openshift-realtime]
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (OpenShift Container Platform 4.7.50 bug fix update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:1698