Bug 2024682 - patch pipeline--worker nodes unexpectedly reboot during scale out
Summary: patch pipeline--worker nodes unexpectedly reboot during scale out
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: OpenShift Container Platform
Classification: Red Hat
Component: Node Tuning Operator
Version: 4.7
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: 4.7.z
Assignee: Jiří Mencák
QA Contact: liqcui
URL:
Whiteboard:
Depends On: 2030353
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-11-18 16:40 UTC by Alvaro Soto
Modified: 2022-07-21 16:57 UTC (History)
9 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 2029371 (view as bug list)
Environment:
Last Closed: 2022-05-12 18:12:28 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Github openshift cluster-node-tuning-operator pull 352 0 None open Bug 2024682: controller: update MC after application by TuneD 2022-04-26 17:28:25 UTC
Red Hat Product Errata RHBA-2022:1698 0 None None None 2022-05-12 18:12:37 UTC

Description Alvaro Soto 2021-11-18 16:40:08 UTC
What problem/issue/behavior are you having trouble with?  What do you expect to see?
worker nodes that have the latest machine config should not update when scaling out nodes. In reality, we see un-targeted nodes going through mc updates. We see the daemon logs for updates between two rendered machineConfigs.

Where are you experiencing the behavior? What environment?
OCP 4.74

When does the behavior occur? Frequency? Repeatedly? At certain times?
Repeatedly when performing scale out operations.

What information can you provide around timeframes and the business impact?
This impacts our ability predict and avoid tenant application impact when performing updates.

Comment 2 Bob Fournier 2021-11-18 17:16:36 UTC
Changing assignee to Ben for MCO team.

Comment 3 Ben Nemec 2021-11-18 17:46:01 UTC
This may be running on the baremetal platform, but it appears to be a general machine-config issue on scaleout. Based on my reading of the case it sounds like there's some sort of conflict between MCO and PAO (the latter of which I'm not familiar with), so this may not be an MCO bug at all. I'll let the MCO team make that determination though.

Comment 5 Yu Qi Zhang 2021-11-26 21:32:55 UTC
So the customer case is pretty long. Distilled down it appears to be roughly 4 must-gather timelines of different types of approaches hitting this issue. Fundamentally, it appears that in the customer scenario, scaling (?) nodes in their setup sets the MCO to flip between different rendered configs. Note that the MCO always resyncs all MCs constantly, so my first guess is that the PAO generated configs (and corresponding labelling?) -> kubeletconfig -> machineconfig is... not being done correctly? Has a bug in the PAO?

For simplicity of analysis purposes, I used the 2 must-gathers in this comment (the pre/post must-gather): https://access.redhat.com/support/cases/#/case/03070861/discussion?attachmentId=a092K00002vKCEGQA4

And the flip happening between rendered-workerperf-8567689b42ea7dabfb3118dafe136d96 and rendered-workerperf-cb815f01e9974630dcf1c991b9eeccc3:

1. The rendered-workerperf-cb815f01e9974630dcf1c991b9eeccc3 that the system wants to be on has these additions:
  /usr/local/bin/stalld
  /usr/local/bin/throttlectl.sh
  stalld.service
  karg set:
  - skew_tick=1
  - nohz=on
  - rcu_nocbs=4-39,44-79
  - tuned.non_isolcpus=0000ffff,ffff0000,00000f00,0000000f
  - intel_pstate=disable
  - nosoftlockup
  - tsc=nowatchdog
  - intel_iommu=on
  - iommu=pt
  - isolcpus=managed_irq,4-39,44-79
  - systemd.cpu_affinity=0,1,2,3,40,41,42,43,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,106,107,108,109,110,111
  - default_hugepagesz=2M

2. The one that it flipped to temporarily has none of those changes but was interestingly generated 11 days ago, whereas the "correct" one was generated 74 days ago, meaning that the first action that generated that was 11 days ago. I see no MCs with that age, meaning that an existing MC was changed.

Those MCs are managed by the NTO. I see that in the MachineConfig 50-nto-workerperf.yaml the last update timestamp was 2021-11-17T14:52:38Z, meaning that it did in fact perform some kind of update to it (likely triggering your reboot rollout.)

Passing this to the NTO team to see what triggered those updates. In the eyes of the MCO, it is behaving as expected.

Comment 6 Jiří Mencák 2021-11-29 17:48:43 UTC
Thank you for the report.

While I cannot reproduce this right now, by revising the appropriate parts of NTO code, I believe that given the right timing (openshift-tuned pods still not running during node startup) the issues described above are likely caused by NTO.  I'm currently on leave -- I'll start working on a fix upon my return.

As a workaround, please pause the MCP during the scaleups.

Comment 7 Jiří Mencák 2021-12-06 10:20:54 UTC
Back from leave.  Many thanks Yu Qi for pointing me in the right direction.

A minimal (NTO-only) reproducer for the QE:

1) Install a two worker node cluster.

2) Add both worker nodes into the same machine config pool.
git clone https://github.com/openshift/cluster-node-tuning-operator
cd cluster-node-tuning-operator/test/e2e/reboots

node1="your-first-worker-node"	# CHANGE THIS
node2="your-second-worker-node"	# CHANGE THIS
profileRealtime="../testing_manifests/stalld.yaml"
mcpRealtime="../../../examples/realtime-mcp.yaml"
nodeLabelRealtime="node-role.kubernetes.io/worker-rt"

oc label no $node1 ${nodeLabelRealtime}= --overwrite
oc label no $node2 ${nodeLabelRealtime}= --overwrite
oc create -f $profileRealtime
oc create -f $mcpRealtime

3) Scale-up the cluster by one node.
oc scale machineset/$your_machine_set --replicas=$existing_number_plus_one -n openshift-machine-api

4) Watch the scale-up and immediately once Tuned Profile becomes available add the new node into the same machine config pool.

For example by using this script:
new_node_get() {
  local nodes nodes_old n

  nodes=$(oc get profile --no-headers -o name -n openshift-cluster-node-tuning-operator)

  while true;
  do
    nodes_old="$nodes"
    nodes=$(oc get profile --no-headers -o name -n openshift-cluster-node-tuning-operator)
    for n in $nodes
    do
      test "${nodes_old//$n/}" == "${nodes_old}" && break 2
    done
    sleep 1
  done

  printf "${n##*/}"
}

new_node=$(new_node_get)

date
echo "oc label no $new_node node-role.kubernetes.io/worker-rt="
oc label no $new_node node-role.kubernetes.io/worker-rt=

5) Watch one of the old nodes unnecessarily rebooted and see the NTO logs with messages such as:

... updated MachineConfig 50-nto-worker-rt with ignition and kernel parameters: []

The [] is the key, this should not happen in this case.

Upstream fix for 4.10: https://github.com/openshift/cluster-node-tuning-operator/pull/293

Comment 10 Jiří Mencák 2021-12-17 12:18:39 UTC
This BZ has been fixed in all the recent OCP releases down to 4.8.z.  As 4.7 is in maintenance mode and only
qualified Critical and Important Security Advisories (RHSAs) and Urgent and Selected High Priority
Bug Fix Advisories (RHBAs) may be released (see: https://access.redhat.com/support/policy/updates/openshift),
I'm closing this BZ.

Comment 13 Colin Walters 2022-04-22 15:56:47 UTC
We're discussing this bug in https://bugzilla.redhat.com/show_bug.cgi?id=2075126 around a 4.7 cluster.

Hmm.  So as I understand it...the bug here isn't actually *scaling up* i.e. new nodes - the actual trigger is changing the pool associated with a node.

Would it work around the bug to pause the machineconfigpools until all adjustments to the pool assigned to nodes is complete?

Comment 14 Jiří Mencák 2022-04-22 16:52:05 UTC
(In reply to Colin Walters from comment #13)
> We're discussing this bug in
> https://bugzilla.redhat.com/show_bug.cgi?id=2075126 around a 4.7 cluster.
> 
> Hmm.  So as I understand it...the bug here isn't actually *scaling up* i.e.
> new nodes - the actual trigger is changing the pool associated with a node.
> 
> Would it work around the bug to pause the machineconfigpools until all
> adjustments to the pool assigned to nodes is complete?

Yes, I believe that would work.  However, you'd want to wait until the NTO
operands apply the new config and report back to the operator the calculated
kernel arguments.  This should not take longer than a few seconds.

Comment 19 liqcui 2022-05-05 13:30:43 UTC
[ocpadmin@ec2-18-217-45-133 nto]$ oc get clusterversion
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.7.0-0.nightly-2022-05-02-180457   True        False         85m     Cluster version is 4.7.0-0.nightly-2022-05-02-180457
[ocpadmin@ec2-18-217-45-133 nto]$ oc logs -f cluster-node-tuning-operator-5dcc9c8d68-4kr9q
I0505 11:34:53.908545       1 main.go:25] Go Version: go1.15.14
I0505 11:34:53.908719       1 main.go:26] Go OS/Arch: linux/amd64
I0505 11:34:53.908729       1 main.go:27] node-tuning Version: v4.7.0-202204262116.p0.gddbc574.assembly.stream-0-g291a1f1-dirty
I0505 11:34:53.912771       1 controller.go:1030] trying to become a leader
I0505 11:34:53.927535       1 leaderelection.go:243] attempting to acquire leader lease openshift-cluster-node-tuning-operator/node-tuning-operator-lock...
I0505 11:34:53.930841       1 controller.go:1103] current leader: cluster-node-tuning-operator-5dcc9c8d68-4kr9q_d68c65d1-26b8-472a-aef5-f7ffbae7c6c8
I0505 11:35:32.503773       1 leaderelection.go:253] successfully acquired lease openshift-cluster-node-tuning-operator/node-tuning-operator-lock
I0505 11:35:32.503944       1 controller.go:1090] became leader: cluster-node-tuning-operator-5dcc9c8d68-4kr9q_563fa332-dde8-458b-b55c-56ab8f7bed1c
I0505 11:35:32.504002       1 controller.go:959] starting Tuned controller
I0505 11:35:32.804440       1 controller.go:1011] started events processor/controller
I0505 11:36:06.734833       1 controller.go:581] created profile liqcui-oc4737ngt-skd8w-worker-b-klwbz.c.openshift-qe.internal [openshift-node]
I0505 11:36:07.409652       1 controller.go:581] created profile liqcui-oc4737ngt-skd8w-worker-c-bnlxp.c.openshift-qe.internal [openshift-node]
I0505 11:36:12.055225       1 controller.go:581] created profile liqcui-oc4737ngt-skd8w-worker-a-qrj9x.c.openshift-qe.internal [openshift-node]
I0505 12:45:42.167888       1 controller.go:496] updated Tuned rendered
I0505 12:45:42.191841       1 controller.go:615] updated profile liqcui-oc4737ngt-skd8w-worker-a-qrj9x.c.openshift-qe.internal [openshift-realtime]
I0505 12:45:43.948819       1 controller.go:668] created MachineConfig 50-nto-worker-rt with kernel parameters: [skew_tick=1 isolcpus=1 intel_pstate=disable nosoftlockup tsc=nowatchdog systemd.cpu_affinity=0,2,3]
I0505 12:53:40.840752       1 controller.go:615] updated profile liqcui-oc4737ngt-skd8w-worker-b-klwbz.c.openshift-qe.internal [openshift-realtime]
I0505 12:53:58.474204       1 controller.go:581] created profile liqcui-oc4737ngt-skd8w-worker-f-j9lsp.c.openshift-qe.internal [openshift-node]
I0505 12:53:59.438575       1 controller.go:615] updated profile liqcui-oc4737ngt-skd8w-worker-f-j9lsp.c.openshift-qe.internal [openshift-realtime]
I0505 13:16:05.376665       1 controller.go:273] deleted Profile liqcui-oc4737ngt-skd8w-worker-f-j9lsp.c.openshift-qe.internal
I0505 13:19:03.467807       1 controller.go:581] created profile liqcui-oc4737ngt-skd8w-worker-a-bjcr8.c.openshift-qe.internal [openshift-node]
I0505 13:19:04.225443       1 controller.go:615] updated profile liqcui-oc4737ngt-skd8w-worker-a-bjcr8.c.openshift-qe.internal [openshift-realtime]

Comment 21 errata-xmlrpc 2022-05-12 18:12:28 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (OpenShift Container Platform 4.7.50 bug fix update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:1698


Note You need to log in before you can comment on or make changes to this bug.