Description of problem: NTO relies on Tuned [service] plugin to start/restart/disable stalld unit file. However, if the stalld service is started after a Tuned Pod tried to disable the service, the disablement never happens. Version-Release number of selected component (if applicable): 4.6 -- current. How reproducible: Rare -- race. Steps to Reproduce: 1. Create a Tuned profile for stalld with something like: [service] service.stalld=stop,disable Actual results: The stalld service may be running even though it was specificaly disabled. Expected results: The stalld service stopped/disabled on the host. Additional info: https://bugzilla.redhat.com/show_bug.cgi?id=1923726
$ oc project openshift-cluster-node-tuning-operator Now using project "openshift-cluster-node-tuning-operator" on server "https://api.skordas302.qe.devcluster.openshift.com:6443". $ oc get clusterversions.config.openshift.io NAME VERSION AVAILABLE PROGRESSING SINCE STATUS version 4.8.0-0.nightly-2021-03-01-143026 True False 30h Cluster version is 4.8.0-0.nightly-2021-03-01-143026 $ node=$(oc get nodes | grep -m 1 worker | cut -f 1 -d ' ') && echo $node ip-10-0-146-220.us-east-2.compute.internal $ pod=$(oc get pods -n openshift-cluster-node-tuning-operator -o wide | grep $node | cut -d ' ' -f 1) && echo $pod tuned-kq2kh $ oc label node $node node-role.kubernetes.io/worker-rt= node/ip-10-0-146-220.us-east-2.compute.internal labeled $ oc create -f- <<EOF > apiVersion: machineconfiguration.openshift.io/v1 > kind: MachineConfigPool > metadata: > name: worker-rt > labels: > worker-rt: "" > spec: > machineConfigSelector: > matchExpressions: > - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,worker-rt]} > nodeSelector: > matchLabels: > node-role.kubernetes.io/worker-rt: "" > EOF machineconfigpool.machineconfiguration.openshift.io/worker-rt created # stalld enabled $ oc create -f- <<EOF > apiVersion: tuned.openshift.io/v1 > kind: Tuned > metadata: > name: openshift-realtime > namespace: openshift-cluster-node-tuning-operator > spec: > profile: > - data: | > [main] > summary=Custom OpenShift realtime profile > include=openshift-node,realtime > [variables] > # isolated_cores take a list of ranges; e.g. isolated_cores=2,4-7 > isolated_cores=1 > #isolate_managed_irq=Y > not_isolated_cores_expanded=${f:cpulist_invert:${isolated_cores_expanded}} > [bootloader] > cmdline_ocp_realtime=+systemd.cpu_affinity=${not_isolated_cores_expanded} > [service] > service.stalld=start,enable > name: openshift-realtime > > recommend: > - machineConfigLabels: > machineconfiguration.openshift.io/role: "worker-rt" > priority: 20 > profile: openshift-realtime > EOF tuned.tuned.openshift.io/openshift-realtime created $ oc get mcp NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE master rendered-master-8d3c9cf8802995989eb130e398294563 True False False 3 3 3 0 30h worker rendered-worker-9d6ebebd33cd11d80eb54c8f514d02e8 True False False 2 2 2 0 30h worker-rt rendered-worker-rt-97e4e0b4040760c9e1a90fff7fc5a9f9 True False False 1 1 1 0 15m $ oc logs $pod ... 2021-03-03 20:39:57,066 INFO tuned.daemon.daemon: static tuning from profile 'openshift-realtime' applied $ oc debug node/$node Starting pod/ip-10-0-146-220us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.146.220 If you don't see a command prompt, try pressing enter. sh-4.4# ps auxww | grep stalld root 3591 0.5 0.0 7920 2728 ? Ss 20:39 0:04 /usr/local/bin/stalld --systemd -p 1000000000 -r 10000 -d 3 -t 20 --log_syslog --log_kmsg --foreground --pidfile /run/stalld.pid root 9416 0.0 0.0 9184 988 pts/0 S+ 20:54 0:00 grep stalld sh-4.4# exit exit Removing debug pod ... # stalld is running - as expected! # disabling stalld $ oc edit tuned openshift-realtime # edit service.stalld=start,enable -> service.stalld=stop,disable $ oc debug node/$node Starting pod/ip-10-0-146-220us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.146.220 If you don't see a command prompt, try pressing enter. sh-4.4# ps auxww | grep stalld root 3775 0.0 0.0 9184 972 pts/0 S+ 20:57 0:00 grep stalld sh-4.4# exit exit Removing debug pod ... # stalld is not running - as expected! # enable back stalld $ oc edit tuned openshift-realtime # edit service.stalld=stop,disable -> service.stalld=start,enable $ oc debug node/$node Starting pod/ip-10-0-146-220us-east-2computeinternal-debug ... To use host binaries, run `chroot /host` Pod IP: 10.0.146.220 If you don't see a command prompt, try pressing enter. sh-4.4# ps auxww | grep stalld root 3632 0.7 0.0 7996 2700 ? Ss 21:03 0:00 /usr/local/bin/stalld --systemd -p 1000000000 -r 10000 -d 3 -t 20 --log_syslog --log_kmsg --foreground --pidfile /run/stalld.pid root 4008 0.0 0.0 9184 1076 pts/0 S+ 21:04 0:00 grep stalld sh-4.4# exit exit Removing debug pod ... # stalld enabled - as expected! No problems after multiple enabling/disabling stalld by tuned
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:2438