1926903 – NTO may fail to disable stalld when relying on Tuned '[service]' plugin

Bug 1926903 - NTO may fail to disable stalld when relying on Tuned '[service]' plugin

Summary: NTO may fail to disable stalld when relying on Tuned '[service]' plugin

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	OpenShift Container Platform
Classification:	Red Hat
Component:	Node Tuning Operator
Sub Component:
Version:	4.7
Hardware:	Unspecified
OS:	Unspecified
Priority:	medium
Severity:	medium
Target Milestone:	---
Target Release:	4.8.0
Assignee:	Jiří Mencák
QA Contact:	Simon
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1928614
TreeView+	depends on / blocked

Reported:	2021-02-09 16:10 UTC by Jiří Mencák
Modified:	2021-07-27 22:43 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Doc Type:	No Doc Update
Doc Text:
Clone Of:
Environment:
Last Closed:	2021-07-27 22:43:10 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Github	openshift cluster-node-tuning-operator pull 207	None	closed	Bug 1926903: Instantiate the stalld systemd unit as disabled.	2021-02-15 07:26:31 UTC
Github	openshift cluster-node-tuning-operator pull 211	None	open	Bug 1926903: Keep ignition units in sync with [service] plugin.	2021-02-22 15:15:12 UTC
Red Hat Product Errata	RHSA-2021:2438	None	None	None	2021-07-27 22:43:32 UTC

Description Jiří Mencák 2021-02-09 16:10:27 UTC

Description of problem:
NTO relies on Tuned [service] plugin to start/restart/disable stalld unit file.  However, if the stalld service is started after a Tuned Pod tried to disable the service, the disablement never happens.

Version-Release number of selected component (if applicable):
4.6 -- current.

How reproducible:
Rare -- race.

Steps to Reproduce:
1. Create a Tuned profile for stalld with something like:
[service]
service.stalld=stop,disable

Actual results:
The stalld service may be running even though it was specificaly disabled.

Expected results:
The stalld service stopped/disabled on the host.

Additional info:
https://bugzilla.redhat.com/show_bug.cgi?id=1923726

Comment 4 Simon 2021-03-03 21:09:33 UTC

$ oc project openshift-cluster-node-tuning-operator
Now using project "openshift-cluster-node-tuning-operator" on server "https://api.skordas302.qe.devcluster.openshift.com:6443".

$ oc get clusterversions.config.openshift.io 
NAME      VERSION                             AVAILABLE   PROGRESSING   SINCE   STATUS
version   4.8.0-0.nightly-2021-03-01-143026   True        False         30h     Cluster version is 4.8.0-0.nightly-2021-03-01-143026

$ node=$(oc get nodes | grep -m 1 worker | cut -f 1 -d ' ') && echo $node
ip-10-0-146-220.us-east-2.compute.internal

$ pod=$(oc get pods -n openshift-cluster-node-tuning-operator -o wide | grep $node | cut -d ' ' -f 1) && echo $pod
tuned-kq2kh

$ oc label node $node node-role.kubernetes.io/worker-rt=
node/ip-10-0-146-220.us-east-2.compute.internal labeled

$ oc create -f- <<EOF
> apiVersion: machineconfiguration.openshift.io/v1
> kind: MachineConfigPool
> metadata:
>  name: worker-rt
>  labels:
>    worker-rt: ""
> spec:
>  machineConfigSelector:
>    matchExpressions:
>      - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,worker-rt]}
>  nodeSelector:
>    matchLabels:
>      node-role.kubernetes.io/worker-rt: ""
> EOF
machineconfigpool.machineconfiguration.openshift.io/worker-rt created

# stalld enabled
$ oc create -f- <<EOF
> apiVersion: tuned.openshift.io/v1
> kind: Tuned
> metadata:
>  name: openshift-realtime
>  namespace: openshift-cluster-node-tuning-operator
> spec:
>  profile:
>  - data: |
>      [main]
>      summary=Custom OpenShift realtime profile
>      include=openshift-node,realtime
>      [variables]
>      # isolated_cores take a list of ranges; e.g. isolated_cores=2,4-7
>      isolated_cores=1
>      #isolate_managed_irq=Y
>      not_isolated_cores_expanded=${f:cpulist_invert:${isolated_cores_expanded}}
>      [bootloader]
>      cmdline_ocp_realtime=+systemd.cpu_affinity=${not_isolated_cores_expanded}
>      [service]
>      service.stalld=start,enable
>    name: openshift-realtime
> 
>  recommend:
>  - machineConfigLabels:
>      machineconfiguration.openshift.io/role: "worker-rt"
>    priority: 20
>    profile: openshift-realtime
> EOF
tuned.tuned.openshift.io/openshift-realtime created

$ oc get mcp
NAME        CONFIG                                                UPDATED   UPDATING   DEGRADED   MACHINECOUNT   READYMACHINECOUNT   UPDATEDMACHINECOUNT   DEGRADEDMACHINECOUNT   AGE
master      rendered-master-8d3c9cf8802995989eb130e398294563      True      False      False      3              3                   3                     0                      30h
worker      rendered-worker-9d6ebebd33cd11d80eb54c8f514d02e8      True      False      False      2              2                   2                     0                      30h
worker-rt   rendered-worker-rt-97e4e0b4040760c9e1a90fff7fc5a9f9   True      False      False      1              1                   1                     0                      15m

$ oc logs $pod
...
2021-03-03 20:39:57,066 INFO     tuned.daemon.daemon: static tuning from profile 'openshift-realtime' applied

$ oc debug node/$node
Starting pod/ip-10-0-146-220us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.146.220
If you don't see a command prompt, try pressing enter.
sh-4.4# ps auxww | grep stalld
root        3591  0.5  0.0   7920  2728 ?        Ss   20:39   0:04 /usr/local/bin/stalld --systemd -p 1000000000 -r 10000 -d 3 -t 20 --log_syslog --log_kmsg --foreground --pidfile /run/stalld.pid
root        9416  0.0  0.0   9184   988 pts/0    S+   20:54   0:00 grep stalld
sh-4.4# exit
exit

Removing debug pod ...

# stalld is running - as expected!

# disabling stalld
$ oc edit tuned openshift-realtime 
# edit service.stalld=start,enable -> service.stalld=stop,disable

$ oc debug node/$node
Starting pod/ip-10-0-146-220us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.146.220
If you don't see a command prompt, try pressing enter.
sh-4.4# ps auxww | grep stalld
root        3775  0.0  0.0   9184   972 pts/0    S+   20:57   0:00 grep stalld
sh-4.4# exit
exit

Removing debug pod ...

# stalld is not running - as expected!

# enable back stalld
$ oc edit tuned openshift-realtime 
# edit service.stalld=stop,disable -> service.stalld=start,enable

$ oc debug node/$node
Starting pod/ip-10-0-146-220us-east-2computeinternal-debug ...
To use host binaries, run `chroot /host`
Pod IP: 10.0.146.220
If you don't see a command prompt, try pressing enter.
sh-4.4# ps auxww | grep stalld
root        3632  0.7  0.0   7996  2700 ?        Ss   21:03   0:00 /usr/local/bin/stalld --systemd -p 1000000000 -r 10000 -d 3 -t 20 --log_syslog --log_kmsg --foreground --pidfile /run/stalld.pid
root        4008  0.0  0.0   9184  1076 pts/0    S+   21:04   0:00 grep stalld
sh-4.4# exit
exit

Removing debug pod ...

# stalld enabled - as expected!

No problems after multiple enabling/disabling stalld by tuned

Comment 7 errata-xmlrpc 2021-07-27 22:43:10 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Moderate: OpenShift Container Platform 4.8.2 bug fix and security update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2021:2438

Note You need to log in before you can comment on or make changes to this bug.