Bug 1395899

Summary: cpu-partitioning: set workqueue affinity early
Product: Red Hat Enterprise Linux 7 Reporter: Luiz Capitulino <lcapitulino>
Component: tunedAssignee: Jaroslav Škarvada <jskarvad>
Status: CLOSED ERRATA QA Contact: Tereza Cerna <tcerna>
Severity: unspecified Docs Contact: Lenka Kimlickova <lkimlick>
Priority: unspecified    
Version: 7.4CC: bhu, jeder, jskarvad, psklenar, tcerna
Target Milestone: rcKeywords: Patch, Upstream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: tuned-2.8.0-1.el7 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-08-01 12:32:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1395860, 1414098    
Bug Blocks: 1394932, 1400961    

Description Luiz Capitulino 2016-11-16 21:35:38 UTC
The cpu-partitioning profile should set the workqueue affinity mask (/sys/devices/virtual/workqueue/cpumask) very early in boot, right when the initrd takes control. This avoids module initialization code from running on CPUs isolated with CPUAffinity, which can create long lived timers that will fire later on on those CPUs.

The following procedure creates a dracut module to do that (which should be adopted by the cpu-partitioning profile):

1. # mkdir /usr/lib/dracut/modules.d/01cpumask

2. Create module-setup.sh with the following contents:

#!/bin/bash

check() {
        return 0
}

depends() {
        return 0
}

install() {
        inst_hook pre-udev 00 "$moddir/workqueue-cpumask.sh"
}

3. Create workqueue-cpumask.sh with the following contents:

#!/bin/bash

cpumask=MASK
file=/sys/devices/virtual/workqueue/cpumask

log()
{
        echo "$1" >> /dev/kmsg
}

if [ ! -f $file ]; then
        log "ERROR: could not write cpumask"
        return
fi

if ! echo $cpumask > $file; then
       log "ERROR: could not write cpumask"
fi

4. Make both files executable and re-generate the initrd with dracut, so that they are included in the initrd image

A few important notes:

1. "MASK" should be a list of housekeeping CPUs
2. The log() function should be improved with better error messages and maybe a different log location

Comment 1 Jeremy Eder 2016-11-17 12:48:39 UTC
Hmm, are you proposing this be part of tuned itself? I can see other profiles requiring the same things, so I'd hope so.

Comment 2 Luiz Capitulino 2016-11-17 13:50:58 UTC
My initial idea was to make it part of the profile. That is, script.sh could implement the steps from the description in start() and then stop() could just remove 01cpumask and re-generate the initrd image.

But having it as part of tuned looks like a good idea too. I think the only problem would be to come up with a good abstraction in tuned.conf for setting this up given that dracut modules have several different options as to when the script is run in initrd.

Comment 3 Jeremy Eder 2016-11-17 13:56:51 UTC
Understood.  To me this sounds like "add dracut support to tuned" and then the first user of said feature would be this bugzilla.

Comment 4 Jaroslav Škarvada 2017-03-01 15:51:50 UTC
I think this could be smartly resolved by static initrd overlay (with help of the RFE bug 1414098) together with the cpuaffinity kernel/systemd boot option (which was proposed in https://github.com/systemd/systemd/issues/5368). Even now when CPUAffinity is not supported by systemd, it can be introduced in our scripts, e.g.:

[bootloader]
cmdline=cpuaffinity=DESIRED_AFFINITY
initrd_add_dir=tuned-overlay.img

$ cat tuned-overlay.img/usr/lib/dracut/hooks/pre-udev/00-tuned-pre-udev.sh
#!/bin/sh

type getarg >/dev/null 2>&1 || . /lib/dracut-lib.sh

cpuaffinity="$(getargs cpuaffinity)"

echo "tuned setting CPUAffinity to $(getargs cpuaffinity)" > /dev/kmsg

----
Well you need CPUMask, so we could either calculate CPUMask from the CPUAffinity in the script, or just inject the CPUMask to the cmdline instead of the CPUAffinity, e.g.:

[bootloader]
cmdline=cpumask=DESIRED_CPUMASK
initrd_add_dir=tuned-overlay.img

$ cat tuned-overlay.img/usr/lib/dracut/hooks/pre-udev/00-tuned-pre-udev.sh
#!/bin/sh

type getarg >/dev/null 2>&1 || . /lib/dracut-lib.sh

cpumask="$(getargs cpumask)"

file=/sys/devices/virtual/workqueue/cpumask

log()
{
        echo "$1" >> /dev/kmsg
}

if ! echo $cpumask > $file 2>/dev/null; then
       log "ERROR: could not write cpumask"
fi

Not to cause conflicts we can set the cmdline name to e.g. tuned.cpumask or tuned-cpumask or whatever. This all should now work out of the box.

Also you can pre-generate initrd.img fully statically and use the following:

[bootloader]
cmdline=cpumask=DESIRED_CPUMASK
initrd_add_img=tuned-overlay.img

I.e. Tuned will not re-generate the tuned-overlay.img, because it's not needed.

Comment 5 Jaroslav Škarvada 2017-03-03 18:06:21 UTC
Upstream commit introducing functionality mentioned in comment 4:
https://github.com/redhat-performance/tuned/commit/64decfe7568007cb2fde3a679ef05e366014214f

Available in:
tuned-2.7.1-1.20170303git64decfe7.el7

from:
https://jskarvad.fedorapeople.org/tuned/devel/repo/

Comment 6 Luiz Capitulino 2017-03-06 19:19:15 UTC
This is working as expected. I can see on dmesg that the workqueue cpumask is being set. Also, initial zero-loss test-case passed. I'm now running a long duration test.

Comment 7 Luiz Capitulino 2017-03-07 14:02:10 UTC
My tests passed, very well done!

The only detail about this one is that, I think we should change the kernel parameter "tuned.cpumask" to something more meaningful. This parameter will be visible for the entire life time of the system and "cpumask" is too generic.

Here are some suggestions (in order of preference):

tuned.housekeeping_mask
tuned.worqueue_mask
tuned.early_workqueue_mask
tuned.non_isolcpus

Comment 8 Jaroslav Škarvada 2017-03-07 14:17:51 UTC
(In reply to Luiz Capitulino from comment #7)
> My tests passed, very well done!
> 
> The only detail about this one is that, I think we should change the kernel
> parameter "tuned.cpumask" to something more meaningful. This parameter will
> be visible for the entire life time of the system and "cpumask" is too
> generic.
> 
> Here are some suggestions (in order of preference):
> 
> tuned.housekeeping_mask
> tuned.worqueue_mask
> tuned.early_workqueue_mask
> tuned.non_isolcpus

I like the most:
tuned.non_isolcpus

because we can reuse it later for other things as well theoretically not only for teh worqueue_mask.

Comment 9 Luiz Capitulino 2017-03-07 14:22:03 UTC
tuned.non_isolcpus it is :)

Comment 10 Jaroslav Škarvada 2017-03-07 14:40:49 UTC
Upstream commit changing the kernel boot command line parameter to tuned.non_isolcpus:
https://github.com/redhat-performance/tuned/commit/6906738603b92f34f042486a8e9e924ac8531a2a

Available in:
tuned-2.7.1-1.20170307git69067386.el7

Comment 12 Tereza Cerna 2017-04-26 10:21:23 UTC
Verified in:
    tuned-2.8.0-2.el7.noarch
PASS


::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
:: [   LOG    ] :: Check setting of workqueue affinity [BZ#1395899]
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

:: [   PASS   ] :: Check message in dmesg (Expected 0, got 0)
:: [   PASS   ] :: Check isolates cores in cpumask file (Expected 0, got 0)
:: [   LOG    ] :: Duration: 1s
:: [   LOG    ] :: Assertions: 2 good, 0 bad
:: [   PASS   ] :: RESULT: Check setting of workqueue affinity [BZ#1395899]

Comment 16 errata-xmlrpc 2017-08-01 12:32:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2102