Bug 1394965

Summary: cpu-partitioning: CPUAffinity not set in initrd
Product: Red Hat Enterprise Linux 7 Reporter: Luiz Capitulino <lcapitulino>
Component: tunedAssignee: Jaroslav Škarvada <jskarvad>
Status: CLOSED ERRATA QA Contact: Tereza Cerna <tcerna>
Severity: unspecified Docs Contact: Lenka Kimlickova <lkimlick>
Priority: high    
Version: 7.4CC: atragler, bhu, jeder, jskarvad, psklenar, tcerna
Target Milestone: rcKeywords: Patch, Upstream, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: tuned-2.8.0-1.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1439157 (view as bug list) Environment:
Last Closed: 2017-08-01 12:32:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1414098    
Bug Blocks: 1394932, 1400961, 1439157    

Description Luiz Capitulino 2016-11-14 22:01:28 UTC
Description of problem:

The cpu-partitioning profile is setting up CPUAffinity in the system, but not in the initrd image. This causes initrd to run without CPUAffinity, which in turn causes systemd initialization code to run on isolated CPUs during initrd time, creating timers that will keep firing during the life-time of the system.

The solution to this problem is to update the initrd image any time system.conf is changed (eg. by running "dracut -f"), so that the new system.conf is pulled into the initrd image.

Version-Release number of selected component (if applicable): tuned-2.7.1-5.el7fdp


How reproducible:

Reproducing the problem I was having, ie. the firing of timers creating during initrd time, is difficult, as it involves early tracing and the matching of created timers with timers that are firing on isolated CPUs.

However, it's possible to check for configuration correctness:

1. Active the cpu-partitioning profile
2. Check CPUAffinity setting in /etc/systemd/system.conf
3. Unpack the current kernel initrd image, check CPUAffinity setting in etc/systemd/system.conf differs from the system's one

Comment 1 Jaroslav Škarvada 2017-02-14 16:37:24 UTC
After my own mini-research and discussion with systemd guys it seems the only way how to resolve this is to update the initrd.

Comment 2 Luiz Capitulino 2017-02-14 16:48:55 UTC
Jaroslav,

Updating the initrd image works, but it's not doable in practice because we have to update the initrd image for all installed kernels and that takes too long (and actually breaks tuned).

A much better solution is to the the following:

1. Implement initrd image layering in tuned as described in bug 1414098
2. Use image layering to pull in only the /etc/systemd/system.conf file in the initrd image

Comment 3 Jaroslav Škarvada 2017-02-14 16:54:01 UTC
(In reply to Luiz Capitulino from comment #2)
> Jaroslav,
> 
> Updating the initrd image works, but it's not doable in practice because we
> have to update the initrd image for all installed kernels and that takes too
> long (and actually breaks tuned).
> 
> A much better solution is to the the following:
> 
> 1. Implement initrd image layering in tuned as described in bug 1414098
> 2. Use image layering to pull in only the /etc/systemd/system.conf file in
> the initrd image

OK working on it, I tried to point out that it cannot be fixed in systemd itself or addressed as systemd RFE. Systemd is trying to update the affinity for everything (except specially prefixed processes), but for timers, it technically cannot do it.

Comment 4 Luiz Capitulino 2017-02-14 16:56:33 UTC
Oh, OK. That's correct. CPUAffinity will only set the affinity of systemd itself (and by extension, all processes it starts).

Comment 5 Jaroslav Škarvada 2017-02-14 16:57:39 UTC
(In reply to Jaroslav Škarvada from comment #3)
> (In reply to Luiz Capitulino from comment #2)
> > Jaroslav,
> > 
> > Updating the initrd image works, but it's not doable in practice because we
> > have to update the initrd image for all installed kernels and that takes too
> > long (and actually breaks tuned).
> > 
> > A much better solution is to the the following:
> > 
> > 1. Implement initrd image layering in tuned as described in bug 1414098
> > 2. Use image layering to pull in only the /etc/systemd/system.conf file in
> > the initrd image
> 
> OK working on it, I tried to point out that it cannot be fixed in systemd
> itself or addressed as systemd RFE. Systemd is trying to update the affinity
> for everything (except specially prefixed processes), but for timers, it
> technically cannot do it.

I.e. the current systemd behaviour is to re-set the CPU affinity from filesystem system.conf to takeover initrd system.conf, but for timers it cannot do it.

Comment 6 Jaroslav Škarvada 2017-02-16 16:07:12 UTC
After another discussion with systemd guys I think we came to the best solution, to propose kernel boot command line parameter (e.g. CPUAffinity) which would be understood by systemd and overrides /etc/system/system.conf settings. Upstream RFE:
https://github.com/systemd/systemd/issues/5368

I think this would resolve this problem very clean way.

I understand you need the fix quickly and the systemd fix could take some time, so I am also going to introduce the initrd workaround in Tuned.

Comment 7 Luiz Capitulino 2017-02-16 16:22:50 UTC
Yes, the kernel command-line parameter solves this BZ. But note that we also have bug 1395899, which requires initrd support. So, we end having to implement bug 1414098 anyways.

Comment 8 Jaroslav Škarvada 2017-03-01 15:35:10 UTC
It's preferred to solve this by the systemd RFE (comment 6) which has been accepted upstream. Until it's finished we can use the following workaround (relying on the functionality introduced by RFE bug 1414098 and RFE bug 1246172):

Profile:
[systemd]
cpu_affinity=YOUR_AFFINITY

[script]
priority=5
script=script.sh

[bootloader]
priority=10
initrd_add_dir=tuned-overlay.img

# mkdir -p tuned-overlay.img/etc/systemd
# cat script.sh
#!/bin/bash

. /usr/lib/tuned/functions

start() {
    cp /etc/systemd/system.conf tuned-overlay.img/etc/systemd/
    return 0
}

stop() {
    return 0
}

process $@

----
I.e. it introduces priority=5 to the script which guarantees that it will run after the [systemd] plugin sets the affinity in /etc/systemd/system.conf, than the [script] copy the systemd settings to the overlay and finally the [bootloader] creates the initrd and installs it as a overlay.

Regarding problem related to the workqueue-cpumask I will comment to the bug 1395899.

Comment 9 Luiz Capitulino 2017-03-01 18:39:37 UTC
OK, I tried the patch at the bottom but tuned doesn't create the overlay image. It does add the following to grub:

set tuned_initrd=""

But as we can see, it's empty.

Also, in /var/log/tuned/tuned.log there is:

"""
2017-03-01 13:32:22,591 INFO     tuned.plugins.plugin_script: calling script '/usr/lib/tuned/cpu-partitioning/script.sh' with arguments '['start']'
2017-03-01 13:32:22,686 INFO     tuned.plugins.plugin_bootloader: installing additional boot command line parameters to grub2
2017-03-01 13:32:22,686 ERROR    tuned.plugins.plugin_bootloader: error: cannot create initrd image, source directory '/usr/lib/tuned/latency-performance/tuned-overlay.img' doesn't exist
2017-03-01 13:32:22,690 INFO     tuned.daemon.daemon: static tuning from profile 'cpu-partitioning' applied
"""

So, it looks like tuned is confusing cpu-partitioning with latency-performance?

diff -Nparu cpu-partitioning.orig/script.sh cpu-partitioning/script.sh
--- cpu-partitioning.orig/script.sh     2017-01-10 07:56:20.000000000 -0500
+++ cpu-partitioning/script.sh  2017-03-01 13:29:10.826909265 -0500
@@ -8,6 +8,7 @@ start() {
     sed -i '/^IRQBALANCE_BANNED_CPUS=/d' /etc/sysconfig/irqbalance
     echo "IRQBALANCE_BANNED_CPUS=$TUNED_isolated_cpumask" >>/etc/sysconfig/irqbalance
     setup_kvm_mod_low_latency
+    cp /etc/systemd/system.conf tuned-overlay.img/etc/systemd/
     return "$?"
 }

diff -Nparu cpu-partitioning.orig/tuned.conf cpu-partitioning/tuned.conf
--- cpu-partitioning.orig/tuned.conf    2017-01-10 07:44:28.000000000 -0500
+++ cpu-partitioning/tuned.conf 2017-03-01 13:28:04.201401503 -0500
@@ -31,10 +31,13 @@ kernel.timer_migration = 1
 /sys/devices/system/machinecheck/machinecheck*/ignore_ce = 1

 [script]
+priority=5
 script=script.sh

 [systemd]
 cpu_affinity=${not_isolated_cores_expanded}

 [bootloader]
+priority=10
+initrd_add_dir=tuned-overlay.img
 cmdline= nohz=on nohz_full=${isolated_cores} rcu_nocbs=${isolated_cores} intel_pstate=disable nosoftlockup

Comment 10 Jaroslav Škarvada 2017-03-02 09:08:48 UTC
(In reply to Luiz Capitulino from comment #9)

Thanks for info and let me check it. Our tests for this feature ran only from /etc, but IMHO it shouldn't be hard to fix this problem.

Comment 11 Jaroslav Škarvada 2017-03-02 14:09:37 UTC
Upstream related fix:
https://github.com/redhat-performance/tuned/commit/8cef7cdb21264c5030c5f2cf5609935819a28de5

Let's try:
tuned-2.7.1-1.20170302git00746dc6.el7

Comment 12 Luiz Capitulino 2017-03-02 15:06:31 UTC
This almost worked. The only problem now is that the /etc/systemd/system.conf was copied into the initrd image before it was modified to contain the CPUAffinity setting. This gives the impression that the "priority" lines are not working as expected (that is, "[bootloader]" ran first than "[systemd]").

PS: I wasn't able to find tuned-2.7.1-1.20170302git00746dc6.el7 in your repo, so I just built tuned from the git repo.

Comment 13 Luiz Capitulino 2017-03-02 16:25:42 UTC
A small problem I just found out: when I switch to another profile, grub.cfg is updated correctly (that is, tuned_initrd is set to empty). However, the /boot/tuned-overlay.img file is still there. I think it should be removed.

Comment 14 Jaroslav Škarvada 2017-03-02 16:29:36 UTC
(In reply to Luiz Capitulino from comment #13)
> A small problem I just found out: when I switch to another profile, grub.cfg
> is updated correctly (that is, tuned_initrd is set to empty). However, the
> /boot/tuned-overlay.img file is still there. I think it should be removed.

It could remove it, but I left it there, because it's harmless and not to overwrite something which may be needed later. It will be overwritten again, when the profile is activated.

Comment 15 Luiz Capitulino 2017-03-02 16:37:20 UTC
IMO, this is the profile leaking data. If different profiles use different initrd images, then we'll have several stale files in /boot. And /boot is not /tmp :) I think tuned profiles should always undo what they have done when they are turned off or when there's a profile switch. This includes cleaning up stale files (and I'll also update my patch to delete tuned-overlay.img/etc/systemd/system.conf).

Comment 16 Jaroslav Škarvada 2017-03-03 13:13:57 UTC
(In reply to Luiz Capitulino from comment #15)
> IMO, this is the profile leaking data. If different profiles use different
> initrd images, then we'll have several stale files in /boot. And /boot is
> not /tmp :) I think tuned profiles should always undo what they have done
> when they are turned off or when there's a profile switch. This includes
> cleaning up stale files (and I'll also update my patch to delete
> tuned-overlay.img/etc/systemd/system.conf).

Valid point, I will remove them. I wanted to track and remove them in rpms.

Comment 17 Jaroslav Škarvada 2017-03-03 14:08:27 UTC
(In reply to Jaroslav Škarvada from comment #16)
> (In reply to Luiz Capitulino from comment #15)
> > IMO, this is the profile leaking data. If different profiles use different
> > initrd images, then we'll have several stale files in /boot. And /boot is
> > not /tmp :) I think tuned profiles should always undo what they have done
> > when they are turned off or when there's a profile switch. This includes
> > cleaning up stale files (and I'll also update my patch to delete
> > tuned-overlay.img/etc/systemd/system.conf).
> 
> Valid point, I will remove them. I wanted to track and remove them in rpms.

Upstream commit:
https://github.com/redhat-performance/tuned/commit/d89ab6731b41c9beac2f11e9588916f2d0397e93

Comment 18 Luiz Capitulino 2017-03-03 14:11:59 UTC
Cool! So, I take it you're still looking at the priority problem from comment 12?

Comment 19 Jaroslav Škarvada 2017-03-03 14:16:19 UTC
(In reply to Luiz Capitulino from comment #18)
> Cool! So, I take it you're still looking at the priority problem from
> comment 12?

Trying to reproduce, sorry for delay.

Comment 20 Luiz Capitulino 2017-03-03 14:20:01 UTC
No problem! Just wanted to know we were on the same page.

Comment 21 Jaroslav Škarvada 2017-03-03 15:35:27 UTC
Hmm, I am still unable to reproduce:

script.sh:
#!/bin/sh

. /usr/lib/tuned/functions

start() {
    cp /etc/systemd/system.conf ./tuned-initrd.img/etc/systemd/
    return 0
}

stop() {
    return 0
}

process $@

---

tuned.conf:
[systemd]
cpu_affinity = 2

[script]
priority = 5
script = script.sh

[bootloader]
priority = 10
initrd_add_dir = tuned-initrd.img

---

Tuned run...

# lsinitrd -f /etc/systemd/system.conf /boot/tuned-initrd.img | grep CPUAffinity
#CPUAffinity=1 2
CPUAffinity=2

I.e. it works as expected, which is confirmed by Tuned log:
2017-03-03 16:27:56,200 INFO     tuned.plugins.plugin_systemd: setting 'CPUAffinity' to '2' in the '/etc/systemd/system.conf'
2017-03-03 16:27:56,201 INFO     tuned.plugins.plugin_script: calling script '/etc/tuned/test2/script.sh' with arguments '['start']'
2017-03-03 16:27:56,215 INFO     tuned.plugins.plugin_bootloader: generating initrd image from directory '/etc/tuned/test2/tuned-initrd.img'
2017-03-03 16:27:56,229 INFO     tuned.plugins.plugin_bootloader: installing initrd image as '/boot/tuned-initrd.img'

I.e. it patches system.conf first, then it runs the helper script and finally it generates and install initrd.

Could you provide snip from your tuned.log?

Comment 22 Luiz Capitulino 2017-03-03 16:41:57 UTC
I have rebooted into a different profile a few times to do other work on that machine, so I'll have to reproduce again.

However, I did not set the priority for the "[script]" section so maybe it's running first? I'll fix that. But I'm taking the day off today, so this will have to wait until next week.

Thanks for the feedback.

Comment 23 Jaroslav Škarvada 2017-03-03 17:02:08 UTC
(In reply to Luiz Capitulino from comment #22)
> However, I did not set the priority for the "[script]" section so maybe it's
> running first? I'll fix that.

Yes, it's important. If no 'priority' is specified the plugins are executed in order which depends on python implementation and is unrelated to their position in the tuned.conf. On my Fedora it executes in correct order even without the priority, but it's pure coincidence.

> But I'm taking the day off today, so this will
> have to wait until next week.
> 
OK, NP, let me know next week.

Comment 24 Jaroslav Škarvada 2017-03-03 17:14:39 UTC
Improved by following upstream commit:
https://github.com/redhat-performance/tuned/commit/4eadc691fc3ae6f300087ccef893e5f5031253a2

This adds 'initrd_remove_dir' option which if set to 'True' (is false by default) removes the source directory from which the initrd image is build, e.g.:

[bootloader]
initrd_remove_dir = True
initrd_add_dir = /tmp/tuned-initrd.img

This will create initrd image from the /tmp/tuned-initrd.img directory and
then it removes the tuned-initrd.img directory from the /tmp.

Comment 25 Jaroslav Škarvada 2017-03-03 17:31:26 UTC
This is an commit putting it all together in cpu-partitioning profile:
https://github.com/redhat-performance/tuned/commit/843dc8cf5f3a9ea176e13f6d4086e909c5fbde0e

Comment 26 Jaroslav Škarvada 2017-03-03 17:34:33 UTC
Available in:
tuned-2.7.1-1.20170303git843dc8cf.el7

Comment 27 Luiz Capitulino 2017-03-06 19:17:59 UTC
This is working as expected. I can see that the initrd contains the correct system.conf file. Also, initial zero-loss test-case passed. I'm now running a long duration test.

Comment 28 Luiz Capitulino 2017-03-07 14:02:48 UTC
My tests passed, very well done!

Comment 37 errata-xmlrpc 2017-08-01 12:32:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2102