Bug 1512295
| Summary: | Tuned fails to isolate the pmd processes with regex by using tuned scheduler | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 7 | Reporter: | Maxim Babushkin <mbabushk> | ||||||||||||||
| Component: | tuned | Assignee: | Jaroslav Škarvada <jskarvad> | ||||||||||||||
| Status: | CLOSED ERRATA | QA Contact: | qe-baseos-daemons | ||||||||||||||
| Severity: | high | Docs Contact: | |||||||||||||||
| Priority: | high | ||||||||||||||||
| Version: | 7.4 | CC: | aasmith, atelang, atheurer, fbaudin, jeder, jraju, jskarvad, mbabushk, oblaut, olysonek, pmorey, salmy, skramaja, supadhya, tcerna, thozza, yrachman | ||||||||||||||
| Target Milestone: | rc | Keywords: | Patch, Upstream, ZStream | ||||||||||||||
| Target Release: | --- | ||||||||||||||||
| Hardware: | Unspecified | ||||||||||||||||
| OS: | Unspecified | ||||||||||||||||
| Whiteboard: | |||||||||||||||||
| Fixed In Version: | tuned-2.10.0-0.1.rc1.el7 | Doc Type: | If docs needed, set a value | ||||||||||||||
| Doc Text: |
Previously, Tuned migrated threads during cores isolation even if you blacklisted the threads from being migrated. With this update, the scheduler plug-in has been fixed to filter thread names through the same blacklist and whitelist filters as process names. As a result, Tuned no longer migrates black listed threads.
|
Story Points: | --- | ||||||||||||||
| Clone Of: | |||||||||||||||||
| : | 1592373 1598031 1601350 (view as bug list) | Environment: | |||||||||||||||
| Last Closed: | 2018-10-30 10:48:57 UTC | Type: | Bug | ||||||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||||||
| Documentation: | --- | CRM: | |||||||||||||||
| Verified Versions: | Category: | --- | |||||||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||
| Embargoed: | |||||||||||||||||
| Bug Depends On: | |||||||||||||||||
| Bug Blocks: | 1592373, 1598031, 1601350 | ||||||||||||||||
| Attachments: |
|
||||||||||||||||
|
Description
Maxim Babushkin
2017-11-12 13:09:15 UTC
Created attachment 1351184 [details]
/proc/cmdline
Created attachment 1351185 [details]
cpu-partitioning-variables.conf
Created attachment 1351186 [details]
/etc/systemd/system.conf
Created attachment 1351187 [details]
/usr/lib/tuned/cpu-partitioning/tuned.conf
Created attachment 1351188 [details]
/usr/lib/tuned/cpu-partitioning/script.sh
Created attachment 1351189 [details]
tuned.log
Sorry for delay. I will provide update soon. As I thought it's caused by the CPUAffinity systemd setting. The systemd CPUAffinity setting is applied to the 'init' process and inherited by all forked processes including your 'pmd' threads. That's why they are running on the non-isolated cores only.
I tried with five 'pmd' processes:
# tuna -CP | grep pmd
9595 OTHER 0 0xffffff 273 1 pmd1
9604 OTHER 0 0xffffff 271 1 pmd2
9617 OTHER 0 0xffffff 268 1 pmd3
9629 OTHER 0 0xffffff 267 1 pmd4
9640 OTHER 0 0xffffff 266 2 pmd5
It's the original affinity.
With tuned-2.9.0-1 there is no need to edit the cpu-partitioning profile, just set the isolated cores in the /etc/tuned/cpu-partitioning-variables.conf, my testing machine has 24 cores and I set it the following way:
# cat /etc/tuned/cpu-partitioning-variables.conf
# Examples:
# isolated_cores=2,4-7
# isolated_cores=2-23
#
# To disable the kernel load balancing in certain isolated CPUs:
# no_rebalance_cores=5-10
isolated_cores=2-23
Now it's OK to just activate the cpu-partitioning profile (if not already activated):
# tuned-adm profile cpu-partitioning
Check the affinity:
# tuna -CP |grep pmd
9595 OTHER 0 0xffffff 488 1 pmd1
9604 OTHER 0 0xffffff 487 1 pmd2
9617 OTHER 0 0xffffff 484 1 pmd3
9629 OTHER 0 0xffffff 483 1 pmd4
9640 OTHER 0 0xffffff 482 2 pmd5
Affinity unaffected as expected.
Now check it after reboot:
# reboot
# tuna -CP |grep pmd
886 OTHER 0 0,1 132 6 pmd2
895 OTHER 0 0,1 132 0 pmd3
898 OTHER 0 0,1 133 1 pmd4
899 OTHER 0 0,1 132 1 pmd5
915 OTHER 0 0,1 131 3 pmd1
The pmd processes are started before Tuned and the affinity was changed by the systemd as expected (i.e. not what you want).
Systemd doesn't have anything like blacklisting, it's just setting the affinity of the init process. So you have to disable the 'systemd' Tuned plugin to get the blacklisting Tuned feature working. Copy the cpu-partitioning profile to the /etc/tuned (i.e. make it user customizable):
# cp -av /usr/lib/tuned/cpu-partitioning /etc/tuned/
Apply the following patch over /etc/tuned/cpu-partitioning disabling the systemd plugin:
diff --git a/cpu-partitioning/tuned.conf b/cpu-partitioning/tuned.conf
index 3c52215..d95a1e6 100644
--- a/cpu-partitioning/tuned.conf
+++ b/cpu-partitioning/tuned.conf
@@ -38,8 +38,8 @@ kernel.timer_migration = 1
/sys/devices/virtual/workqueue/cpumask = ${not_isolated_cpumask}
/sys/devices/system/machinecheck/machinecheck*/ignore_ce = 1
-[systemd]
-cpu_affinity=${not_isolated_cores_expanded}
+#[systemd]
+#cpu_affinity=${not_isolated_cores_expanded}
[script]
priority=5
Disable Tuned (this is to allow Tuned to perform full roll-back and removing the initrd):
# systemctl stop tuned
Re-enable Tuned:
# systemctl start tuned
Check that the systemd CPUAffinity is unset (it has to be commented):
# cat /etc/systemd/system.conf | grep CPUAffinity
#CPUAffinity=1 2
Check that the systemd CPUAffinity in the dracut is unset (commented):
# lsinitrd -f etc/systemd/system.conf /boot/tuned-initrd.img | grep CPUAffinity
#CPUAffinity=1 2
Now reboot:
# reboot
Check the affinity of pmd processes:
# tuna -CP |grep pmd
954 OTHER 0 0xffffff 91 2 pmd3
958 OTHER 0 0xffffff 88 3 pmd5
959 OTHER 0 0xffffff 88 3 pmd4
960 OTHER 0 0xffffff 88 2 pmd1
964 OTHER 0 0xffffff 88 3 pmd2
Now it works as expected.
Limitation: Tuned now doesn't change affinity of newly created processes (processes created after Tuned is started). There is an RFE for it.
We may consider adding new Tuned profile utilizing the cpu-partitiong settings, but without the systemd plugin, otherwise the blacklisting will not work after the reboot.
Thanks for the detailed analysis Jaroslav. Now that we need to introduce a new profile with commenting systemd affinity, we could name it as "cpu-partitioning-affinity"? Lets take Andrew's view too conclude on it. If we comment out systemd affinity, wont we lose the affinity for everything else, which would defeat the purpose of using cpu-partitioning? The only problem I see with the original profile and its implementation is that systemd starts the tuned service too late -it should be started as one of the first services, so subsequent services (like openvswitch) can start with the new affinity, but also break out of the affinity with things like sched_setaffinity(). (In reply to Andrew Theurer from comment #11) > If we comment out systemd affinity, wont we lose the affinity for everything > else, which would defeat the purpose of using cpu-partitioning? > At the moment Tuned do one shot re-pinning, but it has perf support and monitors newly created processes, so it can also repin them according to the regex - it's not yet implemented, but it should be easy to add. I do not believe the regex is not working for cpu-partitioning, for example:
/usr/lib/tuned/cpu-partitioning/tuned.conf:
[scheduler]
isolated_cores=${isolated_cores}
ps_blacklist=.*pmd.*;.*PMD.*;^DPDK;.*qemu-kvm.*
#systemctl start openvswitch
#ovs_pid=`pgrep ovs-vswitchd`; for i in `/bin/ls /proc/$ovs_pid/task`; do echo `cat /proc/$i/comm` affinity: `taskset -pc $i` ; done
pmd53 affinity: pid 41004's current affinity list: 5
pmd52 affinity: pid 41005's current affinity list: 29
#systemctl restart tuned
# ovs_pid=`pgrep ovs-vswitchd`; for i in `/bin/ls /proc/$ovs_pid/task`; do echo `cat /proc/$i/comm` affinity: `taskset -pc $i` ; done | grep pmd
pmd53 affinity: pid 41004's current affinity list: 0-3
pmd52 affinity: pid 41005's current affinity list: 0-3
#systemctl restart ovs-vswitchd
# ovs_pid=`pgrep ovs-vswitchd`; for i in `/bin/ls /proc/$ovs_pid/task`; do echo `cat /proc/$i/comm` affinity: `taskset -pc $i` ; done | grep pmd
pmd53 affinity: pid 42351's current affinity list: 5
pmd52 affinity: pid 42352's current affinity list: 29
<edit tuned config>
[scheduler]
isolated_cores=${isolated_cores}
ps_blacklist=ovs-vswitchd
# systemctl restart tuned
# ovs_pid=`pgrep ovs-vswitchd`; for i in `/bin/ls /proc/$ovs_pid/task`; do echo `cat /proc/$i/comm` affinity: `taskset -pc $i` ; done | grep pmd
pmd52 affinity: pid 43372's current affinity list: 5
pmd53 affinity: pid 43373's current affinity list: 29
So I believe the regex is working, just not on the right process name. ps_blacklist needs to check for threads' "comm", not just main process comm.
(In reply to Andrew Theurer from comment #13) > So I believe the regex is working, just not on the right process name. > ps_blacklist needs to check for threads' "comm", not just main process comm. At the moment it checks the process "comm" against the regex. If it matches the blacklist the process and all its threads are ignored. If the process doesn't match no additional checks are done and the process and all its threads are processed. We will extend the code so in case the process doesn't match it will also check the individual threads if not blacklisted. (In reply to Jaroslav Škarvada from comment #9) > The pmd processes are started before Tuned and the affinity was changed by > the systemd as expected (i.e. not what you want). To be clear, the affinity of the pmd processes was not *changed* by systemd. All the CPUAffinity setting in /etc/systemd/system.conf does is set the affinity of the init process. All other processes (including the pmd ones) *inherit* that affinity. > We may consider adding new Tuned profile utilizing the cpu-partitiong > settings, but without the systemd plugin, otherwise the blacklisting will > not work after the reboot. This would not work either in the current state of the profile. This is because the ps_blacklist option doesn't contain a regex for the init process. So when tuned applies the cpu-partitioning profile, it will set the affinity of the init process to the non-isolated cores. Any process started after that (e.g. pmd) will inherit that affinity. So it will behave the same as if the CPUAffinity option was set. (In reply to Andrew Theurer from comment #11) > The only problem I see with the original profile and its implementation is > that systemd starts the tuned service too late -it should be started as one > of the first services, so subsequent services (like openvswitch) can start > with the new affinity, but also break out of the affinity with things like > sched_setaffinity(). Do I understand it correctly that openvswitch calls sched_setaffinity() to set its own CPU affinity? Or perhaps systemd sets its CPU affinity? (In reply to Jaroslav Škarvada from comment #14) > (In reply to Andrew Theurer from comment #13) > > > So I believe the regex is working, just not on the right process name. > > ps_blacklist needs to check for threads' "comm", not just main process comm. > > At the moment it checks the process "comm" against the regex. If it matches > the blacklist the process and all its threads are ignored. If the process > doesn't match no additional checks are done and the process and all its > threads are processed. We will extend the code so in case the process > doesn't match it will also check the individual threads if not blacklisted. Yes, I think this is all that needs to be done to resolve this bug. (In reply to Jaroslav Škarvada from comment #12) > At the moment Tuned do one shot re-pinning, but it has perf support and > monitors newly created processes, so it can also repin them according to the > regex - it's not yet implemented, but it should be easy to add. We should probably add that for completeness. Upstream commit that should fix the problem: https://github.com/redhat-performance/tuned/commit/72c94b3614a48d14fa635b6b369010ff29ab9619 If the process passes through the whitelist/blacklist, threads are matched through the same whitelist/blacklist, threads passing through the lists are processed. If process doesn't pass through the whitelist/blacklist, its threads are not matched and not processed. The implementation could be improved, e.g. we could have two groups of lists - one for processes, seconds for threads, but I think even this implementation should satisfy the requirements. Scratch build for testing: tuned-2.9.0-1.20180518git72c94b36.el7.x86_64 Available from: https://jskarvad.fedorapeople.org/tuned/devel/repo/ Please test and let me know. Looks good to me, but based on comment#13, we should probably add "ovs-vswitchd" to ps_blacklist, right? (In reply to Ondřej Lysoněk from comment #17) > Looks good to me, but based on comment#13, we should probably add > "ovs-vswitchd" to ps_blacklist, right? I thought just the pmd threads were the issue. Andrew? Do we also need to update the blacklist? I have used the specified the package to validate the issue and the issue is fixed. Even after restarting the tuned service, the PMD threads affinity is not altered. I have verified it by rebooting the overcloud node also and it works. But I observe a behavior that by updating to new package (provided by Jaroslav), the PMD therad's affinity is changed to all cpus 0-87. And after updating the tuned package, restarted the ovs to retain the actual PMD affinity. I am not sure if this is a new issue or an existing issue, but this could force a reboot (or atleast restart of ovs after tuned update). Thoughts? -------------------------------------------------------------- #### Before installing the new package [root@overcloud-computeovsdpdk-0 heat-admin]# ovs_pid=`pgrep ovs-vswitchd`; for i in `/bin/ls /proc/$ovs_pid/task`; do echo `cat /proc/$i/comm` affinity: `taskset -pc $i` ; done | grep pmd pmd93 affinity: pid 676288's current affinity list: 11 pmd92 affinity: pid 676289's current affinity list: 10 pmd94 affinity: pid 676290's current affinity list: 23 pmd95 affinity: pid 676291's current affinity list: 22 [root@overcloud-computeovsdpdk-0 heat-admin]# ### After installing the tuned package [root@overcloud-computeovsdpdk-0 heat-admin]# yum install -q tuned-2.9.0-1.20180518git72c94b36.el7.noarch.rpm -y [root@overcloud-computeovsdpdk-0 heat-admin]# ovs_pid=`pgrep ovs-vswitchd`; for i in `/bin/ls /proc/$ovs_pid/task`; do echo `cat /proc/$i/comm` affinity: `taskset -pc $i` ; done | grep pmd pmd93 affinity: pid 676288's current affinity list: 0-87 pmd92 affinity: pid 676289's current affinity list: 0-87 pmd94 affinity: pid 676290's current affinity list: 0-87 pmd95 affinity: pid 676291's current affinity list: 0-87 [root@overcloud-computeovsdpdk-0 heat-admin]# [root@overcloud-computeovsdpdk-0 heat-admin]# systemctl restart tuned [root@overcloud-computeovsdpdk-0 heat-admin]# ovs_pid=`pgrep ovs-vswitchd`; for i in `/bin/ls /proc/$ovs_pid/task`; do echo `cat /proc/$i/comm` affinity: `taskset -pc $i` ; done | grep pmd pmd93 affinity: pid 676288's current affinity list: 0-87 pmd92 affinity: pid 676289's current affinity list: 0-87 pmd94 affinity: pid 676290's current affinity list: 0-87 pmd95 affinity: pid 676291's current affinity list: 0-87 ### Restart ovs to retain the PMD thread affinity [root@overcloud-computeovsdpdk-0 heat-admin]# systemctl restart openvswitch [root@overcloud-computeovsdpdk-0 heat-admin]# ovs_pid=`pgrep ovs-vswitchd`; for i in `/bin/ls /proc/$ovs_pid/task`; do echo `cat /proc/$i/comm` affinity: `taskset -pc $i` ; done | grep pmd pmd93 affinity: pid 678889's current affinity list: 11 pmd92 affinity: pid 678890's current affinity list: 10 pmd94 affinity: pid 678891's current affinity list: 23 pmd95 affinity: pid 678892's current affinity list: 22 [root@overcloud-computeovsdpdk-0 heat-admin]# ### Actual issue of this BZ is fixed, where restarting tuned still retains affinity of PMD threads [root@overcloud-computeovsdpdk-0 heat-admin]# systemctl restart tuned [root@overcloud-computeovsdpdk-0 heat-admin]# ovs_pid=`pgrep ovs-vswitchd`; for i in `/bin/ls /proc/$ovs_pid/task`; do echo `cat /proc/$i/comm` affinity: `taskset -pc $i` ; done | grep pmd pmd93 affinity: pid 678889's current affinity list: 11 pmd92 affinity: pid 678890's current affinity list: 10 pmd94 affinity: pid 678891's current affinity list: 23 pmd95 affinity: pid 678892's current affinity list: 22 The problem is in the way the isolated_cores tuning of the scheduler plugin is reverted. Currently tuned changes the affinity of all processes so that they can run on all cores when reverting the tuning. So when tuned is restarted after the upgrade of the tuned package, the old tuned allows the pmd threads to run on all cores, and then when the new version of tuned starts it ignores the pmd threads (because of the ps_blacklist regex) and leaves their affinity be. I'm afraid we cannot fix the problem in tuned because the new version of tuned has no way of knowing what the correct affinity of the pmd threads should be. But I will change the way the isolated_cores tuning is reverted so that these problems don't happen in the future. I'll make tuned remember the affinity of all processes when applying the tuning and make it revert to the previous affinity when the tuning is unapplied. (In reply to Ondřej Lysoněk from comment #20) > should be. But I will change the way the isolated_cores tuning is reverted > so that these problems don't happen in the future. I'll make tuned remember > the affinity of all processes when applying the tuning and make it revert to > the previous affinity when the tuning is unapplied. Are you planning to implement save and restore? As per my understanding, this issue will not affect earlier because blacklist was not working for tasks. Since the enhancements provides a check to tasks also, it will hit with this issue. We need save and restore, else it will result in regression after tuned package update. (In reply to Saravanan KR from comment #21) > Are you planning to implement save and restore? Yes, I am. > As per my understanding, > this issue will not affect earlier because blacklist was not working for > tasks. Since the enhancements provides a check to tasks also, it will hit > with this issue. We need save and restore, else it will result in regression > after tuned package update. I didn't quite understand what you're asking, but let me try: the save and restore feature is meant to address the problem you described in comment#19 (it is a separate issue, unrelated to the issue described in this bug's description). However, the feature will not help you with the problem when upgrading from the old tuned version that you already have to a new tuned version which implements save and restore. You will need to reboot/restart ovs after upgrading tuned. That's because as far as I can see there's no reasonable way to prevent that problem from happening by doing changes to the tuned package. The save and restore feature may only help when upgrading from the new tuned version (which implements save and restore) to an even newer version. I hope that gives you an answer. Regarding this BZ, the upstream fix is already available as mentioned in comment #16. What is the plan for the tuned package release for downstream? As this is linked with providing post-install.yaml in the RHOSP NFV docs, I would like to plan it accordingly based on the fix availability on tuned package? (In reply to Saravanan KR from comment #23) > Regarding this BZ, the upstream fix is already available as mentioned in > comment #16. What is the plan for the tuned package release for downstream? > As this is linked with providing post-install.yaml in the RHOSP NFV docs, I > would like to plan it accordingly based on the fix availability on tuned > package? It will be addressed by RHEL-7.6 rebase. NFV errata will follow, bug 1588388. *** Bug 1592373 has been marked as a duplicate of this bug. *** Hi, I'm not able to reproduce this bug because of missing openstack in rhel. I've never use it, so if there is somebody who can test it, it really helps me a lot. # tuna -t ovs-vswitchd -CP |grep pmd # # echo $? # 1 What I need for ovs-vswitchd threads? I used: tuned-2.8.0-5.el7.noarch, tuna-0.13-5.el7.noarch @Saravanan: You wrote that it pending for qe verification. What else is neccessary to test it? With z-stream I have no problem, if somebody try this feature with z-stream rpm package. I can do other work (regression tests, tps, filelists....) and prepare it for deployment from qe point of view. This BZ verified with the following rpms openvswitch-2.9.0-54.el7fdp.x86_64 tuned-2.9.0-1.el7_5.2.noarch tuned-profiles-cpu-partitioning-2.9.0-1.el7_5.2.noarch Kernel 3.10.0-862.11.6.el7.x86_64 RHOSP release 2018-08-22.2 thanks yariv for your testing. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3172 |