RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.
Bug 1613478 - Tuned hangs with cpu-partitioning and offlined cpu(s)
Summary: Tuned hangs with cpu-partitioning and offlined cpu(s)
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: tuned
Version: 7.6
Hardware: Unspecified
OS: Unspecified
urgent
unspecified
Target Milestone: rc
: ---
Assignee: Jaroslav Škarvada
QA Contact: Tereza Cerna
Marek Suchánek
URL:
Whiteboard:
Depends On:
Blocks: 1613950 1613951
TreeView+ depends on / blocked
 
Reported: 2018-08-07 15:59 UTC by Joe Mario
Modified: 2019-10-22 02:23 UTC (History)
13 users (show)

Fixed In Version: tuned-2.10.0-2.el7
Doc Type: If docs needed, set a value
Doc Text:
Previously, setting the cpu-partitioning Tuned profile caused the tuned service to become unresponsive if a CPU was set offline. With this update, the problem has been fixed, and tuned no longer hangs after selecting cpu-partitioning when a CPU is offline.
Clone Of:
: 1613832 1613950 1613951 (view as bug list)
Environment:
Last Closed: 2018-10-30 10:50:19 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 3561671 0 None None None 2018-08-13 21:02:31 UTC
Red Hat Product Errata RHBA-2018:3172 0 None None None 2018-10-30 10:51:10 UTC

Description Joe Mario 2018-08-07 15:59:46 UTC
Description of problem:
If I offline a cpu and then try to set "tuned-adm profile cpu-partitioning", tuned will just hang.  

If I re-online that cpu, tuned works fine.

Version-Release number of selected component (if applicable):

This exists in RHEL 7.4, 7.5, and 7.6.

How reproducible:


Steps to Reproduce:
1. Boot a system with the default tuned (throughput-performance, balanced,
   or whatever).
2. Then offline a cpu. e.g.: # echo 0 > /sys/devices/system/cpu/cpu41/online
3. Add some other cpu(s) [not the offlined cpu] to the
   /etc/tuned/cpu-partitioning-variables.conf file.
4. Then run: # tuned-adm profile cpu-partitioning
   It will hang.

In addition to the cpu-partitioning profile, this will likely impact
the realtime profiles.

I emailed Jaroslav on this.  He understands it and has a fix.

Actual results:


Expected results:


Additional info:

Comment 1 Jaroslav Škarvada 2018-08-07 16:04:08 UTC
From devel point of view no problem to backport to 7.5.z and 7.4.z.

To proceed further I need qa_ack, pm_ack and Z stream clones.

Comment 4 Tereza Cerna 2018-08-08 09:59:59 UTC
Hi, I'm not sure if I'm able to test 3 errata and have it ready for release on 14th August. There is so much work, I should test this bug, do package testing and regression testing. It takes hours..

Comment 15 Tereza Cerna 2018-08-10 12:29:44 UTC
Hi, is mentioned reproduced in bug description complete? I did these steps and I was not able to reproduce it. Maybe some step is missing?


# echo 'isolated_cores=2-3' > /etc/tuned/cpu-partitioning-variables.conf
# echo 0 > /sys/devices/system/cpu/cpu4/online

# tuned-adm profile cpu-partitioning       ## this should hang, but it did not it
# tuned-adm active | grep cpu-partitioning
Current active profile: cpu-partitioning

# rpm -q tuned{,-profiles-cpu-partitioning}
tuned-2.8.0-5.el7_4.2.noarch
tuned-profiles-cpu-partitioning-2.8.0-5.el7_4.2.noarch

Comment 17 Joe Mario 2018-08-10 13:44:02 UTC
Hi Tereza:
This should not need any special boot flag.  It fails on a default RHEL 7.5 installation.

Here's what I get, and that last cmd, setting it to cpu-partitioning
hangs.

# rpm -q tuned{,-profiles-cpu-partitioning}
tuned-2.9.0-1.el7_5.2.noarch
tuned-profiles-cpu-partitioning-2.9.0-1.el7_5.2.noarch
# tuned-adm active
Current active profile: throughput-performance
# egrep '^isolated' /etc/tuned/cpu-partitioning-variables.conf
isolated_cores=17
# echo 0 > /sys/devices/system/cpu/cpu4/online
# tuned-adm profile cpu-partitioning 

[the above cmd hangs]

Comment 18 Joe Mario 2018-08-10 14:08:31 UTC
Hi Jaroslav:
Do you have any thoughts on how the above scenario for Tereza did not hang?

It looks like he's on RHEL 7.4, which I assumed had the same implementation for reading which cpus were present instead of online.

Thanks,
Joe

Comment 20 Tereza Cerna 2018-08-10 15:22:33 UTC
I tested in on RHEL-7.5, because of ER#35664 which should be in REL_PREP on Tuesday.

Comment 21 Tereza Cerna 2018-08-10 15:23:42 UTC
Mistake, I tested in on RHEL-7.4.z durint testing of ER#35664.

Comment 22 Jeff Bastian 2018-08-13 21:50:52 UTC
Ok, now I'm having trouble reproducing this too on RHEL-7.4.z:

[root@intel-brickland-03 ~]# uname -r
3.10.0-693.37.4.el7.x86_64

[root@intel-brickland-03 ~]# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.4 (Maipo)

[root@intel-brickland-03 ~]# rpm -qa tuned{,-profiles-cpu-partitioning}
tuned-profiles-cpu-partitioning-2.8.0-5.el7.noarch
tuned-2.8.0-5.el7.noarch

[root@intel-brickland-03 ~]# cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list 
0,72

[root@intel-brickland-03 ~]# echo isolated_cores=0-71 >> /etc/tuned/cpu-partitioning-variables.conf

[root@intel-brickland-03 ~]# echo 0 > /sys/devices/system/cpu/cpu72/online
[  955.220873] intel_pstate CPU 72 exiting
[  955.239976] Broke affinity for irq 107
[  955.246524] smpboot: CPU 72 is now offline

[root@intel-brickland-03 ~]# tuned-adm profile cpu-partitioning

[root@intel-brickland-03 ~]# tuned-adm active
Current active profile: cpu-partitioning

[root@intel-brickland-03 ~]# echo Still alive
Still alive

Comment 23 Jeff Bastian 2018-08-14 12:47:51 UTC
This is reproducible on RHEL-7.5 with tuned-2.9

[root@dell-pet620-01 ~]# uname -r
3.10.0-862.11.6.el7.x86_64

[root@dell-pet620-01 ~]# cat /etc/redhat-release
Red Hat Enterprise Linux Server release 7.5 (Maipo)

[root@dell-pet620-01 ~]# rpm -qa tuned{,-profiles-cpu-partitioning}
tuned-2.9.0-1.el7.noarch
tuned-profiles-cpu-partitioning-2.9.0-1.el7.noarch

[root@dell-pet620-01 ~]# lscpu | grep -B1  On-line
CPU(s):                24
On-line CPU(s) list:   0-23

[root@dell-pet620-01 ~]# grep -v ^# /etc/tuned/cpu-partitioning-variables.conf 
isolated_cores=1,3,5,7,9

[root@dell-pet620-01 ~]# echo 0 > /sys/devices/system/cpu/cpu10/online 

[root@dell-pet620-01 ~]# tuned-adm profile cpu-partitioning
<<<HANG>>>
^C
Traceback (most recent call last):
  File "/usr/sbin/tuned-adm", line 94, in <module>
    result = admin.action(action_name, **options)
  File "/usr/lib/python2.7/site-packages/tuned/admin/admin.py", line 75, in action
    res = self._controller.run()
  File "/usr/lib/python2.7/site-packages/tuned/admin/dbus_controller.py", line 59, in run
    self._main_loop.run()
  File "/usr/lib64/python2.7/site-packages/gi/overrides/GLib.py", line 577, in run
    raise KeyboardInterrupt
KeyboardInterrupt

Comment 24 Jeff Bastian 2018-08-14 12:57:19 UTC
If I downgrade this RHEL-7.5 system to the tuned packages from RHEL-7.4, I can no longer reproduce the hang.

[root@dell-pet620-01 ~]# rpm -qa tuned{,-profiles-cpu-partitioning}
tuned-2.8.0-5.el7.noarch
tuned-profiles-cpu-partitioning-2.8.0-5.el7.noarch

[root@dell-pet620-01 ~]# echo 0 > /sys/devices/system/cpu/cpu10/online

[root@dell-pet620-01 ~]# tuned-adm profile cpu-partitioning

[root@dell-pet620-01 ~]# echo Still Alive
Still Alive

Comment 25 Jeff Bastian 2018-08-14 13:07:14 UTC
And if I upgrade to tuned-2.10.0-1.el7 it also works fine, so it seems the hang bug was limited to something in tuned-2.9.0.

Nevertheless, the patch is a good idea for RHEL 7.6, so tuned-2.10.0-2.el7 is the right way to go.

I guess it's up for debate if we want to fix RHEL 7.4.z or not.

Comment 26 Joe Mario 2018-08-14 13:13:39 UTC
Thanks Jeff for the above analysis.  

When I ran into the tuned hang, it occured on RHEL 7.5.  I assumed the 
problem also existed on RHEL 7.4 (because the tuned there also used 
the  /sys/devices/system/cpu/present file instead of the  
/sys/devices/system/cpu/online file).

There must be something in the RHEL 7.4 tuned version that causes this bug 
to not trigger.  But it still looks like Jaroslav's patch is warranted for 
RHEL 7.4 since the cpu information in /sys/devices/system/cpu/present 
is not correct for how tuned used it.

Comment 27 Tereza Cerna 2018-08-14 18:36:34 UTC
Hi,

I've tested tps, regression tests, filelists... and all these things are OK.

I've seen also patch, except one typo which is commented (so not problematic), this patch is OK. Two or three my test cases uses this code and when these tests were improved according to changes from patch, they worked well.

But this bug... I write new test case /CoreOS/tuned/Regression/offlined-cpu-in-profile-cpu-partitioning for testing problem specified in description - tuned hangs when user sets some cpu(s) offlined and selects cpu-partition profile. This should hang on cpu-partitioning selection with old packages, and works when new packages are installed, but see my results:

                    |  current result  |  excepted result
    ----------------|------------------|-------------------
    rhel-7.5 (old)  |    it hang       |    it hang          -> maybe false positive
    rhel-7.5 (new)  |    it hang       |    it works         -> fail
    rhel-7.4 (old)  |    it works      |    it hang          -> fail
    rhel-7.4 (new)  |    it works      |    it works         -> maybe false positive

So, from my point of view there are two options:
  1] My testing was wrong, @jbastian Can you try to test it again on 7.5? I see that you was
     able to reproduce it, but please, try to test it again with using tuned-2.9.0-1.el7_5.2 
     version.
  2] My testing was right, this patch works as expected (test cases passed), but it doesn't 
     fix this problem.

I don't want to block the 7.5.z release, so if you are satisfied with the patch, I'll finish this erratum and push it to REL_PREP.

Comment 28 Tereza Cerna 2018-08-14 18:43:33 UTC
You can see my results from test case (I can provide full log):

=============================================================
RHEL-7.5 (new packages)
   tuned-2.9.0-1.el7_5.2.noarch.rpm 
   tuned-profiles-cpu-partitioning-2.9.0-1.el7_5.2.noarch.rpm
=============================================================

:: [ PASS ] :: Command 'echo 'isolated_cores=2-3' > /etc/tuned/cpu-partitioning-variables.conf' (Expected 0, got 0)
:: [ LOG  ] :: Offline cpu
:: [ PASS ] :: Command 'echo 0 > /sys/devices/system/cpu/cpu4/online' (Expected 0, got 0)
:: [ LOG  ] :: Select cpu-partitioning profile
:: [ LOG  ] :: Output of 'tuned-adm profile cpu-partitioning':
:: [ LOG  ] :: --------------- OUTPUT START ---------------
:: [ LOG  ] :: Operation timed out after waiting 600 seconds(s), you may try to increase timeout by using --timeout command line option or using --async.
:: [ LOG  ] :: ---------------  OUTPUT END  ---------------
:: [ FAIL ] :: Command 'tuned-adm profile cpu-partitioning' (Expected 0, got 1)
:: [ PASS ] :: Command 'tuned-adm active | grep cpu-partitioning' (Expected 0, got 0)

FAIL: Selection of cpu-partition was not successfull, it hangs... and it shouldn't...

=============================================================
RHEL-7.5 (old packages)
   tuned-2.9.0-1.el7_5.1.noarch.rpm 
   tuned-profiles-cpu-partitioning-2.9.0-1.el7_5.1.noarch.rpm
=============================================================

:: [ PASS ] :: Command 'echo 'isolated_cores=2-3' > /etc/tuned/cpu-partitioning-variables.conf' (Expected 0, got 0)
:: [ LOG  ] :: Offline cpu
:: [ PASS ] :: Command 'echo 0 > /sys/devices/system/cpu/cpu4/online' (Expected 0, got 0)
:: [ LOG  ] :: Select cpu-partitioning profile
:: [ LOG  ] :: Output of 'tuned-adm profile cpu-partitioning':
:: [ LOG  ] :: --------------- OUTPUT START ---------------
:: [ LOG  ] :: Operation timed out after waiting 600 seconds(s), you may try to increase timeout by using --timeout command line option or using --async.
:: [ LOG  ] :: ---------------  OUTPUT END  ---------------
:: [ FAIL ] :: Command 'tuned-adm profile cpu-partitioning' (Expected 0, got 1)
:: [ PASS ] :: Command 'tuned-adm active | grep cpu-partitioning' (Expected 0, got 0)

?? PASS: Selection of cpu-partition was not successfull, it hangs... (that's right) but it is still weird, because it was also hanging with the patch.
maybe FALSE POSITIVE

=============================================================
RHEL-7.4 (new packages)
   tuned-2.8.0-5.el7_4.3.noarch.rpm 
   tuned-profiles-cpu-partitioning-2.8.0-5.el7_4.3.noarch.rpm
=============================================================
   
:: [ PASS ] :: Command 'echo 'isolated_cores=2-3' > /etc/tuned/cpu-partitioning-variables.conf' (Expected 0, got 0)
:: [ LOG  ] :: Offline cpu
:: [ PASS ] :: Command 'echo 0 > /sys/devices/system/cpu/cpu4/online' (Expected 0, got 0)
:: [ LOG  ] :: Select cpu-partitioning profile
:: [ LOG  ] :: Output of 'tuned-adm profile cpu-partitioning':
:: [ LOG  ] :: --------------- OUTPUT START ---------------
:: [ LOG  ] :: ---------------  OUTPUT END  ---------------
:: [ PASS ] :: Command 'tuned-adm profile cpu-partitioning' (Expected 0, got 0)
:: [ PASS ] :: Command 'tuned-adm active | grep cpu-partitioning' (Expected 0, got 0)

?? PASS: Selection of cpu-partition works, it didn't hang... (that's right) but it is still weird, because it was not failed without patch.
maybe FALSE POSITIVE

=============================================================
RHEL-7.4 (old packages)
   tuned-profiles-cpu-partitioning-2.8.0-5.el7_4.2.noarch
   tuned-2.8.0-5.el7_4.2.noarch   
=============================================================

:: [ PASS ] :: Command 'echo 'isolated_cores=2-3' > /etc/tuned/cpu-partitioning-variables.conf' (Expected 0, got 0)
:: [ LOG  ] :: Offline cpu
:: [ PASS ] :: Command 'echo 0 > /sys/devices/system/cpu/cpu4/online' (Expected 0, got 0)
:: [ LOG  ] :: Select cpu-partitioning profile
:: [ LOG  ] :: Output of 'tuned-adm profile cpu-partitioning':
:: [ LOG  ] :: --------------- OUTPUT START ---------------
:: [ LOG  ] :: ---------------  OUTPUT END  ---------------
:: [ PASS ] :: Command 'tuned-adm profile cpu-partitioning' (Expected 0, got 0)
:: [ PASS ] :: Command 'tuned-adm active | grep cpu-partitioning' (Expected 0, got 0)

FAIL: Selection of cpu-partition works, it didn't hang..., but it should hang...

Comment 29 Jeff Bastian 2018-08-14 19:51:26 UTC
Your RHEL-7.4 results are good: tuned-2.8.0-* does not hang, even the old version (before the patch).  This is what I reported in comment 22 and comment 24 above. So for RHEL-7.4, as long as nothing else failed in regression testing, you can move it to Verified.

I'm more concerned about the failed result with RHEL-7.5 and the new packages.  Can you point me to the full logs and Beaker job?

Comment 30 Joe Mario 2018-08-14 20:41:03 UTC
Hi Tereza:
 I just logged onto ibm-x3500m4-01.rhts.eng.bos.redhat.com,.  And I see she's running with the unpatched tuned.  The /usr/lib/python2.7/site-packages/tuned/utils/commands.py still references the /sys/devices/system/cpu/present file.

Is that correctly updated?

Joe

Comment 31 Tereza Cerna 2018-08-15 07:48:44 UTC
Hi Joe,
as I see, patch was applied correctly and the file was changed:

with patch (new package):
function cpulist_invert references the file /sys/devices/system/cpu/online

without patch (old package):
function cpulist_invert references the file /sys/devices/system/cpu/present

T.

Comment 41 Joe Mario 2018-08-15 21:09:01 UTC
Thank you Jaroslav. 
That's a great reproducer.

Jirka Olsa:
 Do you know anything about the perf python module?

 If so, can you take a look at this?  It's pretty important, because
 anyone disabling hyperthreads due to the latest CVE, may hit this, which
 causes tuned to hang (when using the cpu-partitioning profile).

 If it's not yours, do you know who would know?

Thanks,
Joe

Comment 46 Jiri Olsa 2018-08-16 17:49:00 UTC
(In reply to Joe Mario from comment #41)
> Thank you Jaroslav. 
> That's a great reproducer.
> 
> Jirka Olsa:
>  Do you know anything about the perf python module?
> 
>  If so, can you take a look at this?  It's pretty important, because
>  anyone disabling hyperthreads due to the latest CVE, may hit this, which
>  causes tuned to hang (when using the cpu-partitioning profile).
> 
>  If it's not yours, do you know who would know?

yea, it's me.. sry for delay, but AFAICT I got CC-ed just today

there's an issue with perf python mmap interface,
I'm brewing RHEL7 build with the fix in here:
  https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=17917666

not sure how it fixes tuned, but it fixes those valgrind errors for me

I'll CC you on upstream patches

jirka

Comment 47 Jaroslav Škarvada 2018-08-17 06:33:16 UTC
(In reply to Jiri Olsa from comment #46)
> (In reply to Joe Mario from comment #41)
> > Thank you Jaroslav. 
> > That's a great reproducer.
> > 
> > Jirka Olsa:
> >  Do you know anything about the perf python module?
> > 
> >  If so, can you take a look at this?  It's pretty important, because
> >  anyone disabling hyperthreads due to the latest CVE, may hit this, which
> >  causes tuned to hang (when using the cpu-partitioning profile).
> > 
> >  If it's not yours, do you know who would know?
> 
> yea, it's me.. sry for delay, but AFAICT I got CC-ed just today
> 
> there's an issue with perf python mmap interface,
> I'm brewing RHEL7 build with the fix in here:
>   https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=17917666
> 
> not sure how it fixes tuned, but it fixes those valgrind errors for me
> 
> I'll CC you on upstream patches
> 
> jirka

Thanks,

Tereza, could you try your test with the new kernel?

Comment 48 Tereza Cerna 2018-08-17 07:16:08 UTC
Sure, I'll look at it today.

Comment 49 Jiri Olsa 2018-08-17 08:07:28 UTC
> > yea, it's me.. sry for delay, but AFAICT I got CC-ed just today
> > 
> > there's an issue with perf python mmap interface,
> > I'm brewing RHEL7 build with the fix in here:
> >   https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=17917666
> > 
> > not sure how it fixes tuned, but it fixes those valgrind errors for me
> > 
> > I'll CC you on upstream patches
> > 
> > jirka
> 
> Thanks,
> 
> Tereza, could you try your test with the new kernel?

the change is in python's perf.so module (python-perf rpm)
kernel wasn't changed

jirka

Comment 50 Tereza Cerna 2018-08-17 10:07:35 UTC
Hi, mentioned brew-build 
   https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=17917666
is builded for rhel-7.6, but I need to test it on rhel-7.5.z, where I reproduced the problem. 
Can you build it for this release, please?

Comment 51 Jaroslav Škarvada 2018-08-17 10:49:38 UTC
(In reply to Jiri Olsa from comment #49)
> > > yea, it's me.. sry for delay, but AFAICT I got CC-ed just today
> > > 
> > > there's an issue with perf python mmap interface,
> > > I'm brewing RHEL7 build with the fix in here:
> > >   https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=17917666
> > > 
> > > not sure how it fixes tuned, but it fixes those valgrind errors for me
> > > 
> > > I'll CC you on upstream patches
> > > 
> > > jirka
> > 
> > Thanks,
> > 
> > Tereza, could you try your test with the new kernel?
> 
> the change is in python's perf.so module (python-perf rpm)
> kernel wasn't changed
> 
> jirka

Sure, but it will have to be released as a kernel errata, because it's build from the kernel srpm.

Comment 52 Tereza Cerna 2018-08-17 11:29:04 UTC
Hi, provided python-perf package in c#46 gived me a positive result of a test case. It looks, that this fix solves this problem.

I tried all options:
  tuned                  |  python-perf                           |  result
-----------------------------------------------------------------------------
  2.9.0-1.el7_5.2 (new)  |  3.10.0-934.el7perf_python_fix  (new)  |  PASS
  2.9.0-1.el7_5.2 (new)  |  3.10.0-862.el7                 (old)  |  FAIL
  2.9.0-1.el7_5.1 (old)  |  3.10.0-934.el7perf_python_fix  (new)  |  FAIL
  2.9.0-1.el7_5.1 (old)  |  3.10.0-862.el7                 (old)  |  FAIL

We can see that only combination of new tuned and new python-perf leads to positive result of a test case. Can we deploy in 7.5.z also fix in python-perf package?

====================================================
Verified in:
    python-perf-3.10.0-934.el7perf_python_fix.x86_64
    tuned-2.9.0-1.el7_5.2.noarch
    tuned-profiles-cpu-partitioning-2.9.0-1.el7_5.2.noarch
PASS
====================================================

[ PASS ] :: Command 'echo 'isolated_cores=2-3' > /etc/tuned/cpu-partitioning-variables.conf' (Expected 0, got 0)
[ LOG  ] :: Offline cpu
[ PASS ] :: Command 'echo 0 > /sys/devices/system/cpu/cpu4/online' (Expected 0, got 0)
[ LOG  ] :: Select cpu-partitioning profile
[ LOG  ] :: Output of 'tuned-adm profile cpu-partitioning':
[ LOG  ] :: --------------- OUTPUT START ---------------
[ LOG  ] :: ---------------  OUTPUT END  ---------------
[ PASS ] :: Command 'tuned-adm profile cpu-partitioning' (Expected 0, got 0)
[ PASS ] :: Command 'tuned-adm active | grep cpu-partitioning' (Expected 0, got 0)


====================================================
Reproduced in:
    python-perf-3.10.0-862.el7.x86_64
    tuned-2.9.0-1.el7_5.2.noarch
    tuned-profiles-cpu-partitioning-2.9.0-1.el7_5.2.noarch
FAIL
====================================================

[ PASS ] :: Command 'echo 'isolated_cores=2-3' > /etc/tuned/cpu-partitioning-variables.conf' (Expected 0, got 0)
[ LOG  ] :: Offline cpu
[ PASS ] :: Command 'echo 0 > /sys/devices/system/cpu/cpu4/online' (Expected 0, got 0)
[ LOG  ] :: Select cpu-partitioning profile
[ LOG  ] :: Output of 'tuned-adm profile cpu-partitioning':
[ LOG  ] :: --------------- OUTPUT START ---------------
[ LOG  ] :: Traceback (most recent call last):
[ LOG  ] :: File "/usr/sbin/tuned-adm", line 94, in <module>
[ LOG  ] :: result = admin.action(action_name, **options)
[ LOG  ] :: File "/usr/lib/python2.7/site-packages/tuned/admin/admin.py", line 75, in action
[ LOG  ] :: res = self._controller.run()
[ LOG  ] :: File "/usr/lib/python2.7/site-packages/tuned/admin/dbus_controller.py", line 59, in run
[ LOG  ] :: self._main_loop.run()
[ LOG  ] :: File "/usr/lib64/python2.7/site-packages/gi/overrides/GLib.py", line 577, in run
[ LOG  ] :: raise KeyboardInterrupt
[ LOG  ] :: KeyboardInterrupt
[ LOG  ] :: ---------------  OUTPUT END  ---------------
[ FAIL ] :: Command 'tuned-adm profile cpu-partitioning' (Expected 0, got 3)
[ PASS ] :: Command 'tuned-adm active | grep cpu-partitioning' (Expected 0, got 0)

Comment 53 Joe Mario 2018-08-17 11:55:05 UTC
> We can see that only combination of new tuned and new python-perf leads 
> to positive result of a test case. Can we deploy in 7.5.z also fix 
> in python-perf package?

Thank you Tereza for testing this.  Here's the way I see it, and Jirka, Jaroslav, Jeff, please do jump in with any additional thoughts.

1) Initially, tuned would hang when hyperthreads was disabled using the new runtime and kernel boottime flags.  This happened on RHEL 7.4z and 7.5z.  Jaroslav's patched tuned fixed those test cases for me.  But given what we've learned since then, his patch may have just masked the perf python problem.

2) Tereza's more simple test case, of just disabling one cpu, uncovered the perf python problem.  

Given we will now have RHEL 7.4z and 7.5z customers who use the new runtime and boottime flags to disable hyperthreads, if any of them are using a tuned profile that isolates cpus, they will need a fix.  Just a new tuned "might" work for them, but it looks like the perf patch will be needed for completeness.

My vote would be to:
 a) Submit the new patched tuned as soon as we can for an ancillary hot patch.
 b) Submit the patched perf python to RHEL 7.4z and 7.5z to be included in the next z-stream update in September.  If a customer needs it sooner, we'll deal with a hot patch if we have to.

Comments?
Joe

Comment 54 Jaroslav Škarvada 2018-08-17 12:06:42 UTC
(In reply to Joe Mario from comment #53)
> > We can see that only combination of new tuned and new python-perf leads 
> > to positive result of a test case. Can we deploy in 7.5.z also fix 
> > in python-perf package?
> 
> Thank you Tereza for testing this.  Here's the way I see it, and Jirka,
> Jaroslav, Jeff, please do jump in with any additional thoughts.
> 
> 1) Initially, tuned would hang when hyperthreads was disabled using the new
> runtime and kernel boottime flags.  This happened on RHEL 7.4z and 7.5z. 
> Jaroslav's patched tuned fixed those test cases for me.  But given what
> we've learned since then, his patch may have just masked the perf python
> problem.
> 
> 2) Tereza's more simple test case, of just disabling one cpu, uncovered the
> perf python problem.  
> 
> Given we will now have RHEL 7.4z and 7.5z customers who use the new runtime
> and boottime flags to disable hyperthreads, if any of them are using a tuned
> profile that isolates cpus, they will need a fix.  Just a new tuned "might"
> work for them, but it looks like the perf patch will be needed for
> completeness.
> 
> My vote would be to:
>  a) Submit the new patched tuned as soon as we can for an ancillary hot
> patch.
>  b) Submit the patched perf python to RHEL 7.4z and 7.5z to be included in
> the next z-stream update in September.  If a customer needs it sooner, we'll
> deal with a hot patch if we have to.
> 
> Comments?
> Joe

IMHO the original Tuned patch shouldn't mask the perf problem - the python perf problem was always there. The crash caused by python perf is just not easy to reproduce (without running through the valgrind or by "having luck"). Otherwise I agree.

Comment 55 Jiri Olsa 2018-08-17 12:13:33 UTC
(In reply to Joe Mario from comment #53)

SNIP

> patch.
>  b) Submit the patched perf python to RHEL 7.4z and 7.5z to be included in
> the next z-stream update in September.  If a customer needs it sooner, we'll
> deal with a hot patch if we have to.

7.5.z backport brewing in here:
  https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=17928924

I posted the changes upstream, and I think we need to have them
posted to Y stream first before they could be picked up by zstream.

I'll do that once they are accepted upstream. Also I'll need some
BZ for the Y stream post.

jirka

Comment 56 Tereza Cerna 2018-08-17 12:51:09 UTC
On my laptop I tried to disable one, two and more cpus, functionality of this bugzilla works well. Do you want some other testing from tuned QE?

Comment 57 Jeff Bastian 2018-08-17 13:10:19 UTC
(In reply to Joe Mario from comment #53)
> My vote would be to:
>  a) Submit the new patched tuned as soon as we can for an ancillary hot
> patch.
>  b) Submit the patched perf python to RHEL 7.4z and 7.5z to be included in
> the next z-stream update in September.  If a customer needs it sooner, we'll
> deal with a hot patch if we have to.


I agree.  We need to update both components, but they do not necessarily need to ship together.  Let's ship the tuned update ASAP, and the kernel / python-perf update in the next z-stream batch.

Comment 59 Michael Petlan 2018-08-20 20:51:34 UTC
Hmm, unfortunately, this is a weak point in perf testing, for two reasons:
1) perf python module has almost zero test coverage
2) we don't test perf with disabled CPUs

There were (and are) various problems with sparse NUMA nodes, etc. So why not with sparse CPUs. Both (1) and (2) have to be improved.

Comment 60 Michael Petlan 2018-08-20 20:54:40 UTC
Also, if I understand it correctly, there is a bugfix needed in tuned and another in perf. Is there a corresponding bug opened against kernel/perf?

Comment 61 Joe Mario 2018-08-20 21:18:27 UTC
Hi Michael:
 > Is there a corresponding bug opened against kernel/perf?
I will be creating one as soon as I get a chance. Likely tomorrow.

> perf python module has almost zero test coverage
It apparently gets some indirect test coverage whenever tuned is used to select a profile that isolates cpus, (including cpu-partitioning and various realtime profiles).

Joe

Comment 62 Jeff Bastian 2018-08-22 14:45:08 UTC
(In reply to Joe Mario from comment #61)
>  > Is there a corresponding bug opened against kernel/perf?
> I will be creating one as soon as I get a chance. Likely tomorrow.

See bug 1619465

Comment 64 Tereza Cerna 2018-08-23 11:19:31 UTC
I tested this bug with new package tuned-2.10.0-2.el7 and known traceback from comment #35 was appeared. Python-perf fix should be deployed also in rhel-7.6, I clonned bug from #62 to 7.6 release, see BZ#1620774.

# rpm -q tuned
tuned-2.10.0-2.el7.noarch
# tuned-adm profile cpu-partitioning
^CTraceback (most recent call last):
  File "/usr/sbin/tuned-adm", line 94, in <module>
    result = admin.action(action_name, **options)
  File "/usr/lib/python2.7/site-packages/tuned/admin/admin.py", line 75, in action
    res = self._controller.run()
  File "/usr/lib/python2.7/site-packages/tuned/admin/dbus_controller.py", line 59, in run
    self._main_loop.run()
  File "/usr/lib64/python2.7/site-packages/gi/overrides/GLib.py", line 577, in run
    raise KeyboardInterrupt
KeyboardInterrupt

Comment 65 Tereza Cerna 2018-09-06 14:00:07 UTC
I verified this bugzilla with provided python-perf package [1] which contain fix from #46. This fix ix NECESSARY for right behavior when some cpu is offlined.

Please, deploy python-perf in RHEL-7.6 or ASAP, because without this fix it does not work.

[1] https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=18201528


I tried all options:
  tuned               |  python-perf                       |  result
-----------------------------------------------------------------------------
  2.10.0-4.el7 (new)  |  3.10.0-944.el7perf_python  (new)  |  PASS
  2.10.0-4.el7 (new)  |  3.10.0-693.el7             (old)  |  FAIL
  2.9.0-1.el7  (old)  |  3.10.0-944.el7perf_python  (new)  |  FAIL
  2.9.0-1.el7  (old)  |  3.10.0-693.el7             (old)  |  FAIL

We can see that only combination of new tuned and new python-perf leads to positive result of a test case. 


====================================================
Verified in:
    tuned-2.10.0-4.el7.noarch
    tuned-profiles-cpu-partitioning-2.10.0-4.el7.noarch
    python-perf-3.10.0-944.el7perf_python.x86_64
PASS
====================================================

[ PASS ] :: Command 'echo 'isolated_cores=2-3' > /etc/tuned/cpu-partitioning-variables.conf' (Expected 0, got 0)
[ LOG  ] :: Offline cpu
[ PASS ] :: Command 'echo 0 > /sys/devices/system/cpu/cpu4/online' (Expected 0, got 0)
[ LOG  ] :: Select cpu-partitioning profile
[ LOG  ] :: Output of 'tuned-adm profile cpu-partitioning':
[ LOG  ] :: --------------- OUTPUT START ---------------
[ LOG  ] :: ---------------  OUTPUT END  ---------------
[ PASS ] :: Command 'tuned-adm profile cpu-partitioning' (Expected 0, got 0)
[ PASS ] :: Command 'tuned-adm active | grep cpu-partitioning' (Expected 0, got 0)

test case: /CoreOS/tuned/Regression/offlined-cpu-in-profile-cpu-partitioning

Comment 67 errata-xmlrpc 2018-10-30 10:50:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3172

Comment 68 Joe Mario 2019-10-22 02:23:20 UTC
Clearing "needinfo" flag on this long-since-closed BZ.


Note You need to log in before you can comment on or make changes to this bug.