Bug 1698238

Summary: Do not try and place measurement threads on offline cpus
Product: Red Hat Enterprise Linux 7 Reporter: Clark Williams <williams>
Component: rtevalAssignee: John Kacur <jkacur>
Status: CLOSED ERRATA QA Contact: Qiao Zhao <qzhao>
Severity: medium Docs Contact:
Priority: unspecified    
Version: 7.6CC: bhu, lgoncalv, tieli
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1720687 (view as bug list) Environment:
Last Closed: 2019-08-06 12:40:21 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1655694, 1720687    
Attachments:
Description Flags
Check whether a cpu is online
none
Change hackbench to use systopology none

Description Clark Williams 2019-04-09 21:20:28 UTC
Description of problem:

Running rteval on a system with hyperthreads disabled using the boot command line argument 'nosmt' results in crash while trying to place a measurement thread on an offline cpu.


Version-Release number of selected component (if applicable):
rteval-2.14-11.el7.noarch

How reproducible:
Always

Steps to Reproduce:
1. Add 'nosmt' to grub command line (turns off hyperthreading)
2. reboot
3. run rteval, crashes trying to use offline cpu

Additional info:

Backtrace and additional info:
[root@realtime-03 ~]# cd /tmp
[root@realtime-03 tmp]# rteval --duration=1m
got system topology: 2 node system (10 cores per node)
Traceback (most recent call last):
  File "/usr/bin/rteval", line 302, in <module>
    rteval.Prepare(rtevcfg.onlyload)
  File "/usr/lib/python2.7/site-packages/rteval/__init__.py", line 157, in Prepare
    self._measuremods.Setup(params)
  File "/usr/lib/python2.7/site-packages/rteval/modules/measurement/__init__.py", line 182, in Setup
    mp.Setup(modname)
  File "/usr/lib/python2.7/site-packages/rteval/modules/measurement/__init__.py", line 58, in Setup
    modobj = self._InstantiateModule(modname, self._cfg.GetSection(modname))
  File "/usr/lib/python2.7/site-packages/rteval/modules/__init__.py", line 417, in _InstantiateModule
    return self.__modules.InstantiateModule(modname, modcfg, modroot)
  File "/usr/lib/python2.7/site-packages/rteval/modules/__init__.py", line 332, in InstantiateModule
    return mod.create(modcfg, self.__logger)
  File "/usr/lib/python2.7/site-packages/rteval/modules/measurement/cyclictest.py", line 420, in create
    return Cyclictest(params, logger)
  File "/usr/lib/python2.7/site-packages/rteval/modules/measurement/cyclictest.py", line 212, in __init__
    self.__cyclicdata[core].description = info[core]['model name']
KeyError: '20'
[root@realtime-03 tmp]# lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                40
On-line CPU(s) list:   0-19
Off-line CPU(s) list:  20-39
Thread(s) per core:    1
Core(s) per socket:    10
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 63
Model name:            Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz
Stepping:              2
CPU MHz:               2297.477
BogoMIPS:              4594.95
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              25600K
NUMA node0 CPU(s):     0-9
NUMA node1 CPU(s):     10-19
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm epb intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm xsaveopt cqm_llc cqm_occup_llc dtherm ida arat pln pts spec_ctrl intel_stibp flush_l1d
[root@realtime-03 tmp]# cat /proc/cmdline 
BOOT_IMAGE=/vmlinuz-3.10.0-957.15.1.rt56.927skipktimersoftd1.el7.x86_64 root=/dev/mapper/rhel_realtime--03-root ro crashkernel=auto rd.lvm.lv=rhel_realtime-03/root rd.lvm.lv=rhel_realtime-03/swap console=ttyS1,115200n81 log_buf_len=1M nosmt isolcpus=10-19 LANG=en_US.UTF-8
[root@realtime-03 tmp]# rpm -q rteval
rteval-2.14-11.el7.noarch

Comment 2 John Kacur 2019-04-16 19:14:41 UTC
Wasn't able to reproduce it on the machine I tried it on.

To verify that the nosmt "took", I looked at

(note the Thread(s) per core is 1, as you would expect)

lscpu 
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    1
Core(s) per socket:    4
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 26
Model name:            Intel(R) Xeon(R) CPU           E5506  @ 2.13GHz
Stepping:              5
CPU MHz:               2133.000
CPU max MHz:           2133.0000
CPU min MHz:           1600.0000
BogoMIPS:              4266.74
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              4096K
NUMA node0 CPU(s):     0-3
NUMA node1 CPU(s):     4-7
Flags:                 fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf eagerfpu pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid dtherm spec_ctrl intel_stibp flush_l1d

and
cat /sys/devices/system/cpu/cpu1/topology/thread_siblings_list
1

so it does seem like I was able to invoke nosmt

Do you know what version of rt-tests you were using?

Comment 3 John Kacur 2019-05-24 11:42:08 UTC
I believe the problem is that in the case of hotplug, there are entries for cpus that are offline.

I have modified online_cpus() in misc.py to fix this, but it doesn't solve the problem yet because
during the initialization of Cyclictest() another configuration is being passed that is bypassing this function
and merely calling expand_cpulist but passing it data that includes an offline cpu.

getting closer.

Comment 4 John Kacur 2019-05-29 12:32:48 UTC
Created attachment 1574745 [details]
Check whether a cpu is online

Comment 5 John Kacur 2019-05-29 12:33:28 UTC
Created attachment 1574746 [details]
Change hackbench to use systopology

Comment 12 errata-xmlrpc 2019-08-06 12:40:21 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2063