Bug 592024 - Periodic system preemption on HP z800 when NUMA is enabled
Summary: Periodic system preemption on HP z800 when NUMA is enabled
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: realtime-kernel
Version: 1.2
Hardware: All
OS: Linux
Target Milestone: ---
: ---
Assignee: Red Hat Real Time Maintenance
QA Contact: David Sommerseth
Depends On:
TreeView+ depends on / blocked
Reported: 2010-05-13 17:35 UTC by Jon Thomas
Modified: 2018-10-27 13:38 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Last Closed: 2011-08-08 15:14:58 UTC
Target Upstream Version:

Attachments (Terms of Use)
testcase (4.20 KB, text/plain)
2010-05-13 17:37 UTC, Jon Thomas
no flags Details

Description Jon Thomas 2010-05-13 17:35:19 UTC
A periodic system preemption of 20 uS is noticed when NUMA is enabled on an HP z800.  If NUMA is disabled in the BIOS, this preemption is no longer noticed.  However, this memory configuration has potential perofmance issues itself.

->> System BIOS does not seem to have to an option to disable SMI
->> When Memory interleaving is disabled, it removes 20 uS preemption

Customer is concerned about the inefficient memory layout after disabling NUMA

He has provided a test application which will describe this problem

Test file: timerAccess.cpp

boot option isolcpus=1-7 is used.

The test used for this is attached.  Basically, the High Res Timer (HRT) is read back to back and the results compared.  This operation should only take 100-200 nS to complete.  Periodically, the system preempts the process for a couple of microseconds.  On this HP workstation, when NUMA is enabled, the preemption time increases to over 20 uS.  The test application does set RT priority and processor affinity.  All CPUs except zero are isolated using the isolcpus=1-7 command on the kernel boot line.  Hardware interrupts and kernel interrupt threads are redirected to CPU 0.

-->> comment from the customer on test application after reading the documentation on


Test program uses sched_setaffinity() to pin the thread to a single CPU.  After the thread has been pinned, then you can allocate any memory necessary.  If the system is a NUMA configuration, the OS should then allocate memory as close as possbile (assuming it is available) to the thread.
The preferred way to do this is to start the thread and then change its scheduling class/priority and affinity from inside the program

Observation made by the customer:

->> This has something to do with the 2.6.24 kernel.  This issue disappears on real-time kernels of 2.6.25 and later.  Results are shown using the 2.6.31 kernel.

Results of test application

  Min:    0.204 uS
  Avg:    0.210 uS
  Max:   21.731 uS
  Count: 500000000
  Test time: 215 seconds  
  Min:    0.111 uS
  Avg:    0.117 uS
  Max:    2.360 uS
  Count: 500000000
  Test time: 122 seconds

NUMA Off  
  Min:    0.203 uS
  Avg:    0.211 uS
  Max:    5.176 uS
  Count: 500000000
  Test time: 215 seconds  
  Min:    0.111 uS
  Avg:    0.117 uS
  Max:    2.739 uS
  Count: 500000000
  Test time: 121 second

->>Informed the customer regarding the kernel RT version 2.6.33 that is going to be released with MRG 1.3. Customer

->We need to identify the change between kernel which is helping to remove the latency issue with the later kernels compared to 2.6.24.
->will we fix this issue with 2.6.24 rt kernel before MRG 1.3 release

Customer has changed kernel compile options and it seems to help in improving the performance.

File attached MRG-Config

->>Some of the parameters disabled are the kernel debugging parameter and CONFIG_CPU_FREQ

How reproducible:

Always for the customer

Steps to Reproduce:

1. Run the attached test application

Actual results:

A periodic system preemption of 20 uS is noted when numa is enabled

Expected results:

Better performance by reducing the preemption

Comment 1 Jon Thomas 2010-05-13 17:37:06 UTC
Created attachment 413832 [details]

Comment 2 Luis Claudio R. Goncalves 2010-05-13 21:57:35 UTC
I haven't compiled the testcase but noticed it uses CLOCK_REALTIME instead of CLOCK_MONOTONIC. I suggest using CLOCK_MONOTONIC and running the test again.

It would be also interesting knowing what is the clocksource in use on that system (the contents of /sys/devices/system/clocksource/clocksource0/*) and whether the VDSO gettimeofday extensions are enabled or not (/proc/sys/kernel/vsyscall64). These data points can heavily influence the result.


Comment 3 Jon Thomas 2010-05-14 16:22:57 UTC

thanks. I've asked that the customer follow up on your suggestions.

Comment 5 Issue Tracker 2010-05-15 09:47:18 UTC
Event posted on 05-15-2010 05:47am EDT by rrajaram

This event sent from IssueTracker by rrajaram 
 issue 861323
it_file 669733

Note You need to log in before you can comment on or make changes to this bug.