Bug 437463 - Evaluate the impact of CONFIG_NUMA on real-time latencies
Evaluate the impact of CONFIG_NUMA on real-time latencies
Status: CLOSED NOTABUG
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: realtime-configuration (Show other bugs)
1.0
x86_64 All
low Severity medium
: ---
: ---
Assigned To: Red Hat Real Time Maintenance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-03-14 07:48 EDT by IBM Bug Proxy
Modified: 2008-05-19 11:21 EDT (History)
1 user (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-05-02 13:09:20 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Effect of CONFIG_NUMA on latencies on an LS21 machine (1.15 KB, text/html)
2008-03-18 09:17 EDT, IBM Bug Proxy
no flags Details
comarison_with_mem_node_interleave_bios_setting (3.36 KB, text/html)
2008-03-20 07:56 EDT, IBM Bug Proxy
no flags Details


External Trackers
Tracker ID Priority Status Summary Last Updated
IBM Linux Technology Center 42634 None None None Never

  None (edit)
Description IBM Bug Proxy 2008-03-14 07:48:44 EDT
=Comment: #0=================================================
Sripathi Kodi <sripathi@in.ibm.com> - 2008-02-25 10:31 EDT
*** This is a "Task" bug. This has been opened to track a particular task, not
necessarily a bug. ***

Please evaluate whether enabling CONFIG_NUMA affects real-time latencies. RH
has enabled this option in MRG kernel.

John Stultz has worked on this in bug #40270. He has not reached a definitive
answer yet.

=Comment: #1=================================================
Vernon Mauery <mauery@us.ibm.com> - 2008-02-28 19:30 EDT
I am currently gathering results for the baseline RH kernel on an LS20.  This
should finish up this evening, when I will boot it to a machine that has
CONFIG_NUMA disabled and start up the tests again.  I should have results
tomorrow that I can post here.

=Comment: #2=================================================
Vernon Mauery <mauery@us.ibm.com> - 2008-02-29 10:31 EDT
I must have done something wrong with the config file.  My new kernel does not
boot.  I get this:

Kernel panic - not syncing: Attempted to kill init!
Pid: 1, comm: init Not tainted 2.6.24-21 #1

Call Trace:
 [<ffffffff8023d200>] panic+0xaf/0x169
 [<ffffffff8049ff45>] do_page_fault+0x3f6/0x769
 [<ffffffff80335a05>] lock_list_del_init+0x7c/0xaf
 [<ffffffff80255de2>] blocking_notifier_call_chain+0xf/0x11
 [<ffffffff80240981>] do_exit+0x8d/0x823
 [<ffffffff802411a6>] sys_exit_group+0x0/0x14
 [<ffffffff802411b8>] sys_exit_group+0x12/0x14
 [<ffffffff8020c21e>] system_call+0x7e/0x83

when trying to boot.

=Comment: #3=================================================
Vernon Mauery <mauery@us.ibm.com> - 2008-02-29 13:26 EDT
I think I was just hit by the same abat bug as Darren was yesterday.  I had an
empty /etc/modprobe.conf file so the newly installed initrd was not configured
correctly.  I have booted the CONFIG_NUMA=n kernel and will run 100 calibrate
runs like I did on the original kernel.

=Comment: #4=================================================
Vernon Mauery <mauery@us.ibm.com> - 2008-02-29 18:41 EDT
I have run 100 full calibrate runs on the MRG kernel and another 100 on the MRG
kernel with CONFIG_NUMA disabled.

Basic inspection:
vhmauery@elm3b213 $ grep SUMMARY logs.numa/* | grep -v "0 FAIL" | wc -l
38
vhmauery@elm3b213 $ grep SUMMARY logs.nonuma/* | grep -v "0 FAIL" | wc -l
19

We have twice as many runs with one or more tests failing when CONFIG_NUMA is
enabled as when it is not.

Slightly more detailed results:
I ran the results through calibrate/sum_results.py and diffed them.  This is the
output:
--- nonuma.results      2008-02-29 18:25:42.000000000 -0500
+++ numa.results        2008-02-29 18:25:52.000000000 -0500
@@ -16,7 +16,7 @@
 Checks abs(Start Latency) < 100 µs 
        PASS:  100 FAIL:  0
 NHRT: Checks abs(Maximum Start) < 100 µs 
-       PASS:  100 FAIL:  0
+       PASS:  99 FAIL:  1
 NHRT: Checks abs(Start Latency) < 100 µs 
        PASS:  100 FAIL:  0
 
@@ -32,9 +32,9 @@
 Multi-Processor Performance
 ------------------------------
 Concurrent Time * 2.0 < Sequential Time 
-       PASS:  98 FAIL:  2
+       PASS:  95 FAIL:  5
 XML: Concurrent Time * 2.0 < Sequential Time 
-       PASS:  99 FAIL:  1
+       PASS:  93 FAIL:  7
 
 ------------------------------
 Just-In-Time Compilation Jitter
@@ -80,7 +80,7 @@
 Impact on scheduling latency, GC Latency  < NO-GC Latency + 100 µs 
        PASS:  100 FAIL:  0
 Impact on execution time. GC Duration < 1.1 NO-GC Duration (10% penalty) 
-       PASS:  99 FAIL:  1
+       PASS:  100 FAIL:  0
 
 ------------------------------
 NoHeapRealtimeThread Memory Allocation
@@ -94,9 +94,9 @@
 Dispatch Latency
 ------------------------------
 Bound Handler Latency < 70 µs 
-       PASS:  96 FAIL:  4
+       PASS:  83 FAIL:  17
 Async Handler Latency < 100 µs 
-       PASS:  89 FAIL:  11
+       PASS:  79 FAIL:  21
 
 ------------------------------
 Memory Check Penalty

If it is necessary, I can go through and find some actual latency numbers to
back up my argument, but I think that upon this amount of cursory inspection,
this myth is busted!  We should tell RedHat to disable CONFIG_NUMA.

=Comment: #5=================================================
Vernon Mauery <mauery@us.ibm.com> - 2008-02-29 18:42 EDT
I note that this also needs to be tested on an HS21

=Comment: #6=================================================
Sripathi Kodi <sripathi@in.ibm.com> - 2008-03-04 05:43 EDT
From the minutes of the MRG call, it looks like RH would like to keep this on.
Is it possible to disable NUMA through a kernel command line option? I can't
find any such option in kernel-parameters.txt.

=Comment: #7=================================================
Vernon Mauery <mauery@us.ibm.com> - 2008-03-04 09:27 EDT
The final word on this bug from me.  It appears that the machines affected most
by this are either slow or AMD.  Not sure which affects the tests results more.
 I would have to test on an LS21 to confirm, but still the HS21 is faster.  The
HS21 failed in 3% more tests with NUMA enabled.

results.numa 
--- results.nonuma      2008-03-03 17:44:51.000000000 -0500
+++ results.numa        2008-03-04 09:20:32.000000000 -0500
@@ -24,7 +24,7 @@
 Concurrency Jitter
 ------------------------------
 Checks (maximum - minimum) < 200 µs 
-       PASS:  99 FAIL:  1
+       PASS:  100 FAIL:  0
 Checks Start Jitter < 200 µs 
        PASS:  100 FAIL:  0
 
@@ -94,9 +94,9 @@
 Dispatch Latency
 ------------------------------
 Bound Handler Latency < 70 µs 
-       PASS:  100 FAIL:  0
+       PASS:  99 FAIL:  1
 Async Handler Latency < 100 µs 
-       PASS:  97 FAIL:  3
+       PASS:  96 FAIL:  4
 
 ------------------------------
 Memory Check Penalty

This could be statistical noise.  To be sure we would have to run 1000 runs of
calibrate rather than 100, which would take about 33 hours (* 2 for config
changes) or so.

CONFIG_NUMA definitely affects the latency on an LS20, but only a little bit on
the HS21.
Comment 1 IBM Bug Proxy 2008-03-17 14:48:42 EDT
------- Comment From sripathi@in.ibm.com 2008-03-17 14:42 EDT-------
The BIOS setting for "Memory Node Interleave" on our machines "Disabled". This
seems to be the default.
Comment 2 IBM Bug Proxy 2008-03-18 09:17:29 EDT
------- Comment From sripathi@in.ibm.com 2008-03-18 09:11 EDT-------
Some more numbers. This time from rt-test tests that are part of LTP.
CONFIG_NUMA did not make a significant impact on the runs on HS21, but it's
impact was measurable on LS21. I will attach an html file to this bug that shows
the comparison.
Comment 3 IBM Bug Proxy 2008-03-18 09:17:31 EDT
Created attachment 298389 [details]
Effect of CONFIG_NUMA on latencies on an LS21 machine
Comment 4 IBM Bug Proxy 2008-03-18 10:16:44 EDT
------- Comment From dvhltc@us.ibm.com 2008-03-18 10:13 EDT-------
(In reply to comment #13)
> The BIOS setting for "Memory Node Interleave" on our machines "Disabled". This
> seems to be the default.

I believe this is what Clark mentioned to me as his expectation given our
results.  Can we also run with Memory Node Interleave Enabled to see how this
effects the LS21 results?
Comment 5 IBM Bug Proxy 2008-03-20 07:56:44 EDT
------- Comment From sudhanshusingh@in.ibm.com 2008-03-20 07:56 EDT-------
(From update of attachment 35546)
results are for LS21.
calibrate and C tests are run 100 times and average/max is taken of those runs.
Comment 6 IBM Bug Proxy 2008-03-20 07:56:46 EDT
Created attachment 298687 [details]
comarison_with_mem_node_interleave_bios_setting

Attachment contains comparison of results with BIOS setting of memory node
interleave ( enabled and disbaled ) for MRG base kernel and MRG kernel with
NUMA option turned off. (all four permutations).
Comment 7 IBM Bug Proxy 2008-03-26 02:00:36 EDT
------- Comment From sripathi@in.ibm.com 2008-03-26 01:57 EDT-------
We decided on mailing lists and RH call that it is okay to leave NUMA turned ON.
In case we discover problems later, we can use numa=off boot parameter.
Comment 8 IBM Bug Proxy 2008-03-26 11:40:51 EDT
------- Comment From dvhltc@us.ibm.com 2008-03-26 11:37 EDT-------
Closing it. (rejecting as a note a bug... since there was no change made)
Comment 9 IBM Bug Proxy 2008-04-01 06:49:08 EDT
------- Comment From sripathi@in.ibm.com 2008-04-01 06:41 EDT-------
Moving this bug to FIX_BY_IBM
Comment 10 Clark Williams 2008-05-02 13:09:20 EDT
Closing on our side.

Note You need to log in before you can comment on or make changes to this bug.