Bug 504762
Description
Qian Cai
2009-06-09 09:28:23 UTC
FYI. This test is running fine for RHEL5-Server-U2, kernel-xen-2.6.18-92.1.27.el5 Is the test really blocking forever?? I try to run it and it takes just too much time to execute test. And RHTS log shows, test was canceled due to timeout. What about extending duration of test (it takes me about 2,5 hour). I did not think I have waited longer enough for around 2.5 hour. As far as I can tell it took much less time in RHEL 5.2. Tested versions 2.6.18-92 and 2.6.18-164 kernels in RHEL 5.4 release. Test runs three times in a row with following results: 2.6.18-92: 1.) systime 7212.84, elapsed time 2:01:50, 100000 minor page faults 2.) systime 7209.36, elapsed time 2:01:46, 99678 minor page faults 3.) systime 7218.31, elapsed time 2:01:56, 99386 minor page faults 2.6.18-164: 1.) systime 7204.55, elapsed time 2:01:40, 96497 minor page faults 2.) systime 7222.78, elapsed time 2:01:58, 97459 minor page faults 3.) systime 7229.79, elapsed time 2:02:04, 97404 minor page faults As results show, in RHEL 5.4 takes test execution more than 2 hours that are set for the test so test has to fail. However, this time is also reach with RHEL 5.2 kernel. Test never blocks forever, just was not able to finish in expected time. I got a soft lockup here. mallocstress D a000000100642180 0 5854 5683 5853 (NOTLB) Call Trace: [<a00000010063d9b0>] schedule+0x1750/0x1bc0 sp=e00000014ccefc60 bsp=e00000014cce93c8 [<a000000100642180>] rwsem_down_read_failed+0x4e0/0x550 sp=e00000014ccefcf0 bsp=e00000014cce9380 [<a0000001000c7fb0>] down_read+0x30/0x60 sp=e00000014ccefd10 bsp=e00000014cce9360 [<a0000001000c9a20>] futex_wake+0x40/0x200 sp=e00000014ccefd10 bsp=e00000014cce9320 [<a0000001000cc240>] do_futex+0x4e0/0x1f20 sp=e00000014ccefd30 bsp=e00000014cce92b0 [<a0000001000cdf00>] sys_futex+0x280/0x2e0 sp=e00000014ccefe20 bsp=e00000014cce9220 [<a000000100069060>] xen_trace_syscall+0x100/0x140 sp=e00000014ccefe30 bsp=e00000014cce9220 [<a000000000010620>] __start_ivt_text+0xffffffff00010620/0x400 sp=e00000014ccf0000 bsp=e00000014cce9220 bash R running task 0 5857 3882 (NOTLB) BUG: soft lockup - CPU#0 stuck for 10s! [bash:5857] Modules linked in: ipt_MASQUERADE iptable_nat ip_nat xt_state ip_conntrack nfnetlink ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables netloop netbk blktap blkbk bridge autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc ipv6 xfrm_nalgo crypto_api ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi scsi_transport_iscsi vfat fat dm_multipath button parport_pc lp parport sg tg3 e100 mii e1000 dm_snapshot dm_zero dm_mirror dm_mod lpfc scsi_transport_fc shpchp mptspi mptscsih mptbase scsi_transport_spi sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 5857, CPU 0, comm: bash psr : 00000210085a6010 ifs : 8000000000000003 ip : [<a0000001006432a1>] Not tainted ip is at _spin_unlock_irqrestore+0xc1/0x140 unat: 0000000000000000 pfs : 800000000000050d rsc : 000000000000000b rnat: 0000000000000000 bsps: 0000000000000000 pr : 000000000059aa99 ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f csd : 0000000000000000 ssd : 0000000000000000 b0 : a0000001003bd290 b6 : a0000001003ec7c0 b7 : 0000000000000000 f6 : 000000000000000000000 f7 : 000000000000000000000 f8 : 000000000000000000000 f9 : 000000000000000000000 f10 : 000000000000000000000 f11 : 000000000000000000000 r1 : a000000100c11fd0 r2 : a000000100a2a418 r3 : a000000100a2a418 r8 : 0000000000000001 r9 : fffffffffff00001 r10 : 0000000000000001 r11 : fffffffffff00000 r12 : e00000014b3c7e20 r13 : e00000014b3c0000 r14 : 0000000000004000 r15 : 0000000000000001 r16 : fffffffffff04c18 r17 : fffffffffff00001 r18 : a000000100a12788 r19 : 0000000000000007 r20 : a000000100849300 r21 : a000000100a12628 r22 : e00000014b3c0208 r23 : a000000100a2a128 r24 : 0000000000000001 r25 : a000000100a122e8 r26 : a0000001007cc138 r27 : a000000100849000 r28 : a000000100a12680 r29 : 00001412085a6010 r30 : 0000000000000000 r31 : 0000000000000000 Call Trace: [<a00000010001d180>] show_stack+0x40/0xa0 sp=e00000014b3c7a80 bsp=e00000014b3c1630 [<a00000010001da80>] show_regs+0x840/0x880 sp=e00000014b3c7c50 bsp=e00000014b3c15d0 [<a0000001000fad10>] softlockup_tick+0x310/0x380 sp=e00000014b3c7c50 bsp=e00000014b3c1588 [<a0000001000a48f0>] run_local_timers+0x30/0x60 sp=e00000014b3c7c50 bsp=e00000014b3c1568 [<a0000001000a49a0>] update_process_times+0x80/0x100 sp=e00000014b3c7c50 bsp=e00000014b3c1530 [<a000000100042ac0>] timer_interrupt+0x800/0xa40 sp=e00000014b3c7c50 bsp=e00000014b3c14d8 [<a0000001000fb350>] handle_IRQ_event+0x150/0x240 sp=e00000014b3c7c50 bsp=e00000014b3c1498 [<a0000001000fb570>] __do_IRQ+0x130/0x420 sp=e00000014b3c7c50 bsp=e00000014b3c1450 [<a0000001003e80c0>] evtchn_do_upcall+0x160/0x260 sp=e00000014b3c7c50 bsp=e00000014b3c13b8 [<a000000100068c60>] xen_event_callback+0x380/0x3c0 sp=e00000014b3c7c50 bsp=e00000014b3c13b8 Not sure if this is because of using a different userspace components like Xen version's change. The original problem was used RHEL5-Server-U2. Miroslav, I have installed the system using RHEL5-Server-U2 distro, and everything seems working great there every time. # time ./mallocstress Thread [43]: allocate_free() returned 0, succeeded. Thread exiting. Thread [51]: allocate_free() returned 0, succeeded. Thread exiting. Thread [47]: allocate_free() returned 0, succeeded. Thread exiting. Thread [3]: allocate_free() returned 0, succeeded. Thread exiting. Thread [39]: allocate_free() returned 0, succeeded. Thread exiting. Thread [27]: allocate_free() returned 0, succeeded. Thread exiting. Thread [11]: allocate_free() returned 0, succeeded. Thread exiting. Thread [15]: allocate_free() returned 0, succeeded. Thread exiting. Thread [31]: allocate_free() returned 0, succeeded. Thread exiting. Thread [35]: allocate_free() returned 0, succeeded. Thread exiting. Thread [7]: allocate_free() returned 0, succeeded. Thread exiting. Thread [23]: allocate_free() returned 0, succeeded. Thread exiting. Thread [55]: allocate_free() returned 0, succeeded. Thread exiting. Thread [19]: allocate_free() returned 0, succeeded. Thread exiting. Thread [59]: allocate_free() returned 0, succeeded. Thread exiting. Thread [6]: allocate_free() returned 0, succeeded. Thread exiting. Thread [50]: allocate_free() returned 0, succeeded. Thread exiting. Thread [2]: allocate_free() returned 0, succeeded. Thread exiting. Thread [54]: allocate_free() returned 0, succeeded. Thread exiting. Thread [58]: allocate_free() returned 0, succeeded. Thread exiting. Thread [14]: allocate_free() returned 0, succeeded. Thread exiting. Thread [30]: allocate_free() returned 0, succeeded. Thread exiting. Thread [34]: allocate_free() returned 0, succeeded. Thread exiting. Thread [46]: allocate_free() returned 0, succeeded. Thread exiting. Thread [22]: allocate_free() returned 0, succeeded. Thread exiting. Thread [18]: allocate_free() returned 0, succeeded. Thread exiting. Thread [26]: allocate_free() returned 0, succeeded. Thread exiting. Thread [42]: allocate_free() returned 0, succeeded. Thread exiting. Thread [10]: allocate_free() returned 0, succeeded. Thread exiting. Thread [38]: allocate_free() returned 0, succeeded. Thread exiting. Thread [49]: allocate_free() returned 0, succeeded. Thread exiting. Thread [53]: allocate_free() returned 0, succeeded. Thread exiting. Thread [21]: allocate_free() returned 0, succeeded. Thread exiting. Thread [29]: allocate_free() returned 0, succeeded. Thread exiting. Thread [33]: allocate_free() returned 0, succeeded. Thread exiting. Thread [9]: allocate_free() returned 0, succeeded. Thread exiting. Thread [41]: allocate_free() returned 0, succeeded. Thread exiting. Thread [25]: allocate_free() returned 0, succeeded. Thread exiting. Thread [5]: allocate_free() returned 0, succeeded. Thread exiting. Thread [13]: allocate_free() returned 0, succeeded. Thread exiting. Thread [45]: allocate_free() returned 0, succeeded. Thread exiting. Thread [1]: allocate_free() returned 0, succeeded. Thread exiting. Thread [17]: allocate_free() returned 0, succeeded. Thread exiting. Thread [57]: allocate_free() returned 0, succeeded. Thread exiting. Thread [37]: allocate_free() returned 0, succeeded. Thread exiting. Thread [12]: allocate_free() returned 0, succeeded. Thread exiting. Thread [16]: allocate_free() returned 0, succeeded. Thread exiting. Thread [8]: allocate_free() returned 0, succeeded. Thread exiting. Thread [24]: allocate_free() returned 0, succeeded. Thread exiting. Thread [28]: allocate_free() returned 0, succeeded. Thread exiting. Thread [0]: allocate_free() returned 0, succeeded. Thread exiting. Thread [20]: allocate_free() returned 0, succeeded. Thread exiting. Thread [32]: allocate_free() returned 0, succeeded. Thread exiting. Thread [4]: allocate_free() returned 0, succeeded. Thread exiting. Thread [40]: allocate_free() returned 0, succeeded. Thread exiting. Thread [48]: allocate_free() returned 0, succeeded. Thread exiting. Thread [36]: allocate_free() returned 0, succeeded. Thread exiting. Thread [56]: allocate_free() returned 0, succeeded. Thread exiting. Thread [52]: allocate_free() returned 0, succeeded. Thread exiting. Thread [44]: allocate_free() returned 0, succeeded. Thread exiting. main(): test passed. real 0m9.415s user 0m0.027s sys 0m0.235s CAI's results are for non-xen kernel. When trying equivalent xen kernel, results are: real 41m36.795s user 0m0.516s sys 40m20.836s This means that in xen enviroment test is much slower. And it is even slower in newer releases. Created attachment 477629 [details]
strace output for one minute of mallocstress
I reproduced the problem under RHEL-5.6 and did some preliminary measurements.
On the bare metal:
[root@pq0-0 ~]# uname -psrn
Linux pq0-0.lab.bos.redhat.com 2.6.18-238.el5 ia64
mallocstress completes in less than ten seconds.
I traced the same test with the following command, for one minute, in dom0:
[root@pq0-0 ~]# uname -psrn
Linux pq0-0.lab.bos.redhat.com 2.6.18-238.el5xen ia64
$ strace -f -T -etrace=mmap,munmap,mprotect,madvise ./mallocstress 2>&1 \
| tee -i log-one-minute
The '-T' option outputs the time spent in syscalls when they are completed. The following script creates a listing of times for the four traces syscalls:
for I in mmap munmap mprotect madvise; do
grep $I log-one-minute \
| sed -n -r 's/^.*<([0-9]+\.[0-9]+)>$/\1/p' >times-$I
done
Summary:
madvise mmap mprotect munmap
sample size 22 1490 447 703
mean time (s) 0.155120 0.040717 0.029127 0.199247
sample stdev (s) 0.221008 0.129294 0.093251 0.260324
The performance of the munmap() calls looks especially horrifying -- five
munmap()'s per second, on the average. The sample standard deviations are huge.
Either the dom0 <-> hv switch takes a lot of time, or there must be some heavy
lock contention in the hypervisor on munmap().
Created attachment 477644 [details]
dom0 dmesg about oprofile (perfmon) startup failure
To get a better understanding of what's happening, I tried to oprofile the machine while running the malloc stress. I installed the "kernel-xen-debuginfo" package, configured the xen and vmlinux images for oprofile:
$ opcontrol --vmlinux=/usr/lib/debug/lib/modules/2.6.18-238.el5xen/vmlinux
$ opcontrol --xen=/usr/lib/debug/boot/xen-syms-2.6.18-238.el5.debug
And then tried to start the oprofile daemon.
$ opcontrol --start --verbose
[...]
executing oprofiled --session-dir=/var/lib/oprofile --separate-lib=0 --separate-kernel=0 --separate-thread=0 --separate-cpu=0 --events=CPU_CYCLES:18:0:1000000:0:1:1, --vmlinux=/usr/lib/debug/lib/modules/2.6.18-238.el5xen/vmlinux --kernel-range=a000000100000000,a000000100698940 --xen-image=/usr/lib/debug/boot/xen-syms-2.6.18-238.el5.debug --xen-range=f000000004000000,f0000000040e3470 --verbose=all
[...]
Using log file /var/lib/oprofile/samples/oprofiled.log
Daemon started.
Profiler running.
Contrarily to the above, the daemon failed to start; "oprofiled.log" says
"Couldn't start perfmon: : Invalid argument".
I have enabled perfmon debug logging (echo 1 >/proc/sys/kernel/perfmon/debug) and tried to start the daemon again. It failed the same way, but now the attached messages were logged. pfm_start() returns -22 (EINVAL) because no context is loaded. The context is not loaded anymore at that time because pfm_flush() forces an unload ("forcing unload") earlier. The source says "we must stop and unload because we are losing access to the context". I don't understand that, but it seems clear that the oprofile machinery tries to rely on perfmon on IA64, and that on this specific machine perfmon stabs oprofile in the back.
I'm reserving another IA64 machine (hp-rx8640-02) and I will retry the entire
thing there (first the mallocstress, then -- if it is slow there as well --
oprofiling).
I have reserved hp-rx8640-02, and all results match those on pq0-0: - on the bare metal, the mallocstress test finishes in about 11 seconds, - in dom0, the test promises to take forever (I interrupted it after one minute or so), - The oprofile daemon failed to start with the exact same symptomps (dmesg included) as in comment 17. Andrew had the great idea to repeat the test on x86_64, where oprofile has a better chance to work. This should help us identify the slow path at least on x86_64, and we could try looking at the same code for ia64. A summary of the results this far: (1) On x86_64 bare metal, -242 kernel, the times taken by mallocstress are: real 9.16 / user 0.00 / sys 0.01. I will upload an oprofile report for this case. (2) On x64_64 xen dom0, -242 kernel and hv, the times taken by mallocstress are: real 36.68 / user 0.00 / sys 0.02. I will upload an oprofile report for this case as well. (3) On ia64 bare metal, -238 kernel (= U6, but also on U2, see comment 10 and comment 11), the test takes, similarly, about 9-10 seconds, and practically no CPU time. (4) On ia64 xen dom0, the test takes forever, and all wall clock time is spent on sys CPU time. In cases (1)-(3), the mallocstress process seems to be "IO-bound", except it does no IO. I'm suspecting the "my_yield()" function, which, instead of calling pthread_yield(), calls usleep() or nanosleep() (dependent on the SPEW_SIGNALS macro). Indeed, I re-run case (1) -- without profiling -- with my_yield() calling pthread_yield() instead of the sleep functions, and the times now look like: real 1.26 / user 0.14 / sys 1.71. This is actually reasonable for a sys-CPU-bound multi-threaded program. Created attachment 477807 [details]
x864_64 bare metal opreport, with my_yield() calling usleep(0)
Top of the report is
CPU: Intel Core/i7, speed 1596 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples % app name symbol name
169431 25.0721 vmlinux .text.acpi_processor_idle_simple
95385 14.1149 vmlinux mwait_idle_with_hints
55694 8.2415 vmlinux unmap_vmas
52499 7.7687 vmlinux .text.acpi_safe_halt
22182 3.2824 vmlinux schedule
10706 1.5843 vmlinux clear_page
10209 1.5107 vmlinux free_pgd_range
If we put aside the sleeping incurred by usleep(0), then the top CPU consumers are unmap_vmas(), clear_page() and free_pgd_range().
Created attachment 477810 [details]
x864_64 xen dom0 opreport, with my_yield() calling usleep(0)
The top of the report is:
CPU: Intel Core/i7, speed 1596 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples % app name symbol name
131444 4.8761 vmlinux unmap_vmas
95945 3.5592 vmlinux .text.lock.spinlock
81181 3.0115 xen-syms-2.6.18-242.el5.debug alloc_page_type
64907 2.4078 xen-syms-2.6.18-242.el5.debug get_page_from_l1e
Now "unmap_vmas()" is the clear winner. (It also ranked high on the bare metal.) For comparison, this is the bare metal hit count:
55694 8.2415 vmlinux unmap_vmas
It seems that unmap_vmas() spens, under xen, a lot more CPU cycles. This should include the cycles spent by the hypervisor. I believe we should look at the ia64 hv code that is called by unmap_vmas(), or otherwise present near the top of this opreport: alloc_page_type(), get_page_from_l1e(), free_page_type(), compat_failsafe_callback(), and put_page_from_l1e().
The results described in comment 20 and comment 21 were gathered like this: - installed the following packages from brew: - kernel-debuginfo-common-2.6.18-242.el5.x86_64.rpm - kernel-debuginfo-2.6.18-242.el5.x86_64.rpm - kernel-xen-debuginfo-2.6.18-242.el5.x86_64.rpm Then, for each test: opcontrol --init opcontrol --vmlinux=/usr/lib/debug/lib/modules/$(uname -r)/vmlinux #Under xen only: opcontrol --xen=/usr/lib/debug/boot/xen-syms-2.6.18-242.el5.debug opcontrol --start --verbose for ((K=0; K<20; K++)); do echo "==== $K ====" >&2 time -p ./mallocstress done opcontrol --shutdown # Produces report to stdout. opreport -l # archive results for later analysis in standalone form oparchive -o ~/filestore/bz504762/oprofile-`uname -r` # stash away current sample in samples directory too, emptying "current" sample opcontrol --save=mallocstress-`uname -r` opcontrol --deinit # Make sure the --xen setting doesn't remain in effect for bare metal, messing # up the report. rm -rf ~/.oprofile *** Bug 533872 has been marked as a duplicate of this bug. *** Paolo pointed out that the bigger the allocated block is, the longer its deallocation takes. This is the reason why the standard deviation is so high in comment 16: the stddev of the memory sizes is big as well. Paolo also established that the hypervisor does pin the dom0 VCPUs to the PCPUs even on IA64. This is interesting because we figured out that NUMA is the culprit. Considering the Xen testcase on hp-rx8640-02 (see comment 18), specifying a single repetition of the allocation loop (-l 1): numthreads taskset real user sys 1 none 0.43 0.00 0.04 2 none 3.95 0.00 3.56 2 -c 0 0.45 0.00 0.08 2 -c x,y <alternating haphazardly between the above two lines> So if we pin all threads to the same VCPU, then they are bound to run on the same PCPU as well, and bound to have the same NUMA node affinities. If we allow more than one VCPU, then the threads will sometimes use the same NUMA node (and the test will complete fast), and sometimes not (and the test will run slow). ( xm info on hp-rx8640-02 states: node_to_cpu : node0:0-7 node1:8-15 node2:no cpus ) The "numa=on" xen cmdline switch seems to have no effect. Even though VCPU-PCPU pinning is done, we seem unable to reliably pin threads to NUMA nodes. It is as if we could not bind PCPU's to NUMA nodes. ----o---- Running the mallocstress test with the default parameters, but inside "taskset -c 0"; the times are: real 231.16 / user 4.56 / sys 214.80; which is much slower than the 11 seconds on the bare metal (see comment 18), but still an order of magnitude faster than without "taskset". I'm not sure what we should do to this bug now. The original report has proved imprecise; the test does complete with functionally good result (see comment 6). If we re-qualify the bug as a performance problem, then that problem can be worked around to a great extent by "taskset -c X" -- however, I'm not sure that wouldn't defeat the original purpose of the test, which tries to attack the memory subsystem from as many sides as possible. Calling my_yield() after each allocation and deallocation apparently tries to ensure that all threads pound the system in lock-step for maximum stress. In that it succeeds on these machines as well, just at a very high perf price under Xen. Created attachment 477877 [details]
current CPU and memory topology of hp-rx8640-02
(In reply to comment #24) > Even though VCPU-PCPU pinning is done, we seem unable to reliably pin threads > to NUMA nodes. It is as if we could not bind PCPU's to NUMA nodes. vcpu -> pcpu pinning is one thing, and it works on Xen (i.e. you are guaranteed to use the pcpu you pin to). However, the memory used by those pcpus has no such guarantee. So your comment here is exactly right. You can't "bind pcpus to NUMA nodes" in the sense that you can't guarantee that even with a pinned vcpu->pcpu that you'll get memory associated with the node of that pcpu. Or written more succinctly vcpu -> pcpu // with pinning, which is the default for dom0 is guaranteed mem = mem_from_node[node_of[pcpu]] // not guaranteed by Xen, and probably on average it's not even likely (the Xen allocator isn't numa-smart) To further analyse the potential that the perf issue here is due to numa, then I suggest you compare at least one of these things. a) compare x86_64 xen perf results from a non-numa machine with one that is numa (both with and without numa=on) b) restrict the ia64 numa machine to use a single (the first) node when running the test for both bare-metal and xen, and then compare their results. This requires restricting the memory the kernel (bare-metal) or HV (xen) can see to fit within the first node, the mem=<mem> kernel/HV command line option. Also use taskset and vcpu-set to ensure the processors selected are from the first node. ============================================================================= Above is all I can think of right now related to the NUMA path this bug is going down. However, another path exists. Recall from comment 1 that this bug was considered as a regression at one point. That means something changed; the kernel, the HV, the xen userspace, the test code. We don't know what, but if something changed and it appeared to regress, then we should be able to bisect it. (In reply to comment #26) > ============================================================================= > > Above is all I can think of right now related to the NUMA path this bug is > going down. I'll certainly follow up on this, thanks for the great hints. > However, another path exists. Recall from comment 1 that this bug was > considered as a regression at one point. That means something changed; the > kernel, the HV, the xen userspace, the test code. We don't know what, but if > something changed and it appeared to regress, then we should be able to > bisect it. I repeated Miroslav's test in comment 11, with the same, official RHEL-5.2 xen kernel (2.6.18-92.el5xen). My results are consistent with those of Mirek. So if this is a regression at all, it is not from 5.2 to some later version. (In reply to comment #1) > FYI. This test is running fine for RHEL5-Server-U2, > > kernel-xen-2.6.18-92.1.27.el5 This is a 5.2.z version. I just re-run the test on this exact same version, and I've got the same ~40 minute runtime. No xen userspace component is installed at all, because the test runs in dom0 and the test doesn't need any xen userspace tools/libs. [root@pq0-0 ~]# uname -psrn Linux pq0-0.lab.bos.redhat.com 2.6.18-92.1.27.el5xen ia64 [root@pq0-0 ~]# time -p ./mallocstress [...] main(): test passed. real 2500.81 user 0.52 sys 2423.92 Now we have absolutely no *reproducible* proof that this is a regression. comment # | tested kernel-xen version | result -------------+-------------------------------------------+--------------- comment 6 | 2.6.18-92 (RHEL-5.2) | takes too long comment 11 | 2.6.18-92 (RHEL-5.2) | takes too long this comment | 2.6.18-92 (RHEL-5.2) | takes too long -------------+-------------------------------------------+--------------- comment 1 | 2.6.18-92.1.27 (RHEL-5.2.z) | works this comment | 2.6.18-92.1.27 (RHEL-5.2.z) | takes too long -------------+-------------------------------------------+--------------- comment 0 | 2.6.128 (? probably 2.6.18-128, RHEL-5.3) | takes too long -------------+-------------------------------------------+--------------- comment 0 | 2.6.152 (? probably 2.6.18-152) | takes too long -------------+-------------------------------------------+--------------- comment 6 | 2.6.18-164 (RHEL-5.4) | takes too long -------------+-------------------------------------------+--------------- comment 16 | 2.6.18-238 (RHEL-5.6) | takes too long I'll continue with the NUMA tips Andrew gave in comment 26, under RHEL-5.6. Created attachment 478079 [details] dell-pesc1435-01 topology / comment 26 hint "a" Created attachment 478080 [details] hp-bl465cg7-01 topology / comment 26 hint "a" Created attachment 478081 [details] hp-dl585g5-01 topology / comment 26 hint "a" Created attachment 478082 [details] lacos-workstation topology / comment 26 hint "a" Created attachment 478083 [details] make my_yield() call pthread_yield() in mallocstress.c / comment 26 hint "a" (In reply to comment #26) > To further analyse the potential that the perf issue here is due to numa, > then I suggest you compare at least one of these things. > > a) compare x86_64 xen perf results from a non-numa machine with one that is > numa (both with and without numa=on) See the above attachments for the CPU and RAM topologies of the examined machines. lacos-workstation is UMA, the other three are NUMA. The mallocstress source was patched to make it indeed CPU bound (see comment 32). The following RHEL-5.6 kernels were installed for bare metal tests: Linux dell-pesc1435-01.rhts.eng.bos.redhat.com 2.6.18-238.el5 x86_64 Linux hp-bl465cg7-01.rhts.eng.bos.redhat.com 2.6.18-238.el5 x86_64 Linux hp-dl585g5-01.rhts.eng.bos.redhat.com 2.6.18-238.el5 x86_64 Linux lacos-workstation.usersys.redhat.com 2.6.18-238.el5 x86_64 And for xen dom0 (both numa=on and numa=off, in case of the NUMA machines): Linux dell-pesc1435-01.rhts.eng.bos.redhat.com 2.6.18-238.el5xen x86_64 Linux hp-bl465cg7-01.rhts.eng.bos.redhat.com 2.6.18-238.el5xen x86_64 Linux hp-dl585g5-01.rhts.eng.bos.redhat.com 2.6.18-238.el5xen x86_64 Linux lacos-workstation.usersys.redhat.com 2.6.18-238.el5xen x86_64 Results: +----------------------------------+---------------+------+---------+ | | | real | 1.26 | | | bare metal | user | 0.14 | | | | sys | 1.70 | | lacos-workstation (uniform) +---------------+------+---------+ | | | real | 3.91 | | | xen | user | 0.16 | | | | sys | 0.90 | +----------------------------------+---------------+------+---------+ | | | real | 1.98 | | | bare metal | user | 0.29 | | | | sys | 2.31 | | +---------------+------+---------+ | | | real | 5.62 | | hp-dl585g5-01 (numa nodes: 4) | xen, numa=on | user | 0.18 | | | | sys | 1.98 | | +---------------+------+---------+ | | | real | 5.47 | | | xen, numa=off | user | 0.18 | | | | sys | 1.53 | +----------------------------------+---------------+------+---------+ | | | real | 8.29 | | | bare metal | user | 1.34 | | | | sys | 9.89 | | +---------------+------+---------+ | | | real | 19.90 | | hp-bl465cg7-01 (numa nodes: 2) | xen, numa=on | user | 0.59 | | | | sys | 5.80 | | +---------------+------+---------+ | | | real | 20.25 | | | xen, numa=off | user | 0.49 | | | | sys | 4.64 | +----------------------------------+---------------+------+---------+ | | | real | 1.95 | | | bare metal | user | 0.29 | | | | sys | 2.34 | | +---------------+------+---------+ | | | real | 4.10 | | dell-pesc1435-01 (numa nodes: 2) | xen, numa=on | user | 0.23 | | | | sys | 2.18 | | +---------------+------+---------+ | | | real | 4.27 | | | xen, numa=off | user | 0.16 | | | | sys | 1.39 | +----------------------------------+---------------+------+---------+ Analysis later. (In reply to comment #33) > Analysis later. - On the NUMA machines, under xen, numa=off vs. numa=on doesn't seem to make a significant difference. Therefore I won't distinguish them from each other below; I'll commonly refer to them as "xen". - The bare metal case is always CPU bound (real < sys). - The bare metal -> xen transition multiplies the wall clock time by about 2.19 to 3.1. It also reduces the user and sys times. - The xen cases all seem to waste wall clock time without using the CPU (real > user + sys). The above might be explained by the HV taking over some responsibilities from the kernel, spending more CPU time on them, all the while the dom0 kernel can't track that CPU time. The test shows that Xen incurs a performance penalty on x86_64, but nowhere near as grave as on ia64. Furthermore, the nondeterministic behavior seen on ia64 can't be reproduced on x86_64. (I've run the test multiple times.) The test is inconclusive. Either NUMA is not a significant factor in general, or NUMA under x86_64 vs. ia64 cannot be compared. On to comment 26 hint "b". Created attachment 478089 [details] pq0-0 topology / comment 26 hint "b" (In reply to comment #26) > b) restrict the ia64 numa machine to use a single (the first) node when running > the test for both bare-metal and xen, and then compare their results. This > requires restricting the memory the kernel (bare-metal) or HV (xen) can see to > fit within the first node, the mem=<mem> kernel/HV command line option. Also > use taskset and vcpu-set to ensure the processors selected are from the first > node. Repeating the test in comment 16 under these circumstances. Bare metal: ----v---- [root@pq0-0 ~]# uname -psrn Linux pq0-0.lab.bos.redhat.com 2.6.18-238.el5 ia64 [root@pq0-0 ~]# free -m total used free shared buffers cached Mem: 6993 685 6308 0 47 292 [root@pq0-0 ~]# cat /proc/cmdline BOOT_IMAGE=scsi0:EFI\redhat\vmlinuz-2.6.18-238.el5 root=/dev/VolGroup00/LogVol00 console=ttyS0,19200 rhgb quiet mem=7G ro [root@pq0-0 ~]# time -p ./mallocstress [...] main(): test passed. real 9.49 user 0.01 sys 0.51 ----^---- Xen dom0 "setup": ----v---- (XEN) Xen command line: BOOT_IMAGE=scsi0:EFI\redhat\xen.gz-2.6.18-238.el5 mem=7G [root@pq0-0 ~]# uname -psrn Linux pq0-0.lab.bos.redhat.com 2.6.18-238.el5xen ia64 [root@pq0-0 ~]# free -m total used free shared buffers cached Mem: 4096 956 3139 0 52 420 [root@pq0-0 ~]# cat /proc/cmdline root=/dev/VolGroup00/LogVol00 console=ttyS0,19200 rhgb quiet ro [root@pq0-0 ~]# xm info node_to_cpu : node0:0-7 node1:8-15 (XEN) VCPU information and callbacks for domain 0: (XEN) VCPU0: CPU0 [has=F] flags=1 upcall_pend = 00, upcall_mask = 00 dirty_cpus={} cpu_affinity={0} (XEN) VCPU1: CPU1 [has=F] flags=1 upcall_pend = 00, upcall_mask = 00 dirty_cpus={} cpu_affinity={1} (XEN) VCPU2: CPU2 [has=F] flags=1 upcall_pend = 00, upcall_mask = 00 dirty_cpus={} cpu_affinity={2} (XEN) VCPU3: CPU3 [has=F] flags=1 upcall_pend = 00, upcall_mask = 00 dirty_cpus={} cpu_affinity={3} [root@pq0-0 ~]# xm vcpu-list Name ID VCPUs CPU State Time(s) CPU Affinity Domain-0 0 0 0 r-- 30.3 0 Domain-0 0 1 1 -b- 10.9 1 Domain-0 0 2 2 -b- 17.8 2 Domain-0 0 3 3 -b- 10.1 3 ----^---- Xen dom results (repeated multiple times): ----v---- [root@pq0-0 ~]# time -p taskset -c 0 ./mallocstress -t 2 -l 10 [...] main(): test passed. real 4.20 user 0.00 sys 0.50 [root@pq0-0 ~]# time -p taskset -c 0,1 ./mallocstress -t 2 -l 10 [...] main(): test passed. real 35.56 user 0.00 sys 32.08 ----^---- As soon as we allow the threads to run on separate VCPUs (corresponding separate PCPUs), even though we forced the use of the first NUMA node, the sys CPU time blows up. NUMA seems to be out of the picture. In the next test, I'll try to pin all VCPUs to a single PCPU. The next culprit candidate is TLB flushes (also mentioned by Paolo earlier). Lock contention or some other contention is unlikely. I compiled the mallocstress test earlier in debug mode, which prints a message for each alloc/dealloc. With two threads, one of the threads finishes much earlier. Even after that, unmappings by the final worker thread take a very long time, if the range being unmapped is big. Rebooted pq0-0 with "dom0_vcpus_pin=0" as the sole Xen command line parameter. In dom0, pinned all VCPUs to PCPU 1: ----v---- [root@pq0-0 ~]# xm vcpu-pin 0 all 1 Name ID VCPUs CPU State Time(s) CPU Affinity Domain-0 0 0 1 -b- 39.0 1 Domain-0 0 1 1 r-- 21.8 1 Domain-0 0 2 1 -b- 20.4 1 Domain-0 0 3 1 -b- 14.3 1 (XEN) VCPU information and callbacks for domain 0: (XEN) VCPU0: CPU1 [has=F] flags=1 upcall_pend = 00, upcall_mask = 00 dirty_cpus={} cpu_affinity={1} (XEN) VCPU1: CPU1 [has=F] flags=1 upcall_pend = 00, upcall_mask = 00 dirty_cpus={} cpu_affinity={1} (XEN) VCPU2: CPU1 [has=F] flags=1 upcall_pend = 00, upcall_mask = 00 dirty_cpus={} cpu_affinity={1} (XEN) VCPU3: CPU1 [has=F] flags=1 upcall_pend = 00, upcall_mask = 00 dirty_cpus={} cpu_affinity={1} ----^---- I have run several instances of the following command: ----v---- [root@pq0-0 ~]# time -p taskset -c X,Y ./mallocstress -t 2 -l 10 ----^---- Many times, the times were: ----v---- real 4.16 user 0.00 sys 0.40 ----^---- (This is also returned if, instead of "-c X,Y", I only pass "-c X".) However, many times the results looked like: real 11.95 real 28.76 real 28.62 real 24.94 user 0.00 user 0.00 user 0.00 user 0.00 sys 5.09 sys 15.82 sys 14.78 sys 12.55 With "-c X", the results are consistent (always 4.1 / 0 / 0). With "-c X,Y", the times vary non-deterministically. For example, running "-c 0,1" five times in a row returned 4.2 / 0 / 0 thrice; then 16 / 0 / 7.1 once, then finally 37.3 / 0 / 20.3. All VCPUs are pinned to PCPU 1. I have absolutely no idea what's going on. An interesting tidbit is that in the "slow" cases, the wall clock time approximates the sys time doubled. (In reply to comment #37) > With "-c X", the results are consistent (always 4.1 / 0 / 0). With "-c X,Y", > the times vary non-deterministically. For example, running "-c 0,1" five times > in a row returned 4.2 / 0 / 0 thrice; then 16 / 0 / 7.1 once, then finally > 37.3 / 0 / 20.3. > > All VCPUs are pinned to PCPU 1. This was a good test. AFAICT, it shows we should concentrate on the vcpu context switch penalty; TLB flushing, etc. (In reply to comment #37) > With "-c X,Y", the times vary non-deterministically. I rebooted xen with "dom0_max_vcpus=16 --", and that seems to have made the "-c X,Y" times deterministic: now there is an almost invariable, "negligible" factor of about 11 from '-c X' to '-c X,Y'. (Test was "mallocstress -t 2 -l 10".) real 4.20 real 46.84 user 0.00 vs user 0.00 sys 0.51 sys 43.15 Enabling all PCPUs (and quadrupling the number of VCPUs) made the performance worse, but it also stabilized it. VCPUs are pinned, and it does not seem to have an effect if X and Y share or don't share a NUMA node. Created attachment 478250 [details] systemtap script The attached systemtap script outputs backtraces of VCPUs running do_munmap every 10 ms spent there. The output looks like this (I'll attach the complete output shortly): CPU#4 | thread 10011 (mallocstress) in xen_ptcga after 149 ms in do_munmap 0xa00000010006a2d0 : xen_ptcga+0x10/0x20 [kernel] 0xa000000100068b70 : ia64_global_tlb_purge+0x5b0/0xa80 [kernel] 0xa000000100068530 : flush_tlb_range+0x1b0/0x240 [kernel] 0xa000000100154500 : unmap_region+0x380/0x480 [kernel] 0xa000000100157080 : do_munmap+0x3e0/0x520 [kernel] 0xa000000100066200 : kretprobe_trampoline+0x0/0x20 [kernel] 0xa000000100068b70 : ia64_global_tlb_purge+0x5b0/0xa80 [kernel] ... still in xen_ptcga after 297 ms ... still in xen_ptcga after 445 ms ... still in xen_ptcga after 593 ms ... still in xen_ptcga after 742 ms CPU#4 | thread 10011 (mallocstress) in do_munmap for 743 ms CPU#10 | thread 10010 (mallocstress) in xen_ptcga after 38 ms in do_munmap 0xa00000010006a2d0 : xen_ptcga+0x10/0x20 [kernel] 0xa000000100068b70 : ia64_global_tlb_purge+0x5b0/0xa80 [kernel] 0xa000000100068530 : flush_tlb_range+0x1b0/0x240 [kernel] 0xa000000100154500 : unmap_region+0x380/0x480 [kernel] 0xa000000100157080 : do_munmap+0x3e0/0x520 [kernel] 0xa000000100066200 : kretprobe_trampoline+0x0/0x20 [kernel] 0xa000000100068b70 : ia64_global_tlb_purge+0x5b0/0xa80 [kernel] ... still in xen_ptcga after 75 ms ... still in xen_ptcga after 113 ms CPU#10 | thread 10010 (mallocstress) in do_munmap for 113 ms CPU#10 | thread 10010 (mallocstress) in xen_ptcga after 150 ms in do_munmap 0xa00000010006a2d0 : xen_ptcga+0x10/0x20 [kernel] 0xa000000100068b70 : ia64_global_tlb_purge+0x5b0/0xa80 [kernel] 0xa000000100068530 : flush_tlb_range+0x1b0/0x240 [kernel] 0xa000000100154500 : unmap_region+0x380/0x480 [kernel] 0xa000000100157080 : do_munmap+0x3e0/0x520 [kernel] 0xa000000100066200 : kretprobe_trampoline+0x0/0x20 [kernel] 0xa000000100068b70 : ia64_global_tlb_purge+0x5b0/0xa80 [kernel] ... still in xen_ptcga after 299 ms ... still in xen_ptcga after 449 ms CPU#10 | thread 10010 (mallocstress) in do_munmap for 449 ms Note how the samples are very rare (even 150 ms). The script should generate an output every 10 ms more or less (and it does if you make it print something on every timer.profile event). This means that the time is indeed spent in the hypervisor (as hinted in comment 21 and comment 34) in the hypercall that executes the ptc.ga (global TLB purge) operation. The implementation of the hypercall is in hyper_ptc_ga (arch/ia64/xen/hyperprivop.S), but it actually goes straight to dispatch_break_fault and from there priv_ptc_ga -> vcpu_ptc_ga -> domain_flush_vtlb_range. In turn, domain_flush_vtlb_range calls ia64_global_tlb_purge at the end (which is the same code as bare-metal). First, however, Xen has to flush the "Virtual Hash Page Table", a large extension of TLB that resides in memory and can be automatically searched by the processor. http://www.mail-archive.com/xen-ia64-devel@lists.xensource.com/msg07611.html points to the VHPT as a culprit for performance degradation when the OS performs frequent TLB flushing. The next step is to understand the reason for the different treatment of the VHPT in Linux and Xen. Created attachment 478251 [details]
systemtap output
(In reply to comment #41) > This means that the time is indeed spent in the > hypervisor (as hinted in comment 21 and comment 34) in the hypercall that > executes the ptc.ga (global TLB purge) operation. > > The implementation of the hypercall is in hyper_ptc_ga > (arch/ia64/xen/hyperprivop.S), but it actually goes straight to > dispatch_break_fault and from there priv_ptc_ga -> vcpu_ptc_ga -> > domain_flush_vtlb_range. Fantastic! Actually I followed the kernel source from unmap_vmas() (following up on comment 20): mm/memory.c unmap_vmas include/asm-ia64/tlb.h tlb_finish_mmu include/asm-ia64/tlb.h ia64_tlb_flush_mmu arch/ia64/mm/tlb.c flush_tlb_range include/asm-ia64/machvec.h platform_global_tlb_purge arch/ia64/mm/tlb.c ia64_global_tlb_purge include/asm-ia64/tlbflush.h flush_tlb_all arch/ia64/mm/tlb.c local_flush_tlb_all include/asm-ia64/xen/privop.h ia64_ptce include/asm-ia64/gcc_intrin.h "ptc.e %0" ... and the "ptc.e" instruction was where I gave up. I should have been more careful, because "include/asm-ia64/xen/privop.h" has a comment for ia64_ptce (and others) -- just five lines below where "ia64_ptcga" is defined as "xen_ptcga" :) -- : /* the remainder of these are not performance-sensitive so its * OK to not paravirtualize and just take a privop trap and emulate */ So "ptc.e" is trapped. Instead of giving up, I should have continued in the hypervisor, even though now it's clear from your stap results that I was on the wrong path. > In turn, domain_flush_vtlb_range calls ia64_global_tlb_purge at the end > (which is the same code as bare-metal). First, however, Xen has to flush the > "Virtual Hash Page Table", a large extension of TLB that resides in memory > and can be automatically searched by the processor. > http://www.mail-archive.com/xen-ia64-devel@lists.xensource.com/msg07611.html > points to the VHPT as a culprit for performance degradation when the OS > performs frequent TLB flushing. > > The next step is to understand the reason for the different treatment of the > VHPT in Linux and Xen. Probably irrelevant, but ia64_tlb_flush_mmu() flushes the TLB first, and the VHPT second. /* flush the address range from the tlb: */ flush_tlb_range(&vma, start, end); /* now flush the virt. page-table area mapping the address range: */ flush_tlb_range(&vma, ia64_thash(start), ia64_thash(end)); I'm now thinking, perhaps we have the same problem as described in the mailing list message that you've found, between Xen and bare-metal -- different VHPT sizes. I can see some "dom0_vhpt_size_log2" Xen parameter patches in xen-unstable. I added "pervcpu_vhpt=0" after "dom0_max_vcpus=16", just to perturb the results. The consistent 49 / 0 / 46 triple promptly worsened to a consistent 62.22 / 0.00 / 58.82. (With "mallocstress -t 2 -l 10", no taskset.) I think this confirms Paolo's analysis; we have to poke at the VHPT. The size of the dom0 VHPT in 2.6.18-243.el5 Xen appears to be 64K. include/asm-ia64/vhpt.h:8: # define VHPT_SIZE_LOG2 16 // 64KB This is supported by the Xen boot logs. [...] (XEN) vhpt_init: vhpt paddr=0x4080030000, end=0x408003ffff [...] For the time being I was unable to figure out the VHPT size on the bare metal. Further tests to look at: - run the test in an ia64 pv domU (thanks to Alex for the hint), - try it on a non-numa ia64 box: rx26XX/rx36XX/rx46XX/rx66XX (ditto), - rebuild the HV for pq0-0 with VHPT_SIZE_LOG2 defined to eg. 15. Changing the VHPT size from 64KB to 32KB doesn't seem to affect the performance. As in comment 40, the test was "mallocstress -t 2 -l 10", the times are the basically identical real 47.38 user 0.00 sys 43.86 The hv command line was the same "dom0_max_vcpus=16 loglvl=all guest_loglvl=all". (XEN) vhpt_init: vhpt paddr=0x201fcfe0000, end=0x201fcfe7fff Repeated the previous test with 512KB VHPT. (XEN) vhpt_init: vhpt paddr=0x4080100000, end=0x408017ffff No change in times over a factor of 16 in VHPT size. real 48.01 user 0.00 sys 44.45 I have found a possibly relevant paper under [0]: "A Hash-TLB Approach for MMU Virtualization in Xen/IA64". The direct link is [1]. This paper and the related software may very well have been developed later than the code in RHEL5 Xen, but I still have a feeling that the explanation to our problem is staring me in the face from the paper. I can't quite put my finger on it though. Some quotes, with liberal snipping: ----v---- 2.2 The single TLB approach in paravirtualization Paravirtualization in the Xen/IA-64 has used a single TLB approach based on the VHPT without collision chain. The VMM under such a design implements the smallest number of virtual TLBs required by the Itanium specification. For paravirtualization, the single TLB approach has the following characteristics: [...] Performance benefits of the TLB flush-all virtualization: In some cases, the OS may use the ptc.e instruction to purge all the TLB entries. In the virtualization environment, the VMM similarly needs to purge all the TC entries from the virtual TLB. If we can ensure that there is only one entry present in the virtual TLB, the execution time will be reduced and the VM will get better performance. 3 Hash TLB Virtualization based on VHPT [...] In the TLB virtualization, the paravirtualization adopts the 8 TR + 1 TC model. For both the ITLB and DTLB, the model represents the smallest number of TLBs to ensure the system's forward progress according to the IA64 specification. Such a model is the single TLB approach mentioned in the previous sections. [...] 4.3 STPV vs HTPV We have implemented the hash TLB design for PV virtualization. We measure the overall performance of the PV VMs under both the STPV and HTPV approaches. [...] [T]he hash TLB does not improve the performance of the virtual machines under PV virtualization. [...] [U]nder the specialized IO mode of the PV guest VM, the number of TLB misses as a result of IO operations is reduced. Hence, it is not necessary for the guest VM TLB to track these accesses. Moreover, because the guest OS is modified, instructions for manipulating the TLB can, after paravirtualization, use relatively simple logic to insert information into the system TLB and VHPT. As a result, the complex logic of the hash TLB virtualization can negatively impact the performance of the paravirtualization approach. [...] ----^---- [0] http://www.cs.purdue.edu/homes/yau/publications.html [1] http://www.cs.purdue.edu/homes/yau/papers/smtps2008.pdf (In reply to comment #45) > - run the test in an ia64 pv domU (thanks to Alex for the hint), I created a PV domain with 10 GB RAM and 16 VCPUs, and pinned each VCPU to a separate PCPU. In said domain "./mallocstress -t 2 -l 10" completed reproducibly in real 3.90 user 0.00 sys 0.27 without any taskset restriction. (Cf. comment 40 -- this largely corresponds to the "taskset -c X" dom0 time.) The complete mallocstress test (ie. no -t or -l option passed) took real 6838.64 user 0.72 sys 6619.10 During the test, virt-manager displayed a 12% load for the domain running mallocstress, with occasional bursts up to 20%. Inside the domain, "top" displayed a similar sys load (and nothing else much), and the loadavg was varying between 20 and 40 (mostly near 20), according to "uptime". The HV command line was the usual "dom0_max_vcpus=16 loglvl=all guest_loglvl=all". Created a fullhvm guest similar to the PV one in comment 49; 10 GB RAM, 16 VCPUs, each VCPU pinned to a different PCPU. Same HV environment. [root@localhost ~]# time -p ./mallocstress -t 2 -l 10 [...] real 67.92 user 0.02 sys 67.00 (Reproducibly.) This resembles the "pervcpu_vhpt=0" dom0 results in comment 44. [root@localhost ~]# time -p taskset -c 9 ./mallocstress -t 2 -l 10 [...] real 5.68 user 0.00 sys 4.72 Won't try the full mallocstress as it would probably take ages. (In reply to comment #45) > - try it on a non-numa ia64 box: rx26XX/rx36XX/rx46XX/rx66XX (ditto), **** hp-rx2660-01, bare metal: [root@hp-rx2660-01 ~]# uname -psrn Linux hp-rx2660-01.rhts.eng.bos.redhat.com 2.6.18-238.el5 ia64 [root@hp-rx2660-01 ~]# time -p ./mallocstress [...] real 8.98 user 0.02 sys 0.22 **** hp-rx2660-01, dom0: [root@hp-rx2660-01 ~]# uname -psrn Linux hp-rx2660-01.rhts.eng.bos.redhat.com 2.6.18-238.el5xen ia64 [root@hp-rx2660-01 ~]# time -p ./mallocstress -t 2 -l 10 [...] real 9.05 user 0.00 sys 5.77 This was a virtually constant result. Using "taskset -c X", 3.71 / 0 / 0.17 was a stable set of resulting times. With "taskset -c X,Y", the times were varying between 3.71 / 0 / 0.17 and the above 9.05 / 0.00 / 5.77. The full test took 1502.62 / 1.52 / 1436.38. **** hp-rx2660-03, bare metal: [root@hp-rx2660-03 ~]# uname -psrn Linux hp-rx2660-03.rhts.eng.bos.redhat.com 2.6.18-238.el5 ia64 [root@hp-rx2660-03 ~]# time -p ./mallocstress [...] real 9.78 user 0.00 sys 0.06 **** hp-rx2660-03, dom0: [root@hp-rx2660-03 ~]# uname -psrn Linux hp-rx2660-03.rhts.eng.bos.redhat.com 2.6.18-238.el5xen ia64 [root@hp-rx2660-03 ~]# time -p ./mallocstress -t 2 -l 10 [...] real 22.61 user 0.00 sys 19.06 This result was practically constant as well. Trying "taskset -c X,Y" several times didn't change these numbers, any time. [root@hp-rx2660-03 ~]# time -p taskset -c 0 ./mallocstress -t 2 -l 10 [...] real 4.57 user 0.00 sys 0.79 (Reproducibly.) The full test took 4202.80 / 0.62 / 4116.68. As these boxen have four PCPUs (topologies to be attached soon), nothing was passed to the HV except loglevel switches. VCPUs are pinned for dom0 per default. Created attachment 478527 [details]
hp-rx2660-01 topology
Created attachment 478528 [details]
hp-rx2660-03 topology
Latest git version of LTP results on hp-rx2660-04.rhts.eng.bos.redhat.com: RHEL5.6 ======= kernel: real 0m8.968s user 0m0.002s sys 0m0.021s kernel-xen real 18m49.002s user 0m0.728s sys 18m8.408s RHEL5.5 ======= kernel: real 0m8.975s user 0m0.002s sys 0m0.031s kernel-xen: real 18m51.607s user 0m0.672s sys 18m11.188s RHEL5.4 ======= kernel: real 0m8.970s user 0m0.002s sys 0m0.032s kernel-xen: real 18m46.917s user 0m0.956s sys 18m14.324s |