Bug 504762

Summary:

extremely expensive TLB flushes on ia64 (LTP Mallocstress Test)

Product:

Red Hat Enterprise Linux 5

Reporter:

Qian Cai <qcai>

Component:

kernel-xen

Assignee:

Xen Maintainance List <xen-maint>

Status:

CLOSED CANTFIX

QA Contact:

Red Hat Kernel QE team <kernel-qe>

Severity:

low

Docs Contact:

Priority:

low

Version:

5.3

CC:

alex.williamson, cdahlin, clalance, czhang, drjones, lersek, mrezanin, pbonzini, syeghiay, tao, xen-maint

Target Milestone:

Target Release:

---

Hardware:

ia64

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2011-02-15 08:11:19 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

514490

Attachments:

Description	Flags
strace output for one minute of mallocstress	none
dom0 dmesg about oprofile (perfmon) startup failure	none
x864_64 bare metal opreport, with my_yield() calling usleep(0)	none
x864_64 xen dom0 opreport, with my_yield() calling usleep(0)	none
current CPU and memory topology of hp-rx8640-02	none
dell-pesc1435-01 topology / comment 26 hint "a"	none
hp-bl465cg7-01 topology / comment 26 hint "a"	none
hp-dl585g5-01 topology / comment 26 hint "a"	none
lacos-workstation topology / comment 26 hint "a"	none
make my_yield() call pthread_yield() in mallocstress.c / comment 26 hint "a"	none
pq0-0 topology / comment 26 hint "b"	none
systemtap script	none
systemtap output	none
hp-rx2660-01 topology	none
hp-rx2660-03 topology	none

Description Qian Cai 2009-06-09 09:28:23 UTC

Description of problem:
LTP test suite could not complete for kernel-xen on ia64 due mallocstress test case blocks seems forever,

mallocstress  D a00000010065cac0     0 26293  24968         26294 26292 (NOTLB)

Call Trace:
 [<a000000100658860>] schedule+0x1800/0x1c60
                                sp=e00000016830fd80 bsp=e0000001683092b8
 [<a00000010065cac0>] rwsem_down_write_failed+0x4c0/0x540
                                sp=e00000016830fe10 bsp=e000000168309270
 [<a0000001000ca410>] down_write+0x90/0xc0
                                sp=e00000016830fe30 bsp=e000000168309250
 [<a00000010014e170>] sys_munmap+0x50/0xa0
                                sp=e00000016830fe30 bsp=e0000001683091d0
 [<a00000010006a680>] xen_trace_syscall+0x100/0x140
                                sp=e00000016830fe30 bsp=e0000001683091d0
 [<a000000000010620>] __start_ivt_text+0xffffffff00010620/0x400
                                sp=e000000168310000 bsp=e0000001683091d0


This problem has not seen with bare metal kernels.

Version-Release number of selected component (if applicable):
kernel-xen-2.6.152.el5
kernel-xen-2.6.128.el5

How reproducible:
always

Steps to Reproduce:
1. reserve pq0-0.lab.bos.redhat.com installed RHEL5-Server-U3.
2. download LTP test suite from http://porkchop.devel.redhat.com/qa/rhts/lookaside/ltp-full-20090228.tgz
3. ./configure; make; make install
4. cd testcases/kernel/mem/mtest07; ./mallocstress

Actual results:
Blocked

Expected results:
Passed

Additional info:
http://rhts.redhat.com/cgi-bin/rhts/test_log.cgi?id=8432510

Comment 1 Qian Cai 2009-06-29 03:26:25 UTC

FYI. This test is running fine for RHEL5-Server-U2,

kernel-xen-2.6.18-92.1.27.el5

Comment 4 Miroslav Rezanina 2009-10-19 12:41:48 UTC

Is the test really blocking forever?? I try to run it and it takes just too much time to execute test. And RHTS log shows, test was canceled due to timeout. What about extending duration of test (it takes me about 2,5 hour).

Comment 5 Qian Cai 2009-10-19 13:05:58 UTC

I did not think I have waited longer enough for around 2.5 hour. As far as I can tell it took much less time in RHEL 5.2.

Comment 6 Miroslav Rezanina 2009-10-21 04:49:59 UTC

Tested versions 2.6.18-92 and 2.6.18-164 kernels in RHEL 5.4 release. Test runs three times in a row with following results:

2.6.18-92:
1.) systime 7212.84, elapsed time 2:01:50, 100000 minor page faults
2.) systime 7209.36, elapsed time 2:01:46, 99678 minor page faults
3.) systime 7218.31, elapsed time 2:01:56, 99386 minor page faults

2.6.18-164:
1.) systime 7204.55, elapsed time 2:01:40, 96497 minor page faults
2.) systime 7222.78, elapsed time 2:01:58, 97459 minor page faults
3.) systime 7229.79, elapsed time 2:02:04, 97404 minor page faults

As results show, in RHEL 5.4 takes test execution more than 2 hours that are set for the test so test has to fail. However, this time is also reach with RHEL 5.2 kernel.

Test never blocks forever, just was not able to finish in expected time.

Comment 7 Qian Cai 2009-10-21 14:11:15 UTC

I got a soft lockup here.

mallocstress  D a000000100642180     0  5854   5683                5853 (NOTLB)

Call Trace:
 [<a00000010063d9b0>] schedule+0x1750/0x1bc0
                                sp=e00000014ccefc60 bsp=e00000014cce93c8
 [<a000000100642180>] rwsem_down_read_failed+0x4e0/0x550
                                sp=e00000014ccefcf0 bsp=e00000014cce9380
 [<a0000001000c7fb0>] down_read+0x30/0x60
                                sp=e00000014ccefd10 bsp=e00000014cce9360
 [<a0000001000c9a20>] futex_wake+0x40/0x200
                                sp=e00000014ccefd10 bsp=e00000014cce9320
 [<a0000001000cc240>] do_futex+0x4e0/0x1f20
                                sp=e00000014ccefd30 bsp=e00000014cce92b0
 [<a0000001000cdf00>] sys_futex+0x280/0x2e0
                                sp=e00000014ccefe20 bsp=e00000014cce9220
 [<a000000100069060>] xen_trace_syscall+0x100/0x140
                                sp=e00000014ccefe30 bsp=e00000014cce9220
 [<a000000000010620>] __start_ivt_text+0xffffffff00010620/0x400
                                sp=e00000014ccf0000 bsp=e00000014cce9220
bash          R  running task       0  5857   3882                     (NOTLB)
BUG: soft lockup - CPU#0 stuck for 10s! [bash:5857]
Modules linked in: ipt_MASQUERADE iptable_nat ip_nat xt_state ip_conntrack nfnetlink ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables netloop netbk blktap blkbk bridge autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc ipv6 xfrm_nalgo crypto_api ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr iscsi_tcp libiscsi scsi_transport_iscsi vfat fat dm_multipath button parport_pc lp parport sg tg3 e100 mii e1000 dm_snapshot dm_zero dm_mirror dm_mod lpfc scsi_transport_fc shpchp mptspi mptscsih mptbase scsi_transport_spi sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd

Pid: 5857, CPU 0, comm:                 bash
psr : 00000210085a6010 ifs : 8000000000000003 ip  : [<a0000001006432a1>]    Not tainted
ip is at _spin_unlock_irqrestore+0xc1/0x140
unat: 0000000000000000 pfs : 800000000000050d rsc : 000000000000000b
rnat: 0000000000000000 bsps: 0000000000000000 pr  : 000000000059aa99
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f
csd : 0000000000000000 ssd : 0000000000000000
b0  : a0000001003bd290 b6  : a0000001003ec7c0 b7  : 0000000000000000
f6  : 000000000000000000000 f7  : 000000000000000000000
f8  : 000000000000000000000 f9  : 000000000000000000000
f10 : 000000000000000000000 f11 : 000000000000000000000
r1  : a000000100c11fd0 r2  : a000000100a2a418 r3  : a000000100a2a418
r8  : 0000000000000001 r9  : fffffffffff00001 r10 : 0000000000000001
r11 : fffffffffff00000 r12 : e00000014b3c7e20 r13 : e00000014b3c0000
r14 : 0000000000004000 r15 : 0000000000000001 r16 : fffffffffff04c18
r17 : fffffffffff00001 r18 : a000000100a12788 r19 : 0000000000000007
r20 : a000000100849300 r21 : a000000100a12628 r22 : e00000014b3c0208
r23 : a000000100a2a128 r24 : 0000000000000001 r25 : a000000100a122e8
r26 : a0000001007cc138 r27 : a000000100849000 r28 : a000000100a12680
r29 : 00001412085a6010 r30 : 0000000000000000 r31 : 0000000000000000

Call Trace:
 [<a00000010001d180>] show_stack+0x40/0xa0
                                sp=e00000014b3c7a80 bsp=e00000014b3c1630
 [<a00000010001da80>] show_regs+0x840/0x880
                                sp=e00000014b3c7c50 bsp=e00000014b3c15d0
 [<a0000001000fad10>] softlockup_tick+0x310/0x380
                                sp=e00000014b3c7c50 bsp=e00000014b3c1588
 [<a0000001000a48f0>] run_local_timers+0x30/0x60
                                sp=e00000014b3c7c50 bsp=e00000014b3c1568
 [<a0000001000a49a0>] update_process_times+0x80/0x100
                                sp=e00000014b3c7c50 bsp=e00000014b3c1530
 [<a000000100042ac0>] timer_interrupt+0x800/0xa40
                                sp=e00000014b3c7c50 bsp=e00000014b3c14d8
 [<a0000001000fb350>] handle_IRQ_event+0x150/0x240
                                sp=e00000014b3c7c50 bsp=e00000014b3c1498
 [<a0000001000fb570>] __do_IRQ+0x130/0x420
                                sp=e00000014b3c7c50 bsp=e00000014b3c1450
 [<a0000001003e80c0>] evtchn_do_upcall+0x160/0x260
                                sp=e00000014b3c7c50 bsp=e00000014b3c13b8
 [<a000000100068c60>] xen_event_callback+0x380/0x3c0
                                sp=e00000014b3c7c50 bsp=e00000014b3c13b8

Comment 8 Qian Cai 2009-10-21 14:15:14 UTC

Not sure if this is because of using a different userspace components like Xen version's change. The original problem was used RHEL5-Server-U2.

Comment 10 Qian Cai 2009-10-22 12:36:36 UTC

Miroslav, I have installed the system using RHEL5-Server-U2 distro, and everything seems working great there every time.

# time ./mallocstress
Thread [43]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [51]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [47]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [3]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [39]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [27]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [11]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [15]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [31]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [35]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [7]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [23]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [55]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [19]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [59]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [6]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [50]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [2]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [54]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [58]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [14]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [30]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [34]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [46]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [22]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [18]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [26]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [42]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [10]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [38]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [49]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [53]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [21]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [29]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [33]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [9]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [41]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [25]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [5]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [13]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [45]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [1]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [17]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [57]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [37]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [12]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [16]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [8]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [24]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [28]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [0]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [20]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [32]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [4]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [40]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [48]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [36]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [56]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [52]: allocate_free() returned 0, succeeded.  Thread exiting.
Thread [44]: allocate_free() returned 0, succeeded.  Thread exiting.
main(): test passed.

real	0m9.415s
user	0m0.027s
sys	0m0.235s

Comment 11 Miroslav Rezanina 2009-10-23 08:19:52 UTC

CAI's results are for non-xen kernel. When trying equivalent xen kernel, results are:

real	41m36.795s
user	0m0.516s
sys	40m20.836s


This means that in xen enviroment test is much slower. And it is even slower in newer releases.

Comment 16 Laszlo Ersek 2011-02-08 15:09:45 UTC

Created attachment 477629 [details]
strace output for one minute of mallocstress

I reproduced the problem under RHEL-5.6 and did some preliminary measurements.

On the bare metal:

[root@pq0-0 ~]# uname -psrn
Linux pq0-0.lab.bos.redhat.com 2.6.18-238.el5 ia64

mallocstress completes in less than ten seconds.

I traced the same test with the following command, for one minute, in dom0:

[root@pq0-0 ~]# uname -psrn
Linux pq0-0.lab.bos.redhat.com 2.6.18-238.el5xen ia64

$ strace -f -T -etrace=mmap,munmap,mprotect,madvise ./mallocstress 2>&1 \
| tee -i log-one-minute

The '-T' option outputs the time spent in syscalls when they are completed. The following script creates a listing of times for the four traces syscalls:

for I in mmap munmap mprotect madvise; do
  grep $I log-one-minute \
  | sed -n -r 's/^.*<([0-9]+\.[0-9]+)>$/\1/p' >times-$I
done

Summary:

		    madvise	 mmap  mprotect    munmap
sample size	         22      1490       447       703
mean time (s)      0.155120  0.040717  0.029127  0.199247
sample stdev (s)   0.221008  0.129294  0.093251  0.260324

The performance of the munmap() calls looks especially horrifying -- five
munmap()'s per second, on the average. The sample standard deviations are huge.
Either the dom0 <-> hv switch takes a lot of time, or there must be some heavy
lock contention in the hypervisor on munmap().

Comment 17 Laszlo Ersek 2011-02-08 16:31:47 UTC

Created attachment 477644 [details]
dom0 dmesg about oprofile (perfmon) startup failure

To get a better understanding of what's happening, I tried to oprofile the machine while running the malloc stress. I installed the "kernel-xen-debuginfo" package, configured the xen and vmlinux images for oprofile:

$ opcontrol --vmlinux=/usr/lib/debug/lib/modules/2.6.18-238.el5xen/vmlinux
$ opcontrol --xen=/usr/lib/debug/boot/xen-syms-2.6.18-238.el5.debug

And then tried to start the oprofile daemon.

$ opcontrol --start --verbose

[...]
executing oprofiled --session-dir=/var/lib/oprofile --separate-lib=0 --separate-kernel=0 --separate-thread=0 --separate-cpu=0 --events=CPU_CYCLES:18:0:1000000:0:1:1, --vmlinux=/usr/lib/debug/lib/modules/2.6.18-238.el5xen/vmlinux --kernel-range=a000000100000000,a000000100698940 --xen-image=/usr/lib/debug/boot/xen-syms-2.6.18-238.el5.debug --xen-range=f000000004000000,f0000000040e3470 --verbose=all
[...]
Using log file /var/lib/oprofile/samples/oprofiled.log
Daemon started.
Profiler running.

Contrarily to the above, the daemon failed to start; "oprofiled.log" says
"Couldn't start perfmon: : Invalid argument".

I have enabled perfmon debug logging (echo 1 >/proc/sys/kernel/perfmon/debug) and tried to start the daemon again. It failed the same way, but now the attached messages were logged. pfm_start() returns -22 (EINVAL) because no context is loaded. The context is not loaded anymore at that time because pfm_flush() forces an unload ("forcing unload") earlier. The source says "we must stop and unload because we are losing access to the context". I don't understand that, but it seems clear that the oprofile machinery tries to rely on perfmon on IA64, and that on this specific machine perfmon stabs oprofile in the back.

I'm reserving another IA64 machine (hp-rx8640-02) and I will retry the entire
thing there (first the mallocstress, then -- if it is slow there as well --
oprofiling).

Comment 18 Laszlo Ersek 2011-02-08 17:24:49 UTC

I have reserved hp-rx8640-02, and all results match those on pq0-0:
- on the bare metal, the mallocstress test finishes in about 11 seconds,
- in dom0, the test promises to take forever (I interrupted it after one minute
  or so),
- The oprofile daemon failed to start with the exact same symptomps (dmesg
  included) as in comment 17.

Comment 19 Laszlo Ersek 2011-02-09 13:30:04 UTC

Andrew had the great idea to repeat the test on x86_64, where oprofile has a better chance to work. This should help us identify the slow path at least on x86_64, and we could try looking at the same code for ia64.

A summary of the results this far:

(1) On x86_64 bare metal, -242 kernel, the times taken by mallocstress are:
real 9.16 / user 0.00 / sys 0.01. I will upload an oprofile report for this case.

(2) On x64_64 xen dom0, -242 kernel and hv, the times taken by mallocstress are:
real 36.68 / user 0.00 / sys 0.02. I will upload an oprofile report for this case as well.

(3) On ia64 bare metal, -238 kernel (= U6, but also on U2, see comment 10 and comment 11), the test takes, similarly, about 9-10 seconds, and practically no CPU time.

(4) On ia64 xen dom0, the test takes forever, and all wall clock time is spent on sys CPU time.

In cases (1)-(3), the mallocstress process seems to be "IO-bound", except it does no IO. I'm suspecting the "my_yield()" function, which, instead of calling pthread_yield(), calls usleep() or nanosleep() (dependent on the SPEW_SIGNALS macro). Indeed, I re-run case (1) -- without profiling -- with my_yield() calling pthread_yield() instead of the sleep functions, and the times now look like: real 1.26 / user 0.14 / sys 1.71. This is actually reasonable for a sys-CPU-bound multi-threaded program.

Comment 20 Laszlo Ersek 2011-02-09 13:37:45 UTC

Created attachment 477807 [details]
x864_64 bare metal opreport, with my_yield() calling usleep(0)

Top of the report is

CPU: Intel Core/i7, speed 1596 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        app name                 symbol name
169431   25.0721  vmlinux                  .text.acpi_processor_idle_simple
95385    14.1149  vmlinux                  mwait_idle_with_hints
55694     8.2415  vmlinux                  unmap_vmas
52499     7.7687  vmlinux                  .text.acpi_safe_halt
22182     3.2824  vmlinux                  schedule
10706     1.5843  vmlinux                  clear_page
10209     1.5107  vmlinux                  free_pgd_range

If we put aside the sleeping incurred by usleep(0), then the top CPU consumers are unmap_vmas(), clear_page() and free_pgd_range().

Comment 21 Laszlo Ersek 2011-02-09 13:45:07 UTC

Created attachment 477810 [details]
x864_64 xen dom0 opreport, with my_yield() calling usleep(0)

The top of the report is:

CPU: Intel Core/i7, speed 1596 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        app name                 symbol name
131444    4.8761  vmlinux                  unmap_vmas
95945     3.5592  vmlinux                  .text.lock.spinlock
81181     3.0115  xen-syms-2.6.18-242.el5.debug alloc_page_type
64907     2.4078  xen-syms-2.6.18-242.el5.debug get_page_from_l1e

Now "unmap_vmas()" is the clear winner. (It also ranked high on the bare metal.) For comparison, this is the bare metal hit count:

55694     8.2415  vmlinux                  unmap_vmas

It seems that unmap_vmas() spens, under xen, a lot more CPU cycles. This should include the cycles spent by the hypervisor. I believe we should look at the ia64 hv code that is called by unmap_vmas(), or otherwise present near the top of this opreport: alloc_page_type(), get_page_from_l1e(), free_page_type(), compat_failsafe_callback(), and put_page_from_l1e().

Comment 22 Laszlo Ersek 2011-02-09 13:53:52 UTC

The results described in comment 20 and comment 21 were gathered like this:

- installed the following packages from brew:
  - kernel-debuginfo-common-2.6.18-242.el5.x86_64.rpm
  - kernel-debuginfo-2.6.18-242.el5.x86_64.rpm
  - kernel-xen-debuginfo-2.6.18-242.el5.x86_64.rpm

Then, for each test:

opcontrol --init
opcontrol --vmlinux=/usr/lib/debug/lib/modules/$(uname -r)/vmlinux

#Under xen only:
opcontrol --xen=/usr/lib/debug/boot/xen-syms-2.6.18-242.el5.debug 

opcontrol --start --verbose

for ((K=0; K<20; K++)); do
  echo "==== $K ====" >&2
  time -p ./mallocstress
done

opcontrol --shutdown

# Produces report to stdout.
opreport -l

# archive results for later analysis in standalone form
oparchive -o ~/filestore/bz504762/oprofile-`uname -r`

# stash away current sample in samples directory too, emptying "current" sample
opcontrol --save=mallocstress-`uname -r`

opcontrol --deinit

# Make sure the --xen setting doesn't remain in effect for bare metal, messing
# up the report.
rm -rf ~/.oprofile

Comment 23 Paolo Bonzini 2011-02-09 14:57:05 UTC

*** Bug 533872 has been marked as a duplicate of this bug. ***

Comment 24 Laszlo Ersek 2011-02-09 17:36:28 UTC

Paolo pointed out that the bigger the allocated block is, the longer its deallocation takes. This is the reason why the standard deviation is so high in comment 16: the stddev of the memory sizes is big as well.

Paolo also established that the hypervisor does pin the dom0 VCPUs to the PCPUs even on IA64. This is interesting because we figured out that NUMA is the culprit.

Considering the Xen testcase on hp-rx8640-02 (see comment 18), specifying a single repetition of the allocation loop (-l 1):

numthreads taskset real user sys
1 none 0.43 0.00 0.04
2 none 3.95 0.00 3.56
2 -c 0 0.45 0.00 0.08
2 -c x,y <alternating haphazardly between the above two lines>

So if we pin all threads to the same VCPU, then they are bound to run on the same PCPU as well, and bound to have the same NUMA node affinities. If we allow more than one VCPU, then the threads will sometimes use the same NUMA node (and the test will complete fast), and sometimes not (and the test will run slow).

(
xm info on hp-rx8640-02 states:

node_to_cpu : node0:0-7
node1:8-15
node2:no cpus
)

The "numa=on" xen cmdline switch seems to have no effect.

Even though VCPU-PCPU pinning is done, we seem unable to reliably pin threads to NUMA nodes. It is as if we could not bind PCPU's to NUMA nodes.

----o----

Running the mallocstress test with the default parameters, but inside "taskset -c 0"; the times are: real 231.16 / user 4.56 / sys 214.80; which is much slower than the 11 seconds on the bare metal (see comment 18), but still an order of magnitude faster than without "taskset".

I'm not sure what we should do to this bug now. The original report has proved imprecise; the test does complete with functionally good result (see comment 6).

If we re-qualify the bug as a performance problem, then that problem can be worked around to a great extent by "taskset -c X" -- however, I'm not sure that wouldn't defeat the original purpose of the test, which tries to attack the memory subsystem from as many sides as possible. Calling my_yield() after each allocation and deallocation apparently tries to ensure that all threads pound the system in lock-step for maximum stress. In that it succeeds on these machines as well, just at a very high perf price under Xen.

Comment 25 Laszlo Ersek 2011-02-09 18:18:43 UTC

Created attachment 477877 [details]
current CPU and memory topology of hp-rx8640-02

Comment 26 Andrew Jones 2011-02-10 10:15:47 UTC

(In reply to comment #24)
> Even though VCPU-PCPU pinning is done, we seem unable to reliably pin threads
> to NUMA nodes. It is as if we could not bind PCPU's to NUMA nodes.

vcpu -> pcpu pinning is one thing, and it works on Xen (i.e. you are guaranteed to use the pcpu you pin to). However, the memory used by those pcpus has no such guarantee. So your comment here is exactly right. You can't "bind pcpus to NUMA nodes" in the sense that you can't guarantee that even with a pinned vcpu->pcpu that you'll get memory associated with the node of that pcpu. Or written more succinctly

vcpu -> pcpu // with pinning, which is the default for dom0 is guaranteed

mem = mem_from_node[node_of[pcpu]] // not guaranteed by Xen, and probably on average it's not even likely (the Xen allocator isn't numa-smart)


To further analyse the potential that the perf issue here is due to numa, then I suggest you compare at least one of these things.

a) compare x86_64 xen perf results from a non-numa machine with one that is numa (both with and without numa=on)

b) restrict the ia64 numa machine to use a single (the first) node when running the test for both bare-metal and xen, and then compare their results. This requires restricting the memory the kernel (bare-metal) or HV (xen) can see to fit within the first node, the mem=<mem> kernel/HV command line option. Also use taskset and vcpu-set to ensure the processors selected are from the first node.

=============================================================================

Above is all I can think of right now related to the NUMA path this bug is going down. However, another path exists. Recall from comment 1 that this bug was considered as a regression at one point. That means something changed; the kernel, the HV, the xen userspace, the test code. We don't know what, but if something changed and it appeared to regress, then we should be able to bisect it.

Comment 27 Laszlo Ersek 2011-02-10 14:40:01 UTC

(In reply to comment #26)

> =============================================================================
>
> Above is all I can think of right now related to the NUMA path this bug is
> going down.

I'll certainly follow up on this, thanks for the great hints.

> However, another path exists. Recall from comment 1 that this bug was
> considered as a regression at one point. That means something changed; the
> kernel, the HV, the xen userspace, the test code. We don't know what, but if
> something changed and it appeared to regress, then we should be able to
> bisect it.

I repeated Miroslav's test in comment 11, with the same, official RHEL-5.2 xen
kernel (2.6.18-92.el5xen). My results are consistent with those of Mirek. So if
this is a regression at all, it is not from 5.2 to some later version.


(In reply to comment #1)
> FYI. This test is running fine for RHEL5-Server-U2,
>
> kernel-xen-2.6.18-92.1.27.el5

This is a 5.2.z version. I just re-run the test on this exact same version, and
I've got the same ~40 minute runtime. No xen userspace component is installed
at all, because the test runs in dom0 and the test doesn't need any xen
userspace tools/libs.

[root@pq0-0 ~]# uname -psrn
Linux pq0-0.lab.bos.redhat.com 2.6.18-92.1.27.el5xen ia64

[root@pq0-0 ~]# time -p ./mallocstress
[...]
main(): test passed.
real 2500.81
user 0.52
sys 2423.92

Now we have absolutely no *reproducible* proof that this is a regression.

comment #    | tested kernel-xen version                 | result
-------------+-------------------------------------------+---------------
comment 6    | 2.6.18-92 (RHEL-5.2)                      | takes too long
comment 11   | 2.6.18-92 (RHEL-5.2)                      | takes too long
this comment | 2.6.18-92 (RHEL-5.2)                      | takes too long
-------------+-------------------------------------------+---------------
comment 1    | 2.6.18-92.1.27 (RHEL-5.2.z)               | works
this comment | 2.6.18-92.1.27 (RHEL-5.2.z)               | takes too long
-------------+-------------------------------------------+---------------
comment 0    | 2.6.128 (? probably 2.6.18-128, RHEL-5.3) | takes too long
-------------+-------------------------------------------+---------------
comment 0    | 2.6.152 (? probably 2.6.18-152)           | takes too long
-------------+-------------------------------------------+---------------
comment 6    | 2.6.18-164 (RHEL-5.4)                     | takes too long
-------------+-------------------------------------------+---------------
comment 16   | 2.6.18-238 (RHEL-5.6)                     | takes too long

I'll continue with the NUMA tips Andrew gave in comment 26, under RHEL-5.6.

Comment 28 Laszlo Ersek 2011-02-10 16:34:42 UTC

Created attachment 478079 [details]
dell-pesc1435-01 topology / comment 26 hint "a"

Comment 29 Laszlo Ersek 2011-02-10 16:35:23 UTC

Created attachment 478080 [details]
hp-bl465cg7-01 topology / comment 26 hint "a"

Comment 30 Laszlo Ersek 2011-02-10 16:36:07 UTC

Created attachment 478081 [details]
hp-dl585g5-01 topology / comment 26 hint "a"

Comment 31 Laszlo Ersek 2011-02-10 16:36:59 UTC

Created attachment 478082 [details]
lacos-workstation topology / comment 26 hint "a"

Comment 32 Laszlo Ersek 2011-02-10 16:38:13 UTC

Created attachment 478083 [details]
make my_yield() call pthread_yield() in mallocstress.c / comment 26 hint "a"

Comment 33 Laszlo Ersek 2011-02-10 16:42:29 UTC

(In reply to comment #26)

> To further analyse the potential that the perf issue here is due to numa,
> then I suggest you compare at least one of these things.
> 
> a) compare x86_64 xen perf results from a non-numa machine with one that is
> numa (both with and without numa=on)

See the above attachments for the CPU and RAM topologies of the examined machines. lacos-workstation is UMA, the other three are NUMA. The mallocstress source was patched to make it indeed CPU bound (see comment 32). The following RHEL-5.6 kernels were installed for bare metal tests:

Linux  dell-pesc1435-01.rhts.eng.bos.redhat.com  2.6.18-238.el5  x86_64
Linux  hp-bl465cg7-01.rhts.eng.bos.redhat.com    2.6.18-238.el5  x86_64
Linux  hp-dl585g5-01.rhts.eng.bos.redhat.com     2.6.18-238.el5  x86_64
Linux  lacos-workstation.usersys.redhat.com      2.6.18-238.el5  x86_64

And for xen dom0 (both numa=on and numa=off, in case of the NUMA machines):

Linux  dell-pesc1435-01.rhts.eng.bos.redhat.com  2.6.18-238.el5xen  x86_64
Linux  hp-bl465cg7-01.rhts.eng.bos.redhat.com    2.6.18-238.el5xen  x86_64
Linux  hp-dl585g5-01.rhts.eng.bos.redhat.com     2.6.18-238.el5xen  x86_64
Linux  lacos-workstation.usersys.redhat.com      2.6.18-238.el5xen  x86_64

Results:

+----------------------------------+---------------+------+---------+
|                                  |               | real |    1.26 |
|                                  | bare metal    | user |    0.14 |
|                                  |               | sys  |    1.70 |
| lacos-workstation (uniform)      +---------------+------+---------+
|                                  |               | real |    3.91 |
|                                  | xen           | user |    0.16 |
|                                  |               | sys  |    0.90 |
+----------------------------------+---------------+------+---------+
|                                  |               | real |    1.98 |
|                                  | bare metal    | user |    0.29 |
|                                  |               | sys  |    2.31 |
|                                  +---------------+------+---------+
|                                  |               | real |    5.62 |
| hp-dl585g5-01 (numa nodes: 4)    | xen, numa=on  | user |    0.18 |
|                                  |               | sys  |    1.98 |
|                                  +---------------+------+---------+
|                                  |               | real |    5.47 |
|                                  | xen, numa=off | user |    0.18 |
|                                  |               | sys  |    1.53 |
+----------------------------------+---------------+------+---------+
|                                  |               | real |    8.29 |
|                                  | bare metal    | user |    1.34 |
|                                  |               | sys  |    9.89 |
|                                  +---------------+------+---------+
|                                  |               | real |   19.90 |
| hp-bl465cg7-01 (numa nodes: 2)   | xen, numa=on  | user |    0.59 |
|                                  |               | sys  |    5.80 |
|                                  +---------------+------+---------+
|                                  |               | real |   20.25 |
|                                  | xen, numa=off | user |    0.49 |
|                                  |               | sys  |    4.64 |
+----------------------------------+---------------+------+---------+
|                                  |               | real |    1.95 |
|                                  | bare metal    | user |    0.29 |
|                                  |               | sys  |    2.34 |
|                                  +---------------+------+---------+
|                                  |               | real |    4.10 |
| dell-pesc1435-01 (numa nodes: 2) | xen, numa=on  | user |    0.23 |
|                                  |               | sys  |    2.18 |
|                                  +---------------+------+---------+
|                                  |               | real |    4.27 |
|                                  | xen, numa=off | user |    0.16 |
|                                  |               | sys  |    1.39 |
+----------------------------------+---------------+------+---------+

Analysis later.

Comment 34 Laszlo Ersek 2011-02-10 17:09:54 UTC

(In reply to comment #33)

> Analysis later.

- On the NUMA machines, under xen, numa=off vs. numa=on doesn't seem to make a
  significant difference. Therefore I won't distinguish them from each other
  below; I'll commonly refer to them as "xen".

- The bare metal case is always CPU bound (real < sys).

- The bare metal -> xen transition multiplies the wall clock time by about
  2.19 to 3.1. It also reduces the user and sys times.

- The xen cases all seem to waste wall clock time without using the CPU
  (real > user + sys).

The above might be explained by the HV taking over some responsibilities from the kernel, spending more CPU time on them, all the while the dom0 kernel can't track that CPU time.

The test shows that Xen incurs a performance penalty on x86_64, but nowhere near as grave as on ia64. Furthermore, the nondeterministic behavior seen on ia64 can't be reproduced on x86_64. (I've run the test multiple times.)

The test is inconclusive. Either NUMA is not a significant factor in general, or NUMA under x86_64 vs. ia64 cannot be compared.

On to comment 26 hint "b".

Comment 35 Laszlo Ersek 2011-02-10 17:15:24 UTC

Created attachment 478089 [details]
pq0-0 topology / comment 26 hint "b"

Comment 36 Laszlo Ersek 2011-02-11 06:29:15 UTC

(In reply to comment #26)

> b) restrict the ia64 numa machine to use a single (the first) node when running
> the test for both bare-metal and xen, and then compare their results. This
> requires restricting the memory the kernel (bare-metal) or HV (xen) can see to
> fit within the first node, the mem=<mem> kernel/HV command line option. Also
> use taskset and vcpu-set to ensure the processors selected are from the first
> node.

Repeating the test in comment 16 under these circumstances. Bare metal:

----v----
[root@pq0-0 ~]# uname -psrn
Linux pq0-0.lab.bos.redhat.com 2.6.18-238.el5 ia64

[root@pq0-0 ~]# free -m
             total       used       free     shared    buffers     cached
Mem:          6993        685       6308          0         47        292

[root@pq0-0 ~]# cat /proc/cmdline 
BOOT_IMAGE=scsi0:EFI\redhat\vmlinuz-2.6.18-238.el5 root=/dev/VolGroup00/LogVol00  console=ttyS0,19200 rhgb quiet mem=7G ro

[root@pq0-0 ~]# time -p ./mallocstress 
[...]
main(): test passed.
real 9.49
user 0.01
sys 0.51
----^----

Xen dom0 "setup":

----v----
(XEN) Xen command line: BOOT_IMAGE=scsi0:EFI\redhat\xen.gz-2.6.18-238.el5  mem=7G 

[root@pq0-0 ~]# uname -psrn
Linux pq0-0.lab.bos.redhat.com 2.6.18-238.el5xen ia64

[root@pq0-0 ~]# free -m
             total       used       free     shared    buffers     cached
Mem:          4096        956       3139          0         52        420

[root@pq0-0 ~]# cat /proc/cmdline 
root=/dev/VolGroup00/LogVol00 console=ttyS0,19200 rhgb quiet ro

[root@pq0-0 ~]# xm info
node_to_cpu            : node0:0-7
                         node1:8-15

(XEN) VCPU information and callbacks for domain 0:
(XEN)     VCPU0: CPU0 [has=F] flags=1 upcall_pend = 00, upcall_mask = 00 dirty_cpus={} cpu_affinity={0}
(XEN)     VCPU1: CPU1 [has=F] flags=1 upcall_pend = 00, upcall_mask = 00 dirty_cpus={} cpu_affinity={1}
(XEN)     VCPU2: CPU2 [has=F] flags=1 upcall_pend = 00, upcall_mask = 00 dirty_cpus={} cpu_affinity={2}
(XEN)     VCPU3: CPU3 [has=F] flags=1 upcall_pend = 00, upcall_mask = 00 dirty_cpus={} cpu_affinity={3}

[root@pq0-0 ~]# xm vcpu-list
Name                              ID VCPUs   CPU State   Time(s) CPU Affinity
Domain-0                           0     0     0   r--      30.3 0
Domain-0                           0     1     1   -b-      10.9 1
Domain-0                           0     2     2   -b-      17.8 2
Domain-0                           0     3     3   -b-      10.1 3
----^----

Xen dom results (repeated multiple times):

----v----
[root@pq0-0 ~]# time -p taskset -c 0 ./mallocstress -t 2 -l 10
[...]
main(): test passed.
real 4.20
user 0.00
sys 0.50

[root@pq0-0 ~]# time -p taskset -c 0,1 ./mallocstress -t 2 -l 10
[...]
main(): test passed.
real 35.56
user 0.00
sys 32.08
----^----

As soon as we allow the threads to run on separate VCPUs (corresponding separate PCPUs), even though we forced the use of the first NUMA node, the sys CPU time blows up. NUMA seems to be out of the picture.

In the next test, I'll try to pin all VCPUs to a single PCPU. The next culprit candidate is TLB flushes (also mentioned by Paolo earlier).

Lock contention or some other contention is unlikely. I compiled the mallocstress test earlier in debug mode, which prints a message for each alloc/dealloc. With two threads, one of the threads finishes much earlier. Even after that, unmappings by the final worker thread take a very long time, if the range being unmapped is big.

Comment 37 Laszlo Ersek 2011-02-11 07:33:28 UTC

Rebooted pq0-0 with "dom0_vcpus_pin=0" as the sole Xen command line parameter. In dom0, pinned all VCPUs to PCPU 1:

----v----
[root@pq0-0 ~]# xm vcpu-pin 0 all 1

Name                              ID VCPUs   CPU State   Time(s) CPU Affinity
Domain-0                           0     0     1   -b-      39.0 1
Domain-0                           0     1     1   r--      21.8 1
Domain-0                           0     2     1   -b-      20.4 1
Domain-0                           0     3     1   -b-      14.3 1

(XEN) VCPU information and callbacks for domain 0:
(XEN)     VCPU0: CPU1 [has=F] flags=1 upcall_pend = 00, upcall_mask = 00
          dirty_cpus={} cpu_affinity={1}
(XEN)     VCPU1: CPU1 [has=F] flags=1 upcall_pend = 00, upcall_mask = 00
          dirty_cpus={} cpu_affinity={1}
(XEN)     VCPU2: CPU1 [has=F] flags=1 upcall_pend = 00, upcall_mask = 00
          dirty_cpus={} cpu_affinity={1}
(XEN)     VCPU3: CPU1 [has=F] flags=1 upcall_pend = 00, upcall_mask = 00
          dirty_cpus={} cpu_affinity={1}
----^----

I have run several instances of the following command:

----v----
[root@pq0-0 ~]# time -p taskset -c X,Y ./mallocstress -t 2 -l 10
----^----

Many times, the times were:

----v----
real 4.16
user 0.00
sys 0.40
----^----

(This is also returned if, instead of "-c X,Y", I only pass "-c X".) However, many times the results looked like:

real 11.95        real 28.76        real 28.62        real 24.94
user  0.00        user  0.00        user  0.00        user  0.00
sys   5.09        sys  15.82         sys 14.78        sys  12.55

With "-c X", the results are consistent (always 4.1 / 0 / 0). With "-c X,Y", the times vary non-deterministically. For example, running "-c 0,1" five times in a row returned 4.2 / 0 / 0 thrice; then 16 / 0 / 7.1 once, then finally
37.3 / 0 / 20.3.

All VCPUs are pinned to PCPU 1.

I have absolutely no idea what's going on.

Comment 38 Laszlo Ersek 2011-02-11 07:38:18 UTC

An interesting tidbit is that in the "slow" cases, the wall clock time approximates the sys time doubled.

Comment 39 Andrew Jones 2011-02-11 07:43:37 UTC

(In reply to comment #37)
> With "-c X", the results are consistent (always 4.1 / 0 / 0). With "-c X,Y",
> the times vary non-deterministically. For example, running "-c 0,1" five times
> in a row returned 4.2 / 0 / 0 thrice; then 16 / 0 / 7.1 once, then finally
> 37.3 / 0 / 20.3.
> 
> All VCPUs are pinned to PCPU 1.

This was a good test. AFAICT, it shows we should concentrate on the vcpu context switch penalty; TLB flushing, etc.

Comment 40 Laszlo Ersek 2011-02-11 10:41:37 UTC

(In reply to comment #37)

> With "-c X,Y", the times vary non-deterministically.

I rebooted xen with "dom0_max_vcpus=16 --", and that seems to have made the "-c X,Y" times deterministic: now there is an almost invariable, "negligible" factor of about 11 from '-c X' to '-c X,Y'. (Test was "mallocstress -t 2 -l 10".)

real 4.20                  real 46.84
user 0.00        vs        user  0.00
sys  0.51                  sys  43.15

Enabling all PCPUs (and quadrupling the number of VCPUs) made the performance worse, but it also stabilized it.

VCPUs are pinned, and it does not seem to have an effect if X and Y share or don't share a NUMA node.

Comment 41 Paolo Bonzini 2011-02-11 14:45:23 UTC

Created attachment 478250 [details]
systemtap script

The attached systemtap script outputs backtraces of VCPUs running do_munmap every 10 ms spent there.

The output looks like this (I'll attach the complete output shortly):

CPU#4 | thread 10011 (mallocstress) in xen_ptcga after 149 ms in do_munmap
 0xa00000010006a2d0 : xen_ptcga+0x10/0x20 [kernel]
 0xa000000100068b70 : ia64_global_tlb_purge+0x5b0/0xa80 [kernel]
 0xa000000100068530 : flush_tlb_range+0x1b0/0x240 [kernel]
 0xa000000100154500 : unmap_region+0x380/0x480 [kernel]
 0xa000000100157080 : do_munmap+0x3e0/0x520 [kernel]
 0xa000000100066200 : kretprobe_trampoline+0x0/0x20 [kernel]
 0xa000000100068b70 : ia64_global_tlb_purge+0x5b0/0xa80 [kernel]
 ... still in xen_ptcga after 297 ms
 ... still in xen_ptcga after 445 ms
 ... still in xen_ptcga after 593 ms
 ... still in xen_ptcga after 742 ms
CPU#4 | thread 10011 (mallocstress) in do_munmap for 743 ms

CPU#10 | thread 10010 (mallocstress) in xen_ptcga after 38 ms in do_munmap
 0xa00000010006a2d0 : xen_ptcga+0x10/0x20 [kernel]
 0xa000000100068b70 : ia64_global_tlb_purge+0x5b0/0xa80 [kernel]
 0xa000000100068530 : flush_tlb_range+0x1b0/0x240 [kernel]
 0xa000000100154500 : unmap_region+0x380/0x480 [kernel]
 0xa000000100157080 : do_munmap+0x3e0/0x520 [kernel]
 0xa000000100066200 : kretprobe_trampoline+0x0/0x20 [kernel]
 0xa000000100068b70 : ia64_global_tlb_purge+0x5b0/0xa80 [kernel]
 ... still in xen_ptcga after 75 ms
 ... still in xen_ptcga after 113 ms
CPU#10 | thread 10010 (mallocstress) in do_munmap for 113 ms

CPU#10 | thread 10010 (mallocstress) in xen_ptcga after 150 ms in do_munmap
 0xa00000010006a2d0 : xen_ptcga+0x10/0x20 [kernel]
 0xa000000100068b70 : ia64_global_tlb_purge+0x5b0/0xa80 [kernel]
 0xa000000100068530 : flush_tlb_range+0x1b0/0x240 [kernel]
 0xa000000100154500 : unmap_region+0x380/0x480 [kernel]
 0xa000000100157080 : do_munmap+0x3e0/0x520 [kernel]
 0xa000000100066200 : kretprobe_trampoline+0x0/0x20 [kernel]
 0xa000000100068b70 : ia64_global_tlb_purge+0x5b0/0xa80 [kernel]
  ... still in xen_ptcga after 299 ms
  ... still in xen_ptcga after 449 ms
CPU#10 | thread 10010 (mallocstress) in do_munmap for 449 ms


Note how the samples are very rare (even 150 ms).  The script should generate an output every 10 ms more or less (and it does if you make it print something on every timer.profile event).  This means that the time is indeed spent in the hypervisor (as hinted in comment 21 and comment 34) in the hypercall that executes the ptc.ga (global TLB purge) operation.

The implementation of the hypercall is in hyper_ptc_ga (arch/ia64/xen/hyperprivop.S), but it actually goes straight to dispatch_break_fault and from there priv_ptc_ga -> vcpu_ptc_ga -> domain_flush_vtlb_range.

In turn, domain_flush_vtlb_range calls ia64_global_tlb_purge at the end (which is the same code as bare-metal).  First, however, Xen has to flush the "Virtual Hash Page Table", a large extension of TLB that resides in memory and can be 
automatically searched by the processor.  http://www.mail-archive.com/xen-ia64-devel@lists.xensource.com/msg07611.html points to the VHPT as a culprit for performance degradation when the OS performs frequent TLB flushing.

The next step is to understand the reason for the different treatment of the VHPT in Linux and Xen.

Comment 42 Paolo Bonzini 2011-02-11 14:46:32 UTC

Created attachment 478251 [details]
systemtap output

Comment 43 Laszlo Ersek 2011-02-11 21:34:44 UTC

(In reply to comment #41)
> This means that the time is indeed spent in the
> hypervisor (as hinted in comment 21 and comment 34) in the hypercall that
> executes the ptc.ga (global TLB purge) operation.
> 
> The implementation of the hypercall is in hyper_ptc_ga
> (arch/ia64/xen/hyperprivop.S), but it actually goes straight to
> dispatch_break_fault and from there priv_ptc_ga -> vcpu_ptc_ga ->
> domain_flush_vtlb_range.

Fantastic!

Actually I followed the kernel source from unmap_vmas() (following up on comment 20):

mm/memory.c                    unmap_vmas
include/asm-ia64/tlb.h         tlb_finish_mmu
include/asm-ia64/tlb.h         ia64_tlb_flush_mmu
arch/ia64/mm/tlb.c             flush_tlb_range
include/asm-ia64/machvec.h     platform_global_tlb_purge
arch/ia64/mm/tlb.c             ia64_global_tlb_purge
include/asm-ia64/tlbflush.h    flush_tlb_all
arch/ia64/mm/tlb.c             local_flush_tlb_all
include/asm-ia64/xen/privop.h  ia64_ptce
include/asm-ia64/gcc_intrin.h  "ptc.e %0"

... and the "ptc.e" instruction was where I gave up. I should have been more careful, because "include/asm-ia64/xen/privop.h" has a comment for ia64_ptce (and others) -- just five lines below where "ia64_ptcga" is defined as "xen_ptcga" :) -- : 

/* the remainder of these are not performance-sensitive so its
 * OK to not paravirtualize and just take a privop trap and emulate */

So "ptc.e" is trapped. Instead of giving up, I should have continued in the hypervisor, even though now it's clear from your stap results that I was on the wrong path.


> In turn, domain_flush_vtlb_range calls ia64_global_tlb_purge at the end
> (which is the same code as bare-metal).  First, however, Xen has to flush the
> "Virtual Hash Page Table", a large extension of TLB that resides in memory
> and can be  automatically searched by the processor. 
> http://www.mail-archive.com/xen-ia64-devel@lists.xensource.com/msg07611.html
> points to the VHPT as a culprit for performance degradation when the OS
> performs frequent TLB flushing.
> 
> The next step is to understand the reason for the different treatment of the
> VHPT in Linux and Xen.

Probably irrelevant, but ia64_tlb_flush_mmu() flushes the TLB first, and the VHPT second.

    /* flush the address range from the tlb: */
    flush_tlb_range(&vma, start, end);
    /* now flush the virt. page-table area mapping the address range: */
    flush_tlb_range(&vma, ia64_thash(start), ia64_thash(end));

I'm now thinking, perhaps we have the same problem as described in the mailing
list message that you've found, between Xen and bare-metal -- different VHPT sizes. I can see some "dom0_vhpt_size_log2" Xen parameter patches in xen-unstable.

Comment 44 Laszlo Ersek 2011-02-11 22:42:30 UTC

I added "pervcpu_vhpt=0" after "dom0_max_vcpus=16", just to perturb the results. The consistent 49 / 0 / 46 triple promptly worsened to a consistent 62.22 / 0.00 / 58.82. (With "mallocstress -t 2 -l 10", no taskset.) I think this confirms Paolo's analysis; we have to poke at the VHPT.

The size of the dom0 VHPT in 2.6.18-243.el5 Xen appears to be 64K.

include/asm-ia64/vhpt.h:8:
# define      VHPT_SIZE_LOG2                  16      // 64KB

This is supported by the Xen boot logs.

[...]
(XEN) vhpt_init: vhpt paddr=0x4080030000, end=0x408003ffff
[...]

For the time being I was unable to figure out the VHPT size on the bare metal.

Comment 45 Laszlo Ersek 2011-02-13 11:26:19 UTC

Further tests to look at:
- run the test in an ia64 pv domU (thanks to Alex for the hint),
- try it on a non-numa ia64 box: rx26XX/rx36XX/rx46XX/rx66XX (ditto),
- rebuild the HV for pq0-0 with VHPT_SIZE_LOG2 defined to eg. 15.

Comment 46 Laszlo Ersek 2011-02-13 12:50:29 UTC

Changing the VHPT size from 64KB to 32KB doesn't seem to affect the performance. As in comment 40, the test was "mallocstress -t 2 -l 10", the times are the basically identical

real 47.38
user 0.00
sys 43.86

The hv command line was the same "dom0_max_vcpus=16 loglvl=all guest_loglvl=all".

(XEN) vhpt_init: vhpt paddr=0x201fcfe0000, end=0x201fcfe7fff

Comment 47 Laszlo Ersek 2011-02-13 13:09:34 UTC

Repeated the previous test with 512KB VHPT.

(XEN) vhpt_init: vhpt paddr=0x4080100000, end=0x408017ffff

No change in times over a factor of 16 in VHPT size.

real 48.01
user 0.00
sys 44.45

Comment 48 Laszlo Ersek 2011-02-13 17:21:20 UTC

I have found a possibly relevant paper under [0]: "A Hash-TLB Approach for MMU Virtualization in Xen/IA64". The direct link is [1]. This paper and the related software may very well have been developed later than the code in RHEL5 Xen, but I still have a feeling that the explanation to our problem is staring me in the face from the paper. I can't quite put my finger on it though.

Some quotes, with liberal snipping:

----v----
2.2 The single TLB approach in paravirtualization

Paravirtualization in the Xen/IA-64 has used a single TLB approach based
on the VHPT without collision chain. The VMM under such a design
implements the smallest number of virtual TLBs required by the Itanium
specification. For paravirtualization, the single TLB approach has the
following characteristics:

[...]

Performance benefits of the TLB flush-all virtualization: In some
cases, the OS may use the ptc.e instruction to purge all the TLB
entries. In the virtualization environment, the VMM similarly needs
to purge all the TC entries from the virtual TLB. If we can ensure
that there is only one entry present in the virtual TLB, the
execution time will be reduced and the VM will get better
performance.

3 Hash TLB Virtualization based on VHPT

[...]

In the TLB virtualization, the paravirtualization adopts the 8 TR + 1 TC
model. For both the ITLB and DTLB, the model represents the smallest
number of TLBs to ensure the system's forward progress according to the
IA64 specification. Such a model is the single TLB approach mentioned in
the previous sections. [...]

4.3 STPV vs HTPV

We have implemented the hash TLB design for PV virtualization. We
measure the overall performance of the PV VMs under both the STPV and
HTPV approaches. [...] [T]he hash TLB does not improve the performance
of the virtual machines under PV virtualization. [...] [U]nder the
specialized IO mode of the PV guest VM, the number of TLB misses as a
result of IO operations is reduced. Hence, it is not necessary for the
guest VM TLB to track these accesses. Moreover, because the guest OS is
modified, instructions for manipulating the TLB can, after
paravirtualization, use relatively simple logic to insert information
into the system TLB and VHPT. As a result, the complex logic of the hash
TLB virtualization can negatively impact the performance of the
paravirtualization approach. [...]
----^----

[0] http://www.cs.purdue.edu/homes/yau/publications.html
[1] http://www.cs.purdue.edu/homes/yau/papers/smtps2008.pdf

Comment 49 Laszlo Ersek 2011-02-13 20:49:32 UTC

(In reply to comment #45)

> - run the test in an ia64 pv domU (thanks to Alex for the hint),

I created a PV domain with 10 GB RAM and 16 VCPUs, and pinned each VCPU to a separate PCPU. In said domain "./mallocstress -t 2 -l 10" completed reproducibly in

real 3.90
user 0.00
sys 0.27

without any taskset restriction. (Cf. comment 40 -- this largely corresponds to the "taskset -c X" dom0 time.)

The complete mallocstress test (ie. no -t or -l option passed) took

real 6838.64
user 0.72
sys 6619.10

During the test, virt-manager displayed a 12% load for the domain running mallocstress, with occasional bursts up to 20%. Inside the domain, "top" displayed a similar sys load (and nothing else much), and the loadavg was varying between 20 and 40 (mostly near 20), according to "uptime".

The HV command line was the usual "dom0_max_vcpus=16 loglvl=all guest_loglvl=all".

Comment 50 Laszlo Ersek 2011-02-13 21:08:43 UTC

Created a fullhvm guest similar to the PV one in comment 49; 10 GB RAM, 16 VCPUs, each VCPU pinned to a different PCPU. Same HV environment.

[root@localhost ~]# time -p ./mallocstress -t 2 -l 10
[...]
real 67.92
user 0.02
sys 67.00

(Reproducibly.) This resembles the "pervcpu_vhpt=0" dom0 results in comment 44.

[root@localhost ~]# time -p taskset -c 9 ./mallocstress -t 2 -l 10
[...]
real 5.68
user 0.00
sys 4.72

Won't try the full mallocstress as it would probably take ages.

Comment 51 Laszlo Ersek 2011-02-13 23:44:09 UTC

(In reply to comment #45)

> - try it on a non-numa ia64 box: rx26XX/rx36XX/rx46XX/rx66XX (ditto),

**** hp-rx2660-01, bare metal:

[root@hp-rx2660-01 ~]# uname -psrn
Linux hp-rx2660-01.rhts.eng.bos.redhat.com 2.6.18-238.el5 ia64

[root@hp-rx2660-01 ~]# time -p ./mallocstress 
[...]
real 8.98
user 0.02
sys 0.22

**** hp-rx2660-01, dom0:

[root@hp-rx2660-01 ~]# uname -psrn
Linux hp-rx2660-01.rhts.eng.bos.redhat.com 2.6.18-238.el5xen ia64

[root@hp-rx2660-01 ~]# time -p ./mallocstress -t 2 -l 10
[...]
real 9.05
user 0.00
sys 5.77

This was a virtually constant result.

Using "taskset -c X", 3.71 / 0 / 0.17 was a stable set of resulting times.
With "taskset -c X,Y", the times were varying between 3.71 / 0 / 0.17 and the above 9.05 / 0.00 / 5.77.

The full test took 1502.62 / 1.52 / 1436.38.

**** hp-rx2660-03, bare metal:

[root@hp-rx2660-03 ~]# uname -psrn
Linux hp-rx2660-03.rhts.eng.bos.redhat.com 2.6.18-238.el5 ia64

[root@hp-rx2660-03 ~]# time -p ./mallocstress 
[...]
real 9.78
user 0.00
sys 0.06

**** hp-rx2660-03, dom0:

[root@hp-rx2660-03 ~]# uname -psrn
Linux hp-rx2660-03.rhts.eng.bos.redhat.com 2.6.18-238.el5xen ia64

[root@hp-rx2660-03 ~]# time -p ./mallocstress -t 2 -l 10
[...]
real 22.61
user 0.00
sys 19.06

This result was practically constant as well. Trying "taskset -c X,Y" several times didn't change these numbers, any time.

[root@hp-rx2660-03 ~]# time -p taskset -c 0 ./mallocstress -t 2 -l 10
[...]
real 4.57
user 0.00
sys 0.79

(Reproducibly.)

The full test took 4202.80 / 0.62 / 4116.68.

As these boxen have four PCPUs (topologies to be attached soon), nothing was passed to the HV except loglevel switches. VCPUs are pinned for dom0 per default.

Comment 52 Laszlo Ersek 2011-02-13 23:44:58 UTC

Created attachment 478527 [details]
hp-rx2660-01 topology

Comment 53 Laszlo Ersek 2011-02-13 23:45:37 UTC

Created attachment 478528 [details]
hp-rx2660-03 topology

Comment 56 Caspar Zhang 2011-02-15 06:45:03 UTC

Latest git version of LTP results on hp-rx2660-04.rhts.eng.bos.redhat.com:

RHEL5.6
=======

kernel:
real 0m8.968s
user 0m0.002s
sys 0m0.021s

kernel-xen
real 18m49.002s
user 0m0.728s
sys 18m8.408s

RHEL5.5
=======

kernel:
real 0m8.975s
user 0m0.002s
sys 0m0.031s

kernel-xen:
real 18m51.607s
user 0m0.672s
sys 18m11.188s

RHEL5.4
=======

kernel:
real 0m8.970s
user 0m0.002s
sys 0m0.032s

kernel-xen:
real 18m46.917s
user 0m0.956s
sys 18m14.324s