Description of problem: Currently the vsyscall implementation for the 64-bit Xen kernel is turned off. This causes certain system calls (gettimeofday, in particular) to get much slower. In a benchmark like the following: #include <stdio.h> int main(int argc,char **argv) { int i; for ( i = 0 ; i < 90000000 ; i++ ) { gettimeofday(); } } Running with a bare-metal kernel takes about 14s, but under Xen takes 2m21s. If I go into arch/x86_64/kernel/vsyscall-xen.c, and remove the "sysctl_vsyscall = 0" in vsyscall_init(), then the above benchmark drops to 30s (still not as good as bare-metal, but a big improvement). Additionally, it looks like the upstream Linux Xen pv_ops implementation has this on. The only caveat is that there is a slight possibility it is unsafe to do this under Xen; I'll have to ask around and find out.
Is there any update on this issue? I note it's title references RHEL 5.3, but elsewhere you reference 5.4. Which version can we expect to see it fixed in?
Very unlikely to make 5.3 as it is not in the current proposed beta kernel spin, and this would be something that would need a full beta cycle. Possible for 5.4.
Yes, to re-iterate, this can't make 5.3. Besides the fact that this needs a lot of testing, there are a couple of problems I've found: 1. We have some code to handle some corner cases of time going backwards. Unfortunately, that code is not vsyscall friendly, so we would have to find another way to fix that. 2. Upstream Xen hasn't enabled this because vxtime is not being updated properly, which means vsyscall wouldn't work. So we would need to code this up for upstream, get it accepted there, and then get it into RHEL. So there is quite a bit of work to get this working. Chris Lalancette
Okay. Thanks to both of you for the update.
I'm not Chris, but I don't think anything changed from the situation of comment #3.
After looking into this further, there is a fundamental problem with vsyscall in a virtualized environment: vsyscall assumes that the virtual CPUs never migrate across multiple physical CPUs. There is code to deal with this in both the upstream hypervisor (but it was buggy, so it is currently disabled even there) and the upstream pvops kernel (ported up to 2.6.31.x). I suggest that we change a bit our course of action here: 1) first, fix the upstream hypervisor's implementation of VCPUOP_register_runstate_memory_area and get the vsyscalls to work with upstream pvops. For the original patch, see changeset 20339. For the bug, see http://permalink.gmane.org/gmane.comp.emulators.xen.devel/79038. 2) Backport VCPUOP_register_runstate_memory_area to the RHEL5 hypervisor, and vsyscall support to RHEL6. 3) Finally, backport vsyscall support to RHEL5. This is complicated because it probably means using the pvclock infrastructure instead of what is now in time-xen.c.
Here are some benchmark results for 5000000 calls using different clocksources. ================================================================ syscall real 0m15.038s user 0m3.109s sys 0m11.722s ================================================================ rdtsc real 0m0.220s user 0m0.208s sys 0m0.030s hpet real 0m3.344s user 0m3.339s sys 0m0.004s pvrdtscp real 0m2.760s user 0m2.737s sys 0m0.004s ================================================================ All measurements were taken on a F15 machine. Measurements for RHEL5 were consistent with the above (except pvrdtscp was not available for RHEL5). vsyscall speed would be roughly 0.2s slower than rdtsc/hpet/pvrdtscp due to the overhead of scaling. In order to support migration and save/restore, the only possible clocksource is of course rdtscp. The speedup would then be roughly a factor of 5 compared to current syscall performance, rather than 10-50 as it would be for TSC. Based on this data, unless there is a compelling application executing so many gettimeofday (or clock_gettime) syscalls, I do not believe it is worth micro-optimizing Xen's performance at this stage. ================================== Test program (HPET access would segfault without clocksource=hpet): #include <sys/time.h> static inline void pv_cpuid(unsigned idx, unsigned sub, unsigned *eax, unsigned *ebx, unsigned *ecx, unsigned *edx) { *eax = idx, *ecx = sub; asm volatile ( "ud2a ; .ascii \"xen\"; cpuid" : "=a" (*eax), "=b" (*ebx), "=c" (*ecx), "=d" (*edx) : "0" (*eax), "2" (*ecx)); } static inline unsigned long long do_rdtscp(unsigned *aux) { static unsigned long long last = 0; unsigned lo32, hi32; unsigned long long val; asm volatile(".byte 0x0f,0x01,0xf9":"=a"(lo32),"=d"(hi32),"=c" (*aux)); val = lo32 | ((unsigned long long)hi32 << 32); return val; } int main() { int i; struct timeval tv; for (i=0; i<5000000;i++) { #if 0 asm ("rdtsc" : : : "rax", "rdx"); #elif 1 asm ("mov 0xffffffffff5ff0f0,%%eax" : : : "rax"); #elif 0 unsigned aux; do_rdtscp(&aux); #elif 0 unsigned eax, ebx, ecx, edx; pv_cpuid(0x40000000, 0, &eax, &ebx, &ecx, &edx); #else gettimeofday (&tv, 0L); #endif } }
For pvrdtscp to work, the hardware must support rdtscp instruction and invariant tsc. (CPUID.80000001H:EDX[27] and CPUID.80000007H:EDX[8]. Also a tsc synchronized across all cores, which hopefully always happens with hardware supporting invariant tsc...) That means only newer processors (the oldest being nehalem or k10) can benefit, Otherwise the pvrdtscp is emulated, as in Paolo's case, and thus much slower. The timekeeping code also changed a lot in xen since RHEL5, so it would be at least 1000 changed lines for a conservative backport.