Description of problem: The system becomes deadlock when running virsh schedinfo and xentop together continuously. Version-Release number of selected component (if applicable): RHEL 5.2 with XEN How reproducible: Run virsh schedinfo and xentop continuously at dom0 Steps to Reproduce: 1. xentop -b -d 0.1 2. while [ true ]; do virsh schedinfo > /dev/null; done 3. Actual results: The system becomes deallock. Expected results: No deadlock. Additional info: I attached stacktraces taken from crash program. Based on stack trace, the deadlock happens between vcpu#0 running libvirtd and vcpu#3 running xentop. The vcpu#0 is processing XEN_DOMCTL_scheduler_op of domctl.c which calls sched_adjust(). The sched_adjust() calls vcpu_pause(v) for each vcpu in the domain, and vcpu_pause(v) calls vcpu_sleep_sync(v) where it waits for vcpu#3 pause. On the other hand vcpu#3 is executing vcpu_runstate_get() in schedule.c called from XEN_SYSCTL_getdomaininfolist in sysctl.c. At the time of deadlock somehow this vcpu#3's exception RIP is pointing [compat_failsafe_callback+95], where it tries to get a lock on domctl_lock (cmpb $0x0,87987(%rip) # 0xffff828c8019ef00 <domctl_lock.10183>). But the vcpu#0 had a lock on the domctl_lock when it enters do_comctl(), so two vcpus are in deadlock now. crash> bt -a PCPU: 3 VCPU: ffff8300cf9e2080 #0 [ffff8300cea08f20] crash_nmi_callback at ffff828c80145c93 #1 [ffff8300cea08f30] do_nmi at ffff828c8013aed9 #2 [ffff8300cea08f50] handle_ist_exception at ffff828c8017f6f7 [exception RIP: compat_failsafe_callback+95] RIP: ffff828c8018974f RSP: ffff8300cea0fd08 RFLAGS: 00000286 RAX: 0000000000000000 RBX: fffffffffffffff3 RCX: 0000000000000000 crash> dis compat_failsafe_callback 0xffff828c80189746 <compat_failsafe_callback+86>: cmpb $0x0,87987(%rip) # 0xffff828c8019ef00 <domctl_lock.10183> 0xffff828c8018974d <compat_failsafe_callback+93>: repz nop 0xffff828c8018974f <compat_failsafe_callback+95>: jle 0xffff828c80189746 <compat_failsafe_callback+86> The compat_failsafe_callback is exception fixup code, it should not try to lock domctl_lock! This is somehow related to linking. Below is a part of xen.lds.S. . = __XEN_VIRT_START + 0x100000; _start = .; _stext = .; /* Text and read-only data */ .text : { *(.text) *(.fixup) *(.gnu.warning) } :text = 0x9090 .text.lock : { *(.text.lock) } :text /* out-of-line lock text */ The .text.lock section follows immediately .fixup section. I found nothing in .gnu.warning section. In x86_64 source, there is only one definition of .text.lock section in xen/include/asm/spinlock.h which is _raw_spin_lock(). The vcpu#3 is exactly running the code piece defined in _raw_spin_lock(). So I believe exception fixup code(compat_failsafe_callback) does not return correctly and somehow it falls into the code in .text.lock section, that is the problem here.
in kernel-2.6.18-151.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html