Bug 499013 - Deadlock between libvirt and xentop
Deadlock between libvirt and xentop
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel-xen (Show other bugs)
x86_64 Linux
low Severity medium
: rc
: ---
Assigned To: Miroslav Rezanina
Red Hat Kernel QE team
Depends On:
  Show dependency treegraph
Reported: 2009-05-04 15:44 EDT by kerdosa
Modified: 2009-09-02 05:00 EDT (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2009-09-02 05:00:33 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)

  None (edit)
Description kerdosa 2009-05-04 15:44:16 EDT
Description of problem: The system becomes deadlock when running virsh schedinfo and xentop together continuously.

Version-Release number of selected component (if applicable): RHEL 5.2 with XEN

How reproducible: Run virsh schedinfo and xentop continuously at dom0

Steps to Reproduce:
1. xentop -b -d 0.1
2. while [ true ]; do virsh schedinfo > /dev/null; done
Actual results: The system becomes deallock.

Expected results: No deadlock.

Additional info: I attached stacktraces taken from crash program. Based on stack trace, the deadlock happens between vcpu#0 running libvirtd and vcpu#3 running xentop.

The vcpu#0 is processing XEN_DOMCTL_scheduler_op of domctl.c which calls sched_adjust(). The sched_adjust() calls vcpu_pause(v) for each vcpu in the domain, and vcpu_pause(v) calls vcpu_sleep_sync(v) where it waits for vcpu#3 pause. On the other hand vcpu#3 is executing vcpu_runstate_get() in schedule.c called from XEN_SYSCTL_getdomaininfolist in sysctl.c. At the time of deadlock somehow this vcpu#3's exception RIP is pointing [compat_failsafe_callback+95], where it tries to get a lock on domctl_lock (cmpb $0x0,87987(%rip)    # 0xffff828c8019ef00 <domctl_lock.10183>). But the vcpu#0 had a lock on the domctl_lock when it enters do_comctl(), so two vcpus are in deadlock now.

crash> bt -a
PCPU:  3  VCPU: ffff8300cf9e2080
 #0 [ffff8300cea08f20] crash_nmi_callback at ffff828c80145c93
 #1 [ffff8300cea08f30] do_nmi at ffff828c8013aed9
 #2 [ffff8300cea08f50] handle_ist_exception at ffff828c8017f6f7
    [exception RIP: compat_failsafe_callback+95]
    RIP: ffff828c8018974f  RSP: ffff8300cea0fd08  RFLAGS: 00000286
    RAX: 0000000000000000  RBX: fffffffffffffff3  RCX: 0000000000000000

crash> dis compat_failsafe_callback
0xffff828c80189746 <compat_failsafe_callback+86>: cmpb  $0x0,87987(%rip) # 0xffff828c8019ef00 <domctl_lock.10183>
0xffff828c8018974d <compat_failsafe_callback+93>: repz nop
0xffff828c8018974f <compat_failsafe_callback+95>: jle 0xffff828c80189746 <compat_failsafe_callback+86>

The compat_failsafe_callback is exception fixup code, it should not try to lock domctl_lock! This is somehow related to linking. Below is a part of xen.lds.S.

  . = __XEN_VIRT_START + 0x100000;
  _start = .;
  _stext = .;                   /* Text and read-only data */
  .text : {
        } :text = 0x9090
  .text.lock : { *(.text.lock) } :text  /* out-of-line lock text */

The .text.lock section follows immediately .fixup section. I found nothing in .gnu.warning section. In x86_64 source, there is only one definition of .text.lock section in xen/include/asm/spinlock.h which is _raw_spin_lock(). The vcpu#3 is exactly running the code piece defined in _raw_spin_lock(). So I believe exception fixup code(compat_failsafe_callback) does not return correctly and somehow it falls into the code in .text.lock section, that is the problem here.
Comment 5 Don Zickus 2009-05-28 14:07:53 EDT
in kernel-2.6.18-151.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.
Comment 8 errata-xmlrpc 2009-09-02 05:00:33 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.