499013 – Deadlock between libvirt and xentop

Bug 499013 - Deadlock between libvirt and xentop

Summary: Deadlock between libvirt and xentop

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel-xen
Sub Component:
Version:	5.2
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Miroslav Rezanina
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2009-05-04 19:44 UTC by kerdosa
Modified:	2009-09-02 09:00 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2009-09-02 09:00:33 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2009:1243	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.4 kernel security and bug fix update	2009-09-01 08:53:34 UTC

Description kerdosa 2009-05-04 19:44:16 UTC

Description of problem: The system becomes deadlock when running virsh schedinfo and xentop together continuously.


Version-Release number of selected component (if applicable): RHEL 5.2 with XEN


How reproducible: Run virsh schedinfo and xentop continuously at dom0


Steps to Reproduce:
1. xentop -b -d 0.1
2. while [ true ]; do virsh schedinfo > /dev/null; done
3.
  
Actual results: The system becomes deallock.


Expected results: No deadlock.


Additional info: I attached stacktraces taken from crash program. Based on stack trace, the deadlock happens between vcpu#0 running libvirtd and vcpu#3 running xentop.

The vcpu#0 is processing XEN_DOMCTL_scheduler_op of domctl.c which calls sched_adjust(). The sched_adjust() calls vcpu_pause(v) for each vcpu in the domain, and vcpu_pause(v) calls vcpu_sleep_sync(v) where it waits for vcpu#3 pause. On the other hand vcpu#3 is executing vcpu_runstate_get() in schedule.c called from XEN_SYSCTL_getdomaininfolist in sysctl.c. At the time of deadlock somehow this vcpu#3's exception RIP is pointing [compat_failsafe_callback+95], where it tries to get a lock on domctl_lock (cmpb $0x0,87987(%rip)    # 0xffff828c8019ef00 <domctl_lock.10183>). But the vcpu#0 had a lock on the domctl_lock when it enters do_comctl(), so two vcpus are in deadlock now.

crash> bt -a
PCPU:  3  VCPU: ffff8300cf9e2080
 #0 [ffff8300cea08f20] crash_nmi_callback at ffff828c80145c93
 #1 [ffff8300cea08f30] do_nmi at ffff828c8013aed9
 #2 [ffff8300cea08f50] handle_ist_exception at ffff828c8017f6f7
    [exception RIP: compat_failsafe_callback+95]
    RIP: ffff828c8018974f  RSP: ffff8300cea0fd08  RFLAGS: 00000286
    RAX: 0000000000000000  RBX: fffffffffffffff3  RCX: 0000000000000000

crash> dis compat_failsafe_callback
0xffff828c80189746 <compat_failsafe_callback+86>: cmpb  $0x0,87987(%rip) # 0xffff828c8019ef00 <domctl_lock.10183>
0xffff828c8018974d <compat_failsafe_callback+93>: repz nop
0xffff828c8018974f <compat_failsafe_callback+95>: jle 0xffff828c80189746 <compat_failsafe_callback+86>

The compat_failsafe_callback is exception fixup code, it should not try to lock domctl_lock! This is somehow related to linking. Below is a part of xen.lds.S.

  . = __XEN_VIRT_START + 0x100000;
  _start = .;
  _stext = .;                   /* Text and read-only data */
  .text : {
        *(.text)
        *(.fixup)
        *(.gnu.warning)
        } :text = 0x9090
  .text.lock : { *(.text.lock) } :text  /* out-of-line lock text */

The .text.lock section follows immediately .fixup section. I found nothing in .gnu.warning section. In x86_64 source, there is only one definition of .text.lock section in xen/include/asm/spinlock.h which is _raw_spin_lock(). The vcpu#3 is exactly running the code piece defined in _raw_spin_lock(). So I believe exception fixup code(compat_failsafe_callback) does not return correctly and somehow it falls into the code in .text.lock section, that is the problem here.

Comment 5 Don Zickus 2009-05-28 18:07:53 UTC

in kernel-2.6.18-151.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please do NOT transition this bugzilla state to VERIFIED until our QE team
has sent specific instructions indicating when to do so.  However feel free
to provide a comment indicating that this fix has been verified.

Comment 8 errata-xmlrpc 2009-09-02 09:00:33 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2009-1243.html

Note You need to log in before you can comment on or make changes to this bug.