470304 – el5u3 xenU guest kernel lockup due to mm_unpinned_lock and runqueue spinlock deadlock

Bug 470304 - el5u3 xenU guest kernel lockup due to mm_unpinned_lock and runqueue spinlock deadlock

Summary: el5u3 xenU guest kernel lockup due to mm_unpinned_lock and runqueue spinlock ...

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel-xen
Sub Component:
Version:	5.3
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Larry Woodman
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:
Depends On:	450953
Blocks:
TreeView+	depends on / blocked

Reported:	2008-11-06 16:35 UTC by Larry Woodman
Modified:	2008-11-25 07:41 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-11-25 07:41:21 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Patch that fixes problem by using spin_lock_irqsave() to disable interrupts and prevent recursion. (2.54 KB, patch) 2008-11-06 16:48 UTC, Larry Woodman	no flags	Details \| Diff
View All

Description Larry Woodman 2008-11-06 16:35:17 UTC

+++ This bug was initially created as a clone of Bug #450953 +++

Description of problem:

After running an arbitrary workload involving network traffic for some time (1-2
days), a xen guest running the 2.6.9-67 x86_64 xenU kernel locks up with both
vcpu's spinning at 100%.  

Version-Release number of selected component (if applicable):
kernel-2.6.9-67

How reproducible:
Reproduces after running a test workload (involving network traffic) for 1-2
days, but not consistently.

Steps to Reproduce:
1. Run a test workload involving network traffic
2. monitor cpu usage of guest
3. wait... until guest cpu usage goes to 100%
  
Actual results:
Guest kernel spins on both vcpu's

Expected results:
Guest kernel doesn't spin

Additional info:
The problem is due to a race between the scheduler and network interrupts.  On
one vcpu, the scheduler takes the runqueue spinlock of the other vcpu to
schedule a process, and attempts to lock mm_unpinned_lock.  On the other vcpu,
another process is holding mm_unpinned_lock (because it is starting or exiting),
and is interrupted by a network interrupt.  The network interrupt handler
attempts to wake up the same process that the first vcpu is trying to schedule,
and will try to get the runqueue spinlock that the first vcpu is already holding.

I was not able to obtain a full kernel stack from the interrupt, but do have
kernel stacks of the tasks on the vcpu's, if needed.

--- Additional comment from herbert.van.den.bergh on 2008-06-11 17:13:09 EDT ---

Created an attachment (id=309000)
fix for mm_unpinned_lock / runq deadlock


--- Additional comment from riel on 2008-07-30 11:17:56 EDT ---

The patch looks good to me.

--- Additional comment from pm-rhel on 2008-09-03 09:17:16 EDT ---

Updating PM score.

--- Additional comment from pm-rhel on 2008-09-17 14:52:15 EDT ---

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 1 Larry Woodman 2008-11-06 16:42:19 UTC


Same problem exists in RHEL5.  Cloned RHEL4 BZ and ported patch to RHEL5.

Larry Woodman

Comment 2 Larry Woodman 2008-11-06 16:48:19 UTC

Created attachment 322756 [details]
Patch that fixes problem by using spin_lock_irqsave() to disable interrupts and prevent recursion.

Comment 3 Chris Lalancette 2008-11-25 07:41:21 UTC

After upstream discussion, it looks like we probably already have the fix for RHEL-5 in place.  That's now been backported to fix the RHEL-4 version of this bug.  I'm going to close this as NOTABUG for now; if customer's see an issue similar to this, however, we can reopen it later.

Chris Lalancette

Note You need to log in before you can comment on or make changes to this bug.