Bug 361931 - [Stratus 4.7 bug] iounmap may sleep while holding vmlist_lock, causing a deadlock.
Summary: [Stratus 4.7 bug] iounmap may sleep while holding vmlist_lock, causing a dead...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 4
Classification: Red Hat
Component: kernel
Version: 4.6
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: rc
: ---
Assignee: Larry Woodman
QA Contact: Martin Jenner
URL:
Whiteboard: GSSApproved
Depends On:
Blocks: 240187 367631 422551 430698 433267
TreeView+ depends on / blocked
 
Reported: 2007-11-01 14:57 UTC by Peter Martuccelli
Modified: 2018-10-19 20:13 UTC (History)
8 users (show)

Fixed In Version: RHSA-2008-0665
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-07-24 19:19:20 UTC


Attachments (Terms of Use)
Patch to fix this problem (635 bytes, patch)
2007-11-01 19:05 UTC, Larry Woodman
no flags Details | Diff


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2008:0665 normal SHIPPED_LIVE Moderate: Updated kernel packages for Red Hat Enterprise Linux 4.7 2008-07-24 16:41:06 UTC

Description Kimball Murray 2007-11-01 14:57:32 UTC
Description of problem:
Driver and other code that uses the iounmap function may deadlock. This happens
because the iounmap contains a code path wherein a writelock is taken, followed
by a semaphore, which can lead to a sleep-while-holding spinlock sort of
deadlock.  See additional info below:

Version-Release number of selected component (if applicable):
RHEL4.6 (and earlier)

How reproducible:
Moderately

Steps to Reproduce:
1. load/unload a driver that uses ioremap_nocache/iounmap.
2. Put a load on the machine that causes some scheduling activity.
3.
  
Actual results:
After a while, some kernel threads will deadlock.

Expected results:
No deadlock

Additional info:
This problem is fixed in RHEL5 by dropping the writelock before calling the code
that can sleep.  Here is the RHEL4 iounmap function:

void iounmap(void __iomem *addr)
{
        struct vm_struct *p, **pprev;

        if (addr <= high_memory)
                return; 

        write_lock(&vmlist_lock);
        for (p = vmlist, pprev = &vmlist; p != NULL; pprev = &p->next, p = *pprev)
                if (p->addr == (void *)(PAGE_MASK & (unsigned long)addr))
                        break;
        if (!p) { 
                printk("__iounmap: bad address %p\n", addr);
                goto out_unlock;
        }
        *pprev = p->next;
        unmap_vm_area(p);
        if (p->flags >> 20) {
                /* p->size includes the guard page, but cpa doesn't like that */
                ioremap_change_attr(p->phys_addr, (p->size - PAGE_SIZE), 0);
        }
out_unlock:
        write_unlock(&vmlist_lock);
        kfree(p);
}

Note that the vmlist_lock is still locked when ioremap_change_attr is called. 
ioremap_change_attr calls change_page_attr_addr, which in turn, does this:

down_write(&init_mm.mmap_sem);

And now we may sleep while holding the vmlist_lock.

Comment 1 Kimball Murray 2007-11-01 16:39:06 UTC
This is a very timing-and-scheduling-sensitive bug.  It has been around for a
long time, and we never hit it through RHEL4.4 and RHEL4.5 until last week.  One
of our drivers, at system startup, does an ioremap_nocache, followed by an
iounmap.  Meanwhile, we must have changed something in the system that changes
the timing of events such that the init_mm.mmap_sem is taken by some other
thread at just the wrong time.  Now our system hits this bug frequently on
bootup, but changing anything in the way the system starts can make this go away
again.

So it's hard to assess the exposure acurately. 

Comment 4 Larry Woodman 2007-11-01 19:05:15 UTC
Created attachment 245981 [details]
Patch to fix this problem


This patch fixes this problem:

Comment 6 RHEL Product and Program Management 2007-12-19 03:56:06 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 9 Andrius Benokraitis 2008-02-13 19:57:09 UTC
Simon, can you confirm that the patch below fixes the issue on 4.6? (Since this
may get proposed for 4.6.z). Thanks!

Comment 10 Simon McGrath (Stratus Onsite Engineer) 2008-02-13 22:20:39 UTC
Will do patching, rebuilding and will submit to Systems System Test group for 
verification tomorrow morning.

Comment 11 Simon McGrath (Stratus Onsite Engineer) 2008-02-14 19:59:04 UTC
Good, Stratus Systems Test group have been running the patched 2.6.9-67.0.4
kernel  for several hours now, and are unable to repro the problem. Previously,
on one system, the problem would occur in about an hour. They will continue
tests for 24 hours before giving a definiative sign-off on the patch.

Comment 13 Simon McGrath (Stratus Onsite Engineer) 2008-02-15 16:14:47 UTC
Stratus Systems Test group have run tests successfully for 24 hours without
reproducing problem, or encountering any new ones. Therefore, patch is GOOD.

Comment 21 Simon McGrath (Stratus Onsite Engineer) 2008-02-19 18:40:52 UTC
On Friday Feb 15th Vivek G. agreed to include patch in next 4.7 build, with no
objections from Larry W.

This bug is cloned for inclusion in the 4.6.z stream with:

https://bugzilla.redhat.com/show_bug.cgi?id=433267



Comment 22 Vivek Goyal 2008-02-21 14:49:52 UTC
Committed in 68.12. RPMS are available at http://people.redhat.com/vgoyal/rhel4/

Comment 24 R.H. 2008-05-02 19:10:55 UTC
What test procedures did Stratus Systems Test group use to test this?

Comment 28 errata-xmlrpc 2008-07-24 19:19:20 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2008-0665.html


Note You need to log in before you can comment on or make changes to this bug.