172839 – NMI watchdog panic during cache_alloc_refill with corrupt size-128 slabcache

Bug 172839 - NMI watchdog panic during cache_alloc_refill with corrupt size-128 slabcache

Summary: NMI watchdog panic during cache_alloc_refill with corrupt size-128 slabcache

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 4
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	4.0
Hardware:	All
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Alasdair Kergon
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	168430
TreeView+	depends on / blocked

Reported:	2005-11-10 13:56 UTC by Issue Tracker
Modified:	2015-03-03 05:50 UTC (History)
CC List:	9 users (show)
Fixed In Version:	RHSA-2006-0132
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-03-07 20:39:12 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
first stab at a proposed patch (1.85 KB, patch) 2005-12-12 18:28 UTC, Jeff Layton	no flags	Details \| Diff
corrected patch (1.84 KB, patch) 2005-12-12 18:50 UTC, Jeff Layton	no flags	Details \| Diff
test patch for attempting to reproduce problem (2.31 KB, patch) 2005-12-15 15:04 UTC, Jeff Layton	no flags	Details \| Diff
reproducer script (291 bytes, text/plain) 2005-12-19 12:33 UTC, Jeff Layton	no flags	Details
test patch for attempting to reproduce the problem (2.31 KB, patch) 2005-12-19 20:34 UTC, Jeff Layton	no flags	Details \| Diff
dm-snapshot-fix-origin_write-pending_exception-submission.patch (2.00 KB, patch) 2006-01-11 21:25 UTC, Alasdair Kergon	no flags	Details \| Diff
new loopdelay patch for __origin_write (1.64 KB, patch) 2006-01-13 12:59 UTC, Jeff Layton	no flags	Details \| Diff
/tmp/dm-snapshot-replace-sibling-list.patch (6.85 KB, patch) 2006-01-13 22:54 UTC, Alasdair Kergon	no flags	Details \| Diff
Show Obsolete (4) View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2006:0132	0	qe-ready	SHIPPED_LIVE	Moderate: Updated kernel packages available for Red Hat Enterprise Linux 4 Update 3	2006-03-09 16:31:00 UTC

Comment 32 Larry Woodman 2005-12-12 17:10:45 UTC

The problem here is that the list of pending_exceptions is somehow corrupt in
this loop inside __origin_write():

--------------------------------------------------------------------------
        /*
         * Now that we have a complete pe list we can start the copying.
         */
        if (last) {
                pe = last;
                do {
                        down_write(&pe->snap->lock);
                        if (first)
                                bio_list_add(&pe->origin_bios, bio);
                        if (!pe->started) {
                                pe->started = 1;
                                up_write(&pe->snap->lock);
                                start_copy(pe);
                        } else
                                up_write(&pe->snap->lock);
                        first = 0;
                        pe = list_entry(pe->siblings.next,
                                        struct pending_exception, siblings);
                                                                               
                        
                } while (pe != last);
        }
-----------------------------------------------------------------------

The last pointer is valid and points to itself yet the pe pointer contains
slab poison therefore its been freed while this list was processing!

Perhaps a semaphore should be down while this list is running so another
cpu can not change list list on the fly?

Comment 41 Alasdair Kergon 2005-12-13 23:01:07 UTC

There is one fix in a related area of the snapshot code included in U3.

But we suspect that in __origin_write, 'pe' and/or 'last' could be freed (via
start_copy,  kcopyd_copy, copy_callback (async in different thread),
pending_complete, free_pending_exception) meaning 'pe->siblings.next' and 'pe !=
last' would be undefined.

Comment 46 Alasdair Kergon 2005-12-14 15:42:09 UTC

Points to note:

  They say they do *not* have multiple snapshots of the same origin.  That means
that __origin_write only passes through its loops once.

  They are not writing to the snapshots, so snapshot_map's start_copy is not
involved.  The snapshot reading code there does look worrying though [a 2nd
possible problem]: I don't see how it enforces ordering when reading a part of
the snapshot corresponding to part of the origin that is being written to for
the first time.  Is this one factor involved in bug 174742 (the other being the
lack of a flush at snapshot creation time delaying the write to overlap with the
reads)?

Comment 57 Jonathan Earl Brassow 2005-12-19 23:35:10 UTC

It appears to me that our big problem is that there is no overarching lock.  We
can use the snap->lock to ensure that ITS list pointers don't change, but we can
not ensure that other items in that list won't change.  Bottom line is that we
can lock the elements, but we can't lock the list.  jlayton's patch I think
tried to address this, but I don't think it when far enough (also, why was the
lock being initialized in __find_pending_exception?).

The reference counters that I thought would work, don't make sense - since that
would require a callback for each reference, which isn't there.  Even if we
check the reference count at the end of __origin_write, it only serves to
complicate matters.

It would be great if we could associate the list w/ the origin somehow and take
out a list related lock (as opposed to an element related lock) from there.  The
simplest fix may be to augment jlayton's patch and clean-up the list processing.
 However, every write to any snapshot or origin (even if they are unrelated)
would conflict.  To do this right, we may need an actual origin structure.

Comment 68 Alasdair Kergon 2006-01-11 21:25:00 UTC

Created attachment 123075 [details]
dm-snapshot-fix-origin_write-pending_exception-submission.patch

Say you have several snapshots of the same origin and then you issue
a write to some place in the origin for the first time.
 
Before the device-mapper snapshot target lets the write go through to
the underlying device, it needs to make a copy of the data that
is about to be overwritten.  Each snapshot is independent, so it makes
one copy for each snapshot.
 
__origin_write() loops through each snapshot and checks to see
whether a copy is needed for that snapshot.  (A copy is only needed
the first time that data changes.)
 
If a copy is needed, the code allocates a 'pending_exception' structure
holding the details.  It links these together for all the snapshots, then
works its way through this list and submits the copying requests to
the kcopyd thread by calling start_copy().  When each request is
completed, the original pending_exception structure gets freed in
pending_complete().

If you're very unlucky, this structure can get freed *before* the
submission process has finished walking the list.

This patch:
  1) Creates a new temporary list pe_queue to hold the pending exception
structures;
  2) Does all the bookkeeping up-front, then walks through the new list
safely and calls start_copy() for each pending_exception that needed it;
  3) Avoids attempting to add pe->siblings to the list if it's already
connected.

Comment 69 Alasdair Kergon 2006-01-11 21:45:19 UTC

That patch tackles the problem I reckon they're seeing, by using a different
temporary list.

It does not deal with problems when the snapshot is being written to at the same
time - but they told us they aren't doing that.


More reasoning to check, again assuming there are never any snapshot writes:

Say two contiguous origin writes occur relating to the same chunk - the first
creates the pending_exceptions.  Because pending_complete() holds s->lock while
removing the exception and manipulating pe->siblings, the
list_empty(pe->siblings) in origin_write cannot run while the pe is being
deleted.  'last' might be invalid by this point though, as it refers to the
previous snapshot which is not locked.  However 'last' can only be invalid if
the first pending_exception already got submitted (and completed) - and that
means that the list was already created the first time this code got run, so
list_empty(pe->siblings) will always be true and that 'last' pointer will never
get dereferenced.

Comment 73 Alasdair Kergon 2006-01-12 14:41:48 UTC

s/true/false/ in the last sentence of my last explanation.

And a further patch is needed to improve the locking on pe->siblings.

[Consider 2 pending_exceptions being reduced to 1 while the 2nd time around that
loop.]

Comment 79 Jeff Layton 2006-01-13 12:59:04 UTC

Created attachment 123159 [details]
new loopdelay patch for __origin_write

Updated loopdelay patch for __origin_write. It should apply cleanly to the -27
kernel with patch 123075 from AGK. This should be used with CONFIG_DEBUG_SLAB
enabled.

Comment 80 Alasdair Kergon 2006-01-13 22:54:54 UTC

Created attachment 123184 [details]
/tmp/dm-snapshot-replace-sibling-list.patch

Follow-on patch that addresses the remaining problems with the sibling lists.

This patch hasn't had a proper review yet and needs plenty of testing.

Comment 89 Red Hat Bugzilla 2006-03-07 20:39:13 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2006-0132.html

Note You need to log in before you can comment on or make changes to this bug.