The problem here is that the list of pending_exceptions is somehow corrupt in this loop inside __origin_write(): -------------------------------------------------------------------------- /* * Now that we have a complete pe list we can start the copying. */ if (last) { pe = last; do { down_write(&pe->snap->lock); if (first) bio_list_add(&pe->origin_bios, bio); if (!pe->started) { pe->started = 1; up_write(&pe->snap->lock); start_copy(pe); } else up_write(&pe->snap->lock); first = 0; pe = list_entry(pe->siblings.next, struct pending_exception, siblings); } while (pe != last); } ----------------------------------------------------------------------- The last pointer is valid and points to itself yet the pe pointer contains slab poison therefore its been freed while this list was processing! Perhaps a semaphore should be down while this list is running so another cpu can not change list list on the fly?
There is one fix in a related area of the snapshot code included in U3. But we suspect that in __origin_write, 'pe' and/or 'last' could be freed (via start_copy, kcopyd_copy, copy_callback (async in different thread), pending_complete, free_pending_exception) meaning 'pe->siblings.next' and 'pe != last' would be undefined.
Points to note: They say they do *not* have multiple snapshots of the same origin. That means that __origin_write only passes through its loops once. They are not writing to the snapshots, so snapshot_map's start_copy is not involved. The snapshot reading code there does look worrying though [a 2nd possible problem]: I don't see how it enforces ordering when reading a part of the snapshot corresponding to part of the origin that is being written to for the first time. Is this one factor involved in bug 174742 (the other being the lack of a flush at snapshot creation time delaying the write to overlap with the reads)?
It appears to me that our big problem is that there is no overarching lock. We can use the snap->lock to ensure that ITS list pointers don't change, but we can not ensure that other items in that list won't change. Bottom line is that we can lock the elements, but we can't lock the list. jlayton's patch I think tried to address this, but I don't think it when far enough (also, why was the lock being initialized in __find_pending_exception?). The reference counters that I thought would work, don't make sense - since that would require a callback for each reference, which isn't there. Even if we check the reference count at the end of __origin_write, it only serves to complicate matters. It would be great if we could associate the list w/ the origin somehow and take out a list related lock (as opposed to an element related lock) from there. The simplest fix may be to augment jlayton's patch and clean-up the list processing. However, every write to any snapshot or origin (even if they are unrelated) would conflict. To do this right, we may need an actual origin structure.
Created attachment 123075 [details] dm-snapshot-fix-origin_write-pending_exception-submission.patch Say you have several snapshots of the same origin and then you issue a write to some place in the origin for the first time. Before the device-mapper snapshot target lets the write go through to the underlying device, it needs to make a copy of the data that is about to be overwritten. Each snapshot is independent, so it makes one copy for each snapshot. __origin_write() loops through each snapshot and checks to see whether a copy is needed for that snapshot. (A copy is only needed the first time that data changes.) If a copy is needed, the code allocates a 'pending_exception' structure holding the details. It links these together for all the snapshots, then works its way through this list and submits the copying requests to the kcopyd thread by calling start_copy(). When each request is completed, the original pending_exception structure gets freed in pending_complete(). If you're very unlucky, this structure can get freed *before* the submission process has finished walking the list. This patch: 1) Creates a new temporary list pe_queue to hold the pending exception structures; 2) Does all the bookkeeping up-front, then walks through the new list safely and calls start_copy() for each pending_exception that needed it; 3) Avoids attempting to add pe->siblings to the list if it's already connected.
That patch tackles the problem I reckon they're seeing, by using a different temporary list. It does not deal with problems when the snapshot is being written to at the same time - but they told us they aren't doing that. More reasoning to check, again assuming there are never any snapshot writes: Say two contiguous origin writes occur relating to the same chunk - the first creates the pending_exceptions. Because pending_complete() holds s->lock while removing the exception and manipulating pe->siblings, the list_empty(pe->siblings) in origin_write cannot run while the pe is being deleted. 'last' might be invalid by this point though, as it refers to the previous snapshot which is not locked. However 'last' can only be invalid if the first pending_exception already got submitted (and completed) - and that means that the list was already created the first time this code got run, so list_empty(pe->siblings) will always be true and that 'last' pointer will never get dereferenced.
s/true/false/ in the last sentence of my last explanation. And a further patch is needed to improve the locking on pe->siblings. [Consider 2 pending_exceptions being reduced to 1 while the 2nd time around that loop.]
Created attachment 123159 [details] new loopdelay patch for __origin_write Updated loopdelay patch for __origin_write. It should apply cleanly to the -27 kernel with patch 123075 from AGK. This should be used with CONFIG_DEBUG_SLAB enabled.
Created attachment 123184 [details] /tmp/dm-snapshot-replace-sibling-list.patch Follow-on patch that addresses the remaining problems with the sibling lists. This patch hasn't had a proper review yet and needs plenty of testing.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2006-0132.html