133785 – Creating LVM snapshot on DAC960 panics kernel

Bug 133785 - Creating LVM snapshot on DAC960 panics kernel

Summary: Creating LVM snapshot on DAC960 panics kernel

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Red Hat Enterprise Linux 3
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	3.0
Hardware:	i686
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Heinz Mauelshagen
QA Contact:	Brian Brock
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2004-09-27 13:45 UTC by Chris Adams
Modified:	2007-11-30 22:07 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2006-12-22 19:02:27 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Debugging patch to DAC960.c (739 bytes, patch) 2004-09-29 17:12 UTC, Chris Adams	no flags	Details \| Diff
Oops output when using debugging patch (2.31 KB, text/plain) 2004-09-29 17:13 UTC, Chris Adams	no flags	Details
Remove bogus buffer count check (710 bytes, patch) 2004-09-29 17:36 UTC, Chris Adams	no flags	Details \| Diff
View All

Description Chris Adams 2004-09-27 13:45:55 UTC

We use LVM snapshots as part of our backup process.  I have installed
RHEL ES 3 Update 3 (and post U3 errata) on a server here with a Mylex
AcceleRAID 170 card (so it uses the DAC960 module).  When I try to
create the LVM snapshot, I get a kernel panic on the console: 
 
Kernel panic: DAC960: SegmentNumber != SegmentCount 
 
That is it (no stack trace or anything). 
 
I made sure that the RAID firmware was up to date and that didn't make
any difference.

Comment 1 Chris Adams 2004-09-28 19:57:23 UTC

Is there anything else I can do to help debug this?  Anything I can
set up to gather additional data?  The system is still running (as
long as you don't need to do any disk I/O), so getting SysRq kernel
info doesn't seem very useful (as the kernel is still running).

If there is any debugging I can do, I'd like gather it as soon as
possible.  I am supposed to get this system into production use as
soon as possible, and then I won't be able to test anything anymore
(as it will be a 24x7 server).  We'll either (a) not use snapshots or
possibly (b) replace the RAID card and reload the OS.

Comment 2 Heinz Mauelshagen 2004-09-29 10:25:45 UTC

Any further debugging will help (don't have access to DAC960).
Unless we get this drilled down, b's the obvious way to go.

Comment 3 Chris Adams 2004-09-29 17:12:50 UTC

Created attachment 104525 [details]
Debugging patch to DAC960.c

Comment 4 Chris Adams 2004-09-29 17:13:30 UTC

Created attachment 104527 [details]
Oops output when using debugging patch

Comment 5 Chris Adams 2004-09-29 17:17:10 UTC

I patched DAC960.c to print a little more info and BUG() when it fails
instead of just panic()ing (I've attached the patch and resulting oops
output).  I am still trying to understand the logic in the code at
this point.

This is all with kernel-smp-2.4.21-20.EL.i686.rpm installed (I am just
rebuilding the DAC960 module and rebuilding the initrd to use it).

If I snapshot a filesystem mounted read-only, it doesn't crash, so I
guess it happens when trying to flush the filesystem to disk before
snapshotting.

If there are other specific steps I can use to debug this, please let
me know (I have not done a lot to debug kernel problems before).

Comment 6 Chris Adams 2004-09-29 17:28:46 UTC

Looking at this some more, it looks like the logic in
DAC960_V2_QueueReadWriteCommand() is trying to coalesce buffers itself
before putting them in a scatter/gather list.  Then at the end it
checks to see if the number of entries in the scatter/gather list is
the same as the number of buffers.  If any of the buffers were
contiguous and are coalesced, this is guaranteed to panic().  Either
the coalescing code should be removed or the check should be removed,
right?  I'll remove the check and see what happens.

DAC960_V1_QueueReadWriteCommand() looks to have the same problem.

Comment 7 Chris Adams 2004-09-29 17:33:57 UTC

Okay, removing the check seems to work fine.  I can create a snapshot
and "e2fsck -f -n" it with no errors (so it does appear that the
writes are succeeding).  I'll attach the patch.

Comment 8 Chris Adams 2004-09-29 17:36:14 UTC

Created attachment 104528 [details]
Remove bogus buffer count check

This removes a check that compares the number of buffers in the command to the
number of buffers on the scatter/gather list.  Since the code will combine
contiguous buffers when building the scatter/gather list, this check will
always fail in that case.

Comment 9 Chris Adams 2004-10-22 20:03:49 UTC

I see that this bug is still in the RHEL 3 ES beta Update 4 kernel. 
Is this going to be fixed?

Comment 10 Chris Adams 2006-12-22 03:00:21 UTC

I no longer have this hardware so can no longer even test this.  However,
Bugzilla won't let me close it.

Comment 11 Ernie Petrides 2006-12-22 19:02:27 UTC

Closing based on last comment.

Note You need to log in before you can comment on or make changes to this bug.