We use LVM snapshots as part of our backup process. I have installed
RHEL ES 3 Update 3 (and post U3 errata) on a server here with a Mylex
AcceleRAID 170 card (so it uses the DAC960 module). When I try to
create the LVM snapshot, I get a kernel panic on the console:
Kernel panic: DAC960: SegmentNumber != SegmentCount
That is it (no stack trace or anything).
I made sure that the RAID firmware was up to date and that didn't make
Is there anything else I can do to help debug this? Anything I can
set up to gather additional data? The system is still running (as
long as you don't need to do any disk I/O), so getting SysRq kernel
info doesn't seem very useful (as the kernel is still running).
If there is any debugging I can do, I'd like gather it as soon as
possible. I am supposed to get this system into production use as
soon as possible, and then I won't be able to test anything anymore
(as it will be a 24x7 server). We'll either (a) not use snapshots or
possibly (b) replace the RAID card and reload the OS.
Any further debugging will help (don't have access to DAC960).
Unless we get this drilled down, b's the obvious way to go.
Created attachment 104525 [details]
Debugging patch to DAC960.c
Created attachment 104527 [details]
Oops output when using debugging patch
I patched DAC960.c to print a little more info and BUG() when it fails
instead of just panic()ing (I've attached the patch and resulting oops
output). I am still trying to understand the logic in the code at
This is all with kernel-smp-2.4.21-20.EL.i686.rpm installed (I am just
rebuilding the DAC960 module and rebuilding the initrd to use it).
If I snapshot a filesystem mounted read-only, it doesn't crash, so I
guess it happens when trying to flush the filesystem to disk before
If there are other specific steps I can use to debug this, please let
me know (I have not done a lot to debug kernel problems before).
Looking at this some more, it looks like the logic in
DAC960_V2_QueueReadWriteCommand() is trying to coalesce buffers itself
before putting them in a scatter/gather list. Then at the end it
checks to see if the number of entries in the scatter/gather list is
the same as the number of buffers. If any of the buffers were
contiguous and are coalesced, this is guaranteed to panic(). Either
the coalescing code should be removed or the check should be removed,
right? I'll remove the check and see what happens.
DAC960_V1_QueueReadWriteCommand() looks to have the same problem.
Okay, removing the check seems to work fine. I can create a snapshot
and "e2fsck -f -n" it with no errors (so it does appear that the
writes are succeeding). I'll attach the patch.
Created attachment 104528 [details]
Remove bogus buffer count check
This removes a check that compares the number of buffers in the command to the
number of buffers on the scatter/gather list. Since the code will combine
contiguous buffers when building the scatter/gather list, this check will
always fail in that case.
I see that this bug is still in the RHEL 3 ES beta Update 4 kernel.
Is this going to be fixed?
I no longer have this hardware so can no longer even test this. However,
Bugzilla won't let me close it.
Closing based on last comment.