We use LVM snapshots as part of our backup process. I have installed RHEL ES 3 Update 3 (and post U3 errata) on a server here with a Mylex AcceleRAID 170 card (so it uses the DAC960 module). When I try to create the LVM snapshot, I get a kernel panic on the console: Kernel panic: DAC960: SegmentNumber != SegmentCount That is it (no stack trace or anything). I made sure that the RAID firmware was up to date and that didn't make any difference.
Is there anything else I can do to help debug this? Anything I can set up to gather additional data? The system is still running (as long as you don't need to do any disk I/O), so getting SysRq kernel info doesn't seem very useful (as the kernel is still running). If there is any debugging I can do, I'd like gather it as soon as possible. I am supposed to get this system into production use as soon as possible, and then I won't be able to test anything anymore (as it will be a 24x7 server). We'll either (a) not use snapshots or possibly (b) replace the RAID card and reload the OS.
Any further debugging will help (don't have access to DAC960). Unless we get this drilled down, b's the obvious way to go.
Created attachment 104525 [details] Debugging patch to DAC960.c
Created attachment 104527 [details] Oops output when using debugging patch
I patched DAC960.c to print a little more info and BUG() when it fails instead of just panic()ing (I've attached the patch and resulting oops output). I am still trying to understand the logic in the code at this point. This is all with kernel-smp-2.4.21-20.EL.i686.rpm installed (I am just rebuilding the DAC960 module and rebuilding the initrd to use it). If I snapshot a filesystem mounted read-only, it doesn't crash, so I guess it happens when trying to flush the filesystem to disk before snapshotting. If there are other specific steps I can use to debug this, please let me know (I have not done a lot to debug kernel problems before).
Looking at this some more, it looks like the logic in DAC960_V2_QueueReadWriteCommand() is trying to coalesce buffers itself before putting them in a scatter/gather list. Then at the end it checks to see if the number of entries in the scatter/gather list is the same as the number of buffers. If any of the buffers were contiguous and are coalesced, this is guaranteed to panic(). Either the coalescing code should be removed or the check should be removed, right? I'll remove the check and see what happens. DAC960_V1_QueueReadWriteCommand() looks to have the same problem.
Okay, removing the check seems to work fine. I can create a snapshot and "e2fsck -f -n" it with no errors (so it does appear that the writes are succeeding). I'll attach the patch.
Created attachment 104528 [details] Remove bogus buffer count check This removes a check that compares the number of buffers in the command to the number of buffers on the scatter/gather list. Since the code will combine contiguous buffers when building the scatter/gather list, this check will always fail in that case.
I see that this bug is still in the RHEL 3 ES beta Update 4 kernel. Is this going to be fixed?
I no longer have this hardware so can no longer even test this. However, Bugzilla won't let me close it.
Closing based on last comment.