Description of problem:
There is a race condition in the md subsystem where if members and md
devices are being removed while /proc/mdstat is being read a kernel
panic can occur. Ultimately, it boils down to lack of locking on the
internal md data structures that are being scanned in order to
produce the output of /proc/mdstat. This bug has already been fixed
in the 2.6 stream (actually it was fixed in 2.5), and I think it was
backported to 2.4.22 kernel (Neil Brown <email@example.com>
created the patches for both).
Here is an example of the oops and panic:
Unable to handle kernel NULL pointer dereference at virtual
raid1 usbserial parport_pc lp parport autofs4 audit
iptable_filter ip_tables e1
00 floppy sg scsi_mod microcode ke<y6bd>medv: m omud1s esdtevop
npmudt: u sunbb-uinhdci<
: exporCtPU_r: d e v( h3
m d: 0 0un60:b[in<f8da<hfd4ecc123>,]0>
evE(FhLAdcGS12:) 0 rd
EIP is at raid1_status [raid1] 0x13 (2.4.21-27.0.1.ELsmp/i686)
eax: f7bb0a80 ebx: f7e2c400 ecx: 00000006 edx: f6485980
esi: 00000000 edi: 00000000 ebp: f6485980 esp: f586ff18
ds: 0068 es: 0068 ss: 0068
Process cat (pid: 6618, stackpage=f586f000)
Stack: f7bb0a94 0001f500 c0185757 c3bb3108 f7e2c400 f7bb0a94
c0218bd3 f6485980 f7bb0a80 0000fa80 00000000 00000000
000000e4 c0185213 f6485980 f7bb0a80 f586ff74 f6485998
Call Trace: [<c0185757>] seq_printf [kernel] 0x47 (0xf586ff20)
[<c0218bd3>] md_seq_show [kernel] 0x153 (0xf586ff38)
[<c0185213>] seq_read [kernel] 0x173 (0xf586ff5c)
[<c0164<0867>r>a] isdy1s: _rmiearrdo [r kerrensyenl]c 0wxa9s 7n
h:e d[,m< crd0e_1sdtoa_rst6yinnecg7( )0n 9e>gxo]tt t siymsse_i.gf
alta t.6.4. [ekxerintienl]g
x49 (0xf586ffa8) 0
Code: 8b 87 d8 03 00 00 89 44 2<46<>4md>: 0 mcd2 8 sbt op8p7e dd.4
0md3: <4u>nb i00nd <0h0dc1 c57,1> 4
4md<:4> e x24por 0t_4rd
pmdan: icu:nb iFandta<lhd cex1c4e,0pt>i
Version-Release number of selected component (if applicable):
Always with the test scripts. Under normal system operation the bug
is not likely to happen, but if someone were having to fail a disk,
and happened to have some monitoring software running that was
reading /proc/mdstat it would happen at this most inopportune moment.
Steps to Reproduce:
1. Start running test script that creates and removes md devices in
2. Run test script that cats /proc/mdstat in a loop.
In a short amount of time the panic will occur.
I am going to attach two test scripts that you use to reproduce this.
I am also trying to figure out how to patch the kernel to avoid this
problem, but I don't mind if you produce a patch first (-:. I also
have just sent an email to Neil Brown about this issue. Will gladly
copy the responsible engineer on correspondence if requested.
Created attachment 110871 [details]
panic log at normal logging level (will have interspersed md output)
Created attachment 110872 [details]
Panic log at loglevel 1
Created attachment 110873 [details]
Test script that reads /proc/mdstat in loop
Created attachment 110874 [details]
Test script that creates and destroys md devices in a loop
Will probably want to change the partitions used and perhaps the names of the
md devices to work with whatever system you test with. Also, just so you know,
we were testing on SMP systems when doing this. One was E7501 chipset and the
other E7520 chipset (Nocona/Lindhurst).
Created attachment 110942 [details]
Add locking around displaying mddev info
It took me a while grok the fix in 2.5.23 by Neil Brown and then translate to
2.4.21 +N redhat patches, but I think I got. Basically, just needed to add
locking on the mddev structure before displaying it in md_seq_show(). Patch is
very tiny, but seems to work.
Ran test scripts against patched kernel all night and its still going
without any problems. This seems to fix the problem.
Just wondering if you had a moment to look at this?
The patch looks sane to me. I'm currently testing it. If it passes
testing, I'll submit it for review and possible inclusion in our next
The patch has passed my testing so far (it's hard to say it's right
since the problem only reproduces occasionally, but at least it
doesn't deadlock or anything like that). It's been submitted for
review and possible inclusion in the next kernel update (for both
AS2.1 and RHEL3).
A fix for this problem has just been committed to the RHEL3 U5
patch pool this afternoon (in kernel version 2.4.21-28.EL).
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.