Hide Forgot
Raid array failed and I got no email notice from mdadm. Why? mdadm was not running. Attempt to start manually with "service mdmonitor start" fails with this in the syslog: Feb 3 10:52:15 positron kernel: mdadm[18642]: segfault at 0 ip 0000000000421c76 sp 00007fffde4ca430 error 4 in mdadm[400000+61000] Installed mdadm debuginfo package, run mdadm under gdb - see it crash on NULL pointer in mse->metadata_version. (see stack trace below). Confirm that some arrays in /proc/mdstat do not report a metadata version number (see contents on /proc/mdstat below). Confirm that all SL6.1 machines with no metadata reported do not have mdadm running. This is VERY BAD because there will be no automatic notification of RAID failures, etc. Confirm that freshly installed SL6.1 machines have metadata versions 1.0, 1.1 and 1.2 with mdadm running happily. Also see the same bug filed and fixed in Fedora: https://bugzilla.redhat.com/show_bug.cgi?id=698731 Attached below is the gdb output from mdadm and contents of /proc/mdstat. Please fix or provide workaround (i.e. how to convert 0.9 array into 1.0 array). Thanks in advance, K.O. [root@positron ~]# gdb `which mdadm` GNU gdb (GDB) Red Hat Enterprise Linux (7.2-48.el6) Copyright (C) 2010 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-redhat-linux-gnu". For bug reporting instructions, please see: <http://www.gnu.org/software/gdb/bugs/>... Reading symbols from /sbin/mdadm...Reading symbols from /usr/lib/debug/sbin/mdadm.debug...done. done. (gdb) run --monitor --scan Starting program: /sbin/mdadm --monitor --scan Detaching after fork from child process 21267. Detaching after fork from child process 21278. Program received signal SIGSEGV, Segmentation fault. 0x0000000000421c76 in check_array (st=0x67ca40, mdstat=<value optimized out>, test=<value optimized out>, ainfo=0x7fffffffde80, increments=<value optimized out>) at Monitor.c:580 580 if (strncmp(mse->metadata_version, "external:", 9) == 0 && (gdb) where #0 0x0000000000421c76 in check_array (st=0x67ca40, mdstat=<value optimized out>, test=<value optimized out>, ainfo=0x7fffffffde80, increments=<value optimized out>) at Monitor.c:580 #1 0x00000000004225d6 in Monitor (devlist=<value optimized out>, mailaddr=<value optimized out>, alert_cmd=<value optimized out>, period=1000, daemonise=6811760, scan=<value optimized out>, oneshot=0, dosyslog=0, test=<value optimized out>, pidfile=0x0, increments=20, share=1) at Monitor.c:223 #2 0x0000000000403d9b in main (argc=<value optimized out>, argv=0x7fffffffe5e8) at mdadm.c:1600 (gdb) up #1 0x00000000004225d6 in Monitor (devlist=<value optimized out>, mailaddr=<value optimized out>, alert_cmd=<value optimized out>, period=1000, daemonise=6811760, scan=<value optimized out>, oneshot=0, dosyslog=0, test=<value optimized out>, pidfile=0x0, increments=20, share=1) at Monitor.c:223 223 if (check_array(st, mdstat, test, &info, increments)) (gdb) p *mdstat $1 = {dev = 0x67f0c0 "md127", devnum = 127, active = 1, level = 0x67f0e0 "raid1", pattern = 0x67f140 "_U", percent = -1, resync = 0, devcnt = 1, raid_disks = 4, metadata_version = 0x0, members = 0x67f100, next = 0x67f1c0} (gdb) p *mdstat->next $2 = {dev = 0x67f210 "md2", devnum = 2147483647, active = 1, level = 0x67f230 "raid1", pattern = 0x67f2f0 "U_", percent = -1, resync = 0, devcnt = 2, raid_disks = 4, metadata_version = 0x67f2d0 "1.0", members = 0x67f290, next = 0x67f310} (gdb) p *mdstat->next->next $3 = {dev = 0x67f030 "md1", devnum = 2147483647, active = 1, level = 0x67f050 "raid1", pattern = 0x67f3e0 "U_", percent = -1, resync = 0, devcnt = 2, raid_disks = 4, metadata_version = 0x0, members = 0x67f3a0, next = 0x0} (gdb) p *mdstat->next->next->next Cannot access memory at address 0x0 [root@positron ~]# cat /proc/mdstat Personalities : [raid1] md127 : active (auto-read-only) raid1 sdc1[1] 40957568 blocks [2/1] [_U] bitmap: 1/157 pages [4KB], 128KB chunk md2 : active raid1 sdb3[2] sdc3[1](F) 40959928 blocks super 1.0 [2/1] [U_] bitmap: 1/1 pages [4KB], 65536KB chunk md1 : active raid1 sdb2[0] sdc2[2](F) 32764480 blocks [2/1] [U_] bitmap: 1/1 pages [4KB], 65536KB chunk unused devices: <none> K.O.
This has been fixed already as of rhel6.2. Please update to the latest mdadm which resolves this issue.
I guess I should note the version of mdadm and the contents of /etc/mdadm.conf: [root@positron ~]# rpm -q mdadm mdadm-3.2.1-1.el6.x86_64 [root@positron ~]# cat /etc/mdadm.conf # mdadm.conf written out by anaconda MAILADDR root AUTO +imsm +1.x -all ARRAY /dev/md1 level=raid1 num-devices=2 UUID=1674297a:39ef30ff:969902aa:3eab0b21 ARRAY /dev/md2 level=raid1 num-devices=2 UUID=67358e91:01dc1f3f:07873abf:f749b94d [root@positron ~]# K.O.
I confirm this problem does not exist in SL6.2: [root@positron Packages]# rpm -q mdadm mdadm-3.2.2-9.el6.x86_64 K.O.