653207 – mdadm[862]: segfault at 0 ip 00007fc6adff8314 sp 00007fff7016fb90 error 4 in libc-2.12.90.so[7fc6adf91000+199000]

Bug 653207 - mdadm[862]: segfault at 0 ip 00007fc6adff8314 sp 00007fff7016fb90 error 4 in libc-2.12.90.so[7fc6adf91000+199000]

Summary: mdadm[862]: segfault at 0 ip 00007fc6adff8314 sp 00007fff7016fb90 error 4 in ...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	mdadm
Sub Component:
Version:	14
Hardware:	Unspecified
OS:	Unspecified
Priority:	low
Severity:	urgent
Target Milestone:	---
Assignee:	Doug Ledford
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-11-15 01:11 UTC by Jaroslav Kortus
Modified:	2011-08-11 23:30 UTC (History)
CC List:	12 users (show)
Fixed In Version:	mdadm-3.1.3-0.git20100804.3.fc14
Clone Of:
Environment:
Last Closed:	2011-08-11 23:30:54 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Jaroslav Kortus 2010-11-15 01:11:10 UTC

Description of problem:
Array moved from FC12 system to FC14 (different PC) is not assembled properly as expected.

Version-Release number of selected component (if applicable):
mdadm-3.1.3-0.git20100722.2.fc14.x86_64

How reproducible:
not sure

Steps to Reproduce:
I moved array from one PC with FC12 to another with FC14. The members of the array have been renamed (from /dev/sd{b,c,d} to /dev/sd{b,d,e}).
  
Actual results:
array degraded when there is no reason to
mdadm segfaults

Expected results:
correct assembly

Additional info:
# mdadm --detail /dev/md0
/dev/md0:
        Version : 0.90
  Creation Time : Sun Mar 14 18:00:45 2010
     Raid Level : raid5
     Array Size : 625139712 (596.18 GiB 640.14 GB)
  Used Dev Size : 312569856 (298.09 GiB 320.07 GB)
   Raid Devices : 3
  Total Devices : 2
Preferred Minor : 0
    Persistence : Superblock is persistent

    Update Time : Mon Nov 15 01:59:08 2010
          State : clean, degraded
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

         Layout : left-symmetric
     Chunk Size : 2048K

           UUID : 70ecc28e:5fc25d66:fe8bfe81:3e6c5787 (local to host abcd)
         Events : 0.2396

    Number   Major   Minor   RaidDevice State
       0       0        0        0      removed
       1       8       64        1      active sync   /dev/sde
       2       8       48        2      active sync   /dev/sdd


dmesg snip:
[   15.004941] md: bind<sde>
[   15.057549] mdadm[862]: segfault at 0 ip 00007fc6adff8314 sp 00007fff7016fb90 error 4 in libc-2.12.90.so[7fc6adf91000+199000]
[   15.066942] md: bind<sdd>
[   15.068014] mdadm[870]: segfault at 0 ip 00007f61522a1314 sp 00007fff12e9d4f0 error 4 in libc-2.12.90.so[7f615223a000+199000]

# mdadm --examine /dev/sdb
/dev/sdb:
          Magic : a92b4efc
        Version : 0.90.00
           UUID : 70ecc28e:5fc25d66:fe8bfe81:3e6c5787 (local to host abcd)
  Creation Time : Sun Mar 14 18:00:45 2010
     Raid Level : raid5
  Used Dev Size : 312569856 (298.09 GiB 320.07 GB)
     Array Size : 625139712 (596.18 GiB 640.14 GB)
   Raid Devices : 3
  Total Devices : 3
Preferred Minor : 0

    Update Time : Sun Nov 14 01:48:08 2010
          State : clean
 Active Devices : 3
Working Devices : 3
 Failed Devices : 0
  Spare Devices : 0
       Checksum : 62108ff4 - correct
         Events : 2268

         Layout : left-symmetric
     Chunk Size : 2048K

      Number   Major   Minor   RaidDevice State
this     0       8       16        0      active sync   /dev/sdb

   0     0       8       16        0      active sync   /dev/sdb
   1     1       8       64        1      active sync   /dev/sde
   2     2       8       48        2      active sync   /dev/sdd

Comment 1 John F 2010-11-15 20:22:40 UTC

Similar results for me.  This seems to be a problem only when booting.  I am able to create the mdraid volume, but it dies on reboot


I created my mdraid with

sudo mdadm --create /dev/md0 --level=6 --raid-devices=5 /dev/sdb1 /dev/sdc1 /dev/sdd1 missing missing
sudo sudo mdadm -Es >> /etc/mdadm.conf


Before adding devices= section to the mdadm.conf file

[    5.536857] md: bind<sdc1>
[    5.537749] mdadm[745]: segfault at 0 ip 00007f97deccd314 sp 00007fff2142da10 error 4 in libc-2.12.90.so[7f97dec66000+199000]

After adding devices=/dev/sdb1,/dev/sdc1,/dev/sdd1 to mdadm.conf line i get

[    5.268594] md: bind<sdc1>
[    5.270883] md: bind<sdb1>
[    5.271098] md: bind<sdd1>
[    5.271865] mdadm[675]: segfault at 0 ip 00007f4ffa7fc314 sp 00007fffabd86870 error 4 in libc-2.12.90.so[7f4ffa795000+199000]


It seems like there is an issue when dealing with missing drives.

Comment 2 John F 2010-11-15 20:26:34 UTC

(In reply to comment #1)
> Similar results for me.  This seems to be a problem only when booting.  I am
> able to create the mdraid volume, but it dies on reboot

err, I don't know if this is similar, but I am getting similar results.

Comment 3 Jaroslav Kortus 2010-11-15 20:54:27 UTC

During next boot the array contained only one disk (sdd). The array in mdadm.conf is identified by UUID, so it should scan and find the devices, but for some reason, this is not happening.

Comment 4 matt 2010-11-17 21:02:22 UTC

I too am having similar problems with mdadm after replacing my working F13 with F14.

Upon each reboot, messages such as this appear in my system log:

> kernel: [   19.036815] mdadm[948]: segfault at 46 ip 00537997 sp bf9d8040 error 4 in libc-2.12.90.so[4d8000+18d000]
> kernel: [   19.038766] mdadm[945]: segfault at 46 ip 0016f997 sp bfecc6c0 error 4 in libc-2.12.90.so[110000+18d000]
> kernel: [   19.039826] mdadm[949]: segfault at 46 ip 00c72997 sp bfbb1b10 error 4 in libc-2.12.90.so[c13000+18d000]
> kernel: [   19.040535] mdadm[947]: segfault at 46 ip 00941997 sp bf83a9e0 error 4 in libc-2.12.90.so[8e2000+18d000]
> kernel: [   19.049072] mdadm[952]: segfault at 46 ip 002bd997 sp bff8eb20 error 4 in libc-2.12.90.so[25e000+18d000]
> kernel: [   19.053141] mdadm[953]: segfault at 46 ip 00314997 sp bfd69e40 error 4 in libc-2.12.90.so[2b5000+18d000]


My 4.7TB raid6 got corrupted, but my 1.7TB raid10 seems okay.


This is my mdadm.conf:

MAILADDR root
AUTO +imsm +1.x -all
ARRAY /dev/md0 uuid=9a4ec903:7c16939f:01090baf:230d550b
ARRAY /dev/md1 uuid=7017cc57:7e18230c:1240623f:cb69db74

Comment 5 Erik Logtenberg 2010-11-20 14:46:34 UTC

Same problem here, I think. mdadm crashes during boot, array doesn't get assembled right. If I use mdadm from command line to remove the wrong arrays and assemble them right (mdadm --assemble --scan) then all seems okay again.

I have no idea if this is relevant, but just a fraction of a second before the segfaulting mdadm,libata-core gives a scary warning. Right after that you can see the mdadm segfault.

[   29.576095] udev[554]: starting version 161
[   29.645531] ------------[ cut here ]------------
[   29.645539] WARNING: at drivers/ata/libata-core.c:5124 ata_qc_issue+0xdf/0x250()
[   29.645542] Hardware name: X8SIL
[   29.645544] Modules linked in: cryptd aes_x86_64 aes_generic xts gf128mul dm_crypt mvsas libsas scsi_transport_sas [last unloaded: scsi_wait_scan]
[   29.645556] Pid: 592, comm: ata_id Not tainted 2.6.35.6-48.fc14.x86_64 #1
[   29.645559] Call Trace:
[   29.645565]  [<ffffffff8104d7c1>] warn_slowpath_common+0x85/0x9d
[   29.645570]  [<ffffffff8130a6a9>] ? ata_scsi_pass_thru+0x0/0x222
[   29.645574]  [<ffffffff8104d7f3>] warn_slowpath_null+0x1a/0x1c
[   29.645578]  [<ffffffff8130449f>] ata_qc_issue+0xdf/0x250
[   29.645582]  [<ffffffff8130a848>] ? ata_scsi_pass_thru+0x19f/0x222
[   29.645586]  [<ffffffff812e72cf>] ? scsi_done+0x0/0x49
[   29.645590]  [<ffffffff8130a6a9>] ? ata_scsi_pass_thru+0x0/0x222
[   29.645594]  [<ffffffff8130b929>] __ata_scsi_queuecmd+0x192/0x1ee
[   29.645599]  [<ffffffff814690da>] ? _raw_spin_lock_irqsave+0x12/0x2f
[   29.645603]  [<ffffffff812e72cf>] ? scsi_done+0x0/0x49
[   29.645606]  [<ffffffff8130b9c2>] ata_sas_queuecmd+0x3d/0x59
[   29.645610]  [<ffffffff812e72cf>] ? scsi_done+0x0/0x49
[   29.645618]  [<ffffffffa002291d>] sas_queuecommand+0x99/0x29a [libsas]
[   29.645622]  [<ffffffff812e82e7>] scsi_dispatch_cmd+0x1d8/0x289
[   29.645628]  [<ffffffff812ee80e>] scsi_request_fn+0x445/0x471
[   29.645634]  [<ffffffff81205176>] __blk_run_queue+0x42/0x72
[   29.645639]  [<ffffffff812015c2>] elv_insert+0xb3/0x1ba
[   29.645643]  [<ffffffff81201761>] __elv_add_request+0x98/0x9f
[   29.645647]  [<ffffffff81469116>] ? _raw_spin_lock_irq+0x1f/0x21
[   29.645652]  [<ffffffff81209c6f>] blk_execute_rq_nowait+0x6f/0x9e
[   29.645656]  [<ffffffff81209d3c>] blk_execute_rq+0x9e/0xd6
[   29.645659]  [<ffffffff81209b15>] ? blk_rq_map_user+0x161/0x214
[   29.645664]  [<ffffffff811e437a>] ? selinux_capable+0x37/0x40
[   29.645669]  [<ffffffff8120d949>] sg_io+0x299/0x3df
[   29.645674]  [<ffffffff8120e028>] scsi_cmd_ioctl+0x24d/0x44d
[   29.645680]  [<ffffffff811de14d>] ? avc_has_perm+0x5c/0x6e
[   29.645683]  [<ffffffff812f6fe5>] sd_ioctl+0xa9/0xd0
[   29.645687]  [<ffffffff8120b51a>] __blkdev_driver_ioctl+0x7a/0xa3
[   29.645689]  [<ffffffff8120bebd>] blkdev_ioctl+0x6b0/0x6c3
[   29.645693]  [<ffffffff8111a5ea>] ? cp_new_stat+0xf3/0x10b
[   29.645697]  [<ffffffff8113c93c>] block_ioctl+0x3c/0x40
[   29.645699]  [<ffffffff81123e0f>] vfs_ioctl+0x36/0xa7
[   29.645702]  [<ffffffff81124770>] do_vfs_ioctl+0x468/0x49b
[   29.645704]  [<ffffffff811247f9>] sys_ioctl+0x56/0x79
[   29.645708]  [<ffffffff81009cf2>] system_call_fastpath+0x16/0x1b
[   29.645710] ---[ end trace 9b0443f69eae490a ]---
[   29.887454] md: bind<sdb1>
[   29.888671] md: bind<sda2>
[   29.889804] md: bind<sda1>
[   29.890540] mdadm[701]: segfault at 0 ip 00007f6768399334 sp 00007fff0ae78870 error 4 in libc-2.12.90.so[7f6768332000+19a000]
[   29.890655] md: bind<sdb2>
[   30.108676] md: raid1 personality registered for level 1
[   30.110073] md/raid1:md1: active with 2 out of 2 mirrors
[   30.110093] md1: detected capacity change from 0 to 59774600192
[   30.112031]  md1: unknown partition table

It doesn't seem to matter if I use the default F14 mdadm.conf or if I place the two ARRAY lines underneath, both yield the same results.

Comment 6 mattg 2010-11-20 14:59:08 UTC

I'm not getting the libata-core backtrace that Erik is seeing.

Over in comment 5 of bug #649038 , Marco Columbo suggested a workaround:

> Edited /sbin/start_udev
> and changed:
> 
> /sbin/udevd -d
> 
> into
> 
> /sbin/udevd -d --children-max=1


This worked for me, too.  I've now rebooted a few times with this workaround and haven't had any mdadm segfaults or degraded arrays.

Comment 7 mattg 2010-11-20 15:06:11 UTC

Meta:  comments 4 and 6 above are both from me.  Apparently I have two Redhat logins.  Apologies.

Comment 8 Steve Snyder 2010-11-23 14:59:28 UTC

FYI, I was seeing the same problem with a pre-existing RAID10 configuration.

This on a fully-updated F14/x86_64 system (clean install, not upgrade).  On each boot the array would have only 2 or 3 of the required 4 disks.

I also resolved it by editing /sbin/udevd as shown above.

Comment 9 Ling Li 2010-11-23 17:33:36 UTC

I had the same problem, and found in another bug ticket (sorry I forget its ticket #) that people suggested to remove "rd_NO_MD" from the grub kernel options as a work around.  I recall the reason some mdadm version mismatch between the init image and the actual binary (which I don't understand at all).  Hope this helps.

Comment 10 Doug Ledford 2010-11-23 17:48:14 UTC

If you install the mdadm package from updates-testing and then rebuild your initrd, I suspect this problem will go away.

Comment 11 Steve Snyder 2010-11-23 18:07:12 UTC

(In reply to comment #9)
> I had the same problem, and found in another bug ticket (sorry I forget its
> ticket #) that people suggested to remove "rd_NO_MD" from the grub kernel
> options as a work around.  I recall the reason some mdadm version mismatch
> between the init image and the actual binary (which I don't understand at all).
>  Hope this helps.

I tried that first, before I found the udev work-around.

I figured I would take matters into my own hands to avoid the helpful
auto-(mis)assembling of the array at start-up.  What I found was devnode
confusion as mdadm wanted to use md126 and md127, devnodes which did not exist
and were not automatically created.  I could have created the devnodes myself
but decided all that hard-coding of values would likely bite me later.

Comment 12 matt 2010-11-23 22:47:33 UTC

(In reply to comment #10)
> If you install the mdadm package from updates-testing and then rebuild your
> initrd, I suspect this problem will go away.


Yes, thank you, that worked.  I just updated to mdadm-3.1.3-0.git20100804.2.fc14.i686 from updates-testing and rebooted (without any of the workarounds).  There were no segfaults of mdadm, and both of my arrays assembled cleanly.  Thanks.

Comment 13 Erik Logtenberg 2010-11-24 20:33:41 UTC

The update from updates-testing appears to work fine. I'll do some more rebooting to see how it holds up.
Doug: thanks!

Comment 14 Matthew West 2010-11-27 19:40:15 UTC

I also had this problem on a fresh FC14 install, but the update to mdadm-3.1.3-0.git20100804.2.fc14.x86_64 has fixed the problem. Thanks!

Comment 15 Peter Bieringer 2010-11-28 19:55:07 UTC

I still have the segfault on F14 using mdadm-3.1.3-0.git20100804.2.fc14.i686 from updates-testing repository

mdmon[4165]: segfault at 0 ip 0804aebe sp bffbd500 error 4 in mdmon[8048000+34000]

strace shows:

open("/proc/mdstat", O_RDONLY|O_LARGEFILE) = 3
fcntl64(3, F_SETFD, FD_CLOEXEC)         = 0
fstat64(3, {st_mode=S_IFREG|0444, st_size=0, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb781f000
read(3, "Personalities : [raid1] \nmd1 : a"..., 1024) = 402
read(3, "", 1024)                       = 0
read(3, "", 1024)                       = 0
close(3)                                = 0
munmap(0xb781f000, 4096)                = 0
--- SIGSEGV (Segmentation fault) @ 0 (0) ---
+++ killed by SIGSEGV +++
Segmentation fault
139

# cat /proc/mdstat 
Personalities : [raid1] 
md1 : active raid1 sdc1[1] sdd1[0]
      156288256 blocks [2/2] [UU]
      
md127 : active raid1 sda[1] sdb[0]
      156288000 blocks super external:/md0/0 [2/2] [UU]
      [==========>..........]  resync = 52.1% (81442432/156288132) finish=41.1min speed=30336K/sec
      
md0 : inactive sdb[1](S) sda[0](S)
      4514 blocks super external:imsm
       
unused devices: <none>



Using Intel software RAID controller

Comment 16 Erik Logtenberg 2010-11-28 22:09:27 UTC

Peter: did you run dracut after updating the mdadm package? The new mdadm has to be in the initramfs file in /boot in order to fix the segfault.

Comment 17 Peter Bieringer 2010-11-29 06:23:18 UTC

(In reply to comment #16)
> Peter: did you run dracut after updating the mdadm package? The new mdadm has
> to be in the initramfs file in /boot in order to fix the segfault.

No, didn't run dracut, but did now, but didn't help. Note, perhaps my segfault is a different one, it can be reproduced by "service mdadm restart" or even run the command directly.

Running the command using gdb and installed debuginfos shows:

Reading symbols from /sbin/mdmon...Reading symbols from /usr/lib/debug/sbin/mdmo
n.debug...done.
done.
(gdb) set args --takeover --all
(gdb) run
Starting program: /sbin/mdmon --takeover --all
[Thread debugging using libthread_db enabled]

Program received signal SIGSEGV, Segmentation fault.
0x0804aebe in main (argc=3, argv=0xbffff774) at mdmon.c:303
303                             if (strncmp(e->metadata_version, "external:", 9)
 == 0 &&
(gdb) bt
#0  0x0804aebe in main (argc=3, argv=0xbffff774) at mdmon.c:303

Comment 18 Peter Bieringer 2011-05-29 10:31:21 UTC

Same on F15

# /etc/init.d/mdmonitor restart
Restarting mdmonitor (via systemctl):                      [  OK  ]

mdmon[2198]: segfault at 0 ip 0804a226 sp bf87a0b0 error 4 in mdmon[8048000+36000]

Comment 19 Doug Ledford 2011-07-15 00:04:58 UTC

Peter: you're particular problem is caused by having older raid arrays instead of modern raid arrays.  In particular, you have version 0.90 arrays from an old install.  Current installs create either version 1.x arrays or imsm external metadata arrays, both of which will cause e->metadata_version to be non-null.  However, older version 0.90 arrays will leave it NULL.  That's a bug and will be fixed in the next build.

Comment 20 Fedora Update System 2011-07-15 00:15:46 UTC

mdadm-3.1.3-0.git20100804.3.fc14 has been submitted as an update for Fedora 14.
https://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.3.fc14

Comment 21 Johan Swensson 2011-07-15 08:54:37 UTC

FYI I get this in RHEL6 as well.

mdadm-3.2.1-1.el6.x86_64

Comment 22 Rick Warner 2011-07-15 19:05:43 UTC

Please see this bug https://bugzilla.redhat.com/show_bug.cgi?id=716413 for an updated specfile and patch file to build a working mdadm-3.2.2.

Comment 23 Fedora Update System 2011-07-16 07:36:04 UTC

Package mdadm-3.1.3-0.git20100804.3.fc14:
* should fix your issue,
* was pushed to the Fedora 14 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing mdadm-3.1.3-0.git20100804.3.fc14'
as soon as you are able to, then reboot.
Please go to the following url:
https://admin.fedoraproject.org/updates/mdadm-3.1.3-0.git20100804.3.fc14
then log in and leave karma (feedback).

Comment 24 Fedora Update System 2011-08-11 23:30:42 UTC

mdadm-3.1.3-0.git20100804.3.fc14 has been pushed to the Fedora 14 stable repository.  If problems still persist, please make note of it in this bug report.

Note You need to log in before you can comment on or make changes to this bug.