Bug 759648
Summary: | mdadm resync freeze at random | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Bill McGonigle <bill-bugzilla.redhat.com> |
Component: | mdadm | Assignee: | Doug Ledford <dledford> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 15 | CC: | agk, dan.j.williams, dledford, Jes.Sorensen, ketuzsezr, mbroz, scott-brown, trancegenetic, work.eric |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | mdadm-3.2.2-9.fc15.x86_64 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | 602457 | Environment: | |
Last Closed: | 2011-12-09 22:34:50 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Bill McGonigle
2011-12-02 22:30:02 UTC
I am doubtful this is the same bug - it sounds more like interrupts are not being delivered to your guest. The previous bug has been confirmed as fixed by multiple testers so in worst case this would be another bug. It would be helpful if you can try and reproduce the problem without Xen. Note Xen dom0 isn't really supported in F15: http://fedoraproject.org/wiki/Features/XenPvopsDom0 I agree, not the same bug - I think the reports here are from after that other fix went in. I might be able to test this server without Xen if I'm here late on a weekend night (it's a VM server). Is it possible that only one md mirror would cause Xen to lose interrupts? The system is otherwise fine (including lots of non-md disk activity) and other md mirrors on the same disk can rebuild without causing a system reset. I think the one with this problem has an 0.90 superblock, and the mirror that most recently rebuilt well has a 1.2. The problem I was seeing appears to be fixed in Linux 3.1. No freezes or reboots, and a successful rebuild. I did see a new error message when attempting a re-add under 3.1: [root@librescu ~]# mdadm --add /dev/md1 /dev/sda3 mdadm: /dev/sda3 reports being an active member for /dev/md1, but a --re-add fails. mdadm: not performing --add as that would convert /dev/sda3 in to a spare. mdadm: To make this a spare, use "mdadm --zero-superblock /dev/sda3" first. Nothing in dmesg to indicate why it failed. This was member info: /dev/sda3: Magic : a92b4efc Version : 0.90.00 UUID : fffb73ea:e1b209c2:9d4deba6:47ca997f Creation Time : Fri Jan 14 21:53:53 2011 Raid Level : raid1 Used Dev Size : 1446299712 (1379.30 GiB 1481.01 GB) Array Size : 1446299712 (1379.30 GiB 1481.01 GB) Raid Devices : 2 Total Devices : 2 Preferred Minor : 1 Update Time : Fri Dec 2 15:29:59 2011 State : active Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Checksum : 62305ef8 - correct Events : 676 Number Major Minor RaidDevice State this 1 8 3 1 active sync /dev/sda3 0 0 8 35 0 active sync /dev/sdc3 1 1 8 3 1 active sync /dev/sda3 So I decided to zero the superblock on sda3 and re-add it, and that worked. I see a bunch of raid-1 resync patches went into 3.1 I don't seem to have any bad blocks, but for some reason rebuilding worked as expected on this version. So, marking fixed. Bill, Thanks for letting us know. Just a comment on the --re-add issue. The problem here is that when running raid1 or raid10, mdadm has no way of knowing which of the two disks are authoritative if they differ. Hence when you try to --re-add the drive, after having continued to run on the old drive, mdadm has to reject it since it is unable to guarantee data consistency. mdadm didn't use to reject the drives like this, but that was really a bug in itself. Cheers, Jes ^ oh, super. I've been bitten by that before. I'm very happy to do the extra work to avoid ruining a mirror. Thanks for the comment! (In reply to comment #0) > +++ This bug was initially created as a clone of Bug #602457 +++ > Cloning since there's no activity on the closed bug (for which a legit bug > appears to have been fixed, so this is probably different). First the > problem I'm seeing; other unresolved user reports from bug 602457 appended. > +++ > > Seeing this on a Fedora 15 dom0: > > kernel-2.6.40-4.fc15.x86_64 > mdadm-3.2.2-9.fc15.x86_64 > xen-4.1.2-1.fc15.x86_64 > > My raid1 / needed a rebuild (inconsistent after a UPS failure) - it's a > simple mirror on /dev/sda3 and /dev/sdc3 > > lrwxrwxrwx 1 root root 10 Dec 2 15:30 pci-0000:00:11.0-scsi-0:0:0:0-part3 > -> ../../sda3 > lrwxrwxrwx 1 root root 10 Dec 2 15:30 pci-0000:00:11.0-scsi-2:0:0:0-part3 > -> ../../sdc3 > > The source mirror is on an LSI megaraid SAS controller, the destination > drive is on the mobo SATA 6Gbps controller. > > Both are Seagate 1.5TB SATA 3Gbps drives that pass SMART tests: > > lrwxrwxrwx 1 root root 10 Dec 2 15:30 scsi-SATA_ST31500341AS_9VS41GA4-part3 > -> ../../sdc3 > lrwxrwxrwx 1 root root 10 Dec 2 15:30 scsi-SATA_ST31500341AS_9VS44D67-part3 > -> ../../sda3 > > I haven't seen a rebuild get past 10% without hanging the system hard (no > caps lock, nothing). It's 100% reproducible but I don't know how to debug > it. Nothing onscreen or in log files. > > The system was rebooting every hour or so for two days before I noticed the > RAID rebuild and failed sda3 out of the mirror. Since then, it's totally > stable again. > > --- Additional comment from scott-brown on 2011-01-16 13:41:58 > EST --- > > Seeing what I presume to be same behaviour on > mdadm-3.1.3-0.git20100804.2.fc14.x86_64 > > System runs for days with 4x Seagate Barracuda 7200.12 Model: ST31000528AS > in a LVM group, however when added to a RAID5 array, the system will freeze > randomly. > > Building RAID array manually or through GNOME DiskUtil (using all defaults) > doesnt seem to make difference. EXT4 or XFS doesnt make difference. > Mounted or not, doesnt make difference. > > General behaviour... > > 1. array starts, begins resync > 2. eventually all disk activity stops > 3. system is unresposive to all input > > Hard reset of system is required. Obviously this is something I would > prefer not to do. Net result is that I can not use RAID on this sytem. > > System is homegrown, mobo is Gigabyte GA-H55M-SV2, BIOS set to AHCI. > > Linux version 2.6.35.10-74.fc14.x86_64 > (mockbuild.fedoraproject.org) (gcc version 4.5.1 20100924 (Red > Hat 4.5.1-4) (GCC) ) #1 SMP Thu Dec 23 16:04:50 UTC 2010 > > /dev/md0: > Version : 1.2 > Creation Time : Sat Jan 15 21:27:10 2011 > Raid Level : raid5 > Array Size : 2930276352 (2794.53 GiB 3000.60 GB) > Used Dev Size : 976758784 (931.51 GiB 1000.20 GB) > Raid Devices : 4 > Total Devices : 4 > Persistence : Superblock is persistent > > Intent Bitmap : Internal > > Update Time : Sun Jan 16 13:36:18 2011 > State : active, degraded, recovering > Active Devices : 3 > Working Devices : 4 > Failed Devices : 0 > Spare Devices : 1 > > Layout : left-symmetric > Chunk Size : 512K > > Rebuild Status : 94% complete > > Name : :TerraStore > UUID : 8151c129:646ea005:34e60918:bc4a0ed6 > Events : 3737 > > Number Major Minor RaidDevice State > 0 8 81 0 active sync /dev/sdf1 > 1 8 49 1 active sync /dev/sdd1 > 2 8 33 2 active sync /dev/sdc1 > 4 8 17 3 spare rebuilding /dev/sdb1 > > Personalities : [raid6] [raid5] [raid4] > md0 : active raid5 sdf1[0] sdb1[4] sdc1[2] sdd1[1] > 2930276352 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/3] > [UUU_] > [===================>.] recovery = 95.1% (929804532/976758784) > finish=18.8min speed=41551K/sec > bitmap: 3/8 pages [12KB], 65536KB chunk > > unused devices: <none> > > --- Additional comment from trancegenetic on 2011-07-03 09:30:13 > EDT --- > > I still seem to experience this bug > > system is RHEL6.0 with mdadm-3.1.3-1.el6.x86_64 > > Symptoms > System freezes/hangs, screen output hangs, no input is possible from > keyboard or mouse. > > This always happens at an mdadm resync or even a rebuild. While copying > files through the running samba server it seems to trigger the freeze/hang > much faster. > > System details: > /dev/md0: > Version : 1.0 > Creation Time : Wed Jun 8 16:45:24 2011 > Raid Level : raid1 > Array Size : 488385400 (465.76 GiB 500.11 GB) > Used Dev Size : 488385400 (465.76 GiB 500.11 GB) > Raid Devices : 2 > Total Devices : 2 > Persistence : Superblock is persistent > > Intent Bitmap : Internal > > Update Time : Sun Jul 3 15:25:19 2011 > State : active > Active Devices : 2 > Working Devices : 2 > Failed Devices : 0 > Spare Devices : 0 > > Name : neptune-fix:0 > UUID : 7d13420f:1996e153:6706c024:8e22715f > Events : 3047 > > Number Major Minor RaidDevice State > 0 8 33 0 active sync /dev/sdc1 > 1 8 65 1 active sync /dev/sde1 > > > /dev/md1: > Version : 1.1 > Creation Time : Wed Jun 8 16:47:55 2011 > Raid Level : raid5 > Array Size : 3907023872 (3726.03 GiB 4000.79 GB) > Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB) > Raid Devices : 3 > Total Devices : 3 > Persistence : Superblock is persistent > > Intent Bitmap : Internal > > Update Time : Sun Jul 3 15:20:37 2011 > State : active > Active Devices : 3 > Working Devices : 3 > Failed Devices : 0 > Spare Devices : 0 > > Layout : left-symmetric > Chunk Size : 512K > > Name : neptune-fix:1 > UUID : 8a48b835:5f02582b:68f0c66f:f1d85639 > Events : 17507 > > Number Major Minor RaidDevice State > 0 8 1 0 active sync /dev/sda1 > 1 8 17 1 active sync /dev/sdb1 > 3 8 49 2 active sync /dev/sdd1 UPDATE: In the end my problem had nothing to do with the mdadm raid. It seems my onboard network adapter/( or driver?) was the culprit. I added a new pci network adapter and nothing freezes anymore. Thanks for letting us know. It sounds like it could be an interrupt conflict between the network card and the storage controller. I will keep this in mind for similar bug reports coming up. Jes |