Bug 718496
| Summary: | mdadm resync freeze at random | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | trancegenetic |
| Component: | mdadm | Assignee: | Jes Sorensen <Jes.Sorensen> |
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | qe-baseos-daemons |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 6.0 | CC: | arkiados, dan.j.williams, dledford, rfv781, scott-brown, trancegenetic, vgriit, work.eric |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | 602457 | Environment: | |
| Last Closed: | 2013-02-08 18:59:10 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
|
Description
trancegenetic
2011-07-03 15:37:14 UTC
I still seem to experience this bug on RHEL6.0
system is RHEL6.0 with mdadm-3.1.3-1.el6.x86_64
Symptoms
System freezes/hangs, screen output hangs, no input is possible from keyboard
or mouse.
This always happens at an mdadm resync or even a rebuild. While copying files
through the running samba server it seems to trigger the freeze/hang much
faster.
System details:
/dev/md0:
Version : 1.0
Creation Time : Wed Jun 8 16:45:24 2011
Raid Level : raid1
Array Size : 488385400 (465.76 GiB 500.11 GB)
Used Dev Size : 488385400 (465.76 GiB 500.11 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Sun Jul 3 15:25:19 2011
State : active
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Name : neptune-fix:0
UUID : 7d13420f:1996e153:6706c024:8e22715f
Events : 3047
Number Major Minor RaidDevice State
0 8 33 0 active sync /dev/sdc1
1 8 65 1 active sync /dev/sde1
/dev/md1:
Version : 1.1
Creation Time : Wed Jun 8 16:47:55 2011
Raid Level : raid5
Array Size : 3907023872 (3726.03 GiB 4000.79 GB)
Used Dev Size : 1953511936 (1863.01 GiB 2000.40 GB)
Raid Devices : 3
Total Devices : 3
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Sun Jul 3 15:20:37 2011
State : active
Active Devices : 3
Working Devices : 3
Failed Devices : 0
Spare Devices : 0
Layout : left-symmetric
Chunk Size : 512K
Name : neptune-fix:1
UUID : 8a48b835:5f02582b:68f0c66f:f1d85639
Events : 17507
Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1
1 8 17 1 active sync /dev/sdb1
3 8 49 2 active sync /dev/sdd1
This is guaranteed not to be the same as the bug you duplicated. That bug was fixed, and it only effected Intel software RAID arrays, while you are using a linux MD software RAID array (yes, they are different, and the code in question from the original bug is never even used on your array). My first guess by your description is that this actually sounds like a hardware bug of some sort. I would suggest running a memory test on the machine to see if there are any issues it finds. Ok, thanks for your reply. I already ran a memory test, no issues. I updated BIOS of my motherboard ASUS P5Q SE2 with intel core2duo e8500. This only happens during mdadm resync. Other heavy IO activity does not trigger this behaviour. System freezes/hangs, screen output hangs, no input is possible from keyboard or mouse. It is even not possible to initiate a kernel panic with the sysrq keys. Absolutely nothing is logged in the logs. Since RHEL 6.2 External Beta has begun, and this bug remains unresolved, it has been rejected as it is not proposed as exception or blocker. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. Does this problem still happen with RHEL6.2? Thanks, Jes We are faced this issue at least two times (for 6.1 kernel and now for 6.2 too). Both times system freezes during weekly RAID array checking. But I can't firmly reproduce it.
00:1f.2 RAID bus controller: Intel Corporation 82801 SATA Controller [RAID mode]
mdadm --detail /dev/md127
Container : /dev/md0, member 0
Raid Level : raid1
Array Size : 1953511424 (1863.01 GiB 2000.40 GB)
Used Dev Size : 1953511556 (1863.01 GiB 2000.40 GB)
Raid Devices : 2
Total Devices : 2
State : active, checking
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Check Status : 8% complete
UUID : a72f8913:3693ec59:a88612ff:3072c153
Number Major Minor RaidDevice State
1 8 0 0 active sync /dev/sda
0 8 16 1 active sync /dev/sdb
mdadm --examine /dev/md0
/dev/md0:
Magic : Intel Raid ISM Cfg Sig.
Version : 1.1.00
Orig Family : de8f6928
Family : 023c041b
Generation : 0012d34c
Attributes : All supported
UUID : c4cf95d9:3c32945b:10a9a67c:5fa8fc07
Checksum : 1f770fe2 correct
MPB Sectors : 1
Disks : 2
RAID Devices : 1
Disk01 Serial : MN1220F31NYX1D
State : active
Id : 00030000
Usable Size : 3907023112 (1863.01 GiB 2000.40 GB)
[Volume0]:
UUID : a72f8913:3693ec59:a88612ff:3072c153
RAID Level : 1
Members : 2
Slots : [UU]
Failed disk : 1
This Slot : 1
Array Size : 3907022848 (1863.01 GiB 2000.40 GB)
Per Dev Size : 3907023112 (1863.01 GiB 2000.40 GB)
Sector Offset : 0
Num Stripes : 15261808
Chunk Size : 64 KiB
Reserved : 0
Migrate State : idle
Map State : normal
Dirty State : dirty
Disk00 Serial : MN1220F32B1Z3D
State : active
Id : 00020000
Usable Size : 3907023112 (1863.01 GiB 2000.40 GB)
Guybrush The most likely source of this is mdmon getting killed. Does it happen with the latest 6.3 updates as well? Do you do suspend/resume on the system that sees this? Jes No, suspend/resume was not used. Unfortunately, I have no access to that system anymore, so I don't know whether update fixes the issue and also would not be able to help in further investigation. Ok, without more data I have no way of reproducing the problem unfortunately. The two reports here are for different configs, one for IMSM BIOS RAID and one for regular RAID. Since there hasn't been any updates on the original bug since July I am going to assume it is no longer an issue. If these problems reappear, please open a new BZ. Jes |