Bug 736521
Summary: | mdadm crash/oops when stopping array in installer environment | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | David Lehman <dlehman> | ||||||||||||||||
Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> | ||||||||||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Fedora Extras Quality Assurance <extras-qa> | ||||||||||||||||
Severity: | unspecified | Docs Contact: | |||||||||||||||||
Priority: | unspecified | ||||||||||||||||||
Version: | 16 | CC: | agk, awilliam, clydekunkel7734, dledford, gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda, mbroz, tflink, tomh0665 | ||||||||||||||||
Target Milestone: | --- | ||||||||||||||||||
Target Release: | --- | ||||||||||||||||||
Hardware: | Unspecified | ||||||||||||||||||
OS: | Unspecified | ||||||||||||||||||
Whiteboard: | AcceptedBlocker | ||||||||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||||||||
Doc Text: | Story Points: | --- | |||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||
Last Closed: | 2011-09-16 17:57:33 UTC | Type: | --- | ||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||
Embargoed: | |||||||||||||||||||
Bug Depends On: | |||||||||||||||||||
Bug Blocks: | 713564 | ||||||||||||||||||
Attachments: |
|
Created attachment 522015 [details]
log of commands run by anaconda
This was not done by manually starting/stopping the array on tty2 -- I let anaconda do it. The result appears to be the same in either case.
Dave, you said you made a custom image without 65-md-incremental.rules. Did that custom image also show this problem (I suspect it did)? Regardless, this is a kernel oops, not an mdadm bug. It is, however, very likely that it's a kernel MD raid stack bug. (In reply to comment #2) > Dave, you said you made a custom image without 65-md-incremental.rules. Did > that custom image also show this problem (I suspect it did)? Yes, the problem still occurred without the incremental rules. > > Regardless, this is a kernel oops, not an mdadm bug. It is, however, very > likely that it's a kernel MD raid stack bug. Ok. Reassigning. I've pinged Neil Brown about this issue. I suspect it's something that needs fixed prior to the final 3.1 kernel release and therefore he needed to be informed. (In reply to comment #4) > I've pinged Neil Brown about this issue. I suspect it's something that needs > fixed prior to the final 3.1 kernel release and therefore he needed to be > informed. Bruno filed bug 737076 for some other raid related issues and has been working with Neil on them already. Unfortunately, there isn't much info in that bug and bugzilla.kernel.org is down. I thought I would mention it just in case it winds up being related. I filed bz 737278 which I expect is a dupe of this one. I used syslog=192.168.0.11:6666 on the kernel cmd line. It looks like it started. However, on 192.168.0.11 machine, when I enter eval 'scripts/analog ....' I get: bash: scripts/analog: no such file or directory. What do I need to do to get the logs? Thanks (PS: what about going back to the kernel used with the alpha tests?) *** Bug 737278 has been marked as a duplicate of this bug. *** Discussed at the 2011-09-12 Fedora QA meeting. Accepted as a blocker for Fedora 16 Beta due to violation of the following criterion [1]: The installer must be able to create and install to software, hardware or BIOS RAID-0, RAID-1 or RAID-5 partitions for anything except /boot This bug breaks installation on systems that already have an existing mdraid and is suspected to cause problems with newly created mdraid arrays Re: https://bugzilla.redhat.com/show_bug.cgi?id=737278#c6 This may or may not have been fixed in BZ 737076 mentioned above in comment #5 but I re-created an F16 mdraid install, attached the disks to another F16 install, and wasn't able to assemble the array. [root@f16test ~]# lsblk NAME MAJ:MIN RM SIZE RO MOUNTPOINT sda 8:0 0 8G 0 ├─sda1 8:1 0 1M 0 ├─sda2 8:2 0 7G 0 / └─sda3 8:3 0 700M 0 [SWAP] sdb 8:16 0 8G 0 ├─sdb1 8:17 0 1M 0 ├─sdb2 8:18 0 7G 0 │ └─md126 9:126 0 0 └─sdb3 8:19 0 700M 0 └─md127 9:127 0 0 sdc 8:32 0 8G 0 ├─sdc1 8:33 0 1M 0 ├─sdc2 8:34 0 7G 0 │ └─md126 9:126 0 0 └─sdc3 8:35 0 700M 0 └─md127 9:127 0 0 [root@f16test ~]# cat /proc/mdstat Personalities : md126 : inactive sdb2[0] sdc2[1] 14745576 blocks super 1.0 md127 : inactive sdb3[0] sdc3[1] 1433576 blocks super 1.2 unused devices: <none> [root@f16test ~]# mdadm --examine /dev/sdb2 /dev/sdb2: Magic : a92b4efc Version : 1.0 Feature Map : 0x1 Array UUID : 86e84327:d3cda684:aac981df:84084e50 Name : localhost.localdomain:0 Creation Time : Mon Sep 12 13:38:13 2011 Raid Level : raid1 Raid Devices : 2 Avail Dev Size : 14745576 (7.03 GiB 7.55 GB) Array Size : 14745576 (7.03 GiB 7.55 GB) Super Offset : 14745584 sectors State : clean Device UUID : d699d597:a48cbf76:5903b724:e2672b04 Internal Bitmap : -8 sectors from superblock Update Time : Mon Sep 12 13:44:11 2011 Checksum : 6d5c01d4 - correct Events : 27 Device Role : Active device 0 Array State : AA ('A' == active, '.' == missing) [root@f16test ~]# mdadm --examine /dev/sdc2 /dev/sdc2: Magic : a92b4efc Version : 1.0 Feature Map : 0x1 Array UUID : 86e84327:d3cda684:aac981df:84084e50 Name : localhost.localdomain:0 Creation Time : Mon Sep 12 13:38:13 2011 Raid Level : raid1 Raid Devices : 2 Avail Dev Size : 14745576 (7.03 GiB 7.55 GB) Array Size : 14745576 (7.03 GiB 7.55 GB) Super Offset : 14745584 sectors State : clean Device UUID : 33d9f694:8468aa53:ce731827:f997aa13 Internal Bitmap : -8 sectors from superblock Update Time : Mon Sep 12 13:44:11 2011 Checksum : 5948bd9e - correct Events : 27 Device Role : Active device 1 Array State : AA ('A' == active, '.' == missing) [root@f16test ~]# mdadm --detail /dev/md126 /dev/md126: Version : 1.0 Creation Time : Mon Sep 12 13:38:13 2011 Raid Level : raid1 Used Dev Size : 7372788 (7.03 GiB 7.55 GB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Update Time : Mon Sep 12 13:44:11 2011 State : active, Not Started Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Name : localhost.localdomain:0 UUID : 86e84327:d3cda684:aac981df:84084e50 Events : 27 Number Major Minor RaidDevice State 0 8 18 0 active sync /dev/sdb2 1 8 34 1 active sync /dev/sdc2 [root@f16test ~]# mdadm --assemble /dev/md126 /dev/sdb2 /dev/sdc2 mdadm: cannot open device /dev/sdb2: Device or resource busy mdadm: /dev/sdb2 has no superblock - assembly aborted [root@f16test ~]# mdadm --assemble /dev/md127 /dev/sdb3 /dev/sdc3 mdadm: cannot open device /dev/sdb3: Device or resource busy mdadm: /dev/sdb3 has no superblock - assembly aborted I've started a 3.1-rc6 build that contains some md fixes. Please test this kernel when the build completes: http://koji.fedoraproject.org/koji/taskinfo?taskID=3346292 I made a test boot iso with the 3.1-rc6 kernel for testing: http://tflink.fedorapeople.org/iso/20110913_preRC_boot.x86_64.iso http://tflink.fedorapeople.org/iso/20110913_preRC_boot.x86_64.iso.sha256 I don't have any raid arrays that I can test this with, if someone could try the new kernel out, that would be great. I wanted to reproduce this to test the fix, but I can't - TC2 can stop my Intel BIOS RAID-0 array just fine. Dunno if that helps narrow it down any, but it means we need David, Clyde and/or Tom to test the fix. Please grab the ISO from comment #11 and let us know if it helps. thanks! Created attachment 523093 [details]
20110913_preRC_boot.x86_64.iso rescue/VT1
I tried Anaconda's rescue mode with my previous mdraid install. It failed but this time my laptop wasn't killed. I couldn't access the logs on VT2 but was able to screen dump the VTs.
Created attachment 523094 [details]
20110913_preRC_boot.x86_64.iso rescue/VT3
Created attachment 523095 [details]
20110913_preRC_boot.x86_64.iso rescue/VT4
Created attachment 523096 [details]
20110913_preRC_boot.x86_64.iso rescue/VT5
(In reply to comment #11) > I made a test boot iso with the 3.1-rc6 kernel for testing: > uname -r says this is a rc5 kernel. Anyway, it looks like it fails in the same manner for me. Will be attaching /tmp obtained via remote syslog shortly. Created attachment 523148 [details] contents of /tmp As mentioned in comment 17 yeah, syslog shows: [ 183.790520] Pid: 2953, comm: mdadm Not tainted 3.1.0-0.rc5.git0.0.fc16.x86_64 #1 System manufacturer P5K-E/P5K-E ^^^ | | Note-------------------------------------------------------- So, wrong kernel used? Make a new boot.iso and I'll test. (In reply to comment #17) > (In reply to comment #11) > > I made a test boot iso with the 3.1-rc6 kernel for testing: > > > > uname -r says this is a rc5 kernel. Anyway, it looks like it fails in the same > manner for me. Will be attaching /tmp obtained via remote syslog shortly. Good catch, thanks for letting me know. There was a typo in the command I used to build the iso and it pulled in rc5 instead of rc6. New boot.iso (I double checked the kernel version this time) at: http://tflink.fedorapeople.org/iso/20110914_preRC_boot2.x86_64.iso http://tflink.fedorapeople.org/iso/20110914_preRC_boot2.x86_64.iso.sha256 (In reply to comment #20) > > Good catch, thanks for letting me know. There was a typo in the command I used > to build the iso and it pulled in rc5 instead of rc6. > > New boot.iso (I double checked the kernel version this time) at: I'm downloading it and will try it asap. FYI: (1) After trying rescue mode (comment #13), I tried a new install and it froze my computer after installing 10 packages. No way to get any data but since it was the previous kernel, I won't try again to see if I can preserve something. (2) I installed 3.1-rc6 on another F16 install and the arrays from my failed install were active and mountable unlike in comment #9, so I'm hopeful that your new iso'll work... Success!!! Install is humming along nicely with rc6 kernel, LVs over raid10 and a mirror raid present. Good work all. Thanks!! -- Regards OldFart I was able to install F16, stop the array within the anaconda environment, and then boot into anaconda's rescue mode and mount and chroot into the array. But I'm back to booting to an unusable "grub rescue" prompt. Does everyone agree that 3.1-rc6 fixes this bug? (In reply to comment #24) > Does everyone agree that 3.1-rc6 fixes this bug? Comment 22 for me is a yes. it would be best if people can test with Beta RC1: http://dl.fedoraproject.org/pub/alt/stage/16-Beta.RC1/ to confirm. if this issue is fixed but you run into another later, please file that separately (of course, first check if someone else has already filed it). thanks! (In reply to comment #23) > I was able to install F16, stop the array within the anaconda environment, and > then boot into anaconda's rescue mode and mount and chroot into the array. > > But I'm back to booting to an unusable "grub rescue" prompt. Tom, this is certain to be something wrong with the grub installation (the "grub rescue prompt thing) and not a problem with your raid arrays. The specific problem you ran into with your raid arrays was that mdadm was not able to request the kernel module to support your raid level (which is why the arrays showed up in /proc/mdstat, but weren't running, as the raid1 personality wasn't loaded). So, the issue in comment #9 was fixed by the selinux update, starting and stopping arrays was fixed by this bug, and now we need to get to the bottom of the boot issue on your machine (which may just boil down to you need to select a different install point for grub). In any case, your remaining issue isn't related to this bug, and I hereby declare this issue fixed based upon all the feedback present. I agree that my "grub rescue" problem isn't related to this bug; I'm sorry that I wasn't clearer in #23. I've been too busy to report the grub2 problem but I'll do so this weekend. (FYI, "grub2-install..." in Anaconda's rescue mode fails but "grub2-mkimage ...; grub2-setup ..." succeeds.) Thanks for the mdstat explanation; I hadn't thought of checking that! |
Created attachment 522014 [details] syslog Description of problem: When trying to stop an array in the anaconda runtime environment in f16-beta.tc1 mdadm crashes, which seems to lead to an unkillable udev-spawned blkid process. The udev/blkid part is less clear to me than the mdadm oops. To be clear mdadm says the array is stopped, and it appears to be stopped. However, any subsequent mdadm commands on that array seem to hang. Also, within a short time there is an 'add' event on md0 that triggers the unkillable blkid trying to probe the non-existent md0. Version-Release number of selected component (if applicable): mdadm-3.2.2-6 How reproducible: Always, from what I can tell Steps to Reproduce: 1. Boot some f16-beta.tc1 media 2. switch to shell on tty2 3. deactivate any active md array Actual results: mdadm seems to succeed except for the crash dump in the syslog and the fact that the stopped array is now unusable Expected results: array stopped successfully Additional info: In trying to track this down I made a custom image without 65-md-incremental.rules. I also manually started the rsyslog service from the shell on tty2 before running any other commands.