Bug 738035
Summary: | Incorrectly assembled array (possibly now corrupted) | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Andy Burns <fedora> |
Component: | mdadm | Assignee: | Doug Ledford <dledford> |
Status: | CLOSED NOTABUG | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 16 | CC: | agk, dledford, mbroz |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2011-09-21 20:10:49 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Andy Burns
2011-09-13 17:24:22 UTC
I now think the root cause of this is that udev is not creating the /dev/sd*1 devices for the partitions on all the devices, so have logged that under another bug (BZ#739296) I am still concerned as to how, when handed 6 raw disks and 2 partitions instead of 8 partitions, mdadm thinks it's doing the correct thing by assembling 5/6 disks into an array and attempting to assemble the two partitions into an array. This is a known problem with version 0.90 superblocks. It was a primary motivating factor in deprecating that superblock version. They can't tell the difference between a superblock at the end of a whole disk partition and a superblock at the end of a whole disk. We have moved to version 1.x superblocks which protect against this issue. I would *strongly* suggest you upgrade the superblock on your arrays by remaking the array using the partitions and not the whole drives, with the exact same options as it was originally created with, except specifying a version 1.0 superblock (which will also place the superblock at the end of the device) and passing the --assume-clean flag to keep it from trying to recreate parity and then seeing if your data is intact. If it is, then you are good to go (aside from having to update the uuid in the mdadm.conf file to match the new uuid). If not, then the incorrect assembly has likely caused parity generation to corrupt your array. That sounds like a reasonable explanation fo what I am seeing, the arrays are several years old so likely to have out of date metadata format. Although the machine thinks it has assembled the array (sort of!) I hope nothing has been written to the disks, obviously I can't see the LVM PV which was on the array originally, so the LVs within that haven't been mounted. I don't think parity regeneration has been triggered on the array. I will try to do as you say once I can presuade the machine to power-on again (that's a completely different issue). Thank you very much for the comment. OK, machine is now bootable again (faulty DVB-S2 card removed) Given that the /dev/sd*1 device files for most of the partitions don't exist, are you saying I should wipe the partition tables, then re-create them (with exactly the same start/end sector obviously) and *then* recreate the array with --assume-clean and --metadata=1.0 Is metadata1.0 compatible with Centos5.x, if not should I just go with metadata1.2 I assume the order of the partitions within the array is critical? I have the dmesg output from the last time the array was successfuly mounted under Centos5.x the device names were slightly different under Centos (ghijklmn instead of chijklmn) md: Autodetecting RAID arrays. md: autorun ... md: considering sdn1 ... md: adding sdn1 ... md: adding sdm1 ... md: adding sdl1 ... md: adding sdk1 ... md: adding sdj1 ... md: adding sdi1 ... md: adding sdh1 ... md: adding sdg1 ... md: created md2 md: bind<sdg1> md: bind<sdh1> md: bind<sdi1> md: bind<sdj1> md: bind<sdk1> md: bind<sdl1> md: bind<sdm1> md: bind<sdn1> md: running: <sdn1><sdm1><sdl1><sdk1><sdj1><sdi1><sdh1><sdg1> raid5: device sdn1 operational as raid disk 0 raid5: device sdm1 operational as raid disk 6 raid5: device sdl1 operational as raid disk 4 raid5: device sdk1 operational as raid disk 7 raid5: device sdj1 operational as raid disk 5 raid5: device sdi1 operational as raid disk 3 raid5: device sdh1 operational as raid disk 1 raid5: device sdg1 operational as raid disk 2 raid5: allocated 8462kB for md2 raid5: raid level 5 set md2 active with 8 out of 8 devices, algorithm 2 RAID5 conf printout: --- rd:8 wd:8 fd:0 disk 0, o:1, dev:sdn1 disk 1, o:1, dev:sdh1 disk 2, o:1, dev:sdg1 disk 3, o:1, dev:sdi1 disk 4, o:1, dev:sdl1 disk 5, o:1, dev:sdj1 disk 6, o:1, dev:sdm1 disk 7, o:1, dev:sdk1 md: ... autorun DONE. Yes, you'll need to recreate the partition tables. And version 1.0 metadata should work with CentOS 5 just fine. Do *NOT* go with version 1.2 metadata. That will stick the superblock at the beginning of the partition and will offset the start of your data roughly 1MB into the partition. When that happens, nothing will be valid any more because your real data will start at the beginning of the partition (and will have been overwritten by the new superblock in addition) but the new array will start reading your real data 1MB into the actual data. Nothing will be in the right place and the LVM PV will be permanently hosed. As for the order of the partitions, it is critical. However, as long as you use the --assume-clean flag, you can remake the array as many times as you need to get the order right. So, first try to get the order right by using the dmesg from the last successful boot. When it says: raid5: device sdnl operational as raid disk 0 that means the devices sdnl that existed before should be the first disk you list on the list of devices to create the array from. Number 1 is next, and so on. So, from what you listed above, the create command would be something like this: mdadm -C /dev/md2 -l5 -n8 -e1.0 --name=md2 --chunk=64 --assume-clean /dev/sd[nhgiljmk]1 However, I note that you asked if version 1.0 metadata is ok with CentOS 5. In CentOS we have an older version of mdadm (2.6.9) and although I know it supports --assume-clean, I'm not positive how well version 1.0 metadata is supported. You'll just have to try it and see how it goes. This issue is now solved for me, the problem was that 6 of the 8 disks had previously been members of a different array using md-on-rawdisk, then 2 additional disks were purchased and 8 partitions created, with an md-on-partitions array (used for several years under CentOS5) My issue only was noticed when changing to FC16, where udev apparently sees the rawdisk superblock first and uses it, so simply zeroing the superblock on the raw device allowed the metadata on the partition to be "unshadowed" and was used correctly on the next reboot, no data lost, I will probably still take the suggestion to to upgrade to metadata1.0 though. Thanks. |