| Summary: | Md raid bitmaps sometimes get stuck... | ||
|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Maciej Żenczykowski <zenczykowski> |
| Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> |
| Status: | CLOSED WONTFIX | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 14 | CC: | dledford, gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2012-08-16 22:25:40 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
Perhaps worth noting, is that from the moment the kernel assembled the array, across the entire 'check' operation, till the above commands were run, no errors were reported (in dmesg / var/log/messages) across both the physical drives and the ata bus and the raid array itself: # cat /var/log/messages-20110227 /var/log/messages | egrep 'ata12|sd[ijklm]|md20' ... Feb 26 14:51:02 zeus kernel: [61667.313109] md/raid:md20: device sdj1 operational as raid disk 1 Feb 26 14:51:02 zeus kernel: [61667.313113] md/raid:md20: device sdl1 operational as raid disk 3 Feb 26 14:51:02 zeus kernel: [61667.313115] md/raid:md20: device sdm1 operational as raid disk 4 Feb 26 14:51:02 zeus kernel: [61667.313117] md/raid:md20: device sdi1 operational as raid disk 0 Feb 26 14:51:02 zeus kernel: [61667.313119] md/raid:md20: device sdk1 operational as raid disk 2 Feb 26 14:51:02 zeus kernel: [61667.313816] md/raid:md20: allocated 5334kB Feb 26 14:51:02 zeus kernel: [61667.313843] md/raid:md20: raid level 5 active with 5 out of 5 devices, algorithm 2 Feb 26 14:51:02 zeus kernel: [61667.378969] md20: bitmap initialized from disk: read 22/22 pages, set 1898 bits Feb 26 14:51:02 zeus kernel: [61667.378972] created bitmap (350 pages) for device md20 Feb 26 14:51:02 zeus kernel: [61667.410759] md20: detected capacity change from 0 to 3000613470208 Feb 26 14:51:02 zeus kernel: [61667.411071] md20: unknown partition table Feb 26 14:52:55 zeus kernel: [61780.466545] md: data-check of RAID array md20 Feb 27 09:58:15 zeus kernel: [130501.035315] md: md20: data-check done. # mdadm -S /dev/md20 mdadm: stopped /dev/md20 # mdadm -A /dev/md20 mdadm: /dev/md20 has been started with 5 drives. # mdadm -o /dev/md20 Feb 27 11:46:07 zeus kernel: [136972.309071] md20: detected capacity change from 3000613470208 to 0 Feb 27 11:46:07 zeus kernel: [136972.309076] md: md20 stopped. Feb 27 11:46:07 zeus kernel: [136972.309136] md: unbind<sdj1> Feb 27 11:46:07 zeus kernel: [136972.487189] md: export_rdev(sdj1) Feb 27 11:46:07 zeus kernel: [136972.487203] md: unbind<sdl1> Feb 27 11:46:07 zeus kernel: [136972.491008] md: export_rdev(sdl1) Feb 27 11:46:07 zeus kernel: [136972.491017] md: unbind<sdm1> Feb 27 11:46:07 zeus kernel: [136972.495010] md: export_rdev(sdm1) Feb 27 11:46:07 zeus kernel: [136972.495018] md: unbind<sdi1> Feb 27 11:46:07 zeus kernel: [136972.500008] md: export_rdev(sdi1) Feb 27 11:46:07 zeus kernel: [136972.500018] md: unbind<sdk1> Feb 27 11:46:07 zeus kernel: [136972.501129] md: export_rdev(sdk1) Feb 27 11:46:15 zeus kernel: [136980.819945] md: md20 stopped. Feb 27 11:46:15 zeus kernel: [136980.884111] md: bind<sdj1> Feb 27 11:46:15 zeus kernel: [136980.884386] md: bind<sdk1> Feb 27 11:46:15 zeus kernel: [136980.884660] md: bind<sdl1> Feb 27 11:46:15 zeus kernel: [136980.884902] md: bind<sdm1> Feb 27 11:46:15 zeus kernel: [136980.885472] md: bind<sdi1> Feb 27 11:46:17 zeus kernel: [136982.551723] md/raid:md20: device sdi1 operational as raid disk 0 Feb 27 11:46:17 zeus kernel: [136982.551727] md/raid:md20: device sdm1 operational as raid disk 4 Feb 27 11:46:17 zeus kernel: [136982.551729] md/raid:md20: device sdl1 operational as raid disk 3 Feb 27 11:46:17 zeus kernel: [136982.551731] md/raid:md20: device sdk1 operational as raid disk 2 Feb 27 11:46:17 zeus kernel: [136982.551733] md/raid:md20: device sdj1 operational as raid disk 1 Feb 27 11:46:17 zeus kernel: [136982.552244] md/raid:md20: allocated 5334kB Feb 27 11:46:17 zeus kernel: [136982.552310] md/raid:md20: raid level 5 active with 5 out of 5 devices, algorithm 2 Feb 27 11:46:17 zeus kernel: [136982.594682] md20: bitmap initialized from disk: read 22/22 pages, set 1898 bits Feb 27 11:46:17 zeus kernel: [136982.594684] created bitmap (350 pages) for device md20 Feb 27 11:46:19 zeus kernel: [136984.282677] md20: detected capacity change from 0 to 3000613470208 Feb 27 11:46:19 zeus kernel: [136984.286755] md20: unknown partition table The interesting thing to note is that these 1898 bits appear to now be permanently set??? They don't go away after assembling the array, whether the array is read-only or not doesn't matter. I ran the following command:
dd if=/boot/bitmap/md20 skip=256 bs=1 2>/dev/null | xxd -a | egrep -v '^[0-9a-f]{6}0: 0000 0000 0000 0000 0000 0000 0000 0000 \.{16}$|^\*$' | cut -d: -f1 | while read i; do echo $i; dd if=/dev/md20 of=/dev/null bs=$[4<<20] skip=$[0x$i*8] count=$[16*8]; done
which forces a read of all blocks with a bit set in the bitmap. It completes without errors but doesn't cause the bits to clear.
[The logic above is take the bitmap, skip the 256 byte header, convert it to hex with xxd, skip the rows which are full of zeroes, parse out the beginning address of the row, multiply that by 8, to get a bit count, and read a full row, ie 128 = 16 bytes * 8 bits/byte worth of data from there. The bitmap blocksize is 1MB, but it's a 5 disk raid 5, so effectively it's 5-1=4 times larger.]
Forcing a resync with:
resync > /sys/block/md20/md/sync_action
doesn't accomplish anything.
Setting the bits via:
echo x-y > /sys/block/md20/md/bitmap_set_bits
and then forcing a resync also doesn't do anything (if you set bits which weren't yet set then the resync resync's them, but if you set bits which were already set then nothing happens).
Running the following command (!writes to array!):
dd if=/boot/bitmap/md20 skip=256 bs=1 2>/dev/null | xxd -a | egrep -v '^[0-9a-f]{6}0: 0000 0000 0000 0000 0000 0000 0000 0000 \.{16}$|^\*$' | cut -d: -f1 | head -n 1 | while read i; do echo $i; dd if=/dev/md20 of=/dev/md20 bs=$[4<<20] skip=$[0x$i*8] seek=$[0x$i*8] count=$[16*8]; done
which basically reads and writes a fragment of the array where a bit was set in the bitmap has resulted in 1898 set bits becoming 2025 set bits - these set bits are also not going away. This is a difference of 127 bits, which is the expected value (the command above rewrote 128 bitmap bits worth of data, and one of those bits was already set).
[the logic of this command is the same as the previous one, except we only process the first row, and instead of discarding the read we immediately write it back]
VERY VERY WEIRD - why would a normal write cause bits to be permanently set in the bitmap???
Failing, removing and readding a drive doesn't fix this:
mdadm /dev/md20 -f /dev/sdi1
mdadm /dev/md20 -r /dev/sdi1
mdadm /dev/md20 -a /dev/sdi1
This could be the problem: https://patchwork.kernel.org/patch/561151/ http://www.mail-archive.com/stable@linux.kernel.org/msg04004.html and it looks like the cause is not present in 2.6.35.9 but is in 2.6.35.11: http://git.kernel.org/?p=linux/kernel/git/longterm/linux-2.6.35.y.git;a=commitdiff;h=c3a057e54f2cb3efc797633d7d7102db90beed50 Patch applied I've upgraded to "Linux 2.6.35.11-85.fc14.x86_64 #1 SMP Tue Mar 1 15:57:53 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux" on both my laptop and my desktop, and am not seeing the stuck bits going away. However, I believe this update has stopped new stuck bits from appearing. Previously copying ~260GB of data onto an array resulted in 713 new stuck bits showing up (at 4x1MB data per bit [4+1 disk raid5 array with 1MB bitmap chunk]), which represents slightly more than 1% of all data written. After upgrading to the new kernel and copying 23GB of data onto the array no new stuck bits appeared (instead of the expected ~63). I'm seeing similar dirty-bitmap-bits-not-clearing problems with raid1 bitmap equipped arrays on my laptop (also upgraded kernel to -85). I've also attempted to read 512 MB of data from the array (this is the first area which is 100% full of 128 stuck/dirty bits), wipe it with zeroes, and rewrite the original content. The dirty count doesn't change - it doesn't go up, and it doesn't go down. Failing, removing, and re-adding a drive doesn't cause the bits to clear either. Failing, removing, marking some blocks as dirty (via echo 0-100 > /sys/block/md20/md/bitmap_set_bits) and re-adding the (same as above) drive also doesn't work (although it does resync those 101 blocks). Failing, removing, stopping the array, starting the array and re-adding a drive (same as above) - does actually appear to have caused some sort of meaningful partial resync, but the bits as they were set are still set. It's like the resync happened, but still didn't clear the bits... Now to see if after all this mucking around the data on the array is actually still kosher (an fsck.ext4 on the array passed without errors and 'md5sum -c' is doing data verification on individual files now, but will take hours if not days...). After check-summing 286 GB worth of data (the last data to be written to the array, responsible for at least 25% of the stuck bits) no errors were found (another 291 GB are being checked now, but I don't expect any failures, if I do find anything I'll update the bug).
---
I now have 2738 stuck bits on one of my desktop's raid5 arrays, and 1577+3194 stuck bits on two of my laptop's raid1 arrays. No magic incantation I've been able to find ever results in these bits clearing.
The configurations are quite different:
Desktop md20 is a 5 raw partition (sd{d,e,f,g,h}1) RAID5 array with external bitmap, no write-behind or write-only.
# mdadm -X /boot/bitmap/md20
Filename : /boot/bitmap/md20
Magic : 6d746962
Version : 4
UUID : 4d395d0f:9316a96d:f90d1b56:e8dd6a82
Events : 10880
Events Cleared : 10880
State : OK
Chunksize : 1 MB
Daemon : 5s flush period
Write Mode : Normal
Sync Size : 732571648 (698.63 GiB 750.15 GB)
Bitmap : 715402 bits (chunks), 2738 dirty (0.4%)
Laptop has two raid1 arrays with 2 drives each, one internal, one external (and often missing), with internal bitmap, both external drives [partitions] were configured write-only and one of them had write-behind configured as well (since reverted).
# mdadm -X /dev/sda2
Filename : /dev/sda2
Magic : 6d746962
Version : 4
UUID : bf46baf7:0dac1261:42b884fa:a4142e43
Events : 5668860
Events Cleared : 5668860
State : OK
Chunksize : 128 KB
Daemon : 5s flush period
Write Mode : Normal
Sync Size : 41811900 (39.87 GiB 42.82 GB)
Bitmap : 326656 bits (chunks), 1577 dirty (0.5%)
# mdadm -X /dev/sda4
Filename : /dev/sda4
Magic : 6d746962
Version : 4
UUID : c9bfca04:f2e32717:7e3f69c8:913cef71
Events : 14618766
Events Cleared : 14618766
State : OK
Chunksize : 512 KB
Daemon : 5s flush period
Write Mode : Normal
Sync Size : 153081967 (145.99 GiB 156.76 GB)
Bitmap : 298989 bits (chunks), 3194 dirty (1.1%)
I'd be grateful for any pointers about howto further debug this, I'm basically running out of ideas.
One more point of interest:
I ran a repair via
# echo repair > /sys/devices/virtual/block/md2/md/sync_action
on the smallest array (raid1 on laptop), and the number of dirty bits decreased just a tiny little bit (from 1577 to 1561)
Filename : /dev/sda2
Magic : 6d746962
Version : 4
UUID : bf46baf7:0dac1261:42b884fa:a4142e43
Events : 5668913
Events Cleared : 5668913
State : OK
Chunksize : 128 KB
Daemon : 5s flush period
Write Mode : Normal
Sync Size : 41811900 (39.87 GiB 42.82 GB)
Bitmap : 326656 bits (chunks), 1561 dirty (0.5%)
I've been watching /sys/devices/virtual/block/md*/md/mismatch_cnt and never seen a non-zero value.
According to the RAID maintainer, the stuck bits are harmless. I suspect you would have to rewrite the data from user space to reset them. Isn't this something the raid driver should cleanup on its own? This message is a notice that Fedora 14 is now at end of life. Fedora has stopped maintaining and issuing updates for Fedora 14. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At this time, all open bugs with a Fedora 'version' of '14' have been closed as WONTFIX. (Please note: Our normal process is to give advanced warning of this occurring, but we forgot to do that. A thousand apologies.) Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, feel free to reopen this bug and simply change the 'version' to a later Fedora version. Bug Reporter: Thank you for reporting this issue and we are sorry that we were unable to fix it before Fedora 14 reached end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged to click on "Clone This Bug" (top right of this page) and open it against that version of Fedora. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping |
# uname -a Linux zeus.lan 2.6.35.11-83.fc14.x86_64 #1 SMP Mon Feb 7 07:06:44 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux (set array to read only mode, problem was already present earlier, this is just to show that there really is no traffic heading for the array, it's not mounted so it's not surprising) # mdadm -o /dev/md20 (this is much longer than the 5s bitmap flush period, in truth the array has been sitting idle for hours) # sleep 100 # cat /proc/mdstat Personalities : [raid1] [raid0] [raid6] [raid5] [raid4] [linear] md20 : active (read-only) raid5 sdj1[1] sdl1[3] sdm1[5] sdi1[0] sdk1[2] 2930286592 blocks super 1.1 level 5, 256k chunk, algorithm 2 [5/5] [UUUUU] bitmap: 2/350 pages [8KB], 1024KB chunk, file: /boot/bitmap/md20 # mdadm -X /boot/bitmap/md20 Filename : /boot/bitmap/md20 Magic : 6d746962 Version : 4 UUID : 4d395d0f:9316a96d:f90d1b56:e8dd6a82 Events : 10703 Events Cleared : 10652 State : OK Chunksize : 1 MB Daemon : 5s flush period Write Mode : Normal Sync Size : 732571648 (698.63 GiB 750.15 GB) Bitmap : 715402 bits (chunks), 1898 dirty (0.3%) # xxd -a /boot/bitmap/md20 0000000: 6269 746d 0400 0000 4d39 5d0f 9316 a96d bitm....M9]....m 0000010: f90d 1b56 e8dd 6a82 9f29 0000 0000 0000 ...V..j..)...... 0000020: 9c29 0000 0000 0000 0050 5457 0000 0000 .).......PTW.... 0000030: 0000 0000 0000 1000 0500 0000 0000 0000 ................ 0000040: 0000 0000 0000 0000 0000 0000 0000 0000 ................ * 00090b0: 0000 0000 0000 0010 0000 0000 0000 0000 ................ 00090c0: 0100 0000 0000 0000 0000 0000 0000 00c0 ................ 00090d0: ffff ffff ffff ffff ffff ffff ffff ffff ................ 00090e0: ffff ffff ffff ffff ffff ffff ffff ffff ................ 00090f0: ffff ffff ffff ffff ffff ffff ffff ffff ................ 0009100: 0000 0000 0000 0000 0000 0000 0000 0000 ................ * 000c000: 0100 0000 0000 0000 0000 0000 0000 0000 ................ 000c010: 0000 0000 0000 0000 0000 0400 0000 0000 ................ 000c020: 0000 0000 0000 0000 0000 0000 0040 0000 .............@.. 000c030: 0000 0000 0000 0000 ffff ffff ffff ffff ................ 000c040: 0100 0000 ffff ffff ffff ffff ffff ffff ................ 000c050: ffff ffff ffff ffff ffff ffff ffff ffff ................ 000c060: ffff ffff ffff ffff ffff ffff ffff ffff ................ 000c070: ffff ffff ffff ffff ffff ffff ffff ffff ................ 000c080: 0100 0000 ffff ffff ffff ffff ffff ffff ................ 000c090: ffff ffff ffff ffff ffff ffff ffff ffff ................ 000c0a0: ffff ffff ffff ffff ffff ffff ffff ffff ................ 000c0b0: ffff ffff ffff ffff ffff ffff ffff ffff ................ 000c0c0: 0100 0000 ffff ffff ffff ffff ffff ffff ................ 000c0d0: ffff ffff ffff ffff ffff ffff ffff ffff ................ 000c0e0: ffff ffff ffff ffff ffff ffff ffff ffff ................ 000c0f0: ffff ffff ffff ffff ffff ffff ffff ffff ................ 000c100: 0000 0000 0000 0000 0000 0000 0000 0000 ................ * 0015e50: 00fc .. Note, that the bitmap is refusing to clear/flush. (This unmounted array just completed a 'check' action a couple hours ago, which ran at 5*10MB/s for a dozen or so hours before that)