Bug 680791

Summary:	Md raid bitmaps sometimes get stuck...
Product:	[Fedora] Fedora	Reporter:	Maciej Żenczykowski <zenczykowski>
Component:	kernel	Assignee:	Kernel Maintainer List <kernel-maint>
Status:	CLOSED WONTFIX	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	14	CC:	dledford, gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda
Target Milestone:	---
Target Release:	---
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2012-08-16 22:25:40 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Maciej Żenczykowski 2011-02-27 19:31:01 UTC

# uname -a
Linux zeus.lan 2.6.35.11-83.fc14.x86_64 #1 SMP Mon Feb 7 07:06:44 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux

(set array to read only mode, problem was already present earlier, this is just to show that there really is no traffic heading for the array, it's not mounted so it's not surprising)
# mdadm -o /dev/md20

(this is much longer than the 5s bitmap flush period, in truth the array has been sitting idle for hours)
# sleep 100

# cat /proc/mdstat
Personalities : [raid1] [raid0] [raid6] [raid5] [raid4] [linear]
md20 : active (read-only) raid5 sdj1[1] sdl1[3] sdm1[5] sdi1[0] sdk1[2]
      2930286592 blocks super 1.1 level 5, 256k chunk, algorithm 2 [5/5] [UUUUU]
      bitmap: 2/350 pages [8KB], 1024KB chunk, file: /boot/bitmap/md20

# mdadm -X /boot/bitmap/md20 
        Filename : /boot/bitmap/md20
           Magic : 6d746962
         Version : 4
            UUID : 4d395d0f:9316a96d:f90d1b56:e8dd6a82
          Events : 10703
  Events Cleared : 10652
           State : OK
       Chunksize : 1 MB
          Daemon : 5s flush period
      Write Mode : Normal
       Sync Size : 732571648 (698.63 GiB 750.15 GB)
          Bitmap : 715402 bits (chunks), 1898 dirty (0.3%)

# xxd -a /boot/bitmap/md20 
0000000: 6269 746d 0400 0000 4d39 5d0f 9316 a96d  bitm....M9]....m
0000010: f90d 1b56 e8dd 6a82 9f29 0000 0000 0000  ...V..j..)......
0000020: 9c29 0000 0000 0000 0050 5457 0000 0000  .).......PTW....
0000030: 0000 0000 0000 1000 0500 0000 0000 0000  ................
0000040: 0000 0000 0000 0000 0000 0000 0000 0000  ................
*
00090b0: 0000 0000 0000 0010 0000 0000 0000 0000  ................
00090c0: 0100 0000 0000 0000 0000 0000 0000 00c0  ................
00090d0: ffff ffff ffff ffff ffff ffff ffff ffff  ................
00090e0: ffff ffff ffff ffff ffff ffff ffff ffff  ................
00090f0: ffff ffff ffff ffff ffff ffff ffff ffff  ................
0009100: 0000 0000 0000 0000 0000 0000 0000 0000  ................
*
000c000: 0100 0000 0000 0000 0000 0000 0000 0000  ................
000c010: 0000 0000 0000 0000 0000 0400 0000 0000  ................
000c020: 0000 0000 0000 0000 0000 0000 0040 0000  .............@..
000c030: 0000 0000 0000 0000 ffff ffff ffff ffff  ................
000c040: 0100 0000 ffff ffff ffff ffff ffff ffff  ................
000c050: ffff ffff ffff ffff ffff ffff ffff ffff  ................
000c060: ffff ffff ffff ffff ffff ffff ffff ffff  ................
000c070: ffff ffff ffff ffff ffff ffff ffff ffff  ................
000c080: 0100 0000 ffff ffff ffff ffff ffff ffff  ................
000c090: ffff ffff ffff ffff ffff ffff ffff ffff  ................
000c0a0: ffff ffff ffff ffff ffff ffff ffff ffff  ................
000c0b0: ffff ffff ffff ffff ffff ffff ffff ffff  ................
000c0c0: 0100 0000 ffff ffff ffff ffff ffff ffff  ................
000c0d0: ffff ffff ffff ffff ffff ffff ffff ffff  ................
000c0e0: ffff ffff ffff ffff ffff ffff ffff ffff  ................
000c0f0: ffff ffff ffff ffff ffff ffff ffff ffff  ................
000c100: 0000 0000 0000 0000 0000 0000 0000 0000  ................
*
0015e50: 00fc                                     ..


Note, that the bitmap is refusing to clear/flush.

(This unmounted array just completed a 'check' action a couple hours ago, which ran at 5*10MB/s for a dozen or so hours before that)

Comment 1 Maciej Żenczykowski 2011-02-27 19:58:15 UTC

Perhaps worth noting, is that from the moment the kernel assembled the array, across the entire 'check' operation, till the above commands were run, no errors were reported (in dmesg / var/log/messages) across both the physical drives and the ata bus and the raid array itself:

# cat /var/log/messages-20110227 /var/log/messages | egrep 'ata12|sd[ijklm]|md20'
...
Feb 26 14:51:02 zeus kernel: [61667.313109] md/raid:md20: device sdj1 operational as raid disk 1
Feb 26 14:51:02 zeus kernel: [61667.313113] md/raid:md20: device sdl1 operational as raid disk 3
Feb 26 14:51:02 zeus kernel: [61667.313115] md/raid:md20: device sdm1 operational as raid disk 4
Feb 26 14:51:02 zeus kernel: [61667.313117] md/raid:md20: device sdi1 operational as raid disk 0
Feb 26 14:51:02 zeus kernel: [61667.313119] md/raid:md20: device sdk1 operational as raid disk 2
Feb 26 14:51:02 zeus kernel: [61667.313816] md/raid:md20: allocated 5334kB
Feb 26 14:51:02 zeus kernel: [61667.313843] md/raid:md20: raid level 5 active with 5 out of 5 devices, algorithm 2
Feb 26 14:51:02 zeus kernel: [61667.378969] md20: bitmap initialized from disk: read 22/22 pages, set 1898 bits
Feb 26 14:51:02 zeus kernel: [61667.378972] created bitmap (350 pages) for device md20
Feb 26 14:51:02 zeus kernel: [61667.410759] md20: detected capacity change from 0 to 3000613470208
Feb 26 14:51:02 zeus kernel: [61667.411071]  md20: unknown partition table
Feb 26 14:52:55 zeus kernel: [61780.466545] md: data-check of RAID array md20
Feb 27 09:58:15 zeus kernel: [130501.035315] md: md20: data-check done.

# mdadm -S /dev/md20
mdadm: stopped /dev/md20
# mdadm -A /dev/md20
mdadm: /dev/md20 has been started with 5 drives.
# mdadm -o /dev/md20

Feb 27 11:46:07 zeus kernel: [136972.309071] md20: detected capacity change from 3000613470208 to 0
Feb 27 11:46:07 zeus kernel: [136972.309076] md: md20 stopped.
Feb 27 11:46:07 zeus kernel: [136972.309136] md: unbind<sdj1>
Feb 27 11:46:07 zeus kernel: [136972.487189] md: export_rdev(sdj1)
Feb 27 11:46:07 zeus kernel: [136972.487203] md: unbind<sdl1>
Feb 27 11:46:07 zeus kernel: [136972.491008] md: export_rdev(sdl1)
Feb 27 11:46:07 zeus kernel: [136972.491017] md: unbind<sdm1>
Feb 27 11:46:07 zeus kernel: [136972.495010] md: export_rdev(sdm1)
Feb 27 11:46:07 zeus kernel: [136972.495018] md: unbind<sdi1>
Feb 27 11:46:07 zeus kernel: [136972.500008] md: export_rdev(sdi1)
Feb 27 11:46:07 zeus kernel: [136972.500018] md: unbind<sdk1>
Feb 27 11:46:07 zeus kernel: [136972.501129] md: export_rdev(sdk1)
Feb 27 11:46:15 zeus kernel: [136980.819945] md: md20 stopped.
Feb 27 11:46:15 zeus kernel: [136980.884111] md: bind<sdj1>
Feb 27 11:46:15 zeus kernel: [136980.884386] md: bind<sdk1>
Feb 27 11:46:15 zeus kernel: [136980.884660] md: bind<sdl1>
Feb 27 11:46:15 zeus kernel: [136980.884902] md: bind<sdm1>
Feb 27 11:46:15 zeus kernel: [136980.885472] md: bind<sdi1>
Feb 27 11:46:17 zeus kernel: [136982.551723] md/raid:md20: device sdi1 operational as raid disk 0
Feb 27 11:46:17 zeus kernel: [136982.551727] md/raid:md20: device sdm1 operational as raid disk 4
Feb 27 11:46:17 zeus kernel: [136982.551729] md/raid:md20: device sdl1 operational as raid disk 3
Feb 27 11:46:17 zeus kernel: [136982.551731] md/raid:md20: device sdk1 operational as raid disk 2
Feb 27 11:46:17 zeus kernel: [136982.551733] md/raid:md20: device sdj1 operational as raid disk 1
Feb 27 11:46:17 zeus kernel: [136982.552244] md/raid:md20: allocated 5334kB
Feb 27 11:46:17 zeus kernel: [136982.552310] md/raid:md20: raid level 5 active with 5 out of 5 devices, algorithm 2
Feb 27 11:46:17 zeus kernel: [136982.594682] md20: bitmap initialized from disk: read 22/22 pages, set 1898 bits
Feb 27 11:46:17 zeus kernel: [136982.594684] created bitmap (350 pages) for device md20
Feb 27 11:46:19 zeus kernel: [136984.282677] md20: detected capacity change from 0 to 3000613470208
Feb 27 11:46:19 zeus kernel: [136984.286755]  md20: unknown partition table


The interesting thing to note is that these 1898 bits appear to now be permanently set???
They don't go away after assembling the array, whether the array is read-only or not doesn't matter.

Comment 2 Maciej Żenczykowski 2011-02-27 21:00:36 UTC

I ran the following command:

dd if=/boot/bitmap/md20 skip=256 bs=1 2>/dev/null | xxd -a | egrep -v '^[0-9a-f]{6}0: 0000 0000 0000 0000 0000 0000 0000 0000  \.{16}$|^\*$' | cut -d: -f1 | while read i; do echo $i; dd if=/dev/md20 of=/dev/null bs=$[4<<20] skip=$[0x$i*8] count=$[16*8]; done

which forces a read of all blocks with a bit set in the bitmap.  It completes without errors but doesn't cause the bits to clear.

[The logic above is take the bitmap, skip the 256 byte header, convert it to hex with xxd, skip the rows which are full of zeroes, parse out the beginning address of the row, multiply that by 8, to get a bit count, and read a full row, ie 128 = 16 bytes * 8 bits/byte worth of data from there.  The bitmap blocksize is 1MB, but it's a 5 disk raid 5, so effectively it's 5-1=4 times larger.]

Forcing a resync with:
  resync > /sys/block/md20/md/sync_action
doesn't accomplish anything.

Setting the bits via:
  echo x-y > /sys/block/md20/md/bitmap_set_bits
and then forcing a resync also doesn't do anything (if you set bits which weren't yet set then the resync resync's them, but if you set bits which were already set then nothing happens).


Running the following command (!writes to array!):

dd if=/boot/bitmap/md20 skip=256 bs=1 2>/dev/null | xxd -a | egrep -v '^[0-9a-f]{6}0: 0000 0000 0000 0000 0000 0000 0000 0000  \.{16}$|^\*$' | cut -d: -f1 | head -n 1 | while read i; do echo $i; dd if=/dev/md20 of=/dev/md20 bs=$[4<<20] skip=$[0x$i*8] seek=$[0x$i*8] count=$[16*8]; done

which basically reads and writes a fragment of the array where a bit was set in the bitmap has resulted in 1898 set bits becoming 2025 set bits - these set bits are also not going away.  This is a difference of 127 bits, which is the expected value (the command above rewrote 128 bitmap bits worth of data, and one of those bits was already set).

[the logic of this command is the same as the previous one, except we only process the first row, and instead of discarding the read we immediately write it back]

VERY VERY WEIRD - why would a normal write cause bits to be permanently set in the bitmap???

Failing, removing and readding a drive doesn't fix this:
  mdadm /dev/md20 -f /dev/sdi1
  mdadm /dev/md20 -r /dev/sdi1
  mdadm /dev/md20 -a /dev/sdi1

Comment 3 Maciej Żenczykowski 2011-02-27 21:28:26 UTC

This could be the problem:

https://patchwork.kernel.org/patch/561151/

http://www.mail-archive.com/stable@linux.kernel.org/msg04004.html

and it looks like the cause is not present in 2.6.35.9 but is in 2.6.35.11:

http://git.kernel.org/?p=linux/kernel/git/longterm/linux-2.6.35.y.git;a=commitdiff;h=c3a057e54f2cb3efc797633d7d7102db90beed50

Comment 4 Chuck Ebbert 2011-02-28 18:58:58 UTC

Patch applied

Comment 5 Maciej Żenczykowski 2011-03-07 04:56:19 UTC

I've upgraded to "Linux 2.6.35.11-85.fc14.x86_64 #1 SMP Tue Mar 1 15:57:53 UTC 2011 x86_64 x86_64 x86_64 GNU/Linux" on both my laptop and my desktop, and am not seeing the stuck bits going away.

However, I believe this update has stopped new stuck bits from appearing.

Previously copying ~260GB of data onto an array resulted in 713 new stuck bits showing up (at 4x1MB data per bit [4+1 disk raid5 array with 1MB bitmap chunk]), which represents slightly more than 1% of all data written.

After upgrading to the new kernel and copying 23GB of data onto the array no new stuck bits appeared (instead of the expected ~63).

I'm seeing similar dirty-bitmap-bits-not-clearing problems with raid1 bitmap equipped arrays on my laptop (also upgraded kernel to -85).

Comment 6 Maciej Żenczykowski 2011-03-07 05:38:16 UTC

I've also attempted to read 512 MB of data from the array (this is the first area which is 100% full of 128 stuck/dirty bits), wipe it with zeroes, and rewrite the original content.  The dirty count doesn't change - it doesn't go up, and it doesn't go down.

Failing, removing, and re-adding a drive doesn't cause the bits to clear either.

Failing, removing, marking some blocks as dirty (via echo 0-100 > /sys/block/md20/md/bitmap_set_bits) and re-adding the (same as above) drive also doesn't work (although it does resync those 101 blocks).

Failing, removing, stopping the array, starting the array and re-adding a drive (same as above) - does actually appear to have caused some sort of meaningful partial resync, but the bits as they were set are still set.  It's like the resync happened, but still didn't clear the bits...

Now to see if after all this mucking around the data on the array is actually still kosher (an fsck.ext4 on the array passed without errors and 'md5sum -c' is doing data verification on individual files now, but will take hours if not days...).

Comment 7 Maciej Żenczykowski 2011-03-07 06:38:47 UTC

After check-summing 286 GB worth of data (the last data to be written to the array, responsible for at least 25% of the stuck bits) no errors were found (another 291 GB are being checked now, but I don't expect any failures, if I do find anything I'll update the bug).

---

I now have 2738 stuck bits on one of my desktop's raid5 arrays, and 1577+3194 stuck bits on two of my laptop's raid1 arrays.  No magic incantation I've been able to find ever results in these bits clearing.

The configurations are quite different:

Desktop md20 is a 5 raw partition (sd{d,e,f,g,h}1) RAID5 array with external bitmap, no write-behind or write-only.

# mdadm -X /boot/bitmap/md20
        Filename : /boot/bitmap/md20
           Magic : 6d746962
         Version : 4
            UUID : 4d395d0f:9316a96d:f90d1b56:e8dd6a82
          Events : 10880
  Events Cleared : 10880
           State : OK
       Chunksize : 1 MB
          Daemon : 5s flush period
      Write Mode : Normal
       Sync Size : 732571648 (698.63 GiB 750.15 GB)
          Bitmap : 715402 bits (chunks), 2738 dirty (0.4%)

Laptop has two raid1 arrays with 2 drives each, one internal, one external (and often missing), with internal bitmap, both external drives [partitions] were configured write-only and one of them had write-behind configured as well (since reverted).

# mdadm -X /dev/sda2
        Filename : /dev/sda2
           Magic : 6d746962
         Version : 4
            UUID : bf46baf7:0dac1261:42b884fa:a4142e43
          Events : 5668860
  Events Cleared : 5668860
           State : OK
       Chunksize : 128 KB
          Daemon : 5s flush period
      Write Mode : Normal
       Sync Size : 41811900 (39.87 GiB 42.82 GB)
          Bitmap : 326656 bits (chunks), 1577 dirty (0.5%)

# mdadm -X /dev/sda4
        Filename : /dev/sda4
           Magic : 6d746962
         Version : 4
            UUID : c9bfca04:f2e32717:7e3f69c8:913cef71
          Events : 14618766
  Events Cleared : 14618766
           State : OK
       Chunksize : 512 KB
          Daemon : 5s flush period
      Write Mode : Normal
       Sync Size : 153081967 (145.99 GiB 156.76 GB)
          Bitmap : 298989 bits (chunks), 3194 dirty (1.1%)


I'd be grateful for any pointers about howto further debug this, I'm basically running out of ideas.

Comment 8 Maciej Żenczykowski 2011-03-07 07:49:03 UTC

One more point of interest:

I ran a repair via

# echo repair > /sys/devices/virtual/block/md2/md/sync_action 

on the smallest array (raid1 on laptop), and the number of dirty bits decreased just a tiny little bit (from 1577 to 1561)

        Filename : /dev/sda2
           Magic : 6d746962
         Version : 4
            UUID : bf46baf7:0dac1261:42b884fa:a4142e43
          Events : 5668913
  Events Cleared : 5668913
           State : OK
       Chunksize : 128 KB
          Daemon : 5s flush period
      Write Mode : Normal
       Sync Size : 41811900 (39.87 GiB 42.82 GB)
          Bitmap : 326656 bits (chunks), 1561 dirty (0.5%)

I've been watching /sys/devices/virtual/block/md*/md/mismatch_cnt and never seen a non-zero value.

Comment 9 Chuck Ebbert 2011-03-22 00:20:49 UTC

According to the RAID maintainer, the stuck bits are harmless. I suspect you would have to rewrite the data from user space to reset them.

Comment 10 Maciej Żenczykowski 2011-03-22 01:35:15 UTC

Isn't this something the raid driver should cleanup on its own?

Comment 11 Fedora End Of Life 2012-08-16 22:25:43 UTC

This message is a notice that Fedora 14 is now at end of life. Fedora 
has stopped maintaining and issuing updates for Fedora 14. It is 
Fedora's policy to close all bug reports from releases that are no 
longer maintained.  At this time, all open bugs with a Fedora 'version'
of '14' have been closed as WONTFIX.

(Please note: Our normal process is to give advanced warning of this 
occurring, but we forgot to do that. A thousand apologies.)

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, feel free to reopen 
this bug and simply change the 'version' to a later Fedora version.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we were unable to fix it before Fedora 14 reached end of life. If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora, you are encouraged to click on 
"Clone This Bug" (top right of this page) and open it against that 
version of Fedora.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events.  Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping