looks like 9 out of 10 raid-checks are freezing my machine first noticed by the weekly cronjob i now tried a resync FINE: echo repair >/sys/block/md0/md/sync_action FINE: echo repair >/sys/block/md1/md/sync_action FREEZE: echo repair >/sys/block/md2/md/sync_action as this happened the first time after fsck/repair a manually "raid-check" finished without any issue, but now it leads reproduceable to a freeze affacted: 3.14.1-200.fc20.x86_64 3.14.2-200.fc20.x86_64 3.14.3-200.fc20.x86_64 freeze means: any write to disks stalls forever for now i disabled the weekly cronjob because it does more harm as benefit [root@srv-rhsoft:~]$ cat /usr/local/bin/raid_status #!/usr/bin/bash echo "---------------------------------------------------------------------------------" /sbin/mdadm --detail /dev/md0 echo "---------------------------------------------------------------------------------" /sbin/mdadm --detail /dev/md1 echo "---------------------------------------------------------------------------------" /sbin/mdadm --detail /dev/md2 echo "---------------------------------------------------------------------------------" cat /proc/mdstat echo "---------------------------------------------------------------------------------" [root@srv-rhsoft:~]$ /usr/local/bin/raid_status --------------------------------------------------------------------------------- /dev/md0: Version : 1.0 Creation Time : Wed Jun 8 13:10:48 2011 Raid Level : raid1 Array Size : 511988 (500.07 MiB 524.28 MB) Used Dev Size : 511988 (500.07 MiB 524.28 MB) Raid Devices : 4 Total Devices : 4 Persistence : Superblock is persistent Update Time : Sat May 10 00:52:20 2014 State : clean Active Devices : 4 Working Devices : 4 Failed Devices : 0 Spare Devices : 0 Name : localhost.localdomain:0 UUID : 1d691642:baed26df:1d197496:4fb00ff8 Events : 660 Number Major Minor RaidDevice State 0 8 49 0 active sync /dev/sdd1 5 8 33 1 active sync /dev/sdc1 4 8 1 2 active sync /dev/sda1 3 8 17 3 active sync /dev/sdb1 --------------------------------------------------------------------------------- /dev/md1: Version : 1.1 Creation Time : Wed Jun 8 13:10:52 2011 Raid Level : raid10 Array Size : 30716928 (29.29 GiB 31.45 GB) Used Dev Size : 15358464 (14.65 GiB 15.73 GB) Raid Devices : 4 Total Devices : 4 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Sat May 10 00:59:57 2014 State : active Active Devices : 4 Working Devices : 4 Failed Devices : 0 Spare Devices : 0 Layout : near=2 Chunk Size : 512K Name : localhost.localdomain:1 UUID : b7475879:c95d9a47:c5043c02:0c5ae720 Events : 12429 Number Major Minor RaidDevice State 0 8 50 0 active sync set-A /dev/sdd2 5 8 34 1 active sync set-B /dev/sdc2 4 8 2 2 active sync set-A /dev/sda2 3 8 18 3 active sync set-B /dev/sdb2 --------------------------------------------------------------------------------- /dev/md2: Version : 1.1 Creation Time : Wed Jun 8 13:10:56 2011 Raid Level : raid10 Array Size : 3875222528 (3695.70 GiB 3968.23 GB) Used Dev Size : 1937611264 (1847.85 GiB 1984.11 GB) Raid Devices : 4 Total Devices : 4 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Sat May 10 00:59:50 2014 State : active Active Devices : 4 Working Devices : 4 Failed Devices : 0 Spare Devices : 0 Layout : near=2 Chunk Size : 512K Name : localhost.localdomain:2 UUID : ea253255:cb915401:f32794ad:ce0fe396 Events : 500461 Number Major Minor RaidDevice State 0 8 51 0 active sync set-A /dev/sdd3 5 8 35 1 active sync set-B /dev/sdc3 4 8 3 2 active sync set-A /dev/sda3 3 8 19 3 active sync set-B /dev/sdb3 --------------------------------------------------------------------------------- Personalities : [raid1] [raid10] md2 : active raid10 sdd3[0] sda3[4] sdc3[5] sdb3[3] 3875222528 blocks super 1.1 512K chunks 2 near-copies [4/4] [UUUU] bitmap: 1/29 pages [4KB], 65536KB chunk md1 : active raid10 sda2[4] sdd2[0] sdc2[5] sdb2[3] 30716928 blocks super 1.1 512K chunks 2 near-copies [4/4] [UUUU] bitmap: 1/1 pages [4KB], 65536KB chunk md0 : active raid1 sdb1[3] sda1[4] sdd1[0] sdc1[5] 511988 blocks super 1.0 [4/4] [UUUU] unused devices: <none> ---------------------------------------------------------------------------------
Observed the same problem on my 32-bit DELL machine with vmlinuz-3.13.10-200.fc20.i686+PAE running RAID1 with md0 and md1. The snapshot of /var/log/messages was taken after the system locked up. May 11 01:39:20 spruce kernel: [695803.519110] raid-check invoked oom-killer: gfp_mask=0x3000d0, order=1, oom_score_adj=0 May 11 01:39:20 spruce kernel: [695803.519121] raid-check cpuset=/ mems_allowed=0 May 11 01:39:20 spruce kernel: [695803.519130] CPU: 0 PID: 11135 Comm: raid-check Not tainted 3.13.10-200.fc20.i686+PAE #1 May 11 01:39:20 spruce kernel: [695803.519134] Hardware name: Dell Inc. Dell DM051 /0HJ054, BIOS A05 03/31/2006 May 11 01:39:20 spruce kernel: [695803.519138] 00000000 00000000 c0111da4 c09c3fab e88ec380 c0111dfc c09c1d81 c0b7720c May 11 01:39:20 spruce kernel: [695803.519153] e88ec5b8 003000d0 00000001 00000000 c043c603 00000015 c0111e0c c09ca45c May 11 01:39:20 spruce kernel: [695803.519165] ffffffff 00000000 00000206 c0111dfc c069f783 003000d0 00000001 00000000 May 11 01:39:20 spruce kernel: [695803.519178] Call Trace: May 11 01:39:20 spruce kernel: [695803.519194] [<c09c3fab>] dump_stack+0x41/0x52 May 11 01:39:20 spruce kernel: [695803.519205] [<c09c1d81>] dump_header.isra.10+0x73/0x1bc May 11 01:39:20 spruce kernel: [695803.519216] [<c043c603>] ? native_apic_icr_read+0x23/0x40 May 11 01:39:20 spruce kernel: [695803.519225] [<c09ca45c>] ? apic_timer_interrupt+0x34/0x3c May 11 01:39:20 spruce kernel: [695803.519235] [<c069f783>] ? ___ratelimit+0x83/0xf0 May 11 01:39:20 spruce kernel: [695803.519244] [<c051a47b>] oom_kill_process+0x16b/0x2b0 May 11 01:39:20 spruce kernel: [695803.519252] [<c063dedc>] ? security_capable_noaudit+0x1c/0x30 May 11 01:39:20 spruce kernel: [695803.519260] [<c045e65a>] ? has_capability_noaudit+0x1a/0x30 May 11 01:39:20 spruce kernel: [695803.519268] [<c051a9ab>] out_of_memory+0x22b/0x260 May 11 01:39:20 spruce kernel: [695803.519276] [<c051f000>] __alloc_pages_nodemask+0x830/0x8b0 May 11 01:39:20 spruce kernel: [695803.519285] [<c04541f2>] copy_process.part.36+0xd2/0x1090 May 11 01:39:20 spruce kernel: [695803.519292] [<c045533a>] do_fork+0x9a/0x340 May 11 01:39:20 spruce kernel: [695803.519298] [<c04556a5>] SyS_clone+0x25/0x30 May 11 01:39:20 spruce kernel: [695803.519305] [<c09d118d>] sysenter_do_call+0x12/0x28 May 11 01:39:20 spruce kernel: [695803.519309] Mem-Info: May 11 01:39:20 spruce kernel: [695803.519312] DMA per-cpu: May 11 01:39:20 spruce kernel: [695803.519316] CPU 0: hi: 0, btch: 1 usd: 0 May 11 01:39:20 spruce kernel: [695803.519320] CPU 1: hi: 0, btch: 1 usd: 0 May 11 01:39:20 spruce kernel: [695803.519323] Normal per-cpu: May 11 01:39:20 spruce kernel: [695803.519327] CPU 0: hi: 186, btch: 31 usd: 0 May 11 01:39:20 spruce kernel: [695803.519331] CPU 1: hi: 186, btch: 31 usd: 200 May 11 01:39:20 spruce kernel: [695803.519334] HighMem per-cpu: May 11 01:39:20 spruce kernel: [695803.519337] CPU 0: hi: 186, btch: 31 usd: 0 May 11 01:39:20 spruce kernel: [695803.519341] CPU 1: hi: 186, btch: 31 usd: 30 May 11 01:39:20 spruce kernel: [695803.519349] active_anon:74319 inactive_anon:91025 isolated_anon:0 May 11 01:39:20 spruce kernel: [695803.519349] active_file:8554 inactive_file:75918 isolated_file:2 May 11 01:39:20 spruce kernel: [695803.519349] unevictable:0 dirty:4 writeback:22 unstable:0 May 11 01:39:20 spruce kernel: [695803.519349] free:320406 slab_reclaimable:4991 slab_unreclaimable:15137 May 11 01:39:20 spruce kernel: [695803.519349] mapped:8740 shmem:18291 pagetables:1861 bounce:0 May 11 01:39:20 spruce kernel: [695803.519349] free_cma:0 May 11 01:39:20 spruce kernel: [695803.519363] DMA free:3368kB min:20kB low:24kB high:28kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15996kB managed:5840kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:120kB slab_unreclaimable:1916kB kernel_stack:8kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes May 11 01:39:20 spruce kernel: [695803.519365] lowmem_reserve[]: 0 839 3015 3015 May 11 01:39:20 spruce kernel: [695803.519376] Normal free:27460kB min:3692kB low:4612kB high:5536kB active_anon:15004kB inactive_anon:16744kB active_file:244kB inactive_file:344kB unevictable:0kB isolated(anon):0kB isolated(file):8kB present:890872kB managed:859152kB mlocked:0kB dirty:0kB writeback:0kB mapped:24kB shmem:16828kB slab_reclaimable:19844kB slab_unreclaimable:58632kB kernel_stack:1944kB pagetables:688kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no May 11 01:39:20 spruce kernel: [695803.519379] lowmem_reserve[]: 0 0 17412 17412 May 11 01:39:20 spruce kernel: [695803.519389] HighMem free:1250796kB min:512kB low:2904kB high:5300kB active_anon:282272kB inactive_anon:347356kB active_file:33972kB inactive_file:303328kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2228776kB managed:2228776kB mlocked:0kB dirty:16kB writeback:88kB mapped:34936kB shmem:56336kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:6756kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no May 11 01:39:20 spruce kernel: [695803.519391] lowmem_reserve[]: 0 0 0 0 May 11 01:39:20 spruce kernel: [695803.519396] DMA: 154*4kB (EM) 90*8kB (M) 1*16kB (R) 1*32kB (R) 1*64kB (R) 1*128kB (R) 1*256kB (R) 1*512kB (R) 1*1024kB (R) 0*2048kB 0*4096kB = 3368kB May 11 01:39:20 spruce kernel: [695803.519416] Normal: 6671*4kB (M) 15*8kB (MR) 9*16kB (R) 2*32kB (R) 0*64kB 1*128kB (R) 1*256kB (R) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 27396kB May 11 01:39:20 spruce kernel: [695803.519433] HighMem: 30*4kB (UM) 10596*8kB (UM) 3631*16kB (UM) 9241*32kB (UM) 5652*64kB (UM) 2577*128kB (M) 441*256kB (M) 15*512kB (UM) 0*1024kB 0*2048kB 0*4096kB = 1250856kB May 11 01:39:20 spruce kernel: [695803.519454] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB May 11 01:39:20 spruce kernel: [695803.519456] 106670 total pagecache pages May 11 01:39:20 spruce kernel: [695803.519459] 3917 pages in swap cache May 11 01:39:20 spruce kernel: [695803.519462] Swap cache stats: add 35963, delete 32046, find 3612812/3613910 May 11 01:39:20 spruce kernel: [695803.519464] Free swap = 2990240kB May 11 01:39:20 spruce kernel: [695803.519466] Total swap = 3080188kB May 11 01:39:20 spruce kernel: [695803.519468] 783911 pages RAM May 11 01:39:20 spruce kernel: [695803.519470] 557194 pages HighMem/MovableOnly May 11 01:39:20 spruce kernel: [695803.519472] 0 pages reserved May 11 01:39:20 spruce kernel: [695803.519474] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name May 11 01:39:20 spruce kernel: [695803.519486] [ 510] 0 510 15389 6775 30 42 0 systemd-journal May 11 01:39:20 spruce kernel: [695803.519490] [ 518] 0 518 3823 103 7 229 0 lvmetad May 11 01:39:20 spruce kernel: [695803.519494] [ 525] 0 525 3493 166 10 354 -1000 systemd-udevd May 11 01:39:20 spruce kernel: [695803.519500] [ 634] 0 634 3628 181 6 60 -1000 auditd yada...yada...yada.... May 11 01:39:20 spruce kernel: [695803.519797] [11129] 0 11129 2288 418 9 101 0 crond May 11 01:39:20 spruce kernel: [695803.519800] [11135] 0 11135 1642 312 8 0 0 raid-check May 11 01:39:20 spruce kernel: [695803.519804] [11214] 989 11214 11902 9818 35 0 0 BackupPC_dump May 11 01:39:20 spruce kernel: [695803.519808] [11215] 989 11215 12915 10212 38 0 0 BackupPC_dump May 11 01:39:20 spruce kernel: [695803.519811] [11216] 989 11216 13015 10925 40 0 0 BackupPC_dump May 11 01:39:20 spruce kernel: [695803.519815] [11242] 989 11242 7462 5362 25 0 0 BackupPC_dump May 11 01:39:20 spruce kernel: [695803.519818] [12411] 989 12411 9531 7522 31 0 0 BackupPC_dump May 11 01:39:20 spruce kernel: [695803.519822] [12467] 989 12467 11484 9203 36 0 0 BackupPC_dump May 11 01:39:20 spruce kernel: [695803.519826] [12520] 989 12520 14602 11610 41 0 0 BackupPC_dump May 11 01:39:20 spruce kernel: [695803.519830] [15611] 0 15611 2288 438 9 101 0 crond May 11 01:39:20 spruce kernel: [695803.519833] [15613] 0 15613 1641 283 8 0 0 sh May 11 01:39:20 spruce kernel: [695803.519837] [15662] 0 15662 1375 115 7 0 0 sleep May 11 01:39:20 spruce kernel: [695803.519840] Out of memory: Kill process 767 (polkitd) score 32 or sacrifice child May 11 01:39:20 spruce kernel: [695803.519845] Killed process 767 (polkitd) total-vm:267572kB, anon-rss:196552kB, file-rss:2560kB May 11 01:39:20 spruce kernel: raid-check invoked oom-killer: gfp_mask=0x3000d0, order=1, oom_score_adj=0 May 11 01:39:21 spruce kernel: raid-check cpuset=/ mems_allowed=0 May 11 01:39:21 spruce systemd: polkit.service: main process exited, code=killed, status=9/KILL May 11 01:39:23 spruce systemd: Unit polkit.service entered failed state. May 11 01:39:26 spruce gnome-session: PolicyKit daemon disconnected from the bus. May 11 01:39:29 spruce gnome-session: We are no longer a registered authentication agent. May 11 01:39:29 spruce kernel: CPU: 0 PID: 11135 Comm: raid-check Not tainted 3.13.10-200.fc20.i686+PAE #1 May 11 01:39:30 spruce kernel: Hardware name: Dell Inc. Dell DM051 /0HJ054, BIOS A05 03/31/2006 May 11 01:39:30 spruce kernel: 00000000 00000000 c0111da4 c09c3fab e88ec380 c0111dfc c09c1d81 c0b7720c May 11 01:39:31 spruce kernel: e88ec5b8 003000d0 00000001 00000000 c043c603 00000015 c0111e0c c09ca45c May 11 01:39:31 spruce kernel: ffffffff 00000000 00000206 c0111dfc c069f783 003000d0 00000001 00000000 May 11 01:39:31 spruce kernel: Call Trace: May 11 01:39:32 spruce kernel: [<c09c3fab>] dump_stack+0x41/0x52 May 11 01:39:32 spruce kernel: [<c09c1d81>] dump_header.isra.10+0x73/0x1bc May 11 01:39:33 spruce kernel: [<c043c603>] ? native_apic_icr_read+0x23/0x40 May 11 01:39:33 spruce kernel: [<c09ca45c>] ? apic_timer_interrupt+0x34/0x3c May 11 01:39:33 spruce kernel: [<c069f783>] ? ___ratelimit+0x83/0xf0 May 11 01:39:33 spruce kernel: [<c051a47b>] oom_kill_process+0x16b/0x2b0 May 11 01:39:33 spruce kernel: [<c063dedc>] ? security_capable_noaudit+0x1c/0x30 May 11 01:39:34 spruce kernel: [<c045e65a>] ? has_capability_noaudit+0x1a/0x30 May 11 01:39:34 spruce kernel: [<c051a9ab>] out_of_memory+0x22b/0x260 May 11 01:39:34 spruce kernel: [<c051f000>] __alloc_pages_nodemask+0x830/0x8b0 May 11 01:39:35 spruce kernel: [<c04541f2>] copy_process.part.36+0xd2/0x1090 May 11 01:39:35 spruce kernel: [<c045533a>] do_fork+0x9a/0x340 May 11 01:39:35 spruce kernel: [<c04556a5>] SyS_clone+0x25/0x30 May 11 01:39:35 spruce kernel: [<c09d118d>] sysenter_do_call+0x12/0x28 May 11 01:39:36 spruce kernel: Mem-Info: May 11 01:39:36 spruce kernel: DMA per-cpu: May 11 01:39:36 spruce kernel: CPU 0: hi: 0, btch: 1 usd: 0 May 11 01:39:37 spruce kernel: CPU 1: hi: 0, btch: 1 usd: 0 May 11 01:39:37 spruce kernel: Normal per-cpu: May 11 01:39:38 spruce kernel: CPU 0: hi: 186, btch: 31 usd: 0 May 11 01:39:38 spruce kernel: CPU 1: hi: 186, btch: 31 usd: 200 May 11 01:39:39 spruce kernel: HighMem per-cpu: May 11 01:39:39 spruce kernel: CPU 0: hi: 186, btch: 31 usd: 0 May 11 01:39:39 spruce kernel: CPU 1: hi: 186, btch: 31 usd: 30 May 11 01:39:40 spruce systemd: Failed to mark scope session-12956.scope as abandoned : Stale file handle May 11 01:39:41 spruce kernel: active_anon:74319 inactive_anon:91025 isolated_anon:0 active_file:8554 inactive_file:75918 isolated_file:2 unevictable:0 dirty:4 writeback:22 unstable:0 free:320406 slab_reclaimable:4991 slab_unreclaimable:15137 mapped:8740 shmem:18291 pagetables:1861 bounce:0 free_cma:0 May 11 01:39:44 spruce kernel: DMA free:3368kB min:20kB low:24kB high:28kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15996kB managed:5840kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:120kB slab_unreclaimable:1916kB kernel_stack:8kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes May 11 01:39:44 spruce kernel: lowmem_reserve[]: 0 839 3015 3015 May 11 01:39:45 spruce kernel: Normal free:27460kB min:3692kB low:4612kB high:5536kB active_anon:15004kB inactive_anon:16744kB active_file:244kB inactive_file:344kB unevictable:0kB isolated(anon):0kB isolated(file):8kB present:890872kB managed:859152kB mlocked:0kB dirty:0kB writeback:0kB mapped:24kB shmem:16828kB slab_reclaimable:19844kB slab_unreclaimable:58632kB kernel_stack:1944kB pagetables:688kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no May 11 01:39:45 spruce kernel: lowmem_reserve[]: 0 0 17412 17412 May 11 01:39:45 spruce kernel: HighMem free:1250796kB min:512kB low:2904kB high:5300kB active_anon:282272kB inactive_anon:347356kB active_file:33972kB inactive_file:303328kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2228776kB managed:2228776kB mlocked:0kB dirty:16kB writeback:88kB mapped:34936kB shmem:56336kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:6756kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no May 11 01:39:45 spruce kernel: lowmem_reserve[]: 0 0 0 0 May 11 01:39:46 spruce kernel: DMA: 154*4kB (EM) 90*8kB (M) 1*16kB (R) 1*32kB (R) 1*64kB (R) 1*128kB (R) 1*256kB (R) 1*512kB (R) 1*1024kB (R) 0*2048kB 0*4096kB = 3368kB May 11 01:39:46 spruce kernel: Normal: 6671*4kB (M) 15*8kB (MR) 9*16kB (R) 2*32kB (R) 0*64kB 1*128kB (R) 1*256kB (R) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 27396kB May 11 01:39:47 spruce kernel: HighMem: 30*4kB (UM) 10596*8kB (UM) 3631*16kB (UM) 9241*32kB (UM) 5652*64kB (UM) 2577*128kB (M) 441*256kB (M) 15*512kB (UM) 0*1024kB 0*2048kB 0*4096kB = 1250856kB May 11 01:39:47 spruce kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB May 11 01:39:47 spruce kernel: 106670 total pagecache pages May 11 01:39:48 spruce kernel: 3917 pages in swap cache May 11 01:39:48 spruce kernel: Swap cache stats: add 35963, delete 32046, find 3612812/3613910 May 11 01:39:49 spruce kernel: Free swap = 2990240kB May 11 01:39:49 spruce kernel: Total swap = 3080188kB May 11 01:39:50 spruce kernel: 783911 pages RAM May 11 01:39:50 spruce kernel: 557194 pages HighMem/MovableOnly May 11 01:39:50 spruce kernel: 0 pages reserved May 11 01:39:50 spruce kernel: [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name May 11 01:39:51 spruce kernel: [ 510] 0 510 15389 6775 30 42 0 systemd-journal May 11 01:39:51 spruce kernel: [ 518] 0 518 3823 103 7 229 0 lvmetad May 11 01:39:51 spruce kernel: [ 525] 0 525 3493 166 10 354 -1000 systemd-udevd NOTE: This RAID Status below taken following upgrade to 3.14.3-200.fc20.i686+PAE [root@spruce tmp]# /usr/local/bin/raid_status --------------------------------------------------------------------------------- /dev/md0: Version : 1.0 Creation Time : Mon Oct 25 18:52:00 2010 Raid Level : raid1 Array Size : 511988 (500.07 MiB 524.28 MB) Used Dev Size : 511988 (500.07 MiB 524.28 MB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Update Time : Mon May 12 22:55:19 2014 State : clean Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Name : localhost.localdomain:0 UUID : 8a8ef0df:e25cee96:66fcc1cf:1043d370 Events : 1324 Number Major Minor RaidDevice State 3 8 1 0 active sync /dev/sda1 2 8 17 1 active sync /dev/sdb1 --------------------------------------------------------------------------------- /dev/md1: Version : 1.1 Creation Time : Mon Oct 25 18:52:03 2010 Raid Level : raid1 Array Size : 976247676 (931.02 GiB 999.68 GB) Used Dev Size : 976247676 (931.02 GiB 999.68 GB) Raid Devices : 2 Total Devices : 2 Persistence : Superblock is persistent Intent Bitmap : Internal Update Time : Mon May 12 23:20:43 2014 State : active Active Devices : 2 Working Devices : 2 Failed Devices : 0 Spare Devices : 0 Name : localhost.localdomain:1 UUID : d0200041:57c65bc2:870f2f1e:b84fa712 Events : 192907 Number Major Minor RaidDevice State 0 8 2 0 active sync /dev/sda2 1 8 18 1 active sync /dev/sdb2 --------------------------------------------------------------------------------- Personalities : [raid1] md1 : active raid1 sda2[0] sdb2[1] 976247676 blocks super 1.1 [2/2] [UU] bitmap: 3/8 pages [12KB], 65536KB chunk md0 : active raid1 sda1[3] sdb1[2] 511988 blocks super 1.0 [2/2] [UU] unused devices: <none> --------------------------------------------------------------------------------- [root@spruce tmp]#
i think mine is a different issue because any write IO blocks completly and so there is no way to get somethinginto the syslog as well i doubt oom-killer on a fresh booted machine with 16 GB RAM - also i faced that never before 3.14
My machine is hitting the same bug black screen while checking the raid devices and nothing is logged in the journal raid devices level 1 with two or four members no spares Friday May 2 upgraded to kernel-3.14.2-200.fc20.x86_64 and boot with it Sunday May 4 the automatic raid-check script finish succesfully Sunday May 11 at 2:19 AM and the machine didnt have any video and didn't respond to pings (has a locked kde session) after reset the machine and start it again updated the machine so kernel-3.14.3-200.fc20.x86_64 was installed and executed the raid-check script as root to check the raid devices in the morning the machine didn't respond to anything no video, no pings so I reset the machine, but this time no executed the raid-check script almost two daya are passed and the machine is still up
the same with 3.14.4, the first of my both machines passed the raid-check, after that i started it again on my homeserver and for some hours all looked fine, well, my homeserver died away in the meantime and the second try on my office workstation also ended with a complete freeze please have also a look at https://bugzilla.redhat.com/show_bug.cgi?id=1092937 ("mdadm --stop" of the root device takes a loooong time) which looks more and more like a kernel issue and not dracut
Is this a DUP of a similar issue noted in FC19: Bug 1075185 - oom-killer invoked, machine hangs on kernel 3.12 and 3.13 but not 3.11 ???
Per Gabriel's journal (aka "messages") and "ping" comment, the problem I am experiencing seems much more an incarnation of that reported for FC19 in Bug 1075185. Of particular note is the fact that in my case, "ping" succeeds and messages are successfully written into /var/log/messages.
well, my homserver yesterday stayed pingable too but as it died at around 90% completion of checking the 3.6 TB /dev/md2 even my ssh connection freezed in the middle of typing a command, trying another SSH connection came up with "Last login.." and anyhting else was dead and done :-(
Kernel 3.14.4 seems also to be unstable on that RAID machines independent of raid-check, recently random crash number 6 on two different machines, that was my thunderbird-profile, "happily" restored from backup and merged with recent archives complete freeze of the systemd, music-player-daemon looped the last 2 seconds of the running track forever, no response to ACPI power-button or anything else and at boot some orphaned inodes on ext4 level as well that below - my heart stucked as this fast resync started by thinking "WTF if that freezes again and again the machine is done" [ 1.298603] md: raid10 personality registered for level 10 [ 1.299032] md/raid10:md2: not clean -- starting background reconstruction [ 1.299119] md/raid10:md2: active with 4 out of 4 devices [ 1.299347] created bitmap (29 pages) for device md2
with kernel 3.14.4 I cannot reproduce anymore: the scheduled raid-check in sunday finished without problems I executed the script two times more and finished without problems, I has a journalctl -f -l running in a ssh session hoping to get some info about the problem but it didn't happen, will try running the script tonight
Seeing the same on two machine (one w/ RAID5 and the other with RAID10). Both using 3.14.4. :( Trying to reboot (again and again) doesn't help Luckily for me, both machine (still) have 3.13.5 which manages to sync w/ no issues. Any work around (beyond staying w/ 3.13?) - Gilboa
FWIW both machines use F20/x86_64, on has 32GB RAM and the other has 36GB RAM. (Hence I doubt I'm hitting the OOM killer during reboot, unless it MD-raid-check leaks memory like crazy. - Gilboa
https://bugzilla.redhat.com/show_bug.cgi?id=1096414 https://bugzilla.redhat.com/show_bug.cgi?id=1092937 *both* seems to be fixed with 3.14.5-200.fc20.x86_64 while i am unable to find the relevant change in the kernel-upstream-changelog however, rebootet my workstation 40 times after the update from koji and 4 raid-check runs without any freeze on two different machines if it comes back i will re-open that bug
Harald, Is this kernel in updates-testing or only in koji? - Gilboa
this morning it was still only on koji https://koji.fedoraproject.org/koji/packageinfo?packageID=8
Just noticed it got pushed to testing. I'll give it a test run and report back. - Gilboa
sadly both re-opened, looks only good after a fresh bott, but after some hours running the same shutdown-problems and raid-check freezes machine as before :-( 100% for sure Kernel 3.14.x because my co-developer today had the same shutdown hang at evening, 2 shutdowns with 3.14, two hangs until SysRQ+S dozen of times the same for this bug: i came home 30 minutes ago, fired the first raid-check after one day runnig while 3 worked fine directly after upgrade and at the same moment any IO writes stopped again :-(
the two commits below smell like related https://www.kernel.org/pub/linux/kernel/v3.x/ChangeLog-3.14.6 ____________________________________________________________________________________________ commit 0bc4091108e8f2e65faef3082e5261f2c35cd2b4 Author: NeilBrown <neilb> Date: Tue May 6 09:36:08 2014 +1000 md: avoid possible spinning md thread at shutdown. commit 0f62fb220aa4ebabe8547d3a9ce4a16d3c045f21 upstream. If an md array with externally managed metadata (e.g. DDF or IMSM) is in use, then we should not set safemode==2 at shutdown because: 1/ this is ineffective: user-space need to be involved in any 'safemode' handling, 2/ The safemode management code doesn't cope with safemode==2 on external metadata and md_check_recover enters an infinite loop. Even at shutdown, an infinite-looping process can be problematic, so this could cause shutdown to hang. Signed-off-by: NeilBrown <neilb> Signed-off-by: Greg Kroah-Hartman <gregkh> ____________________________________________________________________________________________ commit 8c7311a1c4a8d804bde91b00a2f2c1a22a954c30 Author: NeilBrown <neilb> Date: Mon May 5 13:34:37 2014 +1000 md/raid10: call wait_barrier() for each request submitted. commit cc13b1d1500656a20e41960668f3392dda9fa6e2 upstream. wait_barrier() includes a counter, so we must call it precisely once (unless balanced by allow_barrier()) for each request submitted. Since commit 20d0189b1012a37d2533a87fb451f7852f2418d1 block: Introduce new bio_split() in 3.14-rc1, we don't call it for the extra requests generated when we need to split a bio. When this happens the counter goes negative, any resync/recovery will never start, and "mdadm --stop" will hang. Reported-by: Chris Murphy <lists> Fixes: 20d0189b1012a37d2533a87fb451f7852f2418d1 Cc: Kent Overstreet <kmo> Signed-off-by: NeilBrown <neilb> Signed-off-by: Greg Kroah-Hartman <gregkh> ____________________________________________________________________________________________
I can confirm 3.14.5 doesn't solve the crash-on-RAID-check problem. @Harald Reindl, Thus far I only seen this issue on RAID10 decides (and I've got a large number of servers w/ RAID5 and RAID6). Have you seen to servers with other RAID levels? - Gilboa
(Sorry for the bad Piglish, just woke up and found two dead servers). I meant: hus far I only seen this issue on RAID10 devices... Have you seen this on servers with other RAID levels? Sorry for the noise.
no, and it only affects *large* RAID10 volumes while i can't say what is big because i avoid any raid-check on not only by me used machines and have disabled the weekly run on any other machine however /dev/md2 is big enough while /dev/md1 is fine /dev/md2: Version : 1.1 Creation Time : Wed Jun 8 13:10:56 2011 Raid Level : raid10 Array Size : 3875222528 (3695.70 GiB 3968.23 GB) Used Dev Size : 1937611264 (1847.85 GiB 1984.11 GB) Raid Devices : 4 Total Devices : 4 Persistence : Superblock is persistent i can check my 30 GB rootfs which is also RAID10 100 times a day and all is fine - it's really frustrating and hopefully 3.14.6 is fixing that since it's a bad feeling in case of remote machines /dev/md1: Version : 1.1 Creation Time : Wed Jun 8 13:10:52 2011 Raid Level : raid10 Array Size : 30716928 (29.29 GiB 31.45 GB) Used Dev Size : 15358464 (14.65 GiB 15.73 GB) Raid Devices : 4 Total Devices : 4 Persistence : Superblock is persistent
closed again - 3.14.6 fixed it really, moved around some TB of data over days while repeatly check/resync 4 TB RAID10, no freeze and no hang at shutdown
This just froze my server last night, last messages at 1am when raid-check cron job ran. First week running Fedora server on this machine, it was previously running RHEL5. Fedora release 21 (Twenty One) 3.18.5-201.fc21.i686+PAE Feb 22 01:00:01 gserv kernel: md: data-check of RAID array md117 Feb 22 01:00:01 gserv kernel: md: minimum _guaranteed_ speed: 1000 KB/sec/disk. Feb 22 01:00:01 gserv kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check. Feb 22 01:00:01 gserv kernel: md: using 128k window, over a total of 24418688k. Feb 22 01:00:07 gserv kernel: md: delaying data-check of md120 until md117 has finished (they share one or more physical units) Feb 22 01:00:13 gserv kernel: md: delaying data-check of md125 until md117 has finished (they share one or more physical units) -- Reboot -- Feb 22 05:13:50 gserv systemd-journal[122]: Runtime journal is using 8.0M (max allowed 404.7M, trying to leave 607.1M free of 3.9G available → current limit 404.7M).