1096414 – raid-check with 3.14 freezes machine

Bug 1096414 - raid-check with 3.14 freezes machine

Summary: raid-check with 3.14 freezes machine

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	20
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	LVM and device-mapper development team
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-05-09 23:02 UTC by Harald Reindl
Modified:	2015-02-22 13:18 UTC (History)
CC List:	11 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2014-06-12 17:50:24 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Harald Reindl 2014-05-09 23:02:40 UTC

looks like 9 out of 10 raid-checks are freezing my machine
first noticed by the weekly cronjob


i now tried a resync

FINE:   echo repair >/sys/block/md0/md/sync_action
FINE:   echo repair >/sys/block/md1/md/sync_action
FREEZE: echo repair >/sys/block/md2/md/sync_action

as this happened the first time after fsck/repair a manually "raid-check" finished without any issue, but now it leads reproduceable to a freeze

affacted:
3.14.1-200.fc20.x86_64
3.14.2-200.fc20.x86_64
3.14.3-200.fc20.x86_64

freeze means: any write to disks stalls forever
for now i disabled the weekly cronjob because it does more harm as benefit

[root@srv-rhsoft:~]$ cat /usr/local/bin/raid_status
#!/usr/bin/bash

echo "---------------------------------------------------------------------------------"
/sbin/mdadm --detail /dev/md0
echo "---------------------------------------------------------------------------------"
/sbin/mdadm --detail /dev/md1
echo "---------------------------------------------------------------------------------"
/sbin/mdadm --detail /dev/md2
echo "---------------------------------------------------------------------------------"
cat /proc/mdstat
echo "---------------------------------------------------------------------------------"
[root@srv-rhsoft:~]$ /usr/local/bin/raid_status
---------------------------------------------------------------------------------
/dev/md0:
        Version : 1.0
  Creation Time : Wed Jun  8 13:10:48 2011
     Raid Level : raid1
     Array Size : 511988 (500.07 MiB 524.28 MB)
  Used Dev Size : 511988 (500.07 MiB 524.28 MB)
   Raid Devices : 4
  Total Devices : 4
    Persistence : Superblock is persistent

    Update Time : Sat May 10 00:52:20 2014
          State : clean 
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

           Name : localhost.localdomain:0
           UUID : 1d691642:baed26df:1d197496:4fb00ff8
         Events : 660

    Number   Major   Minor   RaidDevice State
       0       8       49        0      active sync   /dev/sdd1
       5       8       33        1      active sync   /dev/sdc1
       4       8        1        2      active sync   /dev/sda1
       3       8       17        3      active sync   /dev/sdb1
---------------------------------------------------------------------------------
/dev/md1:
        Version : 1.1
  Creation Time : Wed Jun  8 13:10:52 2011
     Raid Level : raid10
     Array Size : 30716928 (29.29 GiB 31.45 GB)
  Used Dev Size : 15358464 (14.65 GiB 15.73 GB)
   Raid Devices : 4
  Total Devices : 4
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Sat May 10 00:59:57 2014
          State : active 
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

         Layout : near=2
     Chunk Size : 512K

           Name : localhost.localdomain:1
           UUID : b7475879:c95d9a47:c5043c02:0c5ae720
         Events : 12429

    Number   Major   Minor   RaidDevice State
       0       8       50        0      active sync set-A   /dev/sdd2
       5       8       34        1      active sync set-B   /dev/sdc2
       4       8        2        2      active sync set-A   /dev/sda2
       3       8       18        3      active sync set-B   /dev/sdb2
---------------------------------------------------------------------------------
/dev/md2:
        Version : 1.1
  Creation Time : Wed Jun  8 13:10:56 2011
     Raid Level : raid10
     Array Size : 3875222528 (3695.70 GiB 3968.23 GB)
  Used Dev Size : 1937611264 (1847.85 GiB 1984.11 GB)
   Raid Devices : 4
  Total Devices : 4
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Sat May 10 00:59:50 2014
          State : active 
 Active Devices : 4
Working Devices : 4
 Failed Devices : 0
  Spare Devices : 0

         Layout : near=2
     Chunk Size : 512K

           Name : localhost.localdomain:2
           UUID : ea253255:cb915401:f32794ad:ce0fe396
         Events : 500461

    Number   Major   Minor   RaidDevice State
       0       8       51        0      active sync set-A   /dev/sdd3
       5       8       35        1      active sync set-B   /dev/sdc3
       4       8        3        2      active sync set-A   /dev/sda3
       3       8       19        3      active sync set-B   /dev/sdb3
---------------------------------------------------------------------------------
Personalities : [raid1] [raid10] 
md2 : active raid10 sdd3[0] sda3[4] sdc3[5] sdb3[3]
      3875222528 blocks super 1.1 512K chunks 2 near-copies [4/4] [UUUU]
      bitmap: 1/29 pages [4KB], 65536KB chunk

md1 : active raid10 sda2[4] sdd2[0] sdc2[5] sdb2[3]
      30716928 blocks super 1.1 512K chunks 2 near-copies [4/4] [UUUU]
      bitmap: 1/1 pages [4KB], 65536KB chunk

md0 : active raid1 sdb1[3] sda1[4] sdd1[0] sdc1[5]
      511988 blocks super 1.0 [4/4] [UUUU]
      
unused devices: <none>
---------------------------------------------------------------------------------

Comment 1 Harry Rarig 2014-05-13 03:26:07 UTC

Observed the same problem on my 32-bit DELL machine with vmlinuz-3.13.10-200.fc20.i686+PAE running RAID1 with md0 and md1.  The snapshot of /var/log/messages was taken after the system locked up.

May 11 01:39:20 spruce kernel: [695803.519110] raid-check invoked oom-killer: gfp_mask=0x3000d0, order=1, oom_score_adj=0
May 11 01:39:20 spruce kernel: [695803.519121] raid-check cpuset=/ mems_allowed=0
May 11 01:39:20 spruce kernel: [695803.519130] CPU: 0 PID: 11135 Comm: raid-check Not tainted 3.13.10-200.fc20.i686+PAE #1
May 11 01:39:20 spruce kernel: [695803.519134] Hardware name: Dell Inc.                 Dell DM051                   /0HJ054, BIOS A05 03/31/2006
May 11 01:39:20 spruce kernel: [695803.519138]  00000000 00000000 c0111da4 c09c3fab e88ec380 c0111dfc c09c1d81 c0b7720c
May 11 01:39:20 spruce kernel: [695803.519153]  e88ec5b8 003000d0 00000001 00000000 c043c603 00000015 c0111e0c c09ca45c
May 11 01:39:20 spruce kernel: [695803.519165]  ffffffff 00000000 00000206 c0111dfc c069f783 003000d0 00000001 00000000
May 11 01:39:20 spruce kernel: [695803.519178] Call Trace:
May 11 01:39:20 spruce kernel: [695803.519194]  [<c09c3fab>] dump_stack+0x41/0x52
May 11 01:39:20 spruce kernel: [695803.519205]  [<c09c1d81>] dump_header.isra.10+0x73/0x1bc
May 11 01:39:20 spruce kernel: [695803.519216]  [<c043c603>] ? native_apic_icr_read+0x23/0x40
May 11 01:39:20 spruce kernel: [695803.519225]  [<c09ca45c>] ? apic_timer_interrupt+0x34/0x3c
May 11 01:39:20 spruce kernel: [695803.519235]  [<c069f783>] ? ___ratelimit+0x83/0xf0
May 11 01:39:20 spruce kernel: [695803.519244]  [<c051a47b>] oom_kill_process+0x16b/0x2b0
May 11 01:39:20 spruce kernel: [695803.519252]  [<c063dedc>] ? security_capable_noaudit+0x1c/0x30
May 11 01:39:20 spruce kernel: [695803.519260]  [<c045e65a>] ? has_capability_noaudit+0x1a/0x30
May 11 01:39:20 spruce kernel: [695803.519268]  [<c051a9ab>] out_of_memory+0x22b/0x260
May 11 01:39:20 spruce kernel: [695803.519276]  [<c051f000>] __alloc_pages_nodemask+0x830/0x8b0
May 11 01:39:20 spruce kernel: [695803.519285]  [<c04541f2>] copy_process.part.36+0xd2/0x1090
May 11 01:39:20 spruce kernel: [695803.519292]  [<c045533a>] do_fork+0x9a/0x340
May 11 01:39:20 spruce kernel: [695803.519298]  [<c04556a5>] SyS_clone+0x25/0x30
May 11 01:39:20 spruce kernel: [695803.519305]  [<c09d118d>] sysenter_do_call+0x12/0x28
May 11 01:39:20 spruce kernel: [695803.519309] Mem-Info:
May 11 01:39:20 spruce kernel: [695803.519312] DMA per-cpu:
May 11 01:39:20 spruce kernel: [695803.519316] CPU    0: hi:    0, btch:   1 usd:   0
May 11 01:39:20 spruce kernel: [695803.519320] CPU    1: hi:    0, btch:   1 usd:   0
May 11 01:39:20 spruce kernel: [695803.519323] Normal per-cpu:
May 11 01:39:20 spruce kernel: [695803.519327] CPU    0: hi:  186, btch:  31 usd:   0
May 11 01:39:20 spruce kernel: [695803.519331] CPU    1: hi:  186, btch:  31 usd: 200
May 11 01:39:20 spruce kernel: [695803.519334] HighMem per-cpu:
May 11 01:39:20 spruce kernel: [695803.519337] CPU    0: hi:  186, btch:  31 usd:   0
May 11 01:39:20 spruce kernel: [695803.519341] CPU    1: hi:  186, btch:  31 usd:  30
May 11 01:39:20 spruce kernel: [695803.519349] active_anon:74319 inactive_anon:91025 isolated_anon:0
May 11 01:39:20 spruce kernel: [695803.519349]  active_file:8554 inactive_file:75918 isolated_file:2
May 11 01:39:20 spruce kernel: [695803.519349]  unevictable:0 dirty:4 writeback:22 unstable:0
May 11 01:39:20 spruce kernel: [695803.519349]  free:320406 slab_reclaimable:4991 slab_unreclaimable:15137
May 11 01:39:20 spruce kernel: [695803.519349]  mapped:8740 shmem:18291 pagetables:1861 bounce:0
May 11 01:39:20 spruce kernel: [695803.519349]  free_cma:0
May 11 01:39:20 spruce kernel: [695803.519363] DMA free:3368kB min:20kB low:24kB high:28kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15996kB managed:5840kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:120kB slab_unreclaimable:1916kB kernel_stack:8kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
May 11 01:39:20 spruce kernel: [695803.519365] lowmem_reserve[]: 0 839 3015 3015
May 11 01:39:20 spruce kernel: [695803.519376] Normal free:27460kB min:3692kB low:4612kB high:5536kB active_anon:15004kB inactive_anon:16744kB active_file:244kB inactive_file:344kB unevictable:0kB isolated(anon):0kB isolated(file):8kB present:890872kB managed:859152kB mlocked:0kB dirty:0kB writeback:0kB mapped:24kB shmem:16828kB slab_reclaimable:19844kB slab_unreclaimable:58632kB kernel_stack:1944kB pagetables:688kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
May 11 01:39:20 spruce kernel: [695803.519379] lowmem_reserve[]: 0 0 17412 17412
May 11 01:39:20 spruce kernel: [695803.519389] HighMem free:1250796kB min:512kB low:2904kB high:5300kB active_anon:282272kB inactive_anon:347356kB active_file:33972kB inactive_file:303328kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2228776kB managed:2228776kB mlocked:0kB dirty:16kB writeback:88kB mapped:34936kB shmem:56336kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:6756kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
May 11 01:39:20 spruce kernel: [695803.519391] lowmem_reserve[]: 0 0 0 0
May 11 01:39:20 spruce kernel: [695803.519396] DMA: 154*4kB (EM) 90*8kB (M) 1*16kB (R) 1*32kB (R) 1*64kB (R) 1*128kB (R) 1*256kB (R) 1*512kB (R) 1*1024kB (R) 0*2048kB 0*4096kB = 3368kB
May 11 01:39:20 spruce kernel: [695803.519416] Normal: 6671*4kB (M) 15*8kB (MR) 9*16kB (R) 2*32kB (R) 0*64kB 1*128kB (R) 1*256kB (R) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 27396kB
May 11 01:39:20 spruce kernel: [695803.519433] HighMem: 30*4kB (UM) 10596*8kB (UM) 3631*16kB (UM) 9241*32kB (UM) 5652*64kB (UM) 2577*128kB (M) 441*256kB (M) 15*512kB (UM) 0*1024kB 0*2048kB 0*4096kB = 1250856kB
May 11 01:39:20 spruce kernel: [695803.519454] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
May 11 01:39:20 spruce kernel: [695803.519456] 106670 total pagecache pages
May 11 01:39:20 spruce kernel: [695803.519459] 3917 pages in swap cache
May 11 01:39:20 spruce kernel: [695803.519462] Swap cache stats: add 35963, delete 32046, find 3612812/3613910
May 11 01:39:20 spruce kernel: [695803.519464] Free swap  = 2990240kB
May 11 01:39:20 spruce kernel: [695803.519466] Total swap = 3080188kB
May 11 01:39:20 spruce kernel: [695803.519468] 783911 pages RAM
May 11 01:39:20 spruce kernel: [695803.519470] 557194 pages HighMem/MovableOnly
May 11 01:39:20 spruce kernel: [695803.519472] 0 pages reserved
May 11 01:39:20 spruce kernel: [695803.519474] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
May 11 01:39:20 spruce kernel: [695803.519486] [  510]     0   510    15389     6775      30       42             0 systemd-journal
May 11 01:39:20 spruce kernel: [695803.519490] [  518]     0   518     3823      103       7      229             0 lvmetad
May 11 01:39:20 spruce kernel: [695803.519494] [  525]     0   525     3493      166      10      354         -1000 systemd-udevd
May 11 01:39:20 spruce kernel: [695803.519500] [  634]     0   634     3628      181       6       60         -1000 auditd

yada...yada...yada....

May 11 01:39:20 spruce kernel: [695803.519797] [11129]     0 11129     2288      418       9      101             0 crond
May 11 01:39:20 spruce kernel: [695803.519800] [11135]     0 11135     1642      312       8        0             0 raid-check
May 11 01:39:20 spruce kernel: [695803.519804] [11214]   989 11214    11902     9818      35        0             0 BackupPC_dump
May 11 01:39:20 spruce kernel: [695803.519808] [11215]   989 11215    12915    10212      38        0             0 BackupPC_dump
May 11 01:39:20 spruce kernel: [695803.519811] [11216]   989 11216    13015    10925      40        0             0 BackupPC_dump
May 11 01:39:20 spruce kernel: [695803.519815] [11242]   989 11242     7462     5362      25        0             0 BackupPC_dump
May 11 01:39:20 spruce kernel: [695803.519818] [12411]   989 12411     9531     7522      31        0             0 BackupPC_dump
May 11 01:39:20 spruce kernel: [695803.519822] [12467]   989 12467    11484     9203      36        0             0 BackupPC_dump
May 11 01:39:20 spruce kernel: [695803.519826] [12520]   989 12520    14602    11610      41        0             0 BackupPC_dump
May 11 01:39:20 spruce kernel: [695803.519830] [15611]     0 15611     2288      438       9      101             0 crond
May 11 01:39:20 spruce kernel: [695803.519833] [15613]     0 15613     1641      283       8        0             0 sh
May 11 01:39:20 spruce kernel: [695803.519837] [15662]     0 15662     1375      115       7        0             0 sleep
May 11 01:39:20 spruce kernel: [695803.519840] Out of memory: Kill process 767 (polkitd) score 32 or sacrifice child
May 11 01:39:20 spruce kernel: [695803.519845] Killed process 767 (polkitd) total-vm:267572kB, anon-rss:196552kB, file-rss:2560kB
May 11 01:39:20 spruce kernel: raid-check invoked oom-killer: gfp_mask=0x3000d0, order=1, oom_score_adj=0
May 11 01:39:21 spruce kernel: raid-check cpuset=/ mems_allowed=0
May 11 01:39:21 spruce systemd: polkit.service: main process exited, code=killed, status=9/KILL
May 11 01:39:23 spruce systemd: Unit polkit.service entered failed state.
May 11 01:39:26 spruce gnome-session: PolicyKit daemon disconnected from the bus.
May 11 01:39:29 spruce gnome-session: We are no longer a registered authentication agent.
May 11 01:39:29 spruce kernel: CPU: 0 PID: 11135 Comm: raid-check Not tainted 3.13.10-200.fc20.i686+PAE #1
May 11 01:39:30 spruce kernel: Hardware name: Dell Inc.                 Dell DM051                   /0HJ054, BIOS A05 03/31/2006
May 11 01:39:30 spruce kernel: 00000000 00000000 c0111da4 c09c3fab e88ec380 c0111dfc c09c1d81 c0b7720c
May 11 01:39:31 spruce kernel: e88ec5b8 003000d0 00000001 00000000 c043c603 00000015 c0111e0c c09ca45c
May 11 01:39:31 spruce kernel: ffffffff 00000000 00000206 c0111dfc c069f783 003000d0 00000001 00000000
May 11 01:39:31 spruce kernel: Call Trace:
May 11 01:39:32 spruce kernel: [<c09c3fab>] dump_stack+0x41/0x52
May 11 01:39:32 spruce kernel: [<c09c1d81>] dump_header.isra.10+0x73/0x1bc
May 11 01:39:33 spruce kernel: [<c043c603>] ? native_apic_icr_read+0x23/0x40
May 11 01:39:33 spruce kernel: [<c09ca45c>] ? apic_timer_interrupt+0x34/0x3c
May 11 01:39:33 spruce kernel: [<c069f783>] ? ___ratelimit+0x83/0xf0
May 11 01:39:33 spruce kernel: [<c051a47b>] oom_kill_process+0x16b/0x2b0
May 11 01:39:33 spruce kernel: [<c063dedc>] ? security_capable_noaudit+0x1c/0x30
May 11 01:39:34 spruce kernel: [<c045e65a>] ? has_capability_noaudit+0x1a/0x30
May 11 01:39:34 spruce kernel: [<c051a9ab>] out_of_memory+0x22b/0x260
May 11 01:39:34 spruce kernel: [<c051f000>] __alloc_pages_nodemask+0x830/0x8b0
May 11 01:39:35 spruce kernel: [<c04541f2>] copy_process.part.36+0xd2/0x1090
May 11 01:39:35 spruce kernel: [<c045533a>] do_fork+0x9a/0x340
May 11 01:39:35 spruce kernel: [<c04556a5>] SyS_clone+0x25/0x30
May 11 01:39:35 spruce kernel: [<c09d118d>] sysenter_do_call+0x12/0x28
May 11 01:39:36 spruce kernel: Mem-Info:
May 11 01:39:36 spruce kernel: DMA per-cpu:
May 11 01:39:36 spruce kernel: CPU    0: hi:    0, btch:   1 usd:   0
May 11 01:39:37 spruce kernel: CPU    1: hi:    0, btch:   1 usd:   0
May 11 01:39:37 spruce kernel: Normal per-cpu:
May 11 01:39:38 spruce kernel: CPU    0: hi:  186, btch:  31 usd:   0
May 11 01:39:38 spruce kernel: CPU    1: hi:  186, btch:  31 usd: 200
May 11 01:39:39 spruce kernel: HighMem per-cpu:
May 11 01:39:39 spruce kernel: CPU    0: hi:  186, btch:  31 usd:   0
May 11 01:39:39 spruce kernel: CPU    1: hi:  186, btch:  31 usd:  30
May 11 01:39:40 spruce systemd: Failed to mark scope session-12956.scope as abandoned : Stale file handle
May 11 01:39:41 spruce kernel: active_anon:74319 inactive_anon:91025 isolated_anon:0
 active_file:8554 inactive_file:75918 isolated_file:2
 unevictable:0 dirty:4 writeback:22 unstable:0
 free:320406 slab_reclaimable:4991 slab_unreclaimable:15137
 mapped:8740 shmem:18291 pagetables:1861 bounce:0
 free_cma:0
May 11 01:39:44 spruce kernel: DMA free:3368kB min:20kB low:24kB high:28kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15996kB managed:5840kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:120kB slab_unreclaimable:1916kB kernel_stack:8kB pagetables:0kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
May 11 01:39:44 spruce kernel: lowmem_reserve[]: 0 839 3015 3015
May 11 01:39:45 spruce kernel: Normal free:27460kB min:3692kB low:4612kB high:5536kB active_anon:15004kB inactive_anon:16744kB active_file:244kB inactive_file:344kB unevictable:0kB isolated(anon):0kB isolated(file):8kB present:890872kB managed:859152kB mlocked:0kB dirty:0kB writeback:0kB mapped:24kB shmem:16828kB slab_reclaimable:19844kB slab_unreclaimable:58632kB kernel_stack:1944kB pagetables:688kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
May 11 01:39:45 spruce kernel: lowmem_reserve[]: 0 0 17412 17412
May 11 01:39:45 spruce kernel: HighMem free:1250796kB min:512kB low:2904kB high:5300kB active_anon:282272kB inactive_anon:347356kB active_file:33972kB inactive_file:303328kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2228776kB managed:2228776kB mlocked:0kB dirty:16kB writeback:88kB mapped:34936kB shmem:56336kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:6756kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
May 11 01:39:45 spruce kernel: lowmem_reserve[]: 0 0 0 0
May 11 01:39:46 spruce kernel: DMA: 154*4kB (EM) 90*8kB (M) 1*16kB (R) 1*32kB (R) 1*64kB (R) 1*128kB (R) 1*256kB (R) 1*512kB (R) 1*1024kB (R) 0*2048kB 0*4096kB = 3368kB
May 11 01:39:46 spruce kernel: Normal: 6671*4kB (M) 15*8kB (MR) 9*16kB (R) 2*32kB (R) 0*64kB 1*128kB (R) 1*256kB (R) 0*512kB 0*1024kB 0*2048kB 0*4096kB = 27396kB
May 11 01:39:47 spruce kernel: HighMem: 30*4kB (UM) 10596*8kB (UM) 3631*16kB (UM) 9241*32kB (UM) 5652*64kB (UM) 2577*128kB (M) 441*256kB (M) 15*512kB (UM) 0*1024kB 0*2048kB 0*4096kB = 1250856kB
May 11 01:39:47 spruce kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
May 11 01:39:47 spruce kernel: 106670 total pagecache pages
May 11 01:39:48 spruce kernel: 3917 pages in swap cache
May 11 01:39:48 spruce kernel: Swap cache stats: add 35963, delete 32046, find 3612812/3613910
May 11 01:39:49 spruce kernel: Free swap  = 2990240kB
May 11 01:39:49 spruce kernel: Total swap = 3080188kB
May 11 01:39:50 spruce kernel: 783911 pages RAM
May 11 01:39:50 spruce kernel: 557194 pages HighMem/MovableOnly
May 11 01:39:50 spruce kernel: 0 pages reserved
May 11 01:39:50 spruce kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
May 11 01:39:51 spruce kernel: [  510]     0   510    15389     6775      30       42             0 systemd-journal
May 11 01:39:51 spruce kernel: [  518]     0   518     3823      103       7      229             0 lvmetad
May 11 01:39:51 spruce kernel: [  525]     0   525     3493      166      10      354         -1000 systemd-udevd


NOTE: This RAID Status below taken following upgrade to 3.14.3-200.fc20.i686+PAE

[root@spruce tmp]# /usr/local/bin/raid_status
---------------------------------------------------------------------------------
/dev/md0:
        Version : 1.0
  Creation Time : Mon Oct 25 18:52:00 2010
     Raid Level : raid1
     Array Size : 511988 (500.07 MiB 524.28 MB)
  Used Dev Size : 511988 (500.07 MiB 524.28 MB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

    Update Time : Mon May 12 22:55:19 2014
          State : clean
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           Name : localhost.localdomain:0
           UUID : 8a8ef0df:e25cee96:66fcc1cf:1043d370
         Events : 1324

    Number   Major   Minor   RaidDevice State
       3       8        1        0      active sync   /dev/sda1
       2       8       17        1      active sync   /dev/sdb1
---------------------------------------------------------------------------------
/dev/md1:
        Version : 1.1
  Creation Time : Mon Oct 25 18:52:03 2010
     Raid Level : raid1
     Array Size : 976247676 (931.02 GiB 999.68 GB)
  Used Dev Size : 976247676 (931.02 GiB 999.68 GB)
   Raid Devices : 2
  Total Devices : 2
    Persistence : Superblock is persistent

  Intent Bitmap : Internal

    Update Time : Mon May 12 23:20:43 2014
          State : active
 Active Devices : 2
Working Devices : 2
 Failed Devices : 0
  Spare Devices : 0

           Name : localhost.localdomain:1
           UUID : d0200041:57c65bc2:870f2f1e:b84fa712
         Events : 192907

    Number   Major   Minor   RaidDevice State
       0       8        2        0      active sync   /dev/sda2
       1       8       18        1      active sync   /dev/sdb2
---------------------------------------------------------------------------------
Personalities : [raid1]
md1 : active raid1 sda2[0] sdb2[1]
      976247676 blocks super 1.1 [2/2] [UU]
      bitmap: 3/8 pages [12KB], 65536KB chunk

md0 : active raid1 sda1[3] sdb1[2]
      511988 blocks super 1.0 [2/2] [UU]

unused devices: <none>
---------------------------------------------------------------------------------
[root@spruce tmp]#

Comment 2 Harald Reindl 2014-05-13 08:37:23 UTC

i think mine is a different issue because any write IO blocks completly and so there is no way to get somethinginto the syslog as well i doubt oom-killer on a fresh booted machine with 16 GB RAM - also i faced that never before 3.14

Comment 3 Gabriel Ramirez 2014-05-13 15:41:05 UTC

My machine is hitting the same bug black screen while checking the raid devices and nothing is logged in the journal

raid devices level 1 with two or four members no spares

Friday May 2 upgraded to kernel-3.14.2-200.fc20.x86_64 and boot with it
Sunday May 4 the automatic raid-check script finish succesfully
Sunday May 11
at 2:19 AM and the machine didnt have any video and didn't respond to pings (has a locked kde session)

after reset the machine and start it again updated the machine so kernel-3.14.3-200.fc20.x86_64 was installed

and executed the raid-check script as root to check the raid devices

in the morning the machine didn't respond to anything no video, no pings

so I reset the machine, but this time no executed the raid-check script almost two daya are passed and the machine is still up

Comment 4 Harald Reindl 2014-05-14 14:54:13 UTC

the same with 3.14.4, the first of my both machines passed the raid-check, after that i started it again on my homeserver and for some hours all looked fine, well, my homeserver died away in the meantime and the second try on my office workstation also ended with a complete freeze

please have also a look at https://bugzilla.redhat.com/show_bug.cgi?id=1092937 ("mdadm --stop" of the root device takes a loooong time) which looks more and more like a kernel issue and not dracut

Comment 5 Harry Rarig 2014-05-15 04:34:27 UTC

Is this a DUP of a similar issue noted in FC19:
Bug 1075185 - oom-killer invoked, machine hangs on kernel 3.12 and 3.13 but not 3.11
???

Comment 6 Harry Rarig 2014-05-15 04:43:09 UTC

Per Gabriel's journal (aka "messages") and "ping" comment, the problem I am experiencing seems much more an incarnation of that reported for FC19 in Bug 1075185. 

Of particular note is the fact that in my case, "ping" succeeds and messages are successfully written into /var/log/messages.

Comment 7 Harald Reindl 2014-05-15 09:48:24 UTC

well, my homserver yesterday stayed pingable too but as it died at around 90% completion of checking the 3.6 TB /dev/md2 even my ssh connection freezed in the middle of typing a command, trying another SSH connection came up with "Last login.." and anyhting else was dead and done :-(

Comment 8 Harald Reindl 2014-05-17 13:29:22 UTC

Kernel 3.14.4 seems also to be unstable on that RAID machines independent of raid-check, recently random crash number 6 on two different machines, that was my thunderbird-profile, "happily" restored from backup and merged with recent archives

complete freeze of the systemd, music-player-daemon looped the last 2 seconds of the running track forever, no response to ACPI power-button or anything else and at boot some orphaned inodes on ext4 level as well that below - my heart stucked as this fast resync started by thinking "WTF if that freezes again and again the machine is done" 

[ 1.298603] md: raid10 personality registered for level 10
[ 1.299032] md/raid10:md2: not clean -- starting background reconstruction
[ 1.299119] md/raid10:md2: active with 4 out of 4 devices
[ 1.299347] created bitmap (29 pages) for device md2

Comment 9 Gabriel Ramirez 2014-05-21 18:08:42 UTC

with kernel 3.14.4 I cannot reproduce anymore:

the scheduled raid-check in sunday finished without problems

I executed the script two times more and finished without problems, 

I has a journalctl -f -l running in a ssh session hoping to get some info about the problem but it didn't happen, will try running the script tonight

Comment 10 Gilboa Davara 2014-05-25 19:31:36 UTC

Seeing the same on two machine (one w/ RAID5 and the other with RAID10).
Both using 3.14.4. :(

Trying to reboot (again and again) doesn't help
Luckily for me, both machine (still) have 3.13.5 which manages to sync w/ no issues.

Any work around (beyond staying w/ 3.13?)

- Gilboa

Comment 11 Gilboa Davara 2014-05-25 19:40:55 UTC

FWIW both machines use F20/x86_64, on has 32GB RAM and the other has 36GB RAM.
(Hence I doubt I'm hitting the OOM killer during reboot, unless it MD-raid-check leaks memory like crazy.

- Gilboa

Comment 12 Harald Reindl 2014-06-03 19:03:10 UTC

https://bugzilla.redhat.com/show_bug.cgi?id=1096414
https://bugzilla.redhat.com/show_bug.cgi?id=1092937

*both* seems to be fixed with 3.14.5-200.fc20.x86_64
while i am unable to find the relevant change in
the kernel-upstream-changelog

however, rebootet my workstation 40 times after the
update from koji and 4 raid-check runs without any 
freeze on two different machines

if it comes back i will re-open that bug

Comment 13 Gilboa Davara 2014-06-04 13:56:09 UTC

Harald,

Is this kernel in updates-testing or only in koji?

- Gilboa

Comment 14 Harald Reindl 2014-06-04 14:17:14 UTC

this morning it was still only on koji
https://koji.fedoraproject.org/koji/packageinfo?packageID=8

Comment 15 Gilboa Davara 2014-06-04 17:02:37 UTC

Just noticed it got pushed to testing. I'll give it a test run and report back.

- Gilboa

Comment 16 Harald Reindl 2014-06-05 22:24:13 UTC

sadly both re-opened, looks only good after a fresh bott, but after some hours running the same shutdown-problems and raid-check freezes machine as before :-(

100% for sure Kernel 3.14.x because my co-developer today had the same shutdown hang at evening, 2 shutdowns with 3.14, two hangs until SysRQ+S dozen of times

the same for this bug: i came home 30 minutes ago, fired the first raid-check after one day runnig while 3 worked fine directly after upgrade and at the same moment any IO writes stopped again :-(

Comment 17 Harald Reindl 2014-06-07 23:50:27 UTC

the two commits below smell like related
https://www.kernel.org/pub/linux/kernel/v3.x/ChangeLog-3.14.6
____________________________________________________________________________________________

commit 0bc4091108e8f2e65faef3082e5261f2c35cd2b4
Author: NeilBrown <neilb>
Date:   Tue May 6 09:36:08 2014 +1000

    md: avoid possible spinning md thread at shutdown.
    
    commit 0f62fb220aa4ebabe8547d3a9ce4a16d3c045f21 upstream.
    
    If an md array with externally managed metadata (e.g. DDF or IMSM)
    is in use, then we should not set safemode==2 at shutdown because:
    
    1/ this is ineffective: user-space need to be involved in any 'safemode' handling,
    2/ The safemode management code doesn't cope with safemode==2 on external metadata
       and md_check_recover enters an infinite loop.
    
    Even at shutdown, an infinite-looping process can be problematic, so this
    could cause shutdown to hang.
    
    Signed-off-by: NeilBrown <neilb>
    Signed-off-by: Greg Kroah-Hartman <gregkh>
____________________________________________________________________________________________

commit 8c7311a1c4a8d804bde91b00a2f2c1a22a954c30
Author: NeilBrown <neilb>
Date:   Mon May 5 13:34:37 2014 +1000

    md/raid10: call wait_barrier() for each request submitted.
    
    commit cc13b1d1500656a20e41960668f3392dda9fa6e2 upstream.
    
    wait_barrier() includes a counter, so we must call it precisely once
    (unless balanced by allow_barrier()) for each request submitted.
    
    Since
    commit 20d0189b1012a37d2533a87fb451f7852f2418d1
        block: Introduce new bio_split()
    in 3.14-rc1, we don't call it for the extra requests generated when
    we need to split a bio.
    
    When this happens the counter goes negative, any resync/recovery will
    never start, and  "mdadm --stop" will hang.
    
    Reported-by: Chris Murphy <lists>
    Fixes: 20d0189b1012a37d2533a87fb451f7852f2418d1
    Cc: Kent Overstreet <kmo>
    Signed-off-by: NeilBrown <neilb>
    Signed-off-by: Greg Kroah-Hartman <gregkh>
____________________________________________________________________________________________

Comment 18 Gilboa Davara 2014-06-08 03:45:27 UTC

I can confirm 3.14.5 doesn't solve the crash-on-RAID-check problem.

@Harald Reindl,

Thus far I only seen this issue on RAID10 decides (and I've got a large number of servers w/ RAID5 and RAID6). Have you seen to servers with other RAID levels?

- Gilboa

Comment 19 Gilboa Davara 2014-06-08 03:46:52 UTC

(Sorry for the bad Piglish, just woke up and found two dead servers).

I meant:
hus far I only seen this issue on RAID10 devices... Have you seen this on servers with other RAID levels?

Sorry for the noise.

Comment 20 Harald Reindl 2014-06-08 08:38:17 UTC

no, and it only affects *large* RAID10 volumes while i can't say what is big because i avoid any raid-check on not only by me used machines and have disabled the weekly run on any other machine

however /dev/md2 is big enough while /dev/md1 is fine

/dev/md2:
        Version : 1.1
  Creation Time : Wed Jun  8 13:10:56 2011
     Raid Level : raid10
     Array Size : 3875222528 (3695.70 GiB 3968.23 GB)
  Used Dev Size : 1937611264 (1847.85 GiB 1984.11 GB)
   Raid Devices : 4
  Total Devices : 4
    Persistence : Superblock is persistent

i can check my 30 GB rootfs which is also RAID10 100 times a day and all is fine - it's really frustrating and hopefully 3.14.6 is fixing that since it's a bad feeling in case of remote machines

/dev/md1:
        Version : 1.1
  Creation Time : Wed Jun  8 13:10:52 2011
     Raid Level : raid10
     Array Size : 30716928 (29.29 GiB 31.45 GB)
  Used Dev Size : 15358464 (14.65 GiB 15.73 GB)
   Raid Devices : 4
  Total Devices : 4
    Persistence : Superblock is persistent

Comment 21 Harald Reindl 2014-06-12 17:50:24 UTC

closed again - 3.14.6 fixed it really, moved around some TB of data over days while repeatly check/resync 4 TB RAID10, no freeze and no hang at shutdown

Comment 22 Daniel DeGraw 2015-02-22 13:18:43 UTC

This just froze my server last night, last messages at 1am when raid-check cron job ran. First week running Fedora server on this machine, it was previously running RHEL5.

Fedora release 21 (Twenty One)
3.18.5-201.fc21.i686+PAE

Feb 22 01:00:01 gserv kernel: md: data-check of RAID array md117
Feb 22 01:00:01 gserv kernel: md: minimum _guaranteed_  speed: 1000 KB/sec/disk.
Feb 22 01:00:01 gserv kernel: md: using maximum available idle IO bandwidth (but not more than 200000 KB/sec) for data-check.
Feb 22 01:00:01 gserv kernel: md: using 128k window, over a total of 24418688k.
Feb 22 01:00:07 gserv kernel: md: delaying data-check of md120 until md117 has finished (they share one or more physical units)
Feb 22 01:00:13 gserv kernel: md: delaying data-check of md125 until md117 has finished (they share one or more physical units)
-- Reboot --
Feb 22 05:13:50 gserv systemd-journal[122]: Runtime journal is using 8.0M (max allowed 404.7M, trying to leave 607.1M free of 3.9G available → current limit 404.7M).

Note You need to log in before you can comment on or make changes to this bug.