1258153 – md is hanging in break_stripe_break_list

Bug 1258153 - md is hanging in break_stripe_break_list [NEEDINFO]

Summary: md is hanging in break_stripe_break_list

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	22
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2015-08-29 17:40 UTC by Thomas Davis
Modified:	2015-11-23 17:24 UTC (History)
CC List:	7 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2015-11-23 17:24:11 UTC
Type:	Bug
Embargoed:
Dependent Products:
Flags:	jforbes: needinfo?

Attachments	(Terms of Use)

Description Thomas Davis 2015-08-29 17:40:08 UTC

Description of problem:

[91054.119867] WARNING: CPU: 2 PID: 833 at drivers/md/raid5.c:4226 break_stripe_batch_list+0x1b6/0x260 [raid456]()
[91054.119933] Modules linked in: pps_ldisc pps_core xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat tun ebtable_filter ebtables bridge nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack cfg80211 rfkill ip6table_filter ip6_tables it87 hwmon_vid cp210x ppdev btrfs raid456 async_raid6_recov async_memcpy async_pq kvm_amd async_xor async_tx kvm xor crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel fam15h_power edac_core edac_mce_amd k10temp usbtouchscreen raid6_pq snd_hda_codec_realtek snd_hda_codec_generic sp5100_tco snd_hda_codec_hdmi i2c_piix4 snd_hda_intel snd_hda_controller snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore parport_pc parport shpchp wmi acpi_cpufreq nfsd auth_rpcgss nfs_acl
[91054.120612]  lockd grace sunrpc binfmt_misc ata_generic pata_acpi amdkfd amd_iommu_v2 radeon i2c_algo_bit drm_kms_helper ttm 8021q serio_raw pata_atiixp garp drm stp mpt2sas llc uas mrp usb_storage r8169 sata_sil24 mii raid_class scsi_transport_sas
[91054.120897] CPU: 2 PID: 833 Comm: md2_raid5 Not tainted 4.1.5-200.fc22.x86_64 #1
[91054.120959] Hardware name: Gigabyte Technology Co., Ltd. GA-78LMT-USB3/GA-78LMT-USB3, BIOS F4 10/19/2012
[91054.121033]  0000000000000000 00000000a94b4845 ffff8807ef6c7a98 ffffffff8179b89d
[91054.121085]  0000000000000000 0000000000000000 ffff8807ef6c7ad8 ffffffff810a165a
[91054.121135]  ffff880742f0a170 0000000000000000 ffff8807c2ba93e0 ffff8807c1af68e0
[91054.121185] Call Trace:
[91054.121209]  [<ffffffff8179b89d>] dump_stack+0x45/0x57
[91054.121264]  [<ffffffff810a165a>] warn_slowpath_common+0x8a/0xc0
[91054.121307]  [<ffffffff810a178a>] warn_slowpath_null+0x1a/0x20
[91054.121353]  [<ffffffffa072f356>] break_stripe_batch_list+0x1b6/0x260 [raid456]
[91054.121408]  [<ffffffffa07391c0>] handle_stripe+0xa20/0x2660 [raid456]
[91054.121464]  [<ffffffff81380e0f>] ? blk_peek_request+0x4f/0x290
[91054.121505]  [<ffffffffa073af9e>] handle_active_stripes.isra.45+0x19e/0x4e0 [raid456]
[91054.121557]  [<ffffffffa073b798>] raid5d+0x4b8/0x680 [raid456]
[91054.121612]  [<ffffffff815fd644>] md_thread+0x144/0x150
[91054.121664]  [<ffffffff810e4d40>] ? wake_atomic_t_function+0x70/0x70
[91054.121708]  [<ffffffff815fd500>] ? find_pers+0x80/0x80
[91054.121741]  [<ffffffff810c0ba8>] kthread+0xd8/0xf0
[91054.121778]  [<ffffffff810c0ad0>] ? kthread_worker_fn+0x180/0x180
[91054.121839]  [<ffffffff817a2262>] ret_from_fork+0x42/0x70
[91054.121878]  [<ffffffff810c0ad0>] ? kthread_worker_fn+0x180/0x180
[91054.121921] ---[ end trace a49945b72be76d5c ]---



Version-Release number of selected component (if applicable):

[root@tank ~]# uname -a
Linux tank 4.1.5-200.fc22.x86_64 #1 SMP Mon Aug 10 23:38:23 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux


How reproducible:

Happens about 24 to 48hrs after boot; appears to be load related, system is running a  MythTV backend


Steps to Reproduce:
1.
2.
3.

Actual results:

System has to be forced reboot.  Access to the md device is blocked.

Expected results:

No crash.

Additional info:

Not sure, system has 11 drives in 3 different md's (3ea of 4tb, 4 of 3tb, and 4ea of 2tb)

[root@tank ~]# cat /proc/mdstat 
Personalities : [raid6] [raid5] [raid4] 
md2 : active raid5 sde2[1] sdl2[0] sdk2[3]
      7805961216 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/3] [UUU]
      bitmap: 2/30 pages [8KB], 65536KB chunk

md1 : active raid5 sdo1[0] sdj1[2] sdf1[5] sdi1[4]
      8790402048 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
      bitmap: 0/22 pages [0KB], 65536KB chunk

md0 : active raid5 sdc1[2] sdd1[5] sdn1[1] sdb1[4]
      5860532736 blocks super 1.2 level 5, 512k chunk, algorithm 2 [4/4] [UUUU]
      bitmap: 0/15 pages [0KB], 65536KB chunk

unused devices: <none>
[root@tank ~]# uname -a
Linux tank 4.1.5-200.fc22.x86_64 #1 SMP Mon Aug 10 23:38:23 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
[root@tank ~]# uptime
 10:38:47 up 1 day, 13:28,  2 users,  load average: 7.10, 7.10, 7.06

Comment 1 Thomas Davis 2015-08-29 18:03:49 UTC

I forgot - I have a script that runs on boot that does:

MDS="md0 md1 md2"

for MD in $MDS
do
       echo 8192 > /sys/block/$MD/md/stripe_cache_size
        echo 2048 > /sys/block/$MD/queue/read_ahead_kb
done

I've since commented out the stripe_cache_size increase and will see if the system is any more stable.

Comment 2 Calle 2015-09-09 05:58:29 UTC

I've had the same issue twice now in two days on F21. Machine has been running fine for a long time on F20 and upgraded to F21.

Did not happen under load either time. Raid resynced after crash yesterday and seemed clean, data seems intact but last night it blew up again.

Is this a problem with the actual md driver or do I have a hardware issue?


Sep  9 04:46:05 a.hostname kernel: [56837.517234] WARNING: CPU: 2 PID: 685 at drivers/md/raid5.c:4226 break_stripe_batch_list+0x1b6/0x260 [raid456]()
Sep  9 04:46:05 a.hostname kernel: [56837.517237] Modules linked in: vhost_net vhost macvtap macvlan xt_geoip(OE) xt_iprange xt_comment xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack tun bridge stp llc raid456 async_raid6_recov async_memcpy async_pq async_xor kvm_amd ppdev xor async_tx raid6_pq kvm crct10dif_pclmul snd_hda_codec_hdmi crc32_pclmul snd_hda_intel crc32c_intel snd_hda_controller snd_hda_codec snd_hda_core ghash_clmulni_intel snd_hwdep snd_seq snd_seq_device k10temp i2c_piix4 snd_pcm parport_pc snd_timer parport shpchp snd tpm_infineon tpm_tis tpm soundcore acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc amdkfd amd_iommu_v2 radeon i2c_algo_bit drm_kms_helper ttm serio_raw drm r8169 mii
Sep  9 04:46:05 a.hostname kernel: [56837.517303] CPU: 2 PID: 685 Comm: md0_raid5 Tainted: G           OE   4.1.6-100.fc21.x86_64 #1
Sep  9 04:46:05 a.hostname kernel: [56837.517306] Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.M./F2A75M-D3H, BIOS F3 09/20/2012
Sep  9 04:46:05 a.hostname kernel: [56837.517310]  0000000000000000 000000002e5f5144 ffff880818faba98 ffffffff817940d5
Sep  9 04:46:05 a.hostname kernel: [56837.517314]  0000000000000000 0000000000000000 ffff880818fabad8 ffffffff810a163a
Sep  9 04:46:05 a.hostname kernel: [56837.517319]  0000000000000348 0000000000000000 ffff88037e721f60 ffff8806f3395ea8
Sep  9 04:46:05 a.hostname kernel: [56837.517323] Call Trace:
Sep  9 04:46:05 a.hostname kernel: [56837.517332]  [<ffffffff817940d5>] dump_stack+0x45/0x57
Sep  9 04:46:05 a.hostname kernel: [56837.517338]  [<ffffffff810a163a>] warn_slowpath_common+0x8a/0xc0
Sep  9 04:46:05 a.hostname kernel: [56837.517343]  [<ffffffff810a176a>] warn_slowpath_null+0x1a/0x20
Sep  9 04:46:05 a.hostname kernel: [56837.517349]  [<ffffffffa052fa26>] break_stripe_batch_list+0x1b6/0x260 [raid456]
Sep  9 04:46:05 a.hostname kernel: [56837.517357]  [<ffffffffa05337b0>] handle_stripe+0x880/0x26d0 [raid456]
Sep  9 04:46:05 a.hostname kernel: [56837.517366]  [<ffffffffa05357ae>] handle_active_stripes.isra.46+0x1ae/0x520 [raid456]
Sep  9 04:46:05 a.hostname kernel: [56837.517371]  [<ffffffff815f53b9>] ? md_wakeup_thread+0x39/0x70
Sep  9 04:46:05 a.hostname kernel: [56837.517377]  [<ffffffffa0529cd3>] ? do_release_stripe+0xe3/0x190 [raid456]
Sep  9 04:46:05 a.hostname kernel: [56837.517384]  [<ffffffffa05367a8>] raid5d+0x4b8/0x680 [raid456]
Sep  9 04:46:05 a.hostname kernel: [56837.517389]  [<ffffffff8179632d>] ? __schedule+0x2dd/0x960
Sep  9 04:46:05 a.hostname kernel: [56837.517393]  [<ffffffff815f7584>] md_thread+0x154/0x160
Sep  9 04:46:05 a.hostname kernel: [56837.517398]  [<ffffffff810e4830>] ? wait_woken+0x90/0x90
Sep  9 04:46:05 a.hostname kernel: [56837.517402]  [<ffffffff815f7430>] ? find_pers+0x80/0x80
Sep  9 04:46:05 a.hostname kernel: [56837.517406]  [<ffffffff810c06c8>] kthread+0xd8/0xf0
Sep  9 04:46:05 a.hostname kernel: [56837.517410]  [<ffffffff810c05f0>] ? kthread_create_on_node+0x1b0/0x1b0
Sep  9 04:46:05 a.hostname kernel: [56837.517415]  [<ffffffff8179ada2>] ret_from_fork+0x42/0x70
Sep  9 04:46:05 a.hostname kernel: [56837.517418]  [<ffffffff810c05f0>] ? kthread_create_on_node+0x1b0/0x1b0
Sep  9 04:46:05 a.hostname kernel: [56837.517422] ---[ end trace ab62cdc458212cb4 ]---

Comment 3 Thomas Davis 2015-09-23 06:23:33 UTC

I ended up moving back to the only 4.0 kernel I could find:

Linux tank 4.0.4-301.fc22.x86_64 #1 SMP Thu May 21 13:10:33 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux

and the problem went away.

I also found that the code generating this problem was introduced into the 4.1.x kernels.


I noticed a kernel-4.1.6 was released into updates lately, but I have no ideas if any updates to the md driver is in it.

Comment 4 Calle 2015-09-29 07:36:50 UTC

Same for me, works well with 4.0.4-301.fc22.x86_64

Comment 5 Justin M. Forbes 2015-10-20 19:41:44 UTC

*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 22 kernel bugs.

Fedora 22 has now been rebased to 4.2.3-200.fc22.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 23, and are still experiencing this issue, please change the version to Fedora 23.

If you experience different issues, please open a new bug report for those.

Comment 6 Fedora Kernel Team 2015-11-23 17:24:11 UTC

*********** MASS BUG UPDATE **************
This bug is being closed with INSUFFICIENT_DATA as there has not been a response in over 4 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.

Note You need to log in before you can comment on or make changes to this bug.