Created attachment 893391 [details] mini dump from messages. Description of problem: system lockup on high load of nfs server. Version-Release number of selected component (if applicable): kernel-3.14.2-200.fc20.x86_64 How reproducible: rsync'ing 19tb raid to 19tb raid. Steps to Reproduce: 1. 2. 3. Actual results: Expected results: no crash, i switched back to kernel kernel-3.13.9-200.fc20.x86_64 seemed more stable, no crashes, no runs to the dataceter, hope this helps. Additional info:
The last 2 nights where heavy loads on this server, when rsync and batch processes are big and kernel-3.13.9-200.fc20.x86_64 did _not_ lockup. Food for git diff. g.
The first warning is in d_obtain_alias called as part of filehandle lookup. The other two are in shrink_dentry_list->__d_drop. git log v3.13.9..v3.14.2 fs/dcache.c doesn't turn up anything suspicious. I don't see any interesting changes to filehandle lookup code either. On a quick look I'm stumped. Further warnings might be interesting, or experiments with kernels in between those two.
Created attachment 894805 [details] pattern of dumps before lockup or crash. just a 'grep -a BUG message.log' can see BUG on NFS, rsync (what was running), swap kernel process.
Created attachment 894806 [details] all the mini-Dumps before the hang or freeze. ALL the mini-DUMPs...fyi. g.)
interesting its only happening on 6 cpu's, its an i7, but maybe just scheduled to just 6 or the other 2 its not happening on? (dont know my stuff on this thou).
In theory I guess this could be another consequence of https://bugzilla.redhat.com/show_bug.cgi?id=1082586. Could you confirm whether it's still reproduceable on v3.14.3 ?
Yes I could...but its a prod server so I can only do it on the weekend when there is an opening and I'm not clear on what produces the result -- let me know if you have a nfs or rsync test tool to hammer the server (a little howto would help with this fried brain). )
I somehow did a yum update and switched back to 3.14.2-200 and 4 days later it crashed again, see below info. So I've since installed the 3.14.4-200 kernel and seeing what happens. If it fails I'll return to 3.13.9-200, seem to not have the problem. The below doc show the nfs options I was using..I switch them all to the default for this test. btw, I did a poke to increase the RPCNFSDCOUNT like so: echo 16 > /proc/fs/nfsd/threads the afternoon before it hung, not clear if this mean much -- gary. #------------------------------------------------------# kernels installed #------------------------------------------------------# vmlinuz-3.13.9-200.fc20.x86_64 (may not have the problem) vmlinuz-3.14.2-200.fc20.x86_64 (crashed on) vmlinuz-3.14.4-200.fc20.x86_64 (switched to) #------------------------------------------------------# in /etc/sysconfig/nfs: #------------------------------------------------------# #RQUOTAD="/usr/sbin/rpc.rquotad" # Port rquotad should listen on. #RQUOTAD_PORT=875 # Optinal options passed to rquotad RPCRQUOTADOPTS="" # # Optional arguments passed to in-kernel lockd #LOCKDARG= # TCP port rpc.lockd should listen on. #LOCKD_TCPPORT=32803 # UDP port rpc.lockd should listen on. #LOCKD_UDPPORT=32769 LOCKD_UDPPORT=30001 LOCKD_TCPPORT=30001 # # Optional arguments passed to rpc.nfsd. See rpc.nfsd(8) RPCNFSDARGS="" # Number of nfs server processes to be started. # The default is 8. RPCNFSDCOUNT=8 # Set V4 grace period in seconds #NFSD_V4_GRACE=90 # # Optional arguments passed to rpc.mountd. See rpc.mountd(8) RPCMOUNTDOPTS="" # # Optional arguments passed to rpc.statd. See rpc.statd(8) STATDARG="" # # Optional arguments passed to rpc.idmapd. See rpc.idmapd(8) RPCIDMAPDARGS="" # # Optional arguments passed to rpc.gssd. See rpc.gssd(8) RPCGSSDARGS="" # # Optional arguments passed to rpc.svcgssd. See rpc.svcgssd(8) RPCSVCGSSDARGS="" # # To enable RDMA support on the server by setting this to # the port the server should listen on #RDMA_PORT=20049 # # Optional arguments passed to blkmapd. See blkmapd(8) BLKMAPDARGS="" #------------------------------------------------------# in /etc/rc.d/rc.local #------------------------------------------------------# echo Applying NFS performance options... echo 16 > /proc/fs/nfsd/threads echo "120" > /sys/block/sdb/device/timeout echo "120" > /sys/block/sdc/device/timeout echo 262144 > /proc/sys/net/core/rmem_max echo 262144 > /proc/sys/net/core/rmem_default echo 262144 > /proc/sys/net/core/wmem_max echo 262144 > /proc/sys/net/core/wmem_default echo 0 > /sys/block/sdb/queue/read_ahead_kb echo noop > /sys/block/sdb/queue/scheduler #------------------------------------------------------# in /var/log/messages at hangup or crash: #------------------------------------------------------# May 31 01:01:01 r1epi systemd: Started Session 56 of user root. May 31 01:36:47 r1epi kernel: BUG: soft lockup - CPU#2 stuck for 22s! [nfsd:22675] May 31 01:36:47 r1epi kernel: Modules linked in: binfmt_misc sch_sfq xt_iprange bonding ip6t_REJECT nf_conntrack_ipv4 nf_conntrack_ipv6 nf_defrag_ipv4 nf_defrag_ipv6 xt_conntrack ip6table_filter nf_conntrack ip6_tables snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic iTCO_wdt iTCO_vendor_support btrfs x86_pkg_temp_thermal coretemp kvm crct10dif_pclmul crc32_pclmul crc32c_intel raid6_pq eeepc_wmi asus_wmi btusb sparse_keymap mxm_wmi bluetooth xor ghash_clmulni_intel snd_hda_intel 6lowpan_iphc snd_hda_codec microcode rfkill snd_hwdep snd_pcm e1000 serio_raw e1000e lpc_ich i2c_i801 mfd_core snd_timer snd mei_me ptp soundcore mei pps_core shpchp wmi nfsd auth_rpcgss nfs_acl lockd sunrpc i915 i2c_algo_bit drm_kms_helper firewire_ohci drm firewire_core crc_itu_t aacraid i2c_core video May 31 01:36:47 r1epi kernel: CPU: 2 PID: 22675 Comm: nfsd Not tainted 3.14.2-200.fc20.x86_64 #1 May 31 01:36:47 r1epi kernel: Hardware name: System manufacturer System Product Name/P8Z68-V PRO, BIOS 0902 09/19/2011 May 31 01:36:47 r1epi kernel: task: ffff88000e595580 ti: ffff88008cd2a000 task.ti: ffff88008cd2a000 May 31 01:36:47 r1epi kernel: RIP: 0010:[<ffffffff81202607>] [<ffffffff81202607>] d_obtain_alias+0x1b7/0x1d0 May 31 01:36:47 r1epi kernel: RSP: 0018:ffff88008cd2bb00 EFLAGS: 00000202 May 31 01:36:47 r1epi kernel: RAX: ffff88041623d000 RBX: ffffffff812039a3 RCX: ffff88022f9c9e49 May 31 01:36:47 r1epi kernel: RDX: ffff88041623d0b0 RSI: ffffffffa0916140 RDI: ffff88017de7eb98 May 31 01:36:47 r1epi kernel: RBP: ffff88008cd2bb18 R08: 0000000000017bf0 R09: ffffffff812021b5 May 31 01:36:47 r1epi kernel: R10: fd6265d59ea69203 R11: ffffea000fed4f00 R12: 0000000000000000 May 31 01:36:47 r1epi kernel: R13: ffff88027f9ac9a8 R14: ffffffff810d1cb5 R15: ffff88008cd2ba78 May 31 01:36:47 r1epi kernel: FS: 0000000000000000(0000) GS:ffff88042fa80000(0000) knlGS:0000000000000000 May 31 01:36:47 r1epi kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 May 31 01:36:47 r1epi kernel: CR2: 00007f1369f2ab94 CR3: 0000000001c0c000 CR4: 00000000000407e0 May 31 01:36:47 r1epi kernel: Stack: May 31 01:36:47 r1epi kernel: ffff88027f9ac9a8 ffffffffffffff8c 000000000004027d ffff88008cd2bb88 May 31 01:36:47 r1epi kernel: ffffffffa08d6375 00000000008f9f6b ffff880417b10000 000000000e595580 May 31 01:36:48 r1epi kernel: 6bff88000e595a90 0100000000008f9f 0000000000000000 00000000c08b1472 May 31 01:36:48 r1epi kernel: Call Trace: May 31 01:36:48 r1epi kernel: [<ffffffffa08d6375>] btrfs_get_dentry+0x115/0x140 [btrfs] May 31 01:36:48 r1epi kernel: [<ffffffffa01f9a40>] ? nfsd_proc_getattr+0xa0/0xa0 [nfsd] May 31 01:36:48 r1epi kernel: [<ffffffffa08d6662>] btrfs_fh_to_dentry+0x32/0x60 [btrfs] May 31 01:36:48 r1epi kernel: [<ffffffff812cc842>] exportfs_decode_fh+0x72/0x2e0 May 31 01:36:48 r1epi kernel: [<ffffffffa01ffa2b>] ? exp_find+0x10b/0x1c0 [nfsd] May 31 01:36:48 r1epi kernel: [<ffffffff810c2885>] ? sched_clock_cpu+0x85/0xc0 May 31 01:36:48 r1epi kernel: [<ffffffff811cc925>] ? kmem_cache_alloc+0x35/0x1f0 May 31 01:36:48 r1epi kernel: [<ffffffffa01fa796>] fh_verify+0x316/0x600 [nfsd] May 31 01:36:48 r1epi kernel: [<ffffffff810f0c50>] ? getboottime+0x30/0x40 May 31 01:36:48 r1epi kernel: [<ffffffffa01bbf7e>] ? cache_check+0x12e/0x380 [sunrpc] May 31 01:36:48 r1epi kernel: [<ffffffffa0208ac9>] nfsd4_putfh+0x49/0x50 [nfsd] May 31 01:36:48 r1epi kernel: [<ffffffffa020ad1a>] nfsd4_proc_compound+0x56a/0x7b0 [nfsd] May 31 01:36:48 r1epi kernel: [<ffffffffa01f6dbb>] nfsd_dispatch+0xbb/0x200 [nfsd] May 31 01:36:48 r1epi kernel: [<ffffffffa01b1d00>] svc_process_common+0x480/0x6f0 [sunrpc] May 31 01:36:48 r1epi kernel: [<ffffffffa01b2077>] svc_process+0x107/0x170 [sunrpc] May 31 01:36:48 r1epi kernel: [<ffffffffa01f674f>] nfsd+0xbf/0x130 [nfsd] May 31 01:36:48 r1epi kernel: [<ffffffffa01f6690>] ? nfsd_destroy+0x80/0x80 [nfsd] May 31 01:36:48 r1epi kernel: [<ffffffff810ae211>] kthread+0xe1/0x100 May 31 01:36:48 r1epi kernel: [<ffffffff810ae130>] ? insert_kthread_work+0x40/0x40 May 31 01:36:48 r1epi kernel: [<ffffffff816fef7c>] ret_from_fork+0x7c/0xb0 May 31 01:36:48 r1epi kernel: [<ffffffff810ae130>] ? insert_kthread_work+0x40/0x40 May 31 01:36:48 r1epi kernel: Code: 00 66 41 83 45 58 01 66 83 83 88 00 00 00 01 48 89 de 4c 89 ef e8 ca 81 0e 00 4c 89 e8 e9 a1 fe ff ff f3 90 48 8b 88 b0 00 00 00 <80> e1 01 75 f2 e9 75 ff ff ff 48 89 f8 e9 86 fe ff ff 0f 0b 0f May 31 01:36:48 r1epi kernel: BUG: soft lockup - CPU#3 stuck for 22s! [kswapd0:82] May 31 01:36:48 r1epi kernel: Modules linked in: binfmt_misc sch_sfq xt_iprange bonding ip6t_REJECT nf_conntrack_ipv4 nf_conntrack_ipv6 nf_defrag_ipv4 nf_defrag_ipv6 xt_conntrack ip6table_filter nf_conntrack ip6_tables snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic iTCO_wdt iTCO_vendor_support btrfs x86_pkg_temp_thermal coretemp kvm crct10dif_pclmul crc32_pclmul crc32c_intel raid6_pq eeepc_wmi asus_wmi btusb sparse_keymap mxm_wmi bluetooth xor ghash_clmulni_intel snd_hda_intel 6lowpan_iphc snd_hda_codec microcode rfkill snd_hwdep snd_pcm e1000 serio_raw e1000e lpc_ich i2c_i801 mfd_core snd_timer snd mei_me ptp soundcore mei pps_core shpchp wmi nfsd auth_rpcgss nfs_acl lockd sunrpc i915 i2c_algo_bit drm_kms_helper firewire_ohci drm firewire_core crc_itu_t aacraid i2c_core video May 31 01:36:48 r1epi kernel: CPU: 3 PID: 82 Comm: kswapd0 Not tainted 3.14.2-200.fc20.x86_64 #1 May 31 01:36:48 r1epi kernel: Hardware name: System manufacturer System Product Name/P8Z68-V PRO, BIOS 0902 09/19/2011 May 31 01:36:48 r1epi kernel: task: ffff880417468000 ti: ffff880417ba4000 task.ti: ffff880417ba4000 May 31 01:36:48 r1epi kernel: RIP: 0010:[<ffffffff812003b5>] [<ffffffff812003b5>] __d_drop+0x95/0xc0 May 31 01:36:48 r1epi kernel: RSP: 0018:ffff880417ba5b60 EFLAGS: 00000202 May 31 01:36:48 r1epi kernel: RAX: ffff88022f9c9e49 RBX: 0000000000000000 RCX: 0000000000060005 May 31 01:36:48 r1epi kernel: RDX: ffff88041623d0b0 RSI: 0000000000000000 RDI: ffff880106f19a80 May 31 01:36:48 r1epi kernel: RBP: ffff880417ba5b88 R08: ffff880106f19b00 R09: 000000018015000a May 31 01:36:48 r1epi kernel: R10: ffffffff811ff55f R11: ffffea00065f45c0 R12: ffff88041865ba80 May 31 01:36:48 r1epi kernel: R13: 00000000000003f0 R14: 0000000000000032 R15: 0000000000000069 May 31 01:36:48 r1epi kernel: FS: 0000000000000000(0000) GS:ffff88042fac0000(0000) knlGS:0000000000000000 May 31 01:36:48 r1epi kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 May 31 01:36:48 r1epi kernel: CR2: 00007fce0a1d2000 CR3: 0000000001c0c000 CR4: 00000000000407e0 May 31 01:36:48 r1epi kernel: Stack: May 31 01:36:48 r1epi kernel: ffffffff812004f4 ffff880106f19b00 ffff880417ba5be0 ffff880106f19a80 May 31 01:36:48 r1epi kernel: ffff880106f19a80 ffff880417ba5bc8 ffffffff81200881 ffff880106f19b00 May 31 01:36:48 r1epi kernel: 00000000bbfef780 ffff880417ba5be0 0000000000000079 ffff88041623d000 May 31 01:36:48 r1epi kernel: Call Trace: May 31 01:36:48 r1epi kernel: [<ffffffff812004f4>] ? dentry_kill+0xa4/0x210 May 31 01:36:48 r1epi kernel: [<ffffffff81200881>] shrink_dentry_list+0xa1/0x100 May 31 01:36:48 r1epi kernel: [<ffffffff81201f76>] prune_dcache_sb+0x56/0x80 May 31 01:36:48 r1epi kernel: [<ffffffff811ecfb7>] super_cache_scan+0xe7/0x160 May 31 01:36:48 r1epi kernel: [<ffffffff81185688>] shrink_slab_node+0x138/0x290 May 31 01:36:48 r1epi kernel: [<ffffffff811db34b>] ? mem_cgroup_iter+0x16b/0x2d0 May 31 01:36:48 r1epi kernel: [<ffffffff81187c6b>] shrink_slab+0x8b/0x170 May 31 01:36:48 r1epi kernel: [<ffffffff8118a7ca>] kswapd_shrink_zone+0x14a/0x1f0 May 31 01:36:48 r1epi kernel: [<ffffffff8118bc36>] kswapd+0x476/0x860 May 31 01:36:48 r1epi kernel: [<ffffffff8118b7c0>] ? mem_cgroup_shrink_node_zone+0x160/0x160 May 31 01:36:48 r1epi kernel: [<ffffffff810ae211>] kthread+0xe1/0x100 May 31 01:36:48 r1epi kernel: [<ffffffff810ae130>] ? insert_kthread_work+0x40/0x40 May 31 01:36:48 r1epi kernel: [<ffffffff816fef7c>] ret_from_fork+0x7c/0xb0 May 31 01:36:48 r1epi kernel: [<ffffffff810ae130>] ? insert_kthread_work+0x40/0x40 May 31 01:36:48 r1epi kernel: Code: 10 00 00 00 00 0f ba 32 00 8b 47 58 89 c2 c1 ea 10 66 39 d0 74 28 83 47 04 02 c3 0f 1f 00 f3 c3 66 0f 1f 44 00 00 f3 90 48 8b 02 <a8> 01 75 f7 eb a2 48 8b 47 68 48 8d 90 b0 00 00 00 eb 95 55 48 May 31 01:36:51 r1epi kernel: BUG: soft lockup - CPU#4 stuck for 23s! [nfsd:676] May 31 01:36:51 r1epi kernel: Modules linked in: binfmt_misc sch_sfq xt_iprange bonding ip6t_REJECT nf_conntrack_ipv4 nf_conntrack_ipv6 nf_defrag_ipv4 nf_defrag_ipv6 xt_conntrack ip6table_filter nf_conntrack ip6_tables snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic iTCO_wdt iTCO_vendor_support btrfs x86_pkg_temp_thermal coretemp kvm crct10dif_pclmul crc32_pclmul crc32c_intel raid6_pq eeepc_wmi asus_wmi btusb sparse_keymap mxm_wmi bluetooth xor ghash_clmulni_intel snd_hda_intel 6lowpan_iphc snd_hda_codec microcode rfkill snd_hwdep snd_pcm e1000 serio_raw e1000e lpc_ich i2c_i801 mfd_core snd_timer snd mei_me ptp soundcore mei pps_core shpchp wmi nfsd auth_rpcgss nfs_acl lockd sunrpc i915 i2c_algo_bit drm_kms_helper firewire_ohci drm firewire_core crc_itu_t aacraid i2c_core video May 31 01:36:51 r1epi kernel: CPU: 4 PID: 676 Comm: nfsd Not tainted 3.14.2-200.fc20.x86_64 #1 May 31 01:36:51 r1epi kernel: Hardware name: System manufacturer System Product Name/P8Z68-V PRO, BIOS 0902 09/19/2011 May 31 01:36:51 r1epi kernel: task: ffff8803fd80b900 ti: ffff8800c05e2000 task.ti: ffff8800c05e2000 May 31 01:36:51 r1epi kernel: RIP: 0010:[<ffffffff81202607>] [<ffffffff81202607>] d_obtain_alias+0x1b7/0x1d0 May 31 01:36:51 r1epi kernel: RSP: 0018:ffff8800c05e3b30 EFLAGS: 00000202 May 31 01:36:51 r1epi kernel: RAX: ffff88041623d000 RBX: ffffffff812039a3 RCX: ffff88022f9c9e49 May 31 01:36:51 r1epi kernel: RDX: ffff88041623d0b0 RSI: ffffffffa0916140 RDI: ffff88031f144f58 May 31 01:36:51 r1epi kernel: RBP: ffff8800c05e3b48 R08: 0000000000017bf0 R09: ffffffff812021b5 May 31 01:36:51 r1epi kernel: R10: fe6f22b26f5e9203 R11: ffffea0003027140 R12: 0000000000000000 May 31 01:36:51 r1epi kernel: R13: ffff880172dda9a8 R14: ffffffff810d1cb5 R15: ffff8800c05e3aa8 May 31 01:36:51 r1epi kernel: FS: 0000000000000000(0000) GS:ffff88042fb00000(0000) knlGS:0000000000000000 May 31 01:36:51 r1epi kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 May 31 01:36:51 r1epi kernel: CR2: 00007f2ec792359c CR3: 0000000001c0c000 CR4: 00000000000407e0 May 31 01:36:51 r1epi kernel: Stack: May 31 01:36:51 r1epi kernel: ffff880172dda9a8 ffffffffffffff8c 000000000005d12a ffff8800c05e3bb8 May 31 01:36:51 r1epi kernel: ffffffffa08d6375 0000000000a9f19e ffff880417b10000 0000000000000001 May 31 01:36:51 r1epi kernel: 9e00000000000000 010000000000a9f1 0000000000000000 00000000ae4e30da May 31 01:36:51 r1epi kernel: Call Trace: May 31 01:36:51 r1epi kernel: [<ffffffffa08d6375>] btrfs_get_dentry+0x115/0x140 [btrfs] May 31 01:36:51 r1epi kernel: [<ffffffffa01f9a40>] ? nfsd_proc_getattr+0xa0/0xa0 [nfsd] May 31 01:36:51 r1epi kernel: [<ffffffffa08d6662>] btrfs_fh_to_dentry+0x32/0x60 [btrfs] May 31 01:36:51 r1epi kernel: [<ffffffff812cc842>] exportfs_decode_fh+0x72/0x2e0 May 31 01:36:51 r1epi kernel: [<ffffffffa01ffa2b>] ? exp_find+0x10b/0x1c0 [nfsd] May 31 01:36:51 r1epi kernel: [<ffffffff810c8c4e>] ? dequeue_task_fair+0x42e/0x640 May 31 01:36:51 r1epi kernel: [<ffffffff810c2885>] ? sched_clock_cpu+0x85/0xc0 May 31 01:36:51 r1epi kernel: [<ffffffff811cc925>] ? kmem_cache_alloc+0x35/0x1f0 May 31 01:36:51 r1epi kernel: [<ffffffff810b3606>] ? prepare_creds+0x26/0x1c0 May 31 01:36:51 r1epi kernel: [<ffffffffa01fa796>] fh_verify+0x316/0x600 [nfsd] May 31 01:36:51 r1epi kernel: [<ffffffffa0204c0c>] nfsd3_proc_getattr+0x7c/0x110 [nfsd] May 31 01:36:51 r1epi kernel: [<ffffffffa01f6dbb>] nfsd_dispatch+0xbb/0x200 [nfsd] May 31 01:36:51 r1epi kernel: [<ffffffffa01b1d00>] svc_process_common+0x480/0x6f0 [sunrpc] May 31 01:36:51 r1epi kernel: [<ffffffffa01b2077>] svc_process+0x107/0x170 [sunrpc] May 31 01:36:51 r1epi kernel: [<ffffffffa01f674f>] nfsd+0xbf/0x130 [nfsd] May 31 01:36:51 r1epi kernel: [<ffffffffa01f6690>] ? nfsd_destroy+0x80/0x80 [nfsd] May 31 01:36:51 r1epi kernel: [<ffffffff810ae211>] kthread+0xe1/0x100 May 31 01:36:51 r1epi kernel: [<ffffffff810ae130>] ? insert_kthread_work+0x40/0x40 May 31 01:36:51 r1epi kernel: [<ffffffff816fef7c>] ret_from_fork+0x7c/0xb0 May 31 01:36:51 r1epi kernel: [<ffffffff810ae130>] ? insert_kthread_work+0x40/0x40 May 31 01:36:51 r1epi kernel: Code: 00 66 41 83 45 58 01 66 83 83 88 00 00 00 01 48 89 de 4c 89 ef e8 ca 81 0e 00 4c 89 e8 e9 a1 fe ff ff f3 90 48 8b 88 b0 00 00 00 <80> e1 01 75 f2 e9 75 ff ff ff 48 89 f8 e: 9 86 fe ff ff 0f 0b 0f May 31 01:36:55 r1epi kernel: BUG: soft lockup - CPU#1 stuck for 22s! [nfsd:682] May 31 01:36:55 r1epi kernel: BUG: soft lockup - CPU#0 stuck for 22s! [nfsd:22673] May 31 01:36:55 r1epi kernel: Modules linked in: binfmt_misc sch_sfq xt_iprange bonding ip6t_REJECT nf_conntrack_ipv4 nf_conntrack_ipv6 nf_defrag_ipv4 nf_defrag_ipv6 xt_conntrack ip6table_filter nf_conntrack ip6_tables snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic iTCO_wdt iTCO_vendor_support btrfs x86_pkg_temp_thermal coretemp kvm crct10dif_pclmul crc32_pclmul crc32c_intel raid6_pq eeepc_wmi asus_wmi btusb sparse_keymap mxm_wmi bluetooth xor ghash_clmulni_intel snd_hda_intel 6lowpan_iphc snd_hda_codec microcode rfkill snd_hwdep snd_pcm e1000 serio_raw e1000e lpc_ich i2c_i801 mfd_core snd_timer snd mei_me ptp soundcore mei pps_core shpchp wmi nfsd auth_rpcgss nfs_acl lockd sunrpc i915 i2c_algo_bit drm_kms_helper firewire_ohci drm firewire_core crc_itu_t aacraid i2c_core video May 31 01:36:55 r1epi kernel: CPU: 0 PID: 22673 Comm: nfsd Not tainted 3.14.2-200.fc20.x86_64 #1 May 31 01:36:55 r1epi kernel: Hardware name: System manufacturer System Product Name/P8Z68-V PRO, BIOS 0902 09/19/2011 May 31 01:36:55 r1epi kernel: task: ffff88000e595f00 ti: ffff880003c18000 task.ti: ffff880003c18000 May 31 01:36:55 r1epi kernel: RIP: 0010:[<ffffffff81202607>] [<ffffffff81202607>] d_obtain_alias+0x1b7/0x1d0 May 31 01:36:55 r1epi kernel: RSP: 0018:ffff880003c19a80 EFLAGS: 00000202 May 31 01:36:55 r1epi kernel: RAX: ffff88041623d000 RBX: ffffffff812039a3 RCX: ffff88022f9c9e49 May 31 01:36:55 r1epi kernel: RDX: ffff88041623d0b0 RSI: ffffffffa0916140 RDI: ffff880172c09a18 May 31 01:36:55 r1epi kernel: RBP: ffff880003c19a98 R08: 0000000000017bf0 R09: ffffffff812021b5 May 31 01:36:55 r1epi kernel: R10: fe6f2c3664fd9403 R11: ffffea001012cd00 R12: 0000000000000000 May 31 01:36:55 r1epi kernel: R13: ffff880172d425b0 R14: ffffffff810d1cb5 R15: ffff880003c199f8 May 31 01:36:55 r1epi kernel: FS: 0000000000000000(0000) GS:ffff88042fa00000(0000) knlGS:0000000000000000 May 31 01:36:55 r1epi kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 May 31 01:36:55 r1epi kernel: CR2: 00007f85fef93000 CR3: 0000000001c0c000 CR4: 00000000000407f0 May 31 01:36:55 r1epi kernel: Stack: May 31 01:36:55 r1epi kernel: ffff880172d425b0 ffffffffffffff8c 000000000005d7a4 ffff880003c19b08 May 31 01:36:55 r1epi kernel: ffffffffa08d6375 0000000000aa6405 ffff880417b10000 00000000ff38d8cc May 31 01:36:55 r1epi kernel: 05ff8803ff38d400 010000000000aa64 0000000000000000 000000003f519fc4 May 31 01:36:55 r1epi kernel: Call Trace: May 31 01:36:55 r1epi kernel: [<ffffffffa08d6375>] btrfs_get_dentry+0x115/0x140 [btrfs] May 31 01:36:55 r1epi kernel: [<ffffffffa01f9a40>] ? nfsd_proc_getattr+0xa0/0xa0 [nfsd] May 31 01:36:55 r1epi kernel: [<ffffffffa08d6662>] btrfs_fh_to_dentry+0x32/0x60 [btrfs] May 31 01:36:55 r1epi kernel: [<ffffffff812cc842>] exportfs_decode_fh+0x72/0x2e0 May 31 01:36:55 r1epi kernel: [<ffffffffa01ffa2b>] ? exp_find+0x10b/0x1c0 [nfsd] May 31 01:36:55 r1epi kernel: [<ffffffff811cc925>] ? kmem_cache_alloc+0x35/0x1f0 May 31 01:36:55 r1epi kernel: [<ffffffffa01fa796>] fh_verify+0x316/0x600 [nfsd] May 31 01:36:55 r1epi kernel: [<ffffffffa01fbb50>] nfsd_open+0x40/0x1d0 [nfsd] May 31 01:36:55 r1epi kernel: [<ffffffff810a4f69>] ? try_to_grab_pending+0xa9/0x150 May 31 01:36:55 r1epi kernel: [<ffffffffa01fe59b>] nfsd_write+0xbb/0x110 [nfsd] May 31 01:36:55 r1epi kernel: [<ffffffffa0204af0>] nfsd3_proc_write+0xc0/0x160 [nfsd] May 31 01:36:55 r1epi kernel: [<ffffffffa01f6dbb>] nfsd_dispatch+0xbb/0x200 [nfsd] May 31 01:36:55 r1epi kernel: [<ffffffffa01b1d00>] svc_process_common+0x480/0x6f0 [sunrpc] May 31 01:36:55 r1epi kernel: [<ffffffffa01b2077>] svc_process+0x107/0x170 [sunrpc] May 31 01:36:55 r1epi kernel: Modules linked in: binfmt_misc May 31 01:36:55 r1epi kernel: May 31 01:36:55 r1epi kernel: [<ffffffffa01f674f>] nfsd+0xbf/0x130 [nfsd] May 31 01:36:55 r1epi kernel: May 31 01:36:55 r1epi kernel: sch_sfq May 31 01:36:55 r1epi kernel: xt_iprange bonding ip6t_REJECT nf_conntrack_ipv4 May 31 01:36:55 r1epi kernel: [<ffffffffa01f6690>] ? nfsd_destroy+0x80/0x80 [nfsd] May 31 01:36:55 r1epi kernel: May 31 01:36:55 r1epi kernel: [<ffffffff810ae211>] kthread+0xe1/0x100 May 31 01:36:55 r1epi kernel: May 31 01:36:55 r1epi kernel: [<ffffffff810ae130>] ? insert_kthread_work+0x40/0x40May 31 01:36:55 r1epi kernel: Code: May 31 01:36:55 r1epi kernel: 00 May 31 01:36:55 r1epi kernel: 66 May 31 01:36:55 r1epi kernel: 41 May 31 01:36:55 r1epi kernel: 83 May 31 01:36:55 r1epi kernel: 45 May 31 01:36:55 r1epi kernel: 58 May 31 01:36:55 r1epi kernel: 01 May 31 01:36:55 r1epi kernel: 66 May 31 01:36:55 r1epi kernel: 83 May 31 01:36:55 r1epi kernel: 83 May 31 01:36:55 r1epi kernel: 88 May 31 01:36:55 r1epi kernel: 00 May 31 01:36:55 r1epi kernel: 00 May 31 01:36:55 r1epi kernel: 00 May 31 01:36:55 r1epi kernel: 01 May 31 01:36:55 r1epi kernel: 48 May 31 01:36:55 r1epi kernel: 89 May 31 01:36:55 r1epi kernel: de May 31 01:36:55 r1epi kernel: 4c May 31 01:36:55 r1epi kernel: 89 May 31 01:36:55 r1epi kernel: ef May 31 01:36:55 r1epi kernel: e8 May 31 01:36:55 r1epi kernel: ca May 31 01:36:55 r1epi kernel: 81 May 31 01:36:55 r1epi kernel: 0e May 31 01:36:55 r1epi kernel: 00 May 31 01:36:55 r1epi kernel: 4c May 31 01:36:55 r1epi kernel: 89 May 31 01:36:55 r1epi kernel: e8 May 31 01:36:55 r1epi kernel: e9 May 31 01:36:55 r1epi kernel: a1 May 31 01:36:55 r1epi kernel: fe May 31 01:36:55 r1epi kernel: ff May 31 01:36:55 r1epi kernel: ff May 31 01:36:55 r1epi kernel: f3 May 31 01:36:55 r1epi kernel: 90 May 31 01:36:55 r1epi kernel: 48 May 31 01:36:55 r1epi kernel: 8b May 31 01:36:55 r1epi kernel: 88 May 31 01:36:55 r1epi kernel: b0 May 31 01:36:55 r1epi kernel: 00 May 31 01:36:55 r1epi kernel: 00 May 31 01:36:55 r1epi kernel: 00 May 31 01:36:55 r1epi kernel: <80> May 31 01:36:55 r1epi kernel: e1 May 31 01:36:55 r1epi kernel: 48 May 31 01:36:55 r1epi kernel: 89 May 31 01:36:55 r1epi kernel: f8 May 31 01:36:55 r1epi kernel: e9 May 31 01:36:55 r1epi kernel: 86 May 31 01:36:55 r1epi kernel: fe May 31 01:36:55 r1epi kernel: ff May 31 01:36:55 r1epi kernel: ff May 31 01:36:55 r1epi kernel: 0f May 31 01:36:55 r1epi kernel: 0b May 31 01:36:55 r1epi kernel: 0f May 31 01:36:55 r1epi kernel: May 31 01:36:55 r1epi kernel: nf_conntrack_ipv6 May 31 01:36:55 r1epi kernel: nf_defrag_ipv4 nf_defrag_ipv6 xt_conntrack ip6table_filter nf_conntrack ip6_tables snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic iTCO_wdt iTCO_vendor_support btrfs x86_pkg_temp_thermal coretemp kvm crct10dif_pclmul crc32_pclmul crc32c_intel raid6_pq eeepc_wmi asus_wmi btusb sparse_keymap mxm_wmi bluetooth xor ghash_clmulni_intel snd_hda_intel 6lowpan_iphc snd_hda_codec microcode rfkill snd_hwdep snd_pcm e1000 serio_raw e1000e lpc_ich i2c_i801 mfd_core snd_timer snd mei_me ptp soundcore mei pps_core shpchp wmi nfsd auth_rpcgss nfs_acl lockd sunrpc i915 i2c_algo_bit drm_kms_helper firewire_ohci drm firewire_core crc_itu_t aacraid i2c_core video May 31 01:36:55 r1epi kernel: CPU: 1 PID: 682 Comm: nfsd Not tainted 3.14.2-200.fc20.x86_64 #1 May 31 01:36:55 r1epi kernel: Hardware name: System manufacturer System Product Name/P8Z68-V PRO, BIOS 0902 09/19/2011 May 31 01:36:55 r1epi kernel: task: ffff8803fd80f200 ti: ffff8803fd1c0000 task.ti: ffff8803fd1c0000 May 31 01:36:55 r1epi kernel: RIP: 0010:[<ffffffff81202607>] [<ffffffff81202607>] d_obtain_alias+0x1b7/0x1d0 May 31 01:36:55 r1epi kernel: RSP: 0018:ffff8803fd1c1a80 EFLAGS: 00000202 May 31 01:36:55 r1epi kernel: RAX: ffff88041623d000 RBX: ffffffff812039a3 RCX: ffff88022f9c9e49 May 31 01:36:55 r1epi kernel: RDX: ffff88041623d0b0 RSI: ffffffffa0916140 RDI: ffff880172d5d658 May 31 01:36:55 r1epi kernel: RBP: ffff8803fd1c1a98 R08: 0000000000017bf0 R09: ffffffff812021b5 May 31 01:36:55 r1epi kernel: R10: fe6f2bbe751b9003 R11: ffffea0010462480 R12: 0000000000000000 May 31 01:36:55 r1epi kernel: R13: ffff880172d49da0 R14: ffffffff810d1cb5 R15: ffff8803fd1c19f8 May 31 01:36:55 r1epi kernel: FS: 0000000000000000(0000) GS:ffff88042fa40000(0000) knlGS:0000000000000000 May 31 01:36:55 r1epi kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 May 31 01:36:55 r1epi kernel: CR2: 00007fce0a1d2000 CR3: 0000000001c0c000 CR4: 00000000000407e0 May 31 01:36:55 r1epi kernel: Stack: May 31 01:36:55 r1epi kernel: ffff880172d49da0 ffffffffffffff8c 000000000005d7a7 ffff8803fd1c1b08 May 31 01:36:55 r1epi kernel: ffffffffa08d6375 0000000000aa6445 ffff880417b10000 00000000ff38b5cc May 31 01:36:55 r1epi kernel: 45ff8803ff38b100 010000000000aa64 0000000000000000 00000000c9f4e379 May 31 01:36:55 r1epi kernel: Call Trace: May 31 01:36:55 r1epi kernel: [<ffffffffa08d6375>] btrfs_get_dentry+0x115/0x140 [btrfs] May 31 01:36:55 r1epi kernel: [<ffffffffa01f9a40>] ? nfsd_proc_getattr+0xa0/0xa0 [nfsd] May 31 01:36:55 r1epi kernel: [<ffffffffa08d6662>] btrfs_fh_to_dentry+0x32/0x60 [btrfs] May 31 01:36:55 r1epi kernel: [<ffffffff812cc842>] exportfs_decode_fh+0x72/0x2e0 May 31 01:36:55 r1epi kernel: [<ffffffffa01ffa2b>] ? exp_find+0x10b/0x1c0 [nfsd] May 31 01:36:55 r1epi kernel: [<ffffffff811cc925>] ? kmem_cache_alloc+0x35/0x1f0 May 31 01:36:55 r1epi kernel: [<ffffffff810b3606>] ? prepare_creds+0x26/0x1c0 May 31 01:36:55 r1epi kernel: [<ffffffffa01fa796>] fh_verify+0x316/0x600 [nfsd] May 31 01:36:55 r1epi kernel: [<ffffffffa01fbb50>] nfsd_open+0x40/0x1d0 [nfsd] May 31 01:36:55 r1epi kernel: [<ffffffff810a4f69>] ? try_to_grab_pending+0xa9/0x150 May 31 01:36:55 r1epi kernel: [<ffffffffa01fe59b>] nfsd_write+0xbb/0x110 [nfsd] May 31 01:36:55 r1epi kernel: [<ffffffffa0204af0>] nfsd3_proc_write+0xc0/0x160 [nfsd] May 31 01:36:55 r1epi kernel: [<ffffffffa01f6dbb>] nfsd_dispatch+0xbb/0x200 [nfsd] May 31 01:36:55 r1epi kernel: [<ffffffffa01b1d00>] svc_process_common+0x480/0x6f0 [sunrpc] May 31 01:36:55 r1epi kMay 31 01:36:55 r1epi kernel: [<ffffffff816fef7c>] ret_from_fork+0x7c/0xb0 May 31 01:36:55 r1epi kernel: [<ffffffff810ae130>] ? insert_kthread_work+0x40/0x40 May 31 01:36:55 r1epi kernel: Code: 00 66 41 83 45 58 01 66 83 83 88 00 00 00 01 48 89 de 4c 89 ef e8 ca 81 0e 00 4c 89 e8 e9 a1 fe ff ff f3 90 48 8b 88 b0 00 00 00 <80> e1 01 75 f2 e9 75 ff ff ff 48 89 f8 e9 86 fe ff ff 0f 0b 0f ernel: [<ffffffffa01b2077>] svc_process+0x107/0x170 [sunrpc] May 31 01:36:55 r1epi kernel: [<ffffffffa01f674f>] nfsd+0xbf/0x130 [nfsd] May 31 01:36:55 r1epi kernel: [<ffffffffa01f6690>] ? nfsd_destroy+0x80/0x80 [nfsd] May 31 01:36:55 r1epi kernel: [<ffffffff810ae211>] kthread+0xe1/0x100 May 31 01:36:55 r1epi kernel: [<ffffffff810ae130>] ? insert_kthread_work+0x40/0x40 May 31 01:36:55 r1epi kernel: 01 May 31 01:36:55 r1epi kernel: 75 May 31 01:36:55 r1epi kernel: f2 May 31 01:36:55 r1epi kernel: e9 May 31 01:36:55 r1epi kernel: 75 May 31 01:36:55 r1epi kernel: ff May 31 01:36:55 r1epi kernel: ff May 31 01:36:55 r1epi kernel: ff May 31 01:36:55 r1epi kernel: May 31 01:36:55 r1epi kernel: [<ffffffff816fef7c>] ret_from_fork+0x7c/0xb0 May 31 01:36:55 r1epi kernel: May 31 01:36:55 r1epi kernel: [<ffffffff810ae130>] ? insert_kthread_work+0x40/0x40
There were also reports of soft lockups in shrink_dentry_list on lkml recently: https://lkml.org/lkml/2014/5/26/125 Apparently fixed in 3.15-rc7, but I'm unclear when the problem was introduced--possibly too recently to explain your issue.
just another fyi...these are the kernels installed, vanilla Fedora fc20: -rwxr-xr-x 1 root root 5329128 Apr 4 05:17 vmlinuz-3.13.9-200.fc20.x86_64 -rwxr-xr-x 1 root root 5514584 Apr 28 07:47 vmlinuz-3.14.2-200.fc20.x86_64 -rwxr-xr-x 1 root root 5514392 May 13 06:56 vmlinuz-3.14.4-200.fc20.x86_64 this seems to be the only stable one -- no "BUG: soft lockup - CPU#0 stuck for..": Linux r1epi 3.13.9-200.fc20.x86_64 #1 SMP Fri Apr 4 12:13:05 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux Has not crashed for a few weeks. I could try 3.15-rc7 or just stay on 3.13.9-200 for a bit, please advise, thanks!
Actually taking a quick look at the logs I think the fix for the soft lookup mentioned above are in -rc8, not -rc7 (b2b80195d882 "dealing with the rest of shrink_dentry_list() livelock"). Anyway, looks like the latest Fedora 20 kernel is 3.15.4-200, could you just try that? That should have a fix for the known shrink_dentry_list soft lookup.
3.13.9-200.fc20.x86_64 note I got list_del corruption this week on the above kernel (fyi) -- I thought it would be more stable, I have now moved to 3.15.8-200.fc20.x86_64 hoping for no more crashing at 9pm! ha, thanks, Gary Aug 11 19:44:00 r1epi kernel: [1083825.700024] ------------[ cut here ]------------ Aug 11 19:44:00 r1epi kernel: [1083825.700031] WARNING: CPU: 1 PID: 18206 at lib/list_debug.c:53 __list_del_entry+0x63/0xd0() Aug 11 19:44:00 r1epi kernel: [1083825.700033] list_del corruption, ffff88027735e160->next is LIST_POISON1 (dead000000100100) Aug 11 19:44:00 r1epi kernel: [1083825.700034] Modules linked in: binfmt_misc sch_sfq xt_iprange bonding ip6t_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack_ipv6 nf_defrag_ ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables iTCO_wdt iTCO_vendor_support snd_hda_codec_hdmi snd_hda_codec_realtek btrfs raid6_pq libcrc32c eeepc_wmi asus_wmi sparse _keymap xor mxm_wmi x86_pkg_temp_thermal coretemp kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel btusb bluetooth lpc_ich microcode serio_raw mfd_core rfkill s nd_hda_intel e1000 i2c_i801 snd_hda_codec snd_hwdep snd_pcm e1000e snd_page_alloc snd_timer mei_me snd ptp mei soundcore pps_core shpchp wmi nfsd auth_rpcgss nfs_acl lockd sunrpc i915 i2c_algo_bit drm_kms_helper drm firewire_ohci firewire_core crc_itu_t aacraid i2c_core video Aug 11 19:44:00 r1epi kernel: [1083825.700068] CPU: 1 PID: 18206 Comm: nfsd Not tainted 3.13.9-200.fc20.x86_64 #1 Aug 11 19:44:00 r1epi kernel: [1083825.700070] Hardware name: System manufacturer System Product Name/P8Z68-V PRO, BIOS 0902 09/19/2011 Aug 11 19:44:00 r1epi kernel: [1083825.700072] 0000000000000009 ffff88001ab3bce8 ffffffff81687dac ffff88001ab3bd30 Aug 11 19:44:00 r1epi kernel: [1083825.700074] ffff88001ab3bd20 ffffffff8106d4dd ffff88027735e160 ffff8800c3faa000 Aug 11 19:44:00 r1epi kernel: [1083825.700076] 0000000000000002 000000000000006c 000000000000006c ffff88001ab3bd80 Aug 11 19:44:00 r1epi kernel: [1083825.700078] Call Trace: Aug 11 19:44:00 r1epi kernel: [1083825.700083] [<ffffffff81687dac>] dump_stack+0x45/0x56 Aug 11 19:44:00 r1epi kernel: [1083825.700086] [<ffffffff8106d4dd>] warn_slowpath_common+0x7d/0xa0 Aug 11 19:44:00 r1epi kernel: [1083825.700088] [<ffffffff8106d54c>] warn_slowpath_fmt+0x4c/0x50 Aug 11 19:44:00 r1epi kernel: [1083825.700090] [<ffffffff8132cd93>] __list_del_entry+0x63/0xd0 Aug 11 19:44:00 r1epi kernel: [1083825.700097] [<ffffffffa01f8c41>] lru_put_end+0x21/0x60 [nfsd] Aug 11 19:44:00 r1epi kernel: [1083825.700102] [<ffffffffa01f95d5>] nfsd_cache_update+0x85/0x150 [nfsd] Aug 11 19:44:00 r1epi kernel: [1083825.700107] [<ffffffffa01ede02>] nfsd_dispatch+0x192/0x200 [nfsd] Aug 11 19:44:00 r1epi kernel: [1083825.700117] [<ffffffffa01b931d>] svc_process_common+0x46d/0x6d0 [sunrpc] Aug 11 19:44:00 r1epi kernel: [1083825.700126] [<ffffffffa01b9687>] svc_process+0x107/0x170 [sunrpc] Aug 11 19:44:00 r1epi kernel: [1083825.700131] [<ffffffffa01ed71f>] nfsd+0xbf/0x130 [nfsd] Aug 11 19:44:00 r1epi kernel: [1083825.700135] [<ffffffffa01ed660>] ? nfsd_destroy+0x80/0x80 [nfsd] Aug 11 19:44:00 r1epi kernel: [1083825.700138] [<ffffffff8108f2f2>] kthread+0xd2/0xf0 Aug 11 19:44:00 r1epi kernel: [1083825.700141] [<ffffffff8108f220>] ? insert_kthread_work+0x40/0x40 Aug 11 19:44:00 r1epi kernel: [1083825.700144] [<ffffffff81696cbc>] ret_from_fork+0x7c/0xb0 Aug 11 19:44:00 r1epi kernel: [1083825.700146] [<ffffffff8108f220>] ? insert_kthread_work+0x40/0x40 Aug 11 19:44:00 r1epi kernel: [1083825.700147] ---[ end trace 4ca77c7dc9ca19d2 ]--- Aug 11 19:44:00 r1epi kernel: ------------[ cut here ]------------ Aug 11 19:44:00 r1epi kernel: ------------[ cut here ]------------ Aug 11 19:44:00 r1epi kernel: WARNING: CPU: 1 PID: 18206 at lib/list_debug.c:53 __list_del_entry+0x63/0xd0() Aug 11 19:44:00 r1epi kernel: list_del corruption, ffff88027735e160->next is LIST_POISON1 (dead000000100100) Aug 11 19:44:00 r1epi kernel: Modules linked in: binfmt_misc sch_sfq xt_iprange bonding ip6t_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables iTCO_wdt iTCO_vendor_support snd_hda_codec_hdmi snd_hda_codec_realtek btrfs raid6_pq libcrc32c eeepc_wmi asus_wmi sparse_keymap xor mxm_wmi x86_pkg_temp_thermal coretemp kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel btusb bluetooth lpc_ich microcode serio_raw mfd_core rfkill snd_hda_intel e1000 i2c_i801 snd_hda_codec snd_hwdep snd_pcm e1000e snd_page_alloc snd_timer mei_me snd ptp mei soundcore pps_core shpchp wmi nfsd auth_rpcgss nfs_acl lockd sunrpc i915 i2c_algo_bit drm_kms_helper drm firewire_ohci firewire_core crc_itu_t aacraid i2c_core video Aug 11 19:44:00 r1epi kernel: CPU: 1 PID: 18206 Comm: nfsd Not tainted 3.13.9-200.fc20.x86_64 #1 Aug 11 19:44:00 r1epi kernel: Hardware name: System manufacturer System Product Name/P8Z68-V PRO, BIOS 0902 09/19/2011 Aug 11 19:44:00 r1epi kernel: 0000000000000009 ffff88001ab3bce8 ffffffff81687dac ffff88001ab3bd30 Aug 11 19:44:00 r1epi kernel: ffff88001ab3bd20 ffffffff8106d4dd ffff88027735e160 ffff8800c3faa000 Aug 11 19:44:00 r1epi kernel: 0000000000000002 000000000000006c 000000000000006c ffff88001ab3bd80 Aug 11 19:44:00 r1epi kernel: Call Trace: Aug 11 19:44:00 r1epi kernel: [<ffffffff81687dac>] dump_stack+0x45/0x56 Aug 11 19:44:00 r1epi kernel: [<ffffffff8106d4dd>] warn_slowpath_common+0x7d/0xa0 Aug 11 19:44:00 r1epi kernel: [<ffffffff8106d54c>] warn_slowpath_fmt+0x4c/0x50 Aug 11 19:44:00 r1epi kernel: [<ffffffff8132cd93>] __list_del_entry+0x63/0xd0 Aug 11 19:44:00 r1epi kernel: [<ffffffffa01f8c41>] lru_put_end+0x21/0x60 [nfsd] Aug 11 19:44:00 r1epi kernel: [<ffffffffa01f95d5>] nfsd_cache_update+0x85/0x150 [nfsd] Aug 11 19:44:00 r1epi kernel: [<ffffffffa01ede02>] nfsd_dispatch+0x192/0x200 [nfsd] Aug 11 19:44:00 r1epi kernel: [<ffffffffa01b931d>] svc_process_common+0x46d/0x6d0 [sunrpc] Aug 11 19:44:00 r1epi kernel: [<ffffffffa01b9687>] svc_process+0x107/0x170 [sunrpc] Aug 11 19:44:00 r1epi kernel: [<ffffffffa01ed71f>] nfsd+0xbf/0x130 [nfsd] Aug 11 19:44:00 r1epi kernel: [<ffffffffa01ed660>] ? nfsd_destroy+0x80/0x80 [nfsd] Aug 11 19:44:00 r1epi kernel: [<ffffffff8108f2f2>] kthread+0xd2/0xf0 Aug 11 19:44:00 r1epi kernel: [<ffffffff8108f220>] ? insert_kthread_work+0x40/0x40 Aug 11 19:44:00 r1epi kernel: [<ffffffff81696cbc>] ret_from_fork+0x7c/0xb0 Aug 11 19:44:00 r1epi kernel: [<ffffffff8108f220>] ? insert_kthread_work+0x40/0x40 Aug 11 19:44:00 r1epi kernel: ---[ end trace 4ca77c7dc9ca19d2 ]--- Aug 11 19:44:04 r1epi salt-minion: [WARNING ] SaltReqTimeoutError: Waited 60 seconds Aug 11 19:44:04 r1epi salt-minion: [INFO ] Waiting for minion key to be accepted by the master. ^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@^@
(In reply to g. artim from comment #12) > 3.13.9-200.fc20.x86_64 > > note I got list_del corruption this week on the above kernel (fyi) -- I > thought it would be more stable, I have now moved to > > 3.15.8-200.fc20.x86_64 > > hoping for no more crashing at 9pm! ha, thanks, Gary Thanks, yes that should have the fix; let us know either way.
did more testing, this combo creates the soft lockup cpu#n - n=0..7: run: === - tree command on nfs client on a big (20TB) raid, nfs4 mounted, 1000s of files - tree on another client, same dir - scp from a client to home server of 3TB file - rsync backups of all servers across the net, output to nfs server - rsync on server raid to raid (no nfs required), the from target is 20TB, 5805 adaptec, to target is the same. - on server watch `dmesg | tail` 6 console terms open on the raid, after about 15 mins it hangs and I get: self detectable stall on cpu (1) bug soft lockup cpu#0 ..thru cpu#7 stuck for 22s! [NFSD: nnn] on the dmesg, and funny thing I keep gettting them after about 22 seconds. Ok, frustrated as hell, I tried the following in order to see if its ME, something I did, I built the server: I flashed the mb, retested: lockup I went from a 850W psu to 1300W psu, retested: lockup I replaced the memory, ran memtest, retested: lockup I pulled the MB/CPU/Memory from my desktop, installed, then did a complete reinstall and update of O/S to fc20, retest: lockup hw now: ====== mb asus Z87-PRO cpu i7-4771 CPU @ 3.50GHz raid cards adaptec 5805 software: ======== 3.16.6-200.fc20.x86_64 #1 SMP Wed Oct 15 13:06:51 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux btrfs (with lzo compression) on a hw raid 5 and hw raid 0 My research group is away at conference, let me know asap if there is something more I could do to help fix this. Also, would I get more stablility if I ran Redhat. I've run fedora for about 16 years, production and test and have never had a problem I could get fixed. if running an stable version would help, let me know. thanks for ANY feedback!! -- gary
correction 3GB file scp'ed...ooops!
Could you post the softlookup warnings from the new softlookups? Or are the backtraces really identical to the ones you were seeing before? Hard to judge whether you would have seen this on another distro. I haven't seen this on RHEL but there could just be something unusual about your setup or workload. (Also, btrfs is unsupported on RHEL (except as a "tech preview"), and probably isn't what would be recommended for production when stability's the priority.)
couldnt find the soft lock errors in the log and didnt capture through the terminal (watch 'dmesg|tail') i had open. Didn't see anything more then in the past. food for thought: I rebooted after the lockup and _just_ ran: rsync -av /my2 /backup (both btrfs filesystems) and got a lockup, but not with CPU soft lock errors, but a backtrace on btrfs. I cant locate the messages in the log....maybe never made it to the log, need to get the serial port working for a console capture. At this point I'm compressing the from target so I can easily copy from the btrfs to xfs filesystem in hopes of ridding myself of this instability (do a switch between the 2 to get _only_ xfs filesystems). I have to face the issue with having used the lzo compression...more disks space. I did run for quite some time (year) with this config without as many lockups (but not 0), something happened to make it much more unstable, could be software, but none of the raid cards or smartd report errors on the drives. Im thinking btrfs is not production ready yet.
I guess we should cc: Josef if we think btrfs might be at fault.
okay more interesting (voodoo) stuff: I converted both raids, within the same computer, to xfs. Seemed to run clean, but then I got: [176363.245588] aacraid: Host adapter abort request (3,0,0,0) [176363.245600] aacraid: Host adapter abort request (3,0,0,0) [176363.245612] aacraid: Host adapter abort request (3,0,0,0) [176363.245624] aacraid: Host adapter abort request (3,0,0,0) [176363.245690] aacraid: Host adapter reset request. SCSI hang ? [193002.981921] aacraid: Host adapter abort request (3,0,0,0) [193002.981938] aacraid: Host adapter abort request (3,0,0,0) [193002.981952] aacraid: Host adapter abort request (3,0,0,0) [193002.981966] aacraid: Host adapter abort request (3,0,0,0) [193002.981979] aacraid: Host adapter abort request (3,0,0,0) [193002.981991] aacraid: Host adapter abort request (3,0,0,0) [193003.956322] aacraid: Host adapter reset request. SCSI hang ? so I did some browsing on these error and Adaptec 5805 and someone said not to use cfq algorithm with these cards. The machine with throwing these errors every so often and df, dmesg commands were slow. So for laughs I did echo noop > /sys/block/sdb/queue/scheduler echo noop > /sys/block/sdc/queue/scheduler and the system came back or started responding as expected. Now after all I did to try and solve this issue it ends up being a scheduling issue??? I dont know or ?
*********** MASS BUG UPDATE ************** We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 20 kernel bugs. Fedora 20 has now been rebased to 3.17.2-200.fc20. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you have moved on to Fedora 21, and are still experiencing this issue, please change the version to Fedora 21. If you experience different issues, please open a new bug report for those.
This bug is being closed with INSUFFICIENT_DATA as there has not been a response in over 3 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.
Sorry for the automated closing.... (In reply to g. artim from comment #19) > and the system came back or started responding as expected. Now after all I > did to try and solve this issue it ends up being a scheduling issue??? I > dont know or ? So since then have you seen any recurrence of the problem? It would be interesting to know if switching to xfs helped, in which case it's more likely to be something btrfs-specific.
after switching to xfs the problem seem to have stop...but because I had 2 raid configs in one system I turn one off at the same time, could have been the pci bus. I moved the second raid to a separate system. I've recently move to LSI cards and 24 3TB drives and seems stable also -- the new config is xfs, miss the lzo option of btrfs, but it (btrfs-lzo) cornered me when I tried to switch to xfs and could copy all the data. gary
*********** MASS BUG UPDATE ************** We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 20 kernel bugs. Fedora 20 has now been rebased to 3.19.5-100.fc20. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you have moved on to Fedora 21, and are still experiencing this issue, please change the version to Fedora 21. If you experience different issues, please open a new bug report for those.
This message is a reminder that Fedora 20 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 20. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '20'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 20 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Fedora 20 changed to end-of-life (EOL) status on 2015-06-23. Fedora 20 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. If you are unable to reopen this bug, please file a new report against the current release. If you experience problems, please add a comment to this bug. Thank you for reporting this bug and we are sorry it could not be fixed.