Bug 827496
Summary: | NFS timeouts with kmod-tg3 | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Bogdan-Stefan Rotariu <bogdan> |
Component: | kernel | Assignee: | John Feeney <jfeeney> |
Status: | CLOSED CANTFIX | QA Contact: | Red Hat Kernel QE team <kernel-qe> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 6.2 | CC: | jfeeney, manuel.wolfshant, rwheeler, teruaki.ishizaki, tmay81 |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2012-08-06 19:44:39 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Bogdan-Stefan Rotariu
2012-06-01 15:45:04 UTC
Just so I understand this correctly, tg3 version 3.119 times out but 3.116j-1 works without issue? What type of tg3 nic are you using (lspci -v |grep Broadcom)? John, 3116j-1 worked for a few days, but all the time I had nfslock "not responding errors" in the logs. lspci -v output : 04:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5722 Gigabit Ethernet PCI Express 06:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5722 Gigabit Ethernet PCI Express The machine is an HP Proliant SE1102. This is also occurring for me with any size transfer (~3MB or larger). I'm using a Dell Optiplex 790 Minitower that has is running the e1000e: Intel(R) PRO/1000 Network Driver - 1.4.4-k and Linux version 2.6.32-220.17.1.el6.x86_64. Here's the relevant dmesg info: INFO: task cp:2786 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. cp D 0000000000000001 0 2786 2417 0x00000004 ffff8801e5307c78 0000000000000082 ffff8801e5307bf8 ffffffffa05c6380 ffff8801e4845080 ffff8802251eb200 ffff8801e5000001 ffffffff811261c7 ffff88020a56b078 ffff8801e5307fd8 000000000000f4e8 ffff88020a56b078 Call Trace: [<ffffffff811261c7>] ? write_cache_pages+0x117/0x4a0 [<ffffffff8109b949>] ? ktime_get_ts+0xa9/0xe0 [<ffffffff81110d60>] ? sync_page+0x0/0x50 [<ffffffff814ed833>] io_schedule+0x73/0xc0 [<ffffffff81110d9d>] sync_page+0x3d/0x50 [<ffffffff814ee1ef>] __wait_on_bit+0x5f/0x90 [<ffffffff81110f53>] wait_on_page_bit+0x73/0x80 [<ffffffff81090d70>] ? wake_bit_function+0x0/0x50 [<ffffffff811273f5>] ? pagevec_lookup_tag+0x25/0x40 [<ffffffff8111136b>] wait_on_page_writeback_range+0xfb/0x190 [<ffffffff81111538>] filemap_write_and_wait_range+0x78/0x90 [<ffffffff811a577e>] vfs_fsync_range+0x7e/0xe0 [<ffffffff811a584d>] vfs_fsync+0x1d/0x20 [<ffffffffa058b6d0>] nfs_file_flush+0x70/0xa0 [nfs] [<ffffffff81173d2c>] filp_close+0x3c/0x90 [<ffffffff81173e25>] sys_close+0xa5/0x100 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b nfs: server hnasgd_home not responding, still trying nfs: server hnasgd_home not responding, still trying nfs: server hnasgd_home not responding, still trying INFO: task cp:2786 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. cp D 0000000000000001 0 2786 2417 0x00000004 ffff8801e5307c78 0000000000000082 ffff8801e5307bf8 ffffffffa05c6380 ffff8801e4845080 ffff8802251eb200 ffff8801e5000001 ffffffff811261c7 ffff88020a56b078 ffff8801e5307fd8 000000000000f4e8 ffff88020a56b078 Call Trace: [<ffffffff811261c7>] ? write_cache_pages+0x117/0x4a0 [<ffffffff8109b949>] ? ktime_get_ts+0xa9/0xe0 [<ffffffff81110d60>] ? sync_page+0x0/0x50 [<ffffffff814ed833>] io_schedule+0x73/0xc0 [<ffffffff81110d9d>] sync_page+0x3d/0x50 [<ffffffff814ee1ef>] __wait_on_bit+0x5f/0x90 [<ffffffff81110f53>] wait_on_page_bit+0x73/0x80 [<ffffffff81090d70>] ? wake_bit_function+0x0/0x50 [<ffffffff811273f5>] ? pagevec_lookup_tag+0x25/0x40 [<ffffffff8111136b>] wait_on_page_writeback_range+0xfb/0x190 [<ffffffff81111538>] filemap_write_and_wait_range+0x78/0x90 [<ffffffff811a577e>] vfs_fsync_range+0x7e/0xe0 [<ffffffff811a584d>] vfs_fsync+0x1d/0x20 [<ffffffffa058b6d0>] nfs_file_flush+0x70/0xa0 [nfs] [<ffffffff81173d2c>] filp_close+0x3c/0x90 [<ffffffff81173e25>] sys_close+0xa5/0x100 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b This request was not resolved in time for the current release. Red Hat invites you to ask your support representative to propose this request, if still desired, for consideration in the next release of Red Hat Enterprise Linux. Elrepo has recently announced ( http://lists.elrepo.org/pipermail/elrepo/2012-July/001309.html ) an updated Broadcom Tigon 3 (tg3) driver, version 3.122n This request was erroneously removed from consideration in Red Hat Enterprise Linux 6.4, which is currently under development. This request will be evaluated for inclusion in Red Hat Enterprise Linux 6.4. Thanks Manuel, it seems that the update from Elrepo fixes the problem. I've used the NFS for a week now without any issues. The kernel I've used is 2.6.32-220.17.1 For what is worth, I see random stops of communication between a HS20 blade which uses the stock driver from the Centos 6.3 line of kernels and another identical blade connected via the same switch in the same chassis, but which uses the kmod-tg3-3.122n package from elrepo. Please find below a few lines retrieved from the kernel log: Sep 11 11:36:23 mail kernel: INFO: task httpd:21220 blocked for more than 120 seconds. Sep 11 11:36:23 mail kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 11 11:36:23 mail kernel: httpd D e79e5eac 0 21220 3084 0x00000080 Sep 11 11:36:23 mail kernel: c1540000 00000082 00000002 e79e5eac c1f04024 00000000 e79e5ea4 c1e08680 Sep 11 11:36:23 mail kernel: d518cd9c e87ee180 00038324 bfe07354 00038324 c0b25680 c0b25680 c15402a8 Sep 11 11:36:23 mail kernel: c0b25680 c0b21024 c0b25680 c15402a8 3ae88d3d 00000000 00000400 c1540000 Sep 11 11:36:23 mail kernel: Call Trace: Sep 11 11:36:23 mail kernel: [<c0408400>] ? __switch_to+0x130/0x1a0 Sep 11 11:36:23 mail kernel: [<c083c790>] ? schedule+0x3c0/0xae0 Sep 11 11:36:23 mail kernel: [<c0441c23>] ? check_preempt_wakeup+0x183/0x220 Sep 11 11:36:23 mail kernel: [<c083d485>] ? schedule_timeout+0x195/0x250 Sep 11 11:36:23 mail kernel: [<c0476a2b>] ? autoremove_wake_function+0x1b/0x40 Sep 11 11:36:23 mail kernel: [<c083d1e9>] ? wait_for_common+0xe9/0x150 Sep 11 11:36:23 mail kernel: [<c044e420>] ? default_wake_function+0x0/0x10 Sep 11 11:36:23 mail kernel: [<c0472ac8>] ? flush_work+0x58/0xa0 Sep 11 11:36:23 mail kernel: [<c04726a0>] ? wq_barrier_func+0x0/0x10 Sep 11 11:36:23 mail kernel: [<c0472bfe>] ? schedule_on_each_cpu+0xee/0x130 Sep 11 11:36:23 mail kernel: [<c04f5700>] ? lru_add_drain_per_cpu+0x0/0x10 Sep 11 11:36:23 mail kernel: [<c0508eed>] ? sys_mlock+0x3d/0xe0 Sep 11 11:36:23 mail kernel: [<c083ed64>] ? syscall_call+0x7/0xb Sep 11 11:36:23 mail kernel: [<c0830000>] ? xenkbd_probe+0xf8/0x2d1 Sep 11 11:36:23 mail kernel: INFO: task httpd:21276 blocked for more than 120 seconds. Sep 11 11:36:23 mail kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. Sep 11 11:36:23 mail kernel: httpd D cc085eac 0 21276 3084 0x00000080 Sep 11 11:36:23 mail kernel: d4427aa0 00000086 00000002 cc085eac c1f04024 00000000 cc085ea4 c1e08680 Sep 11 11:36:23 mail kernel: f44e7494 e84aa900 0003832c 2d1db042 0003832c c0b25680 c0b25680 d4427d48 Sep 11 11:36:23 mail kernel: c0b25680 c0b21024 c0b25680 d4427d48 3ae906bd 00000000 00000400 d4427aa0 Sep 11 11:36:23 mail kernel: Call Trace: Sep 11 11:36:23 mail kernel: [<c05f9671>] ? __next_cpu+0x11/0x20 Sep 11 11:36:23 mail kernel: [<c0798899>] ? net_rx_action+0x199/0x280 Sep 11 11:36:23 mail kernel: [<f883c269>] ? tg3_interrupt_tagged+0xa9/0xf0 [tg3] Sep 11 11:36:23 mail kernel: [<c083d485>] ? schedule_timeout+0x195/0x250 Sep 11 11:36:23 mail kernel: [<c04b7955>] ? handle_fasteoi_irq+0x85/0xc0 Sep 11 11:36:23 mail kernel: [<c045d1a5>] ? irq_exit+0x35/0x70 Sep 11 11:36:23 mail kernel: [<c040b110>] ? do_IRQ+0x50/0xc0 Sep 11 11:36:23 mail kernel: [<c0476a2b>] ? autoremove_wake_function+0x1b/0x40 Sep 11 11:36:23 mail kernel: [<c083d1e9>] ? wait_for_common+0xe9/0x150 Sep 11 11:36:23 mail kernel: [<c044e420>] ? default_wake_function+0x0/0x10 Sep 11 11:36:23 mail kernel: [<c0472ac8>] ? flush_work+0x58/0xa0 Sep 11 11:36:23 mail kernel: [<c04726a0>] ? wq_barrier_func+0x0/0x10 Sep 11 11:36:23 mail kernel: [<c0472bfe>] ? schedule_on_each_cpu+0xee/0x130 Sep 11 11:36:23 mail kernel: [<c04f5700>] ? lru_add_drain_per_cpu+0x0/0x10 Sep 11 11:36:23 mail kernel: [<c0508eed>] ? sys_mlock+0x3d/0xe0 Sep 11 11:36:23 mail kernel: [<c083ed64>] ? syscall_call+0x7/0xb Sep 11 11:36:23 mail kernel: [<c0830000>] ? xenkbd_probe+0xf8/0x2d1 I think that this bugzilla entry should be reopened or at least not considered closed, ugly bug(s) still seem to exist in the current version of the driver. For the record: httpd was blocked simply because it could no longer access the network. The problem occurs even after httpd was relocated to another server. The network just dies (even arping no longer works) and service network restart must be issued. |