Bug 827496

Summary:	NFS timeouts with kmod-tg3
Product:	Red Hat Enterprise Linux 6	Reporter:	Bogdan-Stefan Rotariu <bogdan>
Component:	kernel	Assignee:	John Feeney <jfeeney>
Status:	CLOSED CANTFIX	QA Contact:	Red Hat Kernel QE team <kernel-qe>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	6.2	CC:	jfeeney, manuel.wolfshant, rwheeler, teruaki.ishizaki, tmay81
Target Milestone:	rc
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2012-08-06 19:44:39 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Bogdan-Stefan Rotariu 2012-06-01 15:45:04 UTC

Description of problem:

Fresh install of CentOS 6.2 has timeout issues with long time transfers on nfs.

Used settings within last week :

nfs,udp,intr,mountvers=3,nfsvers=3

Version-Release number of selected component (if applicable):

kmod-tg3-3.119

How reproducible:


Steps to Reproduce:
1. Last 6.2 updates, mount nfs
2. rsync -avr /home /path/to/nfs
  
Actual results:

Jun  1 09:39:14 ark kernel: nfs: server 192.168.1.122 OK
Jun  1 09:40:14 ark kernel: nfs: server 192.168.1.122 not responding, still trying
Jun  1 09:41:00 ark kernel: nfs: server 192.168.1.122 OK
Jun  1 09:42:00 ark kernel: nfs: server 192.168.1.122 not responding, still trying

Jun  1 09:48:36 ark kernel: INFO: task rsync:31874 blocked for more than 120 seconds.
Jun  1 09:48:36 ark kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun  1 09:48:36 ark kernel: rsync         D 0000000000000000     0 31874  31873 0x00000080
Jun  1 09:48:36 ark kernel: ffff88003c6cbc78 0000000000000082 ffff88003c6cbbf8 ffffffffa04da380
Jun  1 09:48:36 ark kernel: ffff8801fbe4d0c0 ffff880413979e80 ffff88003c000001 ffffffff811261c7
Jun  1 09:48:36 ark kernel: ffff880412acf0f8 ffff88003c6cbfd8 000000000000f4e8 ffff880412acf0f8
Jun  1 09:48:36 ark kernel: Call Trace:
Jun  1 09:48:36 ark kernel: [<ffffffff811261c7>] ? write_cache_pages+0x117/0x4a0
Jun  1 09:48:36 ark kernel: [<ffffffff8109b949>] ? ktime_get_ts+0xa9/0xe0
Jun  1 09:48:36 ark kernel: [<ffffffff81110d60>] ? sync_page+0x0/0x50
Jun  1 09:48:36 ark kernel: [<ffffffff814ed833>] io_schedule+0x73/0xc0
Jun  1 09:48:36 ark kernel: [<ffffffff81110d9d>] sync_page+0x3d/0x50
Jun  1 09:48:36 ark kernel: [<ffffffff814ee1ef>] __wait_on_bit+0x5f/0x90
Jun  1 09:48:36 ark kernel: [<ffffffff81110f53>] wait_on_page_bit+0x73/0x80
Jun  1 09:48:36 ark kernel: [<ffffffff81090d70>] ? wake_bit_function+0x0/0x50
Jun  1 09:48:36 ark kernel: [<ffffffff811273f5>] ? pagevec_lookup_tag+0x25/0x40
Jun  1 09:48:36 ark kernel: [<ffffffff8111136b>] wait_on_page_writeback_range+0xfb/0x190
Jun  1 09:48:36 ark kernel: [<ffffffff81111538>] filemap_write_and_wait_range+0x78/0x90
Jun  1 09:48:36 ark kernel: [<ffffffff811a577e>] vfs_fsync_range+0x7e/0xe0
Jun  1 09:48:36 ark kernel: [<ffffffff811a584d>] vfs_fsync+0x1d/0x20
Jun  1 09:48:36 ark kernel: [<ffffffffa049f6d0>] nfs_file_flush+0x70/0xa0 [nfs]
Jun  1 09:48:36 ark kernel: [<ffffffff81173d2c>] filp_close+0x3c/0x90
Jun  1 09:48:36 ark kernel: [<ffffffff81173e25>] sys_close+0xa5/0x100
Jun  1 09:48:36 ark kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
Jun  1 09:50:36 ark kernel: INFO: task rsync:31874 blocked for more than 120 seconds.
Jun  1 09:50:36 ark kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Jun  1 09:50:36 ark kernel: rsync         D 0000000000000000     0 31874  31873 0x00000080
Jun  1 09:50:36 ark kernel: ffff88003c6cbc78 0000000000000082 ffff88003c6cbbf8 ffffffffa04da380
Jun  1 09:50:36 ark kernel: ffff8801fbe4d0c0 ffff880413979e80 ffff88003c000001 ffffffff811261c7
Jun  1 09:50:36 ark kernel: ffff880412acf0f8 ffff88003c6cbfd8 000000000000f4e8 ffff880412acf0f8
Jun  1 09:50:36 ark kernel: Call Trace:
Jun  1 09:50:36 ark kernel: [<ffffffff811261c7>] ? write_cache_pages+0x117/0x4a0
Jun  1 09:50:36 ark kernel: [<ffffffff8109b949>] ? ktime_get_ts+0xa9/0xe0
Jun  1 09:50:36 ark kernel: [<ffffffff81110d60>] ? sync_page+0x0/0x50
Jun  1 09:50:36 ark kernel: [<ffffffff814ed833>] io_schedule+0x73/0xc0
Jun  1 09:50:36 ark kernel: [<ffffffff81110d9d>] sync_page+0x3d/0x50
Jun  1 09:50:36 ark kernel: [<ffffffff814ee1ef>] __wait_on_bit+0x5f/0x90
Jun  1 09:50:36 ark kernel: [<ffffffff81110f53>] wait_on_page_bit+0x73/0x80
Jun  1 09:50:36 ark kernel: [<ffffffff81090d70>] ? wake_bit_function+0x0/0x50
Jun  1 09:50:36 ark kernel: [<ffffffff811273f5>] ? pagevec_lookup_tag+0x25/0x40
Jun  1 09:50:36 ark kernel: [<ffffffff8111136b>] wait_on_page_writeback_range+0xfb/0x190
Jun  1 09:50:36 ark kernel: [<ffffffff81111538>] filemap_write_and_wait_range+0x78/0x90
Jun  1 09:50:36 ark kernel: [<ffffffff811a577e>] vfs_fsync_range+0x7e/0xe0
Jun  1 09:50:36 ark kernel: [<ffffffff811a584d>] vfs_fsync+0x1d/0x20
Jun  1 09:50:36 ark kernel: [<ffffffffa049f6d0>] nfs_file_flush+0x70/0xa0 [nfs]
Jun  1 09:50:36 ark kernel: [<ffffffff81173d2c>] filp_close+0x3c/0x90
Jun  1 09:50:36 ark kernel: [<ffffffff81173e25>] sys_close+0xa5/0x100
Jun  1 09:50:36 ark kernel: [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b



Expected results:

It should work without any issue

Additional info:

kmod-tg3-3.116j-1 from elrepo works ok for a few days

Comment 3 John Feeney 2012-06-04 14:56:40 UTC

Just so I understand this correctly, tg3 version 3.119 times out but 3.116j-1
works without issue?

 
What type of tg3 nic are you using (lspci -v |grep Broadcom)?

Comment 5 Bogdan-Stefan Rotariu 2012-06-05 12:06:39 UTC

John, 3116j-1 worked for a few days, but all the time I had nfslock "not responding errors" in the logs.

lspci -v output :

04:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5722 Gigabit Ethernet PCI Express
06:00.0 Ethernet controller: Broadcom Corporation NetXtreme BCM5722 Gigabit Ethernet PCI Express

The machine is an HP Proliant SE1102.

Comment 6 Tyler 2012-06-14 02:29:48 UTC

This is also occurring for me with any size transfer (~3MB or larger).

I'm using a Dell Optiplex 790 Minitower that has is running the e1000e: Intel(R) PRO/1000 Network Driver - 1.4.4-k and Linux version 2.6.32-220.17.1.el6.x86_64.

Here's the relevant dmesg info:

INFO: task cp:2786 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
cp            D 0000000000000001     0  2786   2417 0x00000004
 ffff8801e5307c78 0000000000000082 ffff8801e5307bf8 ffffffffa05c6380
 ffff8801e4845080 ffff8802251eb200 ffff8801e5000001 ffffffff811261c7
 ffff88020a56b078 ffff8801e5307fd8 000000000000f4e8 ffff88020a56b078
Call Trace:
 [<ffffffff811261c7>] ? write_cache_pages+0x117/0x4a0
 [<ffffffff8109b949>] ? ktime_get_ts+0xa9/0xe0
 [<ffffffff81110d60>] ? sync_page+0x0/0x50
 [<ffffffff814ed833>] io_schedule+0x73/0xc0
 [<ffffffff81110d9d>] sync_page+0x3d/0x50
 [<ffffffff814ee1ef>] __wait_on_bit+0x5f/0x90
 [<ffffffff81110f53>] wait_on_page_bit+0x73/0x80
 [<ffffffff81090d70>] ? wake_bit_function+0x0/0x50
 [<ffffffff811273f5>] ? pagevec_lookup_tag+0x25/0x40
 [<ffffffff8111136b>] wait_on_page_writeback_range+0xfb/0x190
 [<ffffffff81111538>] filemap_write_and_wait_range+0x78/0x90
 [<ffffffff811a577e>] vfs_fsync_range+0x7e/0xe0
 [<ffffffff811a584d>] vfs_fsync+0x1d/0x20
 [<ffffffffa058b6d0>] nfs_file_flush+0x70/0xa0 [nfs]
 [<ffffffff81173d2c>] filp_close+0x3c/0x90
 [<ffffffff81173e25>] sys_close+0xa5/0x100
 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b
nfs: server hnasgd_home not responding, still trying
nfs: server hnasgd_home not responding, still trying
nfs: server hnasgd_home not responding, still trying
INFO: task cp:2786 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
cp            D 0000000000000001     0  2786   2417 0x00000004
 ffff8801e5307c78 0000000000000082 ffff8801e5307bf8 ffffffffa05c6380
 ffff8801e4845080 ffff8802251eb200 ffff8801e5000001 ffffffff811261c7
 ffff88020a56b078 ffff8801e5307fd8 000000000000f4e8 ffff88020a56b078
Call Trace:
 [<ffffffff811261c7>] ? write_cache_pages+0x117/0x4a0
 [<ffffffff8109b949>] ? ktime_get_ts+0xa9/0xe0
 [<ffffffff81110d60>] ? sync_page+0x0/0x50
 [<ffffffff814ed833>] io_schedule+0x73/0xc0
 [<ffffffff81110d9d>] sync_page+0x3d/0x50
 [<ffffffff814ee1ef>] __wait_on_bit+0x5f/0x90
 [<ffffffff81110f53>] wait_on_page_bit+0x73/0x80
 [<ffffffff81090d70>] ? wake_bit_function+0x0/0x50
 [<ffffffff811273f5>] ? pagevec_lookup_tag+0x25/0x40
 [<ffffffff8111136b>] wait_on_page_writeback_range+0xfb/0x190
 [<ffffffff81111538>] filemap_write_and_wait_range+0x78/0x90
 [<ffffffff811a577e>] vfs_fsync_range+0x7e/0xe0
 [<ffffffff811a584d>] vfs_fsync+0x1d/0x20
 [<ffffffffa058b6d0>] nfs_file_flush+0x70/0xa0 [nfs]
 [<ffffffff81173d2c>] filp_close+0x3c/0x90
 [<ffffffff81173e25>] sys_close+0xa5/0x100
 [<ffffffff8100b0f2>] system_call_fastpath+0x16/0x1b

Comment 7 RHEL Program Management 2012-07-10 08:33:21 UTC

This request was not resolved in time for the current release.
Red Hat invites you to ask your support representative to
propose this request, if still desired, for consideration in
the next release of Red Hat Enterprise Linux.

Comment 8 manuel wolfshant 2012-07-10 11:57:37 UTC

Elrepo has recently announced ( http://lists.elrepo.org/pipermail/elrepo/2012-July/001309.html ) an updated Broadcom Tigon 3 (tg3) driver, version 3.122n

Comment 9 RHEL Program Management 2012-07-10 23:53:58 UTC

This request was erroneously removed from consideration in Red Hat Enterprise Linux 6.4, which is currently under development.  This request will be evaluated for inclusion in Red Hat Enterprise Linux 6.4.

Comment 10 Bogdan-Stefan Rotariu 2012-07-18 10:49:35 UTC

Thanks Manuel, it seems that the update from Elrepo fixes the problem.
I've used the NFS for a week now without any issues.

The kernel I've used is 2.6.32-220.17.1

Comment 11 manuel wolfshant 2012-09-12 21:00:51 UTC

For what is worth, I see random stops of communication between a HS20 blade which uses the stock driver from the Centos 6.3 line of kernels and another identical blade connected via the same switch in the same chassis, but which uses the kmod-tg3-3.122n package from elrepo. Please find below a few lines retrieved from the kernel log:
    Sep 11 11:36:23 mail kernel: INFO: task httpd:21220 blocked for more than 120 seconds.
    Sep 11 11:36:23 mail kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    Sep 11 11:36:23 mail kernel: httpd D e79e5eac 0 21220 3084 0x00000080
    Sep 11 11:36:23 mail kernel: c1540000 00000082 00000002 e79e5eac c1f04024 00000000 e79e5ea4 c1e08680
    Sep 11 11:36:23 mail kernel: d518cd9c e87ee180 00038324 bfe07354 00038324 c0b25680 c0b25680 c15402a8
    Sep 11 11:36:23 mail kernel: c0b25680 c0b21024 c0b25680 c15402a8 3ae88d3d 00000000 00000400 c1540000
    Sep 11 11:36:23 mail kernel: Call Trace:
    Sep 11 11:36:23 mail kernel: [<c0408400>] ? __switch_to+0x130/0x1a0
    Sep 11 11:36:23 mail kernel: [<c083c790>] ? schedule+0x3c0/0xae0
    Sep 11 11:36:23 mail kernel: [<c0441c23>] ? check_preempt_wakeup+0x183/0x220
    Sep 11 11:36:23 mail kernel: [<c083d485>] ? schedule_timeout+0x195/0x250
    Sep 11 11:36:23 mail kernel: [<c0476a2b>] ? autoremove_wake_function+0x1b/0x40
    Sep 11 11:36:23 mail kernel: [<c083d1e9>] ? wait_for_common+0xe9/0x150
    Sep 11 11:36:23 mail kernel: [<c044e420>] ? default_wake_function+0x0/0x10
    Sep 11 11:36:23 mail kernel: [<c0472ac8>] ? flush_work+0x58/0xa0
    Sep 11 11:36:23 mail kernel: [<c04726a0>] ? wq_barrier_func+0x0/0x10
    Sep 11 11:36:23 mail kernel: [<c0472bfe>] ? schedule_on_each_cpu+0xee/0x130
    Sep 11 11:36:23 mail kernel: [<c04f5700>] ? lru_add_drain_per_cpu+0x0/0x10
    Sep 11 11:36:23 mail kernel: [<c0508eed>] ? sys_mlock+0x3d/0xe0
    Sep 11 11:36:23 mail kernel: [<c083ed64>] ? syscall_call+0x7/0xb
    Sep 11 11:36:23 mail kernel: [<c0830000>] ? xenkbd_probe+0xf8/0x2d1
     
    Sep 11 11:36:23 mail kernel: INFO: task httpd:21276 blocked for more than 120 seconds.
    Sep 11 11:36:23 mail kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    Sep 11 11:36:23 mail kernel: httpd D cc085eac 0 21276 3084 0x00000080
    Sep 11 11:36:23 mail kernel: d4427aa0 00000086 00000002 cc085eac c1f04024 00000000 cc085ea4 c1e08680
    Sep 11 11:36:23 mail kernel: f44e7494 e84aa900 0003832c 2d1db042 0003832c c0b25680 c0b25680 d4427d48
    Sep 11 11:36:23 mail kernel: c0b25680 c0b21024 c0b25680 d4427d48 3ae906bd 00000000 00000400 d4427aa0
    Sep 11 11:36:23 mail kernel: Call Trace:
    Sep 11 11:36:23 mail kernel: [<c05f9671>] ? __next_cpu+0x11/0x20
    Sep 11 11:36:23 mail kernel: [<c0798899>] ? net_rx_action+0x199/0x280
    Sep 11 11:36:23 mail kernel: [<f883c269>] ? tg3_interrupt_tagged+0xa9/0xf0 [tg3]
    Sep 11 11:36:23 mail kernel: [<c083d485>] ? schedule_timeout+0x195/0x250
    Sep 11 11:36:23 mail kernel: [<c04b7955>] ? handle_fasteoi_irq+0x85/0xc0
    Sep 11 11:36:23 mail kernel: [<c045d1a5>] ? irq_exit+0x35/0x70
    Sep 11 11:36:23 mail kernel: [<c040b110>] ? do_IRQ+0x50/0xc0
    Sep 11 11:36:23 mail kernel: [<c0476a2b>] ? autoremove_wake_function+0x1b/0x40
    Sep 11 11:36:23 mail kernel: [<c083d1e9>] ? wait_for_common+0xe9/0x150
    Sep 11 11:36:23 mail kernel: [<c044e420>] ? default_wake_function+0x0/0x10
    Sep 11 11:36:23 mail kernel: [<c0472ac8>] ? flush_work+0x58/0xa0
    Sep 11 11:36:23 mail kernel: [<c04726a0>] ? wq_barrier_func+0x0/0x10
    Sep 11 11:36:23 mail kernel: [<c0472bfe>] ? schedule_on_each_cpu+0xee/0x130
    Sep 11 11:36:23 mail kernel: [<c04f5700>] ? lru_add_drain_per_cpu+0x0/0x10
    Sep 11 11:36:23 mail kernel: [<c0508eed>] ? sys_mlock+0x3d/0xe0
    Sep 11 11:36:23 mail kernel: [<c083ed64>] ? syscall_call+0x7/0xb
    Sep 11 11:36:23 mail kernel: [<c0830000>] ? xenkbd_probe+0xf8/0x2d1



I think that this bugzilla entry should be reopened or at least not considered closed, ugly bug(s) still seem to exist in the current version of the driver.

Comment 12 manuel wolfshant 2012-09-23 14:03:55 UTC

For the record: httpd was blocked simply because it could no longer access the network. The problem occurs even after httpd was relocated to another server. The network just dies (even arping no longer works) and service network restart must be issued.