Bug 743123 - tasks hung in log_wait_commit while running ftp tests [NEEDINFO]
Summary: tasks hung in log_wait_commit while running ftp tests
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.7
Hardware: x86_64
OS: Linux
unspecified
urgent
Target Milestone: rc
: ---
Assignee: Red Hat Kernel Manager
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-10-03 21:58 UTC by sritej
Modified: 2014-06-02 13:01 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2014-06-02 13:01:06 UTC
Target Upstream Version:
pm-rhel: needinfo? (sritej.velaga)


Attachments (Terms of Use)
Ftp test script (1022 bytes, text/plain)
2011-10-10 20:45 UTC, sritej
no flags Details
Ftp test script 2 (1.51 KB, text/plain)
2011-10-10 20:47 UTC, sritej
no flags Details
Ftp test script 3 (100 bytes, text/plain)
2011-10-10 20:48 UTC, sritej
no flags Details
dmesg log after running ftp test (12.18 KB, text/plain)
2011-10-10 20:49 UTC, sritej
no flags Details
dmesg after sysrq-t (437.89 KB, text/plain)
2011-10-10 20:50 UTC, sritej
no flags Details

Description sritej 2011-10-03 21:58:09 UTC
Description of problem:

kernel opps message "INFO: task ypbind:7390 blocked for more than 120 seconds" shown while running ftp tests on Qlogic 10GbE CNA.At first glance, it doesnt look like a driver bug. 

Version-Release number of selected component (if applicable):
Kernel Rhek 5.7 2.6.18-274.el5
driver : qlcnic v5.0.24


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

eth0: no IPv6 routers present
INFO: task ypbind:7390 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ypbind        D ffffffff801546d1     0  7390      1          7445  7389 (NOTLB)
 ffff81107cd7fa78 0000000000000082 ffff81087aa603d8 ffffffff88055b5b
 ffff8108716e9850 000000000000000a ffff81107e8ad040 ffff81107f874040
 000002a4236e36b0 000000000051d05a ffff81107e8ad228 0000001c8003f876
Call Trace:
 [<ffffffff88055b5b>] :ext3:__ext3_journal_stop+0x1f/0x3d
 [<ffffffff88036e14>] :jbd:log_wait_commit+0xa3/0xf5
 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e
 [<ffffffff88037844>] :jbd:journal_force_commit_nested+0x63/0x6c
 [<ffffffff880504f7>] :ext3:ext3_write_begin+0x1b3/0x1cd
 [<ffffffff8000feb1>] generic_file_buffered_write+0x14b/0x675
 [<ffffffff8001678a>] __generic_file_aio_write_nolock+0x369/0x3b6
 [<ffffffff800c8f4f>] __generic_file_write_nolock+0x8f/0xa8
 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8000db08>] permission+0x81/0xc8
 [<ffffffff800127e1>] may_open+0x1d5/0x233
 [<ffffffff80063ae9>] mutex_lock+0xd/0x1d
 [<ffffffff800c8fb0>] generic_file_writev+0x48/0xa3
 [<ffffffff8001834e>] do_sync_write+0x0/0x104
 [<ffffffff800e3367>] do_readv_writev+0x172/0x291
 [<ffffffff8001834e>] do_sync_write+0x0/0x104
 [<ffffffff800b9cc0>] audit_syscall_entry+0x1a8/0x1d3
 [<ffffffff800e3510>] sys_writev+0x45/0x93
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

INFO: task ftp:11396 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ftp           D ffffffff801546d1     0 11396      1         11401 10953 (NOTLB)
 ffff810717b5bad8 0000000000000082 ffff810ecc8663a8 ffffffff88055b5b
 ffff810717b5ba88 0000000000000007 ffff81087f507040 ffff81107fea5100
 00000299926679f3 00000000031ff912 ffff81087f507228 000000138003f876
Call Trace:
 [<ffffffff88055b5b>] :ext3:__ext3_journal_stop+0x1f/0x3d
 [<ffffffff88036e14>] :jbd:log_wait_commit+0xa3/0xf5
 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e
 [<ffffffff88037844>] :jbd:journal_force_commit_nested+0x63/0x6c
 [<ffffffff880504f7>] :ext3:ext3_write_begin+0x1b3/0x1cd
 [<ffffffff8000feb1>] generic_file_buffered_write+0x14b/0x675
 [<ffffffff8001678a>] __generic_file_aio_write_nolock+0x369/0x3b6
 [<ffffffff800219bf>] generic_file_aio_write+0x67/0xc3
 [<ffffffff8804c1c2>] :ext3:ext3_file_write+0x16/0x91
 [<ffffffff80018415>] do_sync_write+0xc7/0x104
 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8001a117>] current_kernel_time+0x14/0x38
 [<ffffffff80016b92>] vfs_write+0xce/0x174
 [<ffffffff8001745b>] sys_write+0x45/0x6e
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

Comment 1 Eric Sandeen 2011-10-05 15:15:14 UTC
not technically an oops, but a hung task.

Can you do a sysrq-w (echo w > /proc/sysrq-trigger)

this will show us all the tasks which are hung.  Attaching the entire dmesg would be useful.

How reproducible is this?  What is the ftp test in use?

Thanks,
-Eric

Comment 2 Eric Sandeen 2011-10-05 15:23:37 UTC
Also, does this ever resolve, or does it stay stuck?

Comment 3 sritej 2011-10-07 00:06:59 UTC
Hi Eric,

The issue happens when we run an ftp test script that sets up for ex say 15 session with its peer and each instance try to get a series of files ranging from 1k to 500M. The above is set to iterate for 15 times. Call trace is seen in the 9th  iteration.

The call trace is seen pretty consistently and no, it doesn't resolve.After some time the server hangs.

The entire smesg looked like this(below).I then tried to check the status of process ID shown in dmesg by running ps auxww |grep 'process_ID' but the status didn't show up. 



INFO: task ftp:8848 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ftp           D ffffffff801546d1     0  8848      1          8947  8853 (NOTLB)
 ffff8107ed477ad8 0000000000000086 ffff810877cf6c18 ffffffff88055b5b
 ffff8107ed477a88 0000000000000008 ffff81086394e7a0 ffff81107feb1040
 00000130b25ce708 0000000000007a7e ffff81086394e988 000000148003f876
Call Trace:
 [<ffffffff88055b5b>] :ext3:__ext3_journal_stop+0x1f/0x3d
 [<ffffffff88036e14>] :jbd:log_wait_commit+0xa3/0xf5
 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e
 [<ffffffff88037844>] :jbd:journal_force_commit_nested+0x63/0x6c
 [<ffffffff880504f7>] :ext3:ext3_write_begin+0x1b3/0x1cd
 [<ffffffff8000feb1>] generic_file_buffered_write+0x14b/0x675
 [<ffffffff8001678a>] __generic_file_aio_write_nolock+0x369/0x3b6
 [<ffffffff800219bf>] generic_file_aio_write+0x67/0xc3
 [<ffffffff8804c1c2>] :ext3:ext3_file_write+0x16/0x91
 [<ffffffff80018415>] do_sync_write+0xc7/0x104
 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e
 [<ffffffff801546d1>] __next_cpu+0x19/0x28
 [<ffffffff80062a91>] __sched_text_start+0x6d1/0xbd0
 [<ffffffff80016b92>] vfs_write+0xce/0x174
 [<ffffffff8001745b>] sys_write+0x45/0x6e
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

INFO: task ftp:8849 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ftp           D ffffffff801546d1     0  8849      1          8928  8852 (NOTLB)
 ffff8107f0b1dad8 0000000000000082 ffff81107d955cf0 ffffffff88055b5b
 ffff8107f0b1da88 0000000000000007 ffff8108691cc100 ffff81107fd64080
 00000130b25d538a 00000000000113b1 ffff8108691cc2e8 0000000d8003f876
Call Trace:
 [<ffffffff88055b5b>] :ext3:__ext3_journal_stop+0x1f/0x3d
 [<ffffffff88036e14>] :jbd:log_wait_commit+0xa3/0xf5
 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e
 [<ffffffff88037844>] :jbd:journal_force_commit_nested+0x63/0x6c
 [<ffffffff880504f7>] :ext3:ext3_write_begin+0x1b3/0x1cd
 [<ffffffff8000feb1>] generic_file_buffered_write+0x14b/0x675
 [<ffffffff8001678a>] __generic_file_aio_write_nolock+0x369/0x3b6
 [<ffffffff800219bf>] generic_file_aio_write+0x67/0xc3
 [<ffffffff8804c1c2>] :ext3:ext3_file_write+0x16/0x91
 [<ffffffff80018415>] do_sync_write+0xc7/0x104
 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8000d795>] inotify_inode_queue_event+0xde/0xe8
 [<ffffffff80016b92>] vfs_write+0xce/0x174
 [<ffffffff8001745b>] sys_write+0x45/0x6e
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

INFO: task ftp:8852 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ftp           D ffffffff801546d1     0  8852      1          8849  9068 (NOTLB)
 ffff8107ed219ad8 0000000000000086 ffff81107d955918 ffffffff88055b5b
 ffff8107ed219a88 0000000000000009 ffff8107d97df7e0 ffff81107f815100
 00000130b2601547 0000000000001384 ffff8107d97df9c8 0000001b8003f876
Call Trace:
 [<ffffffff88055b5b>] :ext3:__ext3_journal_stop+0x1f/0x3d
 [<ffffffff88036e14>] :jbd:log_wait_commit+0xa3/0xf5
 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e
 [<ffffffff88037844>] :jbd:journal_force_commit_nested+0x63/0x6c
 [<ffffffff880504f7>] :ext3:ext3_write_begin+0x1b3/0x1cd
 [<ffffffff8000feb1>] generic_file_buffered_write+0x14b/0x675
 [<ffffffff8001678a>] __generic_file_aio_write_nolock+0x369/0x3b6
 [<ffffffff800219bf>] generic_file_aio_write+0x67/0xc3
 [<ffffffff8804c1c2>] :ext3:ext3_file_write+0x16/0x91
 [<ffffffff80018415>] do_sync_write+0xc7/0x104
 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8001a110>] current_kernel_time+0xd/0x38
 [<ffffffff80016b92>] vfs_write+0xce/0x174
 [<ffffffff8001745b>] sys_write+0x45/0x6e
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

INFO: task ftp:8853 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ftp           D ffffffff801546d1     0  8853      1          8848  8856 (NOTLB)
 ffff810fec729ad8 0000000000000082 ffff81107d955750 ffffffff88055b5b
 ffff810fec729a88 0000000000000007 ffff81107e211080 ffff81107fea5100
 00000130b25d355d 00000000009a0097 ffff81107e211268 000000138003f876
Call Trace:
 [<ffffffff88055b5b>] :ext3:__ext3_journal_stop+0x1f/0x3d
 [<ffffffff88036e14>] :jbd:log_wait_commit+0xa3/0xf5
 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e
 [<ffffffff88037844>] :jbd:journal_force_commit_nested+0x63/0x6c
 [<ffffffff880504f7>] :ext3:ext3_write_begin+0x1b3/0x1cd
 [<ffffffff8000feb1>] generic_file_buffered_write+0x14b/0x675
 [<ffffffff8001678a>] __generic_file_aio_write_nolock+0x369/0x3b6
 [<ffffffff800219bf>] generic_file_aio_write+0x67/0xc3
 [<ffffffff8804c1c2>] :ext3:ext3_file_write+0x16/0x91
 [<ffffffff80018415>] do_sync_write+0xc7/0x104
 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e
 [<ffffffff80016b92>] vfs_write+0xce/0x174
 [<ffffffff8001745b>] sys_write+0x45/0x6e
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

INFO: task ftp:8856 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ftp           D ffffffff801546d1     0  8856      1          8853  8923 (NOTLB)
 ffff810feaccdad8 0000000000000082 ffff810863936e70 ffffffff88055b5b
 ffff810feaccda88 0000000000000008 ffff81107fa18080 ffff81107f88e0c0
 00000130b25e2c64 000000000000369b ffff81107fa18268 0000001e8003f876
Call Trace:
 [<ffffffff88055b5b>] :ext3:__ext3_journal_stop+0x1f/0x3d
 [<ffffffff88036e14>] :jbd:log_wait_commit+0xa3/0xf5
 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e
 [<ffffffff88037844>] :jbd:journal_force_commit_nested+0x63/0x6c
 [<ffffffff880504f7>] :ext3:ext3_write_begin+0x1b3/0x1cd
 [<ffffffff8000feb1>] generic_file_buffered_write+0x14b/0x675
 [<ffffffff8001678a>] __generic_file_aio_write_nolock+0x369/0x3b6
 [<ffffffff800219bf>] generic_file_aio_write+0x67/0xc3
 [<ffffffff8804c1c2>] :ext3:ext3_file_write+0x16/0x91
 [<ffffffff80018415>] do_sync_write+0xc7/0x104
 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e
 [<ffffffff801546d1>] __next_cpu+0x19/0x28
 [<ffffffff80062a91>] __sched_text_start+0x6d1/0xbd0
 [<ffffffff80016b92>] vfs_write+0xce/0x174
 [<ffffffff8001745b>] sys_write+0x45/0x6e
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

INFO: task ftp:8923 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ftp           D ffffffff801546d1     0  8923      1          8856  8928 (NOTLB)
 ffff81085b621ad8 0000000000000082 ffff8107fce97db0 ffffffff88055b5b
 ffff81085b621a88 0000000000000007 ffff81087dcb60c0 ffff81107fc910c0
 00000130b25e8b61 00000000000043b0 ffff81087dcb62a8 0000000a8003f876
Call Trace:
 [<ffffffff88055b5b>] :ext3:__ext3_journal_stop+0x1f/0x3d
 [<ffffffff88036e14>] :jbd:log_wait_commit+0xa3/0xf5
 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e
 [<ffffffff88037844>] :jbd:journal_force_commit_nested+0x63/0x6c
 [<ffffffff880504f7>] :ext3:ext3_write_begin+0x1b3/0x1cd
 [<ffffffff8000feb1>] generic_file_buffered_write+0x14b/0x675
 [<ffffffff8001678a>] __generic_file_aio_write_nolock+0x369/0x3b6
 [<ffffffff800219bf>] generic_file_aio_write+0x67/0xc3
 [<ffffffff8804c1c2>] :ext3:ext3_file_write+0x16/0x91
 [<ffffffff80018415>] do_sync_write+0xc7/0x104
 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e
 [<ffffffff80016b92>] vfs_write+0xce/0x174
 [<ffffffff8001745b>] sys_write+0x45/0x6e
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

INFO: task ftp:8928 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ftp           D ffffffff801546d1     0  8928      1          8923  8849 (NOTLB)
 ffff81085aac5ad8 0000000000000082 ffff810878b9e840 ffffffff88055b5b
 ffff81085aac5a88 0000000000000009 ffff81011dfe90c0 ffff81107ffa3040
 00000130b25dd46c 0000000000006036 ffff81011dfe92a8 000000188003f876
Call Trace:
 [<ffffffff88055b5b>] :ext3:__ext3_journal_stop+0x1f/0x3d
 [<ffffffff88036e14>] :jbd:log_wait_commit+0xa3/0xf5
 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e
 [<ffffffff88037844>] :jbd:journal_force_commit_nested+0x63/0x6c
 [<ffffffff880504f7>] :ext3:ext3_write_begin+0x1b3/0x1cd
 [<ffffffff8000feb1>] generic_file_buffered_write+0x14b/0x675
 [<ffffffff8001678a>] __generic_file_aio_write_nolock+0x369/0x3b6
 [<ffffffff800219bf>] generic_file_aio_write+0x67/0xc3
 [<ffffffff8804c1c2>] :ext3:ext3_file_write+0x16/0x91
 [<ffffffff80018415>] do_sync_write+0xc7/0x104
 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8001a129>] current_kernel_time+0x26/0x38
 [<ffffffff80016b92>] vfs_write+0xce/0x174
 [<ffffffff8001745b>] sys_write+0x45/0x6e
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

INFO: task ftp:8942 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ftp           D ffffffff801546d1     0  8942      1          9071  8947 (NOTLB)
 ffff8107a8f0fad8 0000000000000086 ffff810863936ee8 ffffffff88055b5b
 ffff8107a8f0fa88 0000000000000007 ffff8107d00e50c0 ffff81107fe480c0
 00000130b25e3d7e 00000000000038e9 ffff8107d00e52a8 000000128003f876
Call Trace:
 [<ffffffff88055b5b>] :ext3:__ext3_journal_stop+0x1f/0x3d
 [<ffffffff88036e14>] :jbd:log_wait_commit+0xa3/0xf5
 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e
 [<ffffffff88037844>] :jbd:journal_force_commit_nested+0x63/0x6c
 [<ffffffff880504f7>] :ext3:ext3_write_begin+0x1b3/0x1cd
 [<ffffffff8000feb1>] generic_file_buffered_write+0x14b/0x675
 [<ffffffff8001678a>] __generic_file_aio_write_nolock+0x369/0x3b6
 [<ffffffff8005c34d>] alloc_sock_iocb+0x2/0x57
 [<ffffffff800219bf>] generic_file_aio_write+0x67/0xc3
 [<ffffffff8804c1c2>] :ext3:ext3_file_write+0x16/0x91
 [<ffffffff80018415>] do_sync_write+0xc7/0x104
 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e
 [<ffffffff80016b92>] vfs_write+0xce/0x174
 [<ffffffff8001745b>] sys_write+0x45/0x6e
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

INFO: task ftp:8947 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ftp           D ffffffff801546d1     0  8947      1          8942  8848 (NOTLB)
 ffff810868ab1ad8 0000000000000086 ffff81107fa6f2b8 ffffffff88055b5b
 ffff810868ab1a88 0000000000000007 ffff81087dcbe0c0 ffff81089c397080
 00000130b25d2867 000000000001b644 ffff81087dcbe2a8 000000058003f876
Call Trace:
 [<ffffffff88055b5b>] :ext3:__ext3_journal_stop+0x1f/0x3d
 [<ffffffff88036e14>] :jbd:log_wait_commit+0xa3/0xf5
 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e
 [<ffffffff88037844>] :jbd:journal_force_commit_nested+0x63/0x6c
 [<ffffffff880504f7>] :ext3:ext3_write_begin+0x1b3/0x1cd
 [<ffffffff8000feb1>] generic_file_buffered_write+0x14b/0x675
 [<ffffffff8001678a>] __generic_file_aio_write_nolock+0x369/0x3b6
 [<ffffffff800219bf>] generic_file_aio_write+0x67/0xc3
 [<ffffffff8804c1c2>] :ext3:ext3_file_write+0x16/0x91
 [<ffffffff80018415>] do_sync_write+0xc7/0x104
 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e
 [<ffffffff8001a123>] current_kernel_time+0x20/0x38
 [<ffffffff80016b92>] vfs_write+0xce/0x174
 [<ffffffff8001745b>] sys_write+0x45/0x6e
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

INFO: task ftp:8962 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ftp           D ffffffff801546d1     0  8962   8960                     (NOTLB)
 ffff8107a5d57ad8 0000000000000086 ffff81107d955690 ffffffff88055b5b
 ffff8107a5d57a88 0000000000000008 ffff81087f535080 ffff81107fc17100
 00000130b25f1673 000000000000560e ffff81087f535268 000000078003f876
Call Trace:
 [<ffffffff88055b5b>] :ext3:__ext3_journal_stop+0x1f/0x3d
 [<ffffffff88036e14>] :jbd:log_wait_commit+0xa3/0xf5
 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e
 [<ffffffff88037844>] :jbd:journal_force_commit_nested+0x63/0x6c
 [<ffffffff880504f7>] :ext3:ext3_write_begin+0x1b3/0x1cd
 [<ffffffff8000feb1>] generic_file_buffered_write+0x14b/0x675
 [<ffffffff8001678a>] __generic_file_aio_write_nolock+0x369/0x3b6
 [<ffffffff800219bf>] generic_file_aio_write+0x67/0xc3
 [<ffffffff8804c1c2>] :ext3:ext3_file_write+0x16/0x91
 [<ffffffff80018415>] do_sync_write+0xc7/0x104
 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e
 [<ffffffff80062ff2>] thread_return+0x62/0xfe
 [<ffffffff80016b92>] vfs_write+0xce/0x174
 [<ffffffff8001745b>] sys_write+0x45/0x6e
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0

device eth0 left promiscuous mode
device eth0 entered promiscuous mode
device eth0 left promiscuous mode


I tried echo w > /proc/sysrq-trigger followed by dmesg but I didnt quite understand the dmesg log.It looked something like this

CPU13:
 ffff81011df03f48 0000000000000000 ffff81011defde20 ffffffff801b9a8d
 0000000000000000 0000000000000d00 ffffffff80454400 ffffffff801b9abc
 ffffffff801b9a8d ffffffff8002321a ffff81107fccc910 0000000000000318
Call Trace:
 <IRQ>  [<ffffffff801b9a8d>] showacpu+0x0/0x3b
 [<ffffffff801b9abc>] showacpu+0x2f/0x3b
 [<ffffffff801b9a8d>] showacpu+0x0/0x3b
 [<ffffffff8002321a>] smp_call_function_interrupt+0x57/0x75
 [<ffffffff8005dc22>] call_function_interrupt+0x66/0x6c
 <EOI>  [<ffffffff801a246a>] acpi_processor_idle_simple+0x1af/0x31c
 [<ffffffff801a2427>] acpi_processor_idle_simple+0x16c/0x31c
 [<ffffffff801a22bb>] acpi_processor_idle_simple+0x0/0x31c
 [<ffffffff80048fc5>] cpu_idle+0x95/0xb8
 [<ffffffff80078a9a>] start_secondary+0x479/0x488

It looked pretty much the same for all CPU's 0-31.

Thanks,
Sritej

Comment 4 Lukáš Czerner 2011-10-07 07:45:48 UTC
(In reply to comment #3)
> Hi Eric,
> 
> The issue happens when we run an ftp test script that sets up for ex say 15
> session with its peer and each instance try to get a series of files ranging
> from 1k to 500M. The above is set to iterate for 15 times. Call trace is seen
> in the 9th  iteration.

Hi,

I just want to be sure that I understand the test correctly. The files the ftp server provides are on the ext3 file system and peers are trying to get this series of files. So from the file system point of view it is just reading. Do I understand it correctly ?

Where we can find this test script so we can try to reproduce it on our own ?

Comment 5 Lukáš Czerner 2011-10-07 07:52:12 UTC
> I tried echo w > /proc/sysrq-trigger followed by dmesg but I didnt quite
> understand the dmesg log.It looked something like this
> 
> CPU13:
>  ffff81011df03f48 0000000000000000 ffff81011defde20 ffffffff801b9a8d
>  0000000000000000 0000000000000d00 ffffffff80454400 ffffffff801b9abc
>  ffffffff801b9a8d ffffffff8002321a ffff81107fccc910 0000000000000318
> Call Trace:
>  <IRQ>  [<ffffffff801b9a8d>] showacpu+0x0/0x3b
>  [<ffffffff801b9abc>] showacpu+0x2f/0x3b
>  [<ffffffff801b9a8d>] showacpu+0x0/0x3b
>  [<ffffffff8002321a>] smp_call_function_interrupt+0x57/0x75
>  [<ffffffff8005dc22>] call_function_interrupt+0x66/0x6c
>  <EOI>  [<ffffffff801a246a>] acpi_processor_idle_simple+0x1af/0x31c
>  [<ffffffff801a2427>] acpi_processor_idle_simple+0x16c/0x31c
>  [<ffffffff801a22bb>] acpi_processor_idle_simple+0x0/0x31c
>  [<ffffffff80048fc5>] cpu_idle+0x95/0xb8
>  [<ffffffff80078a9a>] start_secondary+0x479/0x488
> 
> It looked pretty much the same for all CPU's 0-31.

This seems really suspicious, because the output should look more like this:

[ 3908.597685] SysRq : Show Blocked State
[ 3908.597701]   task                        PC stack   pid father
[ 3908.597824] Sched Debug Version: v0.09, 2.6.35.14-96.fc14.x86_64 #1
[ 3908.597832] now at 3908597.824865 msecs
[ 3908.597837]   .jiffies                                 : 4298575893
[ 3908.597843]   .sysctl_sched_latency                    : 12.000000
[ 3908.597849]   .sysctl_sched_min_granularity            : 4.000000
[ 3908.597855]   .sysctl_sched_wakeup_granularity         : 2.000000
[ 3908.597860]   .sysctl_sched_child_runs_first           : 0.000000
[ 3908.597866]   .sysctl_sched_features                   : 31855
[ 3908.597872]   .sysctl_sched_tunable_scaling            : 1 (logaritmic)
[ 3908.597879] 
[ 3908.597881] cpu#0, 2793.121 MHz
[ 3908.597886]   .nr_running                    : 0
[ 3908.597890]   .load                          : 0
[ 3908.597895]   .nr_switches                   : 3515579
[ 3908.597899]   .nr_load_updates               : 264448
[ 3908.597904]   .nr_uninterruptible            : 0
[ 3908.597909]   .next_balance                  : 4298.575957
[ 3908.597914]   .curr->pid                     : 0
[ 3908.597919]   .clock                         : 3908597.305661
[ 3908.597924]   .cpu_load[0]                   : 1024
[ 3908.597929]   .cpu_load[1]                   : 514
[ 3908.597933]   .cpu_load[2]                   : 281
[ 3908.597937]   .cpu_load[3]                   : 171
[ 3908.597941]   .cpu_load[4]                   : 106
[ 3908.597946]   .yld_count                     : 245304
[ 3908.597950]   .sched_switch                  : 0
[ 3908.597955]   .sched_count                   : 3761073
[ 3908.597960]   .sched_goidle                  : 1619379
[ 3908.597965]   .avg_idle                      : 884585
[ 3908.597969]   .ttwu_count                    : 1809834
[ 3908.597974]   .ttwu_local                    : 492334
[ 3908.597979]   .bkl_count                     : 5052
...
...

Have you seen this problem with the ftp test on a different box ? Could you try it ?

Thanks!
-Lukas

Comment 6 Eric Sandeen 2011-10-07 17:06:33 UTC
Ok, I guess sysrq-w doesn't work like I thought on rhel5, sorry about that.  Can you try sysrq-t when it's stuck?

It'll be big, so adding it as an attachment would be best.

Thanks,
-Eric

Comment 7 sritej 2011-10-10 20:45:44 UTC
Created attachment 527323 [details]
Ftp test script

Comment 8 sritej 2011-10-10 20:47:42 UTC
Created attachment 527324 [details]
Ftp test script 2

Comment 9 sritej 2011-10-10 20:48:26 UTC
Created attachment 527325 [details]
Ftp test script 3

Comment 10 sritej 2011-10-10 20:49:37 UTC
Created attachment 527326 [details]
dmesg log after running ftp test

Comment 11 sritej 2011-10-10 20:50:44 UTC
Created attachment 527327 [details]
dmesg after sysrq-t

Comment 12 sritej 2011-10-10 20:56:55 UTC
@ Lukas,

The server host the files on an ext3 file system and the peer connects to the server through ftp to get the files.I have attached the test scripts.

Yes our test engineer reported seeing the issue on different servers running RHEL5.6,5.7 & 6 but not on 5.5  


@Eric,

I have attached the dmesg logs.

Thanks,
Sritej

Comment 13 Eric Sandeen 2011-10-10 22:09:40 UTC
So, we have kjournald trying to commit, and waiting for a request (edited stack a bit):

Call Trace:
 [<ffffffff800637ce>] io_schedule+0x3f/0x67
 [<ffffffff80028eb5>] get_request_wait+0xe7/0x130
 [<ffffffff8000c228>] __make_request+0x40a/0x4ce
 [<ffffffff8001c452>] generic_make_request+0x211/0x228
 [<ffffffff882da415>] :dm_mod:__map_bio+0x4a/0x123
 [<ffffffff882daf57>] :dm_mod:__split_bio+0x17d/0x3b7
 [<ffffffff882db99c>] :dm_mod:dm_request+0x115/0x124
 [<ffffffff8001c452>] generic_make_request+0x211/0x228
 [<ffffffff8003318a>] submit_bio+0xe6/0xed
 [<ffffffff8001abc1>] submit_bh+0xf4/0x114
 [<ffffffff88033453>] :jbd:journal_do_submit_data+0x2e/0x37
 [<ffffffff88033921>] :jbd:journal_commit_transaction+0x433/0x10aa

syslogd and all ftp processes are in log_wait_commit ("Wait for a specified commit to complete.")  So basically waiting for kjournald.

The rm processes are in vfs_unlink, waiting on an i_mutex, probably held by the ftp processes, just guessing.

And pdflush is in writeback_inodes, in blk_congestion_wait.

so kjournald & pdflush seem to be the interesting bits here.

Are you quite sure that the box is completely hung and no progress is being made?  Once it gets to this point, does iostat for example show any IO happening at all?

Comment 14 sritej 2011-10-10 23:30:27 UTC
Eric,

After the call trace, I could still see activity through iostat and tcpdump but after a point the server freezes.I will try to get the call trace when that happens.

Sritej

Comment 15 sritej 2011-10-13 00:12:12 UTC
Eric,

I tried to get the call trace when it hanged through netconsole but unfortunately I couldn't capture anything.

Please let me know if you want me to try anything else.

Sritej

Comment 16 Eric Sandeen 2011-10-13 14:35:00 UTC
Well, if the machine is truly deadlocked I think we will need some way to see what that deadlock is.  A crashdump would work, too.

Comment 17 Marvell Linux NIC Driver 2011-10-13 15:02:10 UTC
Eric, Have you tried to re-produce the issue on your end? I think Sritej has provided you the instructions to re-produce the issue. It is happening with multiple adapters on certain RHEL releases and fairly easy to re-produce.

Comment 18 Eric Sandeen 2011-10-13 15:24:52 UTC
I have not yet.  With the number of bugs I'm handling, I usually try to gather as much info as possible from the reporter up front, when I can.  :)  We can try to repro here, too, though.

BTW is there a support ticket open on this?

Comment 19 Marvell Linux NIC Driver 2011-10-14 01:54:30 UTC
Eric, Server OEM is treating this as a blocker issue for a release. We have raised the severity of this issue.

Let us know what we should do to open a support ticket.

Comment 20 Eric Sandeen 2011-10-17 16:08:40 UTC
You can log a support ticket by logging in at support.redhat.com (see the "submit a support case" link) or you can call the support number at 888-GO-REDHAT.  Other contact info:

https://access.redhat.com/support/contact/technicalSupport.html

You might let them know that this bug has been filed.

Thanks,
-Eric

Comment 21 Eric Sandeen 2011-10-17 18:31:13 UTC
Chad Dupuis is going to see if he can help this along.  I'd really like to try to get a crashdump from your system when it's in the hung state.

Thanks,
-Eric

Comment 22 sritej 2011-11-02 19:45:08 UTC
Hi Eric,

From the test reports that I have received and from my own testing, I am convinced that the issue cannot be reproduced consistently.

This is what our test engineer reported:

Case 1:- ./ftp.sh <host IP> 15 15 (Test pass successfully without any trace seen in dmesg.)

Case 2:- ./ftp.sh <host IP> 25 25  (Test pass successfully but kernel  trace seen in dmesg  and system doesn't hang)

Case 3:- ./ftp.sh <host IP> 35 35 (Test pass successfully but  kernel trace seen in dmesg and system hanged)

In my tests after a call trace was seen, the IP address of the site LAN was withdrawn but the server didn't hang. I could still ssh onto the server through the qlogic interface.
As I didn't see any kernel hang, I manually forced a panic after the IP was withdrawn and collected a kdump. Do you want me to upload this kdump file?

Thanks,
Sritej

Comment 23 Eric Sandeen 2011-11-02 20:27:51 UTC
I think that a kdump will only be useful if it is taken while the server is deadlocked, I'm afraid.

Comment 24 Eric Sandeen 2011-11-02 23:03:10 UTC
Just to be clear, the server & client are different machines when you test?  i.e. you are not testing with an ftp server on localhost?

(And it's the client - the retrieving node - which is experiencing the problem, right?)

Comment 25 sritej 2011-11-03 01:22:14 UTC
Thats right,two different machines and its the retrieving node.

Comment 26 sritej 2011-11-16 22:53:18 UTC
Hi Eric,

When I force a panic on the client (without any traffic running) I able to collect a vmcore file locally and can run crash on it.
However, when I force a panic (with ftp traffic running) when the issue is seen,
the vmcore file occupies the entire disk space and is incomplete. 
Not sure why this is happening.


Sritej

Comment 27 sritej 2011-11-16 22:54:14 UTC
Hi Eric,

When I force a panic on the client (without any traffic running) I am able to collect a vmcore file locally and can run crash on it.
However, when I force a panic (with ftp traffic running) when the issue is seen,
the vmcore file occupies the entire disk space and is incomplete. 
Not sure why this is happening.


Sritej

Comment 28 Eric Sandeen 2011-11-18 18:23:25 UTC
Chad, can you help Sritej figure out the crashdump problem?

Comment 29 Marvell Linux NIC Driver 2011-11-19 00:00:23 UTC
Eric,

We are able to collect kdump sucessfully with and without running traffic. However, when we run the ftp traffic and run into the failure described in the Bug, vmcore file occupies the entire disk space and is incomplete

We can provide you remote access to the system (through WebEx) for live debugging. Would that work for you?

Comment 30 RHEL Program Management 2014-03-07 13:32:23 UTC
This bug/component is not included in scope for RHEL-5.11.0 which is the last RHEL5 minor release. This Bugzilla will soon be CLOSED as WONTFIX (at the end of RHEL5.11 development phase (Apr 22, 2014)). Please contact your account manager or support representative in case you need to escalate this bug.

Comment 31 RHEL Program Management 2014-06-02 13:01:06 UTC
Thank you for submitting this request for inclusion in Red Hat Enterprise Linux 5. We've carefully evaluated the request, but are unable to include it in RHEL5 stream. If the issue is critical for your business, please provide additional business justification through the appropriate support channels (https://access.redhat.com/site/support).


Note You need to log in before you can comment on or make changes to this bug.