Hide Forgot
Description of problem: kernel opps message "INFO: task ypbind:7390 blocked for more than 120 seconds" shown while running ftp tests on Qlogic 10GbE CNA.At first glance, it doesnt look like a driver bug. Version-Release number of selected component (if applicable): Kernel Rhek 5.7 2.6.18-274.el5 driver : qlcnic v5.0.24 How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info: eth0: no IPv6 routers present INFO: task ypbind:7390 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ypbind D ffffffff801546d1 0 7390 1 7445 7389 (NOTLB) ffff81107cd7fa78 0000000000000082 ffff81087aa603d8 ffffffff88055b5b ffff8108716e9850 000000000000000a ffff81107e8ad040 ffff81107f874040 000002a4236e36b0 000000000051d05a ffff81107e8ad228 0000001c8003f876 Call Trace: [<ffffffff88055b5b>] :ext3:__ext3_journal_stop+0x1f/0x3d [<ffffffff88036e14>] :jbd:log_wait_commit+0xa3/0xf5 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e [<ffffffff88037844>] :jbd:journal_force_commit_nested+0x63/0x6c [<ffffffff880504f7>] :ext3:ext3_write_begin+0x1b3/0x1cd [<ffffffff8000feb1>] generic_file_buffered_write+0x14b/0x675 [<ffffffff8001678a>] __generic_file_aio_write_nolock+0x369/0x3b6 [<ffffffff800c8f4f>] __generic_file_write_nolock+0x8f/0xa8 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e [<ffffffff8000db08>] permission+0x81/0xc8 [<ffffffff800127e1>] may_open+0x1d5/0x233 [<ffffffff80063ae9>] mutex_lock+0xd/0x1d [<ffffffff800c8fb0>] generic_file_writev+0x48/0xa3 [<ffffffff8001834e>] do_sync_write+0x0/0x104 [<ffffffff800e3367>] do_readv_writev+0x172/0x291 [<ffffffff8001834e>] do_sync_write+0x0/0x104 [<ffffffff800b9cc0>] audit_syscall_entry+0x1a8/0x1d3 [<ffffffff800e3510>] sys_writev+0x45/0x93 [<ffffffff8005d28d>] tracesys+0xd5/0xe0 INFO: task ftp:11396 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ftp D ffffffff801546d1 0 11396 1 11401 10953 (NOTLB) ffff810717b5bad8 0000000000000082 ffff810ecc8663a8 ffffffff88055b5b ffff810717b5ba88 0000000000000007 ffff81087f507040 ffff81107fea5100 00000299926679f3 00000000031ff912 ffff81087f507228 000000138003f876 Call Trace: [<ffffffff88055b5b>] :ext3:__ext3_journal_stop+0x1f/0x3d [<ffffffff88036e14>] :jbd:log_wait_commit+0xa3/0xf5 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e [<ffffffff88037844>] :jbd:journal_force_commit_nested+0x63/0x6c [<ffffffff880504f7>] :ext3:ext3_write_begin+0x1b3/0x1cd [<ffffffff8000feb1>] generic_file_buffered_write+0x14b/0x675 [<ffffffff8001678a>] __generic_file_aio_write_nolock+0x369/0x3b6 [<ffffffff800219bf>] generic_file_aio_write+0x67/0xc3 [<ffffffff8804c1c2>] :ext3:ext3_file_write+0x16/0x91 [<ffffffff80018415>] do_sync_write+0xc7/0x104 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e [<ffffffff8001a117>] current_kernel_time+0x14/0x38 [<ffffffff80016b92>] vfs_write+0xce/0x174 [<ffffffff8001745b>] sys_write+0x45/0x6e [<ffffffff8005d28d>] tracesys+0xd5/0xe0
not technically an oops, but a hung task. Can you do a sysrq-w (echo w > /proc/sysrq-trigger) this will show us all the tasks which are hung. Attaching the entire dmesg would be useful. How reproducible is this? What is the ftp test in use? Thanks, -Eric
Also, does this ever resolve, or does it stay stuck?
Hi Eric, The issue happens when we run an ftp test script that sets up for ex say 15 session with its peer and each instance try to get a series of files ranging from 1k to 500M. The above is set to iterate for 15 times. Call trace is seen in the 9th iteration. The call trace is seen pretty consistently and no, it doesn't resolve.After some time the server hangs. The entire smesg looked like this(below).I then tried to check the status of process ID shown in dmesg by running ps auxww |grep 'process_ID' but the status didn't show up. INFO: task ftp:8848 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ftp D ffffffff801546d1 0 8848 1 8947 8853 (NOTLB) ffff8107ed477ad8 0000000000000086 ffff810877cf6c18 ffffffff88055b5b ffff8107ed477a88 0000000000000008 ffff81086394e7a0 ffff81107feb1040 00000130b25ce708 0000000000007a7e ffff81086394e988 000000148003f876 Call Trace: [<ffffffff88055b5b>] :ext3:__ext3_journal_stop+0x1f/0x3d [<ffffffff88036e14>] :jbd:log_wait_commit+0xa3/0xf5 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e [<ffffffff88037844>] :jbd:journal_force_commit_nested+0x63/0x6c [<ffffffff880504f7>] :ext3:ext3_write_begin+0x1b3/0x1cd [<ffffffff8000feb1>] generic_file_buffered_write+0x14b/0x675 [<ffffffff8001678a>] __generic_file_aio_write_nolock+0x369/0x3b6 [<ffffffff800219bf>] generic_file_aio_write+0x67/0xc3 [<ffffffff8804c1c2>] :ext3:ext3_file_write+0x16/0x91 [<ffffffff80018415>] do_sync_write+0xc7/0x104 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e [<ffffffff801546d1>] __next_cpu+0x19/0x28 [<ffffffff80062a91>] __sched_text_start+0x6d1/0xbd0 [<ffffffff80016b92>] vfs_write+0xce/0x174 [<ffffffff8001745b>] sys_write+0x45/0x6e [<ffffffff8005d28d>] tracesys+0xd5/0xe0 INFO: task ftp:8849 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ftp D ffffffff801546d1 0 8849 1 8928 8852 (NOTLB) ffff8107f0b1dad8 0000000000000082 ffff81107d955cf0 ffffffff88055b5b ffff8107f0b1da88 0000000000000007 ffff8108691cc100 ffff81107fd64080 00000130b25d538a 00000000000113b1 ffff8108691cc2e8 0000000d8003f876 Call Trace: [<ffffffff88055b5b>] :ext3:__ext3_journal_stop+0x1f/0x3d [<ffffffff88036e14>] :jbd:log_wait_commit+0xa3/0xf5 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e [<ffffffff88037844>] :jbd:journal_force_commit_nested+0x63/0x6c [<ffffffff880504f7>] :ext3:ext3_write_begin+0x1b3/0x1cd [<ffffffff8000feb1>] generic_file_buffered_write+0x14b/0x675 [<ffffffff8001678a>] __generic_file_aio_write_nolock+0x369/0x3b6 [<ffffffff800219bf>] generic_file_aio_write+0x67/0xc3 [<ffffffff8804c1c2>] :ext3:ext3_file_write+0x16/0x91 [<ffffffff80018415>] do_sync_write+0xc7/0x104 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e [<ffffffff8000d795>] inotify_inode_queue_event+0xde/0xe8 [<ffffffff80016b92>] vfs_write+0xce/0x174 [<ffffffff8001745b>] sys_write+0x45/0x6e [<ffffffff8005d28d>] tracesys+0xd5/0xe0 INFO: task ftp:8852 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ftp D ffffffff801546d1 0 8852 1 8849 9068 (NOTLB) ffff8107ed219ad8 0000000000000086 ffff81107d955918 ffffffff88055b5b ffff8107ed219a88 0000000000000009 ffff8107d97df7e0 ffff81107f815100 00000130b2601547 0000000000001384 ffff8107d97df9c8 0000001b8003f876 Call Trace: [<ffffffff88055b5b>] :ext3:__ext3_journal_stop+0x1f/0x3d [<ffffffff88036e14>] :jbd:log_wait_commit+0xa3/0xf5 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e [<ffffffff88037844>] :jbd:journal_force_commit_nested+0x63/0x6c [<ffffffff880504f7>] :ext3:ext3_write_begin+0x1b3/0x1cd [<ffffffff8000feb1>] generic_file_buffered_write+0x14b/0x675 [<ffffffff8001678a>] __generic_file_aio_write_nolock+0x369/0x3b6 [<ffffffff800219bf>] generic_file_aio_write+0x67/0xc3 [<ffffffff8804c1c2>] :ext3:ext3_file_write+0x16/0x91 [<ffffffff80018415>] do_sync_write+0xc7/0x104 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e [<ffffffff8001a110>] current_kernel_time+0xd/0x38 [<ffffffff80016b92>] vfs_write+0xce/0x174 [<ffffffff8001745b>] sys_write+0x45/0x6e [<ffffffff8005d28d>] tracesys+0xd5/0xe0 INFO: task ftp:8853 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ftp D ffffffff801546d1 0 8853 1 8848 8856 (NOTLB) ffff810fec729ad8 0000000000000082 ffff81107d955750 ffffffff88055b5b ffff810fec729a88 0000000000000007 ffff81107e211080 ffff81107fea5100 00000130b25d355d 00000000009a0097 ffff81107e211268 000000138003f876 Call Trace: [<ffffffff88055b5b>] :ext3:__ext3_journal_stop+0x1f/0x3d [<ffffffff88036e14>] :jbd:log_wait_commit+0xa3/0xf5 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e [<ffffffff88037844>] :jbd:journal_force_commit_nested+0x63/0x6c [<ffffffff880504f7>] :ext3:ext3_write_begin+0x1b3/0x1cd [<ffffffff8000feb1>] generic_file_buffered_write+0x14b/0x675 [<ffffffff8001678a>] __generic_file_aio_write_nolock+0x369/0x3b6 [<ffffffff800219bf>] generic_file_aio_write+0x67/0xc3 [<ffffffff8804c1c2>] :ext3:ext3_file_write+0x16/0x91 [<ffffffff80018415>] do_sync_write+0xc7/0x104 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e [<ffffffff80016b92>] vfs_write+0xce/0x174 [<ffffffff8001745b>] sys_write+0x45/0x6e [<ffffffff8005d28d>] tracesys+0xd5/0xe0 INFO: task ftp:8856 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ftp D ffffffff801546d1 0 8856 1 8853 8923 (NOTLB) ffff810feaccdad8 0000000000000082 ffff810863936e70 ffffffff88055b5b ffff810feaccda88 0000000000000008 ffff81107fa18080 ffff81107f88e0c0 00000130b25e2c64 000000000000369b ffff81107fa18268 0000001e8003f876 Call Trace: [<ffffffff88055b5b>] :ext3:__ext3_journal_stop+0x1f/0x3d [<ffffffff88036e14>] :jbd:log_wait_commit+0xa3/0xf5 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e [<ffffffff88037844>] :jbd:journal_force_commit_nested+0x63/0x6c [<ffffffff880504f7>] :ext3:ext3_write_begin+0x1b3/0x1cd [<ffffffff8000feb1>] generic_file_buffered_write+0x14b/0x675 [<ffffffff8001678a>] __generic_file_aio_write_nolock+0x369/0x3b6 [<ffffffff800219bf>] generic_file_aio_write+0x67/0xc3 [<ffffffff8804c1c2>] :ext3:ext3_file_write+0x16/0x91 [<ffffffff80018415>] do_sync_write+0xc7/0x104 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e [<ffffffff801546d1>] __next_cpu+0x19/0x28 [<ffffffff80062a91>] __sched_text_start+0x6d1/0xbd0 [<ffffffff80016b92>] vfs_write+0xce/0x174 [<ffffffff8001745b>] sys_write+0x45/0x6e [<ffffffff8005d28d>] tracesys+0xd5/0xe0 INFO: task ftp:8923 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ftp D ffffffff801546d1 0 8923 1 8856 8928 (NOTLB) ffff81085b621ad8 0000000000000082 ffff8107fce97db0 ffffffff88055b5b ffff81085b621a88 0000000000000007 ffff81087dcb60c0 ffff81107fc910c0 00000130b25e8b61 00000000000043b0 ffff81087dcb62a8 0000000a8003f876 Call Trace: [<ffffffff88055b5b>] :ext3:__ext3_journal_stop+0x1f/0x3d [<ffffffff88036e14>] :jbd:log_wait_commit+0xa3/0xf5 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e [<ffffffff88037844>] :jbd:journal_force_commit_nested+0x63/0x6c [<ffffffff880504f7>] :ext3:ext3_write_begin+0x1b3/0x1cd [<ffffffff8000feb1>] generic_file_buffered_write+0x14b/0x675 [<ffffffff8001678a>] __generic_file_aio_write_nolock+0x369/0x3b6 [<ffffffff800219bf>] generic_file_aio_write+0x67/0xc3 [<ffffffff8804c1c2>] :ext3:ext3_file_write+0x16/0x91 [<ffffffff80018415>] do_sync_write+0xc7/0x104 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e [<ffffffff80016b92>] vfs_write+0xce/0x174 [<ffffffff8001745b>] sys_write+0x45/0x6e [<ffffffff8005d28d>] tracesys+0xd5/0xe0 INFO: task ftp:8928 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ftp D ffffffff801546d1 0 8928 1 8923 8849 (NOTLB) ffff81085aac5ad8 0000000000000082 ffff810878b9e840 ffffffff88055b5b ffff81085aac5a88 0000000000000009 ffff81011dfe90c0 ffff81107ffa3040 00000130b25dd46c 0000000000006036 ffff81011dfe92a8 000000188003f876 Call Trace: [<ffffffff88055b5b>] :ext3:__ext3_journal_stop+0x1f/0x3d [<ffffffff88036e14>] :jbd:log_wait_commit+0xa3/0xf5 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e [<ffffffff88037844>] :jbd:journal_force_commit_nested+0x63/0x6c [<ffffffff880504f7>] :ext3:ext3_write_begin+0x1b3/0x1cd [<ffffffff8000feb1>] generic_file_buffered_write+0x14b/0x675 [<ffffffff8001678a>] __generic_file_aio_write_nolock+0x369/0x3b6 [<ffffffff800219bf>] generic_file_aio_write+0x67/0xc3 [<ffffffff8804c1c2>] :ext3:ext3_file_write+0x16/0x91 [<ffffffff80018415>] do_sync_write+0xc7/0x104 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e [<ffffffff8001a129>] current_kernel_time+0x26/0x38 [<ffffffff80016b92>] vfs_write+0xce/0x174 [<ffffffff8001745b>] sys_write+0x45/0x6e [<ffffffff8005d28d>] tracesys+0xd5/0xe0 INFO: task ftp:8942 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ftp D ffffffff801546d1 0 8942 1 9071 8947 (NOTLB) ffff8107a8f0fad8 0000000000000086 ffff810863936ee8 ffffffff88055b5b ffff8107a8f0fa88 0000000000000007 ffff8107d00e50c0 ffff81107fe480c0 00000130b25e3d7e 00000000000038e9 ffff8107d00e52a8 000000128003f876 Call Trace: [<ffffffff88055b5b>] :ext3:__ext3_journal_stop+0x1f/0x3d [<ffffffff88036e14>] :jbd:log_wait_commit+0xa3/0xf5 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e [<ffffffff88037844>] :jbd:journal_force_commit_nested+0x63/0x6c [<ffffffff880504f7>] :ext3:ext3_write_begin+0x1b3/0x1cd [<ffffffff8000feb1>] generic_file_buffered_write+0x14b/0x675 [<ffffffff8001678a>] __generic_file_aio_write_nolock+0x369/0x3b6 [<ffffffff8005c34d>] alloc_sock_iocb+0x2/0x57 [<ffffffff800219bf>] generic_file_aio_write+0x67/0xc3 [<ffffffff8804c1c2>] :ext3:ext3_file_write+0x16/0x91 [<ffffffff80018415>] do_sync_write+0xc7/0x104 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e [<ffffffff80016b92>] vfs_write+0xce/0x174 [<ffffffff8001745b>] sys_write+0x45/0x6e [<ffffffff8005d28d>] tracesys+0xd5/0xe0 INFO: task ftp:8947 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ftp D ffffffff801546d1 0 8947 1 8942 8848 (NOTLB) ffff810868ab1ad8 0000000000000086 ffff81107fa6f2b8 ffffffff88055b5b ffff810868ab1a88 0000000000000007 ffff81087dcbe0c0 ffff81089c397080 00000130b25d2867 000000000001b644 ffff81087dcbe2a8 000000058003f876 Call Trace: [<ffffffff88055b5b>] :ext3:__ext3_journal_stop+0x1f/0x3d [<ffffffff88036e14>] :jbd:log_wait_commit+0xa3/0xf5 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e [<ffffffff88037844>] :jbd:journal_force_commit_nested+0x63/0x6c [<ffffffff880504f7>] :ext3:ext3_write_begin+0x1b3/0x1cd [<ffffffff8000feb1>] generic_file_buffered_write+0x14b/0x675 [<ffffffff8001678a>] __generic_file_aio_write_nolock+0x369/0x3b6 [<ffffffff800219bf>] generic_file_aio_write+0x67/0xc3 [<ffffffff8804c1c2>] :ext3:ext3_file_write+0x16/0x91 [<ffffffff80018415>] do_sync_write+0xc7/0x104 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e [<ffffffff8001a123>] current_kernel_time+0x20/0x38 [<ffffffff80016b92>] vfs_write+0xce/0x174 [<ffffffff8001745b>] sys_write+0x45/0x6e [<ffffffff8005d28d>] tracesys+0xd5/0xe0 INFO: task ftp:8962 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. ftp D ffffffff801546d1 0 8962 8960 (NOTLB) ffff8107a5d57ad8 0000000000000086 ffff81107d955690 ffffffff88055b5b ffff8107a5d57a88 0000000000000008 ffff81087f535080 ffff81107fc17100 00000130b25f1673 000000000000560e ffff81087f535268 000000078003f876 Call Trace: [<ffffffff88055b5b>] :ext3:__ext3_journal_stop+0x1f/0x3d [<ffffffff88036e14>] :jbd:log_wait_commit+0xa3/0xf5 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e [<ffffffff88037844>] :jbd:journal_force_commit_nested+0x63/0x6c [<ffffffff880504f7>] :ext3:ext3_write_begin+0x1b3/0x1cd [<ffffffff8000feb1>] generic_file_buffered_write+0x14b/0x675 [<ffffffff8001678a>] __generic_file_aio_write_nolock+0x369/0x3b6 [<ffffffff800219bf>] generic_file_aio_write+0x67/0xc3 [<ffffffff8804c1c2>] :ext3:ext3_file_write+0x16/0x91 [<ffffffff80018415>] do_sync_write+0xc7/0x104 [<ffffffff800a2e5d>] autoremove_wake_function+0x0/0x2e [<ffffffff80062ff2>] thread_return+0x62/0xfe [<ffffffff80016b92>] vfs_write+0xce/0x174 [<ffffffff8001745b>] sys_write+0x45/0x6e [<ffffffff8005d28d>] tracesys+0xd5/0xe0 device eth0 left promiscuous mode device eth0 entered promiscuous mode device eth0 left promiscuous mode I tried echo w > /proc/sysrq-trigger followed by dmesg but I didnt quite understand the dmesg log.It looked something like this CPU13: ffff81011df03f48 0000000000000000 ffff81011defde20 ffffffff801b9a8d 0000000000000000 0000000000000d00 ffffffff80454400 ffffffff801b9abc ffffffff801b9a8d ffffffff8002321a ffff81107fccc910 0000000000000318 Call Trace: <IRQ> [<ffffffff801b9a8d>] showacpu+0x0/0x3b [<ffffffff801b9abc>] showacpu+0x2f/0x3b [<ffffffff801b9a8d>] showacpu+0x0/0x3b [<ffffffff8002321a>] smp_call_function_interrupt+0x57/0x75 [<ffffffff8005dc22>] call_function_interrupt+0x66/0x6c <EOI> [<ffffffff801a246a>] acpi_processor_idle_simple+0x1af/0x31c [<ffffffff801a2427>] acpi_processor_idle_simple+0x16c/0x31c [<ffffffff801a22bb>] acpi_processor_idle_simple+0x0/0x31c [<ffffffff80048fc5>] cpu_idle+0x95/0xb8 [<ffffffff80078a9a>] start_secondary+0x479/0x488 It looked pretty much the same for all CPU's 0-31. Thanks, Sritej
(In reply to comment #3) > Hi Eric, > > The issue happens when we run an ftp test script that sets up for ex say 15 > session with its peer and each instance try to get a series of files ranging > from 1k to 500M. The above is set to iterate for 15 times. Call trace is seen > in the 9th iteration. Hi, I just want to be sure that I understand the test correctly. The files the ftp server provides are on the ext3 file system and peers are trying to get this series of files. So from the file system point of view it is just reading. Do I understand it correctly ? Where we can find this test script so we can try to reproduce it on our own ?
> I tried echo w > /proc/sysrq-trigger followed by dmesg but I didnt quite > understand the dmesg log.It looked something like this > > CPU13: > ffff81011df03f48 0000000000000000 ffff81011defde20 ffffffff801b9a8d > 0000000000000000 0000000000000d00 ffffffff80454400 ffffffff801b9abc > ffffffff801b9a8d ffffffff8002321a ffff81107fccc910 0000000000000318 > Call Trace: > <IRQ> [<ffffffff801b9a8d>] showacpu+0x0/0x3b > [<ffffffff801b9abc>] showacpu+0x2f/0x3b > [<ffffffff801b9a8d>] showacpu+0x0/0x3b > [<ffffffff8002321a>] smp_call_function_interrupt+0x57/0x75 > [<ffffffff8005dc22>] call_function_interrupt+0x66/0x6c > <EOI> [<ffffffff801a246a>] acpi_processor_idle_simple+0x1af/0x31c > [<ffffffff801a2427>] acpi_processor_idle_simple+0x16c/0x31c > [<ffffffff801a22bb>] acpi_processor_idle_simple+0x0/0x31c > [<ffffffff80048fc5>] cpu_idle+0x95/0xb8 > [<ffffffff80078a9a>] start_secondary+0x479/0x488 > > It looked pretty much the same for all CPU's 0-31. This seems really suspicious, because the output should look more like this: [ 3908.597685] SysRq : Show Blocked State [ 3908.597701] task PC stack pid father [ 3908.597824] Sched Debug Version: v0.09, 2.6.35.14-96.fc14.x86_64 #1 [ 3908.597832] now at 3908597.824865 msecs [ 3908.597837] .jiffies : 4298575893 [ 3908.597843] .sysctl_sched_latency : 12.000000 [ 3908.597849] .sysctl_sched_min_granularity : 4.000000 [ 3908.597855] .sysctl_sched_wakeup_granularity : 2.000000 [ 3908.597860] .sysctl_sched_child_runs_first : 0.000000 [ 3908.597866] .sysctl_sched_features : 31855 [ 3908.597872] .sysctl_sched_tunable_scaling : 1 (logaritmic) [ 3908.597879] [ 3908.597881] cpu#0, 2793.121 MHz [ 3908.597886] .nr_running : 0 [ 3908.597890] .load : 0 [ 3908.597895] .nr_switches : 3515579 [ 3908.597899] .nr_load_updates : 264448 [ 3908.597904] .nr_uninterruptible : 0 [ 3908.597909] .next_balance : 4298.575957 [ 3908.597914] .curr->pid : 0 [ 3908.597919] .clock : 3908597.305661 [ 3908.597924] .cpu_load[0] : 1024 [ 3908.597929] .cpu_load[1] : 514 [ 3908.597933] .cpu_load[2] : 281 [ 3908.597937] .cpu_load[3] : 171 [ 3908.597941] .cpu_load[4] : 106 [ 3908.597946] .yld_count : 245304 [ 3908.597950] .sched_switch : 0 [ 3908.597955] .sched_count : 3761073 [ 3908.597960] .sched_goidle : 1619379 [ 3908.597965] .avg_idle : 884585 [ 3908.597969] .ttwu_count : 1809834 [ 3908.597974] .ttwu_local : 492334 [ 3908.597979] .bkl_count : 5052 ... ... Have you seen this problem with the ftp test on a different box ? Could you try it ? Thanks! -Lukas
Ok, I guess sysrq-w doesn't work like I thought on rhel5, sorry about that. Can you try sysrq-t when it's stuck? It'll be big, so adding it as an attachment would be best. Thanks, -Eric
Created attachment 527323 [details] Ftp test script
Created attachment 527324 [details] Ftp test script 2
Created attachment 527325 [details] Ftp test script 3
Created attachment 527326 [details] dmesg log after running ftp test
Created attachment 527327 [details] dmesg after sysrq-t
@ Lukas, The server host the files on an ext3 file system and the peer connects to the server through ftp to get the files.I have attached the test scripts. Yes our test engineer reported seeing the issue on different servers running RHEL5.6,5.7 & 6 but not on 5.5 @Eric, I have attached the dmesg logs. Thanks, Sritej
So, we have kjournald trying to commit, and waiting for a request (edited stack a bit): Call Trace: [<ffffffff800637ce>] io_schedule+0x3f/0x67 [<ffffffff80028eb5>] get_request_wait+0xe7/0x130 [<ffffffff8000c228>] __make_request+0x40a/0x4ce [<ffffffff8001c452>] generic_make_request+0x211/0x228 [<ffffffff882da415>] :dm_mod:__map_bio+0x4a/0x123 [<ffffffff882daf57>] :dm_mod:__split_bio+0x17d/0x3b7 [<ffffffff882db99c>] :dm_mod:dm_request+0x115/0x124 [<ffffffff8001c452>] generic_make_request+0x211/0x228 [<ffffffff8003318a>] submit_bio+0xe6/0xed [<ffffffff8001abc1>] submit_bh+0xf4/0x114 [<ffffffff88033453>] :jbd:journal_do_submit_data+0x2e/0x37 [<ffffffff88033921>] :jbd:journal_commit_transaction+0x433/0x10aa syslogd and all ftp processes are in log_wait_commit ("Wait for a specified commit to complete.") So basically waiting for kjournald. The rm processes are in vfs_unlink, waiting on an i_mutex, probably held by the ftp processes, just guessing. And pdflush is in writeback_inodes, in blk_congestion_wait. so kjournald & pdflush seem to be the interesting bits here. Are you quite sure that the box is completely hung and no progress is being made? Once it gets to this point, does iostat for example show any IO happening at all?
Eric, After the call trace, I could still see activity through iostat and tcpdump but after a point the server freezes.I will try to get the call trace when that happens. Sritej
Eric, I tried to get the call trace when it hanged through netconsole but unfortunately I couldn't capture anything. Please let me know if you want me to try anything else. Sritej
Well, if the machine is truly deadlocked I think we will need some way to see what that deadlock is. A crashdump would work, too.
Eric, Have you tried to re-produce the issue on your end? I think Sritej has provided you the instructions to re-produce the issue. It is happening with multiple adapters on certain RHEL releases and fairly easy to re-produce.
I have not yet. With the number of bugs I'm handling, I usually try to gather as much info as possible from the reporter up front, when I can. :) We can try to repro here, too, though. BTW is there a support ticket open on this?
Eric, Server OEM is treating this as a blocker issue for a release. We have raised the severity of this issue. Let us know what we should do to open a support ticket.
You can log a support ticket by logging in at support.redhat.com (see the "submit a support case" link) or you can call the support number at 888-GO-REDHAT. Other contact info: https://access.redhat.com/support/contact/technicalSupport.html You might let them know that this bug has been filed. Thanks, -Eric
Chad Dupuis is going to see if he can help this along. I'd really like to try to get a crashdump from your system when it's in the hung state. Thanks, -Eric
Hi Eric, From the test reports that I have received and from my own testing, I am convinced that the issue cannot be reproduced consistently. This is what our test engineer reported: Case 1:- ./ftp.sh <host IP> 15 15 (Test pass successfully without any trace seen in dmesg.) Case 2:- ./ftp.sh <host IP> 25 25 (Test pass successfully but kernel trace seen in dmesg and system doesn't hang) Case 3:- ./ftp.sh <host IP> 35 35 (Test pass successfully but kernel trace seen in dmesg and system hanged) In my tests after a call trace was seen, the IP address of the site LAN was withdrawn but the server didn't hang. I could still ssh onto the server through the qlogic interface. As I didn't see any kernel hang, I manually forced a panic after the IP was withdrawn and collected a kdump. Do you want me to upload this kdump file? Thanks, Sritej
I think that a kdump will only be useful if it is taken while the server is deadlocked, I'm afraid.
Just to be clear, the server & client are different machines when you test? i.e. you are not testing with an ftp server on localhost? (And it's the client - the retrieving node - which is experiencing the problem, right?)
Thats right,two different machines and its the retrieving node.
Hi Eric, When I force a panic on the client (without any traffic running) I able to collect a vmcore file locally and can run crash on it. However, when I force a panic (with ftp traffic running) when the issue is seen, the vmcore file occupies the entire disk space and is incomplete. Not sure why this is happening. Sritej
Hi Eric, When I force a panic on the client (without any traffic running) I am able to collect a vmcore file locally and can run crash on it. However, when I force a panic (with ftp traffic running) when the issue is seen, the vmcore file occupies the entire disk space and is incomplete. Not sure why this is happening. Sritej
Chad, can you help Sritej figure out the crashdump problem?
Eric, We are able to collect kdump sucessfully with and without running traffic. However, when we run the ftp traffic and run into the failure described in the Bug, vmcore file occupies the entire disk space and is incomplete We can provide you remote access to the system (through WebEx) for live debugging. Would that work for you?
This bug/component is not included in scope for RHEL-5.11.0 which is the last RHEL5 minor release. This Bugzilla will soon be CLOSED as WONTFIX (at the end of RHEL5.11 development phase (Apr 22, 2014)). Please contact your account manager or support representative in case you need to escalate this bug.
Thank you for submitting this request for inclusion in Red Hat Enterprise Linux 5. We've carefully evaluated the request, but are unable to include it in RHEL5 stream. If the issue is critical for your business, please provide additional business justification through the appropriate support channels (https://access.redhat.com/site/support).