Hide Forgot
Description of problem: dio_sparse test from LTP testsuite failed on NFSv2+[tcp|udp] when schedule in beaker. The automated test always timeout caused by LOCAL WATCHDOG and trigger kernel call trace. NFSv[3|4]+[tcp|udp] could passed, and I run the test on NFSv2+[tcp|udp] by manual on reserved machines, the dio_sparse test could passed and no kernel call trace trigger on console. So I want to confirm if this is a beaker problem first, and if not I will consider if it's a kernel NFSv2 issue. So pls help me to check the failure, thanks! The call trace log pls refer to below links: http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2011/11/1535/153540/320477/3516668/messages http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2011/11/1535/153541/320479/3516680/messages And these failed jobs pls refer to below links: x86_64:https://beaker.engineering.redhat.com/jobs/153538 i386: https://beaker.engineering.redhat.com/jobs/153540 ppc64: https://beaker.engineering.redhat.com/jobs/153541 s390x: https://beaker.engineering.redhat.com/jobs/153543 Actual results: dio_sparse test from LTP failed on NFSv2+[tcp|udp]. Expected results: Test passed on NFSv2+[tcp|udp] like others combination. Additional info:
Created attachment 533669 [details] passed log run by manual
Above is the logs that I reserved two machines and run the test by manual and passed.
any update?
re-assigning to Marian for his input.
To me looks like NFSv2 has rather poor performance compared to v3 and 4 and the test does not run in the expected time - what takes 3+ hours on NFSv4 takes 6+ on NFSv2. How long did the test run when running manually? Try running with KILLTIMEOVERRIDE parameter set to larger number (50000 - that's 16+ hours.) External Watchdog may be result of test leaving system in inconsistent state.
(In reply to comment #5) > To me looks like NFSv2 has rather poor performance compared to v3 and 4 and the > test does not run in the expected time - what takes 3+ hours on NFSv4 takes 6+ > on NFSv2. How long did the test run when running manually? > > Try running with KILLTIMEOVERRIDE parameter set to larger number (50000 - > that's 16+ hours.) > > External Watchdog may be result of test leaving system in inconsistent state. hi Marian, Thanks for your reminder me the localwatchdog setting, and could you help to look at the call trace issue too? The call trace log of RHEL6.2 pls refer to below links: http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2011/11/1535/153540/320477/3516668/messages http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2011/11/1535/153541/320479/3516680/messages <snippet> Nov 8 02:57:04 hp-dl180-03 kernel: Call Trace: Nov 8 02:57:04 hp-dl180-03 kernel: [<f9322eb6>] ? xprt_reserve+0x176/0x190 [sunrpc] Nov 8 02:57:04 hp-dl180-03 kernel: [<c0409ff0>] ? common_interrupt+0x30/0x38 Nov 8 02:57:04 hp-dl180-03 kernel: [<f7fb0a13>] ? nfs_read_prepare+0x33/0x50 [nfs] Nov 8 02:57:04 hp-dl180-03 kernel: [<c0830b88>] ? _spin_lock_bh+0x8/0x30 Nov 8 02:57:04 hp-dl180-03 kernel: [<f9327da3>] ? __rpc_execute+0x133/0x260 [sunrpc] Nov 8 02:57:04 hp-dl180-03 kernel: [<f93200d8>] ? call_bind_status+0x158/0x260 [sunrpc] Nov 8 02:57:04 hp-dl180-03 kernel: [<f9327f10>] ? rpc_async_schedule+0x0/0x10 [sunrpc] Nov 8 02:57:04 hp-dl180-03 kernel: [<c04719f7>] ? worker_thread+0x197/0x230 Nov 8 02:57:04 hp-dl180-03 kernel: [<c04760d0>] ? autoremove_wake_function+0x0/0x40 Nov 8 02:57:04 hp-dl180-03 kernel: [<c0471860>] ? worker_thread+0x0/0x230 Nov 8 02:57:04 hp-dl180-03 kernel: [<c0475e94>] ? kthread+0x74/0x80 Nov 8 02:57:04 hp-dl180-03 kernel: [<c0475e20>] ? kthread+0x0/0x80 Nov 8 02:57:04 hp-dl180-03 kernel: [<c0409fff>] ? kernel_thread_helper+0x7/0x10 Nov 8 02:57:04 hp-dl180-03 kernel: rpciod/1 S f0e03f50 0 1539 2 0x00000080 Nov 8 02:57:04 hp-dl180-03 kernel: f33e1570 00000046 00000002 f0e03f50 cde13aa4 00000000 00000000 0000022f Nov 8 02:57:04 hp-dl180-03 kernel: 00000000 c16913c0 00005b68 ae78936e 00005b68 c0b0f0c0 c0b0f0c0 f33e1818 Nov 8 02:57:04 hp-dl180-03 kernel: c0b0f0c0 c0b0aaa4 c0b0f0c0 f33e1818 05fbf17d f130e5b8 00000246 f33e1570 Nov 8 02:57:04 hp-dl180-03 kernel: Call Trace: Nov 8 02:57:04 hp-dl180-03 kernel: [<f9322eb6>] ? xprt_reserve+0x176/0x190 [sunrpc] Nov 8 02:57:04 hp-dl180-03 kernel: [<f932083d>] ? call_transmit+0x18d/0x270 [sunrpc] Nov 8 02:57:04 hp-dl180-03 kernel: [<f7fb3883>] ? nfs_write_prepare+0x33/0x50 [nfs] Nov 8 02:57:04 hp-dl180-03 kernel: [<c0830b88>] ? _spin_lock_bh+0x8/0x30 Nov 8 02:57:04 hp-dl180-03 kernel: [<f9327da3>] ? __rpc_execute+0x133/0x260 [sunrpc] Nov 8 02:57:04 hp-dl180-03 kernel: [<f9327f10>] ? rpc_async_schedule+0x0/0x10 [sunrpc] Nov 8 02:57:04 hp-dl180-03 kernel: [<f9327f10>] ? rpc_async_schedule+0x0/0x10 [sunrpc] Nov 8 02:57:04 hp-dl180-03 kernel: [<c04719f7>] ? worker_thread+0x197/0x230 Nov 8 02:57:04 hp-dl180-03 kernel: [<c04760d0>] ? autoremove_wake_function+0x0/0x40 Nov 8 02:57:04 hp-dl180-03 kernel: [<c0471860>] ? worker_thread+0x0/0x230 Nov 8 02:57:04 hp-dl180-03 kernel: [<c0475e94>] ? kthread+0x74/0x80 Nov 8 02:57:04 hp-dl180-03 kernel: [<c0475e20>] ? kthread+0x0/0x80 Nov 8 02:57:04 hp-dl180-03 kernel: [<c0409fff>] ? kernel_thread_helper+0x7/0x10 </snippet> And I run dio_sparse test on RHEL5.8, also trigger kernel call trace but which is different from RHEL6.2, pls check the below links: http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2011/12/1712/171226/358573/3985546/21827877/console_dmesg--kernel-filesystems-nfs-ltp-aiodio-LOCALWATCHDOG.log http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2011/12/1712/171230/358581/3985594/21772866/console_dmesg--kernel-filesystems-nfs-ltp-aiodio-LOCALWATCHDOG.log http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2011/12/1712/171227/358575/3985558/21789549/console_dmesg--kernel-filesystems-nfs-ltp-aiodio-LOCALWATCHDOG.log <snippet> [<c048ec0c>] inode_wait+0x5/0x8 [<c0623c17>] __wait_on_bit+0x33/0x58 [<c048ec07>] inode_wait+0x0/0x8 [<c048ec07>] inode_wait+0x0/0x8 [<c0623c9e>] out_of_line_wait_on_bit+0x62/0x6a [<c0436f58>] wake_bit_function+0x0/0x3c [<c048ee4b>] ifind_fast+0x59/0x6a [<c048fcd7>] iget_locked+0x4a/0x111 [<f88aa29d>] ext3_get_dentry+0x2a/0x95 [ext3] [<f8e9d302>] find_exported_dentry+0x29/0x489 [exportfs] [<c05e68e5>] ip_queue_xmit+0x3c7/0x406 [<f9006e33>] exp_get_by_name+0x4b/0x5a [nfsd] [<f90074cd>] expkey_match+0x49/0x58 [nfsd] [<f8f8516c>] sunrpc_cache_lookup+0x5c/0xea [sunrpc] [<f9006f86>] svc_expkey_lookup+0xa1/0xa7 [nfsd] [<f900701b>] exp_find_key+0x8f/0x9e [nfsd] [<c05f603d>] __tcp_push_pending_frames+0x69c/0x751 [<c0431b0c>] set_current_groups+0x15a/0x166 [<f8e9d7b1>] export_decode_fh+0x4f/0x59 [exportfs] [<f90033aa>] nfsd_acceptable+0x0/0xbc [nfsd] [<f8e9d762>] export_decode_fh+0x0/0x59 [exportfs] [<f90036ef>] fh_verify+0x289/0x4a6 [nfsd] [<f90033aa>] nfsd_acceptable+0x0/0xbc [nfsd] [<c05bcd6b>] kernel_sendpage+0x35/0x3c [<f9004745>] nfsd_open+0x2d/0x176 [nfsd] [<f9004a43>] nfsd_write+0x69/0xab [nfsd] [<f900253f>] nfsd_proc_write+0xb2/0xc1 [nfsd] [<f90011a4>] nfsd_dispatch+0xbb/0x1a9 [nfsd] [<f8f7f835>] svc_process+0x3c2/0x62d [sunrpc] </snippet> But I don't trigger these call trace by manual running when I reserved two machines and setup by manual, so I'm still confuse if there's any problem related to beaker.
These callstacks are dumped by local watchdog handler and it is understandable they are different if the task is still running. What I am concerned about is there are many zombie processes left behind so it may be a genuine bug: Dec 16 02:08:45 ibm-z10-29 LWD:ps-elfH: 0 D root 31217 17658 0 76 0 - 8675 sync_p Dec15 ? 00:01:46 ltp-diorh /mnt/testarea/nfs/aiodio/file5 Dec 16 02:08:45 ibm-z10-29 LWD:ps-elfH: 1 Z root 31218 31217 0 75 0 - 0 exit Dec15 ? 00:00:00 [ltp-diorh] <defunct> ... How long is the test expected to run?
(In reply to comment #7) > These callstacks are dumped by local watchdog handler and it is understandable > they are different if the task is still running. > > What I am concerned about is there are many zombie processes left behind so it > may be a genuine bug: > > Dec 16 02:08:45 ibm-z10-29 LWD:ps-elfH: 0 D root 31217 17658 0 76 0 - > 8675 sync_p Dec15 ? 00:01:46 ltp-diorh > /mnt/testarea/nfs/aiodio/file5 > Dec 16 02:08:45 ibm-z10-29 LWD:ps-elfH: 1 Z root 31218 31217 0 75 0 - > 0 exit Dec15 ? 00:00:00 [ltp-diorh] <defunct> > ... > > How long is the test expected to run? About 16 hours, and I've rescheduled a job using KILLTIMEOVERRIDE parameter, so we could monitor it again: https://beaker.engineering.redhat.com/jobs/173337
(In reply to comment #8) > (In reply to comment #7) > > These callstacks are dumped by local watchdog handler and it is understandable > > they are different if the task is still running. > > > > What I am concerned about is there are many zombie processes left behind so it > > may be a genuine bug: > > > > Dec 16 02:08:45 ibm-z10-29 LWD:ps-elfH: 0 D root 31217 17658 0 76 0 - > > 8675 sync_p Dec15 ? 00:01:46 ltp-diorh > > /mnt/testarea/nfs/aiodio/file5 > > Dec 16 02:08:45 ibm-z10-29 LWD:ps-elfH: 1 Z root 31218 31217 0 75 0 - > > 0 exit Dec15 ? 00:00:00 [ltp-diorh] <defunct> > > ... > > > > How long is the test expected to run? > About 16 hours, and I've rescheduled a job using KILLTIMEOVERRIDE parameter, so > we could monitor it again: > https://beaker.engineering.redhat.com/jobs/173337 The above job aborted, and I rescheduled again and extend runtime to ~17 hours, now it finished. So I think you're right, that NFSv2 has rather poor performance and need to more time to run. Thank you for looking at the problem.
pasted the job link: https://beaker.engineering.redhat.com/jobs/174728
hi Marian, I'm thinking that it's not related to nfsv2 bad performance, maybe still related to beaker env. Since I extent ltp-aiodio runtime against nfsv2 to 20 hours, still timeout, pls check https://beaker.engineering.redhat.com/jobs/209625. But in fact, if I reserved two machines and run ltp-aiodio/dio_sparse by manual that just need several minutes then the test go to done. And others test like aiocp and aio-stress also just need several minutes and run done, but they will run 2 or 3 hours in beaker. So could you help to look at it again? I could reserve and setup test env for you if you need, thanks.
Bulk reassignment of issues as Bill has moved to another team.
Reverting to NEW to more accurately reflect current status.
This bugs is closed as it is either not in the current Beaker scope or we could not find sufficient data in the bug report for consideration. Please feel free to reopen the bug with additional information and/or business cases behind it.