| Summary: | dio_sparse test from LTP failed run on NFSv2+[tcp|udp] by beaker, but passed run by manual | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [Retired] Beaker | Reporter: | yanfu,wang <yanwang> | ||||
| Component: | tests | Assignee: | Nick Coghlan <ncoghlan> | ||||
| Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | yanfu,wang <yanwang> | ||||
| Severity: | medium | Docs Contact: | |||||
| Priority: | medium | ||||||
| Version: | 0.6 | CC: | bpeck, dcallagh, mcsontos, mishin, rmancy, stl | ||||
| Target Milestone: | --- | Keywords: | Reopened | ||||
| Target Release: | --- | ||||||
| Hardware: | All | ||||||
| OS: | All | ||||||
| Whiteboard: | MC | ||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2012-11-07 07:24:39 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Attachments: |
|
||||||
|
Description
yanfu,wang
2011-11-15 02:46:45 UTC
Created attachment 533669 [details]
passed log run by manual
Above is the logs that I reserved two machines and run the test by manual and passed. any update? re-assigning to Marian for his input. To me looks like NFSv2 has rather poor performance compared to v3 and 4 and the test does not run in the expected time - what takes 3+ hours on NFSv4 takes 6+ on NFSv2. How long did the test run when running manually? Try running with KILLTIMEOVERRIDE parameter set to larger number (50000 - that's 16+ hours.) External Watchdog may be result of test leaving system in inconsistent state. (In reply to comment #5) > To me looks like NFSv2 has rather poor performance compared to v3 and 4 and the > test does not run in the expected time - what takes 3+ hours on NFSv4 takes 6+ > on NFSv2. How long did the test run when running manually? > > Try running with KILLTIMEOVERRIDE parameter set to larger number (50000 - > that's 16+ hours.) > > External Watchdog may be result of test leaving system in inconsistent state. hi Marian, Thanks for your reminder me the localwatchdog setting, and could you help to look at the call trace issue too? The call trace log of RHEL6.2 pls refer to below links: http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2011/11/1535/153540/320477/3516668/messages http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2011/11/1535/153541/320479/3516680/messages <snippet> Nov 8 02:57:04 hp-dl180-03 kernel: Call Trace: Nov 8 02:57:04 hp-dl180-03 kernel: [<f9322eb6>] ? xprt_reserve+0x176/0x190 [sunrpc] Nov 8 02:57:04 hp-dl180-03 kernel: [<c0409ff0>] ? common_interrupt+0x30/0x38 Nov 8 02:57:04 hp-dl180-03 kernel: [<f7fb0a13>] ? nfs_read_prepare+0x33/0x50 [nfs] Nov 8 02:57:04 hp-dl180-03 kernel: [<c0830b88>] ? _spin_lock_bh+0x8/0x30 Nov 8 02:57:04 hp-dl180-03 kernel: [<f9327da3>] ? __rpc_execute+0x133/0x260 [sunrpc] Nov 8 02:57:04 hp-dl180-03 kernel: [<f93200d8>] ? call_bind_status+0x158/0x260 [sunrpc] Nov 8 02:57:04 hp-dl180-03 kernel: [<f9327f10>] ? rpc_async_schedule+0x0/0x10 [sunrpc] Nov 8 02:57:04 hp-dl180-03 kernel: [<c04719f7>] ? worker_thread+0x197/0x230 Nov 8 02:57:04 hp-dl180-03 kernel: [<c04760d0>] ? autoremove_wake_function+0x0/0x40 Nov 8 02:57:04 hp-dl180-03 kernel: [<c0471860>] ? worker_thread+0x0/0x230 Nov 8 02:57:04 hp-dl180-03 kernel: [<c0475e94>] ? kthread+0x74/0x80 Nov 8 02:57:04 hp-dl180-03 kernel: [<c0475e20>] ? kthread+0x0/0x80 Nov 8 02:57:04 hp-dl180-03 kernel: [<c0409fff>] ? kernel_thread_helper+0x7/0x10 Nov 8 02:57:04 hp-dl180-03 kernel: rpciod/1 S f0e03f50 0 1539 2 0x00000080 Nov 8 02:57:04 hp-dl180-03 kernel: f33e1570 00000046 00000002 f0e03f50 cde13aa4 00000000 00000000 0000022f Nov 8 02:57:04 hp-dl180-03 kernel: 00000000 c16913c0 00005b68 ae78936e 00005b68 c0b0f0c0 c0b0f0c0 f33e1818 Nov 8 02:57:04 hp-dl180-03 kernel: c0b0f0c0 c0b0aaa4 c0b0f0c0 f33e1818 05fbf17d f130e5b8 00000246 f33e1570 Nov 8 02:57:04 hp-dl180-03 kernel: Call Trace: Nov 8 02:57:04 hp-dl180-03 kernel: [<f9322eb6>] ? xprt_reserve+0x176/0x190 [sunrpc] Nov 8 02:57:04 hp-dl180-03 kernel: [<f932083d>] ? call_transmit+0x18d/0x270 [sunrpc] Nov 8 02:57:04 hp-dl180-03 kernel: [<f7fb3883>] ? nfs_write_prepare+0x33/0x50 [nfs] Nov 8 02:57:04 hp-dl180-03 kernel: [<c0830b88>] ? _spin_lock_bh+0x8/0x30 Nov 8 02:57:04 hp-dl180-03 kernel: [<f9327da3>] ? __rpc_execute+0x133/0x260 [sunrpc] Nov 8 02:57:04 hp-dl180-03 kernel: [<f9327f10>] ? rpc_async_schedule+0x0/0x10 [sunrpc] Nov 8 02:57:04 hp-dl180-03 kernel: [<f9327f10>] ? rpc_async_schedule+0x0/0x10 [sunrpc] Nov 8 02:57:04 hp-dl180-03 kernel: [<c04719f7>] ? worker_thread+0x197/0x230 Nov 8 02:57:04 hp-dl180-03 kernel: [<c04760d0>] ? autoremove_wake_function+0x0/0x40 Nov 8 02:57:04 hp-dl180-03 kernel: [<c0471860>] ? worker_thread+0x0/0x230 Nov 8 02:57:04 hp-dl180-03 kernel: [<c0475e94>] ? kthread+0x74/0x80 Nov 8 02:57:04 hp-dl180-03 kernel: [<c0475e20>] ? kthread+0x0/0x80 Nov 8 02:57:04 hp-dl180-03 kernel: [<c0409fff>] ? kernel_thread_helper+0x7/0x10 </snippet> And I run dio_sparse test on RHEL5.8, also trigger kernel call trace but which is different from RHEL6.2, pls check the below links: http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2011/12/1712/171226/358573/3985546/21827877/console_dmesg--kernel-filesystems-nfs-ltp-aiodio-LOCALWATCHDOG.log http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2011/12/1712/171230/358581/3985594/21772866/console_dmesg--kernel-filesystems-nfs-ltp-aiodio-LOCALWATCHDOG.log http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2011/12/1712/171227/358575/3985558/21789549/console_dmesg--kernel-filesystems-nfs-ltp-aiodio-LOCALWATCHDOG.log <snippet> [<c048ec0c>] inode_wait+0x5/0x8 [<c0623c17>] __wait_on_bit+0x33/0x58 [<c048ec07>] inode_wait+0x0/0x8 [<c048ec07>] inode_wait+0x0/0x8 [<c0623c9e>] out_of_line_wait_on_bit+0x62/0x6a [<c0436f58>] wake_bit_function+0x0/0x3c [<c048ee4b>] ifind_fast+0x59/0x6a [<c048fcd7>] iget_locked+0x4a/0x111 [<f88aa29d>] ext3_get_dentry+0x2a/0x95 [ext3] [<f8e9d302>] find_exported_dentry+0x29/0x489 [exportfs] [<c05e68e5>] ip_queue_xmit+0x3c7/0x406 [<f9006e33>] exp_get_by_name+0x4b/0x5a [nfsd] [<f90074cd>] expkey_match+0x49/0x58 [nfsd] [<f8f8516c>] sunrpc_cache_lookup+0x5c/0xea [sunrpc] [<f9006f86>] svc_expkey_lookup+0xa1/0xa7 [nfsd] [<f900701b>] exp_find_key+0x8f/0x9e [nfsd] [<c05f603d>] __tcp_push_pending_frames+0x69c/0x751 [<c0431b0c>] set_current_groups+0x15a/0x166 [<f8e9d7b1>] export_decode_fh+0x4f/0x59 [exportfs] [<f90033aa>] nfsd_acceptable+0x0/0xbc [nfsd] [<f8e9d762>] export_decode_fh+0x0/0x59 [exportfs] [<f90036ef>] fh_verify+0x289/0x4a6 [nfsd] [<f90033aa>] nfsd_acceptable+0x0/0xbc [nfsd] [<c05bcd6b>] kernel_sendpage+0x35/0x3c [<f9004745>] nfsd_open+0x2d/0x176 [nfsd] [<f9004a43>] nfsd_write+0x69/0xab [nfsd] [<f900253f>] nfsd_proc_write+0xb2/0xc1 [nfsd] [<f90011a4>] nfsd_dispatch+0xbb/0x1a9 [nfsd] [<f8f7f835>] svc_process+0x3c2/0x62d [sunrpc] </snippet> But I don't trigger these call trace by manual running when I reserved two machines and setup by manual, so I'm still confuse if there's any problem related to beaker. These callstacks are dumped by local watchdog handler and it is understandable they are different if the task is still running. What I am concerned about is there are many zombie processes left behind so it may be a genuine bug: Dec 16 02:08:45 ibm-z10-29 LWD:ps-elfH: 0 D root 31217 17658 0 76 0 - 8675 sync_p Dec15 ? 00:01:46 ltp-diorh /mnt/testarea/nfs/aiodio/file5 Dec 16 02:08:45 ibm-z10-29 LWD:ps-elfH: 1 Z root 31218 31217 0 75 0 - 0 exit Dec15 ? 00:00:00 [ltp-diorh] <defunct> ... How long is the test expected to run? (In reply to comment #7) > These callstacks are dumped by local watchdog handler and it is understandable > they are different if the task is still running. > > What I am concerned about is there are many zombie processes left behind so it > may be a genuine bug: > > Dec 16 02:08:45 ibm-z10-29 LWD:ps-elfH: 0 D root 31217 17658 0 76 0 - > 8675 sync_p Dec15 ? 00:01:46 ltp-diorh > /mnt/testarea/nfs/aiodio/file5 > Dec 16 02:08:45 ibm-z10-29 LWD:ps-elfH: 1 Z root 31218 31217 0 75 0 - > 0 exit Dec15 ? 00:00:00 [ltp-diorh] <defunct> > ... > > How long is the test expected to run? About 16 hours, and I've rescheduled a job using KILLTIMEOVERRIDE parameter, so we could monitor it again: https://beaker.engineering.redhat.com/jobs/173337 (In reply to comment #8) > (In reply to comment #7) > > These callstacks are dumped by local watchdog handler and it is understandable > > they are different if the task is still running. > > > > What I am concerned about is there are many zombie processes left behind so it > > may be a genuine bug: > > > > Dec 16 02:08:45 ibm-z10-29 LWD:ps-elfH: 0 D root 31217 17658 0 76 0 - > > 8675 sync_p Dec15 ? 00:01:46 ltp-diorh > > /mnt/testarea/nfs/aiodio/file5 > > Dec 16 02:08:45 ibm-z10-29 LWD:ps-elfH: 1 Z root 31218 31217 0 75 0 - > > 0 exit Dec15 ? 00:00:00 [ltp-diorh] <defunct> > > ... > > > > How long is the test expected to run? > About 16 hours, and I've rescheduled a job using KILLTIMEOVERRIDE parameter, so > we could monitor it again: > https://beaker.engineering.redhat.com/jobs/173337 The above job aborted, and I rescheduled again and extend runtime to ~17 hours, now it finished. So I think you're right, that NFSv2 has rather poor performance and need to more time to run. Thank you for looking at the problem. pasted the job link: https://beaker.engineering.redhat.com/jobs/174728 hi Marian, I'm thinking that it's not related to nfsv2 bad performance, maybe still related to beaker env. Since I extent ltp-aiodio runtime against nfsv2 to 20 hours, still timeout, pls check https://beaker.engineering.redhat.com/jobs/209625. But in fact, if I reserved two machines and run ltp-aiodio/dio_sparse by manual that just need several minutes then the test go to done. And others test like aiocp and aio-stress also just need several minutes and run done, but they will run 2 or 3 hours in beaker. So could you help to look at it again? I could reserve and setup test env for you if you need, thanks. Bulk reassignment of issues as Bill has moved to another team. Reverting to NEW to more accurately reflect current status. This bugs is closed as it is either not in the current Beaker scope or we could not find sufficient data in the bug report for consideration. Please feel free to reopen the bug with additional information and/or business cases behind it. |