Bug 753986 - dio_sparse test from LTP failed run on NFSv2+[tcp|udp] by beaker, but passed run by manual
Summary: dio_sparse test from LTP failed run on NFSv2+[tcp|udp] by beaker, but passed ...
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Beaker
Classification: Retired
Component: tests
Version: 0.6
Hardware: All
OS: All
medium
medium vote
Target Milestone: ---
Assignee: Nick Coghlan
QA Contact: yanfu,wang
URL:
Whiteboard: MC
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-11-15 02:46 UTC by yanfu,wang
Modified: 2012-11-07 07:24 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-11-07 07:24:39 UTC


Attachments (Terms of Use)
passed log run by manual (3.55 KB, application/x-gzip)
2011-11-15 02:49 UTC, yanfu,wang
no flags Details

Description yanfu,wang 2011-11-15 02:46:45 UTC
Description of problem:
dio_sparse test from LTP testsuite failed on NFSv2+[tcp|udp] when schedule in beaker. The automated test always timeout caused by LOCAL WATCHDOG and trigger kernel call trace. NFSv[3|4]+[tcp|udp] could passed, and I run the test on NFSv2+[tcp|udp] by manual on reserved machines, the dio_sparse test could passed and no kernel call trace trigger on console.
So I want to confirm if this is a beaker problem first, and if not I will consider if it's a kernel NFSv2 issue.
So pls help me to check the failure, thanks!

The call trace log pls refer to below links:
http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2011/11/1535/153540/320477/3516668/messages
http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2011/11/1535/153541/320479/3516680/messages

And these failed jobs pls refer to below links:
x86_64:https://beaker.engineering.redhat.com/jobs/153538
i386: https://beaker.engineering.redhat.com/jobs/153540
ppc64: https://beaker.engineering.redhat.com/jobs/153541
s390x: https://beaker.engineering.redhat.com/jobs/153543

Actual results:
dio_sparse test from LTP failed on NFSv2+[tcp|udp].

Expected results:
Test passed on NFSv2+[tcp|udp] like others combination.

Additional info:

Comment 1 yanfu,wang 2011-11-15 02:49:02 UTC
Created attachment 533669 [details]
passed log run by manual

Comment 2 yanfu,wang 2011-11-15 02:50:41 UTC
Above is the logs that I reserved two machines and run the test by manual and passed.

Comment 3 yanfu,wang 2011-12-19 02:28:59 UTC
any update?

Comment 4 Bill Peck 2011-12-19 15:13:47 UTC
re-assigning to Marian for his input.

Comment 5 Marian Csontos 2011-12-19 17:29:10 UTC
To me looks like NFSv2 has rather poor performance compared to v3 and 4 and the test does not run in the expected time - what takes 3+ hours on NFSv4 takes 6+ on NFSv2. How long did the test run when running manually?

Try running with KILLTIMEOVERRIDE parameter set to larger number (50000 - that's 16+ hours.)

External Watchdog may be result of test leaving system in inconsistent state.

Comment 6 yanfu,wang 2011-12-20 03:10:10 UTC
(In reply to comment #5)
> To me looks like NFSv2 has rather poor performance compared to v3 and 4 and the
> test does not run in the expected time - what takes 3+ hours on NFSv4 takes 6+
> on NFSv2. How long did the test run when running manually?
> 
> Try running with KILLTIMEOVERRIDE parameter set to larger number (50000 -
> that's 16+ hours.)
> 
> External Watchdog may be result of test leaving system in inconsistent state.

hi Marian,
Thanks for your reminder me the localwatchdog setting, and could you help to look at the call trace issue too?

The call trace log of RHEL6.2 pls refer to below links:
http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2011/11/1535/153540/320477/3516668/messages
http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2011/11/1535/153541/320479/3516680/messages
<snippet>
Nov  8 02:57:04 hp-dl180-03 kernel: Call Trace:
Nov  8 02:57:04 hp-dl180-03 kernel: [<f9322eb6>] ? xprt_reserve+0x176/0x190 [sunrpc]
Nov  8 02:57:04 hp-dl180-03 kernel: [<c0409ff0>] ? common_interrupt+0x30/0x38
Nov  8 02:57:04 hp-dl180-03 kernel: [<f7fb0a13>] ? nfs_read_prepare+0x33/0x50 [nfs]
Nov  8 02:57:04 hp-dl180-03 kernel: [<c0830b88>] ? _spin_lock_bh+0x8/0x30
Nov  8 02:57:04 hp-dl180-03 kernel: [<f9327da3>] ? __rpc_execute+0x133/0x260 [sunrpc]
Nov  8 02:57:04 hp-dl180-03 kernel: [<f93200d8>] ? call_bind_status+0x158/0x260 [sunrpc]
Nov  8 02:57:04 hp-dl180-03 kernel: [<f9327f10>] ? rpc_async_schedule+0x0/0x10 [sunrpc]
Nov  8 02:57:04 hp-dl180-03 kernel: [<c04719f7>] ? worker_thread+0x197/0x230
Nov  8 02:57:04 hp-dl180-03 kernel: [<c04760d0>] ? autoremove_wake_function+0x0/0x40
Nov  8 02:57:04 hp-dl180-03 kernel: [<c0471860>] ? worker_thread+0x0/0x230
Nov  8 02:57:04 hp-dl180-03 kernel: [<c0475e94>] ? kthread+0x74/0x80
Nov  8 02:57:04 hp-dl180-03 kernel: [<c0475e20>] ? kthread+0x0/0x80
Nov  8 02:57:04 hp-dl180-03 kernel: [<c0409fff>] ? kernel_thread_helper+0x7/0x10
Nov  8 02:57:04 hp-dl180-03 kernel: rpciod/1      S f0e03f50     0  1539      2 0x00000080
Nov  8 02:57:04 hp-dl180-03 kernel: f33e1570 00000046 00000002 f0e03f50 cde13aa4 00000000 00000000 0000022f
Nov  8 02:57:04 hp-dl180-03 kernel: 00000000 c16913c0 00005b68 ae78936e 00005b68 c0b0f0c0 c0b0f0c0 f33e1818
Nov  8 02:57:04 hp-dl180-03 kernel: c0b0f0c0 c0b0aaa4 c0b0f0c0 f33e1818 05fbf17d f130e5b8 00000246 f33e1570
Nov  8 02:57:04 hp-dl180-03 kernel: Call Trace:
Nov  8 02:57:04 hp-dl180-03 kernel: [<f9322eb6>] ? xprt_reserve+0x176/0x190 [sunrpc]
Nov  8 02:57:04 hp-dl180-03 kernel: [<f932083d>] ? call_transmit+0x18d/0x270 [sunrpc]
Nov  8 02:57:04 hp-dl180-03 kernel: [<f7fb3883>] ? nfs_write_prepare+0x33/0x50 [nfs]
Nov  8 02:57:04 hp-dl180-03 kernel: [<c0830b88>] ? _spin_lock_bh+0x8/0x30
Nov  8 02:57:04 hp-dl180-03 kernel: [<f9327da3>] ? __rpc_execute+0x133/0x260 [sunrpc]
Nov  8 02:57:04 hp-dl180-03 kernel: [<f9327f10>] ? rpc_async_schedule+0x0/0x10 [sunrpc]
Nov  8 02:57:04 hp-dl180-03 kernel: [<f9327f10>] ? rpc_async_schedule+0x0/0x10 [sunrpc]
Nov  8 02:57:04 hp-dl180-03 kernel: [<c04719f7>] ? worker_thread+0x197/0x230
Nov  8 02:57:04 hp-dl180-03 kernel: [<c04760d0>] ? autoremove_wake_function+0x0/0x40
Nov  8 02:57:04 hp-dl180-03 kernel: [<c0471860>] ? worker_thread+0x0/0x230
Nov  8 02:57:04 hp-dl180-03 kernel: [<c0475e94>] ? kthread+0x74/0x80
Nov  8 02:57:04 hp-dl180-03 kernel: [<c0475e20>] ? kthread+0x0/0x80
Nov  8 02:57:04 hp-dl180-03 kernel: [<c0409fff>] ? kernel_thread_helper+0x7/0x10
</snippet>

And I run dio_sparse test on RHEL5.8, also trigger kernel call trace but which is different from RHEL6.2, pls check the below links:
http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2011/12/1712/171226/358573/3985546/21827877/console_dmesg--kernel-filesystems-nfs-ltp-aiodio-LOCALWATCHDOG.log
http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2011/12/1712/171230/358581/3985594/21772866/console_dmesg--kernel-filesystems-nfs-ltp-aiodio-LOCALWATCHDOG.log
http://beaker-archive.app.eng.bos.redhat.com/beaker-logs/2011/12/1712/171227/358575/3985558/21789549/console_dmesg--kernel-filesystems-nfs-ltp-aiodio-LOCALWATCHDOG.log
<snippet>
 [<c048ec0c>] inode_wait+0x5/0x8
 [<c0623c17>] __wait_on_bit+0x33/0x58
 [<c048ec07>] inode_wait+0x0/0x8
 [<c048ec07>] inode_wait+0x0/0x8
 [<c0623c9e>] out_of_line_wait_on_bit+0x62/0x6a
 [<c0436f58>] wake_bit_function+0x0/0x3c
 [<c048ee4b>] ifind_fast+0x59/0x6a
 [<c048fcd7>] iget_locked+0x4a/0x111
 [<f88aa29d>] ext3_get_dentry+0x2a/0x95 [ext3]
 [<f8e9d302>] find_exported_dentry+0x29/0x489 [exportfs]
 [<c05e68e5>] ip_queue_xmit+0x3c7/0x406
 [<f9006e33>] exp_get_by_name+0x4b/0x5a [nfsd]
 [<f90074cd>] expkey_match+0x49/0x58 [nfsd]
 [<f8f8516c>] sunrpc_cache_lookup+0x5c/0xea [sunrpc]
 [<f9006f86>] svc_expkey_lookup+0xa1/0xa7 [nfsd]
 [<f900701b>] exp_find_key+0x8f/0x9e [nfsd]
 [<c05f603d>] __tcp_push_pending_frames+0x69c/0x751
 [<c0431b0c>] set_current_groups+0x15a/0x166
 [<f8e9d7b1>] export_decode_fh+0x4f/0x59 [exportfs]
 [<f90033aa>] nfsd_acceptable+0x0/0xbc [nfsd]
 [<f8e9d762>] export_decode_fh+0x0/0x59 [exportfs]
 [<f90036ef>] fh_verify+0x289/0x4a6 [nfsd]
 [<f90033aa>] nfsd_acceptable+0x0/0xbc [nfsd]
 [<c05bcd6b>] kernel_sendpage+0x35/0x3c
 [<f9004745>] nfsd_open+0x2d/0x176 [nfsd]
 [<f9004a43>] nfsd_write+0x69/0xab [nfsd]
 [<f900253f>] nfsd_proc_write+0xb2/0xc1 [nfsd]
 [<f90011a4>] nfsd_dispatch+0xbb/0x1a9 [nfsd]
 [<f8f7f835>] svc_process+0x3c2/0x62d [sunrpc]
</snippet>

But I don't trigger these call trace by manual running when I reserved two machines and setup by manual, so I'm still confuse if there's any problem related to beaker.

Comment 7 Marian Csontos 2011-12-20 09:47:50 UTC
These callstacks are dumped by local watchdog handler and it is understandable they are different if the task is still running.

What I am concerned about is there are many zombie processes left behind so it may be a genuine bug:

Dec 16 02:08:45 ibm-z10-29 LWD:ps-elfH: 0 D root     31217 17658  0  76   0 -  8675 sync_p Dec15 ?        00:01:46                 ltp-diorh /mnt/testarea/nfs/aiodio/file5
Dec 16 02:08:45 ibm-z10-29 LWD:ps-elfH: 1 Z root     31218 31217  0  75   0 -     0 exit   Dec15 ?        00:00:00                   [ltp-diorh] <defunct>
...

How long is the test expected to run?

Comment 8 yanfu,wang 2011-12-21 02:29:08 UTC
(In reply to comment #7)
> These callstacks are dumped by local watchdog handler and it is understandable
> they are different if the task is still running.
> 
> What I am concerned about is there are many zombie processes left behind so it
> may be a genuine bug:
> 
> Dec 16 02:08:45 ibm-z10-29 LWD:ps-elfH: 0 D root     31217 17658  0  76   0 - 
> 8675 sync_p Dec15 ?        00:01:46                 ltp-diorh
> /mnt/testarea/nfs/aiodio/file5
> Dec 16 02:08:45 ibm-z10-29 LWD:ps-elfH: 1 Z root     31218 31217  0  75   0 -  
>   0 exit   Dec15 ?        00:00:00                   [ltp-diorh] <defunct>
> ...
> 
> How long is the test expected to run?
About 16 hours, and I've rescheduled a job using KILLTIMEOVERRIDE parameter, so we could monitor it again:
https://beaker.engineering.redhat.com/jobs/173337

Comment 9 yanfu,wang 2011-12-27 02:27:19 UTC
(In reply to comment #8)
> (In reply to comment #7)
> > These callstacks are dumped by local watchdog handler and it is understandable
> > they are different if the task is still running.
> > 
> > What I am concerned about is there are many zombie processes left behind so it
> > may be a genuine bug:
> > 
> > Dec 16 02:08:45 ibm-z10-29 LWD:ps-elfH: 0 D root     31217 17658  0  76   0 - 
> > 8675 sync_p Dec15 ?        00:01:46                 ltp-diorh
> > /mnt/testarea/nfs/aiodio/file5
> > Dec 16 02:08:45 ibm-z10-29 LWD:ps-elfH: 1 Z root     31218 31217  0  75   0 -  
> >   0 exit   Dec15 ?        00:00:00                   [ltp-diorh] <defunct>
> > ...
> > 
> > How long is the test expected to run?
> About 16 hours, and I've rescheduled a job using KILLTIMEOVERRIDE parameter, so
> we could monitor it again:
> https://beaker.engineering.redhat.com/jobs/173337
The above job aborted, and I rescheduled again and extend runtime to ~17 hours, now it finished. So I think you're right, that NFSv2 has rather poor performance and need to more time to run.
Thank you for looking at the problem.

Comment 10 yanfu,wang 2011-12-27 02:28:09 UTC
pasted the job link:
https://beaker.engineering.redhat.com/jobs/174728

Comment 11 yanfu,wang 2012-03-28 07:11:02 UTC
hi Marian,
I'm thinking that it's not related to nfsv2 bad performance, maybe still related to beaker env. Since I extent ltp-aiodio runtime against nfsv2 to 20 hours, still timeout, pls check https://beaker.engineering.redhat.com/jobs/209625.
But in fact, if I reserved two machines and run ltp-aiodio/dio_sparse by manual that just need several minutes then the test go to done. And others test like aiocp and aio-stress also just need several minutes and run done, but they will run 2 or 3 hours in beaker.
So could you help to look at it again? I could reserve and setup test env for you if you need, thanks.

Comment 12 Nick Coghlan 2012-10-17 04:40:30 UTC
Bulk reassignment of issues as Bill has moved to another team.

Comment 13 Nick Coghlan 2012-11-07 06:01:39 UTC
Reverting to NEW to more accurately reflect current status.

Comment 14 Min Shin 2012-11-07 07:24:39 UTC
This bugs is closed as it is either not in the current Beaker scope or we could not find sufficient data in the bug report for consideration.
Please feel free to reopen the bug with additional information and/or business cases behind it.


Note You need to log in before you can comment on or make changes to this bug.