Bug 694639
| Summary: | lldpad has frequent timeouts on selects | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | William Cohen <wcohen> | ||||||||
| Component: | lldpad | Assignee: | Petr Šabata <psabata> | ||||||||
| Status: | CLOSED ERRATA | QA Contact: | qe-baseos-daemons | ||||||||
| Severity: | medium | Docs Contact: | |||||||||
| Priority: | urgent | ||||||||||
| Version: | 6.1 | CC: | awilliam, bnocera, ddumas, jack.morgan, jens.osterkamp, john.r.fastabend, mcermak, ovasik, pknirsch, psklenar | ||||||||
| Target Milestone: | rc | Keywords: | ZStream | ||||||||
| Target Release: | 6.2 | ||||||||||
| Hardware: | x86_64 | ||||||||||
| OS: | Linux | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | lldpad-0.9.43-7.el6 | Doc Type: | Bug Fix | ||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | |||||||||||
| : | 701943 (view as bug list) | Environment: | |||||||||
| Last Closed: | 2011-12-06 14:39:59 UTC | Type: | --- | ||||||||
| Regression: | --- | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Embargoed: | |||||||||||
| Bug Depends On: | 695943 | ||||||||||
| Bug Blocks: | 701993 | ||||||||||
| Attachments: | 
 | ||||||||||
| Since RHEL 6.1 External Beta has begun, and this bug remains unresolved, it has been rejected as it is not proposed as exception or blocker. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. Attached gdb to process got the following traceback:
(gdb) break select
Breakpoint 1 at 0x3aa9cde8f0
(gdb) c
Continuing.
Breakpoint 1, 0x0000003aa9cde8f0 in select () from /lib64/libc.so.6
(gdb) where
#0  0x0000003aa9cde8f0 in select () from /lib64/libc.so.6
#1  0x00000000004087c4 in eloop_run () at eloop.c:479
#2  0x0000000000403692 in main (argc=<value optimized out>, 
    argv=0x7fff8a857fe8) at lldpad.c:414
Is there some way to avoid having the eloop_run() constantly go around polling?
Looks like lots of 10ms timeouts. Set breakpoint on  eloop_register_timeout function and see a lot of things of the form:
(gdb) 
Continuing.
Breakpoint 1, eloop_register_timeout (secs=0, usecs=10000, 
    handler=0x426d50 <vdp_timeout_handler>, eloop_data=0x0, 
    user_data=0x1bf23a0) at eloop.c:273
273	{
(gdb) 
Continuing.
Breakpoint 1, eloop_register_timeout (secs=0, usecs=10000, 
    handler=0x423c90 <ecp_timeout_handler>, eloop_data=0x0, 
    user_data=0x1bfe370) at eloop.c:273
273	{
See those coming from:
$ grep 10000 */*.h
ecp/ecp.h:#define ECP_TIMER_GRANULARITY		10000 /* 10 ms in us */
include/lldp_vdp.h:#define VDP_TIMER_GRANULARITY		10000 /* 10 ms in us */
lldp_vdp.c:vdp_timeout_handerl() wakes up every 10ms just to decrement some counters. ecp.c:ecp_timeout_handler() is doing the same.
For vdp machine is literally wakes up a hundreds of times before timeing because the timeout initial setting are for ackTimer (VDP_KEEPALIVE_TIMER_DEFAULT) and keepaliveTimer (VDP_ACK_TIMER_DEFAULT) are:
#define VDP_KEEPALIVE_TIMER_DEFAULT	1000  /* 10s in 10ms chunks */
#define VDP_ACK_TIMER_DEFAULT 	(2*ECP_ACK_TIMER_DEFAULT*ECP_MAX_RETRIES)
#define ECP_MAX_RETRIES			3
#define ECP_ACK_TIMER_DEFAULT		50 /* 500 ms in 10 ms chunks */
Seems like the timeouts for the profiles could be computed in advanced and eliminate this polling hundred's of times a second.
William, Thanks for the report and email. I'll look into this. John, if you have a fix for this we would appreciate it asap, given that this is ugly for power management. (In reply to comment #0) ... > Expected results: > > Fewer select timeouts for lldpad. What would be an acceptable level for the select timeouts ? Hi Jens. We're typically looking at less than once a second ideally for any service, the longer the better obviously. Hope that helps, Thanks & regards, Phil (In reply to comment #10) > Hi Jens. > > We're typically looking at less than once a second ideally for any service, the > longer the better obviously. Thanks for the clarification Phil ! The patches I just posted on lldp-devel reduces the number of select timeouts to ~1 per second on my system. I will attach them here. Please test and let me know of the results. Thanks ! Created attachment 496492 [details]
patch to reduce ECP select timeouts
Created attachment 496493 [details]
patch to reduce VDP select timeouts
I built a local version of the lldpad rpm with the two patches. The patches immensely reduce the frequency of timeouts. For the same: /usr/share/doc/systemtap-1.4/examples/profiling/timeout.stp -c "sleep 10" pid | poll select epoll itimer futex nanosle signal| process 17938 | 0 49 0 0 0 0 0| postmaster 17937 | 0 49 0 0 0 0 0| postmaster 4851 | 0 0 0 0 0 39 0| stapio 1431 | 1 0 0 0 0 9 0| multipathd 17939 | 0 10 0 0 0 0 0| postmaster 23534 | 0 9 0 0 0 0 0| ntpd 4845 | 0 9 0 0 0 0 0| lldpad 17935 | 0 9 0 0 0 0 0| postmaster 17940 | 3 0 0 0 0 0 0| postmaster 2508 | 0 1 0 0 0 0 0| sendmail Fixed in CVS for 6.2, lldpad-0.9.41-5.el6 Update: https://bugzilla.redhat.com/show_bug.cgi?id=701993#c16 All fixes are included in the latest releases. Included in via generic EL6 lldpad updates, see bug 695943. verified by Intel An updated package with 6.1.z patches is now ready in CVS, lldpad-0.9.43-4.el6. The original patch had never made it upstream, so I posted a slightly different patch to fix this problem http://www.open-lldp.org/patchwork/patch/2174/ Petr Sabata verified it fixes the problem as well. Created attachment 520608 [details]
patch to only enable vdp when enabletx=true
A new build with the updated patch is on the way. this doesn't need to block the Fedora bug. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHBA-2011-1604.html | 
Description of problem: Even when idle llpad has very frequent timeouts, about 400 select timeouts a second on a 16 processor system. Version-Release number of selected component (if applicable): lldpad-0.9.41-2.el6.x86_64 kernel-2.6.32-122.el6.x86_64 systemtap-1.4-3.el6.x86_64 How reproducible: Every time Steps to Reproduce: 1. Install lldpad and set up 2. Install systemtap, kernel-debuginfo for kernel 3. Use the following script to watch timeout events for 10 seconds: /usr/share/doc/systemtap-1.4/examples/profiling/timeout.stp -c "sleep 10" Actual results: The timeout.stp script tallies the number of times that various syscalls return due to timeouts rather than an actual event. Below is te output of the following 10 second run. pid | poll select epoll itimer futex nanosle signal| process 3952 | 0 3958 0 0 0 0 0| lldpad 4404 | 0 49 0 0 0 0 0| postmaster 4403 | 0 49 0 0 0 0 0| postmaster 4833 | 0 0 0 0 0 39 0| stapio 4405 | 0 10 0 0 0 0 0| postmaster 1431 | 1 0 0 0 0 9 0| multipathd 4401 | 0 9 0 0 0 0 0| postmaster 23534 | 0 9 0 0 0 0 0| ntpd 4406 | 3 0 0 0 0 0 0| postmaster 2508 | 0 1 0 0 0 0 0| sendmail Doing "strace -p 3952" generate a constant stream of the form: select(14, [5 6 8 9 10 11 12 13], [], [], {0, 9799}) = 0 (Timeout) select(14, [5 6 8 9 10 11 12 13], [], [], {0, 0}) = 0 (Timeout) select(14, [5 6 8 9 10 11 12 13], [], [], {0, 0}) = 0 (Timeout) select(14, [5 6 8 9 10 11 12 13], [], [], {0, 0}) = 0 (Timeout) ... Expected results: Fewer select timeouts for lldpad. Additional info: