694639 – lldpad has frequent timeouts on selects

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 694639 - lldpad has frequent timeouts on selects

Summary: lldpad has frequent timeouts on selects

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	lldpad
Sub Component:
Version:	6.1
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	medium
Target Milestone:	rc
Target Release:	6.2
Assignee:	Petr Šabata
QA Contact:	qe-baseos-daemons
Docs Contact:
URL:
Whiteboard:
Depends On:	695943
Blocks:	701993
TreeView+	depends on / blocked

Reported:	2011-04-07 20:16 UTC by William Cohen
Modified:	2011-12-06 14:39 UTC (History)
CC List:	10 users (show)
Fixed In Version:	lldpad-0.9.43-7.el6
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Clones:	701943 (view as bug list)
Environment:
Last Closed:	2011-12-06 14:39:59 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
patch to reduce ECP select timeouts (7.69 KB, patch) 2011-05-03 11:40 UTC, Jens Osterkamp	no flags	Details \| Diff
patch to reduce VDP select timeouts (14.35 KB, patch) 2011-05-03 11:41 UTC, Jens Osterkamp	no flags	Details \| Diff
patch to only enable vdp when enabletx=true (5.08 KB, patch) 2011-08-30 13:05 UTC, Jens Osterkamp	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2011:1604	0	normal	SHIPPED_LIVE	lldpad bug fix and enhancement update	2011-12-06 00:51:11 UTC

Description William Cohen 2011-04-07 20:16:46 UTC

Description of problem:

Even when idle llpad has very frequent timeouts, about 400 select timeouts a second on a 16 processor system.


Version-Release number of selected component (if applicable):

lldpad-0.9.41-2.el6.x86_64
kernel-2.6.32-122.el6.x86_64
systemtap-1.4-3.el6.x86_64

How reproducible:

Every time

Steps to Reproduce:
1. Install lldpad and set up
2. Install systemtap, kernel-debuginfo for kernel
3. Use the following script to watch timeout events for 10 seconds:
   /usr/share/doc/systemtap-1.4/examples/profiling/timeout.stp -c "sleep 10"

  
Actual results:

The timeout.stp script tallies the number of times that various syscalls return due to timeouts rather than an actual event. Below is te output of the following 10 second run.

  pid |   poll  select   epoll  itimer   futex nanosle  signal| process
 3952 |      0    3958       0       0       0       0       0| lldpad
 4404 |      0      49       0       0       0       0       0| postmaster
 4403 |      0      49       0       0       0       0       0| postmaster
 4833 |      0       0       0       0       0      39       0| stapio
 4405 |      0      10       0       0       0       0       0| postmaster
 1431 |      1       0       0       0       0       9       0| multipathd
 4401 |      0       9       0       0       0       0       0| postmaster
23534 |      0       9       0       0       0       0       0| ntpd
 4406 |      3       0       0       0       0       0       0| postmaster
 2508 |      0       1       0       0       0       0       0| sendmail

Doing "strace -p 3952" generate a constant stream of the form:


select(14, [5 6 8 9 10 11 12 13], [], [], {0, 9799}) = 0 (Timeout)
select(14, [5 6 8 9 10 11 12 13], [], [], {0, 0}) = 0 (Timeout)
select(14, [5 6 8 9 10 11 12 13], [], [], {0, 0}) = 0 (Timeout)
select(14, [5 6 8 9 10 11 12 13], [], [], {0, 0}) = 0 (Timeout)
...

Expected results:

Fewer select timeouts for lldpad.


Additional info:

Comment 2 RHEL Program Management 2011-04-07 20:23:27 UTC

Since RHEL 6.1 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 4 William Cohen 2011-04-14 16:16:30 UTC

Attached gdb to process got the following traceback:

(gdb) break select
Breakpoint 1 at 0x3aa9cde8f0
(gdb) c
Continuing.

Breakpoint 1, 0x0000003aa9cde8f0 in select () from /lib64/libc.so.6
(gdb) where
#0  0x0000003aa9cde8f0 in select () from /lib64/libc.so.6
#1  0x00000000004087c4 in eloop_run () at eloop.c:479
#2  0x0000000000403692 in main (argc=<value optimized out>, 
    argv=0x7fff8a857fe8) at lldpad.c:414

Is there some way to avoid having the eloop_run() constantly go around polling?

Comment 5 William Cohen 2011-04-14 21:49:45 UTC

Looks like lots of 10ms timeouts. Set breakpoint on  eloop_register_timeout function and see a lot of things of the form:

(gdb) 
Continuing.

Breakpoint 1, eloop_register_timeout (secs=0, usecs=10000, 
    handler=0x426d50 <vdp_timeout_handler>, eloop_data=0x0, 
    user_data=0x1bf23a0) at eloop.c:273
273	{
(gdb) 
Continuing.

Breakpoint 1, eloop_register_timeout (secs=0, usecs=10000, 
    handler=0x423c90 <ecp_timeout_handler>, eloop_data=0x0, 
    user_data=0x1bfe370) at eloop.c:273
273	{


See those coming from:

$ grep 10000 */*.h
ecp/ecp.h:#define ECP_TIMER_GRANULARITY		10000 /* 10 ms in us */
include/lldp_vdp.h:#define VDP_TIMER_GRANULARITY		10000 /* 10 ms in us */

lldp_vdp.c:vdp_timeout_handerl() wakes up every 10ms just to decrement some counters. ecp.c:ecp_timeout_handler() is doing the same.

For vdp machine is literally wakes up a hundreds of times before timeing because the timeout initial setting are for ackTimer (VDP_KEEPALIVE_TIMER_DEFAULT) and keepaliveTimer (VDP_ACK_TIMER_DEFAULT) are:

#define VDP_KEEPALIVE_TIMER_DEFAULT	1000  /* 10s in 10ms chunks */
#define VDP_ACK_TIMER_DEFAULT 	(2*ECP_ACK_TIMER_DEFAULT*ECP_MAX_RETRIES)
#define ECP_MAX_RETRIES			3
#define ECP_ACK_TIMER_DEFAULT		50 /* 500 ms in 10 ms chunks */

Seems like the timeouts for the profiles could be computed in advanced and eliminate this polling hundred's of times a second.

Comment 6 john.r.fastabend 2011-04-15 03:42:26 UTC

William, Thanks for the report and email. I'll look into this.

Comment 8 Denise Dumas 2011-04-19 21:08:24 UTC

John, if you have a fix for this we would appreciate it asap, given that this is ugly for power management.

Comment 9 Jens Osterkamp 2011-04-21 13:34:05 UTC

(In reply to comment #0)

...

> Expected results:
> 
> Fewer select timeouts for lldpad.

What would be an acceptable level for the select timeouts ?

Comment 10 Phil Knirsch 2011-04-21 13:53:36 UTC

Hi Jens.

We're typically looking at less than once a second ideally for any service, the longer the better obviously.

Hope that helps,

Thanks & regards, Phil

Comment 11 Jens Osterkamp 2011-05-03 11:39:43 UTC

(In reply to comment #10)
> Hi Jens.
> 
> We're typically looking at less than once a second ideally for any service, the
> longer the better obviously.

Thanks for the clarification Phil !

The patches I just posted on lldp-devel reduces the number of select timeouts to ~1 per second on my system.

I will attach them here. Please test and let me know of the results.

Thanks !

Comment 12 Jens Osterkamp 2011-05-03 11:40:44 UTC

Created attachment 496492 [details]
patch to reduce ECP select timeouts

Comment 13 Jens Osterkamp 2011-05-03 11:41:20 UTC

Created attachment 496493 [details]
patch to reduce VDP select timeouts

Comment 14 William Cohen 2011-05-03 15:16:43 UTC

I built a local version of the lldpad rpm with the two patches. The patches immensely reduce the frequency of timeouts. For the same:

/usr/share/doc/systemtap-1.4/examples/profiling/timeout.stp -c "sleep 10"


  pid |   poll  select   epoll  itimer   futex nanosle  signal| process
17938 |      0      49       0       0       0       0       0| postmaster
17937 |      0      49       0       0       0       0       0| postmaster
 4851 |      0       0       0       0       0      39       0| stapio
 1431 |      1       0       0       0       0       9       0| multipathd
17939 |      0      10       0       0       0       0       0| postmaster
23534 |      0       9       0       0       0       0       0| ntpd
 4845 |      0       9       0       0       0       0       0| lldpad
17935 |      0       9       0       0       0       0       0| postmaster
17940 |      3       0       0       0       0       0       0| postmaster
 2508 |      0       1       0       0       0       0       0| sendmail

Comment 23 Petr Šabata 2011-05-04 13:57:35 UTC

Fixed in CVS for 6.2, lldpad-0.9.41-5.el6

Comment 25 Petr Šabata 2011-08-10 08:29:27 UTC

Update:
https://bugzilla.redhat.com/show_bug.cgi?id=701993#c16

All fixes are included in the latest releases.  Included in via generic EL6 lldpad updates, see bug 695943.

Comment 29 Jack Morgan 2011-08-22 19:51:46 UTC

verified by Intel

Comment 30 Petr Šabata 2011-08-23 07:46:11 UTC

An updated package with 6.1.z patches is now ready in CVS, lldpad-0.9.43-4.el6.

Comment 34 Jens Osterkamp 2011-08-30 13:02:27 UTC

The original patch had never made it upstream, so I posted a slightly different patch to fix this problem

http://www.open-lldp.org/patchwork/patch/2174/

Petr Sabata verified it fixes the problem as well.

Comment 35 Jens Osterkamp 2011-08-30 13:05:17 UTC

Created attachment 520608 [details]
patch to only enable vdp when enabletx=true

Comment 36 Petr Šabata 2011-08-30 13:18:12 UTC

A new build with the updated patch is on the way.

Comment 39 Adam Williamson 2011-09-09 18:30:05 UTC

this doesn't need to block the Fedora bug.

Comment 40 errata-xmlrpc 2011-12-06 14:39:59 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2011-1604.html

Note You need to log in before you can comment on or make changes to this bug.