+++ This bug was initially created as a clone of Bug #694639 +++ Description of problem: Even when idle llpad has very frequent timeouts, about 400 select timeouts a second on a 16 processor system. Version-Release number of selected component (if applicable): lldpad-0.9.41-2.el6.x86_64 kernel-2.6.32-122.el6.x86_64 systemtap-1.4-3.el6.x86_64 How reproducible: Every time Steps to Reproduce: 1. Install lldpad and set up 2. Install systemtap, kernel-debuginfo for kernel 3. Use the following script to watch timeout events for 10 seconds: /usr/share/doc/systemtap-1.4/examples/profiling/timeout.stp -c "sleep 10" Actual results: The timeout.stp script tallies the number of times that various syscalls return due to timeouts rather than an actual event. Below is te output of the following 10 second run. pid | poll select epoll itimer futex nanosle signal| process 3952 | 0 3958 0 0 0 0 0| lldpad 4404 | 0 49 0 0 0 0 0| postmaster 4403 | 0 49 0 0 0 0 0| postmaster 4833 | 0 0 0 0 0 39 0| stapio 4405 | 0 10 0 0 0 0 0| postmaster 1431 | 1 0 0 0 0 9 0| multipathd 4401 | 0 9 0 0 0 0 0| postmaster 23534 | 0 9 0 0 0 0 0| ntpd 4406 | 3 0 0 0 0 0 0| postmaster 2508 | 0 1 0 0 0 0 0| sendmail Doing "strace -p 3952" generate a constant stream of the form: select(14, [5 6 8 9 10 11 12 13], [], [], {0, 9799}) = 0 (Timeout) select(14, [5 6 8 9 10 11 12 13], [], [], {0, 0}) = 0 (Timeout) select(14, [5 6 8 9 10 11 12 13], [], [], {0, 0}) = 0 (Timeout) select(14, [5 6 8 9 10 11 12 13], [], [], {0, 0}) = 0 (Timeout) ... Expected results: Fewer select timeouts for lldpad. Additional info:
A clone of Bug #694639.
Removing the private bit. The fact that it's eating CPU isn't really the problem I'm interested in though. I wonder why it's started by default on a machine without any actual ethernet, and when FCoE isn't setup or enabled. It should be started on-demand, when FCoE is setup, not on every start, for every machine. Let me know if you want me to file a separate bug about it.
(In reply to comment #2) > Removing the private bit. > > The fact that it's eating CPU isn't really the problem I'm interested in > though. I wonder why it's started by default on a machine without any actual > ethernet, and when FCoE isn't setup or enabled. > This was required by Anaconda (Bug #571722). I don't know if it's still needed today and/or why is it in the default compose after installation. > It should be started on-demand, when FCoE is setup, not on every start, for > every machine. Let me know if you want me to file a separate bug about it. If you find this to be an issue, then yes.
(In reply to comment #3) > (In reply to comment #2) > > Removing the private bit. > > > > The fact that it's eating CPU isn't really the problem I'm interested in > > though. I wonder why it's started by default on a machine without any actual > > ethernet, and when FCoE isn't setup or enabled. > > > > This was required by Anaconda (Bug #571722). > I don't know if it's still needed today and/or why is it in the default compose > after installation. It's required by fcoe-utils, which is needed by anaconda. I don't really care much about it being _installed_. I care about it starting up when it's not needed. > > It should be started on-demand, when FCoE is setup, not on every start, for > > every machine. Let me know if you want me to file a separate bug about it. > > If you find this to be an issue, then yes. https://bugzilla.redhat.com/show_bug.cgi?id=701999
(In reply to comment #4) > (In reply to comment #3) > > (In reply to comment #2) > > > Removing the private bit. > > > > > > The fact that it's eating CPU isn't really the problem I'm interested in > > > though. I wonder why it's started by default on a machine without any actual > > > ethernet, and when FCoE isn't setup or enabled. > > > > > > > This was required by Anaconda (Bug #571722). > > I don't know if it's still needed today and/or why is it in the default compose > > after installation. > > It's required by fcoe-utils, which is needed by anaconda. I don't really care > much about it being _installed_. I care about it starting up when it's not > needed. Yes, the same applies to fcoe-utils. The fcoe service is also started by default. This change in initscripts was introduced as a fix to the bug I was referencing. In my opinion, those shouldn't even be installed by default unless the user uses an FCoE setup... > > > > It should be started on-demand, when FCoE is setup, not on every start, for > > > every machine. Let me know if you want me to file a separate bug about it. > > > > If you find this to be an issue, then yes. > > https://bugzilla.redhat.com/show_bug.cgi?id=701999 Let's move further discussion about this there.
lldpad-0.9.41-3.fc15 has been submitted as an update for Fedora 15. https://admin.fedoraproject.org/updates/lldpad-0.9.41-3.fc15
Package lldpad-0.9.41-3.fc15: * should fix your issue, * was pushed to the Fedora 15 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing lldpad-0.9.41-3.fc15' as soon as you are able to. Please go to the following url: https://admin.fedoraproject.org/updates/lldpad-0.9.41-3.fc15 then log in and leave karma (feedback).
I see what looks like the same problem with the new package i.e. Name : lldpad Version : 0.9.41 Release : 3.fc15 Architecture: x86_64 Install Date: Thu 05 May 2011 02:02:16 PM CEST A top program (htop) shows lldpad steadily consuming approx. 9% of cpu, although it isn't actually doing anything in my config/system. I traced it for a while: [root@think ~]# /usr/bin/time -v strace -c -p 1034 Process 1034 attached - interrupt to quit ^CProcess 1034 detached % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 57.81 3.773353 7 532215 select 16.45 1.073787 2 531877 socket 14.85 0.969564 2 531878 close 10.89 0.710998 1 531877 ioctl 0.00 0.000000 0 3 recvfrom ------ ----------- ----------- --------- --------- ---------------- 100.00 6.527702 2127850 total Command being timed: "strace -c -p 1034" User time (seconds): 27.42 System time (seconds): 141.95 Percent of CPU this job got: 50% Elapsed (wall clock) time (h:mm:ss or m:ss): 5:37.40 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 740 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 219 Voluntary context switches: 3966183 Involuntary context switches: 332284 Swaps: 0 File system inputs: 0 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 [root@think ~]# It seems to be constantly "spinning" on select(): > ... > select(11, [4 5 7 8 10], [], [], {0, 270}) = 0 (Timeout) > socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 11 > ioctl(11, SIOCGIFFLAGS, {ifr_name="wlan0", ifr_flags=IFF_UP|IFF_BROADCAST|IFF_RUNNING|IFF_MULTICAST}) = 0 > close(11) = 0 > ... Could the last argument be setting a timeout of 270 microseconds?
(In reply to comment #8) > I see what looks like the same problem with the new package i.e. > Name : lldpad > Version : 0.9.41 > Release : 3.fc15 > Architecture: x86_64 > Install Date: Thu 05 May 2011 02:02:16 PM CEST > > A top program (htop) shows lldpad steadily consuming approx. 9% of cpu, > although it isn't actually doing anything in my config/system. > > I traced it for a while: > [root@think ~]# /usr/bin/time -v strace -c -p 1034 > Process 1034 attached - interrupt to quit > ^CProcess 1034 detached > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ---------------- > 57.81 3.773353 7 532215 select > 16.45 1.073787 2 531877 socket > 14.85 0.969564 2 531878 close > 10.89 0.710998 1 531877 ioctl > 0.00 0.000000 0 3 recvfrom > ------ ----------- ----------- --------- --------- ---------------- > 100.00 6.527702 2127850 total > Command being timed: "strace -c -p 1034" > User time (seconds): 27.42 > System time (seconds): 141.95 > Percent of CPU this job got: 50% > Elapsed (wall clock) time (h:mm:ss or m:ss): 5:37.40 > Average shared text size (kbytes): 0 > Average unshared data size (kbytes): 0 > Average stack size (kbytes): 0 > Average total size (kbytes): 0 > Maximum resident set size (kbytes): 740 > Average resident set size (kbytes): 0 > Major (requiring I/O) page faults: 0 > Minor (reclaiming a frame) page faults: 219 > Voluntary context switches: 3966183 > Involuntary context switches: 332284 > Swaps: 0 > File system inputs: 0 > File system outputs: 0 > Socket messages sent: 0 > Socket messages received: 0 > Signals delivered: 0 > Page size (bytes): 4096 > Exit status: 0 > [root@think ~]# > > It seems to be constantly "spinning" on select(): > > ... > > select(11, [4 5 7 8 10], [], [], {0, 270}) = 0 (Timeout) > > socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 11 > > ioctl(11, SIOCGIFFLAGS, {ifr_name="wlan0", ifr_flags=IFF_UP|IFF_BROADCAST|IFF_RUNNING|IFF_MULTICAST}) = 0 > > close(11) = 0 > > ... > > Could the last argument be setting a timeout of 270 microseconds? This looks very much like the problem. Can you try the new rpm [1] and report back your results here ? Thanks ! [1] http://koji.fedoraproject.org/koji/taskinfo?taskID=3069319
Looks better: [root@think ~]# rpm -q lldpad lldpad-0.9.41-4.fc15.x86_64 [root@think ~]# alias proc='ps -H -o pid,cpu,lstart,sz,rss,wchan,cmd -C' [root@think ~]# proc lldpad PID CPU STARTED SZ RSS WCHAN CMD 22264 - Thu May 19 15:40:24 2011 3930 808 poll_s /usr/sbin/lldpad -d [root@think ~]# /usr/bin/time -v strace -c -p 22264 Process 22264 attached - interrupt to quit ^CProcess 22264 detached % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 100.00 0.005999 18 336 select 0.00 0.000000 0 3 recvfrom ------ ----------- ----------- --------- --------- ---------------- 100.00 0.005999 339 total Command being timed: "strace -c -p 22264" User time (seconds): 0.01 System time (seconds): 0.03 Percent of CPU this job got: 0% Elapsed (wall clock) time (h:mm:ss or m:ss): 5:33.42 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 736 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 218 Voluntary context switches: 681 Involuntary context switches: 5 Swaps: 0 File system inputs: 0 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 [root@think ~]# strace -p 22264 Process 22264 attached - interrupt to quit select(10, [5 6 8 9], [], [], {0, 760364}) = 0 (Timeout) select(10, [5 6 8 9], [], [], {0, 999998}) = 0 (Timeout) select(10, [5 6 8 9], [], [], {0, 999998}) = 0 (Timeout) select(10, [5 6 8 9], [], [], {0, 999998}) = 0 (Timeout) select(10, [5 6 8 9], [], [], {0, 999998}) = 0 (Timeout) select(10, [5 6 8 9], [], [], {0, 999997}) = 0 (Timeout) select(10, [5 6 8 9], [], [], {0, 999998}) = 0 (Timeout) select(10, [5 6 8 9], [], [], {0, 999998}) = 0 (Timeout) select(10, [5 6 8 9], [], [], {0, 999998}) = 0 (Timeout) select(10, [5 6 8 9], [], [], {0, 999998}) = 0 (Timeout) select(10, [5 6 8 9], [], [], {0, 999998}) = 0 (Timeout) select(10, [5 6 8 9], [], [], {0, 999997}) = 0 (Timeout) select(10, [5 6 8 9], [], [], {0, 999998}) = 1 (in [5], left {0, 994411}) recvfrom(5, "M00000001C2000000000005wlan0", 4095, MSG_DONTWAIT, {sa_family=AF_FILE, path="/tmp/lldpad_clif_22386-0"}, [27]) = 28 sendto(5, "R00C2000000000005wlan00000000000"..., 78, 0, {sa_family=AF_FILE, path="/tmp/lldpad_clif_22386-0"}, 27) = 78 select(10, [5 6 8 9], [], [], {0, 993449}) = 0 (Timeout) select(10, [5 6 8 9], [], [], {0, 999998}) = 0 (Timeout) It seems to now use a timeout of about 1 second? htop no longer lists lldpad :-) The extent of my unit testing was to run: [root@think ~]# lldptool stats -i wlan0 Total Frames Transmitted = 0 Total Discarded Frames Received = 0 Total Error Frames Received = 0 Total Frames Received = 0 Total Discarded TLVs = 0 Total Unrecognized TLVs = 0 Total Ageouts = 0 (this caused the recvfrom/sendto sequence in the above strace output). So it _seems_ to function :-)
Well, FWIW, a few days and reboots later, the problem seems to have reappeared... The htop shows lldpad using approx. 6% of a cpu. I am still running the same test/updated version suggested about i.e. [root@think ~]# rpm -q lldpad lldpad-0.9.41-4.fc15.x86_64 [root@think ~]# proc lldpad PID CPU STARTED SZ RSS WCHAN CMD 1066 - Tue May 24 11:21:14 2011 3931 696 - /usr/sbin/lldpad -d Again it seems using a lot of cpu because of a very short timeout on select: [root@think ~]# /usr/bin/time -v strace -c -p 1066 Process 1066 attached - interrupt to quit ^CProcess 1066 detached % time seconds usecs/call calls errors syscall ------ ----------- ----------- --------- --------- ---------------- 84.52 3.994284 40 99325 select 6.26 0.295920 3 99199 socket 5.21 0.246396 2 99199 close 4.00 0.189141 2 99199 ioctl 0.00 0.000000 0 1 recvfrom ------ ----------- ----------- --------- --------- ---------------- 100.00 4.725741 396923 total Command being timed: "strace -c -p 1066" User time (seconds): 5.82 System time (seconds): 28.58 Percent of CPU this job got: 27% Elapsed (wall clock) time (h:mm:ss or m:ss): 2:06.54 Average shared text size (kbytes): 0 Average unshared data size (kbytes): 0 Average stack size (kbytes): 0 Average total size (kbytes): 0 Maximum resident set size (kbytes): 740 Average resident set size (kbytes): 0 Major (requiring I/O) page faults: 0 Minor (reclaiming a frame) page faults: 219 Voluntary context switches: 748034 Involuntary context switches: 56908 Swaps: 0 File system inputs: 0 File system outputs: 0 Socket messages sent: 0 Socket messages received: 0 Signals delivered: 0 Page size (bytes): 4096 Exit status: 0 [root@think ~]# strace -p 1066 Process 1066 attached - interrupt to quit select(11, [4 5 7 8 10], [], [], {0, 761}) = 0 (Timeout) socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 11 ioctl(11, SIOCGIFFLAGS, {ifr_name="wlan0", ifr_flags=IFF_UP|IFF_BROADCAST|IFF_RUNNING|IFF_MULTICAST}) = 0 close(11) = 0 select(11, [4 5 7 8 10], [], [], {0, 379}) = 0 (Timeout) socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 11 ioctl(11, SIOCGIFFLAGS, {ifr_name="wlan0", ifr_flags=IFF_UP|IFF_BROADCAST|IFF_RUNNING|IFF_MULTICAST}) = 0 close(11) = 0 ... Similar to before - that's a bit weird? Some more info. about the executable that is currently running: [root@think ~]# /usr/sbin/lldpad -v lldpad v0.9.41 Copyright (c) 2007-2010, Intel Corporation Portions used and/or modified from: hostapd v 0.5.7 Copyright (c) 2004-2007, Jouni Malinen <j> and contributors [root@think ~]# sha256sum /usr/sbin/lldpad 5799462610ef54edf507b27497c769a2b5173cdee1004f827d1cc7d8dd97c4b5 /usr/sbin/lldpad [root@think ~]# ldd /usr/sbin/lldpad linux-vdso.so.1 => (0x00007fff3d93c000) libconfig.so.9 => /usr/lib64/libconfig.so.9 (0x00000035a5000000) libnl.so.1 => /lib64/libnl.so.1 (0x0000003866e00000) libc.so.6 => /lib64/libc.so.6 (0x00000035a4c00000) libm.so.6 => /lib64/libm.so.6 (0x00000035a5c00000) libpthread.so.0 => /lib64/libpthread.so.0 (0x00000035a5400000) libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003866600000) /lib64/ld-linux-x86-64.so.2 (0x00000035a4800000)
*** Bug 717420 has been marked as a duplicate of this bug. ***
so, we need this fix for f16 as well as f15. thanks to the fcoe-utils dep and lldpad being enabled by default, it's running and eating CPU time on all live images and live installs of F16. i guess either fixing lldpad or bastien's 'run it on demand' would act as a fix for the general issue of 'eating lots of CPU time on everyone's system'.
I believe the following build fixes the issue but I'd appreciate if somebody could confirm that: http://koji.fedoraproject.org/koji/taskinfo?taskID=3311037
*** Bug 720080 has been marked as a duplicate of this bug. ***
*** Bug 735024 has been marked as a duplicate of this bug. ***
Petr Sabata wrote: > I believe the following build fixes the issue but I'd appreciate > if somebody could confirm that ... Sorry, no can do - I only have f15.i686. Test build is for f17. An attempt to install fails with: [root@think ~]# rpm -iv ~robb/Downloads/lldpad-0.9.43-2.fc17.i686.rpm error: Failed dependencies: libconfig.so.9 is needed by lldpad-0.9.43-2.fc17.i686 libnl.so.1 is needed by lldpad-0.9.43-2.fc17.i686
(In reply to comment #17) > Petr Sabata wrote: > > I believe the following build fixes the issue but I'd appreciate > > if somebody could confirm that ... > > Sorry, no can do - I only have f15.i686. Test build is for f17. > An attempt to install fails with: > [root@think ~]# rpm -iv ~robb/Downloads/lldpad-0.9.43-2.fc17.i686.rpm > error: Failed dependencies: > libconfig.so.9 is needed by lldpad-0.9.43-2.fc17.i686 > libnl.so.1 is needed by lldpad-0.9.43-2.fc17.i686 Oh! Here's the same package built for f15: http://koji.fedoraproject.org/koji/taskinfo?taskID=3326513
I rebuilt the test fix for F16 and installed it. It certainly seems to address the high CPU usage. I don't have any FCoE hardware so I can't check if it actually still works in the scenario where it does something useful, though.
if you could do an f16 build and submit it as an update soon that'd be great - we could pull it into Beta TC2.
(In reply to comment #19) > I rebuilt the test fix for F16 and installed it. It certainly seems to address > the high CPU usage. I don't have any FCoE hardware so I can't check if it > actually still works in the scenario where it does something useful, though. Great news. Thank you. (In reply to comment #20) > if you could do an f16 build and submit it as an update soon that'd be great - > we could pull it into Beta TC2. I'm on it...
lldpad-0.9.43-2.fc16 has been submitted as an update for Fedora 16. https://admin.fedoraproject.org/updates/lldpad-0.9.43-2.fc16
lldpad-0.9.41-4.fc15 has been submitted as an update for Fedora 15. https://admin.fedoraproject.org/updates/lldpad-0.9.41-4.fc15
Package lldpad-0.9.43-2.fc16: * should fix your issue, * was pushed to the Fedora 16 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing lldpad-0.9.43-2.fc16' as soon as you are able to. Please go to the following url: https://admin.fedoraproject.org/updates/lldpad-0.9.43-2.fc16 then log in and leave karma (feedback).
proposing as NTH for Beta due to its impact on lives.
Discussed during the 2011-09-12 Fedora QA meeting. Accepted as Fedora 16 Beta NTH because of it's impact on the performance for live images
lldpad-0.9.43-3.fc16 has been submitted as an update for Fedora 16. https://admin.fedoraproject.org/updates/lldpad-0.9.43-3.fc16
lldpad-0.9.43-5.fc16 has been submitted as an update for Fedora 16. https://admin.fedoraproject.org/updates/lldpad-0.9.43-5.fc16
I'm testing Fedora-16-TC-x86_64-Live-Desktop, which has lldpad.x86_64 0:0.9.43-1.fc16. And it's currently using more than 10% CPU time on a Atom N450 system.
if we can get -5 into final it should definitely fix that as it won't be enabled. needs karma.
lldpad-0.9.43-5.fc16 has been pushed to the Fedora 16 stable repository. If problems still persist, please make note of it in this bug report.
lldpad-0.9.41-4.fc15 has been pushed to the Fedora 15 stable repository. If problems still persist, please make note of it in this bug report.