Bug 701943
| Summary: | lldpad has frequent timeouts on selects | ||
|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Petr Šabata <psabata> |
| Component: | lldpad | Assignee: | Petr Šabata <psabata> |
| Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
| Severity: | medium | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 15 | CC: | alekcejk, awilliam, benjavalero, bnocera, cebbert, ddumas, jens.osterkamp, john.r.fastabend, mdavis, ovasik, pknirsch, poehn, psabata, rbz, rtguille, tflink, wcohen, yann |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | AcceptedNTH | ||
| Fixed In Version: | lldpad-0.9.41-4.fc15 | Doc Type: | Bug Fix |
| Doc Text: | Story Points: | --- | |
| Clone Of: | 694639 | Environment: | |
| Last Closed: | 2011-11-05 01:22:46 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
| Bug Depends On: | |||
| Bug Blocks: | 713565 | ||
|
Description
Petr Šabata
2011-05-04 11:41:46 UTC
A clone of Bug #694639. Removing the private bit. The fact that it's eating CPU isn't really the problem I'm interested in though. I wonder why it's started by default on a machine without any actual ethernet, and when FCoE isn't setup or enabled. It should be started on-demand, when FCoE is setup, not on every start, for every machine. Let me know if you want me to file a separate bug about it. (In reply to comment #2) > Removing the private bit. > > The fact that it's eating CPU isn't really the problem I'm interested in > though. I wonder why it's started by default on a machine without any actual > ethernet, and when FCoE isn't setup or enabled. > This was required by Anaconda (Bug #571722). I don't know if it's still needed today and/or why is it in the default compose after installation. > It should be started on-demand, when FCoE is setup, not on every start, for > every machine. Let me know if you want me to file a separate bug about it. If you find this to be an issue, then yes. (In reply to comment #3) > (In reply to comment #2) > > Removing the private bit. > > > > The fact that it's eating CPU isn't really the problem I'm interested in > > though. I wonder why it's started by default on a machine without any actual > > ethernet, and when FCoE isn't setup or enabled. > > > > This was required by Anaconda (Bug #571722). > I don't know if it's still needed today and/or why is it in the default compose > after installation. It's required by fcoe-utils, which is needed by anaconda. I don't really care much about it being _installed_. I care about it starting up when it's not needed. > > It should be started on-demand, when FCoE is setup, not on every start, for > > every machine. Let me know if you want me to file a separate bug about it. > > If you find this to be an issue, then yes. https://bugzilla.redhat.com/show_bug.cgi?id=701999 (In reply to comment #4) > (In reply to comment #3) > > (In reply to comment #2) > > > Removing the private bit. > > > > > > The fact that it's eating CPU isn't really the problem I'm interested in > > > though. I wonder why it's started by default on a machine without any actual > > > ethernet, and when FCoE isn't setup or enabled. > > > > > > > This was required by Anaconda (Bug #571722). > > I don't know if it's still needed today and/or why is it in the default compose > > after installation. > > It's required by fcoe-utils, which is needed by anaconda. I don't really care > much about it being _installed_. I care about it starting up when it's not > needed. Yes, the same applies to fcoe-utils. The fcoe service is also started by default. This change in initscripts was introduced as a fix to the bug I was referencing. In my opinion, those shouldn't even be installed by default unless the user uses an FCoE setup... > > > > It should be started on-demand, when FCoE is setup, not on every start, for > > > every machine. Let me know if you want me to file a separate bug about it. > > > > If you find this to be an issue, then yes. > > https://bugzilla.redhat.com/show_bug.cgi?id=701999 Let's move further discussion about this there. lldpad-0.9.41-3.fc15 has been submitted as an update for Fedora 15. https://admin.fedoraproject.org/updates/lldpad-0.9.41-3.fc15 Package lldpad-0.9.41-3.fc15: * should fix your issue, * was pushed to the Fedora 15 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing lldpad-0.9.41-3.fc15' as soon as you are able to. Please go to the following url: https://admin.fedoraproject.org/updates/lldpad-0.9.41-3.fc15 then log in and leave karma (feedback). I see what looks like the same problem with the new package i.e.
Name : lldpad
Version : 0.9.41
Release : 3.fc15
Architecture: x86_64
Install Date: Thu 05 May 2011 02:02:16 PM CEST
A top program (htop) shows lldpad steadily consuming approx. 9% of cpu, although it isn't actually doing anything in my config/system.
I traced it for a while:
[root@think ~]# /usr/bin/time -v strace -c -p 1034
Process 1034 attached - interrupt to quit
^CProcess 1034 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
57.81 3.773353 7 532215 select
16.45 1.073787 2 531877 socket
14.85 0.969564 2 531878 close
10.89 0.710998 1 531877 ioctl
0.00 0.000000 0 3 recvfrom
------ ----------- ----------- --------- --------- ----------------
100.00 6.527702 2127850 total
Command being timed: "strace -c -p 1034"
User time (seconds): 27.42
System time (seconds): 141.95
Percent of CPU this job got: 50%
Elapsed (wall clock) time (h:mm:ss or m:ss): 5:37.40
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 740
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 219
Voluntary context switches: 3966183
Involuntary context switches: 332284
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
[root@think ~]#
It seems to be constantly "spinning" on select():
> ...
> select(11, [4 5 7 8 10], [], [], {0, 270}) = 0 (Timeout)
> socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 11
> ioctl(11, SIOCGIFFLAGS, {ifr_name="wlan0", ifr_flags=IFF_UP|IFF_BROADCAST|IFF_RUNNING|IFF_MULTICAST}) = 0
> close(11) = 0
> ...
Could the last argument be setting a timeout of 270 microseconds?
(In reply to comment #8) > I see what looks like the same problem with the new package i.e. > Name : lldpad > Version : 0.9.41 > Release : 3.fc15 > Architecture: x86_64 > Install Date: Thu 05 May 2011 02:02:16 PM CEST > > A top program (htop) shows lldpad steadily consuming approx. 9% of cpu, > although it isn't actually doing anything in my config/system. > > I traced it for a while: > [root@think ~]# /usr/bin/time -v strace -c -p 1034 > Process 1034 attached - interrupt to quit > ^CProcess 1034 detached > % time seconds usecs/call calls errors syscall > ------ ----------- ----------- --------- --------- ---------------- > 57.81 3.773353 7 532215 select > 16.45 1.073787 2 531877 socket > 14.85 0.969564 2 531878 close > 10.89 0.710998 1 531877 ioctl > 0.00 0.000000 0 3 recvfrom > ------ ----------- ----------- --------- --------- ---------------- > 100.00 6.527702 2127850 total > Command being timed: "strace -c -p 1034" > User time (seconds): 27.42 > System time (seconds): 141.95 > Percent of CPU this job got: 50% > Elapsed (wall clock) time (h:mm:ss or m:ss): 5:37.40 > Average shared text size (kbytes): 0 > Average unshared data size (kbytes): 0 > Average stack size (kbytes): 0 > Average total size (kbytes): 0 > Maximum resident set size (kbytes): 740 > Average resident set size (kbytes): 0 > Major (requiring I/O) page faults: 0 > Minor (reclaiming a frame) page faults: 219 > Voluntary context switches: 3966183 > Involuntary context switches: 332284 > Swaps: 0 > File system inputs: 0 > File system outputs: 0 > Socket messages sent: 0 > Socket messages received: 0 > Signals delivered: 0 > Page size (bytes): 4096 > Exit status: 0 > [root@think ~]# > > It seems to be constantly "spinning" on select(): > > ... > > select(11, [4 5 7 8 10], [], [], {0, 270}) = 0 (Timeout) > > socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 11 > > ioctl(11, SIOCGIFFLAGS, {ifr_name="wlan0", ifr_flags=IFF_UP|IFF_BROADCAST|IFF_RUNNING|IFF_MULTICAST}) = 0 > > close(11) = 0 > > ... > > Could the last argument be setting a timeout of 270 microseconds? This looks very much like the problem. Can you try the new rpm [1] and report back your results here ? Thanks ! [1] http://koji.fedoraproject.org/koji/taskinfo?taskID=3069319 Looks better:
[root@think ~]# rpm -q lldpad
lldpad-0.9.41-4.fc15.x86_64
[root@think ~]# alias proc='ps -H -o pid,cpu,lstart,sz,rss,wchan,cmd -C'
[root@think ~]# proc lldpad
PID CPU STARTED SZ RSS WCHAN CMD
22264 - Thu May 19 15:40:24 2011 3930 808 poll_s /usr/sbin/lldpad -d
[root@think ~]# /usr/bin/time -v strace -c -p 22264
Process 22264 attached - interrupt to quit
^CProcess 22264 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
100.00 0.005999 18 336 select
0.00 0.000000 0 3 recvfrom
------ ----------- ----------- --------- --------- ----------------
100.00 0.005999 339 total
Command being timed: "strace -c -p 22264"
User time (seconds): 0.01
System time (seconds): 0.03
Percent of CPU this job got: 0%
Elapsed (wall clock) time (h:mm:ss or m:ss): 5:33.42
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 736
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 218
Voluntary context switches: 681
Involuntary context switches: 5
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
[root@think ~]# strace -p 22264
Process 22264 attached - interrupt to quit
select(10, [5 6 8 9], [], [], {0, 760364}) = 0 (Timeout)
select(10, [5 6 8 9], [], [], {0, 999998}) = 0 (Timeout)
select(10, [5 6 8 9], [], [], {0, 999998}) = 0 (Timeout)
select(10, [5 6 8 9], [], [], {0, 999998}) = 0 (Timeout)
select(10, [5 6 8 9], [], [], {0, 999998}) = 0 (Timeout)
select(10, [5 6 8 9], [], [], {0, 999997}) = 0 (Timeout)
select(10, [5 6 8 9], [], [], {0, 999998}) = 0 (Timeout)
select(10, [5 6 8 9], [], [], {0, 999998}) = 0 (Timeout)
select(10, [5 6 8 9], [], [], {0, 999998}) = 0 (Timeout)
select(10, [5 6 8 9], [], [], {0, 999998}) = 0 (Timeout)
select(10, [5 6 8 9], [], [], {0, 999998}) = 0 (Timeout)
select(10, [5 6 8 9], [], [], {0, 999997}) = 0 (Timeout)
select(10, [5 6 8 9], [], [], {0, 999998}) = 1 (in [5], left {0, 994411})
recvfrom(5, "M00000001C2000000000005wlan0", 4095, MSG_DONTWAIT, {sa_family=AF_FILE, path="/tmp/lldpad_clif_22386-0"}, [27]) = 28
sendto(5, "R00C2000000000005wlan00000000000"..., 78, 0, {sa_family=AF_FILE, path="/tmp/lldpad_clif_22386-0"}, 27) = 78
select(10, [5 6 8 9], [], [], {0, 993449}) = 0 (Timeout)
select(10, [5 6 8 9], [], [], {0, 999998}) = 0 (Timeout)
It seems to now use a timeout of about 1 second?
htop no longer lists lldpad :-)
The extent of my unit testing was to run:
[root@think ~]# lldptool stats -i wlan0
Total Frames Transmitted = 0
Total Discarded Frames Received = 0
Total Error Frames Received = 0
Total Frames Received = 0
Total Discarded TLVs = 0
Total Unrecognized TLVs = 0
Total Ageouts = 0
(this caused the recvfrom/sendto sequence in the above strace output).
So it _seems_ to function :-)
Well, FWIW, a few days and reboots later, the problem seems to have reappeared...
The htop shows lldpad using approx. 6% of a cpu. I am still running the same test/updated version suggested about i.e.
[root@think ~]# rpm -q lldpad
lldpad-0.9.41-4.fc15.x86_64
[root@think ~]# proc lldpad
PID CPU STARTED SZ RSS WCHAN CMD
1066 - Tue May 24 11:21:14 2011 3931 696 - /usr/sbin/lldpad -d
Again it seems using a lot of cpu because of a very short timeout on select:
[root@think ~]# /usr/bin/time -v strace -c -p 1066
Process 1066 attached - interrupt to quit
^CProcess 1066 detached
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
84.52 3.994284 40 99325 select
6.26 0.295920 3 99199 socket
5.21 0.246396 2 99199 close
4.00 0.189141 2 99199 ioctl
0.00 0.000000 0 1 recvfrom
------ ----------- ----------- --------- --------- ----------------
100.00 4.725741 396923 total
Command being timed: "strace -c -p 1066"
User time (seconds): 5.82
System time (seconds): 28.58
Percent of CPU this job got: 27%
Elapsed (wall clock) time (h:mm:ss or m:ss): 2:06.54
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 740
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 219
Voluntary context switches: 748034
Involuntary context switches: 56908
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0
[root@think ~]# strace -p 1066
Process 1066 attached - interrupt to quit
select(11, [4 5 7 8 10], [], [], {0, 761}) = 0 (Timeout)
socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 11
ioctl(11, SIOCGIFFLAGS, {ifr_name="wlan0", ifr_flags=IFF_UP|IFF_BROADCAST|IFF_RUNNING|IFF_MULTICAST}) = 0
close(11) = 0
select(11, [4 5 7 8 10], [], [], {0, 379}) = 0 (Timeout)
socket(PF_INET, SOCK_DGRAM, IPPROTO_IP) = 11
ioctl(11, SIOCGIFFLAGS, {ifr_name="wlan0", ifr_flags=IFF_UP|IFF_BROADCAST|IFF_RUNNING|IFF_MULTICAST}) = 0
close(11) = 0
...
Similar to before - that's a bit weird?
Some more info. about the executable that is currently running:
[root@think ~]# /usr/sbin/lldpad -v
lldpad v0.9.41
Copyright (c) 2007-2010, Intel Corporation
Portions used and/or modified from: hostapd v 0.5.7
Copyright (c) 2004-2007, Jouni Malinen <j> and contributors
[root@think ~]# sha256sum /usr/sbin/lldpad
5799462610ef54edf507b27497c769a2b5173cdee1004f827d1cc7d8dd97c4b5 /usr/sbin/lldpad
[root@think ~]# ldd /usr/sbin/lldpad
linux-vdso.so.1 => (0x00007fff3d93c000)
libconfig.so.9 => /usr/lib64/libconfig.so.9 (0x00000035a5000000)
libnl.so.1 => /lib64/libnl.so.1 (0x0000003866e00000)
libc.so.6 => /lib64/libc.so.6 (0x00000035a4c00000)
libm.so.6 => /lib64/libm.so.6 (0x00000035a5c00000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x00000035a5400000)
libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x0000003866600000)
/lib64/ld-linux-x86-64.so.2 (0x00000035a4800000)
*** Bug 717420 has been marked as a duplicate of this bug. *** so, we need this fix for f16 as well as f15. thanks to the fcoe-utils dep and lldpad being enabled by default, it's running and eating CPU time on all live images and live installs of F16. i guess either fixing lldpad or bastien's 'run it on demand' would act as a fix for the general issue of 'eating lots of CPU time on everyone's system'. I believe the following build fixes the issue but I'd appreciate if somebody could confirm that: http://koji.fedoraproject.org/koji/taskinfo?taskID=3311037 *** Bug 720080 has been marked as a duplicate of this bug. *** *** Bug 735024 has been marked as a duplicate of this bug. *** Petr Sabata wrote:
> I believe the following build fixes the issue but I'd appreciate
> if somebody could confirm that ...
Sorry, no can do - I only have f15.i686. Test build is for f17.
An attempt to install fails with:
[root@think ~]# rpm -iv ~robb/Downloads/lldpad-0.9.43-2.fc17.i686.rpm
error: Failed dependencies:
libconfig.so.9 is needed by lldpad-0.9.43-2.fc17.i686
libnl.so.1 is needed by lldpad-0.9.43-2.fc17.i686
(In reply to comment #17) > Petr Sabata wrote: > > I believe the following build fixes the issue but I'd appreciate > > if somebody could confirm that ... > > Sorry, no can do - I only have f15.i686. Test build is for f17. > An attempt to install fails with: > [root@think ~]# rpm -iv ~robb/Downloads/lldpad-0.9.43-2.fc17.i686.rpm > error: Failed dependencies: > libconfig.so.9 is needed by lldpad-0.9.43-2.fc17.i686 > libnl.so.1 is needed by lldpad-0.9.43-2.fc17.i686 Oh! Here's the same package built for f15: http://koji.fedoraproject.org/koji/taskinfo?taskID=3326513 I rebuilt the test fix for F16 and installed it. It certainly seems to address the high CPU usage. I don't have any FCoE hardware so I can't check if it actually still works in the scenario where it does something useful, though. if you could do an f16 build and submit it as an update soon that'd be great - we could pull it into Beta TC2. (In reply to comment #19) > I rebuilt the test fix for F16 and installed it. It certainly seems to address > the high CPU usage. I don't have any FCoE hardware so I can't check if it > actually still works in the scenario where it does something useful, though. Great news. Thank you. (In reply to comment #20) > if you could do an f16 build and submit it as an update soon that'd be great - > we could pull it into Beta TC2. I'm on it... lldpad-0.9.43-2.fc16 has been submitted as an update for Fedora 16. https://admin.fedoraproject.org/updates/lldpad-0.9.43-2.fc16 lldpad-0.9.41-4.fc15 has been submitted as an update for Fedora 15. https://admin.fedoraproject.org/updates/lldpad-0.9.41-4.fc15 Package lldpad-0.9.43-2.fc16: * should fix your issue, * was pushed to the Fedora 16 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing lldpad-0.9.43-2.fc16' as soon as you are able to. Please go to the following url: https://admin.fedoraproject.org/updates/lldpad-0.9.43-2.fc16 then log in and leave karma (feedback). proposing as NTH for Beta due to its impact on lives. Discussed during the 2011-09-12 Fedora QA meeting. Accepted as Fedora 16 Beta NTH because of it's impact on the performance for live images lldpad-0.9.43-3.fc16 has been submitted as an update for Fedora 16. https://admin.fedoraproject.org/updates/lldpad-0.9.43-3.fc16 lldpad-0.9.43-5.fc16 has been submitted as an update for Fedora 16. https://admin.fedoraproject.org/updates/lldpad-0.9.43-5.fc16 I'm testing Fedora-16-TC-x86_64-Live-Desktop, which has lldpad.x86_64 0:0.9.43-1.fc16. And it's currently using more than 10% CPU time on a Atom N450 system. if we can get -5 into final it should definitely fix that as it won't be enabled. needs karma. lldpad-0.9.43-5.fc16 has been pushed to the Fedora 16 stable repository. If problems still persist, please make note of it in this bug report. lldpad-0.9.41-4.fc15 has been pushed to the Fedora 15 stable repository. If problems still persist, please make note of it in this bug report. |