| Summary: | irqbalance crashes in debug mode | ||
|---|---|---|---|
| Product: | [Fedora] Fedora | Reporter: | Stan King <stanley.king> |
| Component: | irqbalance | Assignee: | Neil Horman <nhorman> |
| Status: | CLOSED WONTFIX | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
| Severity: | unspecified | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 16 | CC: | anton, nhorman |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2013-02-14 03:02:28 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Stan King
2011-10-21 22:04:51 UTC
This is fixed with upstream commit http://code.google.com/p/irqbalance/source/detail?r=effc540808e630d1fad423d653c43737e99cc1b6 http://koji.fedoraproject.org/koji/taskinfo?taskID=3450720 Test build for you to validate please. I get the same behavior as before with this test build. Here is the info for the irqbalance packages I've loaded: irqbalance-1.0-3.test1.fc16.i686 irqbalance-debuginfo-1.0-3.test1.fc16.i686 The abrt daemon is not picking up this failure, so if there's a way that I can integrate the fault with the debuginfo in a way that's useful, let me know. Alternatively, I may try to compile it, and find where it's bombing. My hunch is that the sysfs details have changed in additional ways. Thats odd, Can you run irqbalance --debug on your system under gdb and attach the backtrace here please? Accepted, I'll backport shortly. Neil, I think this is the backtrace you were asking for:
(gdb) run --debug
Starting program: /usr/sbin/irqbalance --debug
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/libthread_db.so.1".
This machine seems not NUMA capable.
Could not find numa node for node id 0
Could not find numa node for node id 0
Program received signal SIGSEGV, Segmentation fault.
dump_package (d=0x805b780, data=0xbfffbefc) at cputree.c:291
291 printf("Package %i: numa_node is %d cpu mask is %s (load %lu)\n", d->number, package_numa_node(d)->number, buffer, (unsigned long)d->load);
Missing separate debuginfos, use: debuginfo-install glib2-2.30.1-1.fc16.i686 glibc-2.14.90-13.i686 libcap-ng-0.6.6-1.fc16.i686 libgcc-4.6.1-10.fc16.i686 numactl-2.0.7-2.fc16.i686
(gdb) bt
#0 dump_package (d=0x805b780, data=0xbfffbefc) at cputree.c:291
#1 0x0804ad33 in for_each_object (data=0xbfffbefc,
cb=0x804ac30 <dump_package>, list=<optimized out>) at irqbalance.h:118
#2 dump_tree () at cputree.c:301
#3 0x0804b595 in parse_cpu_tree () at cputree.c:353
#4 0x0804917a in build_object_tree () at irqbalance.c:138
#5 main (argc=2, argv=0xbffff674) at irqbalance.c:196
(gdb)
http://koji.fedoraproject.org/koji/taskinfo?taskID=3456037 There you go, thank you, that clarifies the issue. Its the same problem, just a different location. The build above should fix it. Please let me know and I'll commit it here and to rawhide Neil, this version no longer segfaults when debug mode is requested.
However, I have two concerns, relative to debug output in Fedora 15:
* the new version no longer rescans for CPU topology, although both systems (F15 and F16) report 2 hotplug CPUs in their dmesg output. (both systems are Pentium 4 with HT from 2003-2005)
* the new version reports a "load" value which seems to increase without bound. Although it seems to come from a uint64_t in structure topo_obj, I have a hunch that this behavior was not intended, if the intent was to survey recent interrupt load. Here is a sample of the output after a few minutes of running. The load numbers which continually increase appear below as 362361, 199000, and 173000. The loads of 6000 and 5000 change both up and down from time to time.
-----------------------------------------------------------------------------
Package 0: numa_node is -1 cpu mask is 00000003 (load 362361)
Cache domain 0: numa_node is -1 cpu mask is 00000001 (load 199000)
CPU number 0 numa_node is -1 (load 6000)
Cache domain 1: numa_node is -1 cpu mask is 00000003 (load 173000)
CPU number 1 numa_node is -1 (load 5000)
For what it's worth, here's the initial blast of output:
This machine seems not NUMA capable.
Package 0: numa_node is -1 cpu mask is 00000003 (load 0)
Cache domain 0: numa_node is -1 cpu mask is 00000001 (load 0)
CPU number 0 numa_node is -1 (load 0)
Cache domain 1: numa_node is -1 cpu mask is 00000003 (load 0)
CPU number 1 numa_node is -1 (load 0)
Adding IRQ 16 to database
Adding IRQ 19 to database
Adding IRQ 18 to database
Adding IRQ 23 to database
DROPPING DUPLICATE ENTRY FOR IRQ 18 on path /sys/bus/pci/devices/0000:00:1f.1
Adding IRQ 17 to database
DROPPING DUPLICATE ENTRY FOR IRQ 17 on path /sys/bus/pci/devices/0000:00:1f.5
DROPPING DUPLICATE ENTRY FOR IRQ 16 on path /sys/bus/pci/devices/0000:01:00.0
Adding IRQ 22 to database
Adding IRQ 20 to database
DROPPING DUPLICATE ENTRY FOR IRQ 17 on path /sys/bus/pci/devices/0000:02:0c.0
Adding IRQ 21 to database
DROPPING DUPLICATE ENTRY FOR IRQ 22 on path /sys/bus/pci/devices/0000:02:0c.2
NUMA NODE NUMBER: -1
LOCAL CPU MASK: ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff
Sorry about the long line in the above text - I had assumed it would wrap. I know better now.
Just for comparison, here is a sample of output from irqbalance on Fedora 15, on another similar Pentium 4 with hyperthreading, after running for more than an hour.
It mentions interrupt numbers and their classification, whereas the Fedora 16 irqbalance's stable debug output does not talk about interrupts.
-----------------------------------------------------------------------------
IRQ delta is 9
IRQ delta is 9, switching to power mode
Rescanning cpu topology
Package 0: cpu mask is 00000001 (workload 0)
Cache domain 0: cpu mask is 00000001 (workload 0)
CPU number 0 (workload 0)
CPU number 0 (workload 0)
Package 0: cpu mask is 00000003 (workload 0)
Cache domain 0: cpu mask is 00000003 (workload 0)
CPU number 0 (workload 0)
CPU number 1 (workload 0)
Package 0: cpu mask is 00000001 (workload 4)
Cache domain 0: cpu mask is 00000001 (workload 3)
CPU number 0 (workload 2)
Interrupt 16 (ethernet/1)
CPU number 0 (workload 0)
Interrupt 23 (legacy/0)
Interrupt 14 (other/0)
Package 0: cpu mask is 00000003 (workload 3)
Cache domain 0: cpu mask is 00000003 (workload 3)
CPU number 0 (workload 0)
CPU number 1 (workload 0)
Interrupt 20 (storage/0)
Interrupt 18 (legacy/0)
Interrupt 21 (legacy/0)
Ok, I'll commit the segfault fixes. As for the cpu topology, it does rescan, it just needs to see that new cpus are present. It does this by counting the number of cpus it sees in /proc/interrupts. If that doesn't change, it assumes that it has the right number of cpus in its topology. Its possible that something has gotten missed there, but unless the number of cpus in /proc/interrupts has changed, its doing the right thing. As for the load, you're right. The cpus load are a computed delta taken from /proc/stat, but the higher level objects (caches, packages, and nodes), are all accumulators. Thats not really a problem, unless load becomes really unbalanced (since an accumulator maintains a permanent history of sorts), but I can fix that upstream. Thanks! irqbalance-1.0-3.fc16 has been submitted as an update for Fedora 16. https://admin.fedoraproject.org/updates/irqbalance-1.0-3.fc16 Package irqbalance-1.0-3.fc16: * should fix your issue, * was pushed to the Fedora 16 testing repository, * should be available at your local mirror within two days. Update it with: # su -c 'yum update --enablerepo=updates-testing irqbalance-1.0-3.fc16' as soon as you are able to. Please go to the following url: https://admin.fedoraproject.org/updates/FEDORA-2011-14921 then log in and leave karma (feedback). This message is a reminder that Fedora 16 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 16. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '16'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 16's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 16 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged to click on "Clone This Bug" and open it against that version of Fedora. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete. The process we are following is described here: http://fedoraproject.org/wiki/BugZappers/HouseKeeping Fedora 16 changed to end-of-life (EOL) status on 2013-02-12. Fedora 16 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed. |