Bug 748070 - irqbalance crashes in debug mode
Summary: irqbalance crashes in debug mode
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Fedora
Classification: Fedora
Component: irqbalance
Version: 16
Hardware: Unspecified
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Neil Horman
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-10-21 22:04 UTC by Stan King
Modified: 2013-02-14 03:02 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-02-14 03:02:28 UTC
Type: ---


Attachments (Terms of Use)

Description Stan King 2011-10-21 22:04:51 UTC
Description of problem:

Fedora 16's irqbalance version 1.0.2 crashes if given the --debug parameter.  Neil Horman has identified a fix.  I'm entering this separate bug report as he recommended in Bug 746159.

Version-Release number of selected component (if applicable):

irqbalance-1.0.2.fc16.i686

How reproducible:
Consistently.

Steps to Reproduce:
1.  irqbalance --debug
2.
3.
  
Actual results:
This machine seems not NUMA capable.
Could not find numa node for node id 0
Could not find numa node for node id 0
Segmentation fault (core dumped)

Expected results:
Normal debug output from irqbalance, as it remains up and running.

Additional info:

Comment 1 Neil Horman 2011-10-21 23:46:50 UTC
This is fixed with upstream commit http://code.google.com/p/irqbalance/source/detail?r=effc540808e630d1fad423d653c43737e99cc1b6

Comment 2 Neil Horman 2011-10-21 23:54:05 UTC
http://koji.fedoraproject.org/koji/taskinfo?taskID=3450720

Test build for you to validate please.

Comment 3 Stan King 2011-10-22 07:04:26 UTC
I get the same behavior as before with this test build.

Here is the info for the irqbalance packages I've loaded:

irqbalance-1.0-3.test1.fc16.i686
irqbalance-debuginfo-1.0-3.test1.fc16.i686

The abrt daemon is not picking up this failure, so if there's a way that I can integrate the fault with the debuginfo in a way that's useful, let me know.

Alternatively, I may try to compile it, and find where it's bombing.  My hunch is that the sysfs details have changed in additional ways.

Comment 4 Neil Horman 2011-10-22 12:24:39 UTC
Thats odd, Can you run irqbalance --debug on your system under gdb and attach the backtrace here please?

Comment 5 Neil Horman 2011-10-22 12:29:22 UTC
Accepted, I'll backport shortly.

Comment 6 Stan King 2011-10-22 20:44:48 UTC
Neil, I think this is the backtrace you were asking for:

(gdb) run --debug
Starting program: /usr/sbin/irqbalance --debug
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/libthread_db.so.1".
This machine seems not NUMA capable.
Could not find numa node for node id 0
Could not find numa node for node id 0

Program received signal SIGSEGV, Segmentation fault.
dump_package (d=0x805b780, data=0xbfffbefc) at cputree.c:291
291		printf("Package %i:  numa_node is %d cpu mask is %s (load %lu)\n", d->number, package_numa_node(d)->number, buffer, (unsigned long)d->load);
Missing separate debuginfos, use: debuginfo-install glib2-2.30.1-1.fc16.i686 glibc-2.14.90-13.i686 libcap-ng-0.6.6-1.fc16.i686 libgcc-4.6.1-10.fc16.i686 numactl-2.0.7-2.fc16.i686
(gdb) bt
#0  dump_package (d=0x805b780, data=0xbfffbefc) at cputree.c:291
#1  0x0804ad33 in for_each_object (data=0xbfffbefc, 
    cb=0x804ac30 <dump_package>, list=<optimized out>) at irqbalance.h:118
#2  dump_tree () at cputree.c:301
#3  0x0804b595 in parse_cpu_tree () at cputree.c:353
#4  0x0804917a in build_object_tree () at irqbalance.c:138
#5  main (argc=2, argv=0xbffff674) at irqbalance.c:196
(gdb)

Comment 7 Neil Horman 2011-10-24 14:44:09 UTC
http://koji.fedoraproject.org/koji/taskinfo?taskID=3456037

There you go, thank you, that clarifies the issue.  Its the same problem, just a different location.  The build above should fix it.  Please let me know and I'll commit it here and to rawhide

Comment 8 Stan King 2011-10-24 22:18:50 UTC
Neil, this version no longer segfaults when debug mode is requested.

However, I have two concerns, relative to debug output in Fedora 15:

* the new version no longer rescans for CPU topology, although both systems (F15 and F16) report 2 hotplug CPUs in their dmesg output.  (both systems are Pentium 4 with HT from 2003-2005)

* the new version reports a "load" value which seems to increase without bound.  Although it seems to come from a uint64_t in structure topo_obj, I have a hunch that this behavior was not intended, if the intent was to survey recent interrupt load.  Here is a sample of the output after a few minutes of running.  The load numbers which continually increase appear below as 362361, 199000, and 173000.  The loads of 6000 and 5000 change both up and down from time to time.

-----------------------------------------------------------------------------
Package 0:  numa_node is -1 cpu mask is 00000003 (load 362361)
        Cache domain 0:  numa_node is -1 cpu mask is 00000001  (load 199000) 
                CPU number 0  numa_node is -1 (load 6000)
        Cache domain 1:  numa_node is -1 cpu mask is 00000003  (load 173000) 
                CPU number 1  numa_node is -1 (load 5000)

For what it's worth, here's the initial blast of output:

This machine seems not NUMA capable.
Package 0:  numa_node is -1 cpu mask is 00000003 (load 0)
        Cache domain 0:  numa_node is -1 cpu mask is 00000001  (load 0) 
                CPU number 0  numa_node is -1 (load 0)
        Cache domain 1:  numa_node is -1 cpu mask is 00000003  (load 0) 
                CPU number 1  numa_node is -1 (load 0)
Adding IRQ 16 to database
Adding IRQ 19 to database
Adding IRQ 18 to database
Adding IRQ 23 to database
DROPPING DUPLICATE ENTRY FOR IRQ 18 on path /sys/bus/pci/devices/0000:00:1f.1
Adding IRQ 17 to database
DROPPING DUPLICATE ENTRY FOR IRQ 17 on path /sys/bus/pci/devices/0000:00:1f.5
DROPPING DUPLICATE ENTRY FOR IRQ 16 on path /sys/bus/pci/devices/0000:01:00.0
Adding IRQ 22 to database
Adding IRQ 20 to database
DROPPING DUPLICATE ENTRY FOR IRQ 17 on path /sys/bus/pci/devices/0000:02:0c.0
Adding IRQ 21 to database
DROPPING DUPLICATE ENTRY FOR IRQ 22 on path /sys/bus/pci/devices/0000:02:0c.2
NUMA NODE NUMBER: -1
LOCAL CPU MASK: ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff,ffffffff

Comment 9 Stan King 2011-10-24 23:51:25 UTC
Sorry about the long line in the above text - I had assumed it would wrap.  I know better now.

Just for comparison, here is a sample of output from irqbalance on Fedora 15, on another similar Pentium 4 with hyperthreading, after running for more than an hour.

It mentions interrupt numbers and their classification, whereas the Fedora 16 irqbalance's stable debug output does not talk about interrupts.

-----------------------------------------------------------------------------
IRQ delta is 9 
IRQ delta is 9, switching to power mode 
Rescanning cpu topology 
Package 0:  cpu mask is 00000001 (workload 0)
        Cache domain 0: cpu mask is 00000001  (workload 0) 
                CPU number 0  (workload 0)
                CPU number 0  (workload 0)
Package 0:  cpu mask is 00000003 (workload 0)
        Cache domain 0: cpu mask is 00000003  (workload 0) 
                CPU number 0  (workload 0)
                CPU number 1  (workload 0)
Package 0:  cpu mask is 00000001 (workload 4)
        Cache domain 0: cpu mask is 00000001  (workload 3) 
                CPU number 0  (workload 2)
                  Interrupt 16 (ethernet/1) 
                CPU number 0  (workload 0)
          Interrupt 23 (legacy/0) 
  Interrupt 14 (other/0) 
Package 0:  cpu mask is 00000003 (workload 3)
        Cache domain 0: cpu mask is 00000003  (workload 3) 
                CPU number 0  (workload 0)
                CPU number 1  (workload 0)
          Interrupt 20 (storage/0) 
          Interrupt 18 (legacy/0) 
          Interrupt 21 (legacy/0)

Comment 10 Neil Horman 2011-10-25 12:39:38 UTC
Ok, I'll commit the segfault fixes.

As for the cpu topology, it does rescan, it just needs to see that new cpus are present.  It does this by counting the number of cpus it sees in /proc/interrupts.  If that doesn't change, it assumes that it has the right number of cpus in its topology.  Its possible that something has gotten missed there, but unless the number of cpus in /proc/interrupts has changed, its doing the right thing.

As for the load, you're right.  The cpus load are a computed delta taken from /proc/stat, but the higher level objects (caches, packages, and nodes), are all accumulators.  Thats not really a problem, unless load becomes really unbalanced (since an accumulator maintains a permanent history of sorts), but I can fix that upstream.

Thanks!

Comment 11 Fedora Update System 2011-10-25 12:55:00 UTC
irqbalance-1.0-3.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/irqbalance-1.0-3.fc16

Comment 12 Fedora Update System 2011-10-25 21:43:51 UTC
Package irqbalance-1.0-3.fc16:
* should fix your issue,
* was pushed to the Fedora 16 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing irqbalance-1.0-3.fc16'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2011-14921
then log in and leave karma (feedback).

Comment 13 Fedora End Of Life 2013-01-17 02:12:43 UTC
This message is a reminder that Fedora 16 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 16. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '16'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 16's end of life.

Bug Reporter: Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 16 is end of life. If you 
would still like to see this bug fixed and are able to reproduce it 
against a later version of Fedora, you are encouraged to click on 
"Clone This Bug" and open it against that version of Fedora.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

The process we are following is described here: 
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 14 Fedora End Of Life 2013-02-14 03:02:31 UTC
Fedora 16 changed to end-of-life (EOL) status on 2013-02-12. Fedora 16 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.


Note You need to log in before you can comment on or make changes to this bug.