Bug 746159

Summary: Segfault after update to irqbalance-1.0-1.fc16.i686
Product: [Fedora] Fedora Reporter: Michael Weidner <micha>
Component: irqbalanceAssignee: Neil Horman <nhorman>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 16CC: anton, danielbelton, nhorman, stanley.king, tribby21
Target Milestone: ---   
Target Release: ---   
Hardware: i686   
OS: Linux   
Whiteboard:
Fixed In Version: irqbalance-1.0-2.fc16 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2011-10-25 03:42:45 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
dmesg-output of my machine after a system reboot
none
/proc/stat output
none
/proc/interrupts output
none
/proc/interrupts output
none
/proc/interrupts output none

Description Michael Weidner 2011-10-14 06:52:46 UTC
Created attachment 528158 [details]
dmesg-output of my machine after a system reboot

Description of problem:

Cannot run irqbalance after update to irqbalance-1.0-1.fc16.i686 because of a segfault.

Version-Release number of selected component (if applicable):

irqbalance-1.0-1.fc16.i686

How reproducible:

Everytime starting service irqbalance after update to irqbalance-1.0-1.fc16.i686

Steps to Reproduce:
1. Update irqbalance from irqbalance-0.56-4.fc16.i686 to irqbalance-1.0-1.fc16.i686 with yum
2. start irqbalance with "systemctl start irqbalance.service"

Actual results:

irqbalance causing a segfault

systemctl status irqbalance.service output:

irqbalance.service - irqbalance daemon
          Loaded: loaded (/lib/systemd/system/irqbalance.service; enabled)
          Active: failed since Fri, 14 Oct 2011 07:17:03 +0200; 1h 27min ago
         Process: 816 ExecStart=/usr/sbin/irqbalance $ONESHOT (code=killed, signal=SEGV)
          CGroup: name=systemd:/system/irqbalance.service

entry in /var/log/messages:

irqbalance[816]: segfault at 4 ip 00e20245 sp bfff8870 error 6 in libc-2.14.90.so[d6d000+1a7000]

Expected results:

irqbalance to start

Additional info:

I attached a complete dmesg-output of my machine after a system reboot with the segfault present at boot time.

Comment 1 Daniel Belton 2011-10-14 13:55:11 UTC
I get the same problem on 2 of my F16 systems here after the update to irqbalance.

One is a 32 bit and the other is a 64 bit system. 

32 bit system:

[   25.338430] irqbalance[1163]: segfault at 4 ip 41654245 sp bf9213b0 error 6 in libc-2.14.90.so[415a1000+1a7000]

[root@tower11 ~]# systemctl status irqbalance.service
irqbalance.service - irqbalance daemon
	  Loaded: loaded (/lib/systemd/system/irqbalance.service; enabled)
	  Active: failed since Fri, 14 Oct 2011 08:00:07 -0500; 52min ago
	  CGroup: name=systemd:/system/irqbalance.service


On the 64 bit system, sometime it fails with the same error, but usually I get a different message, yet it still loads. 

[   13.615122] /usr/sbin/irqbalance[845]: WARNING: MSI interrupts found in /proc/interrupts
[   13.617375] /usr/sbin/irqbalance[845]: But none found in sysfs, you need to update your kernel
[   13.619579] /usr/sbin/irqbalance[845]: Until then, IRQs will be improperly classified

[root@tower20 ~]# systemctl status irqbalance.service
irqbalance.service - irqbalance daemon
	  Loaded: loaded (/lib/systemd/system/irqbalance.service; enabled)
	  Active: inactive (dead) since Fri, 14 Oct 2011 08:13:08 -0500; 18min ago
	 Process: 841 ExecStart=/usr/sbin/irqbalance $ONESHOT (code=exited, status=0/SUCCESS)
	Main PID: 845 (code=exited, status=0/SUCCESS)
	  CGroup: name=systemd:/system/irqbalance.service

Comment 2 Neil Horman 2011-10-14 23:40:15 UTC
That warning will go away once I'm finished with the kernel backport - its not a big problem, it just means some irqs may get improperly classified as legacy/other interrupts rather than as ethernet interrupts.  Can you please post the output of /proc/interrupts /proc/stat and a backtrace of the segfault?

Comment 3 Neil Horman 2011-10-14 23:41:19 UTC
Anton, hope you don't mind me grabbing this bug, but I'd like to quash any bugs that I missed in my testing from the re-write.

Comment 4 Daniel Belton 2011-10-15 01:27:26 UTC
Created attachment 528288 [details]
/proc/stat output

Comment 5 Daniel Belton 2011-10-15 01:28:39 UTC
Created attachment 528289 [details]
/proc/interrupts output

Here are the outputs of /proc/stat and /proc/interrupts you requested.

What do I need to do to get a backtrace?

Comment 6 Daniel Belton 2011-10-15 01:36:40 UTC
Created attachment 528290 [details]
/proc/interrupts output

Comment 7 Daniel Belton 2011-10-15 01:38:00 UTC
Created attachment 528291 [details]
/proc/interrupts output

sorry about that, I attached the wrong output

Comment 8 Stan King 2011-10-15 02:02:26 UTC
I got this error, too, with the update to irqbalance-1.0-1.fc16.i686.

I'd like to add that although the abort notifier appeared, upon opening the abort tool, no errors were found to report.  In /var/log/messages, it seems as if some information was collected for abrtd to work with.  I'd be glad to provide more info on that; just let me know what's needed.

Comment 9 Michael Weidner 2011-10-15 05:53:31 UTC
Only got little time at the moment, here is the backtrace without debug symbols installed, if you need it with debug symbols, give me a message, I will do it later:

(gdb) run
Starting program: /usr/sbin/irqbalance 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/libthread_db.so.1".

Program received signal SIGSEGV, Segmentation fault.
0x00343245 in readdir () from /lib/libc.so.6
(gdb) thread apply all bt full

Thread 1 (Thread 0xb7ff2700 (LWP 10286)):
#0  0x00343245 in readdir () from /lib/libc.so.6
No symbol table info available.
#1  0x0804c530 in ?? ()
No symbol table info available.
#2  0x08049175 in ?? ()
No symbol table info available.
#3  0x002a9673 in __libc_start_main () from /lib/libc.so.6
No symbol table info available.
#4  0x08049429 in ?? ()
No symbol table info available.
Backtrace stopped: Not enough registers or memory available to unwind further

# cat /proc/interrupts
            CPU0       CPU1       
   0:     609891    7580816   IO-APIC-edge      timer
   1:          0          2   IO-APIC-edge      i8042
   4:      28751     504818   IO-APIC-edge      serial
   7:          1          0   IO-APIC-edge    
   8:     108545    5284077   IO-APIC-edge      rtc0
   9:          0          0   IO-APIC-fasteoi   acpi
  12:          0          4   IO-APIC-edge      i8042
  16:          1         99   IO-APIC-fasteoi   ohci_hcd:usb3, ohci_hcd:usb4
  17:      90533     251191   IO-APIC-fasteoi   ehci_hcd:usb1
  18:         94        820   IO-APIC-fasteoi   ohci_hcd:usb5, ohci_hcd:usb6, ohci_hcd:usb7, radeon
  19:          0          0   IO-APIC-fasteoi   ehci_hcd:usb2
  20:          0          6   IO-APIC-fasteoi   eth1
  21:     287488    2318287   IO-APIC-fasteoi 
  22:       5598      39095   IO-APIC-fasteoi   ahci, serial
  41:     145015    1378068   PCI-MSI-edge      eth0
 NMI:          3          3   Non-maskable interrupts
 LOC:    5585374    1851918   Local timer interrupts
 SPU:          0          0   Spurious interrupts
 PMI:          3          3   Performance monitoring interrupts
 IWI:          0          0   IRQ work interrupts
 RES:    3487662     871222   Rescheduling interrupts
 CAL:       7837       6360   Function call interrupts
 TLB:      26923      33109   TLB shootdowns
 TRM:          0          0   Thermal event interrupts
 THR:          0          0   Threshold APIC interrupts
 MCE:          0          0   Machine check exceptions
 MCP:        296        296   Machine check polls
 ERR:          1
 MIS:          0

# cat /proc/stat
cpu  121719 20025 124027 17428758 22530 66 2337 0 0 0
cpu0 64754 7509 60800 8714530 11374 4 587 0 0 0
cpu1 56965 12515 63227 8714227 11155 61 1749 0 0 0
intr 30510017 8192356 2 0 0 533895 0 0 1 5394350 0 0 0 4 0 0 0 100 341776 914 0 6 2605775 44695 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1523218 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
ctxt 22137222
btime 1318569360
processes 205082
procs_running 1
procs_blocked 0
softirq 10339580 0 4211962 585546 798082 142852 0 297162 1193431 17239 3093306

Comment 10 Neil Horman 2011-10-15 14:50:25 UTC
Thank you for the backtrace, but it wasn't quite done right.  Since irqbalance forks and daemonizes gdb tends to get confused with stack traces.  If you could please run that again in gdb, but before running, issue this command:
set args --debug
that will prevent irqbalance from forking and give you a proper stack trace that we can use to root cause this.  Thanks!

Comment 11 Michael Weidner 2011-10-15 15:15:11 UTC
Here it is:

# gdb /usr/sbin/irqbalance
GNU gdb (GDB) Fedora (7.3.50.20110722-7.fc16)
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "i686-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/irqbalance...(no debugging symbols found)...done.
Missing separate debuginfos, use: debuginfo-install irqbalance-1.0-1.fc16.i686
(gdb) set args --debug
(gdb) run
Starting program: /usr/sbin/irqbalance --debug
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/libthread_db.so.1".
This machine seems not NUMA capable.

Program received signal SIGSEGV, Segmentation fault.
0x00343245 in readdir () from /lib/libc.so.6
(gdb) thread apply all bt full

Thread 1 (Thread 0xb7ff2700 (LWP 6517)):
#0  0x00343245 in readdir () from /lib/libc.so.6
No symbol table info available.
#1  0x0804c530 in ?? ()
No symbol table info available.
#2  0x08049175 in ?? ()
No symbol table info available.
#3  0x002a9673 in __libc_start_main () from /lib/libc.so.6
No symbol table info available.
#4  0x08049429 in ?? ()
No symbol table info available.
Backtrace stopped: Not enough registers or memory available to unwind further
(gdb)

Comment 12 Daniel Belton 2011-10-16 14:02:34 UTC
I hope I did this correct

[root@tower11 ~]# gdb irqbalance
GNU gdb (GDB) Fedora (7.3.50.20110722-9.fc16)
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "i686-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/irqbalance...Reading symbols from /usr/lib/debug/usr/sbin/irqbalance.debug...done.
done.
(gdb) set args --debug
(gdb) run
Starting program: /usr/sbin/irqbalance --debug
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/libthread_db.so.1".
This machine seems not NUMA capable.

Program received signal SIGSEGV, Segmentation fault.
0x00212245 in __readdir (dirp=0x0) at ../sysdeps/unix/readdir.c:45
45	  __libc_lock_lock (dirp->lock);
(gdb) bt
#0  0x00212245 in __readdir (dirp=0x0) at ../sysdeps/unix/readdir.c:45
#1  0x0804c530 in build_numa_node_list () at numa.c:88
#2  0x08049175 in build_object_tree () at irqbalance.c:137
#3  main (argc=2, argv=0xbffff504) at irqbalance.c:196
(gdb) thread apply all bt full

Thread 1 (Thread 0xb7fc9700 (LWP 2358)):
#0  0x00212245 in __readdir (dirp=0x0) at ../sysdeps/unix/readdir.c:45
        dp = <optimized out>
        saved_errno = <optimized out>
#1  0x0804c530 in build_numa_node_list () at numa.c:88
        dir = 0x0
        entry = <optimized out>
#2  0x08049175 in build_object_tree () at irqbalance.c:137
No locals.
#3  main (argc=2, argv=0xbffff504) at irqbalance.c:196
No locals.
(gdb) c
Continuing.

Program terminated with signal SIGSEGV, Segmentation fault.
The program no longer exists.
(gdb) quit

Comment 13 Neil Horman 2011-10-17 11:10:15 UTC
Ok, its possible we're looking at two different problems here, but I can't be 100% sure.  In either case I can clearly identify at least one from the backtrace in comment 12.  Could everyone please give the test build here a try:
http://koji.fedoraproject.org/koji/taskinfo?taskID=3436599

If I get good feedback, I'll commit this upstream and to f16 & rawhide asap.  Thanks!

Comment 14 Michael Weidner 2011-10-17 11:19:24 UTC
irqbalance-1.0-1.test1.fc16.i686.rpm works for me.

Comment 15 Neil Horman 2011-10-17 13:39:38 UTC
grand, I'll commit that shortly, thank you!

Comment 16 Fedora Update System 2011-10-17 18:48:56 UTC
irqbalance-1.0-2.fc16 has been submitted as an update for Fedora 16.
https://admin.fedoraproject.org/updates/irqbalance-1.0-2.fc16

Comment 17 Fedora Update System 2011-10-18 07:21:55 UTC
Package irqbalance-1.0-2.fc16:
* should fix your issue,
* was pushed to the Fedora 16 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing irqbalance-1.0-2.fc16'
as soon as you are able to.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2011-14501
then log in and leave karma (feedback).

Comment 18 Daniel Belton 2011-10-18 18:32:16 UTC
Sorry it took me awhile to get back on this. I got a ton of work dumped on me yesterday.

But, I have updated to the 1.0.2 version of irqbalance that has been pushed into updates-testing and it fixed the issue I was having as well.

Comment 19 Stan King 2011-10-18 20:04:55 UTC
With the 1.0.2 version, my irqbalance's segs are no longer faulting.

On the other hand, if someone can direct me on how to verify that my IRQs are actually balanced, I'd be happy to check that, too.

Comment 20 Neil Horman 2011-10-19 11:04:51 UTC
look in /proc/irq/<N>, where N is an irq number, it should have a mask indicating which cpu(s) it has affinity for.  You can also run irqbalance in the foreground with --debug, and that will dump out a periodic map of which cpus/caches/packages/nodes each affected irq is balanced too (as well as the load on each object).

Comment 21 Stan King 2011-10-19 21:37:31 UTC
Neil,

Thanks for the info.  The data in /proc/irq/<N> and /proc/interrupts look reasonable, given my machine's light workload.

However, I have some bad news.  When I run "irqbalance --debug", I get the following:

This machine seems not NUMA capable.
Could not find numa node for node id 0
Could not find numa node for node id 0
Segmentation fault (core dumped)

The command "rpm -q irqbalance" reports that I've got irqbalance-1.0-2.fc16.i686.

Comment 22 Daniel Belton 2011-10-20 07:06:09 UTC
I get the same thing if I try to run irqbalance from the command line, but it starts up and runs fine when loaded at boot. This is with the updated 1.0.2 version of irqbalance. 

[root@tower11 /]# gdb irqbalance
GNU gdb (GDB) Fedora (7.3.50.20110722-9.fc16)
Copyright (C) 2011 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "i686-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /usr/sbin/irqbalance...Reading symbols from /usr/lib/debug/usr/sbin/irqbalance.debug...done.
done.
(gdb) set arg --debug
(gdb) run
Starting program: /usr/sbin/irqbalance --debug
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/libthread_db.so.1".
This machine seems not NUMA capable.
Could not find numa node for node id 0
Could not find numa node for node id 0

Program received signal SIGSEGV, Segmentation fault.
dump_package (d=0x805b780, data=0xbfffbd8c) at cputree.c:291
291		printf("Package %i:  numa_node is %d cpu mask is %s (load %lu)\n", d->number, package_numa_node(d)->number, buffer, (unsigned long)d->load);
(gdb) bt
#0  dump_package (d=0x805b780, data=0xbfffbd8c) at cputree.c:291
#1  0x0804ad33 in for_each_object (data=0xbfffbd8c, cb=
    0x804ac30 <dump_package>, list=<optimized out>) at irqbalance.h:117
#2  dump_tree () at cputree.c:301
#3  0x0804b595 in parse_cpu_tree () at cputree.c:353
#4  0x0804917a in build_object_tree () at irqbalance.c:138
#5  main (argc=2, argv=0xbffff504) at irqbalance.c:196
(gdb) thread apply all bt full

Thread 1 (Thread 0xb7fc9700 (LWP 20017)):
#0  dump_package (d=0x805b780, data=0xbfffbd8c) at cputree.c:291
        buffer = 0xbfffbd8c "00000003"
#1  0x0804ad33 in for_each_object (data=0xbfffbd8c, cb=
    0x804ac30 <dump_package>, list=<optimized out>) at irqbalance.h:117
        entry = <optimized out>
        next = 0x0
#2  dump_tree () at cputree.c:301
        buffer = 
    "00000003", '\000' <repeats 1744 times>"\343, \257\303B\364\377\317B\025\222\274B\001\000\000\000\000\340\377\267'", '\000' <repeats 19 times>, "'\000\000\000\300\t\320B\000\340\377\267\300\t\320B\364\220\274B\300\t\320B\000\340\377\267'", '\000' <repeats 19 times>"\364, \377\317B'\000\000\000\377\377\377\377\300\t\320B\376\254\274B'", '\000' <repeats 23 times>"\300, \t\320B\272\260\274B\300\t\320B\000\340\377\267'", '\000' <repeats 19 times>"\364, \377\317B\024\336\004\b\000\000\000\000\300\t\320B\242\234\274B\300\t\320B\377\377\377\377%", '\000' <repeats 23 times>, "\001", '\000' <repeats 11 times>, "\001", '\000' <repeats 11 times>, "\r\301\271B\364\377\317B\300\t\320B\024\336\004\b\300\t\320BD̹B\230\312\377\277\000\000\000\000\001", '\000' <repeats 31 times>, "\001\000\000\000\023\336\004\b", '\000' <repeats 112 times>, " ", '\000' <repeats 15 times>"\377"...
#3  0x0804b595 in parse_cpu_tree () at cputree.c:353
---Type <return> to continue, or q <return> to quit---
        dir = 0x8053300
        entry = 0x0
#4  0x0804917a in build_object_tree () at irqbalance.c:138
No locals.
#5  main (argc=2, argv=0xbffff504) at irqbalance.c:196
No locals.
(gdb)

Comment 23 Daniel Belton 2011-10-20 07:09:47 UTC
My guess is that it is getting the error when run from the command line due to the fact that irqbalance is already running.

root@tower11 /]# systemctl status irqbalance
Failed to issue method call: Unit name irqbalance is not valid.
[root@tower11 /]# systemctl status irqbalance.service
irqbalance.service - irqbalance daemon
	  Loaded: loaded (/lib/systemd/system/irqbalance.service; enabled)
	  Active: active (running) since Tue, 18 Oct 2011 17:19:11 -0500; 1 day and 8h ago
	Main PID: 1189 (irqbalance)
	  CGroup: name=systemd:/system/irqbalance.service
		  └ 1189 /usr/sbin/irqbalance
[root@tower11 /]#

Comment 24 Neil Horman 2011-10-20 11:01:54 UTC
That seems like a kernel problem, in that the system has no numa configuration available, but devices in the sysfs tree are reporting that they are local to node 0, rather than to no node (-1).  We should probably track that down.  Unfortunately, we still shouldn't segfault in such a case.  If you open up a new bug for this and assign it to me I'll fix it asap.

Comment 25 Stan King 2011-10-20 20:12:11 UTC
Neil, while trying to figure out exactly how to phrase a new bug report, I noticed that on the system with the failure, there is no directory /sys/devices/system/node.  Could that be causing the difficulty?

I looked in that direction by looking at irqbalance 1.0 as retrieved from Google code.  If you could point me to a newer version, I'd appreciate it, as there is something else that doesn't seem quite right, but not related to this failure.  Thanks.

Comment 26 Neil Horman 2011-10-21 10:58:38 UTC
Yes, thats it, indirectly.  The problem that you describe will occur if two things happen in sysfs:
1) /sys/devices/system/node doesn't exist
2) the numa_node file for a device in /sys/bus/pci/devices/.../ has the contents '0' rather than '-1'

What that tells user space is that numa isn't supported on this system, but there is a device that is local to node 0 (which doesn't exist).  The result is irqbalance looks for a data structure internally representing node 0, but doesn't find it, which lets a NULL pointer get assigned to a variable that never expects to be NULL.  This is strictly speaking a kernel problem, but I need to harden irqbalance against it (I shouldn't crash regardless).  I've fixed it upstream with this:
http://code.google.com/p/irqbalance/source/detail?r=effc540808e630d1fad423d653c43737e99cc1b6#
but you need to open a new bug so I can backport it.

If you want the latest upstream source, you can get it straight from the git tree, instructions found here:
http://code.google.com/p/irqbalance/source/checkout

Comment 27 Stan King 2011-10-21 22:06:46 UTC
I've created Bug report 748070 as requested.  I did not have the power to assign it to you, so it shows as assigned to Anton Arapov.

Comment 28 Fedora Update System 2011-10-25 03:42:45 UTC
irqbalance-1.0-2.fc16 has been pushed to the Fedora 16 stable repository.  If problems still persist, please make note of it in this bug report.