Description of problem: echo 0 > /sys/devices/system/cpu/cpu1/online crashes the machine Version-Release number of selected component (if applicable): 2.6.24.7-62.el5rt How reproducible: Reproduced on two dell poweredge machines: one 1900 and one 1950 Steps to Reproduce: 1. echo 0 > /sys/devices/system/cpu/cpu1/online Actual results: Machine crashes: rhelrt-2 login: Red Hat Enterprise Linux Server release 5.1 (Tikanga) Kernel 2.6.24.7-62.el5rt on an x86_64 rhelrt-2 login: Unable to handle kernel NULL pointer dereference at 0000000000000048 RIP: [<ffffffff81033313>] try_to_wake_up+0x18d/0x21e PGD 768a5067 PUD 76959067 PMD 0 Oops: 0000 [1] PREEMPT SMP CPU 2 Modules linked in: ipv6 autofs4 i2c_dev i2c_core nfs lockd nfs_acl sunrpc dm_multipath video output sbs sbshc battery ac parport_pc lp parport sg bnx2 sr_mod cdrom button ata_generic pata_acpi serio_raw i5000_edac iTCO_wdt shpchp iTCO_vendor_support edac_core pcspkr dm_snapshot dm_zero dm_mirror dm_mod ata_piix libata megaraid_sas sd_mod scsi_mod ext3 jbd mbcache ehci_hcd ohci_hcd uhci_hcd Pid: 3522, comm: kstopmachine Not tainted 2.6.24.7-62.el5rt #1 RIP: 0010:[<ffffffff81033313>] [<ffffffff81033313>] try_to_wake_up+0x18d/0x21e RSP: 0018:ffff81007ec7fea8 EFLAGS: 00010006 RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff81007ec7fe98 RDX: 0101010101010101 RSI: ffff81007ca100c0 RDI: ffff810009037900 RBP: ffff81007ec7fef8 R08: ffff81007e1e1ef0 R09: 0000000000000000 R10: ffff810087b1f000 R11: ffff81007ec7fe58 R12: ffff81007ca100c0 R13: ffff810009037900 R14: 0000000000000002 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff81007ec39dc0(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000048 CR3: 000000007681e000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process kstopmachine (pid: 3522, threadinfo ffff81007c4ce000, task ffff81007ed2f500) Stack: 0000000000000002 0000000000000046 000000000000001f ffff81000901e780 0000000000000046 ffff81007e1e1e98 ffff81000901e698 ffff81000901e640 7fffffffffffffff 0000000000000001 ffff81007ec7ff08 ffffffff81033424 Call Trace: <IRQ> [<ffffffff81033424>] wake_up_process+0x12/0x14 [<ffffffff810540f2>] hrtimer_wakeup+0x1d/0x21 [<ffffffff81054daf>] hrtimer_interrupt+0x11a/0x1ab [<ffffffff8101fe6f>] smp_local_timer_interrupt+0x5a/0x5e [<ffffffff810204d1>] smp_apic_timer_interrupt+0x3a/0x51 [<ffffffff8100ce46>] apic_timer_interrupt+0x66/0x70 <EOI> [<ffffffff810713df>] ? stopmachine+0x9b/0xcc [<ffffffff8100d028>] ? child_rip+0xa/0x12 [<ffffffff8100ada1>] ? default_idle+0x0/0x56 [<ffffffff81071344>] ? stopmachine+0x0/0xcc [<ffffffff8100d01e>] ? child_rip+0x0/0x12 Code: 00 4c 89 ef bb 01 00 00 00 e8 18 cc ff ff ba 01 00 00 00 4c 89 e6 4c 89 ef e8 4e c1 ff ff 49 8b 85 50 07 00 00 4c 89 e6 4c 89 ef <48> 8b 40 48 ff 50 28 eb 02 31 db 80 3d eb f1 3b 00 00 74 2e 49 RIP [<ffffffff81033313>] try_to_wake_up+0x18d/0x21e RSP <ffff81007ec7fea8> CR2: 0000000000000048 ---[ end trace 9bf0541e56471d3d ]--- Kernel panic - not syncing: Aiee, killing interrupt handler! Pid: 3522, comm: kstopmachine Tainted: G D 2.6.24.7-62.el5rt #1 Call Trace: <IRQ> [<ffffffff8103cae0>] panic+0xaf/0x169 [<ffffffff81055c1a>] ? blocking_notifier_call_chain+0xf/0x11 [<ffffffff81040319>] do_exit+0x8d/0x84e [<ffffffff8103ce93>] ? wake_up_klogd+0x32/0x34 [<ffffffff8128be79>] do_page_fault+0x70d/0x7f2 [<ffffffff810831ec>] ? cpupri_set+0xca/0xdd [<ffffffff810831ec>] ? cpupri_set+0xca/0xdd [<ffffffff8103175e>] ? update_rt_migration+0x18/0x99 [<ffffffff810SH� <EOI> [<ffffffff810713df>] ? stopmachine+0x9b/0xcc [<ffffffff8100d028>] ? child_rip+0xa/0x12 [<ffffffff8100ada1>] ? default_idle+0x0/0x56 [<ffffffff81071344>] ? stopmachine+0x0/0xcc [<ffffffff8100d01e>] ? child_rip+0x0/0x12 Unable to handle kernel NULL pointer dereference at 0000000000000048 RIP: [<ffffffff81033313>] try_to_wake_up+0x18d/0x21e PGD 768a5067 PUD 76959067 PMD 0 Oops: 0000 [4] PREEMPT SMP CPU 2 Modules linked in: ipv6 autofs4 i2c_dev i2c_core nfs lockd nfs_acl sunrpc dm_multipath video output sbs sbshc battery ac parport_pc lp parport sg bnx2 sr_mod cdrom button ata_generic pata_acpi serio_raw i5000_edac iTCO_wdt shpchp iTCO_vendor_support edac_core pcspkr dm_snapshot dm_zero dm_mirror dm_mod ata_piix libata megaraid_sas sd_mod scsi_mod ext3 jbd mbcache ehci_hcd ohci_hcd uhci_hcd Pid: 3522, comm: kstopmachine Tainted: G D 2.6.24.7-62.el5rt #1 RIP: 0010:[<ffffffff81033313>] [<ffffffff81033313>] try_to_wake_up+0x18d/0x21e RSP: 0018:ffff81007ec7fa68 EFLAGS: 00010006 RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff81007ec7fa58 RDX: 0101010101010101 RSI: ffff81007d96c0c0 RDI: ffff810009058900 RBP: ffff81007ec7fab8 R08: 0000000000000000 R09: ffff8100090703d4 R10: ffff81007ec7fba8 R11: 0000003000000010 R12: ffff81007d96c0c0 R13: ffff810009058900 R14: 0000000000000001 R15: 0000000000000000 FS: 0000000000000000(0000) GS:ffff81007ec39dc0(0000) knlGS:0000000000000000 CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b CR2: 0000000000000048 CR3: 000000007681e000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process kstopmachine (pid: 3522, threadinfo ffff81007c4ce000, task ffff81007ed2f500) Stack: 0000000000000000 0000000000000000 000000000000001f ffffffff8103d776 0000000000000046 ffffffff81486100 00000000000008f6 0000000000000002 ffffffff814861a8 0000000000000000 ffff81007ec7fac8 ffffffff81033424 Call Trace: <IRQ> [<ffffffff8103d776>] ? printk+0x67/0x69 [<ffffffff81033424>] wake_up_process+0x12/0x14 [<ffffffff8107af2a>] redirect_hardirq+0x3b/0x48 [<ffffffff8107c8db>] handle_edge_irq+0xcb/0x158 [<ffffffff8100ecd1>] do_IRQ+0x106/0x179 [<ffffffff8100c6f1>] ret_from_intr+0x0/0xa [<ffffffff811d8d3d>] ? i8042_panic_blink+0x0/0x144 [<ffffffff8103cb78>] ? panic+0x147/0x169 [<ffffffff8103cb03>] ? panic+0xd2/0x169 [<ffffffff81055c1a>] ? blocking_notifier_call_chain+0xf/0x11 [<ffffffff81040319>] ? do_exit+0x8d/0x84e [<ffffffff8103ce93>] ? wake_up_klogd+0x32/0x34 [<ffffffff8128be79>] ? do_page_fault+0x70d/0x7f2 [<ffffffff810831ec>] ? cpupri_set+0xca/0xdd [<ffffffff810831ec>] ? cpupri_set+0xca/0xdd [<ffffffff8103175e>] ? update_rt_migration+0x18/0x99 [<ffffffff8103175e>] ? update_rt_migration+0x18/0x99 [<ffffffff8103697f>] ? enqueue_task_rt+0xb9/0xcf [<ffffffff81289f49>] ? error_exit+0x0/0x51 [<ffffffff81033313>] ? try_to_wake_up+0x18d/0x21e [<fffffff Expected results: cpu1 gets offline: [root@rhelrt-2 ~]# cat /sys/devices/system/cpu/cpu1/online 1 [root@rhelrt-2 ~]# echo 0 > /sys/devices/system/cpu/cpu1/online [root@rhelrt-2 ~]# cat /sys/devices/system/cpu/cpu1/online 0 [root@rhelrt-2 ~]# uname -r 2.6.24.7-62.el5rtvanilla Additional info: Clark tried this on a AMD machine without problems, maybe Intel specific.
Problem is not Intel specific, diagnosed as a cpupri problem by Thomas Gleixner, kernel rpm package with a hotfix patch being built.
After talking to tglx, I suspect the issue is in how the root-domain and cpupri code interact with one another when building a domain. The cpus in question should have been marked INVALID which would remove them from the cpupri tables. Instead they were allowed to register with some priority that was != INVALID. This bug is likely to affect 23-rt, 24-rt, 25-rt, and sched-devel (and of course, any derivatives of such). I will hopefully post a solution soon.
Survives a lot longer, but if we keep offlining/onlining a cpu in an infinite loop we eventually OOPS, different backtrace this time, will attach.
Created attachment 308312 [details] new oops
Created attachment 308318 [details] Proposed fix
With Gregory patch applied it now takes a lot longer, but we eventually OOPS
Created attachment 308373 [details] new oops, with Gregory patch applied
Created attachment 308386 [details] Proposed fix I have incorporated Peter Zijlstra's feedback, and also fixed some additional holes that I found would still allow the cpupri table to get updated. Please retest!
Created attachment 308389 [details] Proposed fix Hmmm...i refreshed the patch, but firefox seems to have uploaded the old one..lets try again.
Created attachment 313693 [details] sysfs CPU classes, usage example does random CPU off/onlining
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2008-0585.html