Bug 449676 - Turning a CPU offline causes panic
Turning a CPU offline causes panic
Status: CLOSED ERRATA
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: realtime-kernel (Show other bugs)
beta
All Linux
high Severity low
: 1.0.1
: ---
Assigned To: Peter Zijlstra
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2008-06-02 20:57 EDT by Arnaldo Carvalho de Melo
Modified: 2014-08-11 01:40 EDT (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-08-26 15:57:14 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
new oops (6.90 KB, text/plain)
2008-06-03 21:36 EDT, Arnaldo Carvalho de Melo
no flags Details
Proposed fix (4.73 KB, patch)
2008-06-04 00:00 EDT, Gregory Haskins
no flags Details | Diff
new oops, with Gregory patch applied (9.45 KB, text/plain)
2008-06-04 13:21 EDT, Arnaldo Carvalho de Melo
no flags Details
Proposed fix (7.65 KB, patch)
2008-06-04 15:38 EDT, Gregory Haskins
no flags Details | Diff
Proposed fix (7.95 KB, patch)
2008-06-04 15:43 EDT, Gregory Haskins
no flags Details | Diff
sysfs CPU classes, usage example does random CPU off/onlining (2.19 KB, text/plain)
2008-08-07 10:01 EDT, Arnaldo Carvalho de Melo
no flags Details

  None (edit)
Description Arnaldo Carvalho de Melo 2008-06-02 20:57:32 EDT
Description of problem:

echo 0 > /sys/devices/system/cpu/cpu1/online

crashes the machine

Version-Release number of selected component (if applicable):

2.6.24.7-62.el5rt

How reproducible:

Reproduced on two dell poweredge machines: one 1900 and one 1950

Steps to Reproduce:
1. echo 0 > /sys/devices/system/cpu/cpu1/online

Actual results:

Machine crashes:

rhelrt-2 login: 
Red Hat Enterprise Linux Server release 5.1 (Tikanga)
Kernel 2.6.24.7-62.el5rt on an x86_64

rhelrt-2 login: Unable to handle kernel NULL pointer dereference at
0000000000000048 RIP: 
 [<ffffffff81033313>] try_to_wake_up+0x18d/0x21e
PGD 768a5067 PUD 76959067 PMD 0 
Oops: 0000 [1] PREEMPT SMP 
CPU 2 
Modules linked in: ipv6 autofs4 i2c_dev i2c_core nfs lockd nfs_acl sunrpc
dm_multipath video output sbs sbshc battery ac parport_pc lp parport sg bnx2
sr_mod cdrom button ata_generic pata_acpi serio_raw i5000_edac iTCO_wdt shpchp
iTCO_vendor_support edac_core pcspkr dm_snapshot dm_zero dm_mirror dm_mod
ata_piix libata megaraid_sas sd_mod scsi_mod ext3 jbd mbcache ehci_hcd ohci_hcd
uhci_hcd
Pid: 3522, comm: kstopmachine Not tainted 2.6.24.7-62.el5rt #1
RIP: 0010:[<ffffffff81033313>]  [<ffffffff81033313>] try_to_wake_up+0x18d/0x21e
RSP: 0018:ffff81007ec7fea8  EFLAGS: 00010006
RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff81007ec7fe98
RDX: 0101010101010101 RSI: ffff81007ca100c0 RDI: ffff810009037900
RBP: ffff81007ec7fef8 R08: ffff81007e1e1ef0 R09: 0000000000000000
R10: ffff810087b1f000 R11: ffff81007ec7fe58 R12: ffff81007ca100c0
R13: ffff810009037900 R14: 0000000000000002 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff81007ec39dc0(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000048 CR3: 000000007681e000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kstopmachine (pid: 3522, threadinfo ffff81007c4ce000, task ffff81007ed2f500)
Stack:  0000000000000002 0000000000000046 000000000000001f ffff81000901e780
 0000000000000046 ffff81007e1e1e98 ffff81000901e698 ffff81000901e640
 7fffffffffffffff 0000000000000001 ffff81007ec7ff08 ffffffff81033424
Call Trace:
 <IRQ>  [<ffffffff81033424>] wake_up_process+0x12/0x14
 [<ffffffff810540f2>] hrtimer_wakeup+0x1d/0x21
 [<ffffffff81054daf>] hrtimer_interrupt+0x11a/0x1ab
 [<ffffffff8101fe6f>] smp_local_timer_interrupt+0x5a/0x5e
 [<ffffffff810204d1>] smp_apic_timer_interrupt+0x3a/0x51
 [<ffffffff8100ce46>] apic_timer_interrupt+0x66/0x70
 <EOI>  [<ffffffff810713df>] ? stopmachine+0x9b/0xcc
 [<ffffffff8100d028>] ? child_rip+0xa/0x12
 [<ffffffff8100ada1>] ? default_idle+0x0/0x56
 [<ffffffff81071344>] ? stopmachine+0x0/0xcc
 [<ffffffff8100d01e>] ? child_rip+0x0/0x12


Code: 00 4c 89 ef bb 01 00 00 00 e8 18 cc ff ff ba 01 00 00 00 4c 89 e6 4c 89 ef
e8 4e c1 ff ff 49 8b 85 50 07 00 00 4c 89 e6 4c 89 ef <48> 8b 40 48 ff 50 28 eb
02 31 db 80 3d eb f1 3b 00 00 74 2e 49 
RIP  [<ffffffff81033313>] try_to_wake_up+0x18d/0x21e
 RSP <ffff81007ec7fea8>
CR2: 0000000000000048
---[ end trace 9bf0541e56471d3d ]---
Kernel panic - not syncing: Aiee, killing interrupt handler!
Pid: 3522, comm: kstopmachine Tainted: G      D  2.6.24.7-62.el5rt #1

Call Trace:
 <IRQ>  [<ffffffff8103cae0>] panic+0xaf/0x169
 [<ffffffff81055c1a>] ? blocking_notifier_call_chain+0xf/0x11
 [<ffffffff81040319>] do_exit+0x8d/0x84e
 [<ffffffff8103ce93>] ? wake_up_klogd+0x32/0x34
 [<ffffffff8128be79>] do_page_fault+0x70d/0x7f2
 [<ffffffff810831ec>] ? cpupri_set+0xca/0xdd
 [<ffffffff810831ec>] ? cpupri_set+0xca/0xdd
 [<ffffffff8103175e>] ? update_rt_migration+0x18/0x99
 [<ffffffff810SH� <EOI>  [<ffffffff810713df>] ? stopmachine+0x9b/0xcc
 [<ffffffff8100d028>] ? child_rip+0xa/0x12
 [<ffffffff8100ada1>] ? default_idle+0x0/0x56
 [<ffffffff81071344>] ? stopmachine+0x0/0xcc
 [<ffffffff8100d01e>] ? child_rip+0x0/0x12

Unable to handle kernel NULL pointer dereference at 0000000000000048 RIP: 
 [<ffffffff81033313>] try_to_wake_up+0x18d/0x21e
PGD 768a5067 PUD 76959067 PMD 0 
Oops: 0000 [4] PREEMPT SMP 
CPU 2 
Modules linked in: ipv6 autofs4 i2c_dev i2c_core nfs lockd nfs_acl sunrpc
dm_multipath video output sbs sbshc battery ac parport_pc lp parport sg bnx2
sr_mod cdrom button ata_generic pata_acpi serio_raw i5000_edac iTCO_wdt shpchp
iTCO_vendor_support edac_core pcspkr dm_snapshot dm_zero dm_mirror dm_mod
ata_piix libata megaraid_sas sd_mod scsi_mod ext3 jbd mbcache ehci_hcd ohci_hcd
uhci_hcd
Pid: 3522, comm: kstopmachine Tainted: G      D  2.6.24.7-62.el5rt #1
RIP: 0010:[<ffffffff81033313>]  [<ffffffff81033313>] try_to_wake_up+0x18d/0x21e
RSP: 0018:ffff81007ec7fa68  EFLAGS: 00010006
RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff81007ec7fa58
RDX: 0101010101010101 RSI: ffff81007d96c0c0 RDI: ffff810009058900
RBP: ffff81007ec7fab8 R08: 0000000000000000 R09: ffff8100090703d4
R10: ffff81007ec7fba8 R11: 0000003000000010 R12: ffff81007d96c0c0
R13: ffff810009058900 R14: 0000000000000001 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff81007ec39dc0(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000048 CR3: 000000007681e000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kstopmachine (pid: 3522, threadinfo ffff81007c4ce000, task ffff81007ed2f500)
Stack:  0000000000000000 0000000000000000 000000000000001f ffffffff8103d776
 0000000000000046 ffffffff81486100 00000000000008f6 0000000000000002
 ffffffff814861a8 0000000000000000 ffff81007ec7fac8 ffffffff81033424
Call Trace:
 <IRQ>  [<ffffffff8103d776>] ? printk+0x67/0x69
 [<ffffffff81033424>] wake_up_process+0x12/0x14
 [<ffffffff8107af2a>] redirect_hardirq+0x3b/0x48
 [<ffffffff8107c8db>] handle_edge_irq+0xcb/0x158
 [<ffffffff8100ecd1>] do_IRQ+0x106/0x179
 [<ffffffff8100c6f1>] ret_from_intr+0x0/0xa
 [<ffffffff811d8d3d>] ? i8042_panic_blink+0x0/0x144
 [<ffffffff8103cb78>] ? panic+0x147/0x169
 [<ffffffff8103cb03>] ? panic+0xd2/0x169
 [<ffffffff81055c1a>] ? blocking_notifier_call_chain+0xf/0x11
 [<ffffffff81040319>] ? do_exit+0x8d/0x84e
 [<ffffffff8103ce93>] ? wake_up_klogd+0x32/0x34
 [<ffffffff8128be79>] ? do_page_fault+0x70d/0x7f2
 [<ffffffff810831ec>] ? cpupri_set+0xca/0xdd
 [<ffffffff810831ec>] ? cpupri_set+0xca/0xdd
 [<ffffffff8103175e>] ? update_rt_migration+0x18/0x99
 [<ffffffff8103175e>] ? update_rt_migration+0x18/0x99
 [<ffffffff8103697f>] ? enqueue_task_rt+0xb9/0xcf
 [<ffffffff81289f49>] ? error_exit+0x0/0x51
 [<ffffffff81033313>] ? try_to_wake_up+0x18d/0x21e
 [<fffffff


Expected results:

cpu1 gets offline:

[root@rhelrt-2 ~]# cat /sys/devices/system/cpu/cpu1/online 
1
[root@rhelrt-2 ~]# echo 0 > /sys/devices/system/cpu/cpu1/online 
[root@rhelrt-2 ~]# cat /sys/devices/system/cpu/cpu1/online 
0
[root@rhelrt-2 ~]# uname -r
2.6.24.7-62.el5rtvanilla

Additional info:

Clark tried this on a AMD machine without problems, maybe Intel specific.
Comment 1 Arnaldo Carvalho de Melo 2008-06-03 20:27:03 EDT
Problem is not Intel specific, diagnosed as a cpupri problem by Thomas Gleixner,
kernel rpm package with a hotfix patch being built.
Comment 2 Gregory Haskins 2008-06-03 20:31:37 EDT
After talking to tglx, I suspect the issue is in how the root-domain and cpupri
code interact with one another when building a domain.  The cpus in question
should have been marked INVALID which would remove them from the cpupri tables.
 Instead they were allowed to register with some priority that was != INVALID.

This bug is likely to affect 23-rt, 24-rt, 25-rt, and sched-devel (and of
course, any derivatives of such).  I will hopefully post a solution soon.
Comment 3 Arnaldo Carvalho de Melo 2008-06-03 21:35:14 EDT
Survives a lot longer, but if we keep offlining/onlining a cpu in an infinite
loop we eventually OOPS, different backtrace this time, will attach.
Comment 4 Arnaldo Carvalho de Melo 2008-06-03 21:36:41 EDT
Created attachment 308312 [details]
new oops
Comment 5 Gregory Haskins 2008-06-04 00:00:09 EDT
Created attachment 308318 [details]
Proposed fix
Comment 6 Arnaldo Carvalho de Melo 2008-06-04 13:20:44 EDT
With Gregory patch applied it now takes a lot longer, but we eventually OOPS
Comment 7 Arnaldo Carvalho de Melo 2008-06-04 13:21:44 EDT
Created attachment 308373 [details]
new oops, with Gregory patch applied
Comment 8 Gregory Haskins 2008-06-04 15:38:54 EDT
Created attachment 308386 [details]
Proposed fix

I have incorporated Peter Zijlstra's feedback, and also fixed some additional
holes that I found would still allow the cpupri table to get updated.  Please
retest!
Comment 9 Gregory Haskins 2008-06-04 15:43:30 EDT
Created attachment 308389 [details]
Proposed fix

Hmmm...i refreshed the patch, but firefox seems to have uploaded the old
one..lets try again.
Comment 10 Arnaldo Carvalho de Melo 2008-08-07 10:01:48 EDT
Created attachment 313693 [details]
sysfs CPU classes, usage example does random CPU off/onlining
Comment 14 errata-xmlrpc 2008-08-26 15:57:14 EDT
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2008-0585.html

Note You need to log in before you can comment on or make changes to this bug.