Bug 449676 - Turning a CPU offline causes panic
Summary: Turning a CPU offline causes panic
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: realtime-kernel
Version: beta
Hardware: All
OS: Linux
high
low
Target Milestone: 1.0.1
: ---
Assignee: Peter Zijlstra
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2008-06-03 00:57 UTC by Arnaldo Carvalho de Melo
Modified: 2014-08-11 05:40 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-08-26 19:57:14 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
new oops (6.90 KB, text/plain)
2008-06-04 01:36 UTC, Arnaldo Carvalho de Melo
no flags Details
Proposed fix (4.73 KB, patch)
2008-06-04 04:00 UTC, Gregory Haskins
no flags Details | Diff
new oops, with Gregory patch applied (9.45 KB, text/plain)
2008-06-04 17:21 UTC, Arnaldo Carvalho de Melo
no flags Details
Proposed fix (7.65 KB, patch)
2008-06-04 19:38 UTC, Gregory Haskins
no flags Details | Diff
Proposed fix (7.95 KB, patch)
2008-06-04 19:43 UTC, Gregory Haskins
no flags Details | Diff
sysfs CPU classes, usage example does random CPU off/onlining (2.19 KB, text/plain)
2008-08-07 14:01 UTC, Arnaldo Carvalho de Melo
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2008:0585 0 normal SHIPPED_LIVE Important: kernel security and bug fix update 2008-08-26 19:56:57 UTC

Description Arnaldo Carvalho de Melo 2008-06-03 00:57:32 UTC
Description of problem:

echo 0 > /sys/devices/system/cpu/cpu1/online

crashes the machine

Version-Release number of selected component (if applicable):

2.6.24.7-62.el5rt

How reproducible:

Reproduced on two dell poweredge machines: one 1900 and one 1950

Steps to Reproduce:
1. echo 0 > /sys/devices/system/cpu/cpu1/online

Actual results:

Machine crashes:

rhelrt-2 login: 
Red Hat Enterprise Linux Server release 5.1 (Tikanga)
Kernel 2.6.24.7-62.el5rt on an x86_64

rhelrt-2 login: Unable to handle kernel NULL pointer dereference at
0000000000000048 RIP: 
 [<ffffffff81033313>] try_to_wake_up+0x18d/0x21e
PGD 768a5067 PUD 76959067 PMD 0 
Oops: 0000 [1] PREEMPT SMP 
CPU 2 
Modules linked in: ipv6 autofs4 i2c_dev i2c_core nfs lockd nfs_acl sunrpc
dm_multipath video output sbs sbshc battery ac parport_pc lp parport sg bnx2
sr_mod cdrom button ata_generic pata_acpi serio_raw i5000_edac iTCO_wdt shpchp
iTCO_vendor_support edac_core pcspkr dm_snapshot dm_zero dm_mirror dm_mod
ata_piix libata megaraid_sas sd_mod scsi_mod ext3 jbd mbcache ehci_hcd ohci_hcd
uhci_hcd
Pid: 3522, comm: kstopmachine Not tainted 2.6.24.7-62.el5rt #1
RIP: 0010:[<ffffffff81033313>]  [<ffffffff81033313>] try_to_wake_up+0x18d/0x21e
RSP: 0018:ffff81007ec7fea8  EFLAGS: 00010006
RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff81007ec7fe98
RDX: 0101010101010101 RSI: ffff81007ca100c0 RDI: ffff810009037900
RBP: ffff81007ec7fef8 R08: ffff81007e1e1ef0 R09: 0000000000000000
R10: ffff810087b1f000 R11: ffff81007ec7fe58 R12: ffff81007ca100c0
R13: ffff810009037900 R14: 0000000000000002 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff81007ec39dc0(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000048 CR3: 000000007681e000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kstopmachine (pid: 3522, threadinfo ffff81007c4ce000, task ffff81007ed2f500)
Stack:  0000000000000002 0000000000000046 000000000000001f ffff81000901e780
 0000000000000046 ffff81007e1e1e98 ffff81000901e698 ffff81000901e640
 7fffffffffffffff 0000000000000001 ffff81007ec7ff08 ffffffff81033424
Call Trace:
 <IRQ>  [<ffffffff81033424>] wake_up_process+0x12/0x14
 [<ffffffff810540f2>] hrtimer_wakeup+0x1d/0x21
 [<ffffffff81054daf>] hrtimer_interrupt+0x11a/0x1ab
 [<ffffffff8101fe6f>] smp_local_timer_interrupt+0x5a/0x5e
 [<ffffffff810204d1>] smp_apic_timer_interrupt+0x3a/0x51
 [<ffffffff8100ce46>] apic_timer_interrupt+0x66/0x70
 <EOI>  [<ffffffff810713df>] ? stopmachine+0x9b/0xcc
 [<ffffffff8100d028>] ? child_rip+0xa/0x12
 [<ffffffff8100ada1>] ? default_idle+0x0/0x56
 [<ffffffff81071344>] ? stopmachine+0x0/0xcc
 [<ffffffff8100d01e>] ? child_rip+0x0/0x12


Code: 00 4c 89 ef bb 01 00 00 00 e8 18 cc ff ff ba 01 00 00 00 4c 89 e6 4c 89 ef
e8 4e c1 ff ff 49 8b 85 50 07 00 00 4c 89 e6 4c 89 ef <48> 8b 40 48 ff 50 28 eb
02 31 db 80 3d eb f1 3b 00 00 74 2e 49 
RIP  [<ffffffff81033313>] try_to_wake_up+0x18d/0x21e
 RSP <ffff81007ec7fea8>
CR2: 0000000000000048
---[ end trace 9bf0541e56471d3d ]---
Kernel panic - not syncing: Aiee, killing interrupt handler!
Pid: 3522, comm: kstopmachine Tainted: G      D  2.6.24.7-62.el5rt #1

Call Trace:
 <IRQ>  [<ffffffff8103cae0>] panic+0xaf/0x169
 [<ffffffff81055c1a>] ? blocking_notifier_call_chain+0xf/0x11
 [<ffffffff81040319>] do_exit+0x8d/0x84e
 [<ffffffff8103ce93>] ? wake_up_klogd+0x32/0x34
 [<ffffffff8128be79>] do_page_fault+0x70d/0x7f2
 [<ffffffff810831ec>] ? cpupri_set+0xca/0xdd
 [<ffffffff810831ec>] ? cpupri_set+0xca/0xdd
 [<ffffffff8103175e>] ? update_rt_migration+0x18/0x99
 [<ffffffff810SH� <EOI>  [<ffffffff810713df>] ? stopmachine+0x9b/0xcc
 [<ffffffff8100d028>] ? child_rip+0xa/0x12
 [<ffffffff8100ada1>] ? default_idle+0x0/0x56
 [<ffffffff81071344>] ? stopmachine+0x0/0xcc
 [<ffffffff8100d01e>] ? child_rip+0x0/0x12

Unable to handle kernel NULL pointer dereference at 0000000000000048 RIP: 
 [<ffffffff81033313>] try_to_wake_up+0x18d/0x21e
PGD 768a5067 PUD 76959067 PMD 0 
Oops: 0000 [4] PREEMPT SMP 
CPU 2 
Modules linked in: ipv6 autofs4 i2c_dev i2c_core nfs lockd nfs_acl sunrpc
dm_multipath video output sbs sbshc battery ac parport_pc lp parport sg bnx2
sr_mod cdrom button ata_generic pata_acpi serio_raw i5000_edac iTCO_wdt shpchp
iTCO_vendor_support edac_core pcspkr dm_snapshot dm_zero dm_mirror dm_mod
ata_piix libata megaraid_sas sd_mod scsi_mod ext3 jbd mbcache ehci_hcd ohci_hcd
uhci_hcd
Pid: 3522, comm: kstopmachine Tainted: G      D  2.6.24.7-62.el5rt #1
RIP: 0010:[<ffffffff81033313>]  [<ffffffff81033313>] try_to_wake_up+0x18d/0x21e
RSP: 0018:ffff81007ec7fa68  EFLAGS: 00010006
RAX: 0000000000000000 RBX: 0000000000000001 RCX: ffff81007ec7fa58
RDX: 0101010101010101 RSI: ffff81007d96c0c0 RDI: ffff810009058900
RBP: ffff81007ec7fab8 R08: 0000000000000000 R09: ffff8100090703d4
R10: ffff81007ec7fba8 R11: 0000003000000010 R12: ffff81007d96c0c0
R13: ffff810009058900 R14: 0000000000000001 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff81007ec39dc0(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000048 CR3: 000000007681e000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kstopmachine (pid: 3522, threadinfo ffff81007c4ce000, task ffff81007ed2f500)
Stack:  0000000000000000 0000000000000000 000000000000001f ffffffff8103d776
 0000000000000046 ffffffff81486100 00000000000008f6 0000000000000002
 ffffffff814861a8 0000000000000000 ffff81007ec7fac8 ffffffff81033424
Call Trace:
 <IRQ>  [<ffffffff8103d776>] ? printk+0x67/0x69
 [<ffffffff81033424>] wake_up_process+0x12/0x14
 [<ffffffff8107af2a>] redirect_hardirq+0x3b/0x48
 [<ffffffff8107c8db>] handle_edge_irq+0xcb/0x158
 [<ffffffff8100ecd1>] do_IRQ+0x106/0x179
 [<ffffffff8100c6f1>] ret_from_intr+0x0/0xa
 [<ffffffff811d8d3d>] ? i8042_panic_blink+0x0/0x144
 [<ffffffff8103cb78>] ? panic+0x147/0x169
 [<ffffffff8103cb03>] ? panic+0xd2/0x169
 [<ffffffff81055c1a>] ? blocking_notifier_call_chain+0xf/0x11
 [<ffffffff81040319>] ? do_exit+0x8d/0x84e
 [<ffffffff8103ce93>] ? wake_up_klogd+0x32/0x34
 [<ffffffff8128be79>] ? do_page_fault+0x70d/0x7f2
 [<ffffffff810831ec>] ? cpupri_set+0xca/0xdd
 [<ffffffff810831ec>] ? cpupri_set+0xca/0xdd
 [<ffffffff8103175e>] ? update_rt_migration+0x18/0x99
 [<ffffffff8103175e>] ? update_rt_migration+0x18/0x99
 [<ffffffff8103697f>] ? enqueue_task_rt+0xb9/0xcf
 [<ffffffff81289f49>] ? error_exit+0x0/0x51
 [<ffffffff81033313>] ? try_to_wake_up+0x18d/0x21e
 [<fffffff


Expected results:

cpu1 gets offline:

[root@rhelrt-2 ~]# cat /sys/devices/system/cpu/cpu1/online 
1
[root@rhelrt-2 ~]# echo 0 > /sys/devices/system/cpu/cpu1/online 
[root@rhelrt-2 ~]# cat /sys/devices/system/cpu/cpu1/online 
0
[root@rhelrt-2 ~]# uname -r
2.6.24.7-62.el5rtvanilla

Additional info:

Clark tried this on a AMD machine without problems, maybe Intel specific.

Comment 1 Arnaldo Carvalho de Melo 2008-06-04 00:27:03 UTC
Problem is not Intel specific, diagnosed as a cpupri problem by Thomas Gleixner,
kernel rpm package with a hotfix patch being built.

Comment 2 Gregory Haskins 2008-06-04 00:31:37 UTC
After talking to tglx, I suspect the issue is in how the root-domain and cpupri
code interact with one another when building a domain.  The cpus in question
should have been marked INVALID which would remove them from the cpupri tables.
 Instead they were allowed to register with some priority that was != INVALID.

This bug is likely to affect 23-rt, 24-rt, 25-rt, and sched-devel (and of
course, any derivatives of such).  I will hopefully post a solution soon.

Comment 3 Arnaldo Carvalho de Melo 2008-06-04 01:35:14 UTC
Survives a lot longer, but if we keep offlining/onlining a cpu in an infinite
loop we eventually OOPS, different backtrace this time, will attach.

Comment 4 Arnaldo Carvalho de Melo 2008-06-04 01:36:41 UTC
Created attachment 308312 [details]
new oops

Comment 5 Gregory Haskins 2008-06-04 04:00:09 UTC
Created attachment 308318 [details]
Proposed fix

Comment 6 Arnaldo Carvalho de Melo 2008-06-04 17:20:44 UTC
With Gregory patch applied it now takes a lot longer, but we eventually OOPS

Comment 7 Arnaldo Carvalho de Melo 2008-06-04 17:21:44 UTC
Created attachment 308373 [details]
new oops, with Gregory patch applied

Comment 8 Gregory Haskins 2008-06-04 19:38:54 UTC
Created attachment 308386 [details]
Proposed fix

I have incorporated Peter Zijlstra's feedback, and also fixed some additional
holes that I found would still allow the cpupri table to get updated.  Please
retest!

Comment 9 Gregory Haskins 2008-06-04 19:43:30 UTC
Created attachment 308389 [details]
Proposed fix

Hmmm...i refreshed the patch, but firefox seems to have uploaded the old
one..lets try again.

Comment 10 Arnaldo Carvalho de Melo 2008-08-07 14:01:48 UTC
Created attachment 313693 [details]
sysfs CPU classes, usage example does random CPU off/onlining

Comment 14 errata-xmlrpc 2008-08-26 19:57:14 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2008-0585.html


Note You need to log in before you can comment on or make changes to this bug.