Bug 455436 - x86: fix cpu hotplug crash
Summary: x86: fix cpu hotplug crash
Keywords:
Status: CLOSED WONTFIX
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.3
Hardware: All
OS: Linux
low
low
Target Milestone: rc
: ---
Assignee: Prarit Bhargava
QA Contact: Martin Jenner
URL:
Whiteboard:
Depends On:
Blocks: 455409
TreeView+ depends on / blocked
 
Reported: 2008-07-15 14:21 UTC by Prarit Bhargava
Modified: 2008-10-28 12:58 UTC (History)
1 user (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2008-10-28 12:58:05 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Upstream fix for this issue (4.91 KB, patch)
2008-07-15 14:21 UTC, Prarit Bhargava
no flags Details | Diff

Description Prarit Bhargava 2008-07-15 14:21:10 UTC
Backport 

Gitweb:    
http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=fcb43042ef55d2f46b0efa5d7746967cef38f056
Commit:     fcb43042ef55d2f46b0efa5d7746967cef38f056
Parent:     0b1faeef5f9243bb5fc5713a34bbf1ceab0de562
Author:     Zhang, Yanmin <yanmin_zhang.com>
AuthorDate: Tue Jun 24 16:06:23 2008 +0800
Committer:  Ingo Molnar <mingo>
CommitDate: Mon Jun 30 13:15:43 2008 +0200

   x86: fix cpu hotplug crash
    
    Vegard Nossum reported crashes during cpu hotplug tests:
    
      http://marc.info/?l=linux-kernel&m=121413950227884&w=4
    
    In function _cpu_up, the panic happens when calling
    __raw_notifier_call_chain at the second time. Kernel doesn't panic when
    calling it at the first time. If just say because of nr_cpu_ids, that's
    not right.
    
    By checking the source code, I found that function do_boot_cpu is the culprit.
    Consider below call chain:
     _cpu_up=>__cpu_up=>smp_ops.cpu_up=>native_cpu_up=>do_boot_cpu.
    
    So do_boot_cpu is called in the end. In do_boot_cpu, if
    boot_error==true, cpu_clear(cpu, cpu_possible_map) is executed. So later
    on, when _cpu_up calls __raw_notifier_call_chain at the second time to
    report CPU_UP_CANCELED, because this cpu is already cleared from
    cpu_possible_map, get_cpu_sysdev returns NULL.
    
    Many resources are related to cpu_possible_map, so it's better not to
    change it.
    
    Below patch against 2.6.26-rc7 fixes it by removing the bit clearing in
    cpu_possible_map.
    
    Signed-off-by: Zhang Yanmin <yanmin_zhang.com>
    Tested-by: Vegard Nossum <vegard.nossum>
    Acked-by: Rusty Russell <rusty.au>
    Signed-off-by: Ingo Molnar <mingo>

Comment 1 Prarit Bhargava 2008-07-15 14:21:10 UTC
Created attachment 311839 [details]
Upstream fix for this issue

Comment 2 Prarit Bhargava 2008-07-23 11:29:42 UTC
I tried virtual cpu removal and addition and it seems to work.  However, if you
remove cpu A, then B, then re-add A, then B ... you get a nasty panic very
similar to what is described in the link above.

ie) I can reproduce this issue in RHEL5.

Patch above seems to make everything happy.

P.

Comment 3 RHEL Program Management 2008-07-25 17:01:53 UTC
This request was evaluated by Red Hat Product Management for
inclusion, but this component is not scheduled to be updated in
the current Red Hat Enterprise Linux release. If you would like
this request to be reviewed for the next minor release, ask your
support representative to set the next rhel-x.y flag to "?".

Comment 4 Ludek Smid 2008-07-25 21:53:51 UTC
Unfortunately the previous automated notification about the
non-inclusion of this request in Red Hat Enterprise Linux 5.3 used
the wrong text template. It should have read: this request has been
reviewed by Product Management and is not planned for inclusion
in the current minor release of Red Hat Enterprise Linux.

If you would like this request to be reviewed for the next minor
release, ask your support representative to set the next rhel-x.y
flag to "?" or raise an exception.


Note You need to log in before you can comment on or make changes to this bug.