Bug 437944

Summary: "cpu hotplug" lead to cpu deadlock and system crash
Product: Red Hat Enterprise Linux 5 Reporter: Song, Youquan <youquan.song>
Component: pm-utilsAssignee: Phil Knirsch <pknirsch>
Status: CLOSED NOTABUG QA Contact:
Severity: low Docs Contact:
Priority: low    
Version: 5.2CC: grgustaf, luyu, rvokal
Target Milestone: rc   
Target Release: ---   
Hardware: ia64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2008-03-24 03:48:30 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Song, Youquan 2008-03-18 11:38:23 UTC
Description of problem:

"CPU hotplug" lead to cpu deadlock,then system exception crash.  We offline 
the logical CPUs, then online the logical CPUs, it will cause system deadlock 
and crash.

Platform:  Hitachi, with 2 Montvale CPUs.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.  run the following script.

typeset -i num_cpus=`cat /proc/cpuinfo| grep "^processor"| wc -l`
typeset -i loop=1
while [ $loop -lt $num_cpus ];do
    typeset -i online=`cat /sys/devices/system/cpu/cpu$loop/online`
    if [ $online -eq 1  ];then
        echo 0 > /sys/devices/system/cpu/cpu$loop/online
    fi
    loop=$loop+1
done

loop=1
while [ $loop -lt $num_cpus ];do
    typeset -i online=`cat /sys/devices/system/cpu/cpu$loop/online`
    if [ $online -eq 0  ];then
        echo 1 > /sys/devices/system/cpu/cpu$loop/online
    fi
    loop=$loop+1
done


2.
3.
  
Actual results:
The system will exception crash.  with the dmesg information as following.

BUG: soft lockup - CPU#3 stuck for 10s! [bash:5022]
Modules linked in: autofs4 hidp rfcomm l2cap bluetooth sunrpc cpufreq_ondemand 
acpi_cpufreq freq_table vfat fat dm_mirror dm_multipath dm_mod button ipv6 
xfrm_nalgo crypto_api parport_pc lp parport sg ide_cd shpchp e100 cdrom mii 
e1000e ata_piix libata mptsas mptscsih mptbase scsi_transport_sas sd_mod 
scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd

Pid: 5022, CPU 3, comm:                 bash
psr : 0000101008526030 ifs : 8000000000000001 ip  : [<a000000100037900>]    
Not tainted
ip is at ia64_itc_udelay+0x80/0xc0
unat: 0000000000000000 pfs : 0000000000000205 rsc : 0000000000000003
rnat: 0000000000000ca1 bsps: 0000000000000000 pr  : 000000000059a659
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f
csd : 0000000000000000 ssd : 0000000000000000
b0  : a000000100036b60 b6  : a000000100037880 b7  : a0000001000110a0
f6  : 1003e000000acff757ebe f7  : 1003e00000000000001a1
f8  : 1003e000000acff74dbda f9  : 1003e0000000000000064
f10 : 1003e37910802855736d0 f11 : 1003e0000000000000067
r1  : a000000100be0270 r2  : ffffffffffff0070 r3  : ffffffffffffb9f0
r8  : 000000acff7538b8 r9  : a000000100abad68 r10 : a0000001009f66c8
r11 : 0000000000000100 r12 : e000000117c7fd90 r13 : e000000117c78000
r14 : 00000000000001a1 r15 : 000000acff757ebe r16 : 00000000000000bf
r17 : 0000000000000000 r18 : 0000000000000000 r19 : 0000000000000001
r20 : a000000100abae68 r21 : a000000100abae68 r22 : a0000001009f68d0
r23 : 0000000000000260 r24 : a000000100abbe70 r25 : a000000100abbe70
r26 : 0000000000000013 r27 : a0000001009fb3d0 r28 : a0000001009fb3b0
r29 : a0000001009fb3b0 r30 : a0000001009f8398 r31 : e00000027f9b0000

Call Trace:
 [<a000000100013ae0>] show_stack+0x40/0xa0
                                sp=e000000117c7f9f0 bsp=e000000117c79690
 [<a0000001000143e0>] show_regs+0x840/0x880
                                sp=e000000117c7fbc0 bsp=e000000117c79638
 [<a0000001000e8450>] softlockup_tick+0x2b0/0x320
                                sp=e000000117c7fbc0 bsp=e000000117c795e8
 [<a000000100093cd0>] run_local_timers+0x30/0x60
                                sp=e000000117c7fbc0 bsp=e000000117c795c8
 [<a000000100093d80>] update_process_times+0x80/0x100
                                sp=e000000117c7fbc0 bsp=e000000117c79590
 [<a0000001000376a0>] timer_interrupt+0x180/0x360
                                sp=e000000117c7fbc0 bsp=e000000117c79550
 [<a0000001000e8af0>] handle_IRQ_event+0x90/0x120
                                sp=e000000117c7fbc0 bsp=e000000117c79510
 [<a0000001000e8cb0>] __do_IRQ+0x130/0x420
                                sp=e000000117c7fbc0 bsp=e000000117c794c8
 [<a000000100011750>] ia64_handle_irq+0xf0/0x1a0
                                sp=e000000117c7fbc0 bsp=e000000117c79498
 [<a00000010000c020>] __ia64_leave_kernel+0x0/0x280
                                sp=e000000117c7fbc0 bsp=e000000117c79498
 [<a000000100037900>] ia64_itc_udelay+0x80/0xc0
                                sp=e000000117c7fd90 bsp=e000000117c79490
 [<a000000100036b60>] udelay+0x40/0x60
                                sp=e000000117c7fd90 bsp=e000000117c79470
 [<a000000100057aa0>] __cpu_up+0x440/0x9a0
                                sp=e000000117c7fd90 bsp=e000000117c79418
 [<a0000001000bb270>] cpu_up+0x230/0x360
                                sp=e000000117c7fe20 bsp=e000000117c793d8
 [<a0000001003e7990>] store_online+0x70/0xe0
                                sp=e000000117c7fe20 bsp=e000000117c793a8
 [<a0000001003def20>] sysdev_store+0x60/0xa0
                                sp=e000000117c7fe20 bsp=e000000117c79370
 [<a0000001001fd7a0>] sysfs_write_file+0x220/0x2c0
                                sp=e000000117c7fe20 bsp=e000000117c79320
 [<a000000100164320>] vfs_write+0x200/0x3a0
                                sp=e000000117c7fe20 bsp=e000000117c792d0
 [<a000000100164e70>] sys_write+0x70/0xe0
                                sp=e000000117c7fe20 bsp=e000000117c79258
 [<a00000010000bdb0>] __ia64_trace_syscall+0xd0/0x110
                                sp=e000000117c7fe30 bsp=e000000117c79258
 [<a000000000010620>] __start_ivt_text+0xffffffff00010620/0x400
                                sp=e000000117c80000 bsp=e000000117c79258
Processor 0x1/0x100 is stuck.


Expected results:


Additional info:

Comment 1 Luming Yu 2008-03-19 03:29:52 UTC
Tested rhel 5.2 beta on intel tiger box with 16 logical cpus, the script works
just fine.

Tested  upstream 2.6.25-rc3 on Hitachi Coldfusion 4s4u with 16 logical cpus, the
script does trigger some kernel problems...,

So I believe this problem should be relating to particular things in the the
Hitachi Coldfusion box.., and it is still not fixed in upstream..

Comment 2 Luming Yu 2008-03-24 03:48:30 UTC
it is root-casued as a firmware problem, please check your firmware update..


Comment 3 Song, Youquan 2008-03-24 09:12:08 UTC
But I have updated the fireware to FW_04-21_03-34 before I report the bug. 

Comment 4 Luming Yu 2008-03-24 09:22:16 UTC
If fw_04-21_03-34 doesn't work, please ask for newer update..
Anyway, this sounds more like a firmware problem, (not kernel problem).