Description of problem: "CPU hotplug" lead to cpu deadlock,then system exception crash. We offline the logical CPUs, then online the logical CPUs, it will cause system deadlock and crash. Platform: Hitachi, with 2 Montvale CPUs. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. run the following script. typeset -i num_cpus=`cat /proc/cpuinfo| grep "^processor"| wc -l` typeset -i loop=1 while [ $loop -lt $num_cpus ];do typeset -i online=`cat /sys/devices/system/cpu/cpu$loop/online` if [ $online -eq 1 ];then echo 0 > /sys/devices/system/cpu/cpu$loop/online fi loop=$loop+1 done loop=1 while [ $loop -lt $num_cpus ];do typeset -i online=`cat /sys/devices/system/cpu/cpu$loop/online` if [ $online -eq 0 ];then echo 1 > /sys/devices/system/cpu/cpu$loop/online fi loop=$loop+1 done 2. 3. Actual results: The system will exception crash. with the dmesg information as following. BUG: soft lockup - CPU#3 stuck for 10s! [bash:5022] Modules linked in: autofs4 hidp rfcomm l2cap bluetooth sunrpc cpufreq_ondemand acpi_cpufreq freq_table vfat fat dm_mirror dm_multipath dm_mod button ipv6 xfrm_nalgo crypto_api parport_pc lp parport sg ide_cd shpchp e100 cdrom mii e1000e ata_piix libata mptsas mptscsih mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 5022, CPU 3, comm: bash psr : 0000101008526030 ifs : 8000000000000001 ip : [<a000000100037900>] Not tainted ip is at ia64_itc_udelay+0x80/0xc0 unat: 0000000000000000 pfs : 0000000000000205 rsc : 0000000000000003 rnat: 0000000000000ca1 bsps: 0000000000000000 pr : 000000000059a659 ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c8a70033f csd : 0000000000000000 ssd : 0000000000000000 b0 : a000000100036b60 b6 : a000000100037880 b7 : a0000001000110a0 f6 : 1003e000000acff757ebe f7 : 1003e00000000000001a1 f8 : 1003e000000acff74dbda f9 : 1003e0000000000000064 f10 : 1003e37910802855736d0 f11 : 1003e0000000000000067 r1 : a000000100be0270 r2 : ffffffffffff0070 r3 : ffffffffffffb9f0 r8 : 000000acff7538b8 r9 : a000000100abad68 r10 : a0000001009f66c8 r11 : 0000000000000100 r12 : e000000117c7fd90 r13 : e000000117c78000 r14 : 00000000000001a1 r15 : 000000acff757ebe r16 : 00000000000000bf r17 : 0000000000000000 r18 : 0000000000000000 r19 : 0000000000000001 r20 : a000000100abae68 r21 : a000000100abae68 r22 : a0000001009f68d0 r23 : 0000000000000260 r24 : a000000100abbe70 r25 : a000000100abbe70 r26 : 0000000000000013 r27 : a0000001009fb3d0 r28 : a0000001009fb3b0 r29 : a0000001009fb3b0 r30 : a0000001009f8398 r31 : e00000027f9b0000 Call Trace: [<a000000100013ae0>] show_stack+0x40/0xa0 sp=e000000117c7f9f0 bsp=e000000117c79690 [<a0000001000143e0>] show_regs+0x840/0x880 sp=e000000117c7fbc0 bsp=e000000117c79638 [<a0000001000e8450>] softlockup_tick+0x2b0/0x320 sp=e000000117c7fbc0 bsp=e000000117c795e8 [<a000000100093cd0>] run_local_timers+0x30/0x60 sp=e000000117c7fbc0 bsp=e000000117c795c8 [<a000000100093d80>] update_process_times+0x80/0x100 sp=e000000117c7fbc0 bsp=e000000117c79590 [<a0000001000376a0>] timer_interrupt+0x180/0x360 sp=e000000117c7fbc0 bsp=e000000117c79550 [<a0000001000e8af0>] handle_IRQ_event+0x90/0x120 sp=e000000117c7fbc0 bsp=e000000117c79510 [<a0000001000e8cb0>] __do_IRQ+0x130/0x420 sp=e000000117c7fbc0 bsp=e000000117c794c8 [<a000000100011750>] ia64_handle_irq+0xf0/0x1a0 sp=e000000117c7fbc0 bsp=e000000117c79498 [<a00000010000c020>] __ia64_leave_kernel+0x0/0x280 sp=e000000117c7fbc0 bsp=e000000117c79498 [<a000000100037900>] ia64_itc_udelay+0x80/0xc0 sp=e000000117c7fd90 bsp=e000000117c79490 [<a000000100036b60>] udelay+0x40/0x60 sp=e000000117c7fd90 bsp=e000000117c79470 [<a000000100057aa0>] __cpu_up+0x440/0x9a0 sp=e000000117c7fd90 bsp=e000000117c79418 [<a0000001000bb270>] cpu_up+0x230/0x360 sp=e000000117c7fe20 bsp=e000000117c793d8 [<a0000001003e7990>] store_online+0x70/0xe0 sp=e000000117c7fe20 bsp=e000000117c793a8 [<a0000001003def20>] sysdev_store+0x60/0xa0 sp=e000000117c7fe20 bsp=e000000117c79370 [<a0000001001fd7a0>] sysfs_write_file+0x220/0x2c0 sp=e000000117c7fe20 bsp=e000000117c79320 [<a000000100164320>] vfs_write+0x200/0x3a0 sp=e000000117c7fe20 bsp=e000000117c792d0 [<a000000100164e70>] sys_write+0x70/0xe0 sp=e000000117c7fe20 bsp=e000000117c79258 [<a00000010000bdb0>] __ia64_trace_syscall+0xd0/0x110 sp=e000000117c7fe30 bsp=e000000117c79258 [<a000000000010620>] __start_ivt_text+0xffffffff00010620/0x400 sp=e000000117c80000 bsp=e000000117c79258 Processor 0x1/0x100 is stuck. Expected results: Additional info:
Tested rhel 5.2 beta on intel tiger box with 16 logical cpus, the script works just fine. Tested upstream 2.6.25-rc3 on Hitachi Coldfusion 4s4u with 16 logical cpus, the script does trigger some kernel problems..., So I believe this problem should be relating to particular things in the the Hitachi Coldfusion box.., and it is still not fixed in upstream..
it is root-casued as a firmware problem, please check your firmware update..
But I have updated the fireware to FW_04-21_03-34 before I report the bug.
If fw_04-21_03-34 doesn't work, please ask for newer update.. Anyway, this sounds more like a firmware problem, (not kernel problem).