541953 – kernel panic when doing cpu offline/online frequently on hp-dl785g5-01.rhts.eng.bos.redhat.com

Bug 541953 - kernel panic when doing cpu offline/online frequently on hp-dl785g5-01.rhts.eng.bos.redhat.com

Summary: kernel panic when doing cpu offline/online frequently on hp-dl785g5-01.rhts.e...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.5
Hardware:	All
OS:	Linux
Priority:	urgent
Severity:	urgent
Target Milestone:	rc
Target Release:	5.5
Assignee:	Prarit Bhargava
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	520837 545583
TreeView+	depends on / blocked

Reported:	2009-11-27 17:03 UTC by Zhang Kexin
Modified:	2013-01-11 02:37 UTC (History)
CC List:	9 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-03-30 06:58:08 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Initial patch (947 bytes, patch) 2009-12-04 15:53 UTC, Prarit Bhargava	no flags	Details \| Diff
RHEL5 fix for this issue (2.37 KB, patch) 2009-12-08 19:22 UTC, Prarit Bhargava	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHSA-2010:0178	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 5.5 kernel security and bug fix update	2010-03-29 12:18:21 UTC

Description Zhang Kexin 2009-11-27 17:03:30 UTC

Description of problem:
on rhts machine hp-dl785g5-01.rhts.eng.bos.redhat.com, when running /kernel/hotplug/cpusofthotplug test, kernel panic happen.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:
Initializing CPU#47
Six-Core AMD Opteron(tm) Processor 8439 SE stepping 00
CPU 13 is now offline
CPU 2 is now offline
CPU 15 is now offline
CPU 9 is now offline
CPU 10 is now offline
CPU 35 is now offline
Initializing CPU#9
Six-Core AMD Opteron(tm) Processor 8439 SE stepping 00
CPU 4 is now offline
CPU 40 is now offline
CPU 1 is now offline
CPU 11 is now offline
Unable to handle kernel NULL pointer dereference at 00000000000000c0 RIP: 
 [<ffffffff80080312>] cacheinfo_cpu_callback+0x458/0x4c3
PGD 202835b067 PUD 202bd7e067 PMD 0 
Oops: 0002 [1] SMP 
last sysfs file: /kernel/kexec_crash_loaded
CPU 0 
Modules linked in: autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 xfrm_nalgo crypto_api cpufreq_ondemand powernow_k8 freq_table dm_multipath scsi_dh video hwmon backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp parport shpchp ide_cd cdrom i2c_piix4 i2c_core hpilo bnx2 serio_raw pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod qla2xxx scsi_transport_fc cciss sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 29539, comm: runtest.sh Not tainted 2.6.18-164.8.1.el5 #1
RIP: 0010:[<ffffffff80080312>]  [<ffffffff80080312>] cacheinfo_cpu_callback+0x458/0x4c3
RSP: 0018:ffff812029e6dd28  EFLAGS: 00010293
RAX: 00000000000000a8 RBX: 000000000000000b RCX: 0000000000000000
RDX: 0000000000000023 RSI: 00000000000000ff RDI: ffff81042f9c9c40
RBP: ffff81042f9c9c28 R08: 000000000000001c R09: 0000000000000024
R10: ffff81042f9c9c40 R11: ffffffff8002f3d9 R12: 00000000000000a8
R13: 0000000000000003 R14: 0000000000000150 R15: ffff81041fe578c0
FS:  00002b7a7ced2e10(0000) GS:ffffffff803c1000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000000000c0 CR3: 0000002029253000 CR4: 00000000000006e0
Process runtest.sh (pid: 29539, threadinfo ffff812029e6c000, task ffff81202feff860)
Stack:  ffff810422acc300 0000000000000282 ffff810422acc300 ffff81042d89d880
 ffffffff8045b408 0000000000000282 ffff81010eaeb690 ffff81010eaeb680
 ffff81010eaeb690 ffffffff803333d0 ffffffff80333320 ffff81010eaeb690
Call Trace:
 [<ffffffff8014e7f1>] kobject_cleanup+0x62/0x7e
 [<ffffffff8014e80d>] kobject_release+0x0/0x9
 [<ffffffff8014e80d>] kobject_release+0x0/0x9
 [<ffffffff8014e80d>] kobject_release+0x0/0x9
 [<ffffffff80066eaa>] notifier_call_chain+0x20/0x32
 [<ffffffff800a45bb>] _cpu_down+0x191/0x265
 [<ffffffff800a46b8>] cpu_down+0x29/0x3d
 [<ffffffff801c3eaf>] store_online+0x29/0x67
 [<ffffffff8010ae69>] sysfs_write_file+0xb9/0xe8
 [<ffffffff80016942>] vfs_write+0xce/0x174
 [<ffffffff800171fa>] sys_write+0x45/0x6e
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0


Code: f0 0f b3 58 18 48 8d 75 18 89 d7 e8 00 dd 0c 00 3d fe 00 00 
RIP  [<ffffffff80080312>] cacheinfo_cpu_callback+0x458/0x4c3
 RSP <ffff812029e6dd28>
CR2: 00000000000000c0
 <0>Kernel panic - not syncing: Fatal exception
 


Expected results:


Additional info:
on 2.6.18-164.el5, kernel does not panic.

Comment 1 Zhang Kexin 2009-11-28 06:20:44 UTC

(In reply to comment #0)

> How reproducible:
> 100%
> 
> Steps to Reproduce:
> 1.yum -y install rh-tests-kernel-hotplug-cpusofthotplug.noarch
> 2.cd /mnt/tests/kernel/hotplug/cpusofthotplug
> 3.make run

Comment 3 Zhang Kexin 2009-12-01 05:30:48 UTC

panic also happens on amd-drachma-01.lab.bos.redhat.com, i386 2.6.18-164.8.1 debug kernel.

console messages:

lockdep: not fixing up alternatives.
Booting processor 9/25 eip 11000
CPU 9 irqstacks, hard=c080e000 soft=c07ee000
Leaving ESR disabled.
CPU9: AMD Engineering Sample stepping 01
lockdep: not fixing up alternatives.
Booting processor 4/20 eip 11000
CPU 4 irqstacks, hard=c0809000 soft=c07e9000
Leaving ESR disabled.
CPU4: AMD Engineering Sample stepping 01
Breaking affinity for irq 122
Breaking affinity for irq 137
Breaking affinity for irq 145
CPU 23 is now offline
BUG: unable to handle kernel NULL pointer dereference at virtual address 0000004c
 printing eip:
c040dfca
*pde = 00000000
Oops: 0002 [#1]
SMP 
last sysfs file: /devices/system/cpu/cpu23/online
Modules linked in: autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc ipv6 xfrm_nalgo crypto_api cpufreq_ondemand powernow_k8 dm_multipath scsi_dh video hwmon backlight sbs i2c_ec button battery asus_acpi ac parport_pc lp parport joydev sr_mod i2c_piix4 sg ide_cd i2c_core serio_raw bnx2 cdrom pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod usb_storage ahci libata mptsas mptscsih mptbase scsi_transport_sas sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
CPU:    0
EIP:    0060:[<c040dfca>]    Not tainted VLI
EFLAGS: 00010206   (2.6.18-164.8.1.el5debug #1) 
EIP is at cacheinfo_cpu_callback+0x37c/0x3d7
eax: 0000003c   ebx: 00000017   ecx: 00000014   edx: 00000014
esi: f7b53f04   edi: 0000003c   ebp: 0000012c   esp: f7bf4edc
ds: 007b   es: 007b   ss: 0068
Process runtest.sh (pid: 9365, ti=f7bf4000 task=f740ad20 task.ti=f7bf4000)
Stack: c05c967b c0826fdc f740ad20 c05c967b 00000003 00000007 c043b3cd c06a1e34 
       c069ba20 00000017 00000007 c0628b5c 00000017 00000000 f7bcd1a0 f7bcd1a0 
       c0440c65 ffffffff ff7fffff 00000017 fffffff0 f7f37000 00000002 c0440d21 
Call Trace:
 [<c05c967b>] dev_cpu_callback+0x73/0xaa
 [<c05c967b>] dev_cpu_callback+0x73/0xaa
 [<c043b3cd>] trace_hardirqs_on+0xf8/0x118
 [<c0628b5c>] notifier_call_chain+0x19/0x29
 [<c0440c65>] _cpu_down+0x135/0x1cc
 [<c0440d21>] cpu_down+0x25/0x36
 [<c0566aba>] store_online+0x24/0x56
 [<c0566a96>] store_online+0x0/0x56
 [<c0563d8a>] sysdev_store+0x1e/0x22
 [<c04b57d9>] sysfs_write_file+0xa3/0xcd
 [<c04b5736>] sysfs_write_file+0x0/0xcd
 [<c047d45b>] vfs_write+0xa1/0x143
 [<c047da5b>] sys_write+0x3c/0x63
 [<c0404f93>] syscall_call+0x7/0xb
 =======================
Code: ed 31 ff 8b 1a c7 44 24 10 00 00 00 00 eb 4a 89 fe 03 34 9d e0 74 82 c0 8d 46 10 e8 e9 64 0e 00 eb 18 89 f8 03 04 8d e0 74 82 c0 <f0> 0f b3 58 10 8d 56 10 89 c8 e8 e9 64 0e 00 83 f8 1f 89 c1 7e 
EIP: [<c040dfca>] cacheinfo_cpu_callback+0x37c/0x3d7 SS:ESP 0068:f7bf4edc
 <0>Kernel panic - not syncing: Fatal exception

Comment 4 Zhang Kexin 2009-12-01 06:01:23 UTC

2.6.18-164.el5PAE kernel finished the test smoothly.

Comment 5 Prarit Bhargava 2009-12-01 15:20:37 UTC

This is reproducible with experimental.10 and the 164.1.7.el5 z-stream.

P.

Comment 6 Prarit Bhargava 2009-12-01 16:19:44 UTC

This same panic happens on -164.el5 (RHEL5.4 base kernel).

Not a blocker for z-stream.

Still a bug (obviously).

P.

Comment 7 Bhavna Sarathy 2009-12-01 17:02:24 UTC

The kernels you are testing has the CPU hot plug patch from BZ 526770?

Comment 8 Prarit Bhargava 2009-12-01 22:51:54 UTC

(In reply to comment #7)
> The kernels you are testing has the CPU hot plug patch from BZ 526770?  

Yes.

P.

Comment 12 Prarit Bhargava 2009-12-03 18:54:47 UTC

Bhavna,

I've tracked this issue down to this RHEL5 commit:

commit 8c0ce9bfb7f4053fec9cfa70322a2391c7422314
Author: Bhavna Sarathy <bnagendr>
Date:   Tue Sep 22 16:11:06 2009 -0400

    [x86] fix up L3 cache information for AMD Magny-cours
    
    Message-id: <20090922161302.7053.39434.sendpatchset>
    Patchwork-id: 20918
    O-Subject: [RHEL5.5 PATCH 2/3] Fix up L3 cache information for AMD Magny-cou
    Bugzilla: 513684
    RH-Acked-by: Christopher Lalancette <clalance>
    RH-Acked-by: Prarit Bhargava <prarit>
    
    Resolves BZ 513684
    Fixup L3 cache information for AMD multi-node processors.

Looking at the code now.

P.

Comment 13 Prarit Bhargava 2009-12-03 21:05:14 UTC

arch/i386/kernel/cpu/intel_cacheinfo.c, line 474

        if ((index == 3) && (c->x86_vendor == X86_VENDOR_AMD)) {
                for_each_online_cpu(i) {
                        if (cpuid4_info[i] == NULL)
                                continue;
                        this_leaf = CPUID4_INFO_IDX(i, index);
                        this_leaf->shared_cpu_map = c[i].llc_shared_map;
                }
                return;
        }

Removing this chunk fixes the panic in comment #1.

P.

Comment 14 Bhavna Sarathy 2009-12-03 21:15:04 UTC

Can you please add AMD confidential group to this bug?

Comment 17 Prarit Bhargava 2009-12-04 15:52:27 UTC

Sent to Andreas Hermann @ AMD in private email:

-------------------------
I'm currently attempting to resolve a panic in the hotplug code which is related to your patchset posted in these links:

http://lkml.org/lkml/2009/6/3/244

http://lkml.org/lkml/2009/6/3/246

http://lkml.org/lkml/2009/6/3/247

http://lkml.org/lkml/2009/6/3/248

http://lkml.org/lkml/2009/6/3/249

http://lkml.org/lkml/2009/6/3/251


(In the code below I am referencing a RHEL5 backport of this code. Please contact
our onsite AMD engineer, Bhavna (cc'd), for details on accessing this codebase.)

When doing a random walk of CPU hotplug events, I occasionally see

Unable to handle kernel NULL pointer dereference at 00000000000000c0 RIP:
[<ffffffff80080312>] cacheinfo_cpu_callback+0x458/0x4c3
PGD 202835b067 PUD 202bd7e067 PMD 0
Oops: 0002 [1] SMP
last sysfs file: /kernel/kexec_crash_loaded
CPU 0
Modules linked in: autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc
ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink
iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables
x_tables ipv6 xfrm_nalgo crypto_api cpufreq_ondemand powernow_k8 freq_table
dm_multipath scsi_dh video hwmon backlight sbs i2c_ec button battery asus_acpi
acpi_memhotplug ac parport_pc lp parport shpchp ide_cd cdrom i2c_piix4 i2c_core
hpilo bnx2 serio_raw pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache
dm_snapshot dm_zero dm_mirror dm_log dm_mod qla2xxx scsi_transport_fc cciss
sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 29539, comm: runtest.sh Not tainted 2.6.18-164.8.1.el5 #1
RIP: 0010:[<ffffffff80080312>] [<ffffffff80080312>]
cacheinfo_cpu_callback+0x458/0x4c3
RSP: 0018:ffff812029e6dd28 EFLAGS: 00010293
RAX: 00000000000000a8 RBX: 000000000000000b RCX: 0000000000000000
RDX: 0000000000000023 RSI: 00000000000000ff RDI: ffff81042f9c9c40
RBP: ffff81042f9c9c28 R08: 000000000000001c R09: 0000000000000024
R10: ffff81042f9c9c40 R11: ffffffff8002f3d9 R12: 00000000000000a8
R13: 0000000000000003 R14: 0000000000000150 R15: ffff81041fe578c0
FS: 00002b7a7ced2e10(0000) GS:ffffffff803c1000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000000000000c0 CR3: 0000002029253000 CR4: 00000000000006e0
Process runtest.sh (pid: 29539, threadinfo ffff812029e6c000, task
ffff81202feff860)
Stack: ffff810422acc300 0000000000000282 ffff810422acc300 ffff81042d89d880
ffffffff8045b408 0000000000000282 ffff81010eaeb690 ffff81010eaeb680
ffff81010eaeb690 ffffffff803333d0 ffffffff80333320 ffff81010eaeb690
Call Trace:
[<ffffffff8014e7f1>] kobject_cleanup+0x62/0x7e
[<ffffffff8014e80d>] kobject_release+0x0/0x9
[<ffffffff8014e80d>] kobject_release+0x0/0x9
[<ffffffff8014e80d>] kobject_release+0x0/0x9
[<ffffffff80066eaa>] notifier_call_chain+0x20/0x32
[<ffffffff800a45bb>] _cpu_down+0x191/0x265
[<ffffffff800a46b8>] cpu_down+0x29/0x3d
[<ffffffff801c3eaf>] store_online+0x29/0x67
[<ffffffff8010ae69>] sysfs_write_file+0xb9/0xe8
[<ffffffff80016942>] vfs_write+0xce/0x174
[<ffffffff800171fa>] sys_write+0x45/0x6e
[<ffffffff8005d28d>] tracesys+0xd5/0xe0


Code: f0 0f b3 58 18 48 8d 75 18 89 d7 e8 00 dd 0c 00 3d fe 00 00
RIP [<ffffffff80080312>] cacheinfo_cpu_callback+0x458/0x4c3
RSP <ffff812029e6dd28>
CR2: 00000000000000c0

I believe I have discovered what the problem is, and it relates to the following two chunks of
code in RHEL5 (sorry for the cut-and-paste). FWIW, this code is very similar to upstream...

static void __cpuinit cache_shared_cpu_map_setup(unsigned int cpu, int index)
{
   struct _cpuid4_info *this_leaf, *sibling_leaf;
   unsigned long num_threads_sharing;
   int index_msb, i;
   struct cpuinfo_x86 *c = cpu_data;

   if ((index == 3) && (c->x86_vendor == X86_VENDOR_AMD)) {
       for_each_online_cpu(i) {
           if (cpuid4_info[i] == NULL)
               continue;
           this_leaf = CPUID4_INFO_IDX(i, index);
           this_leaf->shared_cpu_map = c[i].llc_shared_map;
           }
           return;
       }


and

static void __cpuinit cache_remove_shared_cpu_map(unsigned int cpu, int index)
{
   struct _cpuid4_info *this_leaf, *sibling_leaf;
   int sibling;

   this_leaf = CPUID4_INFO_IDX(cpu, index);
   for_each_cpu_mask(sibling, this_leaf->shared_cpu_map) {
       sibling_leaf = CPUID4_INFO_IDX(sibling, index);
       cpu_clear(cpu, sibling_leaf->shared_cpu_map);
   }
}


When a cpu_up (protected by a mutex_lock) is executed cache_shared_cpu_map_setup() is called.

When a cpu_down (protected by the same mutex) is executed cache_remove_shared_cpu_map() is
called.

Consider the following example with two CPUS A & B, which are siblings.

At boot time, the global cpuinfo_x86 cpu_data structs are populated such that A and B
both have the same llc_shared_map. This value, AFAICT, is only added to and never has
elements removed from it.  For most uses, this value is *static* after the initial system boot.


DOWN CPU A

results in cpuid4_info[A] = NULL

DOWN CPU B

results in cpuid4_info[B] = NULL

UP CPU B

sets

cpuid4_info[B] not NULL

and

this_leaf = CPUID4_INFO_IDX(i, index);
this_leaf->shared_cpu_map = c[i].llc_shared_map

for all cpus EXCEPT A (because cpuid4_info[A] == NULL)

DOWN CPU B

in cache_remove_shared_cpu_map()

this_leaf = CPUID4_INFO_IDX(cpu, index);
for_each_cpu_mask(sibling, this_leaf->shared_cpu_map) {

^^^ this_leaf->shared_cpu_map = the static llc_shared_map, includes CPU A.

sibling_leaf = CPUID4_INFO_IDX(sibling, index);

^^^^ when we get to CPU A, sibling_leaf = NULL

cpu_clear(cpu, sibling_leaf->shared_cpu_map);

^^^^^ NULL pointer dereference.

}

I'm not entirely sure what the correct fix is here. I'm somewhat confused about the loop in cache_shared_cpu_map_setup() -- I'm not sure why the code has to examine EVERY CPU in the system -- shouldn't it only look at the cpus in c[cpu].llc_shared_map?

I have attached a patch that seems to resolve the problem (and addresses my concern about the loop in cache_shared_cpu_map_setup().

Andreas, do you have any other suggestions on a fix here?  I'm obviously more than willing to defer to your opinion. 

-------------------

Additionally, I have been running this patch for a few hours without any panics.

P.

Comment 18 Prarit Bhargava 2009-12-04 15:53:09 UTC

Created attachment 376107 [details]
Initial patch

Comment 19 Prarit Bhargava 2009-12-04 16:27:52 UTC

NOTE: THIS COMMENT MAY LEAD TO ANOTHER BZ BEING OPENED.  I AM NOT SURE IF THE TWO ISSUES ARE RELATED.  AS I SAID PREVIOUSLY, I'M SEEING THIS PANIC AS WELL AS THE PANIC IN COMMENT #1.  THE PANIC BELOW, HOWEVER, SEEMS TO ALSO BE REPRODUCIBLE WITH 164.EL5.

So the first panic, in comment #1, seems to be resolved.  I'm now left with

 at 0000000000000024 RIP: 
 [<ffffffff8843845f>] :powernow_k8:powernowk8_get+0x109/0x152
PGD 0 
Oops: 0000 [1] SMP 
last sysfs file: /class/cpuid/cpu10/dev
CPU 18 
Modules linked in: autofs4(U) hidp(U) nfs(U) fscache(U) nfs_acl(U) rfcomm(U) l2cap(U) bluetooth(U) lockd(U) sunrpc(U) ip_conntrack_netbios_ns(U) ipt_REJECT(U) xt_state(U) ip_conntrack(U) nfnetlink(U) iptable_filter(U) ip_tables(U) ip6t_REJECT(U) xt_tcpudp(U) ip6table_filter(U) ip6_tables(U) x_tables(U) ipv6(U) xfrm_nalgo(U) crypto_api(U) cpufreq_ondemand(U) powernow_k8(U) freq_table(U) dm_multipath(U) scsi_dh(U) video(U) hwmon(U) backlight(U) sbs(U) i2c_ec(U) dell_wmi(U) wmi(U) button(U) battery(U) asus_acpi(U) acpi_memhotplug(U) ac(U) parport_pc(U) lp(U) parport(U) ide_cd(U) i2c_piix4(U) cdrom(U) bnx2(U) shpchp(U) i2c_core(U) amd64_edac_mod(U) hpilo(U) serio_raw(U) pcspkr(U) edac_mc(U) dm_raid45(U) dm_message(U) dm_region_hash(U) dm_mem_cache(U) dm_snapshot(U) dm_zero(U) dm_mirror(U) dm_log(U) dm_mod(U) qla2xxx(U) scsi_transport_fc(U) cciss(U) sd_mod(U) scsi_mod(U) ext3(U) jbd(U) uhci_hcd(U) ohci_hcd(U) ehci_hcd(U)
Pid: 146, comm: events/0 Tainted: G      2.6.18.4 #15
RIP: 0010:[<ffffffff8843845f>]  [<ffffffff8843845f>] :powernow_k8:powernowk8_get+0x109/0x152
RSP: 0018:ffff8103fed3fd50  EFLAGS: 00010202
RAX: 0000000000000020 RBX: ffff8103ffb84cc0 RCX: 00000000c0010063
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8103ffb84cc0
RBP: 0000000000000012 R08: ffff8103fed3e000 R09: 0000000000000001
R10: 00002b04acc08880 R11: ffff8103ffc268e0 R12: 0000000000000000
R13: 0000000000000282 R14: 0000000000000000 R15: ffffffff8006db05
FS:  00002b04ace17710(0000) GS:ffff8103ffd0a2c0(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000024 CR3: 0000000000201000 CR4: 00000000000006a0
Process events/0 (pid: 146, threadinfo ffff8103fed3e000, task ffff8103fed2a7a0)
Stack:  ffffffffffffffff 0000000000000000 0000000000000000 ffff8103fed2a7a0
 0000000000040000 0000000000000000 0000000000000000 0000000000000000
 0000000000000001 0000000000000000 0000000000000000 0000000000000000
Call Trace:
 [<ffffffff8021d96d>] __cpufreq_get+0x24/0x5e
 [<ffffffff8021e64d>] cpufreq_get+0x28/0x42
 [<ffffffff8006db1b>] handle_cpufreq_delayed_get+0x16/0x39
 [<ffffffff8004d9db>] run_workqueue+0x94/0xe4
 [<ffffffff8004a246>] worker_thread+0x0/0x122
 [<ffffffff8004a336>] worker_thread+0xf0/0x122
 [<ffffffff8008c9ef>] default_wake_function+0x0/0xe
 [<ffffffff80032afe>] kthread+0xfe/0x132
 [<ffffffff8005dfb1>] child_rip+0xa/0x11
 [<ffffffff80032a00>] kthread+0x0/0x132
 [<ffffffff8005dfa7>] child_rip+0x0/0x11


Code: 44 8b 60 04 eb 0e 69 43 2c a0 86 01 00 44 8d a0 00 35 0c 00 
RIP  [<ffffffff8843845f>] :powernow_k8:powernowk8_get+0x109/0x152

P.

Comment 22 Prarit Bhargava 2009-12-08 19:22:11 UTC

Created attachment 376973 [details]
RHEL5 fix for this issue

Comment 24 RHEL Program Management 2009-12-08 19:54:41 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 28 Don Zickus 2009-12-09 18:12:27 UTC

in kernel-2.6.18-178.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Please update the appropriate value in the Verified field
(cf_verified) to indicate this fix has been successfully
verified. Include a comment with verification details.

Comment 30 Russell Doty 2009-12-10 00:05:31 UTC

Script developed by Prarit to reproduce this bug:

#!/bin/bash

function version2 () {
        for k in `seq 1 10`; do
                for j in `seq 1 100000`;
                do
                        let "i=($RANDOM % 126)+1"
                        if [ $i -lt 64 ]; then
                                echo "OFFLINING CPU $i"
                                echo 0 > /sys/devices/system/cpu/cpu$i/online
                        elif [ $i -gt 64 ]; then
                                let "i=$i-64"
                                echo "ONLINING CPU $i"
                                echo 1 > /sys/devices/system/cpu/cpu$i/online
                        fi
                        dmesg | grep 'kref' >& /dev/null
                        if [ $? -eq 0 ]; then
                                exit 1
                        fi
                done
        done
}

version2

Comment 31 Russell Doty 2009-12-10 18:43:32 UTC

I believe the 5.4.z flag needs to be set for this to be included in the 15Dec09 Z-stream release.

Comment 33 George Herman 2009-12-11 16:30:50 UTC

Tested the fix. We were able to trigger the issue on a Dinar with the RC1 kernel and a script similar to the one posted in comment #30 of Red Hat bug #541953. It took only a couple of seconds to take the box down. After installing and booting into the RC2 kernel (2.6.18-164.9.1.el5) we ran the script again. The box survived 2 hours of random CPU hotplug. It's safe to assume that this fixed the problem.

Comment 36 errata-xmlrpc 2010-03-30 06:58:08 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2010-0178.html

Note You need to log in before you can comment on or make changes to this bug.