428331 – booting with maxcpus=1 panics when starting cpufreq service

Bug 428331 - booting with maxcpus=1 panics when starting cpufreq service

Summary: booting with maxcpus=1 panics when starting cpufreq service

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.2
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Doug Chapman
QA Contact:	Martin Jenner
Docs Contact:
URL:
Whiteboard:	GSSApproved
Depends On:
Blocks:	429516
TreeView+	depends on / blocked

Reported:	2008-01-10 21:16 UTC by Doug Chapman
Modified:	2018-10-19 20:15 UTC (History)
CC List:	5 users (show)
Fixed In Version:	RHBA-2008-0314
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2008-05-21 15:06:13 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
patch from upstream (394 bytes, patch) 2008-01-11 15:25 UTC, Doug Chapman	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2008:0314	0	normal	SHIPPED_LIVE	Updated kernel packages for Red Hat Enterprise Linux 5.2	2008-05-20 18:43:34 UTC

Description Doug Chapman 2008-01-10 21:16:18 UTC

Description of problem:
This appears to be a regression caused by my patches for BZ 253416.  It is so
far only seen on x86_64 systems that support acpi_cpufreq.

when booting the kdump kernel after a crash it panics with:


----------- [cut here ] --------- [please bite here ] ---------

Kernel BUG at drivers/cpufreq/cpufreq.c:81

invalid opcode: 0000 [1] SMP 

last sysfs file: 

CPU 0 

Modules linked in:

Pid: 1, comm: swapper Not tainted 2.6.18-62.el5 #1

RIP: 0010:[<ffffffff802062b1>]  [<ffffffff802062b1>]
lock_policy_rwsem_write+0x23/0x78

RSP: 0000:ffff810008c39e60  EFLAGS: 00010246

RAX: 00000000ffffffff RBX: ffffffff804326a8 RCX: 0000000000000000

RDX: ffffffff8039a080 RSI: 0000000000000282 RDI: 0000000000000001

RBP: 0000000000000001 R08: 0000000000000001 R09: ffff810008c307a0

R10: 0000000000000000 R11: 0000000000000050 R12: ffffffff80310e40

R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000

FS:  0000000000000000(0000) GS:ffffffff8039a000(0000) knlGS:0000000000000000

CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b

CR2: 00000000006ae6f8 CR3: 0000000001001000 CR4: 00000000000006e0

Process swapper (pid: 1, threadinfo ffff810008c38000, task ffff810008c307a0)

Stack:  ffffffff804326a8 ffffffff804326a8 ffffffff80310e40 ffffffff8020650f

 ffffffff803214a0 ffffffff801ac7a0 00000000ffffffed ffffffff802e74e8

 00000000ffffffed ffffffff802055df 0000000000000000 ffffffff80402ed0

Call Trace:

 [<ffffffff8020650f>] cpufreq_remove_dev+0xb/0x22

 [<ffffffff801ac7a0>] sysdev_driver_unregister+0x4d/0x97

 [<ffffffff802055df>] cpufreq_register_driver+0x130/0x194

 [<ffffffff803d5a5b>] init+0x1f9/0x2f9

 [<ffffffff8005cfb1>] child_rip+0xa/0x11

 [<ffffffff8016a19e>] acpi_ds_init_one_object+0x0/0x80

 [<ffffffff803d5862>] init+0x0/0x2f9

 [<ffffffff8005cfa7>] child_rip+0x0/0x11





Code: 0f 0b 68 2d 28 2b 80 c2 51 00 4c 63 e0 48 c7 c3 30 bc 40 80 

RIP  [<ffffffff802062b1>] lock_policy_rwsem_write+0x23/0x78

 RSP <ffff810008c39e60>

 <0>Kernel panic - not syncing: Fatal exception


Version-Release number of selected component (if applicable):

kernel-2.6.18-62.el5


How reproducible:



Steps to Reproduce:
1. ensure system supports cpufreq (service cpuspeed start)
2. configure kdump
3. force crash

  
Actual results:


Expected results:


Additional info:

Comment 1 Doug Chapman 2008-01-10 21:31:19 UTC

Turns out this is specific to booting with maxcpus=1 (which also happens with
kdump).  Panic is from this bit of code:

int lock_policy_rwsem_##mode                                           \
(int cpu)                                                              \
{                                                                      \
        int policy_cpu = per_cpu(policy_cpu, cpu);                      \
        BUG_ON(policy_cpu == -1);                                       \
        down_##mode(&per_cpu(cpu_policy_rwsem, policy_cpu));            \
        if (unlikely(!cpu_online(cpu))) {                               \
                up_##mode(&per_cpu(cpu_policy_rwsem, policy_cpu));      \
                return -1;                                              \
        }                                                               \
                                                                        \
        return 0;                                                       \
}


Since we have cpus that are not online.....

I will look at if/how this was fixed upstream.

- Doug

Comment 2 Dave Anderson 2008-01-10 22:28:41 UTC

FYI: I tested both 2.6.18-60.el5 and 2.6.18-61.el5 on dhcp83-53
and verified that the same panic as above did occur.  In my
initial attempts, I was booting with "rhgb", and even though
I clicked the "show details", the system just seem to freeze,
and no panic info was shown.  After removing the "rhgb", I get
the same panic data shown above with -61.el5.

Comment 3 Doug Chapman 2008-01-10 23:12:37 UTC

It turns out this panic only happens with it tries to use the "centrino"
cpuspeed driver which is only available on x86 (which explains why I never saw
this on ia64).  Evidently that driver doesn't initialize things properly when
maxcpus=1.

Looking upstream it appears CONFIG_X86_SPEEDSTEP_CENTRINO is depricated but I
cannot find much discussion as to why.  Fedora 8 no longer uses it which
explains why I could not reproduce the panic there.

Comment 4 Doug Chapman 2008-01-11 00:11:22 UTC

found the fix for this upstream.  I will post a RHEL5.1 patch tomorrow:


commit ec28297a562f2b022115b9eb82e4ea724d996240
Author: Venki Pallipadi <venkatesh.pallipadi>
Date:   Mon Mar 26 12:03:19 2007 -0700

    [PATCH] Fix maxcpus=1 trigerring BUG() in cpufreq
    
    Ingo reported it on lkml in the thread
      "2.6.21-rc5: maxcpus=1 crash in cpufreq: kernel BUG at
drivers/cpufreq/cpufreq.c:82!"
    
    This check added to remove_dev  is symmetric to one in add_dev and handles
    callbacks for offline cpus cleanly.
    
    Signed-off-by: Venkatesh Pallipadi <venkatesh.pallipadi>
    Acked-by: Ingo Molnar <mingo>
    Signed-off-by: Linus Torvalds <torvalds>

Comment 5 Doug Chapman 2008-01-11 15:25:57 UTC

Created attachment 291398 [details]
patch from upstream

This fixes the panic I saw.  Dave is trying it on his system also.  I will post
to rhkernel-list this afternoon assuming Dave's test goes OK.

Comment 6 Dave Anderson 2008-01-11 16:31:18 UTC

(In reply to comment #5)
> Created an attachment (id=291398) [edit]
> patch from upstream
> 
> This fixes the panic I saw.  Dave is trying it on his system also.  I will post
> to rhkernel-list this afternoon assuming Dave's test goes OK.
>

Applied the patch above to kernel 2.6.18-65.el5, and kdump now works OK.

ACK!

Comment 7 Doug Chapman 2008-01-11 16:47:27 UTC

patch posted to rhkernel-list for review.

Comment 8 RHEL Program Management 2008-01-11 17:36:25 UTC

This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 9 Qian Cai 2008-01-16 06:06:46 UTC

I have still seen it with kernel 2.6.18-68.el5

Red Hat Enterprise Linux Server release 5.1 (Tikanga)
Kernel 2.6.18-68.el5 on an x86_64

dell-pesc430-03.rhts.boston.redhat.com login: root
Password: 
Login incorrect

login: SysRq : Trigger a crashdump
Memory for crash kernel (0x0 to 0x0) notwithin permissible range
PCI: BIOS Bug: MCFG area at f0000000 is not E820-reserved
PCI: Not using MMCONFIG.
ACPI: Getting cpuindex for acpiid 0x3
ACPI: Getting cpuindex for acpiid 0x4
Mounting proc filesystem
Mounting sysfs filesystem
Creating /dev
Creating initial device nodes
Loading scsi_mod.ko module
Loading sd_mod.ko module
Loading libata.ko module
Loading ata_piix.ko module
Loading jbd.ko module
Loading ext3.ko module
Creating Block Devices
Creating block device ram0
Creating block device ram1
Creating block device ram10
Creating block device ram11
Creating block device ram12
Creating block device ram13
Creating block device ram14
Creating block device ram15
Creating block device ram2
Creating block device ram3
Creating block device ram4
Creating block device ram5
Creating block device ram6
Creating block device ram7
Creating block device ram8
Creating block device ram9
Creating block device sda
Attempting to enter user-space to capture vmcore
Creating root device.
Checking root filesystem.
fsck 1.38 (30-Jun-2005)
fsck: WARNING: couldn't open /etc/fstab: No such file or directory
e2fsck 1.38 (30-Jun-2005)
fsck.ext3: while determining whether /dev/sda1 is mounted.
/: recovering journal
/: clean, 137938/5124480 files, 1211991/5120710 blocks
Mounting root filesystem.
Trying mount -t ext3 /dev/sda1 /sysroot
Using ext3 on root filesystem
Switching to new root and running init.
INIT: version 2.86 booting
                Welcome to Red Hat Enterprise Linux Server
                Press 'I' to enter interactive startup.
Setting clock  (utc): Wed Jan 16 00:57:30 EST 2008 [  OK  ]
Starting udev: [  OK  ]
Loading default keymap (us): [  OK  ]
Setting hostname localhost.localdomain:  [  OK  ]
No devices found
Setting up Logical Volume Management:   No volume groups found
[  OK  ]
Remounting root filesystem in read-write mode:  [  OK  ]
Mounting local filesystems:  [  OK  ]
Enabling local filesystem quotas:  [  OK  ]
Enabling /etc/fstab swaps:  [  OK  ]
INIT: Entering runlevel: 3
Entering non-interactive startup
Applying Intel CPU microcode update: [  OK  ]
Checking for hardware changes [  OK  ]
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at drivers/cpufreq/cpufreq.c:81
invalid opcode: 0000 [1] SMP 
last sysfs file: /devices/pnp0/00:00/id
CPU 0 
Modules linked in: acpi_cpufreq dm_mirror dm_multipath dm_mod video sbs
backlight i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp
parport joydev floppy sg tg3 ide_cd i3000_edac uhci_hcd i2c_i801 edac_mc
i2c_core ehci_hcd shpchp pcspkr cdrom serio_raw ext3 jbd ata_piix libata sd_mod
scsi_mod
Pid: 2123, comm: modprobe Not tainted 2.6.18-68.el5 #1
RIP: 0010:[<ffffffff80206b5c>]  [<ffffffff80206b5c>]
lock_policy_rwsem_write+0x23/0x78
RSP: 0000:ffff810006b23df8  EFLAGS: 00010246
RAX: 00000000ffffffff RBX: ffffffff804346a8 RCX: 0000000000000000
RDX: ffffffff8039c080 RSI: 0000000000000292 RDI: 0000000000000001
RBP: 0000000000000001 R08: ffffffff802a32a6 R09: 0000000000000000
R10: ffffffff80420260 R11: 0000000000000000 R12: ffffffff80312e40
R13: ffff810007069dc0 R14: ffffffff882a3200 R15: ffffc200000f9710
FS:  00002aaaaaac5240(0000) GS:ffffffff8039c000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000555569afe908 CR3: 0000000006a24000 CR4: 00000000000006e0
Process modprobe (pid: 2123, threadinfo ffff810006b22000, task ffff810008435860)
Stack:  ffffffff804346a8 ffffffff804346a8 ffffffff80312e40 ffffffff80206dba
 ffffffff803234a0 ffffffff801ad065 00000000ffffffed ffffffff882a3168
 00000000ffffffed ffffffff80205e8a ffffffff882a3200 ffff810007069d58
Call Trace:
 [<ffffffff80206dba>] cpufreq_remove_dev+0xb/0x22
 [<ffffffff801ad065>] sysdev_driver_unregister+0x4d/0x97
 [<ffffffff80205e8a>] cpufreq_register_driver+0x130/0x194
 [<ffffffff800a3500>] sys_init_module+0x16a6/0x1857
 [<ffffffff8005b116>] system_call+0x7e/0x83


Code: 0f 0b 68 c6 3d 2b 80 c2 51 00 4c 63 e0 48 c7 c3 30 dc 40 80 
RIP  [<ffffffff80206b5c>] lock_policy_rwsem_write+0x23/0x78
 RSP <ffff810006b23df8>
 <0>Kernel panic - not syncing: Fatal exception

Comment 10 Qian Cai 2008-01-16 06:11:59 UTC

Sorry. I supposed the patch has not included yet.

Comment 11 Doug Chapman 2008-01-16 13:55:50 UTC

(In reply to comment #10)
> Sorry. I supposed the patch has not included yet.

Correct, Don will change the state of this BZ to "MODIFIED" when a kernel has
been built with this patch.

Comment 12 Martin Jenner 2008-01-16 19:15:03 UTC

QE ack for RHEL 5.2

Comment 17 Don Zickus 2008-01-21 17:30:35 UTC

in 2.6.18-71.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 21 errata-xmlrpc 2008-05-21 15:06:13 UTC

An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2008-0314.html

Note You need to log in before you can comment on or make changes to this bug.