Bug 467941 - Kernel BUG at drivers/cpufreq/cpufreq_userspace.c:136
Kernel BUG at drivers/cpufreq/cpufreq_userspace.c:136
Status: CLOSED DUPLICATE of bug 500311
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.4
All Linux
medium Severity high
: rc
: 5.4
Assigned To: John Villalovos
Martin Jenner
: Reopened
Depends On: 500311
Blocks: 480792
  Show dependency treegraph
 
Reported: 2008-10-21 15:26 EDT by Rafael Garabato
Modified: 2015-05-08 09:59 EDT (History)
11 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-05-26 09:15:33 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Test kernel (16.18 MB, application/octet-stream)
2008-10-29 14:37 EDT, Matthew Garrett
no flags Details
kernel-2.6.18-120.el5dz_test boot log (18.90 KB, text/plain)
2008-10-29 15:31 EDT, Rafael Garabato
no flags Details
Adding _PSS invalidation check (789 bytes, patch)
2008-11-19 21:49 EST, Song, Youquan
no flags Details | Diff
Updated patch for the 2.6.18-149 kernel (874 bytes, patch)
2009-05-20 14:48 EDT, John Villalovos
no flags Details | Diff

  None (edit)
Description Rafael Garabato 2008-10-21 15:26:01 EDT
Description of problem:
I have downloaded and installed kernel-2.6.18-120.el5.gtest.59.x86_64.rpm and kernel-2.6.18-116.el5.gtest.57.x86_64.rpm from http://people.redhat.com/agospoda/#rhel5.

Both packages failed to work in Intel's Shoffner platform. There is a kernel panic at __cpufreq_governor: invalid opcode: 0000 [1] SMP



Version-Release number of selected component (if applicable):
kernel-2.6.18-116.el5.gtest.57.x86_64
kernel-2.6.18-120.el5.gtest.59.x86_64


How reproducible:
Always

Steps to Reproduce:
1. Install the packages
2. reboot the computer
3. 

  

Actual results:

Version 1.20.1093 Copyright (C) 2005-2007 American Megatrends, Inc.
Press <F2> to enter setup, <F12> Network Boot
Bios Version: S5400.86B.06.00.0026.040820080929
Platform ID:  S5400SF
4 GB system memory found
Current Memory Speed: 667 MT/s (333 MHz)
Intel(R) Xeon(R) CPU           X5355  @ 2.66GHz
Intel(R) Xeon(R) CPU           X5355  @ 2.66GHz
Booting from BIOS Partition 0
USB keyboard detected
USB mouse detected



Memory for crash kernel (0x0 to 0x0) notwithin permissible range
ÿRed Hat nash version 5.1.19.6 starting
  Reading all physical volumes.  This may take a while...
  Found volume group "VolGroup00" using metadata type lvm2
  2 logical volume(s) in volume group "VolGroup00" now active
                Welcome to Red Hat Enterprise Linux Server
                Press 'I' to enter interactive startup.
Setting clock  (utc): Tue Oct 21 16:45:51 ARST 2008 [  OK  ]
Starting udev: [  OK  ]
Loading default keymap (us): [  OK  ]
Setting hostname rhhpcsf.intel.com:  [  OK  ]
Setting up Logical Volume Management:   2 logical volume(s) in volume group "VolGroup00" now active
[  OK  ]
Checking filesystems
Checking all file systems.
[/sbin/fsck.ext3 (1) -- /] fsck.ext3 -a /dev/VolGroup00/LogVol00
/dev/VolGroup00/LogVol00: clean, 148762/60522496 files, 4355939/60514304 blocks
[/sbin/fsck.ext3 (1) -- /boot] fsck.ext3 -a /dev/sda1
/boot: recovering journal
/boot: clean, 39/26104 files, 21917/104388 blocks
[  OK  ]
Remounting root filesystem in read-write mode:  [  OK  ]
Mounting local filesystems:  [  OK  ]
Enabling local filesystem quotas:  [  OK  ]
Enabling /etc/fstab swaps:  [  OK  ]
INIT: Entering runlevel: 3
Entering non-interactive startup
Applying Intel CPU microcode update: [  OK  ]
Starting monitoring for VG VolGroup00:   /dev/hdb: open failed: Read-only file system
  2 logical volume(s) in volume group "VolGroup00" monitored
[  OK  ]
Starting background readahead: [  OK  ]
Checking for hardware changes [  OK  ]
Loading OpenIB kernel modules:[  OK  ]
----------- [cut here ] --------- [please bite here ] ---------
Kernel BUG at drivers/cpufreq/cpufreq_userspace.c:136
invalid opcode: 0000 [1] SMP
last sysfs file: /devices/pci0000:00/0000:00:00.0/class
CPU 0
Modules linked in: acpi_cpufreq ib_iser libiscsi scsi_transport_iscsi ib_srp ib_sdp ib_ipoib ipv6 xfrm_nalgo crypto_api rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa dm_multipath scsi_dh video backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp parport joydev sr_mod ide_cd ib_mthca cdrom shpchp i2c_i801 ib_mad i2c_core ib_core sg serio_raw e1000e pcspkr dm_snapshot dm_zero dm_mirror dm_log dm_mod usb_storage ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 12673, comm: modprobe Not tainted 2.6.18-116.el5.gtest.57 #1
RIP: 0010:[<ffffffff80215a14>]  [<ffffffff80215a14>] cpufreq_governor_userspace+0x44/0x205
RSP: 0000:ffff810174c85ca8  EFLAGS: 00010246
RAX: 00000000ffffffff RBX: ffff81017e220400 RCX: 0000000000000000
RDX: 00000000ffffffea RSI: 0000000000000000 RDI: ffff81017e220400
RBP: ffff81017e220400 R08: 0000000000000001 R09: 0000000000000000
R10: ffff81017e220400 R11: 0000000000000058 R12: 0000000000000000
R13: 0000000000000000 R14: ffffffff80450688 R15: 0000000000000000
FS:  00002aaaaaac7240(0000) GS:ffffffff803b8000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000000062e19f CR3: 0000000176504000 CR4: 00000000000006e0
Process modprobe (pid: 12673, threadinfo ffff810174c84000, task ffff81017c588860)
Stack:  ffff81017e220400 0000000000000001 0000000000000000 ffffffff802142c7
 ffff810174c85d48 ffff81017e220400 0000000000000000 ffffffff802144cd
 0000000000000000 ffff81017e220400 ffff810174c85d48 ffffffff8033d300
Call Trace:
 [<ffffffff802142c7>] __cpufreq_governor+0x6a/0xf6
 [<ffffffff802144cd>] __cpufreq_set_policy+0x17a/0x1f4
 [<ffffffff80214f40>] cpufreq_set_policy+0x33/0x7c
 [<ffffffff80215410>] cpufreq_add_dev+0x435/0x57f
 [<ffffffff80214ee5>] handle_update+0x0/0x28
 [<ffffffff801b66cc>] sysdev_driver_register+0x61/0xbd
 [<ffffffff80214182>] cpufreq_register_driver+0xb9/0x194
 [<ffffffff800a4d15>] sys_init_module+0xaf/0x1e8
 [<ffffffff8005d116>] system_call+0x7e/0x83


Code: 0f 0b 68 89 64 2c 80 c2 88 00 44 89 e3 48 c7 c7 c0 d7 33 80
RIP  [<ffffffff80215a14>] cpufreq_governor_userspace+0x44/0x205
 RSP <ffff810174c85ca8>
 <0>Kernel panic - not syncing: Fatal exception



Expected results:
Boot succeeds

Additional info:

I also have the Red Hat HPC bits downloaded. Actually, the bug occurs after the system Loaded the OpenIB kernel modules.
Comment 1 John Villalovos 2008-10-23 10:47:52 EDT
What are the results with the RHEL 5.3 Alpha?
Comment 2 Rafael Garabato 2008-10-23 10:56:08 EDT
That would be good to try. I will let you know once I try it?. However, it could be a problem of having the HPC solution installed together with the new kernel.

Now I have a clean install of 5.2 without Red Hat HPC so I will try the new kernel and see what happens.
Then I will try Red Hat 5.3 Alpha.

Rafael.
Comment 3 John Villalovos 2008-10-23 10:57:20 EDT
I'm curious why you are pulling from Gospo's kernel and not
http://people.redhat.com/dzickus/el5/ if you are pulling a development kernel.

Also, am I correct in saying the the Shoffner platform is the Dual socket High
Performance Compute Platform?
Comment 4 Matthew Garrett 2008-10-23 11:48:27 EDT
I'm building a kernel with extra debugging to give some extra information on this.
Comment 5 Rafael Garabato 2008-10-23 12:26:03 EDT
It is the Dual Socket HPC platform.

I have installed the kernel on Red Hat EL 5.2 without Red Hat HPC software and it also hangs in the same way. 

I will now try Red Hat EL 5.3.
Comment 6 Rafael Garabato 2008-10-24 14:24:49 EDT
Red Hat EL 5.3 Alpha is also crashing in the same way.
Comment 7 John Villalovos 2008-10-24 14:29:36 EDT
Youquan,

Since you are our cpufreq expert, can you take a look at this and work with Rafael?

Rafael, Youquan is in Beijing, so might be a couple days before he responds.  I don't think we have any SDVs for that platform, so will have to do some remote debugging.
Comment 8 Song, Youquan 2008-10-27 23:47:38 EDT
Sorry, I am just back to office from vacation.  
Rafael, could you provide the machine's available access address for I have not such SDV? So I can try to do some remote debugging etc..
Comment 9 Rafael Garabato 2008-10-29 09:45:11 EDT
Where can I get the sources for Red Hat 5.3 alpha kernel? (2.6.18-118.el5)

I couldn't find them in the DVD.
Comment 10 Matthew Garrett 2008-10-29 14:37:12 EDT
Created attachment 321843 [details]
Test kernel

Can you provide dmesg output when booting with the attached kernel?
Comment 11 Rafael Garabato 2008-10-29 15:31:10 EDT
Created attachment 321858 [details]
 kernel-2.6.18-120.el5dz_test boot log
Comment 12 Rafael Garabato 2008-10-29 15:32:45 EDT
The kernel from Comment #11 doesn't boot. It crashes before the cpu_freq crash during usb initialization.

See logs: https://bugzilla.redhat.com/attachment.cgi?id=321858
Comment 13 Matthew Garrett 2008-10-29 15:44:49 EDT
Are you able to test without any USB input devices plugged in? This seems to be triggered by building RHEL kernels on Rawhide systems, for some reason.
Comment 14 Song, Youquan 2008-11-05 09:57:55 EST
From Rafael's information, the issue can be solve by enable EIST (Enhanced Intel Speedstep Technonogy) in BIOS.  
Rafael, Can you build upstream kernel to check the issue when EIST disable?
Comment 15 Rafael Garabato 2008-11-05 13:21:54 EST
By upstream kernel you mean the latest kernel from Kernel org?
I thought that there was a patch for this under development, isn't it?
Comment 16 Song, Youquan 2008-11-05 21:59:34 EST
Yes. you can try it with 2.6.27 kernel.  I want to make sure that the upstream kernel if can handle this kind of BIOS issue.  If upstream can handle, we can try to backport the patch for RHEL5.3. If not, we can ask help from upstream developer.
Comment 17 Rafael Garabato 2008-11-06 09:56:21 EST
No luck with the upstream kernel. Seems to be the same issue although the behaviour is not the same.


Linux version 2.6.27.4 (root@redhat53alpha.intel.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)) #2 SMP Thu Nov 6 10:08:15 ARST 2008
Command line: ro root=/dev/VolGroup00/LogVol00 console=ttyS1,115200n8
.
.
.
------------[ cut here ]------------
kernel BUG at drivers/cpufreq/cpufreq_userspace.c:122!
invalid opcode: 0000 [1] SMP
CPU 0
Modules linked in: acpi_cpufreq(+) dm_multipath scsi_dh sbs sbshc battery acpi_memhotplug ac parport_pc lp parport e1000e joydev sr_mod mlx4_core shpchp sg rtc_cmos button rtc_core pcspkr rtc_lib ide_cd_mod cdrom serio_raw i2c_i801 i2c_core dm_snapshot dm_zero dm_mirror dm_log dm_mod usb_storage ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode]
Pid: 3532, comm: modprobe Not tainted 2.6.27.4 #2
RIP: 0010:[<ffffffff803f6617>]  [<ffffffff803f6617>] cpufreq_governor_userspace+0x4a/0x31f
RSP: 0018:ffff880175d65ca8  EFLAGS: 00010246
RAX: 00000000ffffffff RBX: ffff88017d8de200 RCX: 0000000000000000
RDX: 00000000ffffffea RSI: 0000000000000000 RDI: ffff88017d8de200
RBP: ffff88017d8de200 R08: 0000000000000001 R09: 0000000000000000
R10: ffffffff805d41c0 R11: ffff880175d65c68 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
FS:  00007f4949a7d6e0(0000) GS:ffffffff80717a80(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fff75a69130 CR3: 0000000175d75000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process modprobe (pid: 3532, threadinfo ffff880175d64000, task ffff88017b6f6400)
Stack:  0000000000000000 0000000000000000 0000000000000292 ffff88017d8de200
 ffff88017d8de200 0000000000000001 0000000000000000 ffffffff803f4d35
 ffff880175d65d58 ffff88017d8de200 0000000000000000 ffffffff803f4eed
Call Trace:
 [<ffffffff803f4d35>] ? __cpufreq_governor+0x91/0xc8
 [<ffffffff803f4eed>] ? __cpufreq_set_policy+0x181/0x1fb
 [<ffffffff803f5e75>] ? cpufreq_add_dev+0x4c7/0x5f4
 [<ffffffff803f5934>] ? handle_update+0x0/0x28
 [<ffffffff8032432a>] ? __next_cpu_nr+0x1a/0x21
 [<ffffffff803a20a6>] ? sysdev_driver_register+0xa4/0x100
 [<ffffffff803f4bae>] ? cpufreq_register_driver+0xbb/0x1b1
 [<ffffffffa019b000>] ? acpi_cpufreq_init+0x0/0x90 [acpi_cpufreq]
 [<ffffffff80209041>] ? _stext+0x41/0x110
 [<ffffffff80257671>] ? sys_init_module+0x9e/0x1ad
 [<ffffffff8020be6b>] ? system_call_fastpath+0x16/0x1b


Code: f2 01 00 00 31 d2 ff ce 0f 85 e5 02 00 00 44 0f a3 25 1e 17 32 00 19 c0 85 c0 ba ea ff ff ff 0f 84 ce 02 00 00 83 7f 5c 00 75 04 <0f> 0b eb fe 48 c7 c7 d0 45 5d 80 e8 d0 6f 09 00 83 3d 02 aa 48
RIP  [<ffffffff803f6617>] cpufreq_governor_userspace+0x4a/0x31f
 RSP <ffff880175d65ca8>
---[ end trace 2b1230b4297c4a04 ]---
/etc/rc3.d/S06cpuspeed: line 112:  3532 Segmentation fault      /sbin/modprobe acpi-cpufreq 2> /dev/null
Comment 18 Song, Youquan 2008-11-06 22:20:50 EST
OK. I know that. I will push upstream do fix this kind of BIOS issue.
Comment 19 Rafael Garabato 2008-11-19 07:29:51 EST
Which version will contain the fixes?
I know there is a RHEL5.3-Snap3 available. Does this version contain the fix?
Comment 20 Matthew Garrett 2008-11-19 07:40:44 EST
There's currently no patch or fix. Have you been able to test the kernel mentioned earlier?
Comment 21 Rafael Garabato 2008-11-19 07:51:28 EST
I tested a fix provided by Youquan on November 11 (not logged here) with successful results. But I don't know the status of this fix in Red Hat.
Comment 22 Matthew Garrett 2008-11-19 08:05:10 EST
The change doesn't appear to have been submitted for the Red Hat kernel.
Comment 23 Rafael Garabato 2008-11-19 08:53:06 EST
Youquan, can you submit the fix?
Comment 24 Song, Youquan 2008-11-19 21:49:02 EST
Created attachment 324133 [details]
Adding _PSS invalidation check
Comment 25 Song, Youquan 2008-12-05 00:11:04 EST
What's the status of the bug?
Comment 26 Rafael Garabato 2008-12-05 09:34:01 EST
Was the patch included in RH EL 5.3 GA Snapshot 4 or before?
Comment 27 Rafael Garabato 2008-12-16 09:41:30 EST
Which is the status of this bug?. Was it included in any of the 5.3 snapshots? Will it be included otherwise?
Comment 28 Song, Youquan 2008-12-16 10:40:09 EST
No. RHEL5.3 do not include the patch. We can just desire to include it in RHEL5.4. 
This patch is included upstream -mm tree now.
Comment 29 Rafael Garabato 2008-12-29 13:28:11 EST
The Bug is still being reproduced in RH EL 5.3 RC1. 
This comment is just for the record and tracking purposes.
Comment 30 Ronald Pacheco 2009-01-05 13:08:15 EST
Rafael,

Per comment 28, the code is not in Linus' kernel yet.  We will target RHEL 5.4, assuming the code is upstream by the time we code freeze.
Comment 31 Keve Gabbert 2009-01-05 18:52:53 EST
please update subject with "RHEL 5.4"
Comment 32 John Villalovos 2009-01-06 13:12:24 EST
Youquan,

Is it possible to work around this bug via a kernel command line argument?
Comment 33 Song, Youquan 2009-01-07 02:59:34 EST
There is no kernel command line option to work round it on this situation.
Comment 34 John Villalovos 2009-01-20 21:09:45 EST
Rafael,

The upstream maintainer of ACPI (Len Brown) rejected the patch because he believes that this issue is a BIOS bug.  Can you try to have the BIOS team investigate this issue and fix it.
Comment 35 Ronald Pacheco 2009-01-20 21:43:38 EST
John and Rafael,

Based on comment 34, I am closing this as "not a bug".
Comment 36 Len Brown 2009-03-15 23:19:28 EDT
The patch in comment #24 fixes the problem at hand,
and will not break anything else.  So I'd not lose
any sleep about putting it as a workaround into a distro
until a better patch is available.

But the reason that I will not accept that patch upstream
is that it is checking for a random bit pattern to determine
that p-states are disabled by the BIOS.  This bit pattern
is completely arbitrary.

If the BIOS is going to throw random bit patterns at us,
then we need to either
1. be smarter about sanity checking them -- checking
   one bit and not others, why?

or better...
2. harden linux to handle total garbage in that field.
Comment 37 Len Brown 2009-05-20 12:23:52 EDT
re-opening, as there is no indication that this issue has gone away,
either with a BIOS upgrade or a kernel patch.
Comment 38 Len Brown 2009-05-20 12:25:46 EDT
Rafael,
Please make sure that the board is running a production BIOS,
and then please attach the output from acpidump and dmidecode
Comment 39 John Villalovos 2009-05-20 14:48:32 EDT
Created attachment 344864 [details]
Updated patch for the 2.6.18-149 kernel
Comment 40 John Villalovos 2009-05-20 14:50:02 EDT
This bug seems to be the same as Bug 500311
Comment 41 Rafael Garabato 2009-05-22 09:23:36 EDT
(In reply to comment #38)
> Rafael,
> Please make sure that the board is running a production BIOS,
> and then please attach the output from acpidump and dmidecode  

The BIOS I was using was a production BIOS available at www.intel.com. I am not sure which is your concern.
Comment 42 John Villalovos 2009-05-22 09:31:45 EDT
Rafael,

Do you know what version of the BIOS you are running?

Could we get the output of dmidecode please?

Also is it possible to get a copy of the acpidump output?  You can get the pmtools RPM in Bug 500311 and acpidump is inside that package.  You can use that one if you like or:

The pmtools source can be found at:
http://www.lesswatts.org/projects/acpi/utilities.php

I was able to build the Fedora SRPM on my RHEL5 build root:
ftp://mirrors.kernel.org/fedora/releases/10/Everything/source/SRPMS/pmtools-20071116-1.fc9.src.rpm
Comment 43 Rafael Garabato 2009-05-22 09:33:22 EDT
Just (In reply to comment #37)
> re-opening, as there is no indication that this issue has gone away,
> either with a BIOS upgrade or a kernel patch.  

Update:
We have successfully Installed Red Hat 5.3 on this board with the latest BIOS release. I don't know which is the exact Bios version/ kernel version that handled this issue, but the system is working correctly with the latest versions.
Comment 44 John Villalovos 2009-05-22 09:41:35 EDT
Rafael,

Thanks for the info.  Glad to hear the BIOS update fixed it for you.
Comment 45 Ronald Pacheco 2009-05-22 16:06:40 EDT
John,

Based on comments #33 and #34, it appears that we should close this as notabug.  If you concur, please close.  If not, then please provide an update as to what the problem is and how to reproduce it.  Thanks!
Comment 46 John Villalovos 2009-05-22 16:24:01 EDT
Ron,

I believe this is a bug.  This and Bug 500311 seem to be the same bug.  So we could mark this as a duplicate of that bug if desired.

Customers in the field are seeing this issue.
Comment 47 Ronald Pacheco 2009-05-26 08:37:51 EDT
John,

If it's a dup, then please close this as a dup.
Comment 48 John Villalovos 2009-05-26 09:15:33 EDT

*** This bug has been marked as a duplicate of bug 500311 ***

Note You need to log in before you can comment on or make changes to this bug.