Description of problem: I have downloaded and installed kernel-2.6.18-120.el5.gtest.59.x86_64.rpm and kernel-2.6.18-116.el5.gtest.57.x86_64.rpm from http://people.redhat.com/agospoda/#rhel5. Both packages failed to work in Intel's Shoffner platform. There is a kernel panic at __cpufreq_governor: invalid opcode: 0000 [1] SMP Version-Release number of selected component (if applicable): kernel-2.6.18-116.el5.gtest.57.x86_64 kernel-2.6.18-120.el5.gtest.59.x86_64 How reproducible: Always Steps to Reproduce: 1. Install the packages 2. reboot the computer 3. Actual results: Version 1.20.1093 Copyright (C) 2005-2007 American Megatrends, Inc. Press <F2> to enter setup, <F12> Network Boot Bios Version: S5400.86B.06.00.0026.040820080929 Platform ID: S5400SF 4 GB system memory found Current Memory Speed: 667 MT/s (333 MHz) Intel(R) Xeon(R) CPU X5355 @ 2.66GHz Intel(R) Xeon(R) CPU X5355 @ 2.66GHz Booting from BIOS Partition 0 USB keyboard detected USB mouse detected Memory for crash kernel (0x0 to 0x0) notwithin permissible range ÿRed Hat nash version 5.1.19.6 starting Reading all physical volumes. This may take a while... Found volume group "VolGroup00" using metadata type lvm2 2 logical volume(s) in volume group "VolGroup00" now active Welcome to Red Hat Enterprise Linux Server Press 'I' to enter interactive startup. Setting clock (utc): Tue Oct 21 16:45:51 ARST 2008 [ OK ] Starting udev: [ OK ] Loading default keymap (us): [ OK ] Setting hostname rhhpcsf.intel.com: [ OK ] Setting up Logical Volume Management: 2 logical volume(s) in volume group "VolGroup00" now active [ OK ] Checking filesystems Checking all file systems. [/sbin/fsck.ext3 (1) -- /] fsck.ext3 -a /dev/VolGroup00/LogVol00 /dev/VolGroup00/LogVol00: clean, 148762/60522496 files, 4355939/60514304 blocks [/sbin/fsck.ext3 (1) -- /boot] fsck.ext3 -a /dev/sda1 /boot: recovering journal /boot: clean, 39/26104 files, 21917/104388 blocks [ OK ] Remounting root filesystem in read-write mode: [ OK ] Mounting local filesystems: [ OK ] Enabling local filesystem quotas: [ OK ] Enabling /etc/fstab swaps: [ OK ] INIT: Entering runlevel: 3 Entering non-interactive startup Applying Intel CPU microcode update: [ OK ] Starting monitoring for VG VolGroup00: /dev/hdb: open failed: Read-only file system 2 logical volume(s) in volume group "VolGroup00" monitored [ OK ] Starting background readahead: [ OK ] Checking for hardware changes [ OK ] Loading OpenIB kernel modules:[ OK ] ----------- [cut here ] --------- [please bite here ] --------- Kernel BUG at drivers/cpufreq/cpufreq_userspace.c:136 invalid opcode: 0000 [1] SMP last sysfs file: /devices/pci0000:00/0000:00:00.0/class CPU 0 Modules linked in: acpi_cpufreq ib_iser libiscsi scsi_transport_iscsi ib_srp ib_sdp ib_ipoib ipv6 xfrm_nalgo crypto_api rdma_ucm ib_ucm ib_uverbs ib_umad rdma_cm ib_cm iw_cm ib_addr ib_sa dm_multipath scsi_dh video backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp parport joydev sr_mod ide_cd ib_mthca cdrom shpchp i2c_i801 ib_mad i2c_core ib_core sg serio_raw e1000e pcspkr dm_snapshot dm_zero dm_mirror dm_log dm_mod usb_storage ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 12673, comm: modprobe Not tainted 2.6.18-116.el5.gtest.57 #1 RIP: 0010:[<ffffffff80215a14>] [<ffffffff80215a14>] cpufreq_governor_userspace+0x44/0x205 RSP: 0000:ffff810174c85ca8 EFLAGS: 00010246 RAX: 00000000ffffffff RBX: ffff81017e220400 RCX: 0000000000000000 RDX: 00000000ffffffea RSI: 0000000000000000 RDI: ffff81017e220400 RBP: ffff81017e220400 R08: 0000000000000001 R09: 0000000000000000 R10: ffff81017e220400 R11: 0000000000000058 R12: 0000000000000000 R13: 0000000000000000 R14: ffffffff80450688 R15: 0000000000000000 FS: 00002aaaaaac7240(0000) GS:ffffffff803b8000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 000000000062e19f CR3: 0000000176504000 CR4: 00000000000006e0 Process modprobe (pid: 12673, threadinfo ffff810174c84000, task ffff81017c588860) Stack: ffff81017e220400 0000000000000001 0000000000000000 ffffffff802142c7 ffff810174c85d48 ffff81017e220400 0000000000000000 ffffffff802144cd 0000000000000000 ffff81017e220400 ffff810174c85d48 ffffffff8033d300 Call Trace: [<ffffffff802142c7>] __cpufreq_governor+0x6a/0xf6 [<ffffffff802144cd>] __cpufreq_set_policy+0x17a/0x1f4 [<ffffffff80214f40>] cpufreq_set_policy+0x33/0x7c [<ffffffff80215410>] cpufreq_add_dev+0x435/0x57f [<ffffffff80214ee5>] handle_update+0x0/0x28 [<ffffffff801b66cc>] sysdev_driver_register+0x61/0xbd [<ffffffff80214182>] cpufreq_register_driver+0xb9/0x194 [<ffffffff800a4d15>] sys_init_module+0xaf/0x1e8 [<ffffffff8005d116>] system_call+0x7e/0x83 Code: 0f 0b 68 89 64 2c 80 c2 88 00 44 89 e3 48 c7 c7 c0 d7 33 80 RIP [<ffffffff80215a14>] cpufreq_governor_userspace+0x44/0x205 RSP <ffff810174c85ca8> <0>Kernel panic - not syncing: Fatal exception Expected results: Boot succeeds Additional info: I also have the Red Hat HPC bits downloaded. Actually, the bug occurs after the system Loaded the OpenIB kernel modules.
What are the results with the RHEL 5.3 Alpha?
That would be good to try. I will let you know once I try it?. However, it could be a problem of having the HPC solution installed together with the new kernel. Now I have a clean install of 5.2 without Red Hat HPC so I will try the new kernel and see what happens. Then I will try Red Hat 5.3 Alpha. Rafael.
I'm curious why you are pulling from Gospo's kernel and not http://people.redhat.com/dzickus/el5/ if you are pulling a development kernel. Also, am I correct in saying the the Shoffner platform is the Dual socket High Performance Compute Platform?
I'm building a kernel with extra debugging to give some extra information on this.
It is the Dual Socket HPC platform. I have installed the kernel on Red Hat EL 5.2 without Red Hat HPC software and it also hangs in the same way. I will now try Red Hat EL 5.3.
Red Hat EL 5.3 Alpha is also crashing in the same way.
Youquan, Since you are our cpufreq expert, can you take a look at this and work with Rafael? Rafael, Youquan is in Beijing, so might be a couple days before he responds. I don't think we have any SDVs for that platform, so will have to do some remote debugging.
Sorry, I am just back to office from vacation. Rafael, could you provide the machine's available access address for I have not such SDV? So I can try to do some remote debugging etc..
Where can I get the sources for Red Hat 5.3 alpha kernel? (2.6.18-118.el5) I couldn't find them in the DVD.
Created attachment 321843 [details] Test kernel Can you provide dmesg output when booting with the attached kernel?
Created attachment 321858 [details] kernel-2.6.18-120.el5dz_test boot log
The kernel from Comment #11 doesn't boot. It crashes before the cpu_freq crash during usb initialization. See logs: https://bugzilla.redhat.com/attachment.cgi?id=321858
Are you able to test without any USB input devices plugged in? This seems to be triggered by building RHEL kernels on Rawhide systems, for some reason.
From Rafael's information, the issue can be solve by enable EIST (Enhanced Intel Speedstep Technonogy) in BIOS. Rafael, Can you build upstream kernel to check the issue when EIST disable?
By upstream kernel you mean the latest kernel from Kernel org? I thought that there was a patch for this under development, isn't it?
Yes. you can try it with 2.6.27 kernel. I want to make sure that the upstream kernel if can handle this kind of BIOS issue. If upstream can handle, we can try to backport the patch for RHEL5.3. If not, we can ask help from upstream developer.
No luck with the upstream kernel. Seems to be the same issue although the behaviour is not the same. Linux version 2.6.27.4 (root.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-44)) #2 SMP Thu Nov 6 10:08:15 ARST 2008 Command line: ro root=/dev/VolGroup00/LogVol00 console=ttyS1,115200n8 . . . ------------[ cut here ]------------ kernel BUG at drivers/cpufreq/cpufreq_userspace.c:122! invalid opcode: 0000 [1] SMP CPU 0 Modules linked in: acpi_cpufreq(+) dm_multipath scsi_dh sbs sbshc battery acpi_memhotplug ac parport_pc lp parport e1000e joydev sr_mod mlx4_core shpchp sg rtc_cmos button rtc_core pcspkr rtc_lib ide_cd_mod cdrom serio_raw i2c_i801 i2c_core dm_snapshot dm_zero dm_mirror dm_log dm_mod usb_storage ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode] Pid: 3532, comm: modprobe Not tainted 2.6.27.4 #2 RIP: 0010:[<ffffffff803f6617>] [<ffffffff803f6617>] cpufreq_governor_userspace+0x4a/0x31f RSP: 0018:ffff880175d65ca8 EFLAGS: 00010246 RAX: 00000000ffffffff RBX: ffff88017d8de200 RCX: 0000000000000000 RDX: 00000000ffffffea RSI: 0000000000000000 RDI: ffff88017d8de200 RBP: ffff88017d8de200 R08: 0000000000000001 R09: 0000000000000000 R10: ffffffff805d41c0 R11: ffff880175d65c68 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 FS: 00007f4949a7d6e0(0000) GS:ffffffff80717a80(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007fff75a69130 CR3: 0000000175d75000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process modprobe (pid: 3532, threadinfo ffff880175d64000, task ffff88017b6f6400) Stack: 0000000000000000 0000000000000000 0000000000000292 ffff88017d8de200 ffff88017d8de200 0000000000000001 0000000000000000 ffffffff803f4d35 ffff880175d65d58 ffff88017d8de200 0000000000000000 ffffffff803f4eed Call Trace: [<ffffffff803f4d35>] ? __cpufreq_governor+0x91/0xc8 [<ffffffff803f4eed>] ? __cpufreq_set_policy+0x181/0x1fb [<ffffffff803f5e75>] ? cpufreq_add_dev+0x4c7/0x5f4 [<ffffffff803f5934>] ? handle_update+0x0/0x28 [<ffffffff8032432a>] ? __next_cpu_nr+0x1a/0x21 [<ffffffff803a20a6>] ? sysdev_driver_register+0xa4/0x100 [<ffffffff803f4bae>] ? cpufreq_register_driver+0xbb/0x1b1 [<ffffffffa019b000>] ? acpi_cpufreq_init+0x0/0x90 [acpi_cpufreq] [<ffffffff80209041>] ? _stext+0x41/0x110 [<ffffffff80257671>] ? sys_init_module+0x9e/0x1ad [<ffffffff8020be6b>] ? system_call_fastpath+0x16/0x1b Code: f2 01 00 00 31 d2 ff ce 0f 85 e5 02 00 00 44 0f a3 25 1e 17 32 00 19 c0 85 c0 ba ea ff ff ff 0f 84 ce 02 00 00 83 7f 5c 00 75 04 <0f> 0b eb fe 48 c7 c7 d0 45 5d 80 e8 d0 6f 09 00 83 3d 02 aa 48 RIP [<ffffffff803f6617>] cpufreq_governor_userspace+0x4a/0x31f RSP <ffff880175d65ca8> ---[ end trace 2b1230b4297c4a04 ]--- /etc/rc3.d/S06cpuspeed: line 112: 3532 Segmentation fault /sbin/modprobe acpi-cpufreq 2> /dev/null
OK. I know that. I will push upstream do fix this kind of BIOS issue.
Which version will contain the fixes? I know there is a RHEL5.3-Snap3 available. Does this version contain the fix?
There's currently no patch or fix. Have you been able to test the kernel mentioned earlier?
I tested a fix provided by Youquan on November 11 (not logged here) with successful results. But I don't know the status of this fix in Red Hat.
The change doesn't appear to have been submitted for the Red Hat kernel.
Youquan, can you submit the fix?
Created attachment 324133 [details] Adding _PSS invalidation check
What's the status of the bug?
Was the patch included in RH EL 5.3 GA Snapshot 4 or before?
Which is the status of this bug?. Was it included in any of the 5.3 snapshots? Will it be included otherwise?
No. RHEL5.3 do not include the patch. We can just desire to include it in RHEL5.4. This patch is included upstream -mm tree now.
The Bug is still being reproduced in RH EL 5.3 RC1. This comment is just for the record and tracking purposes.
Rafael, Per comment 28, the code is not in Linus' kernel yet. We will target RHEL 5.4, assuming the code is upstream by the time we code freeze.
please update subject with "RHEL 5.4"
Youquan, Is it possible to work around this bug via a kernel command line argument?
There is no kernel command line option to work round it on this situation.
Rafael, The upstream maintainer of ACPI (Len Brown) rejected the patch because he believes that this issue is a BIOS bug. Can you try to have the BIOS team investigate this issue and fix it.
John and Rafael, Based on comment 34, I am closing this as "not a bug".
The patch in comment #24 fixes the problem at hand, and will not break anything else. So I'd not lose any sleep about putting it as a workaround into a distro until a better patch is available. But the reason that I will not accept that patch upstream is that it is checking for a random bit pattern to determine that p-states are disabled by the BIOS. This bit pattern is completely arbitrary. If the BIOS is going to throw random bit patterns at us, then we need to either 1. be smarter about sanity checking them -- checking one bit and not others, why? or better... 2. harden linux to handle total garbage in that field.
re-opening, as there is no indication that this issue has gone away, either with a BIOS upgrade or a kernel patch.
Rafael, Please make sure that the board is running a production BIOS, and then please attach the output from acpidump and dmidecode
Created attachment 344864 [details] Updated patch for the 2.6.18-149 kernel
This bug seems to be the same as Bug 500311
(In reply to comment #38) > Rafael, > Please make sure that the board is running a production BIOS, > and then please attach the output from acpidump and dmidecode The BIOS I was using was a production BIOS available at www.intel.com. I am not sure which is your concern.
Rafael, Do you know what version of the BIOS you are running? Could we get the output of dmidecode please? Also is it possible to get a copy of the acpidump output? You can get the pmtools RPM in Bug 500311 and acpidump is inside that package. You can use that one if you like or: The pmtools source can be found at: http://www.lesswatts.org/projects/acpi/utilities.php I was able to build the Fedora SRPM on my RHEL5 build root: ftp://mirrors.kernel.org/fedora/releases/10/Everything/source/SRPMS/pmtools-20071116-1.fc9.src.rpm
Just (In reply to comment #37) > re-opening, as there is no indication that this issue has gone away, > either with a BIOS upgrade or a kernel patch. Update: We have successfully Installed Red Hat 5.3 on this board with the latest BIOS release. I don't know which is the exact Bios version/ kernel version that handled this issue, but the system is working correctly with the latest versions.
Rafael, Thanks for the info. Glad to hear the BIOS update fixed it for you.
John, Based on comments #33 and #34, it appears that we should close this as notabug. If you concur, please close. If not, then please provide an update as to what the problem is and how to reproduce it. Thanks!
Ron, I believe this is a bug. This and Bug 500311 seem to be the same bug. So we could mark this as a duplicate of that bug if desired. Customers in the field are seeing this issue.
John, If it's a dup, then please close this as a dup.
*** This bug has been marked as a duplicate of bug 500311 ***