After using the machine a while (mainly yum upgrades and installs), I get a panic. I've transcribed it by hand below, and will attach an image. This is on a Fedora 15 newly installed and then fully updated this afternoon. I also attempted to install Fedora 15 on this machine a few weeks ago and got similar results (but wasn't able to catch the panic that time), so I'd be reasonably confident of being able to reproduce this within an hour or two. The machine has run Fedora 13 for months with no trouble. [hardware Error]: CPU 4: Machine Check Exception: 4 Bank 5: be00000000800400 Clocksource tsc unstable (delta = -8589933399 ns) [hardware Error]: TSC a3fe0652426 ADDR 3fff81080b5d MISC 7fff [Hardware Error]: PROCESSOR 0:106e5 TIME 1308792167 SOCKET 0 APIC 1 [Hardware Error]: No human readable MCE decoding support on this CPU type. [Hardware Error]: Run the message through 'mcelog --ascii' to decode. [Hardware Error]: CPU 0: Machine Check Exception: 4 Bank 5: be00000000800400 [Hardware Error]: TSC a3fe0633fb7 ADDR 3fff81080b5d MISC 7fff [Hardware Error]: PROCESSOR 0:106e5 TIME 1308792167 SOCKET 0 APIC 0 [Hardware Error]: No human readable MCE decoding support on this CPU type. [Hardware Error]: Run the message through 'mcelog --ascii' to decode. [Hardware Error]: Machine check: Processor context corrupt Kernel panic - not syncing: Fatal Machine check Pid: 12496, comm: prelink Tainted: G M 2.6.38.8-32.fc15.x86_64 #1 Call Trace: <#MC> [<ffffffff8146c6e6>] panic+0x91/0x19c [<ffffffff8101b1bd>] mce_panic+0x191/0x1c7 [<ffffffff8101b9b9>] do_machine_check+0x59a/0x741 [<ffffffff8147622c>] machine_check+0x1c/0x30 [<ffffffff81080b5d>] ? arch_local_irq_disable+0x4/0xd <<EOE>> [<ffffffff814759a2>] _raw_spin_lock_irq+0x13/0x1e [<ffffffff810d8a0a>] add_to_page_cache_locked+0x93/0x118 [<ffffffff8119864b>] ? ext4_get_block+0x0/0x18 [<ffffffff810d8ab9>] add_to_page_cache_lru+0x2a/0x58 [<ffffffff8114c14a>] mpage_readpages+0x99/0x104 [<ffffffff8119864b>] ? ext4_get_block+0x0/0x18 [<ffffffff8110875e>] ? alloc_pages_current+0xc7/0xd8 [<ffffffff81194b9d>] ext4_readpages+0x1d/0x1f [<ffffffff810e0870>] __do_page_cache_readahead+0x100/0x177 [<ffffffff810e0b4d>] ra_submit+0x21/0x25 [<ffffffff810e0d1a>] ondemand_readahead+0x1c9/0x1d8 [<ffffffff810e0da4>] page_cache_async_readahead+0x7b/0xa3 [<ffffffff8122c8bc>] ? radix_tree_lookup_slot+0xe/0x10 [<ffffffff810d7f42>] ? find_get_page+0x40/0x62 [<ffffffff810d9708>] generic_file_aio_read+0x2bd/0x5e0 [<ffffffff8112114a>] do_sync_read+0xbf/0xff [<ffffffff811e8102>] ? security_file_permission+0x2e/0x33 [<ffffffff81121436>] ? rw_verify_area+0xb0/0xcd [<ffffffff811217b1>] vfs_read+0xa9/0xf0 [<ffffffff8112192e>] sys_pread64+0x5a/0x76 [<ffffffff81009bc2>] system_call_fastpath+0x16/0x1b panic occurred, switching back to text console Rebooting in 30 seconds..
Created attachment 506108 [details] screenshot of panic
What kind of machine is it (vendor and model)? Does the problem go away if you disable hyperthreading in the BIOS? Decoded MCE: Wed Jun 22 21:22:47 2011 HARDWARE ERROR. This is *NOT* a software problem! Please contact your hardware vendor CPU 0 BANK 5 TSC a3fe0633fb7 MISC 7fff ADDR 3fff81080b5d MCG status:MCIP MCi status: Uncorrected error Error enabled MCi_MISC register valid MCi_ADDR register valid Processor context corrupt MCA: Internal Timer error STATUS be00000000800400 MCGSTATUS 4 CPUID Vendor Intel Family 6 Model 30 PROCESSOR 0:106e5 TIME 1308792167 SOCKET 0 APIC 0
The CPU is a core i7-870, in this barebones: http://www.newegg.com/Product/Product.aspx?Item=N82E16856101098. I'll try turning off hyperthreading.
Created attachment 509574 [details] dmesg including oops after ht turned off After turning off hyperthreading, booting, and playing around for a few minutes (mainly with setting up a kvm guest), I got a kernel oops. (dmesg attached). No idea if it's related. Hm, I forgot that the driver for the built-in network interface is buggy in Fedora >=14. (See https://bugzilla.redhat.com/show_bug.cgi?id=654147, for a machine built from the same barebones.) So that may be a factor as well. I'll see if I can find a workaround for that bug to help isolate this one.
This looks like broken hardware. 16: b9 10 00 00 00 mov $0x10,%ecx 1b: 45 85 ed test %r13d,%r13d 1e: c7 85 ac fe ff ff ff movl $0xffffffff,-0x154(%rbp) 25: ff ff ff 28: 48 89 d7 mov %rdx,%rdi 0: f3 ab rep stos %eax,%es:(%rdi) %rcx should contain 0x10 but it contains 0xffff8801f2ce382c %rdi points to userspace when it should be a copy of the kernel pointer in %rdx I would try installing some other OS to rule out hardware problems.
"I would try installing some other OS to rule out hardware problems." The same machine runs fine under Fedora 13, and has been for months. I've also tried downgrading it to Fedora 13 in case there's a hardware problem that developed only recently, but am still unable to reproduce the problem under Fedora 13, whereas it happens within an hour or two of use under Fedora 15.
Something very strange is going on. Can you try "iommu=soft" to rule out DMAR bugs? Do older F15 kernels work?
After some further testing I've seen it freeze (and couldn't get debugging information) after adding iommu=soft to the kernel commandline. I believe I've seen similar problems under older F15 kernels, but haven't retested to confirm that. Apologies, I use the machine a lot while I'm working and am not getting a lot of time to boot it to Fedora 15 for testing.
I just started seeing a very similar [hardware error] on an MSI laptop with an AMD e350 after updating to kernel 2.6.40.6-0.fc15.x86_64 yesterday. It happened three times in a row, so of course I went to grab my camera and now the machine has been fine since. Anyway, adding myself to the cc: list for if/when it happens again.
was this machine hibernated at all ? I'm wondering if this was more fallout from the recent i915 memory corruption bug that got fixed.
No, the machine never hibernates.
Created attachment 585981 [details] Photo of the error
Created attachment 585982 [details] Photo of the error
Hi, I'm a new Fedora user and I've got the same kind of error on my machine, about once per day. Fedora 16 CPU: Corei7 920 Ram: 3x 2 Go DDR3 Hyperthreading is already disabled because of an other problem with josm (https://bugzilla.redhat.com/show_bug.cgi?id=819345) I don't have this problem when I'm Working on Windows 7 on the same computer. I create 2 news attachment which are photos of error on my computer: https://bugzilla.redhat.com/attachment.cgi?id=585981 https://bugzilla.redhat.com/attachment.cgi?id=585982
Fedora 15 has reached it's end of life as of June 26, 2012. As a result, we will not be fixing any remaining bugs found in Fedora 15. In the event that you have upgraded to a newer release and the bug you reported is still present, please reopen the bug and set the version field to the newest release you have encountered the issue with. Before doing so, please ensure you are testing the latest kernel update in that release and attach any new and relevant information you may have gathered. Thank you for taking the time to file a report. We hope newer versions of Fedora suit your needs.
I've attempted to install more recent Fedora versions several times including most recently the F18 alpha, but continue to have random bugs and MCE's. F13 continues to work. I finally took the time to experiment some more. It looks now like the bug began when CONFIG_INTEL_IDLE was turned on for F14. Upstream report: http://mid.gmane.org/<20121005222357.GC30139> I'm not a completely positive this is the same bug, but for now it looks likely. I'll do some more work with the modified kernel and report the results. I'd also like to work out how to create an F18 lived CD with a modified kernel to see whether this makes F18 reliable for me. Resetting the bug's state to ASSIGNED, but let me know if that's not the right thing to do.
I just noticed intel_idle has a "max_cstate" parameter. Booting with "intel_idle.max_cstate = 0" also fixes the problem without the need to rebuild the kernel. I can now get through a Fedora 18 install successfully, whereas previously it always crashed either during the install itself or in the initial post-boot configuration.
Any chance you can incrementally increase max_cstate until the point where it starts failing?
Yep. So far: intel_idle.max_cstate=2 is bad intel_idle.max_cstate=1 is good? intel_idle.max_cstate=0 is good A question mark for max_cstate=1 just because my quick "dd" reproducer hasn't been 100% reliable. It's probably good, but I'll do my work with max_cstate=1 today (as I did yesterday with max_cstate=0) and report if it crashes. This is all with Fedora 18 and 3.6.1-1.fc18.x86_64.
Booting with intel_idle.max_cstate=0, and then with intel_idle.max_cstate=1, please show the output from dmesg | grep idle grep . /sys/devices/system/cpu/cpu0/cpuidle/*/*
To confirm: it survived all day yesterday with max_cstate=1. So max_cstate=2 is the first that reproduces the bug. With max_cstate=0: # dmesg|grep idle [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-3.6.1-1.fc18.x86_64 root=UUID=0526310d-dcb8-4371-a785-752590fe62c1 ro rd.md=0 rd.lvm=0 rd.dm=0 rd.luks=0 rhgb quiet intel_idle.max_cstate=0 [ 0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-3.6.1-1.fc18.x86_64 root=UUID=0526310d-dcb8-4371-a785-752590fe62c1 ro rd.md=0 rd.lvm=0 rd.dm=0 rd.luks=0 rhgb quiet intel_idle.max_cstate=0 [ 0.002931] process: using mwait in idle threads [ 0.922878] intel_idle: disabled [ 1.198689] cpuidle: using governor ladder [ 1.198690] cpuidle: using governor menu # grep . /sys/devices/system/cpu/cpu0/cpuidle/*/* grep: /sys/devices/system/cpu/cpu0/cpuidle/*/*: No such file or directory With max_cstate=1: # dmesg|grep idle [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-3.6.1-1.fc18.x86_64 root=UUID=0526310d-dcb8-4371-a785-752590fe62c1 ro rd.md=0 rd.lvm=0 rd.dm=0 rd.luks=0 rhgb quiet intel_idle.max_cstate=1 [ 0.000000] Kernel command line: BOOT_IMAGE=/vmlinuz-3.6.1-1.fc18.x86_64 root=UUID=0526310d-dcb8-4371-a785-752590fe62c1 ro rd.md=0 rd.lvm=0 rd.dm=0 rd.luks=0 rhgb quiet intel_idle.max_cstate=1 [ 0.002926] process: using mwait in idle threads [ 0.923045] intel_idle: MWAIT substates: 0x1120 [ 0.923050] intel_idle: v0.4 model 0x1E [ 0.923051] intel_idle: lapic_timer_reliable_states 0x2 [ 0.923052] intel_idle: max_cstate 1 reached [ 0.923060] intel_idle: max_cstate 1 reached [ 0.923064] intel_idle: max_cstate 1 reached [ 0.923067] intel_idle: max_cstate 1 reached [ 0.923068] intel_idle: max_cstate 1 reached [ 1.198811] cpuidle: using governor ladder [ 1.198840] cpuidle: using governor menu # grep . /sys/devices/system/cpu/cpu0/cpuidle/*/* /sys/devices/system/cpu/cpu0/cpuidle/state0/desc:CPUIDLE CORE POLL IDLE /sys/devices/system/cpu/cpu0/cpuidle/state0/disable:0 /sys/devices/system/cpu/cpu0/cpuidle/state0/latency:0 /sys/devices/system/cpu/cpu0/cpuidle/state0/name:POLL /sys/devices/system/cpu/cpu0/cpuidle/state0/power:4294967295 /sys/devices/system/cpu/cpu0/cpuidle/state0/time:28659510 /sys/devices/system/cpu/cpu0/cpuidle/state0/usage:236965 /sys/devices/system/cpu/cpu0/cpuidle/state1/desc:MWAIT 0x00 /sys/devices/system/cpu/cpu0/cpuidle/state1/disable:0 /sys/devices/system/cpu/cpu0/cpuidle/state1/latency:3 /sys/devices/system/cpu/cpu0/cpuidle/state1/name:C1-NHM /sys/devices/system/cpu/cpu0/cpuidle/state1/power:4294967294 /sys/devices/system/cpu/cpu0/cpuidle/state1/time:63419910896 /sys/devices/system/cpu/cpu0/cpuidle/state1/usage:92907312
It appears that when you disable intel_idle via intel_idle.max_cstate=0, that instead of running acpi_idle, you are running with no C-states at all. Do you have ACPI C-states disabled in the BIOS or the Linux acpi "processor" driver disabled? Please go into BIOS SETUP and select defaults and verify that you see the same thing. Also look for BIOS options related to idle c-states. Please attach the output of "acpidump" to this bug report. In the intel_idle.max_cstate=0 case, on an idle system, please show the output from # turbostat -v sleep 1 turbostat and acpidump can be found in the latest upstream kernel tree under utils/power/
(In reply to comment #22) > It appears that when you disable intel_idle via intel_idle.max_cstate=0, > that instead of running acpi_idle, you are running with no C-states at all. > > Do you have ACPI C-states disabled in the BIOS or the Linux > acpi "processor" driver disabled? Apologies, I don't know how to answer either of those questions! > Please go into BIOS SETUP > and select defaults and verify that you see the same thing. > Also look for BIOS options related to idle c-states. The only possibly relevant items I see in the BIOS menus are "C1E support", "SpeedStep", and "TurboMode". All are set to "enabled". I did try restoring all BIOS defaults. Output above in the max_cstate=0 case was unchanged. (Still no /sys/devices/system/cpu/cpu0/cpuidle directory.) > Please attach the output of "acpidump" to this bug report. > > In the intel_idle.max_cstate=0 case, on an idle system, > please show the output from > # turbostat -v sleep 1 > > turbostat and acpidump can be found in the latest upstream kernel tree > under utils/power/ (Actually looks like they're in tools/power/acpi and tools/power/x86/turbostat). Thanks, I'll do that next.
Created attachment 625706 [details] acpidump output
Created attachment 625707 [details] turbostat output
Please verify that this motherboard officially supports this processor, that you are running the latest BIOS, and that the BIOS supports this processor. Even in ACPI mode, this box is running with C1 in idle only, which is an indication that something is quite wrong. in the FADT... [05Fh 0095 1] _CST Support : E3 [060h 0096 2] C2 Latency : 0065 [062h 0098 2] C3 Latency : 03E9 which translate to 101 and 1001 decimal, which disable C2 and C3 in non-CST mode. The E3 means that the BIOS wants the OS to tell it that the OS has _CST support, but the tables you sent don't have any _CST present. Are there any dynamic tables in /sys/firmware/acpi/tables/dynamic If yes, please attach them. BTW. It is also interesting that your BIOS would offer to disable C1E, as that would void the warranty on your processor.
Looking at shuttle's web site, this product claims to support the i7-870 processor. However, their BIOS download page has only this description for version 2010/09/01 BIOS: "Improved stability for some CPUs." So it would be a good idea to verify you've got that version or later. http://global.shuttle.com/products/productsDownload?productId=1409 What do you see here?: $ grep . /sys/devices/system/cpu/cpu0/cpufreq/* One possibility is that there is an electrical problem on this board and Shuttle tried to de-feature voltage scaling in their BIOS. Under "Advanced", what is "Intel(R)SpeedStep(tm) tech" set to? if it is off and if you enable it when C-states are off and you see stability issues, that may indicate a voltage issue. If you have an easy way to reproduce the failure, I'd be interested to know if they settings under "Advanced"/"Frequency Voltage Configuration" have an effect. In particular, does the system get more stable if you increase the processor and DIMM voltages? Please show the output from # turbostat -M 0xe2 sleep 1 MSR 0xE2 is the MSR_PKG_CST_CONFIG_CONTROL register. The bottom 3 bits say what the deepest enabled package C-state is. If this is mis-configured, then using the core c-states could result in a package c-state which has issues. If bit 15 is clear, then this MSR is unlocked and you could write this MSR with the bottom 3-bits clear to disable package C-states. Note this is a per-core MSR, so you'd use the version of wrmsr with the -a capability. turbostat or rdmsr -a can tell you if it worked. If the MSR is locked, then a low-tech way to prevent package c-states (as a test) is to have 1 thread running (eg, a spin loop), and see if the other 3 cores are able to get into a deep core c-state w/o problems. (as shown by turbostat).
(In reply to comment #27) > However, their BIOS download page has only this description > for version 2010/09/01 BIOS: > > "Improved stability for some CPUs." > > So it would be a good idea to verify you've got that version or later. In fact, the BIOS reports version 103 (06/18/10); thanks for the suggestion, I'll try their latest. Results below are before doing that, and with max_cstate still 0: > http://global.shuttle.com/products/productsDownload?productId=1409 > > What do you see here?: > $ grep . /sys/devices/system/cpu/cpu0/cpufreq/* /sys/devices/system/cpu/cpu0/cpufreq/affected_cpus:0 /sys/devices/system/cpu/cpu0/cpufreq/bios_limit:2934000 /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq:1200000 /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq:2934000 /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_min_freq:1200000 /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_transition_latency:10000 /sys/devices/system/cpu/cpu0/cpufreq/related_cpus:0 1 2 3 4 5 6 7 /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_frequencies:2934000 2933000 2800000 2667000 2533000 2400000 2267000 2133000 2000000 1867000 1733000 1600000 1467000 1333000 1200000 /sys/devices/system/cpu/cpu0/cpufreq/scaling_available_governors:conservative userspace powersave ondemand performance /sys/devices/system/cpu/cpu0/cpufreq/scaling_cur_freq:1200000 /sys/devices/system/cpu/cpu0/cpufreq/scaling_driver:acpi-cpufreq /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor:ondemand /sys/devices/system/cpu/cpu0/cpufreq/scaling_max_freq:2934000 /sys/devices/system/cpu/cpu0/cpufreq/scaling_min_freq:1200000 /sys/devices/system/cpu/cpu0/cpufreq/scaling_setspeed:<unsupported> > One possibility is that there is an electrical problem on this > board and Shuttle tried to de-feature voltage scaling in their BIOS. > Under "Advanced", what is "Intel(R)SpeedStep(tm) tech" set to? > > if it is off and if you enable it when C-states are off > and you see stability issues, that may indicate a voltage issue. That's set to "enabled" and always has been. > If you have an easy way to reproduce the failure, I think I can reproduce it reliably in under an hour. > I'd be interested > to know if they settings under > "Advanced"/"Frequency Voltage Configuration" > have an effect. In particular, does the system get more stable > if you increase the processor and DIMM voltages? I could try that, sure. > Please show the output from > > # turbostat -M 0xe2 sleep 1 # ./turbostat -M 0xe2 sleep 1 cor CPU %c0 GHz TSC MSR 0x0E2 %c1 %c3 %c6 %pc3 %pc6 0.05 1.20 2.93 0x0000000000000000 99.95 0.00 0.00 0.00 0.00 0 0 0.05 1.20 2.93 0x0000000000000003 99.95 0.00 0.00 0.00 0.00 0 4 0.04 1.20 2.93 0x0000000000000003 99.96 1 1 0.04 1.20 2.93 0x0000000000000003 99.96 0.00 0.00 1 5 0.11 1.20 2.93 0x0000000000000003 99.89 2 2 0.04 1.20 2.93 0x0000000000000003 99.96 0.00 0.00 2 6 0.02 1.20 2.93 0x0000000000000003 99.98 3 3 0.03 1.20 2.93 0x0000000000000003 99.97 0.00 0.00 3 7 0.03 1.20 2.93 0x0000000000000003 99.97 1.001803 sec > MSR 0xE2 is the MSR_PKG_CST_CONFIG_CONTROL register. > The bottom 3 bits say what the deepest enabled package C-state is. > If this is mis-configured, then using the core c-states could result > in a package c-state which has issues. If bit 15 is clear, then > this MSR is unlocked and you could write this MSR with the bottom 3-bits > clear to disable package C-states. Note this is a per-core MSR, so you'd > use the version of wrmsr with the -a capability. "yum install msr-tools" gets me wr/rdmsr without any (documented) "-a" option, and googling isn't finding anything else. Would for (( i=0; i<9; i++ )); do wrmsr -p$i 0xe2; done do the job? > turbostat or rdmsr -a can tell you if it worked. Apologies, I'm not completely sure what you're asking for here. The problem was only reproduceable on booting with intel_idle.max_cstate >= 2. So I should boot with max_cstate >=2, then try the above wrmsr, then see if the problem still occurs? Anyway, I'm assuming I should try the BIOS upgrade first. > If the MSR is locked, then a low-tech way to prevent package c-states > (as a test) is to have 1 thread running (eg, a spin loop), and see > if the other 3 cores are able to get into a deep core c-state w/o problems. > (as shown by turbostat).
Yes, best to focus first on updating the BIOS. After the upgrade, please re-send: acpidump output and with intel_idle.max_cstate=0 grep . /sys/devices/system/cpu/cpu0/cpuidle/*/* turbostat -M 0xe2 sleep 1 Thanks for verifying that P-states are enabled. That suggests that the configuration isn't totally crippled. Note, however, that the lack of deep C-state support on this configuration may prevent you from reaching maximum frequency: 9 * 133 = 1200 MHz max efficiency 22 * 133 = 2933 MHz TSC frequency 24 * 133 = 3200 MHz max turbo 4 active cores 24 * 133 = 3200 MHz max turbo 3 active cores 26 * 133 = 3467 MHz max turbo 2 active cores 27 * 133 = 3600 MHz max turbo 1 active cores you can find out with a simple test. # cat /dev/zero > /dev/null & # cat /dev/zero > /dev/null & # turbostat and see if you get up to 3.4 Ghz. kill one of the threads and see if you can get up to 3.6 GHz. It is possible that the lack of C-states deeper than C1 will limit turbo to 3.2 GHz. Good news on MSR 0xE2. First, bit 15 is clear, so this MSR is unlocked and enabled for writing. The 3 means that PC6 is enabled. Set this MSR to 0 and re-test. (and re-run your test above to see if you can then get to 3.6 Ghz:-)
Created attachment 626014 [details] rdmsr.c Here is the rdmsr.c that I use. I modified it some time ago to add the -a parameter. Looks like I failed to get that change back upstream.
Created attachment 626015 [details] wrmsr.c This version has -a option
Thanks! The BIOS upgrade did indeed help: the machine's been running for a couple days with intel_idle.max_cstate=2 without any crashes. But I also have a backup of the original BIOS and would be happy to reflash back to that if it would be useful. (Presumably it was buggy, but should the kernel have been able to work around whatever the problem was?) (For future reference, the BIOS upgrade was: # download and unzip update from http://global.shuttle.com/products/productsDownload?productId=1409 yum install flashrom flashrom -pinternal -r backup.bin flashrom -pinternal -w SH55JSHU.107 )
With new BIOS: # cat /sys/module/intel_idle/parameters/max_cstate 0 # grep . /sys/devices/system/cpu/cpu0/cpuidle/*/* grep: /sys/devices/system/cpu/cpu0/cpuidle/*/*: No such file or directory
Also with new BIOS and max_cstate=0, turbostat output looks the same?: # ./turbostat -M 0xe2 sleep 1 cor CPU %c0 GHz TSC MSR 0x0E2 %c1 %c3 %c6 %pc3 %pc6 2.13 1.20 2.93 0x0000000000000000 97.87 0.00 0.00 0.00 0.00 0 0 3.61 1.20 2.93 0x0000000000000003 96.39 0.00 0.00 0.00 0.00 0 4 1.47 1.20 2.93 0x0000000000000003 98.53 1 1 6.14 1.20 2.93 0x0000000000000003 93.86 0.00 0.00 1 5 0.12 1.20 2.93 0x0000000000000003 99.88 2 2 0.42 1.20 2.93 0x0000000000000003 99.58 0.00 0.00 2 6 0.02 1.20 2.93 0x0000000000000003 99.98 3 3 5.21 1.20 2.93 0x0000000000000003 94.79 0.00 0.00 3 7 0.05 1.20 2.93 0x0000000000000003 99.95 1.001825 sec
Created attachment 627002 [details] acpidump output (new BIOS, max_cstate=0)
"you can find out with a simple test" Right, I never see anything in the "GHz" column over 3.20. After: # ./wrmsr -a 0x0E2 0 [root@pop ~]# ./rdmsr -a 0x0E2 0 0 0 0 0 0 0 0 there's no change--still nothing over 3.20.
re: comment #32 "intel_idle.max_cstate=2 is now stable" Promising news. Please show the turbostat output for this case to verify that we are getting c-state residency we expect. Please also try with no intel_idle.max_cstate parameter at all, (the out of the box case) to see if we can get stable c6 residency in addition to c3 residency. Send turbostat output. re: comment #33 this is with intel_idle not loaded, yes? ("dmesg |grep idle" will confirm you what is loaded) This is consistent with comment #32 where it seems that ACPI mode is still exporting just C1, and thus you'll see no cpuidle stuff in sysfs. Indeed, it is strange (and likely some sort of BIOS bug, that you're getting only C1 in ACPI mode) re: comment #35 I'll have to get back to you on this - maybe can figure out why ACPI is exporting just C1 -- though we care here more about the intel_idle case -- the ACPI case is primarily for comparison. re: comment #36 Okay, in the ACPI case where you have just C1, you can never run faster than 3.2 GHz. Certainly that isn't what customers will want. Unclear if Shuttle disabled this on purpose, say, they don't have power or cooling capacity, or if this is a BIOS bug. Of course it would be interesting to see what windows does on this box. It will use ACPI, and its C-states should be visible in its perfmon utility. re: clearing of MSR 0xE2. This is interesting only if you have a failure with deep c-states and this makes it go away. But since we don't have a failure to fix at the moment, this doesn't tell us anything.
Apologies for the delayed response: (In reply to comment #37) > re: comment #32 "intel_idle.max_cstate=2 is now stable" > > Promising news. > > Please show the turbostat output for this case > to verify that we are getting c-state residency we expect. [root@pop turbostat]# cat /sys/module/intel_idle/parameters/max_cstate 2 [root@pop turbostat]# ./turbostat sleep 1 cor CPU %c0 GHz TSC %c1 %c3 %c6 %pc3 %pc6 5.21 1.20 2.93 7.12 87.67 0.00 43.23 0.00 0 0 3.99 1.20 2.93 2.03 93.98 0.00 43.23 0.00 0 4 0.04 1.20 2.93 5.98 1 1 8.50 1.20 2.93 2.69 88.81 0.00 1 5 0.05 1.20 2.93 11.15 2 2 12.56 1.20 2.93 15.93 71.51 0.00 2 6 13.78 1.20 2.93 14.71 3 3 2.69 1.20 2.93 0.93 96.39 0.00 3 7 0.05 1.19 2.93 3.57 1.002127 sec > Please also try with no intel_idle.max_cstate parameter at all, > (the out of the box case) > to see if we can get stable c6 residency in addition > to c3 residency. Send turbostat output. # cat /sys/module/intel_idle/parameters/max_cstate 7 [root@pop turbostat]# ./turbostat sleep 1 cor CPU %c0 GHz TSC %c1 %c3 %c6 %pc3 %pc6 4.42 1.25 2.93 7.83 46.71 41.04 38.62 4.58 0 0 7.25 1.24 2.93 3.04 69.86 19.85 38.62 4.58 0 4 0.39 1.62 2.93 9.90 1 1 3.26 1.25 2.93 9.48 33.45 53.81 1 5 5.55 1.25 2.93 7.18 2 2 7.39 1.22 2.93 3.33 40.26 49.03 2 6 0.36 1.66 2.93 10.36 3 3 9.36 1.23 2.93 5.90 43.26 41.48 3 7 1.79 1.37 2.93 13.47 1.002048 sec Neato. > re: comment #33 > > this is with intel_idle not loaded, yes? > ("dmesg |grep idle" will confirm you what is loaded) I don't remember.... Booting with max_cstate=0 to check: that's right, it's not loaded. > This is consistent with comment #32 > where it seems that ACPI mode is still exporting just C1, > and thus you'll see no cpuidle stuff in sysfs. > > Indeed, it is strange (and likely some sort of BIOS bug, > that you're getting only C1 in ACPI mode) > > re: comment #35 > > I'll have to get back to you on this - maybe can figure > out why ACPI is exporting just C1 -- though we care here > more about the intel_idle case -- the ACPI case is primarily > for comparison. > > re: comment #36 > > Okay, in the ACPI case where you have just C1, > you can never run faster than 3.2 GHz. > Certainly that isn't what customers will want. > Unclear if Shuttle disabled this on purpose, > say, they don't have power or cooling capacity, > or if this is a BIOS bug. Of course it would be interesting > to see what windows does on this box. It will use ACPI, > and its C-states should be visible in its perfmon utility. I'm pretty ignorant of Windows--unless there's some Windows equivalent to a live CD that I could get my hands on easily, getting it on this box is probably more of a project than I can take on right now.
So your system is stable and working properly after the BIOS upgrade, and with no special boot parameters, intel_idle is loading, c6 and pc6 are being utilized? I expect you will also find that turbo mode goes faster now. try a single-threaded cycle-soaker # cat /dev/zero > /dev/null & and see if turbostat shows that you are now able to get past 3.2 Ghz. If this is the case, then this bug is closed, yes? The remaining mystery is actually why legacy ACPI mode (intel_idle.max_cstate=0) that you see only C1. That, of course, would be an ACPI-mode bug, not an intel_idle bug:-) If you file that bug, I'll look at it.
> see if turbostat shows that you are now able to get past 3.2 Ghz. Yep, looks like it: cor CPU %c0 GHz TSC %c1 %c3 %c6 %pc3 %pc6 14.86 3.46 2.93 18.26 42.67 24.21 0.00 0.00 0 0 5.22 3.16 2.93 6.30 57.91 30.57 0.00 0.00 0 4 1.96 3.05 2.93 9.56 1 1 4.24 3.16 2.93 4.77 39.02 51.96 1 5 1.10 2.74 2.93 7.91 2 2 98.38 3.52 2.93 1.62 0.00 0.00 2 6 0.13 3.33 2.93 99.87 3 3 6.43 3.23 2.93 5.51 73.74 14.32 3 7 1.42 3.22 2.93 10.52 > If this is the case, then this bug is closed, yes? My one remaining concern aside from the ACPI mode behavior is whether the kernel could have worked around the buggy BIOS. I'm lame for not thinking to check for a BIOS upgrade, but: my experience as a user was that a machine that had been stable for months under F13 suddenly started crashing on upgrade to F14, so my first thought was to blame the software.... That said, my immediate problems are solved so I'm not going to push for anything more unless you judge it's a big priority--I'm fine with closing the bug.
I recommend closing this bug. I don't think Linux can check for this issue in the general case -- since we have no idea what the BIOS changed for "Improved stability for some CPUs" In theory, we could add a specific DMI check for the bad BIOS version -- but we typically don't do that when there is a known good BIOS. And this is a pretty low-volume system, making it hard to justify carrying code to check BIOS version. Finally, this is an end-user assembled "bare bones" system. The integrator selected and installed an i7-870, but failed to notice that they paid extra for higher MHz, but the system didn't deliver that MHz. To say that FC13 was functioning would be fair, but it with no C-states and no turbo-mode, it wasn't working properly, and it is likely that most system integrators would have noticed that and installed the latest BIOS as part of system integration. I think that it is an additional bug that Linux in ACPI mode (intel_idle.max_cstate=0) is not working properly on this system, and I would be interested in debugging that one if you open a new report for it.
(In reply to comment #41) > I recommend closing this bug. > > I don't think Linux can check for this issue in the general case -- > since we have no idea what the BIOS changed for > "Improved stability for some CPUs" > > In theory, we could add a specific DMI check for the bad BIOS version -- > but we typically don't do that when there is a known good BIOS. > And this is a pretty low-volume system, making it hard to justify > carrying code to check BIOS version. OK, makes sense. > Finally, this is an end-user assembled "bare bones" system. > The integrator selected and installed an i7-870, but failed > to notice that they paid extra for higher MHz, but the system > didn't deliver that MHz. To say that FC13 was functioning would > be fair, but it with no C-states and no turbo-mode, it wasn't > working properly, and it is likely that most system integrators > would have noticed that and installed the latest BIOS as part > of system integration. Yeah, my bad; it worked and built my kernels fast enough, so I was happy.... > I think that it is an additional bug that Linux in ACPI mode > (intel_idle.max_cstate=0) is not working properly on this system, > and I would be interested in debugging that one if you open > a new report for it. OK, I've opened bug 875988. Thanks for all your help!