Description of problem: I have a system running Fedora 17 x86_64, it random freezes and donesn't reboot, kdump is not triggered either (kdump works well with echo c to sysrq-trigger). I had been using this machine with Fedora 16 for about half year bofore, it never hit the same problem. I didn't find any clue how to trigger the freeze, it may happen once a week in the middle of night or even 2~3 times one day when I'm working. I redirected the kernel log to serial console, unfortunately, it didn't print any message when it freeze. And after the freeze, it can't reponse to any request from mouse, keyboard or magic sysrq key via serial console. The only lucky thing is that I captured the following log once in the last week: [125515.772726] Disabling lock debugging due to kernel taint [125515.778584] [Hardware Error]: CPU 3: Machine Check Exception: 5 Bank 4: b200000011000402 [125515.788790] [Hardware Error]: RIP !INEXACT! 10:<ffffffff8103de46> {native_safe_halt+0x6/0x10} [125515.799515] [Hardware Error]: TSC 161319a22f68a [125515.805994] [Hardware Error]: PROCESSOR 0:206a7 TIME 1341674682 SOCKET 0 APIC 6 microcode 28 [125515.816637] [Hardware Error]: Run the above through 'mcelog --ascii' [125515.824998] [Hardware Error]: Some CPUs didn't answer in synchronization [125515.833774] [Hardware Error]: Machine check: Processor context corrupt [125515.842346] Kernel panic - not syncing: Fatal machine check on current CPU decoded with mcelog: # mcelog --ascii --cpu sandybridge < error.txt Hardware event. This is not a software error. CPU 3 BANK 4 TSC 161319a22f68a RIP !INEXACT! 10:ffffffff8103de46 TIME 1341674682 Sat Jul 7 23:24:42 2012 MCG status:RIPV MCIP MCi status: Uncorrected error Error enabled Processor context corrupt MCA: Internal unclassified error: 402 PCU: No error <24:11> STATUS b200000011000402 MCGSTATUS 5 CPUID Vendor Intel Family 6 Model 42 RIP: native_safe_halt+0x6/0x10} SOCKET 0 APIC 6 microcode 28 My system configuration: CPU: Intel(R) Xeon(R) CPU E31225 @ 3.10GHz Machine model: Lenovo ThinkStation E30 Running kernel: kernel-3.4.4-3.fc17.x86_64 also tried: kernel-3.5.0-0.rc5.git3.1.fc18 (rebuilt on fedora 17), still have the same problem. Version-Release number of selected component (if applicable): How reproducible: randomly Steps to Reproduce: 1. Boot up the system, keep it running and wait (may take a week). Actual results: System random freeze. Expected results: Should not freeze. Additional info:
Created attachment 597722 [details] hardware info output of: [1] cat /proc/cpuinfo [2] lspci -vvvv [3] dmidecode
(In reply to bug 715485 comment 10) Dave Jones said: > was this machine hibernated at all ? I'm wondering if this was more fallout > from the recent i915 memory corruption bug that got fixed. Hi Dave, Could you give me the link to that i915 bug? I have a similar issue as described in this bug, I'm not sure whether the system is hibernated, the leds are still on, but system can't response (even from the serial console). There is no call trace info in serial log.
there were literally dozens of them, so there's no single bug. that problem got fixed though, so you're seeing something different if the current builds don't work for you. Can you try running memtest86 on that machine for a while just to rule out hardware problems ? Machine checks happen quite a lot from things like overheating, or bad memory.
(In reply to comment #3) > there were literally dozens of them, so there's no single bug. > that problem got fixed though, so you're seeing something different if the > current builds don't work for you. > > Can you try running memtest86 on that machine for a while just to rule out > hardware problems ? Machine checks happen quite a lot from things like > overheating, or bad memory. memtest86 passed without issue, I'll replace the processor and memory next week, and update the status here later.
It doesn't happen any more after replaced the motherboard and CPU, so close this as not a bug.