Bug 839511 - system random freeze with fedora 17 x86_64
system random freeze with fedora 17 x86_64
Status: CLOSED NOTABUG
Product: Fedora
Classification: Fedora
Component: kernel (Show other bugs)
17
x86_64 Linux
unspecified Severity high
: ---
: ---
Assigned To: Kernel Maintainer List
Fedora Extras Quality Assurance
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2012-07-12 03:21 EDT by Steven
Modified: 2012-08-05 23:06 EDT (History)
6 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2012-08-05 23:06:18 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
hardware info (42.52 KB, text/plain)
2012-07-12 03:25 EDT, Steven
no flags Details

  None (edit)
Description Steven 2012-07-12 03:21:55 EDT
Description of problem:
I have a system running Fedora 17 x86_64, it random freezes and donesn't reboot, kdump is not triggered either (kdump works well with echo c to sysrq-trigger). I had been using this machine with Fedora 16 for about half year bofore, it never hit the same problem.

I didn't find any clue how to trigger the freeze, it may happen once a week in the middle of night or even 2~3 times one day when I'm working. I redirected the kernel log to serial console, unfortunately, it didn't print any message when it freeze. And after the freeze, it can't reponse to any request from mouse, keyboard or magic sysrq key via serial console. The only lucky thing is that I captured the following log once in the last week:

[125515.772726] Disabling lock debugging due to kernel taint
[125515.778584] [Hardware Error]: CPU 3: Machine Check Exception: 5 Bank 4: b200000011000402
[125515.788790] [Hardware Error]: RIP !INEXACT! 10:<ffffffff8103de46> {native_safe_halt+0x6/0x10}
[125515.799515] [Hardware Error]: TSC 161319a22f68a 
[125515.805994] [Hardware Error]: PROCESSOR 0:206a7 TIME 1341674682 SOCKET 0 APIC 6 microcode 28
[125515.816637] [Hardware Error]: Run the above through 'mcelog --ascii'
[125515.824998] [Hardware Error]: Some CPUs didn't answer in synchronization
[125515.833774] [Hardware Error]: Machine check: Processor context corrupt
[125515.842346] Kernel panic - not syncing: Fatal machine check on current CPU

decoded with mcelog:

# mcelog --ascii --cpu sandybridge < error.txt 
Hardware event. This is not a software error.
CPU 3 BANK 4 TSC 161319a22f68a 
RIP !INEXACT! 10:ffffffff8103de46
TIME 1341674682 Sat Jul  7 23:24:42 2012
MCG status:RIPV MCIP 
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: Internal unclassified error: 402
PCU: No error <24:11>

STATUS b200000011000402 MCGSTATUS 5
CPUID Vendor Intel Family 6 Model 42
RIP: native_safe_halt+0x6/0x10}
SOCKET 0 APIC 6 microcode 28


My system configuration:

CPU: Intel(R) Xeon(R) CPU E31225 @ 3.10GHz
Machine model: Lenovo ThinkStation E30

Running kernel:
kernel-3.4.4-3.fc17.x86_64

also tried: 
kernel-3.5.0-0.rc5.git3.1.fc18 (rebuilt on fedora 17), still have the same problem.

Version-Release number of selected component (if applicable):


How reproducible:
randomly

Steps to Reproduce:
1. Boot up the system, keep it running and wait (may take a week).

Actual results:
System random freeze.

Expected results:
Should not freeze.

Additional info:
Comment 1 Steven 2012-07-12 03:25:54 EDT
Created attachment 597722 [details]
hardware info

output of:
[1] cat /proc/cpuinfo
[2] lspci -vvvv
[3] dmidecode
Comment 2 Steven 2012-07-12 05:53:08 EDT
(In reply to bug 715485 comment 10)
Dave Jones said:
> was this machine hibernated at all ? I'm wondering if this was more fallout
> from the recent i915 memory corruption bug that got fixed.

Hi Dave,

Could you give me the link to that i915 bug? I have a similar issue as described in this bug, I'm not sure whether the system is hibernated, the leds are still on, but system can't response (even from the serial console). There is no call trace info in serial log.
Comment 3 Dave Jones 2012-07-12 10:32:52 EDT
there were literally dozens of them, so there's no single bug.
that problem got fixed though, so you're seeing something different if the current builds don't work for you.

Can you try running memtest86 on that machine for a while just to rule out hardware problems ? Machine checks happen quite a lot from things like overheating, or bad memory.
Comment 4 Qixiang Wan 2012-07-13 00:35:40 EDT
(In reply to comment #3)
> there were literally dozens of them, so there's no single bug.
> that problem got fixed though, so you're seeing something different if the
> current builds don't work for you.
> 
> Can you try running memtest86 on that machine for a while just to rule out
> hardware problems ? Machine checks happen quite a lot from things like
> overheating, or bad memory.

memtest86 passed without issue, I'll replace the processor and memory next week, and update the status here later.
Comment 5 Steven 2012-08-05 23:06:18 EDT
It doesn't happen any more after replaced the motherboard and CPU, so close this as not a bug.

Note You need to log in before you can comment on or make changes to this bug.