839511 – system random freeze with fedora 17 x86_64

Bug 839511 - system random freeze with fedora 17 x86_64

Summary: system random freeze with fedora 17 x86_64

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	17
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-07-12 07:21 UTC by Steven
Modified:	2012-08-06 03:06 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2012-08-06 03:06:18 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
hardware info (42.52 KB, text/plain) 2012-07-12 07:25 UTC, Steven	no flags	Details
View All

Description Steven 2012-07-12 07:21:55 UTC

Description of problem:
I have a system running Fedora 17 x86_64, it random freezes and donesn't reboot, kdump is not triggered either (kdump works well with echo c to sysrq-trigger). I had been using this machine with Fedora 16 for about half year bofore, it never hit the same problem.

I didn't find any clue how to trigger the freeze, it may happen once a week in the middle of night or even 2~3 times one day when I'm working. I redirected the kernel log to serial console, unfortunately, it didn't print any message when it freeze. And after the freeze, it can't reponse to any request from mouse, keyboard or magic sysrq key via serial console. The only lucky thing is that I captured the following log once in the last week:

[125515.772726] Disabling lock debugging due to kernel taint
[125515.778584] [Hardware Error]: CPU 3: Machine Check Exception: 5 Bank 4: b200000011000402
[125515.788790] [Hardware Error]: RIP !INEXACT! 10:<ffffffff8103de46> {native_safe_halt+0x6/0x10}
[125515.799515] [Hardware Error]: TSC 161319a22f68a 
[125515.805994] [Hardware Error]: PROCESSOR 0:206a7 TIME 1341674682 SOCKET 0 APIC 6 microcode 28
[125515.816637] [Hardware Error]: Run the above through 'mcelog --ascii'
[125515.824998] [Hardware Error]: Some CPUs didn't answer in synchronization
[125515.833774] [Hardware Error]: Machine check: Processor context corrupt
[125515.842346] Kernel panic - not syncing: Fatal machine check on current CPU

decoded with mcelog:

# mcelog --ascii --cpu sandybridge < error.txt 
Hardware event. This is not a software error.
CPU 3 BANK 4 TSC 161319a22f68a 
RIP !INEXACT! 10:ffffffff8103de46
TIME 1341674682 Sat Jul  7 23:24:42 2012
MCG status:RIPV MCIP 
MCi status:
Uncorrected error
Error enabled
Processor context corrupt
MCA: Internal unclassified error: 402
PCU: No error <24:11>

STATUS b200000011000402 MCGSTATUS 5
CPUID Vendor Intel Family 6 Model 42
RIP: native_safe_halt+0x6/0x10}
SOCKET 0 APIC 6 microcode 28


My system configuration:

CPU: Intel(R) Xeon(R) CPU E31225 @ 3.10GHz
Machine model: Lenovo ThinkStation E30

Running kernel:
kernel-3.4.4-3.fc17.x86_64

also tried: 
kernel-3.5.0-0.rc5.git3.1.fc18 (rebuilt on fedora 17), still have the same problem.

Version-Release number of selected component (if applicable):


How reproducible:
randomly

Steps to Reproduce:
1. Boot up the system, keep it running and wait (may take a week).

Actual results:
System random freeze.

Expected results:
Should not freeze.

Additional info:

Comment 1 Steven 2012-07-12 07:25:54 UTC

Created attachment 597722 [details]
hardware info

output of:
[1] cat /proc/cpuinfo
[2] lspci -vvvv
[3] dmidecode

Comment 2 Steven 2012-07-12 09:53:08 UTC

(In reply to bug 715485 comment 10)
Dave Jones said:
> was this machine hibernated at all ? I'm wondering if this was more fallout
> from the recent i915 memory corruption bug that got fixed.

Hi Dave,

Could you give me the link to that i915 bug? I have a similar issue as described in this bug, I'm not sure whether the system is hibernated, the leds are still on, but system can't response (even from the serial console). There is no call trace info in serial log.

Comment 3 Dave Jones 2012-07-12 14:32:52 UTC

there were literally dozens of them, so there's no single bug.
that problem got fixed though, so you're seeing something different if the current builds don't work for you.

Can you try running memtest86 on that machine for a while just to rule out hardware problems ? Machine checks happen quite a lot from things like overheating, or bad memory.

Comment 4 Qixiang Wan 2012-07-13 04:35:40 UTC

(In reply to comment #3)
> there were literally dozens of them, so there's no single bug.
> that problem got fixed though, so you're seeing something different if the
> current builds don't work for you.
> 
> Can you try running memtest86 on that machine for a while just to rule out
> hardware problems ? Machine checks happen quite a lot from things like
> overheating, or bad memory.

memtest86 passed without issue, I'll replace the processor and memory next week, and update the status here later.

Comment 5 Steven 2012-08-06 03:06:18 UTC

It doesn't happen any more after replaced the motherboard and CPU, so close this as not a bug.

Note You need to log in before you can comment on or make changes to this bug.