929420 – doesn't log actual event data

Bug 929420 - doesn't log actual event data

Summary: doesn't log actual event data

Keywords:
Status:	CLOSED WONTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	mcelog
Sub Component:
Version:	17
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	Prarit Bhargava
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-03-30 12:48 UTC by udo
Modified:	2013-08-01 18:02 UTC (History)
CC List:	1 user (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2013-08-01 18:02:38 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
dmidecode output (14.75 KB, text/plain) 2013-04-01 15:36 UTC, udo	no flags	Details
View All

Description udo 2013-03-30 12:48:05 UTC

Description of problem: mcelog doesn't log actual event data but does complain that events were logged. Where? How? Not in /var/log/mcelog or in dmesg.


Version-Release number of selected component (if applicable):
mcelog-1.0-0.4.6e4e2a00.fc17.x86_64

How reproducible:
Boot the box.

Steps to Reproduce:
1. reboot
2. log in, open terminal, wait a few minutes
3. see messages file and dmesg
  
Actual results: see below


Expected results: either no mention of this or a decent mcelog.


Additional info:
Mar 27 16:02:03 a1 kernel: [  299.484876] mce: [Hardware Error]: Machine check events logged
Mar 27 16:11:46 a1 kernel: [  299.500184] mce: [Hardware Error]: Machine check events logged
Mar 27 16:49:26 a1 kernel: [  299.494439] mce: [Hardware Error]: Machine check events logged
Mar 27 17:42:58 a1 kernel: [  299.492441] mce: [Hardware Error]: Machine check events logged
Mar 28 17:57:35 a1 kernel: [  299.491062] mce: [Hardware Error]: Machine check events logged
Mar 28 18:20:33 a1 kernel: [  299.475068] mce: [Hardware Error]: Machine check events logged
Mar 28 18:43:17 a1 kernel: [  299.478879] mce: [Hardware Error]: Machine check events logged
Mar 30 11:11:16 a1 kernel: [  299.483371] mce: [Hardware Error]: Machine check events logged

Comment 1 Prarit Bhargava 2013-03-30 17:18:35 UTC

> 3. see messages file and dmesg


You need to look at the mcelog -- by default IIRC, that is in /var .

P.

Comment 2 udo 2013-03-30 17:27:01 UTC

# pwd
/var
# ls -l
total 100
drwxr-xr-x.   2 root root  4096 Jan 16 18:18 account
drwxr-xr-x.   2 root root  4096 Feb  3  2012 adm
drwxr-xr-x.  20 root root  4096 Oct 25 09:40 cache
drwxr-xr-x.   2 root root  4096 Feb 12 14:01 cvs
drwxr-xr-x.   4 root root  4096 Nov 19 16:08 db
drwxr-xr-x.   3 root root  4096 Feb  3  2012 empty
drwxr-xr-x.   2 root root  4096 Feb  3  2012 games
drwxrwx--T.   2 root gdm   4096 Jun  9  2012 gdm
drwxr-xr-x.   2 root root  4096 Feb  3  2012 gopher
drwxr-xr-x.  62 root root  4096 Nov 12 11:28 lib
drwxr-xr-x.   2 root root  4096 Feb  3  2012 local
lrwxrwxrwx.   1 root root    11 Jul  6  2012 lock -> ../run/lock
drwxr-xr-x.  22 root root  4096 Mar 30 17:20 log
drwx------.   2 root root 16384 Dec  2  2008 lost+found
lrwxrwxrwx.   1 root root    10 Jul  6  2012 mail -> spool/mail
drwxr-xr-x.   2 root root  4096 Feb  3  2012 nis
drwxr-xr-x.   2 root root  4096 Feb  3  2012 opt
drwxr-xr-x.   2 root root  4096 Feb  3  2012 preserve
lrwxrwxrwx.   1 root root     6 Jul  6  2012 run -> ../run
drwxr-xr-x.  17 root root  4096 Jul  6  2012 spool
drwxrwxrwt. 165 root root 12288 Mar 30 17:54 tmp
drwxr-xr-x.  15 root root  4096 Mar 24 11:41 www
drwxr-xr-x.   2 root root  4096 Feb  3  2012 yp
#

So there's nothing there.

Comment 3 udo 2013-03-30 17:28:04 UTC

Also:

# strings /sbin/mcelog|grep /var
/var/run/mcelog-client
/var/log/mcelog
/var/run/mcelog.pid
#

Comment 4 Prarit Bhargava 2013-03-31 14:13:21 UTC

hmm -- what happens when you manually run 'mcelog'?

Is the mcelog service running?

P.

Comment 5 udo 2013-03-31 14:22:08 UTC

# systemctl status mcelog.service
mcelog.service - Machine Check Exception Logging Daemon
	  Loaded: loaded (/usr/lib/systemd/system/mcelog.service; enabled)
	  Active: active (running) since Sun, 31 Mar 2013 13:36:44 +0200; 2h 44min ago
	 Process: 3001 ExecStartPre=/etc/mcelog/mcelog.setup (code=exited, status=0/SUCCESS)
	Main PID: 3105 (mcelog)
	  CGroup: name=systemd:/system/mcelog.service
		  └ 3105 /usr/sbin/mcelog --ignorenodev --daemon --foreground --syslog

Mar 31 13:36:44 a1.hierzo mcelog[3105]: Kernel does not support page offline interface
# ps -ef|grep mcelog
root      3105     1  0 13:36 ?        00:00:00 /usr/sbin/mcelog --ignorenodev --daemon --foreground --syslog
root      9199  5001  0 16:20 pts/0    00:00:00 grep --color=auto mcelog
# mcelog
#

Comment 6 Prarit Bhargava 2013-04-01 15:29:52 UTC

That's really strange.  Can you attach a dmidecode output?

P.

Comment 7 Prarit Bhargava 2013-04-01 15:30:42 UTC

FWIW ...

[root@intel-canoepass-07 mcelog]# systemctl status mcelog
mcelog.service - Machine Check Exception Logging Daemon
          Loaded: loaded (/usr/lib/systemd/system/mcelog.service; enabled)
          Active: active (running) since Mon 2013-04-01 11:12:39 EDT; 17min ago
         Process: 907 ExecStartPre=/etc/mcelog/mcelog.setup (code=exited, status=0/SUCCESS)
        Main PID: 916 (mcelog)
          CGroup: name=systemd:/system/mcelog.service
                  └─916 /usr/sbin/mcelog --ignorenodev --daemon --foreground

Apr 01 11:16:52 intel-canoepass-07.lab.bos.redhat.com mcelog[916]: TIME 1364829412 Mon Apr  1 11:16:52 2013
Apr 01 11:16:52 intel-canoepass-07.lab.bos.redhat.com mcelog[916]: MCG status:
Apr 01 11:16:52 intel-canoepass-07.lab.bos.redhat.com mcelog[916]: MCi status:
Apr 01 11:16:52 intel-canoepass-07.lab.bos.redhat.com mcelog[916]: Corrected error
Apr 01 11:16:52 intel-canoepass-07.lab.bos.redhat.com mcelog[916]: Error enabled
Apr 01 11:16:52 intel-canoepass-07.lab.bos.redhat.com mcelog[916]: MCi_ADDR register valid
Apr 01 11:16:52 intel-canoepass-07.lab.bos.redhat.com mcelog[916]: MCA: No Error
Apr 01 11:16:52 intel-canoepass-07.lab.bos.redhat.com mcelog[916]: STATUS 9400000000000000 MCGSTATUS 0
Apr 01 11:16:52 intel-canoepass-07.lab.bos.redhat.com mcelog[916]: MCGCAP 1000c1b APICID 2 SOCKETID 0
Apr 01 11:16:52 intel-canoepass-07.lab.bos.redhat.com mcelog[916]: CPUID Vendor Intel Family 6 Model 62

P.

Comment 8 udo 2013-04-01 15:36:30 UTC

Created attachment 730325 [details]
dmidecode output

Comment 9 udo 2013-04-01 15:37:15 UTC

(In reply to comment #7)
> FWIW ...

What does this mean w.r.t. the problem of this bug?

Comment 10 Prarit Bhargava 2013-04-01 15:40:52 UTC

(In reply to comment #9)
> (In reply to comment #7)
> > FWIW ...
> 
> What does this mean w.r.t. the problem of this bug?

It's interesting that it is working on my system and not on yours.  This could indicate a HW/FW issue on your system, or that mcelog isn't actually supported on your system.  If it is the first case, then there's not much I can do about that, if it is the second case maybe a code update is required to make it work for you.

P.

Comment 11 Prarit Bhargava 2013-04-01 15:43:02 UTC

Hmm, you're using AMD -- can you try using EDAC instead of mcelog.  IIRC, the preferred method of decoding is using EDAC (on AMD).

P.

Comment 12 Prarit Bhargava 2013-04-01 15:44:14 UTC

https://github.com/andikleen/mcelog/commit/b986691d9c5656beb8a6a0f65b8c7abc29d73a96

P.

Comment 13 udo 2013-04-01 15:48:26 UTC

(In reply to comment #11)
> Hmm, you're using AMD -- can you try using EDAC instead of mcelog.  IIRC,
> the preferred method of decoding is using EDAC (on AMD).

How to do that? (url is OK)



(In reply to comment #10)
> It's interesting that it is working on my system and not on yours.  This
> could indicate a HW/FW issue on your system, 

The HW/FW issue is what I like to find out. I did switch between F4 and F3[letter0 firmwares to verify.

> or that mcelog isn't actually
> supported on your system.  

Why would that be?

> If it is the first case, then there's not much I
> can do about that, 

I'd like to use mcelog to find the problem... :-)

> if it is the second case maybe a code update is required
> to make it work for you.

I already was in contact with gigabyte about a bios bug that caused AMD-Vi messages in the logs and froze the system more or less.
The kernel workaround does not show anymore with F4 BIOS.

Comment 14 udo 2013-04-01 15:57:21 UTC

It's a kernel option I see.
Does that option require mcelog to be running or be present?

Comment 15 udo 2013-04-01 16:17:38 UTC

# modprobe amd64_edac_mod
ERROR: could not insert 'amd64_edac_mod': No such device

So that won't help me now.

Comment 16 Prarit Bhargava 2013-04-01 17:08:35 UTC

(In reply to comment #14)
> It's a kernel option I see.
> Does that option require mcelog to be running or be present?

mcelog can be running but is not required.

(In reply to comment #15)
> # modprobe amd64_edac_mod
> ERROR: could not insert 'amd64_edac_mod': No such device
> 
> So that won't help me now.

Hmm.  Okay, let me see if I can grab an AMD system and reproduce this.

P.

Comment 17 Prarit Bhargava 2013-04-01 17:36:47 UTC

I installed F18 on an AMD system,

processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 21
model           : 1
model name      : AMD Opteron(TM) Processor 6274                 

AFAICT, the edac module is loaded by default:

[root@amd-dinar-07 ~]# lsmod | grep edac
amd64_edac_mod         23665  0 
edac_core              56455  2 amd64_edac_mod
edac_mce_amd           22634  1 amd64_edac_mod

I'm not sure why it isn't working on your system.  
Can you double check that edac isn't already loaded?

P.

Comment 18 Prarit Bhargava 2013-04-01 17:49:08 UTC

Also what does 

dmesg | grep edac

show?

P.

Comment 19 udo 2013-04-02 13:20:12 UTC

# lsmod|grep edac
edac_core              42927  0 


amd64_edac_mod does not support AMD's A10 GPU.

Comment 20 Prarit Bhargava 2013-04-02 13:42:02 UTC

(In reply to comment #19)
> # lsmod|grep edac
> edac_core              42927  0 
> 
> 
> amd64_edac_mod does not support AMD's A10 GPU.

Ah ... so that would explain this.  Hopefully AMD will do something to support A10 in the future.

There's still a bug here.  mcelog should have returned an error on the AMD processor and not started.

P.

Comment 21 udo 2013-04-02 13:47:46 UTC

In that case please fix that small mcelog issue.
After that you can close this issue.

edac_core gives me output, see edac mailinglist.

Comment 22 Fedora End Of Life 2013-07-04 06:28:00 UTC

This message is a reminder that Fedora 17 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 17. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as WONTFIX if it remains open with a Fedora 
'version' of '17'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version prior to Fedora 17's end of life.

Bug Reporter:  Thank you for reporting this issue and we are sorry that 
we may not be able to fix it before Fedora 17 is end of life. If you 
would still like  to see this bug fixed and are able to reproduce it 
against a later version  of Fedora, you are encouraged  change the 
'version' to a later Fedora version prior to Fedora 17's end of life.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 23 Fedora End Of Life 2013-08-01 18:02:44 UTC

Fedora 17 changed to end-of-life (EOL) status on 2013-07-30. Fedora 17 is 
no longer maintained, which means that it will not receive any further 
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of 
Fedora please feel free to reopen this bug against that version.

Thank you for reporting this bug and we are sorry it could not be fixed.

Note You need to log in before you can comment on or make changes to this bug.