Description of problem: mcelog doesn't log actual event data but does complain that events were logged. Where? How? Not in /var/log/mcelog or in dmesg. Version-Release number of selected component (if applicable): mcelog-1.0-0.4.6e4e2a00.fc17.x86_64 How reproducible: Boot the box. Steps to Reproduce: 1. reboot 2. log in, open terminal, wait a few minutes 3. see messages file and dmesg Actual results: see below Expected results: either no mention of this or a decent mcelog. Additional info: Mar 27 16:02:03 a1 kernel: [ 299.484876] mce: [Hardware Error]: Machine check events logged Mar 27 16:11:46 a1 kernel: [ 299.500184] mce: [Hardware Error]: Machine check events logged Mar 27 16:49:26 a1 kernel: [ 299.494439] mce: [Hardware Error]: Machine check events logged Mar 27 17:42:58 a1 kernel: [ 299.492441] mce: [Hardware Error]: Machine check events logged Mar 28 17:57:35 a1 kernel: [ 299.491062] mce: [Hardware Error]: Machine check events logged Mar 28 18:20:33 a1 kernel: [ 299.475068] mce: [Hardware Error]: Machine check events logged Mar 28 18:43:17 a1 kernel: [ 299.478879] mce: [Hardware Error]: Machine check events logged Mar 30 11:11:16 a1 kernel: [ 299.483371] mce: [Hardware Error]: Machine check events logged
> 3. see messages file and dmesg You need to look at the mcelog -- by default IIRC, that is in /var . P.
# pwd /var # ls -l total 100 drwxr-xr-x. 2 root root 4096 Jan 16 18:18 account drwxr-xr-x. 2 root root 4096 Feb 3 2012 adm drwxr-xr-x. 20 root root 4096 Oct 25 09:40 cache drwxr-xr-x. 2 root root 4096 Feb 12 14:01 cvs drwxr-xr-x. 4 root root 4096 Nov 19 16:08 db drwxr-xr-x. 3 root root 4096 Feb 3 2012 empty drwxr-xr-x. 2 root root 4096 Feb 3 2012 games drwxrwx--T. 2 root gdm 4096 Jun 9 2012 gdm drwxr-xr-x. 2 root root 4096 Feb 3 2012 gopher drwxr-xr-x. 62 root root 4096 Nov 12 11:28 lib drwxr-xr-x. 2 root root 4096 Feb 3 2012 local lrwxrwxrwx. 1 root root 11 Jul 6 2012 lock -> ../run/lock drwxr-xr-x. 22 root root 4096 Mar 30 17:20 log drwx------. 2 root root 16384 Dec 2 2008 lost+found lrwxrwxrwx. 1 root root 10 Jul 6 2012 mail -> spool/mail drwxr-xr-x. 2 root root 4096 Feb 3 2012 nis drwxr-xr-x. 2 root root 4096 Feb 3 2012 opt drwxr-xr-x. 2 root root 4096 Feb 3 2012 preserve lrwxrwxrwx. 1 root root 6 Jul 6 2012 run -> ../run drwxr-xr-x. 17 root root 4096 Jul 6 2012 spool drwxrwxrwt. 165 root root 12288 Mar 30 17:54 tmp drwxr-xr-x. 15 root root 4096 Mar 24 11:41 www drwxr-xr-x. 2 root root 4096 Feb 3 2012 yp # So there's nothing there.
Also: # strings /sbin/mcelog|grep /var /var/run/mcelog-client /var/log/mcelog /var/run/mcelog.pid #
hmm -- what happens when you manually run 'mcelog'? Is the mcelog service running? P.
# systemctl status mcelog.service mcelog.service - Machine Check Exception Logging Daemon Loaded: loaded (/usr/lib/systemd/system/mcelog.service; enabled) Active: active (running) since Sun, 31 Mar 2013 13:36:44 +0200; 2h 44min ago Process: 3001 ExecStartPre=/etc/mcelog/mcelog.setup (code=exited, status=0/SUCCESS) Main PID: 3105 (mcelog) CGroup: name=systemd:/system/mcelog.service └ 3105 /usr/sbin/mcelog --ignorenodev --daemon --foreground --syslog Mar 31 13:36:44 a1.hierzo mcelog[3105]: Kernel does not support page offline interface # ps -ef|grep mcelog root 3105 1 0 13:36 ? 00:00:00 /usr/sbin/mcelog --ignorenodev --daemon --foreground --syslog root 9199 5001 0 16:20 pts/0 00:00:00 grep --color=auto mcelog # mcelog #
That's really strange. Can you attach a dmidecode output? P.
FWIW ... [root@intel-canoepass-07 mcelog]# systemctl status mcelog mcelog.service - Machine Check Exception Logging Daemon Loaded: loaded (/usr/lib/systemd/system/mcelog.service; enabled) Active: active (running) since Mon 2013-04-01 11:12:39 EDT; 17min ago Process: 907 ExecStartPre=/etc/mcelog/mcelog.setup (code=exited, status=0/SUCCESS) Main PID: 916 (mcelog) CGroup: name=systemd:/system/mcelog.service └─916 /usr/sbin/mcelog --ignorenodev --daemon --foreground Apr 01 11:16:52 intel-canoepass-07.lab.bos.redhat.com mcelog[916]: TIME 1364829412 Mon Apr 1 11:16:52 2013 Apr 01 11:16:52 intel-canoepass-07.lab.bos.redhat.com mcelog[916]: MCG status: Apr 01 11:16:52 intel-canoepass-07.lab.bos.redhat.com mcelog[916]: MCi status: Apr 01 11:16:52 intel-canoepass-07.lab.bos.redhat.com mcelog[916]: Corrected error Apr 01 11:16:52 intel-canoepass-07.lab.bos.redhat.com mcelog[916]: Error enabled Apr 01 11:16:52 intel-canoepass-07.lab.bos.redhat.com mcelog[916]: MCi_ADDR register valid Apr 01 11:16:52 intel-canoepass-07.lab.bos.redhat.com mcelog[916]: MCA: No Error Apr 01 11:16:52 intel-canoepass-07.lab.bos.redhat.com mcelog[916]: STATUS 9400000000000000 MCGSTATUS 0 Apr 01 11:16:52 intel-canoepass-07.lab.bos.redhat.com mcelog[916]: MCGCAP 1000c1b APICID 2 SOCKETID 0 Apr 01 11:16:52 intel-canoepass-07.lab.bos.redhat.com mcelog[916]: CPUID Vendor Intel Family 6 Model 62 P.
Created attachment 730325 [details] dmidecode output
(In reply to comment #7) > FWIW ... What does this mean w.r.t. the problem of this bug?
(In reply to comment #9) > (In reply to comment #7) > > FWIW ... > > What does this mean w.r.t. the problem of this bug? It's interesting that it is working on my system and not on yours. This could indicate a HW/FW issue on your system, or that mcelog isn't actually supported on your system. If it is the first case, then there's not much I can do about that, if it is the second case maybe a code update is required to make it work for you. P.
Hmm, you're using AMD -- can you try using EDAC instead of mcelog. IIRC, the preferred method of decoding is using EDAC (on AMD). P.
https://github.com/andikleen/mcelog/commit/b986691d9c5656beb8a6a0f65b8c7abc29d73a96 P.
(In reply to comment #11) > Hmm, you're using AMD -- can you try using EDAC instead of mcelog. IIRC, > the preferred method of decoding is using EDAC (on AMD). How to do that? (url is OK) (In reply to comment #10) > It's interesting that it is working on my system and not on yours. This > could indicate a HW/FW issue on your system, The HW/FW issue is what I like to find out. I did switch between F4 and F3[letter0 firmwares to verify. > or that mcelog isn't actually > supported on your system. Why would that be? > If it is the first case, then there's not much I > can do about that, I'd like to use mcelog to find the problem... :-) > if it is the second case maybe a code update is required > to make it work for you. I already was in contact with gigabyte about a bios bug that caused AMD-Vi messages in the logs and froze the system more or less. The kernel workaround does not show anymore with F4 BIOS.
It's a kernel option I see. Does that option require mcelog to be running or be present?
# modprobe amd64_edac_mod ERROR: could not insert 'amd64_edac_mod': No such device So that won't help me now.
(In reply to comment #14) > It's a kernel option I see. > Does that option require mcelog to be running or be present? mcelog can be running but is not required. (In reply to comment #15) > # modprobe amd64_edac_mod > ERROR: could not insert 'amd64_edac_mod': No such device > > So that won't help me now. Hmm. Okay, let me see if I can grab an AMD system and reproduce this. P.
I installed F18 on an AMD system, processor : 0 vendor_id : AuthenticAMD cpu family : 21 model : 1 model name : AMD Opteron(TM) Processor 6274 AFAICT, the edac module is loaded by default: [root@amd-dinar-07 ~]# lsmod | grep edac amd64_edac_mod 23665 0 edac_core 56455 2 amd64_edac_mod edac_mce_amd 22634 1 amd64_edac_mod I'm not sure why it isn't working on your system. Can you double check that edac isn't already loaded? P.
Also what does dmesg | grep edac show? P.
# lsmod|grep edac edac_core 42927 0 amd64_edac_mod does not support AMD's A10 GPU.
(In reply to comment #19) > # lsmod|grep edac > edac_core 42927 0 > > > amd64_edac_mod does not support AMD's A10 GPU. Ah ... so that would explain this. Hopefully AMD will do something to support A10 in the future. There's still a bug here. mcelog should have returned an error on the AMD processor and not started. P.
In that case please fix that small mcelog issue. After that you can close this issue. edac_core gives me output, see edac mailinglist.
This message is a reminder that Fedora 17 is nearing its end of life. Approximately 4 (four) weeks from now Fedora will stop maintaining and issuing updates for Fedora 17. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as WONTFIX if it remains open with a Fedora 'version' of '17'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version prior to Fedora 17's end of life. Bug Reporter: Thank you for reporting this issue and we are sorry that we may not be able to fix it before Fedora 17 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior to Fedora 17's end of life. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Fedora 17 changed to end-of-life (EOL) status on 2013-07-30. Fedora 17 is no longer maintained, which means that it will not receive any further security or bug fix updates. As a result we are closing this bug. If you can reproduce this bug against a currently maintained version of Fedora please feel free to reopen this bug against that version. Thank you for reporting this bug and we are sorry it could not be fixed.