Bug 1069652 - mcelog AMD Processor family 16: Please load edac_mce_amd module
Summary: mcelog AMD Processor family 16: Please load edac_mce_amd module
Status: NEW
Alias: None
Product: Fedora
Classification: Fedora
Component: mcelog
Version: 29
Hardware: x86_64
OS: Unspecified
medium
high
Target Milestone: ---
Assignee: Prarit Bhargava
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Keywords: Reopened
: 1274193 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2014-02-25 13:25 UTC by Davide Repetto
Modified: 2019-05-03 07:45 UTC (History)
29 users (show)

(edit)
Clone Of:
(edit)
Last Closed: 2016-12-20 12:45:50 UTC


Attachments (Terms of Use)
patch for changing exit code (818 bytes, patch)
2014-12-16 02:00 UTC, Yu Watanabe
no flags Details | Diff
service file for mcelog (597 bytes, text/plain)
2014-12-16 02:01 UTC, Yu Watanabe
no flags Details
patch for the spec file (200 bytes, patch)
2014-12-16 02:02 UTC, Yu Watanabe
no flags Details | Diff
spec file (6.63 KB, text/x-matlab)
2014-12-16 02:04 UTC, Yu Watanabe
no flags Details

Description Davide Repetto 2014-02-25 13:25:34 UTC
Description of problem:
=======================
mcelog fail with:
AMD Processor family 16: Please load edac_mce_amd module


Version-Release number of selected component (if applicable):
=============================================================
mcelog-1.0-0.11.f0d7654.fc20


How reproducible:
=================
Always with AMD CPUs of Family 16 and up.


Additional info:
================
The problem seems to be fixed already upstream and in RHEL packages.

Comment 1 Luya Tshimbalanga 2014-05-24 17:34:05 UTC
I have succesfully reproduced the bug by looking on systemctl --failed:
# systemctl --failed
UNIT           LOAD   ACTIVE SUB    DESCRIPTION
mcelog.service loaded failed failed Machine Check Exception Logging Daemon
rngd.service   loaded failed failed Hardware RNG Entropy Gatherer Daemon

The status provided
# systemctl status mcelog.service
mcelog.service - Machine Check Exception Logging Daemon
   Loaded: loaded (/usr/lib/systemd/system/mcelog.service; enabled)
   Active: failed (Result: exit-code) since Sat 2014-05-24 02:34:54 PDT; 7h ago
 Main PID: 682 (code=exited, status=1/FAILURE)

May 24 02:35:32 mcelog.setup[652]: CPU is unsupported
May 24 02:35:33 mcelog[682]: mcelog: AMD Processor family 16....
May 24 02:35:33 mcelog[682]: : Success
May 24 02:35:33 mcelog[682]: CPU is unsupported

Starting mcelog returns the result:
# mcelog start
mcelog: AMD Processor family 16: Please load edac_mce_amd module.
: Success
CPU is unsupported

The running CPU is an AMD Phenom II X4 940

Comment 2 collura 2014-05-25 10:43:11 UTC
similar but with different processor AMD APU A4-5000 (Family 22 not 16)

kernel thinks family 16 according to journalctl:
  'May 24 23:00:28 <removed>.<removed> kernel: smpboot: 
     CPU0: AMD A4-5000 APU with Radeon(TM) HD Graphics 
     (fam: 16, model: 00, stepping: 01)'

  'lscpu' and 'systemctl status mcelog.service' thinks family 22

should i clone this bug (family 22 not 16) for the different processor or leave here (comment#0: 'Family 16 and up') and attach '/proc/cpuinfo' to this bug?

$ systemctl -l status mcelog.service
mcelog.service - Machine Check Exception Logging Daemon
   Loaded: loaded (/usr/lib/systemd/system/mcelog.service; enabled)
   Active: failed (Result: exit-code) since Sun 2014-05-25 03:00:46 EDT; 2h 49min left
  Process: 627 ExecStart=/usr/sbin/mcelog --ignorenodev --daemon --foreground (code=exited, status=1/FAILURE)
  Process: 606 ExecStartPre=/etc/mcelog/mcelog.setup (code=exited, status=0/SUCCESS)
 Main PID: 627 (code=exited, status=1/FAILURE)
   CGroup: /system.slice/mcelog.service

May 25 03:00:52 <removed>.<removed> mcelog.setup[606]: CPU is unsupported
May 25 03:00:52 <removed>.<removed> mcelog[627]: mcelog: AMD Processor family 22: Please load edac_mce_amd module.
May 25 03:00:52 <removed>.<removed> mcelog[627]: : Success
May 25 03:00:52 <removed>.<removed> mcelog[627]: CPU is unsupported

Comment 3 Davide Repetto 2014-05-28 11:55:09 UTC
The error probably should be downgraded to a simple warning and can be safely ignored without any loss of functionality, because the edac_mce_amd module, which is loaded right after issuing the error, directly implements the mc errors logging functionality in those AMD cpus which are incompatible with the mcelog daemon.

Here is the information source:
http://www.novell.com/support/kb/doc.php?id=7013006

At this point the severity of this bug shoud drop to "low" since it is mereley a cosmetic issue.

Comment 4 collura 2014-05-28 22:32:13 UTC
In reply to comment#3:

thats a great link, thanks.

the information might be clearer to the user,
if the error message could be changed from

  'cpu not supported' 
     (which sounds like an mcelog issue that should be reported)

  to something like

  'AMD cpu families > 16h do not support mcelog so loading kernel module edac_mce_amd instead' 
     (which tells a better story of the cause/functional_status 
      so we dont file bug report on a fixed problem :') )


thanks again, good find.

Comment 5 Richard W.M. Jones 2014-07-03 09:01:16 UTC
model name	: AMD FX(tm)-8320 Eight-Core Processor           

$ lsmod | grep mce
edac_mce_amd           22349  0 
$ sudo mcelog
mcelog: AMD Processor family 21: Please load edac_mce_amd module.
: Success
CPU is unsupported

The error is obviously wrong.  I have the module loaded.

What does the error mean?

What is going wrong?

How can I see MCE events?

Comment 6 Prarit Bhargava 2014-07-08 12:31:01 UTC
(In reply to Richard W.M. Jones from comment #5)
> model name	: AMD FX(tm)-8320 Eight-Core Processor           
> 
> $ lsmod | grep mce
> edac_mce_amd           22349  0 
> $ sudo mcelog
> mcelog: AMD Processor family 21: Please load edac_mce_amd module.
> : Success
> CPU is unsupported
> 
> The error is obviously wrong.  I have the module loaded.
> 

The error is from the userspace mcelog program, not the kernel.

> What does the error mean?

It is trying to tell you to not use mcelog on AMD HW.

> 
> What is going wrong?

You're using mcelog on AMD HW.

> 
> How can I see MCE events?

The edac_mce_amd module will put messages in the kernel log for mce events.

P.

Comment 7 Richard W.M. Jones 2014-07-08 12:39:30 UTC
This bug is filed against 'mcelog', not the kernel.

Reopening because the error is incredibly obtuse and should
be fixed in mcelog.

Comment 8 Fomalhaut 2014-11-06 19:07:11 UTC
model name	: AMD FX(tm)-8350 Eight-Core Processor

Drew attention to an error in journald:

mcelog[772]: AMD Processor family 21: Please load edac_mce_amd module.
                : Success

Further investigation:

$ lsmod | grep mce
edac_mce_amd           22310  0

$ systemctl status mcelog
mcelog.service - Machine Check Exception Logging Daemon
   Loaded: loaded (/usr/lib/systemd/system/mcelog.service; enabled)
   Active: failed (Result: exit-code) since Чт 2014-11-06 21:50:27 MSK; 1s ago
  Process: 5663 ExecStart=/usr/sbin/mcelog --ignorenodev --daemon --foreground (code=exited, status=1/FAILURE)
  Process: 5659 ExecStartPre=/etc/mcelog/mcelog.setup (code=exited, status=0/SUCCESS)
 Main PID: 5663 (code=exited, status=1/FAILURE)

ноя 06 21:50:27 <removed>.<removed> mcelog.setup[5659]: CPU is unsupported
ноя 06 21:50:27 <removed>.<removed> systemd[1]: Started Machine Check Exception Logging Daemon.
ноя 06 21:50:27 <removed>.<removed> mcelog[5663]: mcelog: AMD Processor family 21: Please load edac_mce_amd module.
ноя 06 21:50:27 <removed>.<removed> mcelog[5663]: : Success
ноя 06 21:50:27 <removed>.<removed> mcelog[5663]: CPU is unsupported
ноя 06 21:50:27 <removed>.<removed> systemd[1]: mcelog.service: main process exited, code=exited, status=1/FAILURE
ноя 06 21:50:27 <removed>.<removed> systemd[1]: Unit mcelog.service entered failed state.

> I do not know, how about "cosmetic errors," but I have become often problems: sudden reboot error on the CPU, the kernel panics. Everything as described on this link.
I do not know, how about "cosmetic errors," but I have become often problems: sudden reboot error on the CPU, the kernel panics. Everything as described on this link.

[Hardware Error]: Corrected error, no action required.
[Hardware Error]: CPU:7 (15:2:0) MC1_STATUS[-|CE|MiscV|-|AddrV|-|-]: 0x8c00002000010151
[Hardware Error]: MC1_ADDR: 0x00007f3c71771c83
[Hardware Error]: MC1 Error: Parity error during data load from IC.
[Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD

[Hardware Error]: Uncorrected, software containable error.
[Hardware Error]: CPU:6 (15:2:0) MC1_STATUS[-|UE|-|-|-|-|-]: 0xb080000000040151
[Hardware Error]: MC1 Error: Parity error in prediction queue.
[Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD

Comment 9 bob 2014-11-22 09:49:59 UTC
I am having similar problems with an FX-8350 on F21Beta.  I just filed bug 1166978 on F21.

Is the rngd.service failure coupled to the mcelog.service failure?

How are AMD users supposed to deal with these problems?  Should both th emcelog and rngd services be disabled?

Comment 10 Davide Repetto 2014-11-22 12:02:48 UTC
Hello bob,
you can safely disable rndg since it tries to use a feature that is present only certain CPUs and is not present on yours.

As for the mcelog.service, you may want to leave it there, since it just informs you (the warning) that your CPU does not support the intel MCE logging facility and that it has thus kicked in the AMD module, which handles that very same functionality in AMD CPUs.

Comment 11 bob 2014-11-23 00:17:59 UTC
>you can safely disable rndg since it tries to use a feature that is present only certain CPUs and is not present on yours.

thanks.  that fixes one problem.

>As for the mcelog.service, you may want to leave it there, since it just informs you (the warning) that your CPU does not support the intel MCE logging facility and that it has thus kicked in the AMD module, which handles that very same functionality in AMD CPUs.

that may be the way it's supposed to work, but it's not working that way.  if you read Bug 1166978 you'll see that no AMD module ever gets loaded, that the mcelog.service never starts, and that no logs are ever generated on the system.  the bug on AMD family 21 is real -- the daemon DOES NOT work, there is no logging functionality, and the failure reports are accurate.

Comment 12 Davide Repetto 2014-11-23 12:07:26 UTC
Well yes, you're right in that the error message is misleading at best.

Though you're getting to the wrong conclusions about machine exception log not working on your computer.
In comment #3 you can find a link to an article that explains it better, but to explain it briefly, the mcelog service will never actually load on your CPU and it is the correct behaviour, because mcelog is useful only on intel CPUs.

On your AMD CPU the mcelog service is not needed at all. Instead a kernel module, edac_mce_amd, is what handles the Machine exceptions logging for AMDs.
This module is correctly loaded on your machine and the mcelog service acknowledges this before exiting with the "Success" message.
Upon verifying that the module is loaded, the MCElog service can die as it is not necessary anymore.
The final "cpu unsupported" message basically means: "bye bye, I do not support AMD CPUs, thus I'll get out of the way and let the AMD kernel module do its job undisturbed" 

You can double-check edac_mce_amd is loaded on your machine by:

   lsmod | grep mce

It is finally absolutely a good sign that you don't get any MCE messages in your logs, as it would indicate a CPU hardware failure and I suppose that is something you may not be looking forward to. ;)

I'm leaving this bug on "assigned" anyway, since it is actually a bug that the error messages in the logs are so much misleading.

Comment 13 Yu Watanabe 2014-12-16 01:59:25 UTC
I think this bug can be fixed by the same way as http://pkgs.fedoraproject.org/cgit/rng-tools.git/commit/?id=95fb228e859df8162028819da0b6d31e9e1a708a

I will propose a patch that change the exit code if the cpu is not supported, and modified service file. The patch combined with the modified service file works for me.

Comment 14 Yu Watanabe 2014-12-16 02:00:57 UTC
Created attachment 969388 [details]
patch for changing exit code

Comment 15 Yu Watanabe 2014-12-16 02:01:30 UTC
Created attachment 969389 [details]
service file for mcelog

Comment 16 Yu Watanabe 2014-12-16 02:02:30 UTC
Created attachment 969390 [details]
patch for the spec file

Comment 17 Yu Watanabe 2014-12-16 02:04:10 UTC
Created attachment 969391 [details]
spec file

Comment 18 Yu Watanabe 2014-12-16 02:06:02 UTC
Note that I tested the patch on fedora 21. I guess that this also works for fedora 20.

Comment 19 Prarit Bhargava 2014-12-19 14:02:24 UTC
Everyone, I have updated mcelog to the latest upstream.  Before I do any work on this change discussed in this bugzilla I need to get the new packages enough karma to be pushed into stable.

https://admin.fedoraproject.org/updates/mcelog-101-1.9bfaad8f92c5.fc21

and

https://admin.fedoraproject.org/updates/mcelog-101-1.9bfaad8f92c5.fc20

and

https://admin.fedoraproject.org/updates/mcelog-101-1.9bfaad8f92c5.fc19

(in case anyone needs it)

I cannot make the above changes unless these packages get karma.

P.

Comment 20 Yu Watanabe 2014-12-25 03:51:17 UTC
(In reply to Prarit Bhargava from comment #19)
> Everyone, I have updated mcelog to the latest upstream.  Before I do any
> work on this change discussed in this bugzilla I need to get the new
> packages enough karma to be pushed into stable.

As commented to the update pages, please revert the changes to the mcelog.spec and use systemd rpm macros.

If mcelog.service is disabled before updating, updating to mcelog-101-1 makes mcelog.service enabled.

> I cannot make the above changes unless these packages get karma.

OK.

Comment 21 Shivaji Sathe 2014-12-28 15:19:42 UTC
I realized this issue is affecting my system as well and there is no  edac_mce_amd module loaded as well. However I am running Fedora 21 (XFCE spin).

# lscpu | grep "Model name"
Model name:            AMD A8-6600K APU with Radeon(tm) HD Graphics

# systemctl status mcelog.service
● mcelog.service - Machine Check Exception Logging Daemon
   Loaded: loaded (/usr/lib/systemd/system/mcelog.service; enabled)
   Active: failed (Result: exit-code) since Sun 2014-12-28 20:35:58 IST; 9s ago
...
Dec 28 20:35:58 panther.jungle mcelog.setup[2303]: CPU is unsupported
Dec 28 20:35:58 panther.jungle mcelog[2307]: mcelog: AMD Processor family 21: Please load edac_mce_amd module.
Dec 28 20:35:58 panther.jungle mcelog[2307]: : Success
Dec 28 20:35:58 panther.jungle mcelog[2307]: CPU is unsupported
Dec 28 20:35:58 panther.jungle systemd[1]: mcelog.service: main process exited, code=exited, status=1/FAILURE
Dec 28 20:35:58 panther.jungle systemd[1]: Unit mcelog.service entered failed state.
Dec 28 20:35:58 panther.jungle systemd[1]: mcelog.service failed.


# lsmod |grep "edac_mce_amd"

doesn't give any output.

I also tried to look for the module, but looks like there is no such module available on my system.

What should be done now? Should it be separate bug as its Fedora 21?

Comment 22 Shivaji Sathe 2014-12-28 15:26:49 UTC
Sorry. another quick update.

If I do 

# modprobe edac_mce_amd
# lsmod |grep "edac_mce_amd"
edac_mce_amd           22310  0 

I see it loaded. Still wondering why it doesn't seem to happen by itself.

Comment 23 Peter Trenholme 2015-01-17 00:49:10 UTC
For what it's worth, I just noticed this bug persists in F-22 (Rawhide), which
seem to imply that the (trivial) changes proposed above have not yet even made it
to [testing].

Comment 24 Shivaji Sathe 2015-01-17 07:51:27 UTC
# yum --enablerepo=updates-testing install mcelog

# yum list mcelog*
...
Installed Packages
mcelog.x86_64                                   3:101-1.9bfaad8f92c5.fc21                                    @updates-testing


Rebooted and it still gives:

# systemctl status mcelog.service 
● mcelog.service - Machine Check Exception Logging Daemon
   Loaded: loaded (/usr/lib/systemd/system/mcelog.service; enabled)
   Active: failed (Result: exit-code) since Sat 2015-01-17 12:49:30 IST; 4min 4s ago
  Process: 656 ExecStart=/usr/sbin/mcelog --ignorenodev --daemon --foreground (code=exited, status=1/FAILURE)

So I guess it still not as expected as it should have exited with success.

At least the module got loaded automatically:

$ lsmod |grep "edac_mce_amd"
edac_mce_amd           22310  0 

So I am going to keep the package, and disable the service. Still it is an improvement for me, so +1 karma.

Comment 25 Prarit Bhargava 2015-01-28 11:52:23 UTC
(In reply to Davide Repetto from comment #0)
> Description of problem:
> =======================
> mcelog fail with:
> AMD Processor family 16: Please load edac_mce_amd module
> 
> 
> Version-Release number of selected component (if applicable):
> =============================================================
> mcelog-1.0-0.11.f0d7654.fc20
> 
> 
> How reproducible:
> =================
> Always with AMD CPUs of Family 16 and up.
> 
> 
> Additional info:
> ================
> The problem seems to be fixed already upstream and in RHEL packages.

Going back to the original problem ... this message has been modified with something a bit more verbose:

https://github.com/andikleen/mcelog/commit/0fc9f702232cb2d9969916f899c67c3e64deedda

I will bring this in in the next update.

P.

Comment 26 Fedora End Of Life 2015-05-29 11:03:09 UTC
This message is a reminder that Fedora 20 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 20. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '20'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 20 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 27 Luya Tshimbalanga 2015-06-09 23:24:21 UTC
The bug is still valid in Fedora 22 and upon. Assigning to Rawhide.

Comment 28 Jan Kurik 2015-07-15 14:42:29 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 23 development cycle.
Changing version to '23'.

(As we did not run this process for some time, it could affect also pre-Fedora 23 development
cycle bugs. We are very sorry. It will help us with cleanup during Fedora 23 End Of Life. Thank you.)

More information and reason for this action is here:
https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora23

Comment 29 Suren Karapetyan 2015-09-01 21:26:46 UTC
This is still triggered in Fedora 22.

Comment 30 Richard Shaw 2015-10-05 15:53:06 UTC
I'm not sure what the current "fix" is, but I can verify regardless of errors, that it's not staying running even if I run it manually.

Comment 31 Adam Williamson 2015-10-22 19:14:09 UTC
*** Bug 1274193 has been marked as a duplicate of this bug. ***

Comment 32 Richard Shaw 2015-11-13 14:13:09 UTC
No response from the mailing list but I was wondering if systemd-generators could be used to test for compatibility first, and if true, generate the unit file on the fly.

I'm not sure how this would affect it being enabled or disabled intentionally by the user though.

Comment 33 Richard Shaw 2015-11-13 20:59:33 UTC
I completely forgot to mention here that mcelog is NOT compatible with newer AMD cpu's, the error is misleading making you think that you just need to load another kernel module for it to work. 

edac_mce_amd is a completely different method of catching exceptions and is kernel only, if the module is loaded, you're done. 

There needs to be a way to detect when you're running one of the newer AMD processors and not attempt to start mcelog.

Comment 34 Felipe 2015-11-19 20:10:40 UTC
This issue is real and important is blocking my boot-up in an HP Pavilion g6 2004ss with an hybrid AMD/Intel graphic card, I am stuck in kernel 3.19.8-100.fc20.x86_64 with Fedora 20, it is impossible for me to update the system to Fedora 21 right now, please increase Priority to HIGH.

Reading the track of the bug now that it is happening to me I want to cry because I can not understand how an issue like this could have been bouncing for more than a year, I though that the idea of Bugzilla is to improve the system, not to close as many bug as you can, if it is how it works I will seriously reconsider using Fedora again.

Thanks anyway

Comment 35 Richard Shaw 2015-11-19 22:16:26 UTC
Felipe, I'm confused, this bug is about Intel/AMD CPU issue, not graphics. Are you use *mcelog* is preventing you from upgrading?

Comment 36 Felipe 2015-11-19 23:46:38 UTC
Sorry, I just mentioned the AMD graphic card just in case, but you are right, it has nothing to do here.

The thing is that I am using qubes-os 3.0 with is based in Fedora 20, sometimes it uses directly the fedora kernel like the 3.19.8-100.fc20.x86_64 with does not have this issue and sometimes their own, if i use the qubes kernel (which is totally based in fedora) I have this issue for my Intel I7 proccesor stopping the boot-up, I already opened a bug for them too.

I think that they have some change in their version different for mce, but anyway, for their previous kernel it was working, it was only when I updated yesterday that this error appear (the new kernel is based in version 4.1 of linux).

When I posted before I confused the kernels versions and though that all them where fedora pure (fc21, fc22, etc...)  sorry for that

But in any case it is still a bug for fedora and need to be checked, I am really stuck in 3.19.8-100.fc20 version until this issue is fix.

Thanks for your time, I hope that it clarified the problem, but it really annoyed me when I started checking and saw the bug since May of last year.

Comment 37 Davide Repetto 2015-11-20 09:09:12 UTC
Hi Felipe, this bug is certainly not what's blocking your boot-up because all it does is writing an error message that shouldn't be there.
It's just a cosmetic nuisance and nothing that impacts any functionality whatsoever.

Mcelog is an intel-CPUs-only utility which cannot run - and is not needed - on AMD CPUs. Fedora launches it unconditionally at boot and it gives a misguided error message (which should be a "warning" at best...), but that's all.

AMD CPUs don't use and don't need MCELOG because their logging is already handled by an amd-specific kernel module that is already loaded at startup anyway.

This is why this error-message appears only on machines with AMD CPUs and why we can safely ignore it.

But if you're like me and you don't want to see it anyways, just disable the mcelog service on your AMD machines and you're golden.
This way you even shave a few milliseconds off your boot time.

Comment 38 cooloutac 2015-12-11 04:31:29 UTC
I have the issue returning after updating to fedora 23.  I can't remember how i disabled this message on boot in fedora 22.   disabling and removing the package doesn't do it.

Does anybody have any ideas,  its been a while since i had this annoying message and I don't have a clue as to how i originally stopped it.

Tks.

Rich.

Comment 39 cooloutac 2015-12-11 04:33:07 UTC
I have an amd board and cpu so would it be safe to just disable mce with this method?  https://access.redhat.com/solutions/367773

Comment 40 Richard Shaw 2015-12-11 13:53:37 UTC
(In reply to cooloutac from comment #39)
> I have an amd board and cpu so would it be safe to just disable mce with
> this method?  https://access.redhat.com/solutions/367773

Yes, I found that even though I disabled the service that it would be started anyway, some other service must want it. My solution was just to remove the package with the caveat you'd have to remember to reinstall it if you moved to an Intel system.

Comment 41 cooloutac 2015-12-11 21:52:16 UTC
I also tried to remove the package mcelog,  but i still get the message on boot.

Comment 42 Gerald Cox 2016-01-31 20:19:50 UTC
(In reply to Richard Shaw from comment #40)
> (In reply to cooloutac from comment #39)
> > I have an amd board and cpu so would it be safe to just disable mce with
> > this method?  https://access.redhat.com/solutions/367773
> 
> Yes, I found that even though I disabled the service that it would be
> started anyway, some other service must want it. My solution was just to
> remove the package with the caveat you'd have to remember to reinstall it if
> you moved to an Intel system.

Thanks that was a good workaround.  The problem for me was when I entered systemctl I received a message that the service was running degraded, because mcelog failed.  So I had to research and track down the error - basically wasting alot of time.  Now, I understand people are saying, you can ignore that message - but that defeats the purpose of error messages if they are misleading and inaccurate.

Comment 43 Oliver Henshaw 2016-02-08 20:43:11 UTC
mcelog doesn't load any modules for amd cpus (and, as far as I can tell, it never did) - it just prints the warning messages whether or not the edac_mce_amd module is actually loaded.

I have one machine that does load the edac_mce_amd module and one that doesn't, so I poked around a bit. It turns out that edac_mce_amd is loaded by amd64_edac_mod which is loaded when certain pci devices are present - see 'modinfo amd64_edac_mod' for details. On an Athlon X3/Gigabyte 890GPA the modules are loaded by/for the "Family 10h Processor DRAM Controller"; on a Brazos E-350 (part of the Bobcat/14h family) the modules aren't loaded because none of the pci devices are present.

So this might explain why some reporters see the modules have loaded, and some don't (like comment #21).

Now I wonder whether the modules not loading on the E-350 is an oversight. Family 14h processors do offer MCE events I think - indeed I see in the journal the message "mce: CPU supports 6 MCE banks" during boot; and manually loading edac_mce_amd and edac_core succeeds. Even if the memory controller doesn't report errors, the pci bus and other components might, correct? So should there be a module alias or udev rule shipped for these devices?

Comment 44 cornel panceac 2016-03-03 18:14:59 UTC
AMD E-450 in Fedora 23 x64 (4.4.2-301.fc23.x86_64). The computer hangs from time to time, and that module, edac_mce_amd, is never loaded.

Comment 45 Davide Repetto 2016-03-04 02:07:43 UTC
Mcelog is an intel-only utility that has no place on machines with AMD processors.

So, if you get this error and you want to get rid of it, just remove or disable mcelog.

In other words, this is not really a bug. It's just an ill-worded message.

Comment 46 Richard W.M. Jones 2016-03-04 08:54:44 UTC
Reopening.  The bug *is* that the warning is badly worded.

Comment 47 cornel panceac 2016-03-04 10:04:49 UTC
Agree. Thank you Richard. Also there are more mysteries: why the service starts if AMD is not supported? and, Is it true that AMD processors are not supported? The mcelog web site's FAQ page looks a little unclear on this subject. See http://www.mcelog.org/faq.html#18 for details.

Comment 48 Richard Shaw 2016-03-24 16:24:46 UTC
(In reply to Richard W.M. Jones from comment #46)
> Reopening.  The bug *is* that the warning is badly worded.

I agree but upstream actually changed the wording to be more clear (which is funny as the new wording isn't much better). The bug needs to be upstream not here.

(In reply to cornel panceac from comment #47)
> Agree. Thank you Richard. Also there are more mysteries: why the service
> starts if AMD is not supported? and, Is it true that AMD processors are not
> supported? The mcelog web site's FAQ page looks a little unclear on this
> subject. See http://www.mcelog.org/faq.html#18 for details.

Without going back and re-reading, I believe some older processors worked with mcelog but all modern AMD processors use a kernel module only solution (no daemon required).

Comment 49 Luan Cestari 2016-04-22 22:36:02 UTC
Hi,

I saw this bug https://bugzilla.redhat.com/show_bug.cgi?id=1166978#c11 and I think it is the same as the one you guys are talking about here. I think this piece of code https://github.com/andikleen/mcelog/blob/master/mcelog.c#L538-L543 is outdated as there are many new models. 

Do you guys think it should be better to update mcelog  open another bug here in bugzilla or open an issue on github or just keep this bugzilla?

Thank you in advance,
Luan Cestari

Comment 50 Richard Shaw 2016-04-23 12:07:37 UTC
I'm not sure what you are saying, the login in the code says anything past K8's (family 15) is unsupported. Are you saying that is not the case?

My understanding is that the error message is misleading, it tells you to use another module which the end user ASSUMES they mean within mcelog but in fact it's a separate kernel module that has nothing to do with mcelog. 

It would be nice if mcelog failed gracefully, i.e., exited with code 0 or another specific code just for this situation so the systemd service file could be updated to not show a failure.

Comment 51 Luan Cestari 2016-04-24 21:45:44 UTC
Hi Richard,

Sorry if my last message wasn't very good. I jumped from the other bug to this one as I saw this one seems to be the same case and this one is active.

About the code, I think what happened there is nobody gave an update about the AMD code since late 2006 (as far as I searched on the https://en.wikipedia.org/wiki/AMD_K8 was the last release of the family 15)(by the way we are on family 23 now). 

So, in my point of view, it needs that a committer who knows how to capture the CPU information on the newer families so it works for the the AMD processors as it does for intel (it even have its own .c file to do that https://github.com/andikleen/mcelog/blob/master/intel.c )

I also agree about the mcelog could failed gracefully and that message is definitely wrong.

Thank you for your understanding and help.

Kind regards,
Luan Cestari

Comment 52 Shivaji Sathe 2016-04-25 03:07:14 UTC
Apart from error message being misleading, the issue for me is that systemctl tells me the mcelog service failed. Which is why I had to check, understand the problem and then disable the service. In my opinion, the service MUST exit gracefully such that systemctl does not report any problems.

Also for last two fedora upgrades the service got enabled again, and I had to disable it.

Comment 53 Richard Shaw 2016-04-25 12:28:51 UTC
(In reply to Shivaji Sathe from comment #52)
> Apart from error message being misleading, the issue for me is that
> systemctl tells me the mcelog service failed. Which is why I had to check,
> understand the problem and then disable the service. In my opinion, the
> service MUST exit gracefully such that systemctl does not report any
> problems.

That would require that mcelog exit with code "0" or a specific code for situations where it's just not supported, then that exit code can be added to the systemd service file. Sounds like a request tha should be made upstream :)


> Also for last two fedora upgrades the service got enabled again, and I had
> to disable it.

Did you just disable it or disable and mask it? A disabled service can still be started if another service if dependent on it. Since I have no intention of running an intel processor (too expensive) I just removed the package.

Comment 54 Shivaji Sathe 2016-04-25 18:51:48 UTC
(In reply to Richard Shaw from comment #53)

> That would require that mcelog exit with code "0" or a specific code for
> situations where it's just not supported, then that exit code can be added
> to the systemd service file. Sounds like a request tha should be made
> upstream :)

Never checked upstream for this one. I know this is not critical, but just a little annoying. I was hoping something would be possible at distribution level. I will check what can be done at upstream.
 
> Did you just disable it or disable and mask it? A disabled service can still
> be started if another service if dependent on it. Since I have no intention
> of running an intel processor (too expensive) I just removed the package.

I just disabled it every time. Didn't know about masking. After using AMD for nearly two years now, I don't think I will go back to Intel processor (again just too expensive :)) So I might remove the mcelog package as well.

Comment 55 Orion Poplawski 2016-08-18 21:50:18 UTC
Is there any file in /sys that could trigger whether mcelog should run with ConditionPathExists/ConditionDirectoryNotEmpty?  Trigger on the non-existence of /sys/module/edac_mce_amd, assuming that is loaded before mcelog tries to start?

Comment 56 Fedora End Of Life 2016-11-24 11:07:51 UTC
This message is a reminder that Fedora 23 is nearing its end of life.
Approximately 4 (four) weeks from now Fedora will stop maintaining
and issuing updates for Fedora 23. It is Fedora's policy to close all
bug reports from releases that are no longer maintained. At that time
this bug will be closed as EOL if it remains open with a Fedora  'version'
of '23'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 23 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 57 Fedora End Of Life 2016-12-20 12:45:50 UTC
Fedora 23 changed to end-of-life (EOL) status on 2016-12-20. Fedora 23 is
no longer maintained, which means that it will not receive any further
security or bug fix updates. As a result we are closing this bug.

If you can reproduce this bug against a currently maintained version of
Fedora please feel free to reopen this bug against that version. If you
are unable to reopen this bug, please file a new report against the
current release. If you experience problems, please add a comment to this
bug.

Thank you for reporting this bug and we are sorry it could not be fixed.

Comment 58 reescf 2017-01-03 03:31:04 UTC
Seeing this on a just upgraded machine which now has Fedora 25. As far as I can tell, this is the only thing failing now, so I assumed it explained the overall failure of the upgrade. 

Note that I've not see this in the journal before, although I am not as familiar with the logs as I would be on my own machine. 

I'm not sure if commenting is sufficient to reopen this or not, but it should be reopened and tagged as a bug on Fedora 25.

Comment 59 reescf 2017-01-03 03:33:07 UTC
Correction: actually, I apparently have seen this before but had forgotten. (I wouldn't be watching this bug otherwise - this is the only AMD machine I am responsible for.)

Comment 60 Alvin 2018-07-09 12:36:18 UTC
Should this bug be closed with EOL? It's still there on Fedora 28. mcelog.service still fails (together with rngd.service)

# systemctl status mcelog.service
● mcelog.service - Machine Check Exception Logging Daemon
   Loaded: loaded (/usr/lib/systemd/system/mcelog.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Mon 2018-07-09 14:29:59 CEST; 23s ago
  Process: 3218 ExecStart=/usr/sbin/mcelog --ignorenodev --daemon --foreground (code=exited, status=1/FAILURE)
 Main PID: 3218 (code=exited, status=1/FAILURE)

Jul 09 14:29:59 dvdpc02.local.dvdlaw.be systemd[1]: Started Machine Check Exception Logging Daemon.
Jul 09 14:29:59 dvdpc02.local.dvdlaw.be mcelog[3218]: mcelog: ERROR: AMD Processor family 20: mcelog does not support this processor.  Please use the edac_mce_amd module instead.
Jul 09 14:29:59 dvdpc02.local.dvdlaw.be mcelog[3218]: CPU is unsupported
Jul 09 14:29:59 dvdpc02.local.dvdlaw.be systemd[1]: mcelog.service: Main process exited, code=exited, status=1/FAILURE
Jul 09 14:29:59 dvdpc02.local.dvdlaw.be systemd[1]: mcelog.service: Failed with result 'exit-code'.

Comment 61 Prarit Bhargava 2018-08-02 11:49:42 UTC
(In reply to Alvin from comment #60)
> Should this bug be closed with EOL? It's still there on Fedora 28.
> mcelog.service still fails (together with rngd.service)
> 
> # systemctl status mcelog.service
> ● mcelog.service - Machine Check Exception Logging Daemon
>    Loaded: loaded (/usr/lib/systemd/system/mcelog.service; enabled; vendor
> preset: enabled)
>    Active: failed (Result: exit-code) since Mon 2018-07-09 14:29:59 CEST;
> 23s ago
>   Process: 3218 ExecStart=/usr/sbin/mcelog --ignorenodev --daemon
> --foreground (code=exited, status=1/FAILURE)
>  Main PID: 3218 (code=exited, status=1/FAILURE)
> 
> Jul 09 14:29:59 dvdpc02.local.dvdlaw.be systemd[1]: Started Machine Check
> Exception Logging Daemon.
> Jul 09 14:29:59 dvdpc02.local.dvdlaw.be mcelog[3218]: mcelog: ERROR: AMD
> Processor family 20: mcelog does not support this processor.  Please use the
> edac_mce_amd module instead.
> Jul 09 14:29:59 dvdpc02.local.dvdlaw.be mcelog[3218]: CPU is unsupported
> Jul 09 14:29:59 dvdpc02.local.dvdlaw.be systemd[1]: mcelog.service: Main
> process exited, code=exited, status=1/FAILURE
> Jul 09 14:29:59 dvdpc02.local.dvdlaw.be systemd[1]: mcelog.service: Failed
> with result 'exit-code'.

What version of mcelog do you have, and please cut-and-paste the output of lscpu.

Put this BZ into NEEDINFO for me when you have that information.

Thanks,

P.

Comment 62 Gerald Cox 2018-08-02 17:04:57 UTC
Isn't this more or less the same issue as:
https://bugzilla.redhat.com/show_bug.cgi?id=1166978

In any event, as I mentioned in that ticket, this needs to be fixed or an exception needs to be requested:

https://fedoraproject.org/wiki/Packaging:DefaultServices?rd=DefaultServices

Comment 63 Pierguido Lambri 2018-10-01 17:21:32 UTC
I have this on F29 too:

$ lscpu 
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              12
On-line CPU(s) list: 0-11
Thread(s) per core:  1
Core(s) per socket:  6
Socket(s):           2
NUMA node(s):        2
Vendor ID:           AuthenticAMD
CPU family:          16
Model:               8
Model name:          Six-Core AMD Opteron(tm) Processor 2435
Stepping:            0
CPU MHz:             2600.045
BogoMIPS:            5200.09
Virtualization:      AMD-V
L1d cache:           64K
L1i cache:           64K
L2 cache:            512K
L3 cache:            6144K
NUMA node0 CPU(s):   0,2,4,6,8,10
NUMA node1 CPU(s):   1,3,5,7,9,11
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3dnow constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt hw_pstate vmmcall npt lbrv svm_lock nrip_save pausefilter


$ sudo systemctl --failed              
  UNIT           LOAD   ACTIVE SUB    DESCRIPTION                           
● mcelog.service loaded failed failed Machine Check Exception Logging Daemon



$ systemctl status mcelog.service   
● mcelog.service - Machine Check Exception Logging Daemon
   Loaded: loaded (/usr/lib/systemd/system/mcelog.service; enabled; vendor preset: enabled)
   Active: failed (Result: exit-code) since Mon 2018-10-01 18:15:37 BST; 3min 6s ago
  Process: 16893 ExecStart=/usr/sbin/mcelog --ignorenodev --daemon --foreground (code=exited, status=1/FAILURE)
 Main PID: 16893 (code=exited, status=1/FAILURE)

Oct 01 18:15:37 server systemd[1]: Started Machine Check Exception Logging Daemon.
Oct 01 18:15:37 server mcelog[16893]: mcelog: ERROR: AMD Processor family 16: mcelog does not support this processor.  Please use the edac_mce_amd module instead.
Oct 01 18:15:37 server mcelog[16893]: CPU is unsupported
Oct 01 18:15:37 server systemd[1]: mcelog.service: Main process exited, code=exited, status=1/FAILURE
Oct 01 18:15:37 server systemd[1]: mcelog.service: Failed with result 'exit-code'.

$ uname -a
Linux server 4.18.9-300.fc29.x86_64 #1 SMP Thu Sep 20 02:32:53 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

$ rpm -qa | grep mcelog
mcelog-153-3.fc29.x86_64

Comment 64 Peter Trenholme 2018-10-02 01:47:07 UTC
Look, this is a message reporting a failure to load an UNUSABLE, Intel-specific, piece of code. The bug is that it's reported for an AMD processor.

Yes, it's ugly to see a "FATAL" in the log. But, if you're using an AMD processor, just ignore it.

Comment 65 Pierguido Lambri 2018-10-02 07:49:46 UTC
(In reply to Peter Trenholme from comment #64)
> Look, this is a message reporting a failure to load an UNUSABLE,
> Intel-specific, piece of code. The bug is that it's reported for an AMD
> processor.
> 
> Yes, it's ugly to see a "FATAL" in the log. But, if you're using an AMD
> processor, just ignore it.

I know I can just ignore it, still having a failed service and these messages 
doesn't look nice.
And I was just replying to what Prarit asked for.
If this bug is not going to be fixed, then it should be closed as WONTFIX (or mark it as dup of bz#1166978).

Comment 66 Richard Shaw 2018-10-02 12:27:33 UTC
I mask it so it doesn't start but is there a possibility of using dbus or a file trigger to only even attempt to load on Intel systems?

Comment 67 Andrea V. 2019-01-12 05:44:44 UTC
mcelog.service still fails to start on Fedora 29 x86_64 with AMD processor family 23... does this bug will ever be fixed?

Comment 68 Tom Chiverton 2019-02-06 21:42:57 UTC
Still fails here too.


Feb 06 21:20:33 bookcase systemd[1]: Started Machine Check Exception Logging Daemon.
Feb 06 21:20:33 bookcase mcelog[535]: mcelog: ERROR: AMD Processor family 21: mcelog does not support this processor.  Please use the edac_mce_amd module instead.
Feb 06 21:20:33 bookcase mcelog[535]: CPU is unsupported
Feb 06 21:20:33 bookcase systemd[1]: mcelog.service: Main process exited, code=exited, status=1/FAILURE
Feb 06 21:20:33 bookcase systemd[1]: mcelog.service: Failed with result 'exit-code'.

Comment 69 Ben Cotton 2019-05-02 19:56:22 UTC
This message is a reminder that Fedora 28 is nearing its end of life.
On 2019-May-28 Fedora will stop maintaining and issuing updates for
Fedora 28. It is Fedora's policy to close all bug reports from releases
that are no longer maintained. At that time this bug will be closed as
EOL if it remains open with a Fedora 'version' of '28'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 28 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.


Note You need to log in before you can comment on or make changes to this bug.