Bug 1830404 - Correctable MCE errors logged for CPU0/CPU12 L1 instruction cache with AMD Ryzen 9 3900X 12-Core Processor
Summary: Correctable MCE errors logged for CPU0/CPU12 L1 instruction cache with AMD Ry...
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 32
Hardware: x86_64
OS: Linux
unspecified
low
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-05-01 20:18 UTC by Phil Hale
Modified: 2021-05-17 22:32 UTC (History)
19 users (show)

Fixed In Version:
Doc Type: ---
Doc Text:
Clone Of:
Environment:
Last Closed: 2021-05-17 22:32:55 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
requested kernel log from step 7. (123.89 KB, text/plain)
2020-05-01 20:18 UTC, Phil Hale
no flags Details
sensors reading of the cpu temp every 5 seconds during a 30 minute stress test. (31.82 KB, text/plain)
2020-05-16 19:00 UTC, Phil Hale
no flags Details
journalctl dump of mce/kernel log during 30 minute stress test. (133.96 KB, text/plain)
2020-05-16 19:01 UTC, Phil Hale
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Linux Kernel 202005 0 None None None 2020-05-29 13:57:23 UTC
Linux Kernel 207907 0 None None None 2020-05-27 14:25:21 UTC

Description Phil Hale 2020-05-01 20:18:00 UTC
Created attachment 1683843 [details]
requested kernel log from step 7.

Created attachment 1683843 [details]
requested kernel log from step 7.

1. Please describe the problem:

I've just built a new system and install Fedora 32 Workstation on it.  Components are:
1. Asus ROG CROSSHAIR VIII HERO motherboard
2. AMD Ryzen 9 3900x Processor
3. 4 x Corsair CMK32GX4M2D3600C18 Memory (running at 2133 MT/s due to 64GB installed).

I'm getting random MCE errors in the logs like so:
May 01 15:06:59 kernel: mce: [Hardware Error]: Machine check events logged
May 01 15:06:59 kernel: [Hardware Error]: Corrected error, no action required.
May 01 15:06:59 kernel: [Hardware Error]: CPU:12 (17:71:0) MC1_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|-|-|-]: 0xdc20000000030151
May 01 15:06:59 kernel: [Hardware Error]: Error Addr: 0x000000076da32ae0
May 01 15:06:59 kernel: [Hardware Error]: IPID: 0x000100b000000000, Syndrome: 0x000000001a030507
May 01 15:06:59 kernel: [Hardware Error]: Instruction Fetch Unit Ext. Error Code: 3, IC Data Array Parity Error.
May 01 15:06:59 kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
May 01 15:06:59 kernel: mce: [Hardware Error]: Machine check events logged
May 01 15:06:59 kernel: [Hardware Error]: Corrected error, no action required.
May 01 15:06:59 kernel: [Hardware Error]: CPU:0 (17:71:0) MC1_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|-|-|-]: 0xdc20000000030151
May 01 15:06:59 kernel: [Hardware Error]: Error Addr: 0x0000000fbedc2ae0
May 01 15:06:59 kernel: [Hardware Error]: IPID: 0x000100b000000000, Syndrome: 0x000000001a030507
May 01 15:06:59 kernel: [Hardware Error]: Instruction Fetch Unit Ext. Error Code: 3, IC Data Array Parity Error.
May 01 15:06:59 kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD

I tried to get information on the issue, but mcelog is not able to provide any additional information and the abrt reports are empty as well.  I've run a 10 minute stress test with GTKStressTesting and the system is stable.  Everything seems to be running ok, except this message keeps occuring in the logs.

2. What is the Version-Release number of the kernel:

kernel-5.6.6-300.fc32.x86_64
kernel-5.6.7-300.fc32.x86_64
kernel-5.6.14-300.fc32.x86_64

3. Did it work previously in Fedora? If so, what kernel version did the issue
   *first* appear?  Old kernels are available for download at
   https://koji.fedoraproject.org/koji/packageinfo?packageID=8 :

This is a new build, so I can say for sure.  I did the install from a Fedora 32 Workstation live thumbdrive and while running the live OS I didn't see any issues.

4. Can you reproduce this issue? If so, please provide the steps to reproduce
   the issue below:

Continues to occur, but at random times.

5. Does this problem occur with the latest Rawhide kernel? To install the
   Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by
   ``sudo dnf update --enablerepo=rawhide kernel``:

I've not tested with the rawhide kernel.  I can't find much information on the error to determine a root cause.

6. Are you running any modules that not shipped with directly Fedora's kernel?:

No.

7. Please attach the kernel logs. You can get the complete kernel log
   for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the
   issue occurred on a previous boot, use the journalctl ``-b`` flag.

Comment 1 Phil Hale 2020-05-02 03:43:57 UTC
Closing this bug report.  I decided to test all 64GB of memory using memtest86+ and I was able to narrow down the problem to a single DIMM causing issues.

Comment 2 Phil Hale 2020-05-14 01:01:29 UTC
I've reopened this as the issue continues even with confirmed good memory in the system.  I've searched AMD support forums and found references to an issue going back to 2016 with Ryzen chips.  Most recommended disabling C-State in the bios, which I have done, but the issue continues.

Comment 3 Steve 2020-05-14 21:40:54 UTC
> ... [Hardware Error]: Instruction Fetch Unit Ext. Error Code: 3, IC Data Array Parity Error.
                                                                   ^^
If "IC" means "Instruction Cache", the problem is in the processor.

Although the log doesn't show any thermal issues, have you tried monitoring the CPU core temperature and the CPU fan speed?

May 01 14:56:00 kernel: smpboot: CPU0: AMD Ryzen 9 3900X 12-Core Processor (family: 0x17, model: 0x71, stepping: 0x0)

According to the AMD specs for that processor, TDP is 105W and Max Temp is 95°C.

AMD Ryzen™ 9 3900X
https://www.amd.com/en/products/cpu/amd-ryzen-9-3900x

Comment 4 Steve 2020-05-14 21:48:57 UTC
If you haven't already, I suggest installing the "lm_sensors" package.

The "sensors-detect" command will configure hardware monitoring for your system.

The "sensors" command will "show the current readings of all sensor chips."

Comment 5 Phil Hale 2020-05-15 21:27:34 UTC
I had previously installed lm_sensors, but it wasn't picking up the CPU tempature or voltage settings.. I did some research and found that I needed to I had to add the "acpi_enforce_resources=lax" boot option to get sensor data.  Here is what I'm seeing now after running for about four hours doing basic desktop work:

 sensors
nct6798-isa-0290
Adapter: ISA adapter
in0:                     1.30 V  (min =  +0.00 V, max =  +1.74 V)
in1:                   1000.00 mV (min =  +0.00 V, max =  +0.00 V)  ALARM
in2:                     3.36 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in3:                     3.30 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in4:                     1.71 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in5:                   592.00 mV (min =  +0.00 V, max =  +0.00 V)  ALARM
in6:                   1000.00 mV (min =  +0.00 V, max =  +0.00 V)  ALARM
in7:                     3.36 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in8:                     3.23 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in9:                   952.00 mV (min =  +0.00 V, max =  +0.00 V)  ALARM
in10:                   32.00 mV (min =  +0.00 V, max =  +0.00 V)  ALARM
in11:                   96.00 mV (min =  +0.00 V, max =  +0.00 V)  ALARM
in12:                    1.02 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in13:                    1.19 V  (min =  +0.00 V, max =  +0.00 V)  ALARM
in14:                  952.00 mV (min =  +0.00 V, max =  +0.00 V)  ALARM
fan1:                   554 RPM  (min =    0 RPM)
fan2:                  1750 RPM  (min =    0 RPM)
fan3:                   551 RPM  (min =    0 RPM)
fan4:                   898 RPM  (min =    0 RPM)
fan5:                     0 RPM  (min =    0 RPM)
fan6:                     0 RPM  (min =    0 RPM)
fan7:                     0 RPM  (min =    0 RPM)
SYSTIN:                 +37.0°C  (high = +80.0°C, hyst = +75.0°C)  sensor = thermistor
CPUTIN:                 +41.5°C  (high = +80.0°C, hyst = +75.0°C)  sensor = thermistor
AUXTIN0:                +26.0°C    sensor = thermistor
AUXTIN1:               +127.0°C    sensor = thermistor
AUXTIN2:               +101.0°C    sensor = thermistor
AUXTIN3:                +29.0°C    sensor = thermistor
PCH_CHIP_CPU_MAX_TEMP:   +0.0°C  
PCH_CHIP_TEMP:           +0.0°C  
PCH_CPU_TEMP:            +0.0°C  
PCH_MCH_TEMP:            +0.0°C  
intrusion0:            ALARM
intrusion1:            ALARM
beep_enable:           disabled

k10temp-pci-00c3
Adapter: PCI adapter
Vcore:         1.30 V  
Vsoc:          1.01 V  
Tdie:         +49.4°C  
Tctl:         +49.4°C  
Tccd1:        +43.0°C  
Tccd2:        +44.0°C  
Icore:        17.00 A  
Isoc:          9.75 A  

nvme-pci-0100
Adapter: PCI adapter
Composite:    +43.9°C  (low  =  -0.1°C, high = +89.8°C)
                       (crit = +94.8°C)

amdgpu-pci-0b00
Adapter: PCI adapter
vddgfx:      725.00 mV 
fan1:           0 RPM  (min =    0 RPM, max = 3500 RPM)
edge:         +50.0°C  (crit = +118.0°C, hyst = -273.1°C)
                       (emerg = +99.0°C)
junction:     +50.0°C  (crit = +99.0°C, hyst = -273.1°C)
                       (emerg = +99.0°C)
mem:          +50.0°C  (crit = +99.0°C, hyst = -273.1°C)
                       (emerg = +99.0°C)
power1:        8.00 W  (cap = 220.00 W)


Looks like my CPU is hovering right around 105 - 106 degrees F.

Phil

Comment 6 Steve 2020-05-16 04:01:01 UTC
Thanks for your follow-up report.

I would suggest comparing some of those numbers with what the BIOS shows you.

k10temp is AMD-specific, so it might be the most reliable:

$ modinfo k10temp | grep description
description:    AMD Family 10h+ CPU core temperature monitor

If you want to monitor those continuously, there are several ways to do that:

1. Install the "xsensors" package.

2. Install the "gkrellm" package. (Press "F1" to get to the configurator.)

3. Install a desktop-specific applet:
   $ dnf -C search sensors

4. Write your own shell script using files from here:
   $ find -L /sys/class/hwmon -maxdepth 2 2>/dev/null | sort | less

Comment 7 Steve 2020-05-16 04:12:02 UTC
> Looks like my CPU is hovering right around 105 - 106 degrees F.

Temps can spike quite quickly when there is a brief pulse of CPU utilization.

However, the next step is to run a stress test to see if you can induce MCEs.

# dnf install stress stress-ng

I'm not sure what would be the best test for your situation, but the documentation will show you the options.

While running the stress test, run these in separate terminal windows:

$ top  # Press "1" to see each CPU separately. Press "d" to change the update delay. Press "n" to change the number of tasks shown.

$ journalctl --no-hostname -k -f  # This will show you any MCEs as they occur.

Comment 8 Phil Hale 2020-05-16 18:59:22 UTC
An update from today:

I went into the BIOS and set "Optimal Defaults" for the board.  After clean rebooting I set up a bash while loop to dump the CPU tempature to a file every five seconds as I ran a full 30 minute stress test using the GTKStressTesting tool. It looks like the tempature got up to 47.0 C at the end of the run (attached sensors-results.txt file).  I was able to generate MCE errors, but it doesn't appear to be as frequent as before.  Attaching a journalctl dump from the run.  Again, during the whole process, the system remained stable and usable.

Phil

Comment 9 Phil Hale 2020-05-16 19:00:34 UTC
Created attachment 1689223 [details]
sensors reading of the cpu temp every 5 seconds during a 30 minute stress test.

Comment 10 Phil Hale 2020-05-16 19:01:33 UTC
Created attachment 1689232 [details]
journalctl dump of mce/kernel log during 30 minute stress test.

Comment 11 Steve 2020-05-16 20:21:42 UTC
Nice work on the stress test and logging. Here is a brief analysis of the log file.

There are 16 'Error Addr' records:

$ cat mce-errors-during-stress-test.txt | fgrep 'Error Addr' | wc -l
16

However, there are only 9 unique addresses. Also, only CPU:0 and CPU:12 have the errors.

$ cat mce-errors-during-stress-test.txt | fgrep '[Hardware Error]' | sed 's/^.*kernel: //' | sort -u
[Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
[Hardware Error]: Corrected error, no action required.
[Hardware Error]: CPU:0 (17:71:0) MC1_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|-|-|-]: 0xdc20000000030151
[Hardware Error]: CPU:12 (17:71:0) MC1_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|-|-|-]: 0xdc20000000030151
[Hardware Error]: Error Addr: 0x00000000b1041ae0
[Hardware Error]: Error Addr: 0x00000000b1111ae0
[Hardware Error]: Error Addr: 0x00000000b1117ae0
[Hardware Error]: Error Addr: 0x00000000b111aae0
[Hardware Error]: Error Addr: 0x00000000b1135ae0
[Hardware Error]: Error Addr: 0x00000000b116bae0
[Hardware Error]: Error Addr: 0x00000000b163bae0
[Hardware Error]: Error Addr: 0x00000000b1679ae0
[Hardware Error]: Error Addr: 0x00000000b1a33ae0
[Hardware Error]: Instruction Fetch Unit Ext. Error Code: 3, IC Data Array Parity Error.
[Hardware Error]: IPID: 0x000100b000000000, Syndrome: 0x000000001a030507
mce: [Hardware Error]: Machine check events logged

Comment 12 Steve 2020-05-16 20:29:31 UTC
"... using the GTKStressTesting tool."

I found that online, but there is no Fedora package, AFAICT. How did you install it?

Comment 13 Steve 2020-05-16 20:58:50 UTC
Chapter 9 of this AMD document is on the "Machine Check Architecture" (MCA):

AMD64 Architecture Programmer’s Manual
Volume 2: System Programming
September 2012
https://developer.amd.com/wordpress/media/2012/10/24593_APM_v2.pdf

Figure 9-6 has the bit layout of the "MCi_STATUS Register", which can be used to interpret some of the "MC1_STATUS" bits in the log.

Here are a few cherry-picked quotes:

"The error-reporting registers retain their values through a warm reset." (p. 266)

"The i in each register name corresponds to the number of a supported register bank." (p. 270)

"Software clears the VAL bit after reading the contents of this [MCi_STATUS] register (after reading and saving valid information stored in any of the other logging registers) to indicate to hardware that it has saved the information, making the registers available to log the next error." (p. 270)

Comment 14 Steve 2020-05-16 21:24:37 UTC
> ... only CPU:0 and CPU:12 have the errors.

That seems a bit peculiar. According to the specs*, the AMD Ryzen 9 3900X has 24 threads, which appear to be numbered starting from 0:

 0 ... 11
12 ... 23

What do you get for this:

$ grep processor /proc/cpuinfo | wc -l

* https://www.amd.com/en/products/cpu/amd-ryzen-9-3900x

Comment 15 Phil Hale 2020-05-16 23:13:08 UTC
(In reply to Steve from comment #12)
> "... using the GTKStressTesting tool."
> 
> I found that online, but there is no Fedora package, AFAICT. How did you
> install it?

I found it on Gnome Software. From what I can see the source of the install was Fedora RPM, but it's also available from Flathub as a flatpak.

The package is:

gst-0.7.2-1.fc32.noarch : System utility designed to stress and monitoring various hardware components
Repo        : updates
Matched from:
Filename    : /usr/bin/gst

Phil

Comment 16 Phil Hale 2020-05-16 23:15:58 UTC
(In reply to Steve from comment #14)
> > ... only CPU:0 and CPU:12 have the errors.
> 
> That seems a bit peculiar. According to the specs*, the AMD Ryzen 9 3900X
> has 24 threads, which appear to be numbered starting from 0:
> 
>  0 ... 11
> 12 ... 23
> 
> What do you get for this:
> 
> $ grep processor /proc/cpuinfo | wc -l
> 
> * https://www.amd.com/en/products/cpu/amd-ryzen-9-3900x

Showing all 24 threads:

grep processor /proc/cpuinfo | wc -l
24

Which matches up with what the GTKStressTesting tools shows.  It provides a graphical load display on 12 cores.

Phil

Comment 17 Phil Hale 2020-05-16 23:22:38 UTC
(In reply to Steve from comment #13)
> Chapter 9 of this AMD document is on the "Machine Check Architecture" (MCA):
> 
> AMD64 Architecture Programmer’s Manual
> Volume 2: System Programming
> September 2012
> https://developer.amd.com/wordpress/media/2012/10/24593_APM_v2.pdf
> 
> Figure 9-6 has the bit layout of the "MCi_STATUS Register", which can be
> used to interpret some of the "MC1_STATUS" bits in the log.
> 
> Here are a few cherry-picked quotes:
> 
> "The error-reporting registers retain their values through a warm reset."
> (p. 266)
> 
> "The i in each register name corresponds to the number of a supported
> register bank." (p. 270)
> 
> "Software clears the VAL bit after reading the contents of this [MCi_STATUS]
> register (after reading and saving valid information stored in any of the
> other logging registers) to indicate to hardware that it has saved the
> information, making the registers available to log the next error." (p. 270)

Maybe I should do a cold cycle or as Dell would call it a "Flea Power Release" of the system.  Pull the plug, hold the power button for 30 seconds (I usually do a minute) to make sure the board, psu, etc... have completely de-powered and then boot back up with the same "Optimal Default" BIOS settings.  The one thing I've noticed since I enabled the BIOS change is the board's "Q-Code" display has switched from "0C" to a steady "00".  These codes are not very helpful, but from what I can tell in the Asus manual, "00" is "Not Used" and probably means no board codes now.  The "0C" wasn't any help either, as it's marked as "Reserved for future AMI SEC error codes" in the manual.

Phil

Comment 18 Steve 2020-05-17 01:03:54 UTC
Thanks for your follow-up replies. I'm on F30, which doesn't have "gst" -- it appears to have been added in F31:

$ dnf -q repoquery gst --releasever=31
gst-0:0.7.2-1.fc31.noarch

This might show whether "processor" 0 and "processor" 12 are on the same core or not:

$ egrep 'processor|core id' /proc/cpuinfo

Comment 19 Steve 2020-05-17 02:49:27 UTC
> ... whether "processor" 0 and "processor" 12 are on the same core or not:

If they are on the same core, they appear to share the L1 instruction cache (32 KB per core).

Those details are in a table, "L1 Instruction Cache Identifiers", on page 72 here:

Processor Programming Reference (PPR) for AMD Family 17h Model 71h, Revision B0 Processors
https://developer.amd.com/wp-content/resources/56176_ppr_Family_17h_Model_71h_B0_pub_Rev_3.06.zip

For the record, your processor is:

May 16 13:16:41 kernel: smpboot: CPU0: AMD Ryzen 9 3900X 12-Core Processor (family: 0x17, model: 0x71, stepping: 0x0)
                                                                                      ^^           ^^
The log explicitly mentions "L1":

May 16 13:22:12 kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD
                                                       ^^

More AMD documentation is here:

Developer Guides, Manuals & ISA Documents
https://developer.amd.com/resources/developer-guides-manuals/

Comment 20 Steve 2020-05-17 03:28:22 UTC
(In reply to Steve from comment #18)
> Thanks for your follow-up replies. I'm on F30, which doesn't have "gst" -- it appears to have been added in F31:

I tried "gst" with a test install of F31 on a separate hard drive and am very impressed.

I especially like the max/min reporting for temperatures and other data.

(In reply to Steve from comment #19)
> Those details are in a table, "L1 Instruction Cache Identifiers", on page 72 here:

"gst" shows all the cache details in a nice table. Much more convenient than digging through the documentation ... :-)

Comment 21 Steve 2020-05-17 17:22:46 UTC
As the log shows, the kernel boots from CPU0 and then brings up the other CPUs.

If the BIOS settings allow it, it might be informative to disable CPU0/CPU12 and see if the problem occurs when another CPU is the boot CPU.

Also noteworthy is this from the log: "Max logical packages: 2". /proc/cpuinfo might shed some light on what that means.

There are also some kernel command-line options related to "SMP":

The kernel’s command-line parameters
https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html

== snippet from second attached log ==
May 16 13:16:41 kernel: smp: Bringing up secondary CPUs ...
May 16 13:16:41 kernel: x86: Booting SMP configuration:
May 16 13:16:41 kernel: .... node  #0, CPUs:        #1  #2  #3  #4  #5  #6  #7  #8  #9 #10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23
May 16 13:16:41 kernel: smp: Brought up 1 node, 24 CPUs
May 16 13:16:41 kernel: smpboot: Max logical packages: 2
May 16 13:16:41 kernel: smpboot: Total of 24 processors activated (182060.20 BogoMIPS)
==

Comment 22 Phil Hale 2020-05-18 16:14:32 UTC
(In reply to Steve from comment #18)
> Thanks for your follow-up replies. I'm on F30, which doesn't have "gst" --
> it appears to have been added in F31:
> 
> $ dnf -q repoquery gst --releasever=31
> gst-0:0.7.2-1.fc31.noarch
> 
> This might show whether "processor" 0 and "processor" 12 are on the same
> core or not:
> 
> $ egrep 'processor|core id' /proc/cpuinfo

So I did the egrep this morning and it does appear that "processor 0" and "processor 12" are indeed on the same core:

egrep 'processor|core id' /proc/cpuinfo
processor	: 0
core id		: 0

processor	: 1
core id		: 1
processor	: 2
core id		: 2
processor	: 3
core id		: 4
processor	: 4
core id		: 5
processor	: 5
core id		: 6
processor	: 6
core id		: 8
processor	: 7
core id		: 9
processor	: 8
core id		: 10
processor	: 9
core id		: 12
processor	: 10
core id		: 13
processor	: 11
core id		: 14

processor	: 12
core id		: 0

processor	: 13
core id		: 1
processor	: 14
core id		: 2
processor	: 15
core id		: 4
processor	: 16
core id		: 5
processor	: 17
core id		: 6
processor	: 18
core id		: 8
processor	: 19
core id		: 9
processor	: 20
core id		: 10
processor	: 21
core id		: 12
processor	: 22
core id		: 13
processor	: 23
core id		: 14

Could this mean a defective core?

Phil

Comment 23 Steve 2020-05-18 19:01:04 UTC
The core IDs don't seem to be sequential. Is that what you saw in "gst"?

$ cat ryzen-cpuinfo-1.txt | grep core | sort -Vu | cat -n
     1	core id		: 0
     2	core id		: 1
     3	core id		: 2
     4	core id		: 4
     5	core id		: 5
     6	core id		: 6
     7	core id		: 8
     8	core id		: 9
     9	core id		: 10
    10	core id		: 12
    11	core id		: 13
    12	core id		: 14

Comment 24 Steve 2020-05-18 19:17:57 UTC
> Could this mean a defective core?

The only conclusion I would cautiously draw is that since CPU0 and CPU12 share their L1 caches, it could be that some of the MCEs are duplicates.

A conjecture is that since the kernel boots from CPU0, there could be a problem with how the machine-check status registers are first read.

And note the phrase "model-specific registers" in this sentence:

"The AMD64 Machine-Check Architecture defines the set of model-specific registers (MCA MSRs) used to log and report hardware errors." (p. 265)

The kernel could be incorrectly identifying the "model-specific registers" on your processor.

AMD64 Architecture Programmer’s Manual
Volume 2: System Programming
September 2012
https://developer.amd.com/wordpress/media/2012/10/24593_APM_v2.pdf

Comment 25 Steve 2020-05-18 19:40:58 UTC
(In reply to Steve from comment #23)
> The core IDs don't seem to be sequential.

There could be an explanation:

The AMD Ryzen 9 3950X has 16 cores, so the 12-core model (3900X) could actually have 16 cores, but some cores are disabled.

Or that could be an oversimplification:

AMD Clarifies "Best Cores" vs "Preferred Cores" Discrepancies For Ryzen CPUs
by Andrei Frumusanu
November 21, 2019
https://www.anandtech.com/show/15137/amd-clarifies-best-cores-vs-preferred-cores

Comment 26 Phil Hale 2020-05-22 19:42:47 UTC
(In reply to Steve from comment #25)
> (In reply to Steve from comment #23)
> > The core IDs don't seem to be sequential.
> 
> There could be an explanation:
> 
> The AMD Ryzen 9 3950X has 16 cores, so the 12-core model (3900X) could
> actually have 16 cores, but some cores are disabled.
> 
> Or that could be an oversimplification:
> 
> AMD Clarifies "Best Cores" vs "Preferred Cores" Discrepancies For Ryzen CPUs
> by Andrei Frumusanu
> November 21, 2019
> https://www.anandtech.com/show/15137/amd-clarifies-best-cores-vs-preferred-
> cores

Hey Steve,

Just wanted to reach back out on this.  Do we think this is a "non-issue" now?  Or is this something I need to RMA the processor for.

The system has still been very stable, just the sporatic MCE messages.

PHil

Comment 27 Steve 2020-05-23 06:22:26 UTC
An upstream search for "mce"* found a request that the kernel ignore certain correctable errors on Intel processors:

Bug_206587 - x86/mce: Do not log spurious corrected mce errors 
https://bugzilla.kernel.org/show_bug.cgi?id=206587

And the request cites various Intel errata.

However, I couldn't find any AMD errata for your specific processor.

So there are too many possibilities to say what is going on.

I would suggest:

1. Updating the bug summary to say something like this:

"correctable MCE errors logged for CPU0/CPU12 L1 instruction cache with AMD Ryzen 9 3900X 12-Core Processor"

2. Opening a bug upstream under "Platform Specific/Hardware", "x86-64":
https://bugzilla.kernel.org/buglist.cgi?bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&component=x86-64&product=Platform%20Specific%2FHardware&query_format=advanced

3. Checking the AMD web site for errata. This is a search for "revision guide 17h":
https://www.amd.com/en/support/tech-docs?keyword=revision+guide+17h

* https://bugzilla.kernel.org/buglist.cgi?quicksearch=mce

Comment 28 Steve 2020-05-25 04:47:55 UTC
FYI, Linus is "now rocking an AMD Threadripper 3970x":

From	Linus Torvalds <>
Date	Sun, 24 May 2020 16:00:50 -0700
Subject	Linux 5.7-rc7
https://lkml.org/lkml/2020/5/24/407

That processor has the same AMD Zen 2 architecture as yours:
https://en.wikipedia.org/wiki/Zen_2#Desktop_processors

So if there are any kernel problems related to MCEs, they will receive all due attention. :-)

Comment 29 Phil Hale 2020-05-27 00:41:32 UTC
Made changes to Subject and opened an up-stream bug report per the recommendations:

https://bugzilla.kernel.org/show_bug.cgi?id=207907

Guess I'll wait and see if they can provide an answer to this.

(In reply to Steve from comment #27)
> An upstream search for "mce"* found a request that the kernel ignore certain
> correctable errors on Intel processors:
> 
> Bug_206587 - x86/mce: Do not log spurious corrected mce errors 
> https://bugzilla.kernel.org/show_bug.cgi?id=206587
> 
> And the request cites various Intel errata.
> 
> However, I couldn't find any AMD errata for your specific processor.
> 
> So there are too many possibilities to say what is going on.
> 
> I would suggest:
> 
> 1. Updating the bug summary to say something like this:
> 
> "correctable MCE errors logged for CPU0/CPU12 L1 instruction cache with AMD
> Ryzen 9 3900X 12-Core Processor"
> 
> 2. Opening a bug upstream under "Platform Specific/Hardware", "x86-64":
> https://bugzilla.kernel.org/buglist.
> cgi?bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&component=x86-
> 64&product=Platform%20Specific%2FHardware&query_format=advanced
> 
> 3. Checking the AMD web site for errata. This is a search for "revision
> guide 17h":
> https://www.amd.com/en/support/tech-docs?keyword=revision+guide+17h
> 
> * https://bugzilla.kernel.org/buglist.cgi?quicksearch=mce

Comment 30 Steve 2020-05-27 11:50:34 UTC
(In reply to Phil Hale from comment #29)
> Made changes to Subject and opened an up-stream bug report per the recommendations:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=207907

Thanks. Could you add a link in the "Links" section near the top of the BZ page, so that the upstream bug report is easy to find?

> Guess I'll wait and see if they can provide an answer to this.

Comment 31 Steve 2020-05-27 19:10:43 UTC
Thanks for adding the link to your upstream bug report. And thanks for attaching the logs for your stress tests to it.

It might be a good idea to document the stress test you are using:

$ rpm -q gst

https://gitlab.com/leinardi/gst

Comment 32 Phil Hale 2020-05-27 19:18:14 UTC
(In reply to Steve from comment #31)
> Thanks for adding the link to your upstream bug report. And thanks for
> attaching the logs for your stress tests to it.
> 
> It might be a good idea to document the stress test you are using:
> 
> $ rpm -q gst
> 
> https://gitlab.com/leinardi/gst

Done.

Comment 33 Steve 2020-05-27 23:24:32 UTC
Borislav appears to be referring to this command:*

$ dnf -Cq repoquery --whatprovides /usr/\*bin/cpupower
kernel-tools-0:5.3.7-300.fc31.x86_64
kernel-tools-0:5.6.7-200.fc31.x86_64

* https://bugzilla.kernel.org/show_bug.cgi?id=207907#c5

Comment 34 Phil Hale 2020-05-29 14:04:19 UTC
(In reply to Steve from comment #33)
> Borislav appears to be referring to this command:*
> 
> $ dnf -Cq repoquery --whatprovides /usr/\*bin/cpupower
> kernel-tools-0:5.3.7-300.fc31.x86_64
> kernel-tools-0:5.6.7-200.fc31.x86_64
> 
> * https://bugzilla.kernel.org/show_bug.cgi?id=207907#c5

Yep,

I did additional testing adjusting the two items he requested, separately and together.  Still had the MCEs logged.  In fact doing just one setting, then the two together generated the same number of MCEs in a 30 minute stress test.  I found some other bugs on the kernel bugzilla that appear to be the same issue back in the 4.10 tree, but it doesn't appear to have been resolved.

Comment 35 Steve 2020-05-29 14:22:56 UTC
(In reply to Phil Hale from comment #34)
...
> Yep,
> 
> I did additional testing adjusting the two items he requested, separately
> and together.  Still had the MCEs logged.  In fact doing just one setting,
> then the two together generated the same number of MCEs in a 30 minute
> stress test.  I found some other bugs on the kernel bugzilla that appear to
> be the same issue back in the 4.10 tree, but it doesn't appear to have been
> resolved.

Thanks for your update. None of that conclusively points to a processor bug.

However, AMD has been known to replace faulty processors:

"He got a replacement from AMD [that fixed a problem with segfaults while under load]."

That is from the section headed "Background from original developer (Suaefar)" here:*
https://github.com/Oxalin/ryzen-test

That has a link to this thread on the AMD web site:

gcc segmentation faults on Ryzen / Linux
May 8, 2017; Latest reply on Jul 10, 2019
https://community.amd.com/thread/215773

* Linked from unrelated Bug 1840969.

Comment 36 Phil Hale 2020-06-06 00:23:18 UTC
So I got a new power supply in today.  It was the only component I had not switched out; the old one being 10+ years old 750 Watt Untra.  The new one is a Seasonic PX-1000.  On boot up, I didn't see any MCE errors and thought, "Ah ha!" then about 5 minutes in I got one.  So off to do a new stress test.  Same 30 minute stress test as before while following the logs with the following command, "journalctl --no-hostname -k -f > 20200605-journal-during-30min-stress-test.txt".  From a previous test with the old power supply, same stress test settings, I got a total of 9 MCE events.  Tonight I got 6.  So, better?  Not conclusive, but less.  I'm still leaning towards some sort of issue with kernel.  I've run out of hardware to replace, except to RMA the CPU.  I'm not really wanting to do this as the system remains rock-solid, running for days, doing lots of heavy duty Sys Admin work... I'll add this to the kernel.org bug, but I don't expect much to come of it.

Comment 37 Steve 2020-06-06 21:58:11 UTC
Thanks for your status update.

FYI, the 5.8 merge window is open. Look for a 5.8-rc1 release here:

https://www.kernel.org/  # mainline kernels are usually released on Sunday afternoons.*

And a Fedora build here:

https://koji.fedoraproject.org/koji/packageinfo?packageID=8

* However, the merge window appears to be 2 weeks,
so the 5.8-rc1 release would be two weeks after the 5.7 release on 2020-05-31 -- which would be 2020-06-14:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tag/?h=v5.7

Comment 38 demo 2020-09-14 14:55:55 UTC
Sorry to revive this issue, but I have the very same one, also on a 3900X and I wondered if you found a way to solve it?

By the way this is not kernel-related (or at least, probably not) because I have the same error on Windows, where it gets logged in the System event log as "WHEA Event ID 19".

Mine occurs on core 6, which is also 18, so it logs 2 events as well.

Tried a million things with BIOS and memory configs, nothing seems to help. 

On Windows it's very apparent the issue only happens when the CPU is idle. Although a very small pause in a workload can make it appear.

Just like you, it's got no apparent effect on stability or performance, unless I'm missing something.

Would really like to hear what you ended up doing, thanks a lot!

Comment 39 Phil Hale 2020-10-16 21:02:25 UTC
Hello demo,

I'm still seeing the issue.  I've applied the latest BIOS updates for my board and have been thru several rounds of kernel updates running on Fedora 32. I've also tried all sorts of different BIOS settings, but none make a difference.  I tend to leave my system running for weeks at a time, and it's remains perfectly stable other than the MCE log entries. I'm guessing this is just something on the processors.  I had filed a support ticket with AMD and they informed me that official support was only available for Windows 10.  If your seeing the same issue with Windows, maybe you could open a ticket on that with them directly.  If you do come across something let me know!

Phil

Comment 40 Fedora Program Management 2021-04-29 17:10:04 UTC
This message is a reminder that Fedora 32 is nearing its end of life.
Fedora will stop maintaining and issuing updates for Fedora 32 on 2021-05-25.
It is Fedora's policy to close all bug reports from releases that are no longer
maintained. At that time this bug will be closed as EOL if it remains open with a
Fedora 'version' of '32'.

Package Maintainer: If you wish for this bug to remain open because you
plan to fix it in a currently maintained version, simply change the 'version' 
to a later Fedora version.

Thank you for reporting this issue and we are sorry that we were not 
able to fix it before Fedora 32 is end of life. If you would still like 
to see this bug fixed and are able to reproduce it against a later version 
of Fedora, you are encouraged  change the 'version' to a later Fedora 
version prior this bug is closed as described in the policy above.

Although we aim to fix as many bugs as possible during every release's 
lifetime, sometimes those efforts are overtaken by events. Often a 
more recent Fedora release includes newer upstream software that fixes 
bugs or makes them obsolete.

Comment 41 Phil Hale 2021-05-17 22:31:59 UTC
Just an update on this issue.  I've replaced the CPU with a Ryzen 9 5950X and the issue is no longer occurring.  I think I can now identify this as an issue with the CPU and not the kernel/OS.


Note You need to log in before you can comment on or make changes to this bug.