Created attachment 1683843 [details] requested kernel log from step 7. Created attachment 1683843 [details] requested kernel log from step 7. 1. Please describe the problem: I've just built a new system and install Fedora 32 Workstation on it. Components are: 1. Asus ROG CROSSHAIR VIII HERO motherboard 2. AMD Ryzen 9 3900x Processor 3. 4 x Corsair CMK32GX4M2D3600C18 Memory (running at 2133 MT/s due to 64GB installed). I'm getting random MCE errors in the logs like so: May 01 15:06:59 kernel: mce: [Hardware Error]: Machine check events logged May 01 15:06:59 kernel: [Hardware Error]: Corrected error, no action required. May 01 15:06:59 kernel: [Hardware Error]: CPU:12 (17:71:0) MC1_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|-|-|-]: 0xdc20000000030151 May 01 15:06:59 kernel: [Hardware Error]: Error Addr: 0x000000076da32ae0 May 01 15:06:59 kernel: [Hardware Error]: IPID: 0x000100b000000000, Syndrome: 0x000000001a030507 May 01 15:06:59 kernel: [Hardware Error]: Instruction Fetch Unit Ext. Error Code: 3, IC Data Array Parity Error. May 01 15:06:59 kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD May 01 15:06:59 kernel: mce: [Hardware Error]: Machine check events logged May 01 15:06:59 kernel: [Hardware Error]: Corrected error, no action required. May 01 15:06:59 kernel: [Hardware Error]: CPU:0 (17:71:0) MC1_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|-|-|-]: 0xdc20000000030151 May 01 15:06:59 kernel: [Hardware Error]: Error Addr: 0x0000000fbedc2ae0 May 01 15:06:59 kernel: [Hardware Error]: IPID: 0x000100b000000000, Syndrome: 0x000000001a030507 May 01 15:06:59 kernel: [Hardware Error]: Instruction Fetch Unit Ext. Error Code: 3, IC Data Array Parity Error. May 01 15:06:59 kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD I tried to get information on the issue, but mcelog is not able to provide any additional information and the abrt reports are empty as well. I've run a 10 minute stress test with GTKStressTesting and the system is stable. Everything seems to be running ok, except this message keeps occuring in the logs. 2. What is the Version-Release number of the kernel: kernel-5.6.6-300.fc32.x86_64 kernel-5.6.7-300.fc32.x86_64 kernel-5.6.14-300.fc32.x86_64 3. Did it work previously in Fedora? If so, what kernel version did the issue *first* appear? Old kernels are available for download at https://koji.fedoraproject.org/koji/packageinfo?packageID=8 : This is a new build, so I can say for sure. I did the install from a Fedora 32 Workstation live thumbdrive and while running the live OS I didn't see any issues. 4. Can you reproduce this issue? If so, please provide the steps to reproduce the issue below: Continues to occur, but at random times. 5. Does this problem occur with the latest Rawhide kernel? To install the Rawhide kernel, run ``sudo dnf install fedora-repos-rawhide`` followed by ``sudo dnf update --enablerepo=rawhide kernel``: I've not tested with the rawhide kernel. I can't find much information on the error to determine a root cause. 6. Are you running any modules that not shipped with directly Fedora's kernel?: No. 7. Please attach the kernel logs. You can get the complete kernel log for a boot with ``journalctl --no-hostname -k > dmesg.txt``. If the issue occurred on a previous boot, use the journalctl ``-b`` flag.
Closing this bug report. I decided to test all 64GB of memory using memtest86+ and I was able to narrow down the problem to a single DIMM causing issues.
I've reopened this as the issue continues even with confirmed good memory in the system. I've searched AMD support forums and found references to an issue going back to 2016 with Ryzen chips. Most recommended disabling C-State in the bios, which I have done, but the issue continues.
> ... [Hardware Error]: Instruction Fetch Unit Ext. Error Code: 3, IC Data Array Parity Error. ^^ If "IC" means "Instruction Cache", the problem is in the processor. Although the log doesn't show any thermal issues, have you tried monitoring the CPU core temperature and the CPU fan speed? May 01 14:56:00 kernel: smpboot: CPU0: AMD Ryzen 9 3900X 12-Core Processor (family: 0x17, model: 0x71, stepping: 0x0) According to the AMD specs for that processor, TDP is 105W and Max Temp is 95°C. AMD Ryzen™ 9 3900X https://www.amd.com/en/products/cpu/amd-ryzen-9-3900x
If you haven't already, I suggest installing the "lm_sensors" package. The "sensors-detect" command will configure hardware monitoring for your system. The "sensors" command will "show the current readings of all sensor chips."
I had previously installed lm_sensors, but it wasn't picking up the CPU tempature or voltage settings.. I did some research and found that I needed to I had to add the "acpi_enforce_resources=lax" boot option to get sensor data. Here is what I'm seeing now after running for about four hours doing basic desktop work: sensors nct6798-isa-0290 Adapter: ISA adapter in0: 1.30 V (min = +0.00 V, max = +1.74 V) in1: 1000.00 mV (min = +0.00 V, max = +0.00 V) ALARM in2: 3.36 V (min = +0.00 V, max = +0.00 V) ALARM in3: 3.30 V (min = +0.00 V, max = +0.00 V) ALARM in4: 1.71 V (min = +0.00 V, max = +0.00 V) ALARM in5: 592.00 mV (min = +0.00 V, max = +0.00 V) ALARM in6: 1000.00 mV (min = +0.00 V, max = +0.00 V) ALARM in7: 3.36 V (min = +0.00 V, max = +0.00 V) ALARM in8: 3.23 V (min = +0.00 V, max = +0.00 V) ALARM in9: 952.00 mV (min = +0.00 V, max = +0.00 V) ALARM in10: 32.00 mV (min = +0.00 V, max = +0.00 V) ALARM in11: 96.00 mV (min = +0.00 V, max = +0.00 V) ALARM in12: 1.02 V (min = +0.00 V, max = +0.00 V) ALARM in13: 1.19 V (min = +0.00 V, max = +0.00 V) ALARM in14: 952.00 mV (min = +0.00 V, max = +0.00 V) ALARM fan1: 554 RPM (min = 0 RPM) fan2: 1750 RPM (min = 0 RPM) fan3: 551 RPM (min = 0 RPM) fan4: 898 RPM (min = 0 RPM) fan5: 0 RPM (min = 0 RPM) fan6: 0 RPM (min = 0 RPM) fan7: 0 RPM (min = 0 RPM) SYSTIN: +37.0°C (high = +80.0°C, hyst = +75.0°C) sensor = thermistor CPUTIN: +41.5°C (high = +80.0°C, hyst = +75.0°C) sensor = thermistor AUXTIN0: +26.0°C sensor = thermistor AUXTIN1: +127.0°C sensor = thermistor AUXTIN2: +101.0°C sensor = thermistor AUXTIN3: +29.0°C sensor = thermistor PCH_CHIP_CPU_MAX_TEMP: +0.0°C PCH_CHIP_TEMP: +0.0°C PCH_CPU_TEMP: +0.0°C PCH_MCH_TEMP: +0.0°C intrusion0: ALARM intrusion1: ALARM beep_enable: disabled k10temp-pci-00c3 Adapter: PCI adapter Vcore: 1.30 V Vsoc: 1.01 V Tdie: +49.4°C Tctl: +49.4°C Tccd1: +43.0°C Tccd2: +44.0°C Icore: 17.00 A Isoc: 9.75 A nvme-pci-0100 Adapter: PCI adapter Composite: +43.9°C (low = -0.1°C, high = +89.8°C) (crit = +94.8°C) amdgpu-pci-0b00 Adapter: PCI adapter vddgfx: 725.00 mV fan1: 0 RPM (min = 0 RPM, max = 3500 RPM) edge: +50.0°C (crit = +118.0°C, hyst = -273.1°C) (emerg = +99.0°C) junction: +50.0°C (crit = +99.0°C, hyst = -273.1°C) (emerg = +99.0°C) mem: +50.0°C (crit = +99.0°C, hyst = -273.1°C) (emerg = +99.0°C) power1: 8.00 W (cap = 220.00 W) Looks like my CPU is hovering right around 105 - 106 degrees F. Phil
Thanks for your follow-up report. I would suggest comparing some of those numbers with what the BIOS shows you. k10temp is AMD-specific, so it might be the most reliable: $ modinfo k10temp | grep description description: AMD Family 10h+ CPU core temperature monitor If you want to monitor those continuously, there are several ways to do that: 1. Install the "xsensors" package. 2. Install the "gkrellm" package. (Press "F1" to get to the configurator.) 3. Install a desktop-specific applet: $ dnf -C search sensors 4. Write your own shell script using files from here: $ find -L /sys/class/hwmon -maxdepth 2 2>/dev/null | sort | less
> Looks like my CPU is hovering right around 105 - 106 degrees F. Temps can spike quite quickly when there is a brief pulse of CPU utilization. However, the next step is to run a stress test to see if you can induce MCEs. # dnf install stress stress-ng I'm not sure what would be the best test for your situation, but the documentation will show you the options. While running the stress test, run these in separate terminal windows: $ top # Press "1" to see each CPU separately. Press "d" to change the update delay. Press "n" to change the number of tasks shown. $ journalctl --no-hostname -k -f # This will show you any MCEs as they occur.
An update from today: I went into the BIOS and set "Optimal Defaults" for the board. After clean rebooting I set up a bash while loop to dump the CPU tempature to a file every five seconds as I ran a full 30 minute stress test using the GTKStressTesting tool. It looks like the tempature got up to 47.0 C at the end of the run (attached sensors-results.txt file). I was able to generate MCE errors, but it doesn't appear to be as frequent as before. Attaching a journalctl dump from the run. Again, during the whole process, the system remained stable and usable. Phil
Created attachment 1689223 [details] sensors reading of the cpu temp every 5 seconds during a 30 minute stress test.
Created attachment 1689232 [details] journalctl dump of mce/kernel log during 30 minute stress test.
Nice work on the stress test and logging. Here is a brief analysis of the log file. There are 16 'Error Addr' records: $ cat mce-errors-during-stress-test.txt | fgrep 'Error Addr' | wc -l 16 However, there are only 9 unique addresses. Also, only CPU:0 and CPU:12 have the errors. $ cat mce-errors-during-stress-test.txt | fgrep '[Hardware Error]' | sed 's/^.*kernel: //' | sort -u [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD [Hardware Error]: Corrected error, no action required. [Hardware Error]: CPU:0 (17:71:0) MC1_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|-|-|-]: 0xdc20000000030151 [Hardware Error]: CPU:12 (17:71:0) MC1_STATUS[Over|CE|MiscV|AddrV|-|-|SyndV|-|-|-]: 0xdc20000000030151 [Hardware Error]: Error Addr: 0x00000000b1041ae0 [Hardware Error]: Error Addr: 0x00000000b1111ae0 [Hardware Error]: Error Addr: 0x00000000b1117ae0 [Hardware Error]: Error Addr: 0x00000000b111aae0 [Hardware Error]: Error Addr: 0x00000000b1135ae0 [Hardware Error]: Error Addr: 0x00000000b116bae0 [Hardware Error]: Error Addr: 0x00000000b163bae0 [Hardware Error]: Error Addr: 0x00000000b1679ae0 [Hardware Error]: Error Addr: 0x00000000b1a33ae0 [Hardware Error]: Instruction Fetch Unit Ext. Error Code: 3, IC Data Array Parity Error. [Hardware Error]: IPID: 0x000100b000000000, Syndrome: 0x000000001a030507 mce: [Hardware Error]: Machine check events logged
"... using the GTKStressTesting tool." I found that online, but there is no Fedora package, AFAICT. How did you install it?
Chapter 9 of this AMD document is on the "Machine Check Architecture" (MCA): AMD64 Architecture Programmer’s Manual Volume 2: System Programming September 2012 https://developer.amd.com/wordpress/media/2012/10/24593_APM_v2.pdf Figure 9-6 has the bit layout of the "MCi_STATUS Register", which can be used to interpret some of the "MC1_STATUS" bits in the log. Here are a few cherry-picked quotes: "The error-reporting registers retain their values through a warm reset." (p. 266) "The i in each register name corresponds to the number of a supported register bank." (p. 270) "Software clears the VAL bit after reading the contents of this [MCi_STATUS] register (after reading and saving valid information stored in any of the other logging registers) to indicate to hardware that it has saved the information, making the registers available to log the next error." (p. 270)
> ... only CPU:0 and CPU:12 have the errors. That seems a bit peculiar. According to the specs*, the AMD Ryzen 9 3900X has 24 threads, which appear to be numbered starting from 0: 0 ... 11 12 ... 23 What do you get for this: $ grep processor /proc/cpuinfo | wc -l * https://www.amd.com/en/products/cpu/amd-ryzen-9-3900x
(In reply to Steve from comment #12) > "... using the GTKStressTesting tool." > > I found that online, but there is no Fedora package, AFAICT. How did you > install it? I found it on Gnome Software. From what I can see the source of the install was Fedora RPM, but it's also available from Flathub as a flatpak. The package is: gst-0.7.2-1.fc32.noarch : System utility designed to stress and monitoring various hardware components Repo : updates Matched from: Filename : /usr/bin/gst Phil
(In reply to Steve from comment #14) > > ... only CPU:0 and CPU:12 have the errors. > > That seems a bit peculiar. According to the specs*, the AMD Ryzen 9 3900X > has 24 threads, which appear to be numbered starting from 0: > > 0 ... 11 > 12 ... 23 > > What do you get for this: > > $ grep processor /proc/cpuinfo | wc -l > > * https://www.amd.com/en/products/cpu/amd-ryzen-9-3900x Showing all 24 threads: grep processor /proc/cpuinfo | wc -l 24 Which matches up with what the GTKStressTesting tools shows. It provides a graphical load display on 12 cores. Phil
(In reply to Steve from comment #13) > Chapter 9 of this AMD document is on the "Machine Check Architecture" (MCA): > > AMD64 Architecture Programmer’s Manual > Volume 2: System Programming > September 2012 > https://developer.amd.com/wordpress/media/2012/10/24593_APM_v2.pdf > > Figure 9-6 has the bit layout of the "MCi_STATUS Register", which can be > used to interpret some of the "MC1_STATUS" bits in the log. > > Here are a few cherry-picked quotes: > > "The error-reporting registers retain their values through a warm reset." > (p. 266) > > "The i in each register name corresponds to the number of a supported > register bank." (p. 270) > > "Software clears the VAL bit after reading the contents of this [MCi_STATUS] > register (after reading and saving valid information stored in any of the > other logging registers) to indicate to hardware that it has saved the > information, making the registers available to log the next error." (p. 270) Maybe I should do a cold cycle or as Dell would call it a "Flea Power Release" of the system. Pull the plug, hold the power button for 30 seconds (I usually do a minute) to make sure the board, psu, etc... have completely de-powered and then boot back up with the same "Optimal Default" BIOS settings. The one thing I've noticed since I enabled the BIOS change is the board's "Q-Code" display has switched from "0C" to a steady "00". These codes are not very helpful, but from what I can tell in the Asus manual, "00" is "Not Used" and probably means no board codes now. The "0C" wasn't any help either, as it's marked as "Reserved for future AMI SEC error codes" in the manual. Phil
Thanks for your follow-up replies. I'm on F30, which doesn't have "gst" -- it appears to have been added in F31: $ dnf -q repoquery gst --releasever=31 gst-0:0.7.2-1.fc31.noarch This might show whether "processor" 0 and "processor" 12 are on the same core or not: $ egrep 'processor|core id' /proc/cpuinfo
> ... whether "processor" 0 and "processor" 12 are on the same core or not: If they are on the same core, they appear to share the L1 instruction cache (32 KB per core). Those details are in a table, "L1 Instruction Cache Identifiers", on page 72 here: Processor Programming Reference (PPR) for AMD Family 17h Model 71h, Revision B0 Processors https://developer.amd.com/wp-content/resources/56176_ppr_Family_17h_Model_71h_B0_pub_Rev_3.06.zip For the record, your processor is: May 16 13:16:41 kernel: smpboot: CPU0: AMD Ryzen 9 3900X 12-Core Processor (family: 0x17, model: 0x71, stepping: 0x0) ^^ ^^ The log explicitly mentions "L1": May 16 13:22:12 kernel: [Hardware Error]: cache level: L1, tx: INSN, mem-tx: IRD ^^ More AMD documentation is here: Developer Guides, Manuals & ISA Documents https://developer.amd.com/resources/developer-guides-manuals/
(In reply to Steve from comment #18) > Thanks for your follow-up replies. I'm on F30, which doesn't have "gst" -- it appears to have been added in F31: I tried "gst" with a test install of F31 on a separate hard drive and am very impressed. I especially like the max/min reporting for temperatures and other data. (In reply to Steve from comment #19) > Those details are in a table, "L1 Instruction Cache Identifiers", on page 72 here: "gst" shows all the cache details in a nice table. Much more convenient than digging through the documentation ... :-)
As the log shows, the kernel boots from CPU0 and then brings up the other CPUs. If the BIOS settings allow it, it might be informative to disable CPU0/CPU12 and see if the problem occurs when another CPU is the boot CPU. Also noteworthy is this from the log: "Max logical packages: 2". /proc/cpuinfo might shed some light on what that means. There are also some kernel command-line options related to "SMP": The kernel’s command-line parameters https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html == snippet from second attached log == May 16 13:16:41 kernel: smp: Bringing up secondary CPUs ... May 16 13:16:41 kernel: x86: Booting SMP configuration: May 16 13:16:41 kernel: .... node #0, CPUs: #1 #2 #3 #4 #5 #6 #7 #8 #9 #10 #11 #12 #13 #14 #15 #16 #17 #18 #19 #20 #21 #22 #23 May 16 13:16:41 kernel: smp: Brought up 1 node, 24 CPUs May 16 13:16:41 kernel: smpboot: Max logical packages: 2 May 16 13:16:41 kernel: smpboot: Total of 24 processors activated (182060.20 BogoMIPS) ==
(In reply to Steve from comment #18) > Thanks for your follow-up replies. I'm on F30, which doesn't have "gst" -- > it appears to have been added in F31: > > $ dnf -q repoquery gst --releasever=31 > gst-0:0.7.2-1.fc31.noarch > > This might show whether "processor" 0 and "processor" 12 are on the same > core or not: > > $ egrep 'processor|core id' /proc/cpuinfo So I did the egrep this morning and it does appear that "processor 0" and "processor 12" are indeed on the same core: egrep 'processor|core id' /proc/cpuinfo processor : 0 core id : 0 processor : 1 core id : 1 processor : 2 core id : 2 processor : 3 core id : 4 processor : 4 core id : 5 processor : 5 core id : 6 processor : 6 core id : 8 processor : 7 core id : 9 processor : 8 core id : 10 processor : 9 core id : 12 processor : 10 core id : 13 processor : 11 core id : 14 processor : 12 core id : 0 processor : 13 core id : 1 processor : 14 core id : 2 processor : 15 core id : 4 processor : 16 core id : 5 processor : 17 core id : 6 processor : 18 core id : 8 processor : 19 core id : 9 processor : 20 core id : 10 processor : 21 core id : 12 processor : 22 core id : 13 processor : 23 core id : 14 Could this mean a defective core? Phil
The core IDs don't seem to be sequential. Is that what you saw in "gst"? $ cat ryzen-cpuinfo-1.txt | grep core | sort -Vu | cat -n 1 core id : 0 2 core id : 1 3 core id : 2 4 core id : 4 5 core id : 5 6 core id : 6 7 core id : 8 8 core id : 9 9 core id : 10 10 core id : 12 11 core id : 13 12 core id : 14
> Could this mean a defective core? The only conclusion I would cautiously draw is that since CPU0 and CPU12 share their L1 caches, it could be that some of the MCEs are duplicates. A conjecture is that since the kernel boots from CPU0, there could be a problem with how the machine-check status registers are first read. And note the phrase "model-specific registers" in this sentence: "The AMD64 Machine-Check Architecture defines the set of model-specific registers (MCA MSRs) used to log and report hardware errors." (p. 265) The kernel could be incorrectly identifying the "model-specific registers" on your processor. AMD64 Architecture Programmer’s Manual Volume 2: System Programming September 2012 https://developer.amd.com/wordpress/media/2012/10/24593_APM_v2.pdf
(In reply to Steve from comment #23) > The core IDs don't seem to be sequential. There could be an explanation: The AMD Ryzen 9 3950X has 16 cores, so the 12-core model (3900X) could actually have 16 cores, but some cores are disabled. Or that could be an oversimplification: AMD Clarifies "Best Cores" vs "Preferred Cores" Discrepancies For Ryzen CPUs by Andrei Frumusanu November 21, 2019 https://www.anandtech.com/show/15137/amd-clarifies-best-cores-vs-preferred-cores
(In reply to Steve from comment #25) > (In reply to Steve from comment #23) > > The core IDs don't seem to be sequential. > > There could be an explanation: > > The AMD Ryzen 9 3950X has 16 cores, so the 12-core model (3900X) could > actually have 16 cores, but some cores are disabled. > > Or that could be an oversimplification: > > AMD Clarifies "Best Cores" vs "Preferred Cores" Discrepancies For Ryzen CPUs > by Andrei Frumusanu > November 21, 2019 > https://www.anandtech.com/show/15137/amd-clarifies-best-cores-vs-preferred- > cores Hey Steve, Just wanted to reach back out on this. Do we think this is a "non-issue" now? Or is this something I need to RMA the processor for. The system has still been very stable, just the sporatic MCE messages. PHil
An upstream search for "mce"* found a request that the kernel ignore certain correctable errors on Intel processors: Bug_206587 - x86/mce: Do not log spurious corrected mce errors https://bugzilla.kernel.org/show_bug.cgi?id=206587 And the request cites various Intel errata. However, I couldn't find any AMD errata for your specific processor. So there are too many possibilities to say what is going on. I would suggest: 1. Updating the bug summary to say something like this: "correctable MCE errors logged for CPU0/CPU12 L1 instruction cache with AMD Ryzen 9 3900X 12-Core Processor" 2. Opening a bug upstream under "Platform Specific/Hardware", "x86-64": https://bugzilla.kernel.org/buglist.cgi?bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&component=x86-64&product=Platform%20Specific%2FHardware&query_format=advanced 3. Checking the AMD web site for errata. This is a search for "revision guide 17h": https://www.amd.com/en/support/tech-docs?keyword=revision+guide+17h * https://bugzilla.kernel.org/buglist.cgi?quicksearch=mce
FYI, Linus is "now rocking an AMD Threadripper 3970x": From Linus Torvalds <> Date Sun, 24 May 2020 16:00:50 -0700 Subject Linux 5.7-rc7 https://lkml.org/lkml/2020/5/24/407 That processor has the same AMD Zen 2 architecture as yours: https://en.wikipedia.org/wiki/Zen_2#Desktop_processors So if there are any kernel problems related to MCEs, they will receive all due attention. :-)
Made changes to Subject and opened an up-stream bug report per the recommendations: https://bugzilla.kernel.org/show_bug.cgi?id=207907 Guess I'll wait and see if they can provide an answer to this. (In reply to Steve from comment #27) > An upstream search for "mce"* found a request that the kernel ignore certain > correctable errors on Intel processors: > > Bug_206587 - x86/mce: Do not log spurious corrected mce errors > https://bugzilla.kernel.org/show_bug.cgi?id=206587 > > And the request cites various Intel errata. > > However, I couldn't find any AMD errata for your specific processor. > > So there are too many possibilities to say what is going on. > > I would suggest: > > 1. Updating the bug summary to say something like this: > > "correctable MCE errors logged for CPU0/CPU12 L1 instruction cache with AMD > Ryzen 9 3900X 12-Core Processor" > > 2. Opening a bug upstream under "Platform Specific/Hardware", "x86-64": > https://bugzilla.kernel.org/buglist. > cgi?bug_status=NEW&bug_status=ASSIGNED&bug_status=REOPENED&component=x86- > 64&product=Platform%20Specific%2FHardware&query_format=advanced > > 3. Checking the AMD web site for errata. This is a search for "revision > guide 17h": > https://www.amd.com/en/support/tech-docs?keyword=revision+guide+17h > > * https://bugzilla.kernel.org/buglist.cgi?quicksearch=mce
(In reply to Phil Hale from comment #29) > Made changes to Subject and opened an up-stream bug report per the recommendations: > > https://bugzilla.kernel.org/show_bug.cgi?id=207907 Thanks. Could you add a link in the "Links" section near the top of the BZ page, so that the upstream bug report is easy to find? > Guess I'll wait and see if they can provide an answer to this.
Thanks for adding the link to your upstream bug report. And thanks for attaching the logs for your stress tests to it. It might be a good idea to document the stress test you are using: $ rpm -q gst https://gitlab.com/leinardi/gst
(In reply to Steve from comment #31) > Thanks for adding the link to your upstream bug report. And thanks for > attaching the logs for your stress tests to it. > > It might be a good idea to document the stress test you are using: > > $ rpm -q gst > > https://gitlab.com/leinardi/gst Done.
Borislav appears to be referring to this command:* $ dnf -Cq repoquery --whatprovides /usr/\*bin/cpupower kernel-tools-0:5.3.7-300.fc31.x86_64 kernel-tools-0:5.6.7-200.fc31.x86_64 * https://bugzilla.kernel.org/show_bug.cgi?id=207907#c5
(In reply to Steve from comment #33) > Borislav appears to be referring to this command:* > > $ dnf -Cq repoquery --whatprovides /usr/\*bin/cpupower > kernel-tools-0:5.3.7-300.fc31.x86_64 > kernel-tools-0:5.6.7-200.fc31.x86_64 > > * https://bugzilla.kernel.org/show_bug.cgi?id=207907#c5 Yep, I did additional testing adjusting the two items he requested, separately and together. Still had the MCEs logged. In fact doing just one setting, then the two together generated the same number of MCEs in a 30 minute stress test. I found some other bugs on the kernel bugzilla that appear to be the same issue back in the 4.10 tree, but it doesn't appear to have been resolved.
(In reply to Phil Hale from comment #34) ... > Yep, > > I did additional testing adjusting the two items he requested, separately > and together. Still had the MCEs logged. In fact doing just one setting, > then the two together generated the same number of MCEs in a 30 minute > stress test. I found some other bugs on the kernel bugzilla that appear to > be the same issue back in the 4.10 tree, but it doesn't appear to have been > resolved. Thanks for your update. None of that conclusively points to a processor bug. However, AMD has been known to replace faulty processors: "He got a replacement from AMD [that fixed a problem with segfaults while under load]." That is from the section headed "Background from original developer (Suaefar)" here:* https://github.com/Oxalin/ryzen-test That has a link to this thread on the AMD web site: gcc segmentation faults on Ryzen / Linux May 8, 2017; Latest reply on Jul 10, 2019 https://community.amd.com/thread/215773 * Linked from unrelated Bug 1840969.
So I got a new power supply in today. It was the only component I had not switched out; the old one being 10+ years old 750 Watt Untra. The new one is a Seasonic PX-1000. On boot up, I didn't see any MCE errors and thought, "Ah ha!" then about 5 minutes in I got one. So off to do a new stress test. Same 30 minute stress test as before while following the logs with the following command, "journalctl --no-hostname -k -f > 20200605-journal-during-30min-stress-test.txt". From a previous test with the old power supply, same stress test settings, I got a total of 9 MCE events. Tonight I got 6. So, better? Not conclusive, but less. I'm still leaning towards some sort of issue with kernel. I've run out of hardware to replace, except to RMA the CPU. I'm not really wanting to do this as the system remains rock-solid, running for days, doing lots of heavy duty Sys Admin work... I'll add this to the kernel.org bug, but I don't expect much to come of it.
Thanks for your status update. FYI, the 5.8 merge window is open. Look for a 5.8-rc1 release here: https://www.kernel.org/ # mainline kernels are usually released on Sunday afternoons.* And a Fedora build here: https://koji.fedoraproject.org/koji/packageinfo?packageID=8 * However, the merge window appears to be 2 weeks, so the 5.8-rc1 release would be two weeks after the 5.7 release on 2020-05-31 -- which would be 2020-06-14: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tag/?h=v5.7
Sorry to revive this issue, but I have the very same one, also on a 3900X and I wondered if you found a way to solve it? By the way this is not kernel-related (or at least, probably not) because I have the same error on Windows, where it gets logged in the System event log as "WHEA Event ID 19". Mine occurs on core 6, which is also 18, so it logs 2 events as well. Tried a million things with BIOS and memory configs, nothing seems to help. On Windows it's very apparent the issue only happens when the CPU is idle. Although a very small pause in a workload can make it appear. Just like you, it's got no apparent effect on stability or performance, unless I'm missing something. Would really like to hear what you ended up doing, thanks a lot!
Hello demo, I'm still seeing the issue. I've applied the latest BIOS updates for my board and have been thru several rounds of kernel updates running on Fedora 32. I've also tried all sorts of different BIOS settings, but none make a difference. I tend to leave my system running for weeks at a time, and it's remains perfectly stable other than the MCE log entries. I'm guessing this is just something on the processors. I had filed a support ticket with AMD and they informed me that official support was only available for Windows 10. If your seeing the same issue with Windows, maybe you could open a ticket on that with them directly. If you do come across something let me know! Phil
This message is a reminder that Fedora 32 is nearing its end of life. Fedora will stop maintaining and issuing updates for Fedora 32 on 2021-05-25. It is Fedora's policy to close all bug reports from releases that are no longer maintained. At that time this bug will be closed as EOL if it remains open with a Fedora 'version' of '32'. Package Maintainer: If you wish for this bug to remain open because you plan to fix it in a currently maintained version, simply change the 'version' to a later Fedora version. Thank you for reporting this issue and we are sorry that we were not able to fix it before Fedora 32 is end of life. If you would still like to see this bug fixed and are able to reproduce it against a later version of Fedora, you are encouraged change the 'version' to a later Fedora version prior this bug is closed as described in the policy above. Although we aim to fix as many bugs as possible during every release's lifetime, sometimes those efforts are overtaken by events. Often a more recent Fedora release includes newer upstream software that fixes bugs or makes them obsolete.
Just an update on this issue. I've replaced the CPU with a Ryzen 9 5950X and the issue is no longer occurring. I think I can now identify this as an issue with the CPU and not the kernel/OS.