Bug 1450769

Summary: AMD Ryzen 7 soft lockup on md-udevd
Product: [Fedora] Fedora Reporter: matti aarnio <matti.aarnio>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 27CC: bjorn, cks-rhbugzilla, gansalmon, ichavero, itamar, jonathan, kernel-maint, madhu.chinakonda, matti.aarnio, mchehab
Target Milestone: ---Flags: jforbes: needinfo?
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-08-29 15:11:04 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Kernel messages on first trouble indication
none
ryzen7 boot dmesg log
none
kernel log before hang, with many "soft lockup" messages
none
kernel log from boot to hang, with "soft lockup" messages
none
kernel log from boot to hang, with many "soft lockup" messages
none
kernel log from boot to hang, with many "soft lockup" messages
none
compressed kernel log from boot to reset, with many many "soft lockup" messages none

Description matti aarnio 2017-05-15 07:29:26 UTC
Created attachment 1278822 [details]
Kernel messages on first trouble indication

Description of problem:
  Workstation gets stuck on md-udevd

Version-Release number of selected component (if applicable):
  kernel-4.11.0-0.rc8.git0.1.fc26.x86_64

How reproducible:
  At random on idle machine every few days.


Machine has new AMD Ryzen 7 CPU on new ASUS 
  AMD Ryzen 7 1700X Eight-Core Processor (family: 0x17, model: 0x1, stepping: 0x1)

Comment 1 matti aarnio 2017-05-15 07:32:26 UTC
Created attachment 1278824 [details]
ryzen7 boot dmesg log

Comment 2 matti aarnio 2017-05-15 09:17:58 UTC
This system was originally installed as Fedora back in 2012, and been updated since then.  Previously this was AMD Phenom-II for several years and very stable.

Comment 3 Björn Persson 2017-11-20 17:23:34 UTC
This seems similar to what I'm seeing.

Symptoms:
About a day or two after boot the computer appears to lock up. So far it has always happened while I was away and the console was locked. When I return and turn on the monitor the system does not respond to keypresses or mouse movements. The screen remains black. In some cases the system has responded to ping but not to SSH. In other cases it hasn't responded even to ping.

To get more data I've configured a serial console and attached a null-modem cable. The messages I get there vary. In two cases so far there have been "soft lockup" messages like in this bug report, though not in md-udevd but other processes.

This may or may not be related to bug 1507173, which seems similar to other cases I've seen.

Linux 4.13.11-300.fc27.x86_64

Motherboard: Asus Prime X370-pro
Processor: AMD Ryzen
Memory: Kingston KVR24E17D8/16MA with ECC support
Graphics card: AMD/XFX Radeon RX 460

Comment 4 Björn Persson 2017-11-20 17:27:45 UTC
Created attachment 1355979 [details]
kernel log before hang, with many "soft lockup" messages

This log was captured from the serial console. The beginning was lost from the scrollback buffer.

Comment 5 Björn Persson 2017-11-20 17:33:49 UTC
Created attachment 1355990 [details]
kernel log from boot to hang, with "soft lockup" messages

This is everything that Linux wrote to the serial console until it hung. In this case the system responded to ping, but not to anything else.

Comment 6 Björn Persson 2017-11-25 17:55:17 UTC
Created attachment 1359009 [details]
kernel log from boot to hang, with many "soft lockup" messages

Here's a log from Linux 4.13.13-300.fc27.x86_64 with repeated messages about kworker being stuck on one core followed by a hang.

Comment 7 Björn Persson 2017-12-02 21:13:23 UTC
Created attachment 1362105 [details]
kernel log from boot to hang, with many "soft lockup" messages

Another case of one stuck core, this time with Linux 4.13.15-300.fc27.x86_64.

Comment 8 Björn Persson 2017-12-02 21:31:40 UTC
Created attachment 1362107 [details]
compressed kernel log from boot to reset, with many many "soft lockup" messages

This case started out with one stuck core, increasing to seven stuck cores (actually threads) before I pressed the reset button. The computer did not respond to ping or keypresses, but continued printing "soft lockup" messages until I reset it. The log from the serial console grew so large that I had to compress it.

Comment 9 Chris Siebenmann 2018-01-26 20:37:29 UTC
I've also experienced this issue under Fedora 27 on an ASUS Prime X370-PRO
motherboard with a Ryzen 1800X CPU (not overclocked and with Kingston server
RAM and a modern high-quality PSU). I captured some kernel logs through
netconsole and, as with previous comments, they show the stuck CPUs hanging
in smp_call_function_many and smp_call_function_single (generally from TLB
flushing functions).

There is at least one upstream kernel bug for this,
https://bugzilla.kernel.org/show_bug.cgi?id=196683

My machine has been stable so far when booted with additional kernel
arguments of 'rcu_nocbs=0-15 processor.max_cstate=5' (the common
workaround uses 'processor.max_cstate=1', which is probably slightly
safer).

Comment 10 Björn Persson 2018-02-07 16:58:02 UTC
Thanks for the link, Chris. The rcu_nocbs workaround improves my uptime dramatically. It seems to prevent both these soft lockups and the kernel panics of bug 1507173.

Comment 11 Laura Abbott 2018-02-20 19:48:14 UTC
We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  As kernel maintainers, we try to keep up with bugzilla but due the rate at which the upstream kernel project moves, bugs may be fixed without any indication to us. Due to this, we are doing a mass bug update across all of the Fedora 27 kernel bugs.
 
Fedora 27 has now been rebased to 4.15.3-300.f27.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.
 
If you experience different issues, please open a new bug report for those.

Comment 12 Björn Persson 2018-02-21 12:11:04 UTC
Soft lockup happened 2.5 hours after I rebooted kernel-4.15.3-300.fc27.x86_64 without rcu_nocbs.

Comment 13 Justin M. Forbes 2018-07-23 15:16:33 UTC
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There are a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 27 kernel bugs.

Fedora 27 has now been rebased to 4.17.7-100.fc27.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 28, and are still experiencing this issue, please change the version to Fedora 28.

If you experience different issues, please open a new bug report for those.

Comment 14 Justin M. Forbes 2018-08-29 15:11:04 UTC
*********** MASS BUG UPDATE **************
This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 5 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.