Bug 1450769
Summary: | AMD Ryzen 7 soft lockup on md-udevd | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | matti aarnio <matti.aarnio> |
Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> |
Status: | CLOSED INSUFFICIENT_DATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | urgent | Docs Contact: | |
Priority: | unspecified | ||
Version: | 27 | CC: | bjorn, cks-rhbugzilla, gansalmon, ichavero, itamar, jonathan, kernel-maint, madhu.chinakonda, matti.aarnio, mchehab |
Target Milestone: | --- | Flags: | jforbes:
needinfo?
|
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-08-29 15:11:04 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Attachments: |
Created attachment 1278824 [details]
ryzen7 boot dmesg log
This system was originally installed as Fedora back in 2012, and been updated since then. Previously this was AMD Phenom-II for several years and very stable. This seems similar to what I'm seeing. Symptoms: About a day or two after boot the computer appears to lock up. So far it has always happened while I was away and the console was locked. When I return and turn on the monitor the system does not respond to keypresses or mouse movements. The screen remains black. In some cases the system has responded to ping but not to SSH. In other cases it hasn't responded even to ping. To get more data I've configured a serial console and attached a null-modem cable. The messages I get there vary. In two cases so far there have been "soft lockup" messages like in this bug report, though not in md-udevd but other processes. This may or may not be related to bug 1507173, which seems similar to other cases I've seen. Linux 4.13.11-300.fc27.x86_64 Motherboard: Asus Prime X370-pro Processor: AMD Ryzen Memory: Kingston KVR24E17D8/16MA with ECC support Graphics card: AMD/XFX Radeon RX 460 Created attachment 1355979 [details]
kernel log before hang, with many "soft lockup" messages
This log was captured from the serial console. The beginning was lost from the scrollback buffer.
Created attachment 1355990 [details]
kernel log from boot to hang, with "soft lockup" messages
This is everything that Linux wrote to the serial console until it hung. In this case the system responded to ping, but not to anything else.
Created attachment 1359009 [details]
kernel log from boot to hang, with many "soft lockup" messages
Here's a log from Linux 4.13.13-300.fc27.x86_64 with repeated messages about kworker being stuck on one core followed by a hang.
Created attachment 1362105 [details]
kernel log from boot to hang, with many "soft lockup" messages
Another case of one stuck core, this time with Linux 4.13.15-300.fc27.x86_64.
Created attachment 1362107 [details]
compressed kernel log from boot to reset, with many many "soft lockup" messages
This case started out with one stuck core, increasing to seven stuck cores (actually threads) before I pressed the reset button. The computer did not respond to ping or keypresses, but continued printing "soft lockup" messages until I reset it. The log from the serial console grew so large that I had to compress it.
I've also experienced this issue under Fedora 27 on an ASUS Prime X370-PRO motherboard with a Ryzen 1800X CPU (not overclocked and with Kingston server RAM and a modern high-quality PSU). I captured some kernel logs through netconsole and, as with previous comments, they show the stuck CPUs hanging in smp_call_function_many and smp_call_function_single (generally from TLB flushing functions). There is at least one upstream kernel bug for this, https://bugzilla.kernel.org/show_bug.cgi?id=196683 My machine has been stable so far when booted with additional kernel arguments of 'rcu_nocbs=0-15 processor.max_cstate=5' (the common workaround uses 'processor.max_cstate=1', which is probably slightly safer). Thanks for the link, Chris. The rcu_nocbs workaround improves my uptime dramatically. It seems to prevent both these soft lockups and the kernel panics of bug 1507173. We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. As kernel maintainers, we try to keep up with bugzilla but due the rate at which the upstream kernel project moves, bugs may be fixed without any indication to us. Due to this, we are doing a mass bug update across all of the Fedora 27 kernel bugs. Fedora 27 has now been rebased to 4.15.3-300.f27. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you experience different issues, please open a new bug report for those. Soft lockup happened 2.5 hours after I rebooted kernel-4.15.3-300.fc27.x86_64 without rcu_nocbs. *********** MASS BUG UPDATE ************** We apologize for the inconvenience. There are a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 27 kernel bugs. Fedora 27 has now been rebased to 4.17.7-100.fc27. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you have moved on to Fedora 28, and are still experiencing this issue, please change the version to Fedora 28. If you experience different issues, please open a new bug report for those. *********** MASS BUG UPDATE ************** This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 5 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously. |
Created attachment 1278822 [details] Kernel messages on first trouble indication Description of problem: Workstation gets stuck on md-udevd Version-Release number of selected component (if applicable): kernel-4.11.0-0.rc8.git0.1.fc26.x86_64 How reproducible: At random on idle machine every few days. Machine has new AMD Ryzen 7 CPU on new ASUS AMD Ryzen 7 1700X Eight-Core Processor (family: 0x17, model: 0x1, stepping: 0x1)