1502095 – Kernels 4.13.4-200 and 4.13.5-200 are no good on my AMD Ryzen (previous ones were OK)

Bug 1502095 - Kernels 4.13.4-200 and 4.13.5-200 are no good on my AMD Ryzen (previous ones were OK)

Summary: Kernels 4.13.4-200 and 4.13.5-200 are no good on my AMD Ryzen (previous ones ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	26
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-10-14 06:51 UTC by Bruno Antunes
Modified:	2018-02-28 11:17 UTC (History)
CC List:	24 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-02-28 11:17:46 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Full dmesg output (113.53 KB, text/plain) 2017-11-11 17:36 UTC, Bruno Antunes	no flags	Details
View All

Description Bruno Antunes 2017-10-14 06:51:17 UTC

Description of problem:

Spontaneous reset a few minutes after boot, "machine check events logged" in Problem Reporting tool


Version-Release number of selected component (if applicable):

kernel-4.13.5-200.fc26.x86_64 and kernel-4.13.4-200.fc26.x86_64


How reproducible:

Always

Steps to Reproduce:
1. Boot
2. Login to GNOME
3. Wait for 5-15 minutes
4. PC reboots without warning

Actual results:

mce: [Hardware Error]: Machine check events logged


Expected results:

No spontaneous reset.


Additional info:

I'm on an AMD Ryzen 1700 processor. The Problem Reporting tool is hard to get info from, but I can attach whatever fields from the kerneloops you find interesting. I see that mcelog doesn't support AMD processors, or at least mine, so there doesn't seem to be a lot of info.

This didn't happen on 4.12 and previous kernels - just severe lags when doing heavy file copying, but I suspect that's because of AMD's hyperthreading, as I've seen some articles mentioning it as a problem.

Thank you for your time.

Comment 1 Jeremy Cline 2017-10-16 18:33:32 UTC

Hello,

Thanks for the bug report. Can you reproduce this with the latest kernel-debug and attach the oops message?

Comment 2 Bruno Antunes 2017-10-16 21:52:20 UTC

I've installed the debug kernel for 4.13.5-200, and it hasn't rebooted yet (been up for around an hour). Not sure what's going on. I've even installed the "stress" package, but it doesn't seem to be rebooting now.

Sorry if I've wasted your time.

Comment 3 Bruno Antunes 2017-10-16 21:55:39 UTC

I've uploaded what I see in Problem Reporting, if it's useful let me know and I'll copy all of the info in:

https://imgur.com/KK4lNBu
https://imgur.com/YAxXStU

Comment 4 Bruno Antunes 2017-11-11 17:35:55 UTC

This is happening again on 4.13.11-200 (the reboots) and I've reinstalled kernel-debug (I had removed it for lack of space on /boot). 

I'm also seeing errors on my boot log. Here's dmesg's relevant output:

...
[    0.989610] microcode: CPU0: patch_level=0x08001129
[    0.989629] microcode: CPU1: patch_level=0x08001129
[    0.989645] microcode: CPU2: patch_level=0x08001129
[    0.989659] microcode: CPU3: patch_level=0x08001129
[    0.989683] microcode: CPU4: patch_level=0x08001129
[    0.989694] microcode: CPU5: patch_level=0x08001129
[    0.989723] microcode: CPU6: patch_level=0x08001129
[    0.989740] microcode: CPU7: patch_level=0x08001129
[    0.989754] microcode: CPU8: patch_level=0x08001129
[    0.989766] microcode: CPU9: patch_level=0x08001129
[    0.989778] microcode: CPU10: patch_level=0x08001129
[    0.989791] microcode: CPU11: patch_level=0x08001129
[    0.989816] microcode: CPU12: patch_level=0x08001129
[    0.989828] microcode: CPU13: patch_level=0x08001129
[    0.989843] microcode: CPU14: patch_level=0x08001129
[    0.989854] microcode: CPU15: patch_level=0x08001129
[    0.989956] microcode: Microcode Update Driver: v2.2.
[    0.989977] AVX2 version of gcm_enc/dec engaged.
[    0.989979] AES CTR mode by8 optimization enabled
[    1.006842] sched_clock: Marking stable (1006817073, 0)->(1373205207, -366388134)
[    1.007490] registered taskstats version 1
[    1.007511] Loading compiled-in X.509 certificates
[    1.037902] Loaded X.509 cert 'Fedora kernel signing key: f43d7ced727e60b9e0617c90e8cd95fa0fbcc263'
[    1.039581] Couldn't get size: 0x800000000000000e
[    1.039614] MODSIGN: Couldn't get UEFI db list
[    1.041307] Loaded UEFI:MokListRT cert 'Fedora Secure Boot CA: fde32599c2d61db1bf5807335d7b20e4cd963b42' linked to secondary sys keyring
[    1.042034] Couldn't get size: 0x800000000000000e
[    1.042056] MODSIGN: Couldn't get UEFI dbx list
[    1.042187] zswap: loaded using pool lzo/zbud
[    1.048086] Key type big_key registered
[    1.050797] Key type encrypted registered
[    1.051392]   Magic number: 9:36:441
[    1.051571] rtc_cmos 00:02: setting system clock to 2017-11-11 17:24:07 UTC (1510421047)
[    1.051693] PM: Hibernation image not present or could not be loaded.
...
[    3.834559] BUG: sleeping function called from invalid context at kernel/locking/rwsem.c:52
[    3.834617] in_atomic(): 1, irqs_disabled(): 1, pid: 463, name: systemd-udevd
[    3.834666] 3 locks held by systemd-udevd/463:
[    3.834668]  #0:  (&dev->mutex){......}, at: [<ffffffffb5644ae9>] __driver_attach+0x49/0xf0
[    3.834681]  #1:  (&dev->mutex){......}, at: [<ffffffffb5644af7>] __driver_attach+0x57/0xf0
[    3.834690]  #2:  (ccp_debugfs_lock){......}, at: [<ffffffffc02b43bb>] ccp5_debugfs_setup+0x5b/0x1a0 [ccp]
[    3.834701] irq event stamp: 15870
[    3.834705] hardirqs last  enabled at (15869): [<ffffffffb59ad4c6>] _raw_spin_unlock_irqrestore+0x36/0x60
[    3.834707] hardirqs last disabled at (15870): [<ffffffffb59adc17>] _raw_write_lock_irqsave+0x27/0x88
[    3.834711] softirqs last  enabled at (15854): [<ffffffffb5812c44>] peernet2id+0x54/0x80
[    3.834713] softirqs last disabled at (15852): [<ffffffffb5812c26>] peernet2id+0x36/0x80
[    3.834716] CPU: 1 PID: 463 Comm: systemd-udevd Not tainted 4.13.11-200.fc26.x86_64+debug #1
[    3.834718] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./AB350 Pro4, BIOS P3.20 09/05/2017
[    3.834719] Call Trace:
[    3.834724]  dump_stack+0x8e/0xd6
[    3.834730]  ___might_sleep+0x164/0x250
[    3.834734]  __might_sleep+0x4a/0x80
[    3.834739]  down_write+0x32/0xc0
[    3.834745]  start_creating+0x5f/0x110
[    3.834750]  debugfs_create_dir+0x13/0x110
[    3.834758]  ccp5_debugfs_setup+0x18c/0x1a0 [ccp]
[    3.834762]  ? ccp_dmaengine_register+0x32a/0x3c0 [ccp]
[    3.834770]  ccp5_init+0xa06/0xa10 [ccp]
[    3.834783]  ccp_pci_probe+0x260/0x420 [ccp]
[    3.834789]  local_pci_probe+0x42/0xa0
[    3.834794]  pci_device_probe+0x18d/0x1a0
[    3.834802]  driver_probe_device+0x2ff/0x450
[    3.834809]  __driver_attach+0xa8/0xf0
[    3.834819]  ? driver_probe_device+0x450/0x450
[    3.834822]  bus_for_each_dev+0x75/0xc0
[    3.834829]  driver_attach+0x1e/0x20
[    3.834833]  bus_add_driver+0x1ca/0x270
[    3.834837]  ? 0xffffffffc02bd000
[    3.834843]  driver_register+0x60/0xe0
[    3.834847]  ? 0xffffffffc02bd000
[    3.834850]  __pci_register_driver+0x60/0x70
[    3.834856]  ccp_pci_init+0x23/0x30 [ccp]
[    3.834862]  ccp_mod_init+0x9/0x1000 [ccp]
[    3.834866]  do_one_initcall+0x50/0x192
[    3.834872]  ? rcu_read_lock_sched_held+0x79/0x80
[    3.834878]  ? kmem_cache_alloc_trace+0x273/0x2e0
[    3.834881]  ? do_init_module+0x27/0x1eb
[    3.834890]  do_init_module+0x5f/0x1eb
[    3.834895]  load_module+0x26e6/0x2de0
[    3.834919]  SYSC_init_module+0x183/0x1c0
[    3.834922]  ? SYSC_init_module+0x183/0x1c0
[    3.834938]  SyS_init_module+0xe/0x10
[    3.834942]  do_syscall_64+0x6c/0x1c0
[    3.834946]  entry_SYSCALL64_slow_path+0x25/0x25
[    3.834949] RIP: 0033:0x7f13816c93ea
[    3.834951] RSP: 002b:00007fffb3719628 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
[    3.834954] RAX: ffffffffffffffda RBX: 000055bcf80085f0 RCX: 00007f13816c93ea
[    3.834955] RDX: 00007f13822039c5 RSI: 000000000001c3cb RDI: 000055bcf8018c00
[    3.834957] RBP: 00007f13822039c5 R08: 000055bcf8007280 R09: 00000000000000b0
[    3.834959] R10: 00007f1381987b00 R11: 0000000000000246 R12: 000055bcf8018c00
[    3.834961] R13: 000055bcf8007160 R14: 0000000000020000 R15: 000055bcf674cdf7
[    3.834987] DEBUG_LOCKS_WARN_ON(irqs_disabled_flags(flags))
[    3.834992] ------------[ cut here ]------------
[    3.835029] WARNING: CPU: 1 PID: 463 at kernel/locking/lockdep.c:2897 lockdep_trace_alloc+0xb8/0x100
[    3.835064] Modules linked in: ccp(+) drm r8169 fjes(-) mii
[    3.835093] CPU: 1 PID: 463 Comm: systemd-udevd Tainted: G        W       4.13.11-200.fc26.x86_64+debug #1
[    3.835129] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./AB350 Pro4, BIOS P3.20 09/05/2017
[    3.835171] task: ffff979a417c8000 task.stack: ffffaf0302e3c000
[    3.835205] RIP: 0010:lockdep_trace_alloc+0xb8/0x100
[    3.835229] RSP: 0018:ffffaf0302e3f7e0 EFLAGS: 00010082
[    3.835256] RAX: 000000000000002f RBX: 0000000000000046 RCX: 0000000000000000
[    3.835286] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffffb512bc16
[    3.835315] RBP: ffffaf0302e3f7e8 R08: 0000000000000000 R09: 0000000000000001
[    3.835345] R10: ffffaf0302e3f790 R11: 0000000000000001 R12: ffffaf0302e3f8e0
[    3.835375] R13: 00000000014000c0 R14: ffff979a4c5572c0 R15: 0000000000000000
[    3.835407] FS:  00007f1382a768c0(0000) GS:ffff979a4cc00000(0000) knlGS:0000000000000000
[    3.835443] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    3.835470] CR2: 00005614123b7180 CR3: 00000004017b4000 CR4: 00000000003406e0
[    3.835499] Call Trace:
[    3.835512]  kmem_cache_alloc+0x33/0x2e0
[    3.835529]  ? __d_alloc+0x27/0x220
[    3.835546]  __d_alloc+0x27/0x220
[    3.835565]  d_alloc+0x25/0xc0
[    3.835586]  __lookup_hash+0x45/0xa0
[    3.835605]  lookup_one_len+0x110/0x120
[    3.835627]  start_creating+0x74/0x110
[    3.835648]  debugfs_create_dir+0x13/0x110
[    3.835670]  ccp5_debugfs_setup+0x18c/0x1a0 [ccp]
[    3.835691]  ? ccp_dmaengine_register+0x32a/0x3c0 [ccp]
[    3.835715]  ccp5_init+0xa06/0xa10 [ccp]
[    3.835735]  ccp_pci_probe+0x260/0x420 [ccp]
[    3.835752]  local_pci_probe+0x42/0xa0
[    3.835768]  pci_device_probe+0x18d/0x1a0
[    3.835786]  driver_probe_device+0x2ff/0x450
[    3.835803]  __driver_attach+0xa8/0xf0
[    3.835818]  ? driver_probe_device+0x450/0x450
[    3.835836]  bus_for_each_dev+0x75/0xc0
[    3.835852]  driver_attach+0x1e/0x20
[    3.835868]  bus_add_driver+0x1ca/0x270
[    3.835884]  ? 0xffffffffc02bd000
[    3.836674]  driver_register+0x60/0xe0
[    3.837449]  ? 0xffffffffc02bd000
[    3.838238]  __pci_register_driver+0x60/0x70
[    3.839032]  ccp_pci_init+0x23/0x30 [ccp]
[    3.839826]  ccp_mod_init+0x9/0x1000 [ccp]
[    3.840603]  do_one_initcall+0x50/0x192
[    3.841403]  ? rcu_read_lock_sched_held+0x79/0x80
[    3.842238]  ? kmem_cache_alloc_trace+0x273/0x2e0
[    3.842988]  ? do_init_module+0x27/0x1eb
[    3.843753]  do_init_module+0x5f/0x1eb
[    3.844587]  load_module+0x26e6/0x2de0
[    3.845384]  SYSC_init_module+0x183/0x1c0
[    3.846213]  ? SYSC_init_module+0x183/0x1c0
[    3.847095]  SyS_init_module+0xe/0x10
[    3.847915]  do_syscall_64+0x6c/0x1c0
[    3.848699]  entry_SYSCALL64_slow_path+0x25/0x25
[    3.849462] RIP: 0033:0x7f13816c93ea
[    3.850254] RSP: 002b:00007fffb3719628 EFLAGS: 00000246 ORIG_RAX: 00000000000000af
[    3.851027] RAX: ffffffffffffffda RBX: 000055bcf80085f0 RCX: 00007f13816c93ea
[    3.851832] RDX: 00007f13822039c5 RSI: 000000000001c3cb RDI: 000055bcf8018c00
[    3.852775] RBP: 00007f13822039c5 R08: 000055bcf8007280 R09: 00000000000000b0
[    3.853727] R10: 00007f1381987b00 R11: 0000000000000246 R12: 000055bcf8018c00
[    3.854631] R13: 000055bcf8007160 R14: 0000000000020000 R15: 000055bcf674cdf7
[    3.855514] Code: f6 c7 02 75 48 e8 e9 84 3a 00 85 c0 74 1f 8b 05 cf b9 46 02 85 c0 75 15 48 c7 c6 57 80 ca b5 48 c7 c7 b3 ed c8 b5 e8 59 3c 01 00 <0f> ff 65 48 8b 04 25 c0 d4 00 00 48 89 df c7 80 3c 0d 00 00 00 
[    3.857018] ---[ end trace 613fa35f2c88c6ac ]---
[    3.857912] ccp 0000:11:00.2: enabled
[    3.866984] ata_id (499) used greatest stack depth: 12560 bytes left
[    3.916422] r8169 0000:0a:00.0 enp10s0: renamed from eth0
[    3.949127] [drm] amdgpu kernel modesetting enabled.
[    3.952814] AMD IOMMUv2 driver by Joerg Roedel <jroedel>
...

I'll keep running this kernel until it reboots by itself again.

Comment 5 Bruno Antunes 2017-11-11 17:36:40 UTC

Created attachment 1350948 [details]
Full dmesg output

Comment 6 Bruno Antunes 2017-11-12 00:36:17 UTC

I have more info after another crash. From Problem Reporting:


not-reportable: The backtrace does not contain enough meaningful function frames to be reported. It is annoying but it does not necessary signalize a problem with your computer. ABRT will not allow you to create a report in a bug tracking system but you can contact kernel maintainers via e-mail.

reason: 
WARNING: CPU: 1 PID: 508 at kernel/locking/lockdep.c:2897 lockdep_trace_alloc+0xb8/0x100

backtrace:

WARNING: CPU: 1 PID: 508 at kernel/locking/lockdep.c:2897 lockdep_trace_alloc+0xb8/0x100
Modules linked in: ccp(+) r8169 fjes(-) mii
CPU: 1 PID: 508 Comm: systemd-udevd Tainted: G        W       4.13.11-200.fc26.x86_64+debug #1
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./AB350 Pro4, BIOS P3.20 09/05/2017
task: ffff9bdb43d0b300 task.stack: ffffba7342f60000
RIP: 0010:lockdep_trace_alloc+0xb8/0x100
RSP: 0018:ffffba7342f637e0 EFLAGS: 00010082
RAX: 000000000000002f RBX: 0000000000000046 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffffffff8512bc16
RBP: ffffba7342f637e8 R08: 0000000000000000 R09: 0000000000000001
R10: ffffba7342f63790 R11: 0000000000000001 R12: ffffba7342f638e0
R13: 00000000014000c0 R14: ffff9bdb5ed55340 R15: 0000000000000000
FS:  00007fb5b93978c0(0000) GS:ffff9bdb4ca00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fb5b92ed000 CR3: 00000003ff904000 CR4: 00000000003406e0
Call Trace:


Hope this helps...

Comment 7 Laura Abbott 2018-02-28 03:42:39 UTC

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale. The kernel moves very fast so bugs may get fixed as part of a kernel update. Due to this, we are doing a mass bug update across all of the Fedora 26 kernel bugs.
 
Fedora 26 has now been rebased to 4.15.4-200.fc26.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.
 
If you have moved on to Fedora 27, and are still experiencing this issue, please change the version to Fedora 27.
 
If you experience different issues, please open a new bug report for those.

Comment 8 Bruno Antunes 2018-02-28 11:17:46 UTC

No need to apologize. I've tested the 4.15 kernels on F27 and the problems are gone. Thanks!

Note You need to log in before you can comment on or make changes to this bug.

airlied
ajax
bskeggs
eparis
esandeen
hdegoede
ichavero
itamar
jarodwilson
jeremy
jforbes
jglisse
jonathan
josef
jwboyer
kernel-maint
labbott
linville
mchehab
mjg59
nhorman
quintela
sardaukar.siet
steved