1525279 – 4.14.8-200 x86_64 intermittently fails to reboot and panics

Bug 1525279 - 4.14.8-200 x86_64 intermittently fails to reboot and panics

Summary: 4.14.8-200 x86_64 intermittently fails to reboot and panics

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	26
Hardware:	x86_64
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1530318 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2017-12-13 00:09 UTC by York Possemiers
Modified:	2018-01-05 16:08 UTC (History)
CC List:	24 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-01-05 16:08:55 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
output from journalctl -b -1 (occured previous boot) but does not show all kernel messages (456.68 KB, text/plain) 2017-12-13 00:09 UTC, York Possemiers	no flags	Details
Extra kernel output not entirely present in log (5.58 MB, image/jpeg) 2017-12-13 00:15 UTC, York Possemiers	no flags	Details
new kernel console dump in newer kernel point releases (4.48 MB, image/jpeg) 2017-12-21 23:10 UTC, York Possemiers	no flags	Details
Another kernel panic example (5.45 MB, image/jpeg) 2017-12-26 22:37 UTC, Graham Mainwaring	no flags	Details
Upstream patch to fix the issue up to 4.14.9 (4.32 KB, patch) 2017-12-29 22:20 UTC, Thomas Jarosch	no flags	Details \| Diff
View All

Description York Possemiers 2017-12-13 00:09:39 UTC

Created attachment 1367022 [details]
output from journalctl -b -1 (occured previous boot) but does not show all kernel messages

Removed extra boot options to confirm this issue.
Occurs perhaps on 50% on reboots. Exact triggers still not observed
Using Ryzen 1600X on B350M

Comment 1 York Possemiers 2017-12-13 00:15:59 UTC

Created attachment 1367024 [details]
Extra kernel output not entirely present in log

ctrl-alt-del at command to obtain extra kernel output. Freeze also experienced when rebooting from GUI

Comment 2 York Possemiers 2017-12-13 00:27:18 UTC

The panic is such that using the reset button is ineffective, and ACPI hard shutdown (holding the power button) is required

Comment 3 York Possemiers 2017-12-17 23:10:38 UTC

Bug still present in 4.14.6

Comment 4 York Possemiers 2017-12-21 23:07:58 UTC

Bug still present in .8

Comment 5 York Possemiers 2017-12-21 23:10:07 UTC

Created attachment 1371095 [details]
new kernel console dump in newer kernel point releases

4.14.6 and 4.14.8 have now been dumping even more to the console, and I believe (though it's hard to tell because it scolls by so fast) the original console dump still applies.

Comment 6 Graham Mainwaring 2017-12-26 22:37:49 UTC

Created attachment 1372607 [details]
Another kernel panic example

I think this is another example of the same issue.  It happens on reboot on 4.14.5-300 and 4.14.8-300.

Note that this only happens when rebooting via systemd.  If I do a reboot --force it behaves normally.

Comment 7 Thomas Jarosch 2017-12-29 22:20:49 UTC

Created attachment 1374213 [details]
Upstream patch to fix the issue up to 4.14.9

Problem happens on Fedora 27 with 4.14.9, too:
Kernel crashes on reboot on three workstations at work.

Fix is available here:
https://marc.info/?l=linux-cgroups&m=151378282108794&w=2

I'll also attach the upstream patch to this bug report.

Comment 8 Thomas Jarosch 2017-12-29 22:27:13 UTC

Additional partial kernel oops backtraces so search engines can find this bugzilla entry:

**************************************
[ 1164.913034] NMI watchdog: Watchdog detected hard LOCKUP on cpu 3
[ 1164.913034] Modules linked in: netconsole vhost_net vhost tap xt_CHECKSUM tun ebtable_filter ebtables rpcsec_gss_krb5 auth_rpcgs
[ 1164.913058]  ghash_clmulni_intel r8169 mii sunrpc scsi_transport_iscsi [last unloaded: libcrc32c]
[ 1164.913062] CPU: 3 PID: 519 Comm: kworker/dying Not tainted 4.14.9-300.fc27.x86_64 #1
[ 1164.913062] Hardware name: ASUS All Series/B85M-E, BIOS 2306 11/09/2015
[ 1164.913063] task: ffff8ac2f061be80 task.stack: ffffae6c839cc000
[ 1164.913068] RIP: 0010:queued_spin_lock_slowpath+0x12d/0x190
[ 1164.913068] RSP: 0018:ffffae6c839cfe68 EFLAGS: 00000002
[ 1164.913069] RAX: 00000000001c0101 RBX: ffffffffb7e6c200 RCX: 0000000000000001
[ 1164.913070] RDX: 0000000000000101 RSI: 0000000000000001 RDI: ffffffffb856f3a0
[ 1164.913070] RBP: ffff8ac2f061be80 R08: 0000000000000101 R09: 0000000000000000
[ 1164.913071] R10: 0000000000000000 R11: 0000000000000300 R12: 0000000000000000
[ 1164.913071] R13: ffff8ac2f061be01 R14: 0000000000000000 R15: ffffffffb70bdc80
[ 1164.913072] FS:  0000000000000000(0000) GS:ffff8ac31dcc0000(0000) knlGS:0000000000000000
[ 1164.913073] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1164.913073] CR2: 00000000b7746000 CR3: 0000000373e09003 CR4: 00000000001626e0
[ 1164.913074] Call Trace:
[ 1164.913077]  cgroup_exit+0x4a/0xf0
[ 1164.913081]  do_exit+0x2f8/0xba0
[ 1164.913084]  ? worker_thread+0x252/0x380
[ 1164.913085]  ? process_one_work+0x3a0/0x3a0
[ 1164.913086]  kthread+0xe7/0x130
[ 1164.913087]  ? kthread_park+0x60/0x60
[ 1164.913089]  ? do_syscall_64+0x61/0x170
[ 1164.913090]  ? SyS_exit_group+0x10/0x10
[ 1164.913092]  ret_from_fork+0x1f/0x30
..
[ 1165.126398] NMI watchdog: Watchdog detected hard LOCKUP on cpu 2
[ 1165.126399] Modules linked in: netconsole vhost_net vhost tap xt_CHECKSUM tun ebtable_filter ebtables rpcsec_gss_krb5 auth_rpcgs
[ 1165.126417]  ghash_clmulni_intel r8169 mii sunrpc scsi_transport_iscsi [last unloaded: libcrc32c]
[ 1165.126420] CPU: 2 PID: 1 Comm: systemd Not tainted 4.14.9-300.fc27.x86_64 #1
[ 1165.126420] Hardware name: ASUS All Series/B85M-E, BIOS 2306 11/09/2015
[ 1165.126421] task: ffff8ac2f9ed3e80 task.stack: ffffae6c83144000
[ 1165.126423] RIP: 0010:css_task_iter_advance+0x22/0x70
[ 1165.126424] RSP: 0018:ffffae6c83147da8 EFLAGS: 00000002
[ 1165.126424] RAX: ffff8ac2ef1ca090 RBX: ffff8ac2f4711c90 RCX: dead000000000200
[ 1165.126425] RDX: ffff8ac2ef1ca0a0 RSI: ffff8ac2ef1ca0a0 RDI: ffff8ac2ef1ef4e0
[ 1165.126426] RBP: ffff8ac2ef1ef480 R08: ffff8ac2ef1ca0a0 R09: ffff8ac2ef1ef480
[ 1165.126426] R10: 00007f2edf875b38 R11: 0000000000003000 R12: ffff8ac2ef1ef300
[ 1165.126427] R13: ffffae6c83147e18 R14: ffff8ac2f413f100 R15: ffff8ac2ef1ef300
[ 1165.126428] FS:  00007f2ee0f0fa00(0000) GS:ffff8ac31dc80000(0000) knlGS:0000000000000000
[ 1165.126428] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1165.126429] CR2: 000055a4c2fcc068 CR3: 00000007f85d7004 CR4: 00000000001626e0
[ 1165.126430] Call Trace:
[ 1165.126432]  css_task_iter_next+0x4f/0x70
[ 1165.126435]  kernfs_seq_start+0x4a/0x80
[ 1165.126438]  seq_read+0xa9/0x440
[ 1165.126439]  __vfs_read+0x33/0x160
[ 1165.126441]  vfs_read+0x89/0x130
[ 1165.126442]  SyS_read+0x52/0xc0
[ 1165.126444]  entry_SYSCALL_64_fastpath+0x1a/0x7d
**************************************

From a second machine:
**************************************
[  266.102397] WARNING: CPU: 6 PID: 1 at kernel/fork.c:414 __put_task_struct+0xeb/0x150
[  266.102407] Modules linked in: netconsole rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache vhost_net vhost
[  266.102438]  e1000e crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel ptp pps_core hid_microsoft sunrpc scsi_transp
[  266.102447] CPU: 6 PID: 1 Comm: systemd Not tainted 4.14.9-300.fc27.x86_64 #1
[  266.102452] Hardware name:                  /DH87MC, BIOS MCH8710H.86A.0157.2014.0530.1830 05/30/2014
[  266.102456] task: ffff95aa7bed1f40 task.stack: ffffaaac83144000
[  266.102461] RIP: 0010:__put_task_struct+0xeb/0x150
[  266.102466] RSP: 0018:ffffaaac83147db8 EFLAGS: 00010246
[  266.102472] RAX: 0000000000000000 RBX: ffff95aa58161110 RCX: 0000000000000001
[  266.102477] RDX: ffffaaac83147e20 RSI: ffff95aa58161110 RDI: ffff95aa58161110
[  266.102481] RBP: ffffaaac83147f20 R08: 0000000000001000 R09: 0000000000000007
[  266.102486] R10: ffff95aa793c2f38 R11: ffff95aa4c768006 R12: ffff95aa793c2f00
[  266.102490] R13: 00000000ffffffff R14: ffff95aa58161110 R15: ffff95aa74f48f00
[  266.102495] FS:  00007f3f9ff93a00(0000) GS:ffff95aa9fb80000(0000) knlGS:0000000000000000
[  266.102499] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  266.102504] CR2: 000055a318166620 CR3: 00000007f2781004 CR4: 00000000001626e0
[  266.102508] Call Trace:
[  266.102514]  css_task_iter_next+0x68/0x70
[  266.102521]  kernfs_seq_next+0x23/0x50
[  266.102528]  ? cgroup_procs_show+0x26/0x30
[  266.102534]  seq_read+0x313/0x440
[  266.102539]  __vfs_read+0x33/0x160
[  266.102543]  vfs_read+0x89/0x130
[  266.102549]  SyS_read+0x52/0xc0
[  266.102557]  entry_SYSCALL_64_fastpath+0x1a/0x7d
**************************************

**************************************
[  266.102663] BUG: unable to handle kernel NULL pointer dereference at 00000000000000b0
[  266.102671] IP: pids_free+0x11/0x40
[  266.102674] PGD 0 P4D 0 
[  266.102682] Oops: 0000 [#1] SMP
[  266.102686] Modules linked in: netconsole rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache vhost_net vhost
[  266.102716]  e1000e crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel ptp pps_core hid_microsoft sunrpc scsi_transp
[  266.102726] CPU: 6 PID: 1 Comm: systemd Tainted: G        W       4.14.9-300.fc27.x86_64 #1
[  266.102730] Hardware name:                  /DH87MC, BIOS MCH8710H.86A.0157.2014.0530.1830 05/30/2014
[  266.102736] task: ffff95aa7bed1f40 task.stack: ffffaaac83144000
[  266.102741] RIP: 0010:pids_free+0x11/0x40
[  266.102745] RSP: 0018:ffffaaac83147d70 EFLAGS: 00010246
[  266.102748] RAX: ffff95a93f6828c0 RBX: 0000000000000000 RCX: 000000000000000b
[  266.102751] RDX: 000000000000000b RSI: 000000000000000c RDI: ffff95aa58161110
[  266.102755] RBP: ffff95aa58161110 R08: 0000000000001000 R09: 0000000000000007
[  266.102759] R10: ffff95aa793c2f38 R11: ffff95aa4c768006 R12: ffffffff82e6c640
[  266.102763] R13: ffff95a93f6828c0 R14: ffff95aa58161110 R15: ffff95aa74f48f00
[  266.102769] FS:  00007f3f9ff93a00(0000) GS:ffff95aa9fb80000(0000) knlGS:0000000000000000
[  266.102774] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  266.102778] CR2: 00000000000000b0 CR3: 00000007f2781004 CR4: 00000000001626e0
[  266.102783] Call Trace:
[  266.102790]  cgroup_free+0x5c/0xd0
[  266.102795]  __put_task_struct+0x3d/0x150
[  266.102801]  css_task_iter_next+0x68/0x70
[  266.102809]  kernfs_seq_next+0x23/0x50
[  266.102816]  ? cgroup_procs_show+0x26/0x30
[  266.102825]  seq_read+0x313/0x440
[  266.102832]  __vfs_read+0x33/0x160
[  266.102838]  vfs_read+0x89/0x130
[  266.102845]  SyS_read+0x52/0xc0
[  266.102853]  entry_SYSCALL_64_fastpath+0x1a/0x7d
**************************************

Comment 9 Thomas Jarosch 2017-12-29 22:39:54 UTC

Just to sum this up:

The upstream patch fixes the issue on all three affected workstations at work.
Two of them crashed on every reboot.

Comment 10 Laura Abbott 2018-01-02 17:24:42 UTC

*** Bug 1530318 has been marked as a duplicate of this bug. ***

Comment 11 Gerd v. Egidy 2018-01-05 12:22:37 UTC

The patch seems to be included in current Fedora kernel builds with

Patch631: cgroup-for-4.15-fixes-cgroup-fix-css_task_iter-crash-on-CSS_TASK_ITER_PROC.patch

It came with

kernel-4.14.11-300.fc27
kernel-4.14.11-200.fc26

but without mentioning this rhbz entry.

Comment 12 Laura Abbott 2018-01-05 16:08:55 UTC

Yes, we missed this when doing the updates. Thanks for pointing that out.

Note You need to log in before you can comment on or make changes to this bug.

airlied
ajax
bskeggs
ewk
gerd
gmainwar
hdegoede
ichavero
itamar
jarodwilson
jeremy
jglisse
john.j5live
jonathan
josef
kernel-maint
labbott
linville
mavit
mchehab
mjg59
steved
thomas.jarosch
xyzk