1552124 – All kernels from kernel-4.15.6-300 onwards lock up frequently

Bug 1552124 - All kernels from kernel-4.15.6-300 onwards lock up frequently

Summary: All kernels from kernel-4.15.6-300 onwards lock up frequently

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	28
Hardware:	Unspecified
OS:	Linux
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-03-06 14:31 UTC by David Anderson
Modified:	2019-02-21 21:10 UTC (History)
CC List:	24 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2019-02-21 21:10:38 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
fpaste --sysinfo output (16.99 KB, patch) 2018-03-20 12:41 UTC, Valentin Bajrami	no flags	Details \| Diff
View All

Description David Anderson 2018-03-06 14:31:23 UTC

Description of problem:

Since the update to kernel-4.15.6-300.fc27.x86_64, my computer (~ 2 years old, running current Fedora versions throughout) locks up frequently. Around half the time, it locks up during boot. By "locks up" I mean that no further interaction at all of any sort is possible (mouse pointer frozen, not responsive to any keypresses). If it happens after boot whilst sound is playing, then whatever is in the sound buffer replays continually on loop. This has happened around 10 times in the last week.

This has never happened before, and if I boot from kernel-4.14.18-300.fc27.x86_64 it does not happen (either on boot, or later).

Nothing at all is logged in /var/log/messages when this happens, unfortunately. It really is as if it's a complete lock-up.

Version-Release number of selected component (if applicable):

kernel-4.15.6-300.fc27.x86_64 

How reproducible:

Around half the time on boot; or, infrequently if it survives the boot process.

Comment 1 David Anderson 2018-03-07 09:26:59 UTC

Since writing the report, I've booted the machine 5 times; 3 times to the 4.14.18 kernel without problems, twice to the 4.15.6 kernel - it locked up before reaching the desktop both times.

Comment 2 Emmanuel Touzery 2018-03-09 16:42:44 UTC

4.15.6-300 locked up the same for me minutes in the GUI session twice in a row. Rebooting to the older 4.15.4-300.fc27.x86_64 makes the issue go away.

relatively old intel computer, builtin intel graphics.

I don't see anything which seems relevant towards the end of "journalctl --boot=-1"

Comment 3 Roger Heflin 2018-03-12 14:06:42 UTC

My behavior is consistent what what is being reported.  Mine sometimes boots and sometimes does not, but fairly quickly it does lockup accessing my raid device, but my other disk (not raid) is still functioning.

Here is what I get:

[23842.276861] WARNING: CPU: 2 PID: 2249 at kernel/rcu/tree.c:2792 rcu_process_c                                                                                                                allbacks+0x4cb/0x4e0
[23842.276905] Modules linked in: nfsd auth_rpcgss nfs_acl lockd grace sunrpc vh                                                                                                                ost_net vhost tap xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ip                                                                                                                v4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack                                                                                                                 nf_conntrack tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tabl                                                                                                                es cfg80211 rfkill it87 hwmon_vid xfs edac_mce_amd kvm_amd raid456 async_raid6_r                                                                                                                ecov async_memcpy async_pq async_xor async_tx xor kvm irqbypass crct10dif_pclmul                                                                                                                 crc32_pclmul mt2131 ghash_clmulni_intel raid6_pq libcrc32c lgdt330x s5h1409 cx8                                                                                                                8_dvb cx88_vp3054_i2c ir_rc5_decoder rc_hauppauge ir_lirc_codec lirc_dev tuner_s                                                                                                                imple tuner_types tda9887 tda8290 tuner k10temp cx8802 cx8800 cx88_alsa cx88xx s                                                                                                                p5100_tco i2c_piix4 cx25840 pl2303 cx23885 altera_ci tda18271 altera_stapl video                                                                                                                buf2_dma_sg
[23842.277207]  m88ds3103 tveeprom cx2341x videobuf2_memops snd_hda_codec_realte                                                                                                                k snd_usb_audio videobuf2_dvb snd_hda_codec_hdmi snd_hda_codec_generic videobuf2                                                                                                                _v4l2 snd_usbmidi_lib rc_ati_x10 snd_rawmidi videobuf2_core snd_hda_intel dvb_co                                                                                                                re ati_remote snd_hda_codec rc_core v4l2_common videodev snd_hda_core snd_hwdep                                                                                                                 snd_seq snd_seq_device snd_pcm media i2c_mux snd_timer snd soundcore shpchp acpi                                                                                                                _cpufreq radeon mpt3sas crc32c_intel raid_class scsi_transport_sas uas i2c_algo_                                                                                                                bit usb_storage drm_kms_helper ttm drm r8169 mii
[23842.277402] CPU: 2 PID: 2249 Comm: ml5 Not tainted 4.15.6-300.fc27.x86_64 #1
[23842.277430] Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E.                                                                                                                M./F2A85X-D3H, BIOS F3 04/08/2013
[23842.277471] RIP: 0010:rcu_process_callbacks+0x4cb/0x4e0
[23842.277494] RSP: 0000:ffff95c13ed03f08 EFLAGS: 00010002
[23842.277516] RAX: ffffffffffffd800 RBX: ffff95c13ed21980 RCX: dead000000000201
[23842.277545] RDX: 0000000000000001 RSI: ffff95c13ed03f10 RDI: ffff95c13ed219b8
[23842.277573] RBP: ffffffffaa25e900 R08: ffffffffaa2cb040 R09: 0000000000000000
[23842.277601] R10: 0000000000000098 R11: 0000000000000000 R12: ffff95c13ed219b8
[23842.277629] R13: 7fffffffffffffff R14: 0000000000000246 R15: ffffffffffffffff
[23842.277657] FS:  00007f2494743700(0000) GS:ffff95c13ed00000(0000) knlGS:00000                                                                                                                00000000000
[23842.277689] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[23842.277713] CR2: 0000000004412ff7 CR3: 00000007da606000 CR4: 00000000000406e0
[23842.277742] Call Trace:
[23842.277761]  <IRQ>
[23842.278916]  __do_softirq+0xe7/0x2cb
[23842.280061]  irq_exit+0xf1/0x100
[23842.281204]  smp_apic_timer_interrupt+0x6c/0x120
[23842.282344]  apic_timer_interrupt+0x87/0x90
[23842.283472]  </IRQ>
[23842.284584] RIP: 0033:0x7f24a8ef01a6
[23842.285696] RSP: 002b:00007f2494742700 EFLAGS: 00000206 ORIG_RAX: fffffffffff                                                                                                                fff11
[23842.286837] RAX: 0000000000000002 RBX: 00007f2488012f30 RCX: 0000000000000001
[23842.287994] RDX: 0000000000000001 RSI: 0000000000000046 RDI: 0000000000000002
[23842.289157] RBP: 0000000000000470 R08: 0000000000000468 R09: 0000000000000008
[23842.290320] R10: 00007f248801c660 R11: 00000000fffffffe R12: 00007f2488012bd0
[23842.291468] R13: 00007f24a8f0b010 R14: 00007f2488011ef0 R15: 00007f2494742870
[23842.292596] Code: ff 48 8b 05 90 f9 13 01 48 89 83 b0 00 00 00 e9 c8 fd ff ff                                                                                                                 0f 0b e9 c8 fb ff ff 4c 89 f6 4c 89 e7 e8 aa b3 78 00 e9 18 fc ff ff <0f> 0b e9                                                                                                                 f2 fd ff ff 0f 0b e9 e9 fc ff ff e8 d2 0d f9 ff 66 90

Comment 4 Roger Heflin 2018-03-12 23:16:53 UTC

It appears to be this one.

Summary: there is a bug when dealing with disk devices that are having some issues.

https://bugzilla.kernel.org/show_bug.cgi?id=198861

Comment 5 David Anderson 2018-03-13 20:16:23 UTC

Roger Heflin, if you get an oops stack trace, and are using SCSI RAID, then you appear to have a different issue, so you'll want to file a separate report. In the reported case, nothing is logged at all and am not using SCSI RAID>

Just to note that the same problem is happening for me in 4.15.7-300.fc27.x86_64 as in 4.15.6-300.fc27.x86_64.

Comment 6 Roger Heflin 2018-03-14 01:11:09 UTC

The listed bug is not fixed in any kernel.org kernels as of yet.

If I did not have my /boot and rootvg on another disk controller (ahci) this would probably completely lockup and take out my system.  It certainly would not be able to log any messages about the issue since the disk device would be locked up.

I am using MD-RAID on a JBOD sas controller, and the disks on that one act up sometimes (possibly a cabling issue).

If on the working kernel you get random ATA retries (not the normal initial disk status messages) then from what the bug report says this bug could be it.  My disks on the AHCI controller are newer and better behaved then the disks on the 2nd controller.  Most of the time on boot up I will get a ATA error on one of these disks.   If you don't normally get these errors then this is probably not your bug.

And I don't really need to report it since the bug I appear to have is already supposed to be fixed and is queued for including in the next kernels.

Here was the error I got just prior to the lockup:
Mar 11 14:13:50 rahrah kernel: [   32.234447] ata8.00: irq_stat 0x08000000, interface fatal error
Mar 11 14:13:50 rahrah kernel: [   32.235919] ata8: SError: { Handshk }
Mar 11 14:13:50 rahrah kernel: [   32.237299] ata8.00: failed command: WRITE FPDMA QUEUED
Mar 11 14:13:50 rahrah kernel: [   32.238763] ata8.00: cmd 61/04:b8:10:e0:cd/00:00:5a:00:00/40 tag 23 ncq dma 2048 out
Mar 11 14:13:50 rahrah kernel: [   32.238763]          res 40/00:c8:08:58:19/00:00:07:01:00/40 Emask 0x10 (ATA bus error)
Mar 11 14:13:50 rahrah kernel: [   32.241601] ata8.00: status: { DRDY }
Mar 11 14:13:50 rahrah kernel: [   32.242991] ata8.00: failed command: READ FPDMA QUEUED
Mar 11 14:13:50 rahrah kernel: [   32.244443] ata8.00: cmd 60/08:c8:08:58:19/00:00:07:01:00/40 tag 25 ncq dma 4096 in
Mar 11 14:13:50 rahrah kernel: [   32.244443]          res 40/00:c8:08:58:19/00:00:07:01:00/40 Emask 0x10 (ATA bus error)
Mar 11 14:13:50 rahrah kernel: [   32.247230] ata8.00: status: { DRDY }
Mar 11 14:13:50 rahrah kernel: [   32.248627] ata8: hard resetting link

Mar 11 14:13:50 rahrah kernel: [   32.711159] ata8: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Mar 11 14:13:50 rahrah kernel: [   32.713490] ata8.00: NCQ Send/Recv Log not supported
Mar 11 14:13:50 rahrah kernel: [   32.715717] ata8.00: NCQ Send/Recv Log not supported
Mar 11 14:13:50 rahrah kernel: [   32.716683] ata8.00: configured for UDMA/133
Mar 11 14:13:50 rahrah kernel: [   32.717696] ata8: EH complete

Comment 7 David Anderson 2018-03-14 10:08:00 UTC

> If on the working kernel you get random ATA retries (not the normal initial disk status messages)

I don't.

Comment 8 David Anderson 2018-03-16 14:22:11 UTC

No change with 4.15.8-300.fc27.x86_64 - still locks up during the boot process.

Comment 9 Valentin Bajrami 2018-03-20 10:38:52 UTC

Experiencing the same problem on  kernel-4.15.9-300.fc27.x86_64

The only working kernel right now is kernel-4.15.3-300.fc27.x86_64

I did a diff as follow:

localhost:/boot# diff -ruN config-4.15.3-300.fc27.x86_64 config-4.15.9-300.fc27.x86_64 
--- config-4.15.3-300.fc27.x86_64    2018-02-13 18:16:14.000000000 +0100
+++ config-4.15.9-300.fc27.x86_64    2018-03-12 18:27:50.000000000 +0100
@@ -1,6 +1,6 @@
 #
 # Automatically generated file; DO NOT EDIT.
-# Linux/x86_64 4.15.3-300.fc27.x86_64 Kernel Configuration
+# Linux/x86_64 4.15.9-300.fc27.x86_64 Kernel Configuration
 #
 CONFIG_64BIT=y
 CONFIG_X86_64=y
@@ -5123,7 +5123,7 @@
 CONFIG_DRM_RADEON=m
 CONFIG_DRM_RADEON_USERPTR=y
 CONFIG_DRM_AMDGPU=m
-# CONFIG_DRM_AMDGPU_SI is not set
+CONFIG_DRM_AMDGPU_SI=y
 CONFIG_DRM_AMDGPU_CIK=y
 CONFIG_DRM_AMDGPU_USERPTR=y
 # CONFIG_DRM_AMDGPU_GART_DEBUGFS is not set
localhost:/boot# 

The only change I see is  +CONFIG_DRM_AMDGPU_SI=y  has been added. Is there a fix yet?

Comment 10 Valentin Bajrami 2018-03-20 12:41:45 UTC

Created attachment 1410445 [details]
fpaste --sysinfo output

Comment 11 Valentin Bajrami 2018-03-20 14:00:39 UTC

Installing kernel 4.15.11-300.fc27.x86_64 from https://koji.fedoraproject.org/koji/buildinfo?buildID=1060016 seems to solve the problem(s).

Comment 12 David Anderson 2018-03-26 08:04:47 UTC

No change with kernel-4.15.10-300.fc27.x86_64 (I haven't yet made time to test the 4.5.11 build).

Comment 13 David Anderson 2018-07-18 22:10:58 UTC

I've carried on trying out the latest kernel-du-jour with some regularity. Latest is kernel-4.17.4-200.fc28.x86_64. (I'm running Fedora 28 now).

Basically, anything I boot past kernel-4.14.18-300.fc27.x86_64 will cause a CPU lock-up within an hour or two. If I boot kernel-4.14.18-300.fc27.x86_64, I can use it all day, no problems.

I've hesitated to post that until now because you always think "maybe it's random..."... but it's gone on long enough now, across so many updates and kernel versions, and the outcome is always the same: choose the old kernel, all is well; choose the new one, then wait for the lock-up.

Comment 14 Nem Lawford 2018-07-23 00:55:32 UTC

I have been watching this bug report for several months now. I was experiencing the same issues with the same kernel version. I even had the problem where audio would loop over the same few seconds if I played a sound/video file after getting into the desktop. It would consistently crash within a minute or two.

I have a Gigabyte Z170 Gaming 7 rev 1.0 motherboard running a Core i7 6700K. Although, I have a K variant CPU, I have never overclocked the system, so I don't believe that to be a factor. I decided to try updating the BIOS to see if that would help. That seems to have worked. I haven't tried running for hours or days to see if it remains stable for an extended period of time, but after several reboot cycles, I have yet to experience any freezing or hanging.

One might reasonably argue that this should still be a bug because earlier Linux kernels don't have this behavior, nor does Windows 10. If someone is looking for a possible fix and they feel comfortable updating their BIOS, I might recommend trying that. I understand that not everyone feels comfortable updating their BIOS or is even capable or permitted to do so. If that's an option though, I would be interested in seeing if anyone else sees similar positive results.

Comment 15 David Anderson 2018-07-23 12:34:35 UTC

Mine's an Intel(R) Core(TM) i7-5500U CPU @ 2.40GHz. I did, in fact, try updating the BIOS to the latest before my most recent post, and unfortunately it hasn't resolved the problem.

Comment 16 David Anderson 2018-07-23 22:50:26 UTC

If someone is able to provide bissected kernels for me to test, I'm happy to test them. As I say, the behaviour is very consistent. If I run anything after kernel-4.14.18-300.fc27.x86_64 (including current F28 kernels), it'll lock up before the end of the day. If I run that kernel, it won't. (Actually I don't know that I ran anything after kernel-4.14.18-300.fc27.x86_64 and before kernel-4.15.6-300.fc27.x86_64, if there were such kernels in Fedora).

Comment 17 Steve Kierstead 2018-08-20 19:51:38 UTC

I'm having what sounds like the same symptoms.  My trouble kernel is 4.15.0-30 and I find that I can get back to sanity by reverting to something older (we chose 4.13.0-41 - I won't bother you with the whole explanation).

Since I was unable to find any clues in any of the regular logs, I set up console logging on a serial port and was able to get backtraces the next time it froze.  Since we are an Ubuntu shop I've posted that in my Launchpad bug report, here: https://bugs.launchpad.net/ubuntu/+source/linux-signed-hwe/+bug/1788024

I hope the extra information in those traces is helpful to someone.

Comment 18 David Anderson 2018-08-20 22:32:24 UTC

FWIW, today I again ventured out from the safety of 4.14.18-300.fc27.x86_64 (still working beautifully) to try the latest, 4.17.14-202.fc28.x86_64, and yes, a few hours later, it locked up.

Comment 19 Leigh Orf 2018-08-24 15:29:09 UTC

I give up; I'm switching to CentOS after 15 or more years of being on the Fedora train. I can't be 100% sure it's not some weird hardware thing on my end, but I doubt it. I can't be dealing with disk corruptions because I'm having to hard reboot every other day. If you don't hear from me again, going back to the 3.10 kernel that is the latest with CentOS "solved" my problem, which was hangs that occurred both during interactive use, and overnight just doing its thing, neither of which left any kernel trace messages to help me troubleshoot. I did run memtest86 on my memory and it came up clean. I am on a newer AMD machine.

Comment 20 Leigh Orf 2018-08-25 14:27:41 UTC

(In reply to Leigh Orf from comment #19)
> I give up; I'm switching to CentOS after 15 or more years of being on the
> Fedora train. I can't be 100% sure it's not some weird hardware thing on my
> end, but I doubt it. I can't be dealing with disk corruptions because I'm
> having to hard reboot every other day. If you don't hear from me again,
> going back to the 3.10 kernel that is the latest with CentOS "solved" my
> problem, which was hangs that occurred both during interactive use, and
> overnight just doing its thing, neither of which left any kernel trace
> messages to help me troubleshoot. I did run memtest86 on my memory and it
> came up clean. I am on a newer AMD machine.

Well, my machine was locked up this morning with CentOS - so I am next going to replace the only thing I haven't replaced yet, the power supply. I now do not believe my problems have been kernel related. Son of a...

Comment 21 Laura Abbott 2018-10-01 21:27:27 UTC

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 28 kernel bugs.
 
Fedora 28 has now been rebased to 4.18.10-300.fc28.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.
 
If you have moved on to Fedora 29, and are still experiencing this issue, please change the version to Fedora 29.
 
If you experience different issues, please open a new bug report for those.

Comment 22 David Anderson 2018-10-04 16:27:34 UTC

After a couple more happy lock-up-free weeks running on kernel-4.14.18-300.fc27.x86_64, today I again ventured to "dnf update", and then try out to try the latest available F28 kernel. This was kernel-4.18.10-200.fc28.x86_64. A few hours later, it locked up. So, yes, there's been no change in the situation. After some point in the sequence, all kernels predictably lock up within a few hours of use, whereas 4.14.18-300.fc27.x86_64 does not.

Comment 23 David Anderson 2018-12-15 11:21:43 UTC

I've been running 4.19.4-300.fc29.x86_64 for a bit now.... without lockups. Hurrah. Whatever caused the lock-ups, now causes it no more, and I can now stop using the Fedora 27 kernel.

I do get this issue on 3 separate machines, though: https://bugzilla.redhat.com/show_bug.cgi?id=1654803

Comment 24 Justin M. Forbes 2019-01-29 16:25:55 UTC

*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There are a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 28 kernel bugs.

Fedora 28 has now been rebased to 4.20.5-100.fc28.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 29, and are still experiencing this issue, please change the version to Fedora 29.

If you experience different issues, please open a new bug report for those.

Comment 25 Justin M. Forbes 2019-02-21 21:10:38 UTC

*********** MASS BUG UPDATE **************
This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 3 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.

Note You need to log in before you can comment on or make changes to this bug.

airlied
bskeggs
darkhack
emmanuel.touzery
ewk
fedora-packaging2
hdegoede
ichavero
itamar
jarodwilson
jglisse
john.j5live
jonathan
josef
kernel-maint
leigh.orf
linville
mchehab
mjg59
rahhorizon
rogerheflin
skierstead
steved
valentin.bajrami