Description of problem: Since the update to kernel-4.15.6-300.fc27.x86_64, my computer (~ 2 years old, running current Fedora versions throughout) locks up frequently. Around half the time, it locks up during boot. By "locks up" I mean that no further interaction at all of any sort is possible (mouse pointer frozen, not responsive to any keypresses). If it happens after boot whilst sound is playing, then whatever is in the sound buffer replays continually on loop. This has happened around 10 times in the last week. This has never happened before, and if I boot from kernel-4.14.18-300.fc27.x86_64 it does not happen (either on boot, or later). Nothing at all is logged in /var/log/messages when this happens, unfortunately. It really is as if it's a complete lock-up. Version-Release number of selected component (if applicable): kernel-4.15.6-300.fc27.x86_64 How reproducible: Around half the time on boot; or, infrequently if it survives the boot process.
Since writing the report, I've booted the machine 5 times; 3 times to the 4.14.18 kernel without problems, twice to the 4.15.6 kernel - it locked up before reaching the desktop both times.
4.15.6-300 locked up the same for me minutes in the GUI session twice in a row. Rebooting to the older 4.15.4-300.fc27.x86_64 makes the issue go away. relatively old intel computer, builtin intel graphics. I don't see anything which seems relevant towards the end of "journalctl --boot=-1"
My behavior is consistent what what is being reported. Mine sometimes boots and sometimes does not, but fairly quickly it does lockup accessing my raid device, but my other disk (not raid) is still functioning. Here is what I get: [23842.276861] WARNING: CPU: 2 PID: 2249 at kernel/rcu/tree.c:2792 rcu_process_c allbacks+0x4cb/0x4e0 [23842.276905] Modules linked in: nfsd auth_rpcgss nfs_acl lockd grace sunrpc vh ost_net vhost tap xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ip v4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack tun bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tabl es cfg80211 rfkill it87 hwmon_vid xfs edac_mce_amd kvm_amd raid456 async_raid6_r ecov async_memcpy async_pq async_xor async_tx xor kvm irqbypass crct10dif_pclmul crc32_pclmul mt2131 ghash_clmulni_intel raid6_pq libcrc32c lgdt330x s5h1409 cx8 8_dvb cx88_vp3054_i2c ir_rc5_decoder rc_hauppauge ir_lirc_codec lirc_dev tuner_s imple tuner_types tda9887 tda8290 tuner k10temp cx8802 cx8800 cx88_alsa cx88xx s p5100_tco i2c_piix4 cx25840 pl2303 cx23885 altera_ci tda18271 altera_stapl video buf2_dma_sg [23842.277207] m88ds3103 tveeprom cx2341x videobuf2_memops snd_hda_codec_realte k snd_usb_audio videobuf2_dvb snd_hda_codec_hdmi snd_hda_codec_generic videobuf2 _v4l2 snd_usbmidi_lib rc_ati_x10 snd_rawmidi videobuf2_core snd_hda_intel dvb_co re ati_remote snd_hda_codec rc_core v4l2_common videodev snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm media i2c_mux snd_timer snd soundcore shpchp acpi _cpufreq radeon mpt3sas crc32c_intel raid_class scsi_transport_sas uas i2c_algo_ bit usb_storage drm_kms_helper ttm drm r8169 mii [23842.277402] CPU: 2 PID: 2249 Comm: ml5 Not tainted 4.15.6-300.fc27.x86_64 #1 [23842.277430] Hardware name: Gigabyte Technology Co., Ltd. To be filled by O.E. M./F2A85X-D3H, BIOS F3 04/08/2013 [23842.277471] RIP: 0010:rcu_process_callbacks+0x4cb/0x4e0 [23842.277494] RSP: 0000:ffff95c13ed03f08 EFLAGS: 00010002 [23842.277516] RAX: ffffffffffffd800 RBX: ffff95c13ed21980 RCX: dead000000000201 [23842.277545] RDX: 0000000000000001 RSI: ffff95c13ed03f10 RDI: ffff95c13ed219b8 [23842.277573] RBP: ffffffffaa25e900 R08: ffffffffaa2cb040 R09: 0000000000000000 [23842.277601] R10: 0000000000000098 R11: 0000000000000000 R12: ffff95c13ed219b8 [23842.277629] R13: 7fffffffffffffff R14: 0000000000000246 R15: ffffffffffffffff [23842.277657] FS: 00007f2494743700(0000) GS:ffff95c13ed00000(0000) knlGS:00000 00000000000 [23842.277689] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [23842.277713] CR2: 0000000004412ff7 CR3: 00000007da606000 CR4: 00000000000406e0 [23842.277742] Call Trace: [23842.277761] <IRQ> [23842.278916] __do_softirq+0xe7/0x2cb [23842.280061] irq_exit+0xf1/0x100 [23842.281204] smp_apic_timer_interrupt+0x6c/0x120 [23842.282344] apic_timer_interrupt+0x87/0x90 [23842.283472] </IRQ> [23842.284584] RIP: 0033:0x7f24a8ef01a6 [23842.285696] RSP: 002b:00007f2494742700 EFLAGS: 00000206 ORIG_RAX: fffffffffff fff11 [23842.286837] RAX: 0000000000000002 RBX: 00007f2488012f30 RCX: 0000000000000001 [23842.287994] RDX: 0000000000000001 RSI: 0000000000000046 RDI: 0000000000000002 [23842.289157] RBP: 0000000000000470 R08: 0000000000000468 R09: 0000000000000008 [23842.290320] R10: 00007f248801c660 R11: 00000000fffffffe R12: 00007f2488012bd0 [23842.291468] R13: 00007f24a8f0b010 R14: 00007f2488011ef0 R15: 00007f2494742870 [23842.292596] Code: ff 48 8b 05 90 f9 13 01 48 89 83 b0 00 00 00 e9 c8 fd ff ff 0f 0b e9 c8 fb ff ff 4c 89 f6 4c 89 e7 e8 aa b3 78 00 e9 18 fc ff ff <0f> 0b e9 f2 fd ff ff 0f 0b e9 e9 fc ff ff e8 d2 0d f9 ff 66 90
It appears to be this one. Summary: there is a bug when dealing with disk devices that are having some issues. https://bugzilla.kernel.org/show_bug.cgi?id=198861
Roger Heflin, if you get an oops stack trace, and are using SCSI RAID, then you appear to have a different issue, so you'll want to file a separate report. In the reported case, nothing is logged at all and am not using SCSI RAID> Just to note that the same problem is happening for me in 4.15.7-300.fc27.x86_64 as in 4.15.6-300.fc27.x86_64.
The listed bug is not fixed in any kernel.org kernels as of yet. If I did not have my /boot and rootvg on another disk controller (ahci) this would probably completely lockup and take out my system. It certainly would not be able to log any messages about the issue since the disk device would be locked up. I am using MD-RAID on a JBOD sas controller, and the disks on that one act up sometimes (possibly a cabling issue). If on the working kernel you get random ATA retries (not the normal initial disk status messages) then from what the bug report says this bug could be it. My disks on the AHCI controller are newer and better behaved then the disks on the 2nd controller. Most of the time on boot up I will get a ATA error on one of these disks. If you don't normally get these errors then this is probably not your bug. And I don't really need to report it since the bug I appear to have is already supposed to be fixed and is queued for including in the next kernels. Here was the error I got just prior to the lockup: Mar 11 14:13:50 rahrah kernel: [ 32.234447] ata8.00: irq_stat 0x08000000, interface fatal error Mar 11 14:13:50 rahrah kernel: [ 32.235919] ata8: SError: { Handshk } Mar 11 14:13:50 rahrah kernel: [ 32.237299] ata8.00: failed command: WRITE FPDMA QUEUED Mar 11 14:13:50 rahrah kernel: [ 32.238763] ata8.00: cmd 61/04:b8:10:e0:cd/00:00:5a:00:00/40 tag 23 ncq dma 2048 out Mar 11 14:13:50 rahrah kernel: [ 32.238763] res 40/00:c8:08:58:19/00:00:07:01:00/40 Emask 0x10 (ATA bus error) Mar 11 14:13:50 rahrah kernel: [ 32.241601] ata8.00: status: { DRDY } Mar 11 14:13:50 rahrah kernel: [ 32.242991] ata8.00: failed command: READ FPDMA QUEUED Mar 11 14:13:50 rahrah kernel: [ 32.244443] ata8.00: cmd 60/08:c8:08:58:19/00:00:07:01:00/40 tag 25 ncq dma 4096 in Mar 11 14:13:50 rahrah kernel: [ 32.244443] res 40/00:c8:08:58:19/00:00:07:01:00/40 Emask 0x10 (ATA bus error) Mar 11 14:13:50 rahrah kernel: [ 32.247230] ata8.00: status: { DRDY } Mar 11 14:13:50 rahrah kernel: [ 32.248627] ata8: hard resetting link Mar 11 14:13:50 rahrah kernel: [ 32.711159] ata8: SATA link up 3.0 Gbps (SStatus 123 SControl 300) Mar 11 14:13:50 rahrah kernel: [ 32.713490] ata8.00: NCQ Send/Recv Log not supported Mar 11 14:13:50 rahrah kernel: [ 32.715717] ata8.00: NCQ Send/Recv Log not supported Mar 11 14:13:50 rahrah kernel: [ 32.716683] ata8.00: configured for UDMA/133 Mar 11 14:13:50 rahrah kernel: [ 32.717696] ata8: EH complete
> If on the working kernel you get random ATA retries (not the normal initial disk status messages) I don't.
No change with 4.15.8-300.fc27.x86_64 - still locks up during the boot process.
Experiencing the same problem on kernel-4.15.9-300.fc27.x86_64 The only working kernel right now is kernel-4.15.3-300.fc27.x86_64 I did a diff as follow: localhost:/boot# diff -ruN config-4.15.3-300.fc27.x86_64 config-4.15.9-300.fc27.x86_64 --- config-4.15.3-300.fc27.x86_64 2018-02-13 18:16:14.000000000 +0100 +++ config-4.15.9-300.fc27.x86_64 2018-03-12 18:27:50.000000000 +0100 @@ -1,6 +1,6 @@ # # Automatically generated file; DO NOT EDIT. -# Linux/x86_64 4.15.3-300.fc27.x86_64 Kernel Configuration +# Linux/x86_64 4.15.9-300.fc27.x86_64 Kernel Configuration # CONFIG_64BIT=y CONFIG_X86_64=y @@ -5123,7 +5123,7 @@ CONFIG_DRM_RADEON=m CONFIG_DRM_RADEON_USERPTR=y CONFIG_DRM_AMDGPU=m -# CONFIG_DRM_AMDGPU_SI is not set +CONFIG_DRM_AMDGPU_SI=y CONFIG_DRM_AMDGPU_CIK=y CONFIG_DRM_AMDGPU_USERPTR=y # CONFIG_DRM_AMDGPU_GART_DEBUGFS is not set localhost:/boot# The only change I see is +CONFIG_DRM_AMDGPU_SI=y has been added. Is there a fix yet?
Created attachment 1410445 [details] fpaste --sysinfo output
Installing kernel 4.15.11-300.fc27.x86_64 from https://koji.fedoraproject.org/koji/buildinfo?buildID=1060016 seems to solve the problem(s).
No change with kernel-4.15.10-300.fc27.x86_64 (I haven't yet made time to test the 4.5.11 build).
I've carried on trying out the latest kernel-du-jour with some regularity. Latest is kernel-4.17.4-200.fc28.x86_64. (I'm running Fedora 28 now). Basically, anything I boot past kernel-4.14.18-300.fc27.x86_64 will cause a CPU lock-up within an hour or two. If I boot kernel-4.14.18-300.fc27.x86_64, I can use it all day, no problems. I've hesitated to post that until now because you always think "maybe it's random..."... but it's gone on long enough now, across so many updates and kernel versions, and the outcome is always the same: choose the old kernel, all is well; choose the new one, then wait for the lock-up.
I have been watching this bug report for several months now. I was experiencing the same issues with the same kernel version. I even had the problem where audio would loop over the same few seconds if I played a sound/video file after getting into the desktop. It would consistently crash within a minute or two. I have a Gigabyte Z170 Gaming 7 rev 1.0 motherboard running a Core i7 6700K. Although, I have a K variant CPU, I have never overclocked the system, so I don't believe that to be a factor. I decided to try updating the BIOS to see if that would help. That seems to have worked. I haven't tried running for hours or days to see if it remains stable for an extended period of time, but after several reboot cycles, I have yet to experience any freezing or hanging. One might reasonably argue that this should still be a bug because earlier Linux kernels don't have this behavior, nor does Windows 10. If someone is looking for a possible fix and they feel comfortable updating their BIOS, I might recommend trying that. I understand that not everyone feels comfortable updating their BIOS or is even capable or permitted to do so. If that's an option though, I would be interested in seeing if anyone else sees similar positive results.
Mine's an Intel(R) Core(TM) i7-5500U CPU @ 2.40GHz. I did, in fact, try updating the BIOS to the latest before my most recent post, and unfortunately it hasn't resolved the problem.
If someone is able to provide bissected kernels for me to test, I'm happy to test them. As I say, the behaviour is very consistent. If I run anything after kernel-4.14.18-300.fc27.x86_64 (including current F28 kernels), it'll lock up before the end of the day. If I run that kernel, it won't. (Actually I don't know that I ran anything after kernel-4.14.18-300.fc27.x86_64 and before kernel-4.15.6-300.fc27.x86_64, if there were such kernels in Fedora).
I'm having what sounds like the same symptoms. My trouble kernel is 4.15.0-30 and I find that I can get back to sanity by reverting to something older (we chose 4.13.0-41 - I won't bother you with the whole explanation). Since I was unable to find any clues in any of the regular logs, I set up console logging on a serial port and was able to get backtraces the next time it froze. Since we are an Ubuntu shop I've posted that in my Launchpad bug report, here: https://bugs.launchpad.net/ubuntu/+source/linux-signed-hwe/+bug/1788024 I hope the extra information in those traces is helpful to someone.
FWIW, today I again ventured out from the safety of 4.14.18-300.fc27.x86_64 (still working beautifully) to try the latest, 4.17.14-202.fc28.x86_64, and yes, a few hours later, it locked up.
I give up; I'm switching to CentOS after 15 or more years of being on the Fedora train. I can't be 100% sure it's not some weird hardware thing on my end, but I doubt it. I can't be dealing with disk corruptions because I'm having to hard reboot every other day. If you don't hear from me again, going back to the 3.10 kernel that is the latest with CentOS "solved" my problem, which was hangs that occurred both during interactive use, and overnight just doing its thing, neither of which left any kernel trace messages to help me troubleshoot. I did run memtest86 on my memory and it came up clean. I am on a newer AMD machine.
(In reply to Leigh Orf from comment #19) > I give up; I'm switching to CentOS after 15 or more years of being on the > Fedora train. I can't be 100% sure it's not some weird hardware thing on my > end, but I doubt it. I can't be dealing with disk corruptions because I'm > having to hard reboot every other day. If you don't hear from me again, > going back to the 3.10 kernel that is the latest with CentOS "solved" my > problem, which was hangs that occurred both during interactive use, and > overnight just doing its thing, neither of which left any kernel trace > messages to help me troubleshoot. I did run memtest86 on my memory and it > came up clean. I am on a newer AMD machine. Well, my machine was locked up this morning with CentOS - so I am next going to replace the only thing I haven't replaced yet, the power supply. I now do not believe my problems have been kernel related. Son of a...
We apologize for the inconvenience. There is a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 28 kernel bugs. Fedora 28 has now been rebased to 4.18.10-300.fc28. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you have moved on to Fedora 29, and are still experiencing this issue, please change the version to Fedora 29. If you experience different issues, please open a new bug report for those.
After a couple more happy lock-up-free weeks running on kernel-4.14.18-300.fc27.x86_64, today I again ventured to "dnf update", and then try out to try the latest available F28 kernel. This was kernel-4.18.10-200.fc28.x86_64. A few hours later, it locked up. So, yes, there's been no change in the situation. After some point in the sequence, all kernels predictably lock up within a few hours of use, whereas 4.14.18-300.fc27.x86_64 does not.
I've been running 4.19.4-300.fc29.x86_64 for a bit now.... without lockups. Hurrah. Whatever caused the lock-ups, now causes it no more, and I can now stop using the Fedora 27 kernel. I do get this issue on 3 separate machines, though: https://bugzilla.redhat.com/show_bug.cgi?id=1654803
*********** MASS BUG UPDATE ************** We apologize for the inconvenience. There are a large number of bugs to go through and several of them have gone stale. Due to this, we are doing a mass bug update across all of the Fedora 28 kernel bugs. Fedora 28 has now been rebased to 4.20.5-100.fc28. Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel. If you have moved on to Fedora 29, and are still experiencing this issue, please change the version to Fedora 29. If you experience different issues, please open a new bug report for those.
*********** MASS BUG UPDATE ************** This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 3 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.