Bug 1598462
Summary: | kernel 4.17.4, 4.17.5, 4.17.6 hang | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Didier G <didierg-divers> |
Component: | kernel | Assignee: | Kernel Maintainer List <kernel-maint> |
Status: | CLOSED ERRATA | QA Contact: | Fedora Extras Quality Assurance <extras-qa> |
Severity: | high | Docs Contact: | |
Priority: | unspecified | ||
Version: | 28 | CC: | airlied, alex.bramford, aliakc, anatoli, antonio.montagnani, avinash, berend.de.schouwer, bmrhbugzilla, brian, bskeggs, bugzilla.redhat, chrism, chunnayya, cks-rhbugzilla, CreeVinicio, dan.cermak, daniel2196, dimhen, dominik, doug.hs, dswbike, dwlegg, ego.cordatus, ewk, fedora, gabriele.svelto, gbcox, goodyca48, hdegoede, horsley1953, ichavero, itamar, jarodwilson, jblecker, jglisse, john.j5live, jonathan, josef, kapustka.k, kernel-maint, labbott, leigh.orf, linville, luigic, marco.guazzone, mchehab, media, mihai, mike, mischmitz, mjg59, ncross, ngaywood, piratmac, rolf, ryan, samoht0-bugzilla, samuel-rhbugs, stanley.king, steved, thacuop |
Target Milestone: | --- | Keywords: | Reopened |
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | kernel-4.17.7-200.fc28 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2018-07-29 00:44:21 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Attachments: |
Description
Didier G
2018-07-05 14:39:53 UTC
I setup a netconsoel and I caught two occurrences of the hang. Created attachment 1457113 [details]
hang occurrence 1 caught on netconsole
hang occurrence 1 caught on netconsole
Created attachment 1457115 [details]
hang occurrence 2 caught on netconsole
hang occurrence 2 caught on netconsole
I confirm this issue. Kernel hang 2 times during nightly backup session. The first time it hang during my regular nightly backup session by writing changes from my USB stick to my external USB-HDD. rsync worked, md5sum'ing the files afterwards worked, dd'ing free space caused the kernel to hang. In the morning the entire machine was basicly dead until I rebooted. No kernel dump, no entries in journalctl. Luckely no files was harmed but I had issue some fsck, to ensure that the fs is still errorfree. The next day I issued another backup process because the first one failed. I cleaned my system up before and uninstalled all un-necessary packages. During night the same issue happened at the same step. I had to fsck my external hdd once again and am now in the middle of rescuing the backup to yet another media - just in case. The issues came up after switching from 4.17.3 to 4.17.4. Other issues that appeared with the kernel: Middle mouse button stopped working. Ocasional crashing of random programs. e.g. xfce4-terminal crashed all instances 2 times. google-chrome crashed a couple of times. some other programs simply crashed. Well it's not a crash but a "close". Switching VirtualBox to fullscreen caused an entire system freeze (even akpmod build the modules successfully) Switching back to 4.17.3 solved all issues. There seem to be some serious regressions on 4.17.4 This is an absolutely no-go for a so called stable kernel on a working machine. Both of the backtraces have the out of tree virtualbox drivers installed, can you reproduce the issue without those installed? After I removed VB form my system I got two kernel hang: - the first copying a file to encfs encrypted directory on NTFS formatted USB 3.1 externakl disk using Nautilus - the second reading a file from encrypted directory on NTFS formatted USB 3.1 externakl disk Created attachment 1457583 [details]
hang occurrence 3 VB removed caught on netconsole
Created attachment 1457585 [details]
hang occurrence 4 VB removed caught on netconsole
Problem still exists with 4.17.5 Created attachment 1457611 [details]
hang occurrence 5
If you have working and non-working kernel, the best option is going to be to run a bisect between 4.17.3 and 4.17.4. I don't see any patches that jump out as causing that particular bug to be hit. https://bugzilla.redhat.com/show_bug.cgi?id=1598989 seems to be a duplicate of this bug It will be interesting to know storage involved in bugs encountered by Ali Akcaagac in Comment 4 and by Jens Lody in bz1598989 To investigate this bug I written a basic test script to copy 100 times a 4,7 GB file using cp command overwriting the target at each copy. First I eliminated encfs, hangs happen writing in NO encrypted directory. Booting on 4.17.3 - test with NO problem on external USB NTFS partition Booting on 4.17.4 - test on same NTFS partition kernel 4.17.4 HANGED after only few iterations Booting on 4.17.4 - test with NO problem on internal SATA EXT4 partition Booting on 4.17.4 - test on internal SATA NTFS partition kernel 4.17.4 HANGED after only few iterations Booting on 4.17.3 - test with NO problem on internal SATA NTFS partition In summary and in my case: 4.17.3: OK on any partitions 4.17.4: OK on EXT4, HANG on NTFS both SATA internal and USB external For me the (undefined) issue is reproducible: 1) rsyncing on an external harddisk (even large chungs of files) work. rsyncing 250gb and more works perfectly. 2) md5suming the files afterwards (to compare source with destination) works perfectly. 3) dd'ing free space is *always* the point where the kernel seem to be crashing (without log). dd'ing free space on a harddisk (internal) works! no issues! dd'ing free space on a usb stick (external) works no issues! dd'ing free space on a harddisk (usb: the one that caught my attention for this issue) always causes the kernel to die, right after a few mb's are written). So the issue can be all of this: 1) kernel issue 2) heavy load issue (there was a bugreport for this for 4.17.4 on bugzilla.redhat.com) 3) broken harddrive (but then why does the kernel die ?) 4) usb issue within the drivers (regression) 5) fs issue (XFS is particulary used, but then rsync and md5suming would cause the same issues). But we seem to be using different FS here... XFS, EXT4, NTFS, so this is not an FS issue. I for my own wait for 4.17.5 to show up. So I can do the following steps: 1) repartition external HD 2) format external HD 3) rsync and md5sum'ing the data 4) dd'ing free space Since dd'ing is a heavy "process" task. This would explain the other bugreport: https://bugzilla.redhat.com/show_bug.cgi?id=1599101 Where the reporter writes about kernel oopses under "heavy" system load. This of course depends on the system used. Heavy load on a less powered system means something differently than doing the same tasks on a powered system. So using 'dd' can be explained as heavy load process on my system. This might also explain, why I encounter "closing" programms, once the system get's into heavy load tasks. This of course is just a visual descritption to the things that I see here and that happened here since I upgraded to 4.17.4 The changelog for 4.17.5 (on kernel.org) describes some usb regressions fixed. So can be anything. There are also 50 Fedora specific patches applied to the kernel, where one could also trigger the issue in combination to the upstream changes that happened. A free somewhere in the kernel or a driver close somewhere, can easily render one of these patches from working to critical. https://src.fedoraproject.org/rpms/kernel/tree/master 4.17.5 addresses a bunch of FS, XFS related as well as TIMER related issues. https://koji.fedoraproject.org/koji/buildinfo?buildID=1104321 I have a good feeling (reading the CVS's) that this version may solve various issues, since timing is involved everywhere... Same hang with 4.17.6. Last hang free kernel is 4.17.3 (In reply to Laura Abbott from comment #11) > If you have working and non-working kernel, the best option is going to be > to run a bisect between 4.17.3 and 4.17.4. I don't see any patches that jump > out as causing that particular bug to be hit. Where can I find the right howto to bisect Fedora kernel ? I've been tracking down this issue over the last few days, and have reduced the suspicious commits to the following interval: - 54428453efda4c1c35ca75a0a5aa170de87ff1b0 (x86/e820: put !E820_TYPE_RAM regions into memblock.reserved) -- bad, hang reproduces. - 323252c83194268cadefc2c0ea55827bf4dd04b8 (i2c: gpio: initialize SCL to HIGH again) -- good, no hangs after considerable amount of trying Thanks for the bisect range. There was a thread going on linux-mm so I gave this information to the maintainers. We'll see if anyone responds. Having been running "block: Fix cloning of requests with a special payload" for a few busy hours without incident, I am now reasonably convinced that "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved" is the culprit. I'm having also having a problem with system hang/freeze and it definitely did not appear with 4.16.6. It is appearing in 4.17.6; 4.17.5 and 4.17.4. I'm currently testing 4.17.2 and 4.17.3 to see if it appears there. I like to inform, that I just had a freeze again today (4.17.5-200) during backup process... 1) rsync worked 250gb 2) md5sum'ing worked 250gb 3) dd'ing crash... So again it's crashing at the dd'ing stage... I somehow had the feeling that the xfs fixes and the timing fixes that got introduced with 4.17.5-200 may have solved the issue. But looking at #Comment 18 sounds pretty much realistic here. Hope I had spare time to investigate into creating some sort of backtrace. This needs urgently get fixed. Today I need to spent yet the third time in row manually md5sum'ing and checking all the files of the backup for proper consistency. :( I've had this 4 times over the last 2 days, each time while there is heavy disk I/O going on (while running an automated backup, and while compiling a large piece of software). One time it hung it managed to save "kernel BUG at mm/page_alloc.c:2019" to syslog, which looks like the other logs here. I was not seeing this prior to updating to 4.17.5-200.fc28.x86_64. Same thing here. System hangs while doing rdiff-backup. 4.17.5-200.fc28.x86_64 kernel BUG at mm/page_alloc.c:2019! no other baktrace available. This needs urgent attention as going to STABLE FC28 made my system completely instable with daily hangs. i may have the same or a related issue ... since upgrading from kernel 4.17.2 to 4.17.4, my computer freezes up completely each night, and I have to hard reset. i was able to determine this happens when my nightly script runs nwipe on a small (~4.5GB) internal partition. am i the only one who sees writing to internal partitions affected? the software and configuration for nwipe have not changed the command that's triggering the hang is (run as root): nwipe --nogui --nowait --autonuke --method=zero --verify=off /dev/sda2; I'm able to reproduce this consistently, although it's not fun because it hangs up completely all at once and there's no way to recover. i tried running the command on other and smaller partitions; i noticed that for a smaller partition it didn't hang -- it looked like it was about to, as the computer would be not very responsive for a few seconds, but then it recovered, and the command line showed "segmentation fault (core dumped)" and nwipe stopped. i reproduced this a few times on that partition, but on maybe the 4th time i wasn't so lucky and instead got a hang. i'm therefore wondering if the "segmentation fault (core dumped)" is a clue to this bug. Following the mailing lists, this bug has been identified and fixed, and will be included in 4.17.7. (In reply to Robert Holmes from comment #26) > Following the mailing lists, this bug has been identified and fixed, and > will be included in 4.17.7. Thanks for the info! This is really a showstopper. Apart from NTFS also happening here on external USB stick with exFAT via FUSE. Hallelujah! This is exactly what I am seeing backing up to encrypted USB 3.0 drives. 4.17.3 seems solid, 4.17.4 stays stable for up to 90 minutes during heavy load, 4.17.5 crashes in less than 10 minutes under heavy load. Looking forward to 4.17.7. Will keep my eyes open and avoid updating to 4.17.6 until 4.17.7 is available (since otherwise 4.17.3 will roll off my list of available kernels unless I figure out how to keep it around). @Robert Holmes great news. Do you have a link to the kernel archives? The kernel archives for the memory management issues can be followed here: https://marc.info/?l=linux-mm&r=1&b=201807&w=2 The conversation where Linus tests the issue is here: https://marc.info/?t=153152578000001&r=1&w=2&n=16 The report made by Laura can be found here: https://marc.info/?t=153117953400001&r=1&w=2&n=3 All on the same ML. *** Bug 1601176 has been marked as a duplicate of this bug. *** *** Bug 1600736 has been marked as a duplicate of this bug. *** *** Bug 1601407 has been marked as a duplicate of this bug. *** >Last hang free kernel is 4.17.3<
I can confirm that. 4.17.3 is the last stable one.
>kernel hang during access in read or write to USB 3.1 external drive
This disk is formated in NTFS and directories are encrypted using encfs<
in my case its probably USB3 too, but btrfs, unencrypted,
I can also confirm that 4.17.3 is the last stable one. Just wanted to mention that I too have had random hangs with 4.17.4 and 4.17.5, but not with 4.17.3. However I don't have any USB-3 devices connected and am not doing heavy I/O. These hangs occurred during normal daytime desktop use, but my system is running QEMU/KVM with VFIO GPU passthrough and 8GB of hugepages locked (out of 16GB in the box). I have noticed the same thing. The latest kernel (4.17.5-100.fc27) has frozen twice: once on my Dell Inspiron laptop and once on a Dell Poweredge T20 server. Both froze while doing I/O (no USB3, just normal SATA). The Inspiron froze when I tried to open Libreoffice while doing a DNF upgrade. The Poweredge froze in the middle of an backup. When the computer freezes, nothing can be done apart from restarting using the power button and the log does not indicate anything (i.e. it only contains normal lines then nothing until, of course, -- Restarting --) Reverting to 4.17.3 until a solution is found... For a few days I have been reverting to 4.17.2, because I was afraid that 4.17.3 was also affected by this. I don't have any USB 3.0 devices, but I'm connecting USB 2.0 disks through the USB 3.0 ports. With 4.17.4 the whole PC freezes. You can even hear that all the disks stop writing and reading. Nothing works. When can we expect the fixed 17.7 in the main updates repo? when using kernel 4.17.5 I could not synchronize my home directory to an external USB (2.0) hard disk, after some time Grsync stopped and machine froze. Same when trying to copy my home directory to another external USB disk Running 4.17.3 was fine.Not tried 4.17.4. I experienced also some freezes randomly (once a day) Fixed in kernel 4.17.7 with commit 5ea45736209c8efd04ed793f81084925097f84ed 4.17.7 announcement https://lkml.org/lkml/2018/7/17/434 (In reply to Norman Gaywood from comment #42) > 4.17.7 announcement Party time :) wait and see. (In reply to Didier G from comment #13) > To investigate this bug I written a basic test script to copy 100 times a > 4,7 GB file using cp command overwriting the target at each copy. With 4.17.4 I never did more 6 or 7 iterations before to encounter a hang. I just installed 4.17.7 form koji and I already did 27 iterations with no problem. fedora updated today to 4.17.6. Up to now I didn't boot it. Probably tomorrow. 4.17.6 will still have this problem (for me at least), you need 4.17.7 4.17.7 Should be available in updates-testing soon, so you will be able to: dnf --enablerepo=updates-testing update kernel-core if you want it quicker. Even faster is to download from koji: https://koji.fedoraproject.org/koji/packageinfo?packageID=8 There's probably a dnf command for that but I've never used it. There might have been something left out of 4.17.7, there's 4.17.8 on the way: https://lkml.org/lkml/2018/7/17/453 is 4.17.7 onward really relevant for this bug? Greg discusses that the kernel release is broken for i386 systems. I run amd64 by intel. (In reply to Donald O. from comment #49) > is 4.17.7 onward really relevant for this bug? Yes on x86_64, 4.17.7 fixes this problem. I read https://lkml.org/lkml/2018/7/17/453 as a patch was missed that was part of the fix for this problem. The i386 fix is still in the future, and is not 4.17.8. But I could be reading it wrong. I can verify that a vanilla 4.17.7 kernel does not exhibit the bug; I was rsyncing a few TB of data to external USB 3 drives and it was locking up every hour, now I have stability. I can verify that a vanilla 4.17.7 kernel (x86_64) does not exhibit the bug; I was rsyncing a few TB of data to external USB 3 drives and it was locking up every hour, now I have stability after building and installing 4.17.7. no freezes when rsyncing my hard disk to an external USB 2.0 hard disk with kernel 7.17.7. (In reply to Norman Gaywood from comment #51) > The i386 fix is still in the future, and is not 4.17.8. You are referencing to the "don't boot" issue that is happening to i386 machines. I am not really sure if this is just an issue for i386 only, because I had a "don't boot" issue after updating to 4.17.5 on my x86_64 machine. I had to shut down the computer and power it on again. To clarify: grub install and dracut has been run before. The kernel seem to have more issues after 4.17.3 got released. kernel-4.17.7-200.fc28 kernel-tools-4.17.7-200.fc28 has been submitted as an update to Fedora 28. https://bodhi.fedoraproject.org/updates/FEDORA-2018-898f23c2f3 Regarding those asking about 4.17.8 - here is a pertinent comment from the 4.17.7 update: "We will not be shipping 4.17.8, that release was a single patch to fix an i686 issue introduced with 4.17.7. We have included that patch in the 4.17.7 build, so this kernel is 100% equivalent to 4.17.8." Can the title be updated to mention "4.17.5"? That would make this showstopper easier to find, so people know to revert to 4.17.3 manually. Can someone explain, in relative simple terms, how the kernel will be better after having gone through this struggle? I enjoyed reading the references above concerning narrowing down the problem, but I missed the overall purpose of the change itself and how it improves the system. Thanks. I think this hang is also provoked by doing lots of reads from a DVD-ROM too (in case it helps.) I observed the same problem when doing nightly backups of an ext4 filesystem on an internal SATA disk to our remote Amanda server. Amanda has been configured to use /usr/sbin/dump for taking backups. kernel-4.17.7-200.fc28, kernel-tools-4.17.7-200.fc28 has been pushed to the Fedora 28 testing repository. If problems still persist, please make note of it in this bug report. See https://fedoraproject.org/wiki/QA:Updates_Testing for instructions on how to install test updates. You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-898f23c2f3 I have no idea if this is relevant, but my system has indeed been crashing when I try to backup to a USB disk, but at the same time it crashes, it somehow seems to crash my router along with it. I can't imagine any way it could be connected, but I thought I'd add this note just to record yet another thing man was not meant to know. I've reverted to an older kernel for now and I'll see if it keeps happening. Created attachment 1465416 [details]
han occurrence "screenshot"
kernel-4.17.6-100.fc27 hang
For which raison 4.7.17 available on koji since few days is not yet in updates or at least in testing ? (In reply to Didier G from comment #65) > For which raison 4.7.17 available on koji since few days is not yet in > updates or at least in testing ? I installed it from updates-testing over 6 hours ago. (In reply to Rolf Fokkens from comment #64) > Created attachment 1465416 [details] > han occurrence "screenshot" > > kernel-4.17.6-100.fc27 hang That looks very similar to what I and others were seeing, but ultimately it was the "kernel BUG at mm/page_alloc.c:2019" that showed up when I was running at runlevel 3 and monitoring the console that brought me here and identified the problem. However your kernel is known to contain the bug - wait until 4.17.7 shows up in the Fedora repos or just hand-roll your own 4.17.7 kernel using the config options for your current kernel. (In reply to Patrick O'Callaghan from comment #66) > (In reply to Didier G from comment #65) > > For which raison 4.7.17 available on koji since few days is not yet in > > updates or at least in testing ? > > I installed it from updates-testing over 6 hours ago. I can verify that sudo dnf update kernel --enablerepo=updates-testing worked for me a few minutes ago, installing kernel 4.17.7-200.fc28 I confirm 4.17.7 solved the issue for me. I used the following to ensure that all of the kernel packages updated in concert from the testing repo: dnf --enablerepo=updates-testing update 'kernel*' All 5 packages updated from the testing repo as follows: kernel-4.17.7-200.fc28.x86_64 kernel-core-4.17.7-200.fc28.x86_64 kernel-devel-4.17.7-200.fc28.x86_64 kernel-headers-4.17.7-200.fc28.x86_64 kernel-modules-4.17.7-200.fc28.x86_64 I have copied 340 GB to a freshly encrypted USB3 partition using 4.17.7 without a hiccup. Before I generally got 1.5 GB under 4.17.5, or 50 to 100 GB under 4.17.4. *** Bug 1602939 has been marked as a duplicate of this bug. *** *** Bug 1605855 has been marked as a duplicate of this bug. *** Worried about this: https://bugzilla.redhat.com/show_bug.cgi?id=1597559#c12 Hang with 4.17.7 Might be a different problem with NFS Seems like I've got filesystem corruption with all these hangs. Is there a way to check consistency of all packages on Fedora comparing what is available in repositories with actual filesystem contents? https://ask.fedoraproject.org/en/question/124355/using-dnf-to-compare-filesystem-contents-with-repositories/ kernel-4.17.7-200.fc28, kernel-tools-4.17.7-200.fc28 has been pushed to the Fedora 28 stable repository. If problems still persist, please make note of it in this bug report. I'm running 4.17.7-200 for a couple of hours. Lots of USB IO's. No problems. Everything is fine. Again thanks a lot for th great Fedora community! System dying for me also with kernel 4.17.4 onwards. Only updated a few days ago. For me it was my nightly backup which uses dd. No need for USB device, I think that USB device is not critical to the problem just probably showed it up faster. On a second machine I was able to track it down to the dd process. I have regressed back to 4.17.3 on my server all OK with that version. I also observed on my second machine that gparted trying to do image copying was also dying. I see in the items above that: Robert Holmes 2018-07-15 12:06:12 EDT Following the mailing lists, this bug has been identified and fixed, and will be included in 4.17.7. My dnf updates are not up to 4.17.7 yet so I will keep updating my second machine until its on that version and confirm it is now fixed. If so then I'll update my main system also. If all is then fixed I will not post again but if I still have this issue I will report back. I hope it is all fixed. The fixed kernel works like a charm! Thanks. Kernel 4.17.7 showed up yesterday in the main Fedora 28 Updates repo. No issues so far after update. The problem seems to be gone. (In reply to Fedora Update System from comment #75) > kernel-4.17.7-200.fc28, kernel-tools-4.17.7-200.fc28 has been pushed to the > Fedora 28 stable repository. If problems still persist, please make note of > it in this bug report. I did update to 4.17.7-200.fc28.x86_64 and still experience system hangs which mostly do occur when the system is idle/unattended... This bug is tracking a particular issue which should now be fixed. If you are still seeing issues on 4.17.7 or greater please open a separate bugzilla with system information. My rdiff-backups have now been running daily for more than a week without any hangs. Source drives with XFS filesystems are: Model: ATA WDC WD20EARX-00P (scsi) Model: ATA M4-CT128M4SSD2 (scsi) Destination drive also with XFS filesystem. Model: ATA ST2000DM001-9YN1 (scsi) Thanks for fixing this annoying problem. Reopening... issue returned with the 4.17.10 kernel - may also be with the 4.17.9 kernel, but I didn't test that one. Please report a new bug. I'm running 4.17.9-200 now sind 9 hours. No probs up to now. If this is a kernel regression, it isn't a new bug. It's the same one. I am the initial reporter of this bug I encountered this hang with 4.17.4, 4.17.5 and 4.17.6 I did not encountered it since 4.17.7 and I now run 4.17.10 with no problem. (In reply to Didier G from comment #87) > I am the initial reporter of this bug > > I encountered this hang with 4.17.4, 4.17.5 and 4.17.6 > > I did not encountered it since 4.17.7 and I now run 4.17.10 with no problem. I also encountered this hang with 4.17.4, 4.17.5 and 4.17.6. It was resolved for me in 4.17.7 - but unfortunately it has returned with exactly the same symptoms for 4.17.10 (hang during early morning hours). I'm currently testing with 4.17.9 and will report tomorrow if it occurs there. (In reply to Gerald Cox from comment #88) > I also encountered this hang with 4.17.4, 4.17.5 and 4.17.6. It was > resolved for me in 4.17.7 - but unfortunately it has returned with exactly > the same symptoms for 4.17.10 (hang during early morning hours). I'm > currently testing with 4.17.9 and will report tomorrow if it occurs there. Are you sure it is exactly the same hang ? If you have two computers on your network it will be nice to setup a netconsole to caught log when hang occurs. (In reply to Didier G from comment #89) > (In reply to Gerald Cox from comment #88) > > I also encountered this hang with 4.17.4, 4.17.5 and 4.17.6. It was > > resolved for me in 4.17.7 - but unfortunately it has returned with exactly > > the same symptoms for 4.17.10 (hang during early morning hours). I'm > > currently testing with 4.17.9 and will report tomorrow if it occurs there. > > Are you sure it is exactly the same hang ? > > If you have two computers on your network it will be nice to setup a > netconsole to caught log when hang occurs. I do have a laptop that I could use for that... if you could point me to instructions on how to do that, I'd appreciate it. In the meantime, I'll just go ahead and close out this bug since you can't reproduce. I also notice that 4.7.11 is already out, so maybe (hopefully) it's fixed there. So far, no issues with 4.17.9. (In reply to Gerald Cox from comment #90) > (In reply to Didier G from comment #89) > > (In reply to Gerald Cox from comment #88) > > > I also encountered this hang with 4.17.4, 4.17.5 and 4.17.6. It was > > > resolved for me in 4.17.7 - but unfortunately it has returned with exactly > > > the same symptoms for 4.17.10 (hang during early morning hours). I'm > > > currently testing with 4.17.9 and will report tomorrow if it occurs there. > > > > Are you sure it is exactly the same hang ? > > > > If you have two computers on your network it will be nice to setup a > > netconsole to caught log when hang occurs. > > I do have a laptop that I could use for that... if you could point me to > instructions on how to do that, I'd appreciate it. In the meantime, I'll > just go ahead and close out this bug since you can't reproduce. I also > notice that 4.7.11 is already out, so maybe (hopefully) it's fixed there. > So far, no issues with 4.17.9. Nevermind... found the instructions. Thanks! (In reply to Gerald Cox from comment #90) > I do have a laptop that I could use for that... if you could point me to > instructions on how to do that, I'd appreciate it. You will find information on netconsole on this page: https://fedoraproject.org/wiki/Netconsole To just get information for this hang you can omit section "netconsole at boot - client side" To test if your netconsole setup is correct, on the debugee machine open a terminal in root and send following command: echo Test > /dev/kmsg "Test" should be displayed on the debugger. Thank for the info... the correct close status of this bug is ERRATA, see comment #75. Just because it was reopened and subsequently closed doesn't change that fact. (In reply to Donald O. from comment #85) > I'm running 4.17.9-200 now sind 9 hours. No probs up to now. Thanks for the feedback - I've verified that 4.17.9-200 works for me also. The bug returned in 4.17.10-200. I'm going to wait and see if it gets fixed again in 4.17.11 - which should be in the repository in a few days. (In reply to Gerald Cox from comment #94) > Thanks for the feedback - I've verified that 4.17.9-200 works for me also. > The bug returned in 4.17.10-200. I'm going to wait and see if it gets fixed > again in 4.17.11 - which should be in the repository in a few days. thanks for the bad news. Better fedora stops delivering buggy kernels until its fixed (again). no problems: up 1 day, 11 hours, 37 minutes (In reply to Donald O. from comment #95) > (In reply to Gerald Cox from comment #94) > > Thanks for the feedback - I've verified that 4.17.9-200 works for me also. > > The bug returned in 4.17.10-200. I'm going to wait and see if it gets fixed > > again in 4.17.11 - which should be in the repository in a few days. > thanks for the bad news. Better fedora stops delivering buggy kernels until > its fixed (again). > > no problems: up 1 day, 11 hours, 37 minutes Well, to be fair, this isn't the fault of any fedora personnel - it's upstream. I must say however, I can't remember the last time I've had so many issues with kernel updates. 4.17 is breaking a record for me. Is the upstream aware that the issue has resurfaced? I have had no problems with 4.17.7-200 nor with 4.17.9-200, but I'm not doing any USB-based bulk I/O. My issue was always with QEMU/KVM freezing the system. (In reply to Patrick O'Callaghan from comment #98) > I have had no problems with 4.17.7-200 nor with 4.17.9-200, 4.17.7-200 and 4.17.9-200 are considered good. 4.17.4-4.17.6 are bad, and 4.17.10 is reported to bring the issue back. (In reply to Ozkan Sezer from comment #97) > Is the upstream aware that the issue has resurfaced? good question. Shouldn't we make the upstream aware that the issue has resurfaced? Again, This bug is tracking a particular issue which should now be fixed. If you are still seeing issues on 4.17.7 or greater please open a separate bugzilla with system information. (In reply to Laura Abbott from comment #101) > Again, This bug is tracking a particular issue which should now be fixed. If > you are still seeing issues on 4.17.7 or greater please open a separate > bugzilla with system information. Laura, if this bug had been closed a month or so I could see your point... in this instance we have a pervasive issue that spanned three kernel updates, has been only closed a week and apparently identical symptoms have reappeared after skipping only two fast paced updates. I believe there is some value with letting the folks that have cc: this bug to be aware of this before they decide to apply 4.17.10 and beyond. If I had opened a new bug they wouldn't be aware until after the fact. AFAIK the patch for this issue could have somehow been accidentally been omitted from 4.17.10 build - stranger things have happened (and hopefully the Fedora kernel team has verified that somehow this didn't occur). Regarding 4.17.10, I'm not going to waste my time with doing any debugging since 4.17.11 is being built in koji as I type this reply. As I mentioned above, both 4.17.7 and 4.17.9 work fine. 4.17.10 is where I've immediately encountered the same or suspiciously similar issue. (In reply to Gerald Cox from comment #102) > (In reply to Laura Abbott from comment #101) > > Again, This bug is tracking a particular issue which should now be fixed. If > > you are still seeing issues on 4.17.7 or greater please open a separate > > bugzilla with system information. > > Laura, if this bug had been closed a month or so I could see your point... > in this instance we have a pervasive issue that spanned three kernel > updates, has been only closed a week and apparently identical symptoms have > reappeared after skipping only two fast paced updates. I believe there is > some value with letting the folks that have cc: this bug to be aware of this > before they decide to apply 4.17.10 and beyond. If I had opened a new bug > they wouldn't be aware until after the fact. AFAIK the patch for this issue > could have somehow been accidentally been omitted from 4.17.10 build - > stranger things have happened (and hopefully the Fedora kernel team has > verified that somehow this didn't occur). Regarding 4.17.10, I'm not going > to waste my time with doing any debugging since 4.17.11 is being built in > koji as I type this reply. > > As I mentioned above, both 4.17.7 and 4.17.9 work fine. 4.17.10 is where > I've immediately encountered the same or suspiciously similar issue. My concern is confusing issues. The patch fixed a specific issue with a known backtrace signature. If you are just seeing "hangs" that don't have the same backtrace that's definitely a different issue that needs to be tracked separately. This bug has also become a big unwieldy with multiple reports of "works/doesn't work", hence my request to split out to a separate bugzilla to reduce confusion (for me more than anything). You are welcome to point people to the new bug for tracking. (In reply to Laura Abbott from comment #103) > (In reply to Gerald Cox from comment #102) > > (In reply to Laura Abbott from comment #101) > > My concern is confusing issues. The patch fixed a specific issue with a > known backtrace signature. If you are just seeing "hangs" that don't have > the same backtrace that's definitely a different issue that needs to be > tracked separately. This bug has also become a big unwieldy with multiple > reports of "works/doesn't work", hence my request to split out to a separate > bugzilla to reduce confusion (for me more than anything). You are welcome to > point people to the new bug for tracking. Completely agree, I will test with 4.17.11 once it's built and open a new bug and reference that bug number here if I have issues with that update. I just wanted people to be aware of the issue, and believed commenting in this ticket was the best way to raise awareness to those on cc: - if you know a better way to do this, please let me know. (In reply to Gerald Cox from comment #104) > (In reply to Laura Abbott from comment #103) > > (In reply to Gerald Cox from comment #102) > > > (In reply to Laura Abbott from comment #101) > > > > My concern is confusing issues. The patch fixed a specific issue with a > > known backtrace signature. If you are just seeing "hangs" that don't have > > the same backtrace that's definitely a different issue that needs to be > > tracked separately. This bug has also become a big unwieldy with multiple > > reports of "works/doesn't work", hence my request to split out to a separate > > bugzilla to reduce confusion (for me more than anything). You are welcome to > > point people to the new bug for tracking. > > Completely agree, I will test with 4.17.11 once it's built and open a new > bug and reference that bug number here if I have issues with that update. I > just wanted people to be aware of the issue, and believed commenting in this > ticket was the best way to raise awareness to those on cc: - if you know a > better way to do this, please let me know. For what it's worth I too am still experiencing random freezes up to 4.17.9 with Fedora 28. I have just subscribed to a bunch of other recent 4.17 related bug reports with hangs. I will submit a new one if I ever get anything useful from the kernel logs. It took about 5 freezes before I got the kernel message that identified the bug for this ticket. I will say I am now seeing freezes with NO activity to external USB drives now - just sitting idly. Before early July I can't recall having an unexplained freeze. Since early July I always cringe when I smack the space bar to wake up the machine, for lately it's often just frozen. I have never had that experience and I've been running Fedora since there was a Fedora. I get the impression (but it could be selective bias) that others are in the same boat. (In reply to Leigh Orf from comment #105) > (In reply to Gerald Cox from comment #104) > > (In reply to Laura Abbott from comment #103) > > > (In reply to Gerald Cox from comment #102) > > > > (In reply to Laura Abbott from comment #101) > > > > > > My concern is confusing issues. The patch fixed a specific issue with a > > > known backtrace signature. If you are just seeing "hangs" that don't have > > > the same backtrace that's definitely a different issue that needs to be > > > tracked separately. This bug has also become a big unwieldy with multiple > > > reports of "works/doesn't work", hence my request to split out to a separate > > > bugzilla to reduce confusion (for me more than anything). You are welcome to > > > point people to the new bug for tracking. > > > > Completely agree, I will test with 4.17.11 once it's built and open a new > > bug and reference that bug number here if I have issues with that update. I > > just wanted people to be aware of the issue, and believed commenting in this > > ticket was the best way to raise awareness to those on cc: - if you know a > > better way to do this, please let me know. > > For what it's worth I too am still experiencing random freezes up to 4.17.9 > with Fedora 28. I have just subscribed to a bunch of other recent 4.17 > related bug reports with hangs. I will submit a new one if I ever get > anything useful from the kernel logs. It took about 5 freezes before I got > the kernel message that identified the bug for this ticket. I will say I am > now seeing freezes with NO activity to external USB drives now - just > sitting idly. Before early July I can't recall having an unexplained freeze. > Since early July I always cringe when I smack the space bar to wake up the > machine, for lately it's often just frozen. I have never had that experience > and I've been running Fedora since there was a Fedora. I get the impression > (but it could be selective bias) that others are in the same boat. I am with you in this boat. I had upgrade the server and was scared about the instability. The system freeze and sometimes a kernel panic message was written on the console. I downgraded to 4.17.3 and everything is fine since. Now I am waiting for a longer stabilized period, before I start the next round. I also can't remember to be aware this issues and use Fedora for years. Hopefully it will be stable soon... I've run 4.17.11 for 2 nights in a row and it appears to be good - so whatever was introduced or regressed in 4.17.10 has been fixed. Thank goodness. please post the link to the new 4.17.10 bug. (In reply to Donald O. from comment #108) > please post the link to the new 4.17.10 bug. Please see my comment in #102 and #104. Since 4.17.11 release was imminent I didn't pursue opening a bug for 4.17.10. 4.17.11 has been running fine for 3 nights so far so nothing to report other than appears whatever the issue was with 4.17.10 - it was resolved in 4.17.11. (In reply to Gerald Cox from comment #109) > (In reply to Donald O. from comment #108) > > please post the link to the new 4.17.10 bug. > > Please see my comment in #102 and #104. Since 4.17.11 release was imminent > I didn't pursue opening a bug for 4.17.10. 4.17.11 has been running fine > for 3 nights so far so nothing to report other than appears whatever the > issue was with 4.17.10 - it was resolved in 4.17.11. Who knows when kernel 4.17.11 will be available in the standard update stream? The latest kernel there is 4.17.9. It's in updates-testing. If you don't have that repo enabled, you can add "--enablerepo=updates-testing" to the dnf command to get it right now. 4.17.11 was deployed a couple of hours ago by the regular update channel. I'm running it since 5 Minutes. I am not on the beta channel so I only get full releases thus I did not get the "buggy" 4.17.10. I can confirm though that the earlier bug that was fixed up to 4.17.9 for me and I was working fine on several machines. I have downloaded 4.17.11 a few hours ago on one of my home machines and the cases that caused the earlier bug to appear (hanging on large data transfers) worked fine for me so that bug still appears fixed in 4.17.11. I will test 4.17.11 at work in two days but I am sure it will be fine. 4.17.11 up now for 15 hours, 45 minutes. No problems. Everything's fine. Thanks RH. Thanks community!!! Have had no 'hangs' since abandoning 4.17.7. Suggest discouraging its use somehow? Now on 4.17.11 from Fedora testing repo. Thanks to everyone who helped. 4.17.11 up for 1 day, 13 hours, 37 minutes. Everything's fine. Thanks! ( we have some concurrent usb access slow downs, even if the background job is running with lowest task/io priorities. That isn't a 4.17 problem, but a general problem) |