Bug 1598462 - kernel 4.17.4, 4.17.5, 4.17.6 hang
Summary: kernel 4.17.4, 4.17.5, 4.17.6 hang
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 28
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Kernel Maintainer List
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
: 1600736 1601176 1601407 1602939 1605855 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2018-07-05 14:39 UTC by Didier G
Modified: 2018-08-05 19:51 UTC (History)
61 users (show)

Fixed In Version: kernel-4.17.7-200.fc28
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-07-29 00:44:21 UTC
Type: Bug


Attachments (Terms of Use)
hang occurrence 1 caught on netconsole (15.87 KB, text/plain)
2018-07-07 02:06 UTC, Didier G
no flags Details
hang occurrence 2 caught on netconsole (22.84 KB, text/plain)
2018-07-07 02:07 UTC, Didier G
no flags Details
hang occurrence 3 VB removed caught on netconsole (3.78 KB, text/plain)
2018-07-09 20:03 UTC, Didier G
no flags Details
hang occurrence 4 VB removed caught on netconsole (3.93 KB, text/plain)
2018-07-09 20:04 UTC, Didier G
no flags Details
hang occurrence 5 (19.40 KB, text/plain)
2018-07-09 21:44 UTC, Didier G
no flags Details
han occurrence "screenshot" (1.12 MB, image/jpeg)
2018-07-20 15:16 UTC, Rolf Fokkens
no flags Details

Description Didier G 2018-07-05 14:39:53 UTC
Description of problem:

kernel hang during access in read or write to USB 3.1 external drive

This disk is formated in NTFS and directories are encrypted using encfs


Version-Release number of selected component (if applicable):

kernel-4.17.4-200.fc28.x86_64
ntfs-3g-2017.3.23-6.fc28.x86_64
fuse-encfs-1.9.5-1.fc28.x86_64

/:  Bus 04.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/4p, 5000M
    |__ Port 1: Dev 2, If 0, Class=Hub, Driver=hub/4p, 5000M
        |__ Port 1: Dev 3, If 0, Class=Hub, Driver=hub/4p, 5000M
            |__ Port 1: Dev 4, If 0, Class=Mass Storage, Driver=uas, 5000M


Steps to Reproduce:

Just do many read or write operations on this disk using kernel 4.17.4


Actual results:

kernel hang
screen frozen
no keyboard action
host does not answer to ping from other host


Additional info:

No problem with kernel-4.17.3-200.fc28.x86_64 and previous

I did many tests with both 4.17.3 and 4.17.4, hang occurs only with 4.17.4 and I never had this problem with previous kernel

Nothing in journalctl

Comment 1 Didier G 2018-07-07 02:04:11 UTC
I setup a netconsoel and I caught two occurrences of the hang.

Comment 2 Didier G 2018-07-07 02:06:33 UTC
Created attachment 1457113 [details]
hang occurrence 1 caught on netconsole

hang occurrence 1 caught on netconsole

Comment 3 Didier G 2018-07-07 02:07:22 UTC
Created attachment 1457115 [details]
hang occurrence 2 caught on netconsole

hang occurrence 2 caught on netconsole

Comment 4 Ali Akcaagac 2018-07-09 09:22:17 UTC
I confirm this issue. Kernel hang 2 times during nightly backup session.

The first time it hang during my regular nightly backup session by writing changes from my USB stick to my external USB-HDD. rsync worked, md5sum'ing the files afterwards worked, dd'ing free space caused the kernel to hang. In the morning the entire machine was basicly dead until I rebooted. No kernel dump, no entries in journalctl.

Luckely no files was harmed but I had issue some fsck, to ensure that the fs is still errorfree.

The next day I issued another backup process because the first one failed. I cleaned my system up before and uninstalled all un-necessary packages. During night the same issue happened at the same step.

I had to fsck my external hdd once again and am now in the middle of rescuing the backup to yet another media - just in case.

The issues came up after switching from 4.17.3 to 4.17.4.

Other issues that appeared with the kernel: Middle mouse button stopped working. Ocasional crashing of random programs. e.g. xfce4-terminal crashed all instances 2 times. google-chrome crashed a couple of times. some other programs simply crashed. Well it's not a crash but a "close". Switching VirtualBox to fullscreen caused an entire system freeze (even akpmod build the modules successfully)

Switching back to 4.17.3 solved all issues.

There seem to be some serious regressions on 4.17.4

This is an absolutely no-go for a so called stable kernel on a working machine.

Comment 5 Laura Abbott 2018-07-09 15:26:14 UTC
Both of the backtraces have the out of tree virtualbox drivers installed, can you reproduce the issue without those installed?

Comment 6 Didier G 2018-07-09 20:02:35 UTC
After I removed VB form my system I got two kernel hang:

- the first copying a file to encfs encrypted directory on NTFS formatted USB 3.1 externakl disk using Nautilus 
- the second reading a file from encrypted directory on NTFS formatted USB 3.1 externakl disk

Comment 7 Didier G 2018-07-09 20:03:59 UTC
Created attachment 1457583 [details]
hang occurrence 3 VB removed caught on netconsole

Comment 8 Didier G 2018-07-09 20:04:32 UTC
Created attachment 1457585 [details]
hang occurrence 4 VB removed caught on netconsole

Comment 9 Didier G 2018-07-09 21:43:35 UTC
Problem still exists with 4.17.5

Comment 10 Didier G 2018-07-09 21:44:23 UTC
Created attachment 1457611 [details]
hang occurrence 5

Comment 11 Laura Abbott 2018-07-09 23:34:35 UTC
If you have  working and non-working kernel, the best option is going to be to run a bisect between 4.17.3 and 4.17.4. I don't see any patches that jump out as causing that particular bug to be hit.

Comment 12 Didier G 2018-07-10 22:40:48 UTC
https://bugzilla.redhat.com/show_bug.cgi?id=1598989 seems to be a duplicate of this bug

Comment 13 Didier G 2018-07-11 02:47:35 UTC
It will be interesting to know storage involved in bugs encountered by Ali Akcaagac in Comment 4 and by Jens Lody in bz1598989

To investigate this bug I written a basic test script to copy 100 times a 4,7 GB file using cp command overwriting the target at each copy.

First I eliminated encfs, hangs happen writing in NO encrypted directory.


Booting on 4.17.3 - test with NO problem on external USB NTFS partition

Booting on 4.17.4 - test on same NTFS partition kernel 4.17.4 HANGED after only few iterations

Booting on 4.17.4 - test with NO problem on internal SATA EXT4 partition

Booting on 4.17.4 - test on internal SATA NTFS partition kernel 4.17.4 HANGED after only few iterations

Booting on 4.17.3 - test with NO problem on internal SATA NTFS partition


In summary and in my case:

4.17.3: OK on any partitions
4.17.4: OK on EXT4, HANG on NTFS both SATA internal and USB external

Comment 14 Ali Akcaagac 2018-07-11 09:35:14 UTC
For me the (undefined) issue is reproducible:

1) rsyncing on an external harddisk (even large chungs of files) work. rsyncing 250gb and more works perfectly.
2) md5suming the files afterwards (to compare source with destination) works perfectly.
3) dd'ing free space is *always* the point where the kernel seem to be crashing (without log).

dd'ing free space on a harddisk (internal) works! no issues!
dd'ing free space on a usb stick (external) works no issues!
dd'ing free space on a harddisk (usb: the one that caught my attention for this issue) always causes the kernel to die, right after a few mb's are written).

So the issue can be all of this:
1) kernel issue
2) heavy load issue (there was a bugreport for this for 4.17.4 on bugzilla.redhat.com)
3) broken harddrive (but then why does the kernel die ?)
4) usb issue within the drivers (regression)
5) fs issue (XFS is particulary used, but then rsync and md5suming would cause the same issues). But we seem to be using different FS here... XFS, EXT4, NTFS, so this is not an FS issue.

I for my own wait for 4.17.5 to show up. So I can do the following steps:

1) repartition external HD
2) format external HD
3) rsync and md5sum'ing the data
4) dd'ing free space

Since dd'ing is a heavy "process" task. This would explain the other bugreport:
https://bugzilla.redhat.com/show_bug.cgi?id=1599101

Where the reporter writes about kernel oopses under "heavy" system load. This of course depends on the system used. Heavy load on a less powered system means something differently than doing the same tasks on a powered system.

So using 'dd' can be explained as heavy load process on my system. This might also explain, why I encounter "closing" programms, once the system get's into heavy load tasks.

This of course is just a visual descritption to the things that I see here and that happened here since I upgraded to 4.17.4

The changelog for 4.17.5 (on kernel.org) describes some usb regressions fixed. So can be anything.

Comment 15 Ali Akcaagac 2018-07-11 09:58:53 UTC
There are also 50 Fedora specific patches applied to the kernel, where one could also trigger the issue in combination to the upstream changes that happened. A free somewhere in the kernel or a driver close somewhere, can easily render one of these patches from working to critical.

https://src.fedoraproject.org/rpms/kernel/tree/master

Comment 16 Ali Akcaagac 2018-07-11 11:07:14 UTC
4.17.5 addresses a bunch of FS, XFS related as well as TIMER related issues.

https://koji.fedoraproject.org/koji/buildinfo?buildID=1104321

I have a good feeling (reading the CVS's) that this version may solve various issues, since timing is involved everywhere...

Comment 17 Didier G 2018-07-12 19:39:47 UTC
Same hang with 4.17.6.

Last hang free kernel is 4.17.3


(In reply to Laura Abbott from comment #11)
> If you have  working and non-working kernel, the best option is going to be
> to run a bisect between 4.17.3 and 4.17.4. I don't see any patches that jump
> out as causing that particular bug to be hit.

Where can I find the right howto to bisect Fedora kernel ?

Comment 18 Robert Holmes 2018-07-13 12:38:05 UTC
I've been tracking down this issue over the last few days, and have reduced the suspicious commits to the following interval:

 - 54428453efda4c1c35ca75a0a5aa170de87ff1b0 (x86/e820: put !E820_TYPE_RAM regions into memblock.reserved) -- bad, hang reproduces.

 - 323252c83194268cadefc2c0ea55827bf4dd04b8 (i2c: gpio: initialize SCL to HIGH again) -- good, no hangs after considerable amount of trying

Comment 19 Laura Abbott 2018-07-13 16:42:57 UTC
Thanks for the bisect range. There was a thread going on linux-mm so I gave this information to the maintainers. We'll see if anyone responds.

Comment 20 Robert Holmes 2018-07-13 22:41:55 UTC
Having been running "block: Fix cloning of requests with a special payload" for a few busy hours without incident, I am now reasonably convinced that "x86/e820: put !E820_TYPE_RAM regions into memblock.reserved" is the culprit.

Comment 21 Gerald Cox 2018-07-14 05:04:33 UTC
I'm having also having a problem with system hang/freeze and it definitely did not appear with 4.16.6.  It is appearing in 4.17.6; 4.17.5 and 4.17.4.  I'm currently testing 4.17.2 and 4.17.3 to see if it appears there.

Comment 22 Ali Akcaagac 2018-07-14 07:16:28 UTC
I like to inform, that I just had a freeze again today (4.17.5-200) during backup process...

1) rsync worked 250gb
2) md5sum'ing worked 250gb
3) dd'ing crash... So again it's crashing at the dd'ing stage...

I somehow had the feeling that the xfs fixes and the timing fixes that got introduced with 4.17.5-200 may have solved the issue.

But looking at #Comment 18 sounds pretty much realistic here. Hope I had spare time to investigate into creating some sort of backtrace.

This needs urgently get fixed. Today I need to spent yet the third time in row manually md5sum'ing and checking all the files of the backup for proper consistency. :(

Comment 23 Alex Smith 2018-07-15 09:24:02 UTC
I've had this 4 times over the last 2 days, each time while there is heavy disk I/O going on (while running an automated backup, and while compiling a large piece of software).

One time it hung it managed to save "kernel BUG at mm/page_alloc.c:2019" to syslog, which looks like the other logs here.

I was not seeing this prior to updating to 4.17.5-200.fc28.x86_64.

Comment 24 John Damm Sørensen 2018-07-15 10:24:46 UTC
Same thing here. System hangs while doing rdiff-backup.
4.17.5-200.fc28.x86_64
kernel BUG at mm/page_alloc.c:2019!
no other baktrace available.

This needs urgent attention as going to STABLE FC28 made my system completely instable with daily hangs.

Comment 25 thacuop 2018-07-15 12:26:33 UTC
i may have the same or a related issue ...

since upgrading from kernel 4.17.2 to 4.17.4, my computer freezes up completely each night, and I have to hard reset. i was able to determine this happens when my nightly script runs nwipe on a small (~4.5GB) internal partition. am i the only one who sees writing to internal partitions affected?

the software and configuration for nwipe have not changed

the command that's triggering the hang is (run as root):
nwipe --nogui --nowait --autonuke --method=zero --verify=off /dev/sda2;

I'm able to reproduce this consistently, although it's not fun because it hangs up completely all at once and there's no way to recover.

i tried running the command on other and smaller partitions; i noticed that for a smaller partition it didn't hang -- it looked like it was about to, as the computer would be not very responsive for a few seconds, but then it recovered, and the command line showed "segmentation fault (core dumped)" and nwipe stopped. i reproduced this a few times on that partition, but on maybe the 4th time i wasn't so lucky and instead got a hang. i'm therefore wondering if the "segmentation fault (core dumped)" is a clue to this bug.

Comment 26 Robert Holmes 2018-07-15 16:06:12 UTC
Following the mailing lists, this bug has been identified and fixed, and will be included in 4.17.7.

Comment 27 samoht0 2018-07-15 19:32:33 UTC
(In reply to Robert Holmes from comment #26)
> Following the mailing lists, this bug has been identified and fixed, and
> will be included in 4.17.7.

Thanks for the info! This is really a showstopper.
Apart from NTFS also happening here on external USB stick with exFAT via FUSE.

Comment 28 Toby Ovod-Everett 2018-07-16 00:54:17 UTC
Hallelujah!  This is exactly what I am seeing backing up to encrypted USB 3.0 drives.  4.17.3 seems solid, 4.17.4 stays stable for up to 90 minutes during heavy load, 4.17.5 crashes in less than 10 minutes under heavy load.

Looking forward to 4.17.7.  Will keep my eyes open and avoid updating to 4.17.6 until 4.17.7 is available (since otherwise 4.17.3 will roll off my list of available kernels unless I figure out how to keep it around).

Comment 29 Norman Gaywood 2018-07-16 02:20:47 UTC
@Robert Holmes great news. Do you have a link to the kernel archives?

Comment 30 Ali Akcaagac 2018-07-16 07:07:21 UTC
The kernel archives for the memory management issues can be followed here:

https://marc.info/?l=linux-mm&r=1&b=201807&w=2

The conversation where Linus tests the issue is here:

https://marc.info/?t=153152578000001&r=1&w=2&n=16

The report made by Laura can be found here:

https://marc.info/?t=153117953400001&r=1&w=2&n=3

All on the same ML.

Comment 31 Jeremy Cline 2018-07-16 17:46:23 UTC
*** Bug 1601176 has been marked as a duplicate of this bug. ***

Comment 32 Jeremy Cline 2018-07-16 17:58:30 UTC
*** Bug 1600736 has been marked as a duplicate of this bug. ***

Comment 33 Jeremy Cline 2018-07-16 18:01:27 UTC
*** Bug 1601407 has been marked as a duplicate of this bug. ***

Comment 34 Donald O. 2018-07-16 18:10:22 UTC
>Last hang free kernel is 4.17.3<
I can confirm that. 4.17.3 is the last stable one.

Comment 35 Donald O. 2018-07-16 18:12:38 UTC
>kernel hang during access in read or write to USB 3.1 external drive
This disk is formated in NTFS and directories are encrypted using encfs<
in my case its probably USB3 too, but btrfs, unencrypted,

Comment 36 Gerald Cox 2018-07-16 18:45:09 UTC
I can also confirm that 4.17.3 is the last stable one.

Comment 37 Patrick O'Callaghan 2018-07-16 21:28:06 UTC
Just wanted to mention that I too have had random hangs with 4.17.4 and 4.17.5, but not with 4.17.3. However I don't have any USB-3 devices connected and am not doing heavy I/O. These hangs occurred during normal daytime desktop use, but my system is running QEMU/KVM with VFIO GPU passthrough and 8GB of hugepages locked (out of 16GB in the box).

Comment 38 Avinash Meetoo 2018-07-17 02:49:44 UTC
I have noticed the same thing.

The latest kernel (4.17.5-100.fc27) has frozen twice: once on my Dell Inspiron laptop and once on a Dell Poweredge T20 server.

Both froze while doing I/O (no USB3, just normal SATA). The Inspiron froze when I tried to open Libreoffice while doing a DNF upgrade. The Poweredge froze in the middle of an backup.

When the computer freezes, nothing can be done apart from restarting using the power button and the log does not indicate anything (i.e. it only contains normal lines then nothing until, of course, -- Restarting --)

Reverting to 4.17.3 until a solution is found...

Comment 39 Krzysztof Kapustka 2018-07-17 07:24:38 UTC
For a few days I have been reverting to 4.17.2, because I was afraid that 4.17.3 was also affected by this. I don't have any USB 3.0 devices, but I'm connecting USB 2.0 disks through the USB 3.0 ports. With 4.17.4 the whole PC freezes. You can even hear that all the disks stop writing and reading. Nothing works. When can we expect the fixed 17.7 in the main updates repo?

Comment 40 antonio montagnani 2018-07-17 09:33:26 UTC
when using kernel 4.17.5 I could not synchronize my home directory to an external USB (2.0) hard disk, after some time Grsync stopped and machine froze. Same when trying to copy my home directory to another external USB disk

Running 4.17.3 was fine.Not tried 4.17.4.

I experienced also some freezes randomly (once a day)

Comment 41 ValdikSS 2018-07-17 11:07:13 UTC
Fixed in kernel 4.17.7 with commit 5ea45736209c8efd04ed793f81084925097f84ed

Comment 42 Norman Gaywood 2018-07-17 11:14:46 UTC
4.17.7 announcement

https://lkml.org/lkml/2018/7/17/434

Comment 43 Ali Akcaagac 2018-07-17 12:25:03 UTC
(In reply to Norman Gaywood from comment #42)
> 4.17.7 announcement

Party time :)

Comment 44 Donald O. 2018-07-17 16:15:08 UTC
wait and see.

Comment 45 Didier G 2018-07-17 20:33:59 UTC
(In reply to Didier G from comment #13)
> To investigate this bug I written a basic test script to copy 100 times a
> 4,7 GB file using cp command overwriting the target at each copy.

With 4.17.4 I never did more 6 or 7 iterations before to encounter a hang.

I just installed 4.17.7 form koji and I already did 27 iterations with no problem.

Comment 46 Donald O. 2018-07-17 22:47:41 UTC
fedora updated today to 4.17.6. Up to now I didn't boot it. Probably tomorrow.

Comment 47 Norman Gaywood 2018-07-17 23:44:29 UTC
4.17.6 will still have this problem (for me at least), you need 4.17.7
4.17.7 Should be available in updates-testing soon, so you will be able to:

dnf --enablerepo=updates-testing update kernel-core

if you want it quicker.
Even faster is to download from koji:

https://koji.fedoraproject.org/koji/packageinfo?packageID=8

There's probably a dnf command for that but I've never used it.

Comment 48 Norman Gaywood 2018-07-18 00:43:15 UTC
There might have been something left out of 4.17.7, there's 4.17.8 on the way:

https://lkml.org/lkml/2018/7/17/453

Comment 49 Donald O. 2018-07-18 08:20:44 UTC
is 4.17.7 onward really relevant for this bug?
Greg discusses that the kernel release is broken for i386 systems. I run amd64 by intel.

Comment 50 Didier G 2018-07-18 08:36:40 UTC
(In reply to Donald O. from comment #49)
> is 4.17.7 onward really relevant for this bug?

Yes on x86_64, 4.17.7 fixes this problem.

Comment 51 Norman Gaywood 2018-07-18 09:55:59 UTC
I read https://lkml.org/lkml/2018/7/17/453 as a patch was missed that was part of the fix for this problem.

The i386 fix is still in the future, and is not 4.17.8.

But I could be reading it wrong.

Comment 52 Leigh Orf 2018-07-18 12:53:09 UTC
I can verify that a vanilla 4.17.7 kernel does not exhibit the bug; I was rsyncing a few TB of data to external USB 3 drives and it was locking up every hour, now I have stability.

Comment 53 Leigh Orf 2018-07-18 12:55:05 UTC
I can verify that a vanilla 4.17.7 kernel (x86_64) does not exhibit the bug; I was rsyncing a few TB of data to external USB 3 drives and it was locking up every hour, now I have stability after building and installing 4.17.7.

Comment 54 antonio montagnani 2018-07-18 15:54:55 UTC
no freezes when rsyncing my hard disk to an external USB 2.0 hard disk with kernel 7.17.7.

Comment 55 Ali Akcaagac 2018-07-18 15:57:44 UTC
(In reply to Norman Gaywood from comment #51)
> The i386 fix is still in the future, and is not 4.17.8.

You are referencing to the "don't boot" issue that is happening to i386 machines. I am not really sure if this is just an issue for i386 only, because I had a "don't boot" issue after updating to 4.17.5 on my x86_64 machine. I had to shut down the computer and power it on again. To clarify: grub install and dracut has been run before. The kernel seem to have more issues after 4.17.3 got released.

Comment 56 Fedora Update System 2018-07-18 17:14:44 UTC
kernel-4.17.7-200.fc28 kernel-tools-4.17.7-200.fc28 has been submitted as an update to Fedora 28. https://bodhi.fedoraproject.org/updates/FEDORA-2018-898f23c2f3

Comment 57 Gerald Cox 2018-07-18 17:21:28 UTC
Regarding those asking about 4.17.8 - here is a pertinent comment from the 4.17.7 update:

"We will not be shipping 4.17.8, that release was a single patch to fix an i686 issue introduced with 4.17.7. We have included that patch in the 4.17.7 build, so this kernel is 100% equivalent to 4.17.8."

Comment 58 CR 2018-07-18 23:19:42 UTC
Can the title be updated to mention "4.17.5"?  That would make this showstopper easier to find, so people know to revert to 4.17.3 manually.

Comment 59 Stan King 2018-07-18 23:57:05 UTC
Can someone explain, in relative simple terms, how the kernel will be better after having gone through this struggle?

I enjoyed reading the references above concerning narrowing down the problem, but I missed the overall purpose of the change itself and how it improves the system.

Thanks.

Comment 60 David W. Legg 2018-07-19 08:48:44 UTC
I think this hang is also provoked by doing lots of reads from a DVD-ROM too (in case it helps.)

Comment 61 Michael Schmitz 2018-07-19 08:58:27 UTC
I observed the same problem when doing nightly backups of an ext4 filesystem on an internal SATA disk to our remote Amanda server. Amanda has been configured to use /usr/sbin/dump for taking backups.

Comment 62 Fedora Update System 2018-07-19 20:19:04 UTC
kernel-4.17.7-200.fc28, kernel-tools-4.17.7-200.fc28 has been pushed to the Fedora 28 testing repository. If problems still persist, please make note of it in this bug report.
See https://fedoraproject.org/wiki/QA:Updates_Testing for
instructions on how to install test updates.
You can provide feedback for this update here: https://bodhi.fedoraproject.org/updates/FEDORA-2018-898f23c2f3

Comment 63 Tom Horsley 2018-07-19 21:00:41 UTC
I have no idea if this is relevant, but my system has indeed been crashing when I try to backup to a USB disk, but at the same time it crashes, it somehow seems to crash my router along with it. I can't imagine any way it could be connected, but I thought I'd add this note just to record yet another thing man was not meant to know.

I've reverted to an older kernel for now and I'll see if it keeps happening.

Comment 64 Rolf Fokkens 2018-07-20 15:16:42 UTC
Created attachment 1465416 [details]
han occurrence "screenshot"

kernel-4.17.6-100.fc27 hang

Comment 65 Didier G 2018-07-20 15:22:36 UTC
For which raison 4.7.17 available on koji since few days is not yet in updates or at least in testing ?

Comment 66 Patrick O'Callaghan 2018-07-20 15:28:08 UTC
(In reply to Didier G from comment #65)
> For which raison 4.7.17 available on koji since few days is not yet in
> updates or at least in testing ?

I installed it from updates-testing over 6 hours ago.

Comment 67 Leigh Orf 2018-07-20 15:33:19 UTC
(In reply to Rolf Fokkens from comment #64)
> Created attachment 1465416 [details]
> han occurrence "screenshot"
> 
> kernel-4.17.6-100.fc27 hang

That looks very similar to what I and others were seeing, but ultimately it was the "kernel BUG at mm/page_alloc.c:2019" that showed up when I was running at runlevel 3 and monitoring the console that brought me here and identified the problem. However your kernel is known to contain the bug - wait until 4.17.7 shows up in the Fedora repos or just hand-roll your own 4.17.7 kernel using the config options for your current kernel.

Comment 68 Leigh Orf 2018-07-20 15:37:10 UTC
(In reply to Patrick O'Callaghan from comment #66)
> (In reply to Didier G from comment #65)
> > For which raison 4.7.17 available on koji since few days is not yet in
> > updates or at least in testing ?
> 
> I installed it from updates-testing over 6 hours ago.

I can verify that 

sudo dnf update kernel --enablerepo=updates-testing

worked for me a few minutes ago, installing kernel 4.17.7-200.fc28

Comment 69 Ali Akcaagac 2018-07-20 16:00:58 UTC
I confirm 4.17.7 solved the issue for me.

Comment 70 Toby Ovod-Everett 2018-07-20 16:38:15 UTC
I used the following to ensure that all of the kernel packages updated in concert from the testing repo:

dnf --enablerepo=updates-testing update 'kernel*'

All 5 packages updated from the testing repo as follows:
kernel-4.17.7-200.fc28.x86_64
kernel-core-4.17.7-200.fc28.x86_64
kernel-devel-4.17.7-200.fc28.x86_64
kernel-headers-4.17.7-200.fc28.x86_64
kernel-modules-4.17.7-200.fc28.x86_64

I have copied 340 GB to a freshly encrypted USB3 partition using 4.17.7 without a hiccup.  Before I generally got 1.5 GB under 4.17.5, or 50 to 100 GB under 4.17.4.

Comment 71 Nick Judd 2018-07-20 16:56:32 UTC
*** Bug 1602939 has been marked as a duplicate of this bug. ***

Comment 72 Jeremy Cline 2018-07-20 17:42:21 UTC
*** Bug 1605855 has been marked as a duplicate of this bug. ***

Comment 73 Norman Gaywood 2018-07-20 21:53:56 UTC
Worried about this:

https://bugzilla.redhat.com/show_bug.cgi?id=1597559#c12

Hang with 4.17.7
Might be a different problem with NFS

Comment 74 Anatoli Babenia 2018-07-22 02:53:13 UTC
Seems like I've got filesystem corruption with all these hangs. Is there a way to check consistency of all packages on Fedora comparing what is available in repositories with actual filesystem contents?

https://ask.fedoraproject.org/en/question/124355/using-dnf-to-compare-filesystem-contents-with-repositories/

Comment 75 Fedora Update System 2018-07-22 03:03:19 UTC
kernel-4.17.7-200.fc28, kernel-tools-4.17.7-200.fc28 has been pushed to the Fedora 28 stable repository. If problems still persist, please make note of it in this bug report.

Comment 76 Donald O. 2018-07-22 19:15:29 UTC
I'm running 4.17.7-200 for a couple of hours. Lots of USB IO's. No problems. Everything is fine.
Again thanks a lot for th great Fedora community!

Comment 77 Luigi Cantoni 2018-07-23 08:47:04 UTC
System dying for me also with kernel 4.17.4 onwards.
Only updated a few days ago.
For me it was my nightly backup which uses dd.
No need for USB device, I think that USB device is not critical to the problem just probably showed it up faster.
On a second machine I was able to track it down to the dd process.
I have regressed back to 4.17.3 on my server all OK with that version.
I also observed on my second machine that gparted trying to do image copying was also dying.

I see in the items above that:
Robert Holmes 2018-07-15 12:06:12 EDT
Following the mailing lists, this bug has been identified and fixed, and will be included in 4.17.7.

My dnf updates are not up to 4.17.7 yet so I will keep updating my second machine until its on that version and confirm it is now fixed. If so then I'll update my main system also.
If all is then fixed I will not post again but if I still have this issue I will report back.
I hope it is all fixed.

Comment 78 Michael Schmitz 2018-07-24 07:42:25 UTC
The fixed kernel works like a charm! Thanks.

Comment 79 Krzysztof Kapustka 2018-07-24 08:09:26 UTC
Kernel 4.17.7 showed up yesterday in the main Fedora 28 Updates repo. No issues so far after update. The problem seems to be gone.

Comment 80 Saša Janiška 2018-07-25 16:23:41 UTC
(In reply to Fedora Update System from comment #75)
> kernel-4.17.7-200.fc28, kernel-tools-4.17.7-200.fc28 has been pushed to the
> Fedora 28 stable repository. If problems still persist, please make note of
> it in this bug report.

I did update to 4.17.7-200.fc28.x86_64 and still experience system hangs which mostly do occur when the system is idle/unattended...

Comment 81 Laura Abbott 2018-07-25 20:57:39 UTC
This bug is tracking a particular issue which should now be fixed. If you are still seeing issues on 4.17.7 or greater please open a separate bugzilla with system information.

Comment 82 John Damm Sørensen 2018-07-25 21:26:27 UTC
My rdiff-backups have now been running daily for more than a week without any hangs.

Source drives with XFS filesystems are:
Model: ATA WDC WD20EARX-00P (scsi)
Model: ATA M4-CT128M4SSD2 (scsi)

Destination drive also with XFS filesystem.
Model: ATA ST2000DM001-9YN1 (scsi)

Thanks for fixing this annoying problem.

Comment 83 Gerald Cox 2018-07-28 16:32:35 UTC
Reopening... issue returned with the 4.17.10 kernel - may also be with the 4.17.9 kernel, but I didn't test that one.

Comment 84 Piotr Drąg 2018-07-28 16:45:27 UTC
Please report a new bug.

Comment 85 Donald O. 2018-07-28 16:51:44 UTC
I'm running 4.17.9-200 now sind 9 hours. No probs up to now.

Comment 86 Gerald Cox 2018-07-28 22:28:38 UTC
If this is a kernel regression, it isn't a new bug.  It's the same one.

Comment 87 Didier G 2018-07-28 23:26:54 UTC
I am the initial reporter of this bug

I encountered this hang with 4.17.4, 4.17.5 and 4.17.6

I did not encountered it since 4.17.7 and I now run 4.17.10 with no problem.

Comment 88 Gerald Cox 2018-07-29 00:26:46 UTC
(In reply to Didier G from comment #87)
> I am the initial reporter of this bug
> 
> I encountered this hang with 4.17.4, 4.17.5 and 4.17.6
> 
> I did not encountered it since 4.17.7 and I now run 4.17.10 with no problem.

I also encountered this hang with 4.17.4, 4.17.5 and 4.17.6.  It was resolved for me in 4.17.7 - but unfortunately it has returned with exactly the same symptoms for 4.17.10 (hang during early morning hours).  I'm currently testing with 4.17.9 and will report tomorrow if it occurs there.

Comment 89 Didier G 2018-07-29 00:31:41 UTC
(In reply to Gerald Cox from comment #88)
> I also encountered this hang with 4.17.4, 4.17.5 and 4.17.6.  It was
> resolved for me in 4.17.7 - but unfortunately it has returned with exactly
> the same symptoms for 4.17.10 (hang during early morning hours).  I'm
> currently testing with 4.17.9 and will report tomorrow if it occurs there.

Are you sure it is exactly the same hang ?

If you have two computers on your network it will be nice to setup a netconsole to caught log when hang occurs.

Comment 90 Gerald Cox 2018-07-29 00:44:21 UTC
(In reply to Didier G from comment #89)
> (In reply to Gerald Cox from comment #88)
> > I also encountered this hang with 4.17.4, 4.17.5 and 4.17.6.  It was
> > resolved for me in 4.17.7 - but unfortunately it has returned with exactly
> > the same symptoms for 4.17.10 (hang during early morning hours).  I'm
> > currently testing with 4.17.9 and will report tomorrow if it occurs there.
> 
> Are you sure it is exactly the same hang ?
> 
> If you have two computers on your network it will be nice to setup a
> netconsole to caught log when hang occurs.

I do have a laptop that I could use for that... if you could point me to instructions on how to do that, I'd appreciate it.  In the meantime, I'll just go ahead and close out this bug since you can't reproduce.  I also notice that 4.7.11 is already out, so maybe (hopefully) it's fixed there.  So far, no issues with 4.17.9.

Comment 91 Gerald Cox 2018-07-29 00:50:07 UTC
(In reply to Gerald Cox from comment #90)
> (In reply to Didier G from comment #89)
> > (In reply to Gerald Cox from comment #88)
> > > I also encountered this hang with 4.17.4, 4.17.5 and 4.17.6.  It was
> > > resolved for me in 4.17.7 - but unfortunately it has returned with exactly
> > > the same symptoms for 4.17.10 (hang during early morning hours).  I'm
> > > currently testing with 4.17.9 and will report tomorrow if it occurs there.
> > 
> > Are you sure it is exactly the same hang ?
> > 
> > If you have two computers on your network it will be nice to setup a
> > netconsole to caught log when hang occurs.
> 
> I do have a laptop that I could use for that... if you could point me to
> instructions on how to do that, I'd appreciate it.  In the meantime, I'll
> just go ahead and close out this bug since you can't reproduce.  I also
> notice that 4.7.11 is already out, so maybe (hopefully) it's fixed there. 
> So far, no issues with 4.17.9.

Nevermind... found the instructions.  Thanks!

Comment 92 Didier G 2018-07-29 00:58:37 UTC
(In reply to Gerald Cox from comment #90)

> I do have a laptop that I could use for that... if you could point me to
> instructions on how to do that, I'd appreciate it.

You will find information on netconsole on this page: https://fedoraproject.org/wiki/Netconsole

To just get information for this hang you can omit section "netconsole at boot - client side"

To test if your netconsole setup is correct, on the debugee machine open a terminal in root and send following command:

echo Test > /dev/kmsg

"Test" should be displayed on the debugger.

Comment 93 Gerald Cox 2018-07-29 02:15:52 UTC
Thank for the info... the correct close status of this bug is ERRATA, see comment #75.  

Just because it was reopened and subsequently closed doesn't change that fact.

Comment 94 Gerald Cox 2018-07-29 17:23:11 UTC
(In reply to Donald O. from comment #85)
> I'm running 4.17.9-200 now sind 9 hours. No probs up to now.

Thanks for the feedback - I've verified that 4.17.9-200 works for me also.  The bug returned in 4.17.10-200.  I'm going to wait and see if it gets fixed again in 4.17.11 - which should be in the repository in a few days.

Comment 95 Donald O. 2018-07-29 18:47:17 UTC
(In reply to Gerald Cox from comment #94)
> Thanks for the feedback - I've verified that 4.17.9-200 works for me also. 
> The bug returned in 4.17.10-200.  I'm going to wait and see if it gets fixed
> again in 4.17.11 - which should be in the repository in a few days.
thanks for the bad news. Better fedora stops delivering buggy kernels until its fixed (again).

no problems: up 1 day, 11 hours, 37 minutes

Comment 96 Gerald Cox 2018-07-29 19:03:34 UTC
(In reply to Donald O. from comment #95)
> (In reply to Gerald Cox from comment #94)
> > Thanks for the feedback - I've verified that 4.17.9-200 works for me also. 
> > The bug returned in 4.17.10-200.  I'm going to wait and see if it gets fixed
> > again in 4.17.11 - which should be in the repository in a few days.
> thanks for the bad news. Better fedora stops delivering buggy kernels until
> its fixed (again).
> 
> no problems: up 1 day, 11 hours, 37 minutes

Well, to be fair, this isn't the fault of any fedora personnel - it's upstream.  I must say however, I can't remember the last time I've had so many issues with kernel updates.  4.17 is breaking a record for me.

Comment 97 Ozkan Sezer 2018-07-30 08:16:00 UTC
Is the upstream aware that the issue has resurfaced?

Comment 98 Patrick O'Callaghan 2018-07-30 10:31:38 UTC
I have had no problems with 4.17.7-200 nor with 4.17.9-200, but I'm not doing any USB-based bulk I/O. My issue was always with QEMU/KVM freezing the system.

Comment 99 Ozkan Sezer 2018-07-30 13:42:36 UTC
(In reply to Patrick O'Callaghan from comment #98)
> I have had no problems with 4.17.7-200 nor with 4.17.9-200, 

4.17.7-200 and 4.17.9-200 are considered good. 4.17.4-4.17.6
are bad, and 4.17.10 is reported to bring the issue back.

Comment 100 Donald O. 2018-07-30 14:37:54 UTC
(In reply to Ozkan Sezer from comment #97)
> Is the upstream aware that the issue has resurfaced?

good question. Shouldn't we make the upstream aware that the issue has resurfaced?

Comment 101 Laura Abbott 2018-07-30 14:55:40 UTC
Again, This bug is tracking a particular issue which should now be fixed. If you are still seeing issues on 4.17.7 or greater please open a separate bugzilla with system information.

Comment 102 Gerald Cox 2018-07-30 17:05:31 UTC
(In reply to Laura Abbott from comment #101)
> Again, This bug is tracking a particular issue which should now be fixed. If
> you are still seeing issues on 4.17.7 or greater please open a separate
> bugzilla with system information.

Laura, if this bug had been closed a month or so I could see your point... in this instance we have a pervasive issue that spanned three kernel updates, has been only closed a week and apparently identical symptoms have reappeared after skipping only two fast paced updates.  I believe there is some value with letting the folks that have cc: this bug to be aware of this before they decide to apply 4.17.10 and beyond.  If I had opened a new bug they wouldn't be aware until after the fact.  AFAIK the patch for this issue could have somehow been accidentally been omitted from 4.17.10 build - stranger things have happened (and hopefully the Fedora kernel team has verified that somehow this didn't occur).  Regarding 4.17.10, I'm not going to waste my time with doing any debugging since 4.17.11 is being built in koji as I type this reply.

As I mentioned above, both 4.17.7 and 4.17.9 work fine.  4.17.10 is where I've immediately encountered the same or suspiciously similar issue.

Comment 103 Laura Abbott 2018-07-30 17:12:35 UTC
(In reply to Gerald Cox from comment #102)
> (In reply to Laura Abbott from comment #101)
> > Again, This bug is tracking a particular issue which should now be fixed. If
> > you are still seeing issues on 4.17.7 or greater please open a separate
> > bugzilla with system information.
> 
> Laura, if this bug had been closed a month or so I could see your point...
> in this instance we have a pervasive issue that spanned three kernel
> updates, has been only closed a week and apparently identical symptoms have
> reappeared after skipping only two fast paced updates.  I believe there is
> some value with letting the folks that have cc: this bug to be aware of this
> before they decide to apply 4.17.10 and beyond.  If I had opened a new bug
> they wouldn't be aware until after the fact.  AFAIK the patch for this issue
> could have somehow been accidentally been omitted from 4.17.10 build -
> stranger things have happened (and hopefully the Fedora kernel team has
> verified that somehow this didn't occur).  Regarding 4.17.10, I'm not going
> to waste my time with doing any debugging since 4.17.11 is being built in
> koji as I type this reply.
> 
> As I mentioned above, both 4.17.7 and 4.17.9 work fine.  4.17.10 is where
> I've immediately encountered the same or suspiciously similar issue.


My concern is confusing issues. The patch fixed a specific issue with a known backtrace signature. If you are just seeing "hangs" that don't have the same backtrace that's definitely a different issue that needs to be tracked separately. This bug has also become a big unwieldy with multiple reports of "works/doesn't work", hence my request to split out to a separate bugzilla to reduce confusion (for me more than anything). You are welcome to point people to the new bug for tracking.

Comment 104 Gerald Cox 2018-07-30 17:20:23 UTC
(In reply to Laura Abbott from comment #103)
> (In reply to Gerald Cox from comment #102)
> > (In reply to Laura Abbott from comment #101)
> 
> My concern is confusing issues. The patch fixed a specific issue with a
> known backtrace signature. If you are just seeing "hangs" that don't have
> the same backtrace that's definitely a different issue that needs to be
> tracked separately. This bug has also become a big unwieldy with multiple
> reports of "works/doesn't work", hence my request to split out to a separate
> bugzilla to reduce confusion (for me more than anything). You are welcome to
> point people to the new bug for tracking.

Completely agree, I will test with 4.17.11 once it's built and open a new bug and reference that bug number here if I have issues with that update.  I just wanted people to be aware of the issue, and believed commenting in this ticket was the best way to raise awareness to those on cc: - if you know a better way to do this, please let me know.

Comment 105 Leigh Orf 2018-07-30 18:52:43 UTC
(In reply to Gerald Cox from comment #104)
> (In reply to Laura Abbott from comment #103)
> > (In reply to Gerald Cox from comment #102)
> > > (In reply to Laura Abbott from comment #101)
> > 
> > My concern is confusing issues. The patch fixed a specific issue with a
> > known backtrace signature. If you are just seeing "hangs" that don't have
> > the same backtrace that's definitely a different issue that needs to be
> > tracked separately. This bug has also become a big unwieldy with multiple
> > reports of "works/doesn't work", hence my request to split out to a separate
> > bugzilla to reduce confusion (for me more than anything). You are welcome to
> > point people to the new bug for tracking.
> 
> Completely agree, I will test with 4.17.11 once it's built and open a new
> bug and reference that bug number here if I have issues with that update.  I
> just wanted people to be aware of the issue, and believed commenting in this
> ticket was the best way to raise awareness to those on cc: - if you know a
> better way to do this, please let me know.

For what it's worth I too am still experiencing random freezes up to 4.17.9 with Fedora 28. I have just subscribed to a bunch of other recent 4.17 related bug reports with hangs. I will submit a new one if I ever get anything useful from the kernel logs. It took about 5 freezes before I got the kernel message that identified the bug for this ticket. I will say I am now seeing freezes with NO activity to external USB drives now - just sitting idly. Before early July I can't recall having an unexplained freeze. Since early July I always cringe when I smack the space bar to wake up the machine, for lately it's often just frozen. I have never had that experience and I've been running Fedora since there was a Fedora. I get the impression (but it could be selective bias) that others are in the same boat.

Comment 106 Frank Haefemeier 2018-07-30 22:46:27 UTC
(In reply to Leigh Orf from comment #105)
> (In reply to Gerald Cox from comment #104)
> > (In reply to Laura Abbott from comment #103)
> > > (In reply to Gerald Cox from comment #102)
> > > > (In reply to Laura Abbott from comment #101)
> > > 
> > > My concern is confusing issues. The patch fixed a specific issue with a
> > > known backtrace signature. If you are just seeing "hangs" that don't have
> > > the same backtrace that's definitely a different issue that needs to be
> > > tracked separately. This bug has also become a big unwieldy with multiple
> > > reports of "works/doesn't work", hence my request to split out to a separate
> > > bugzilla to reduce confusion (for me more than anything). You are welcome to
> > > point people to the new bug for tracking.
> > 
> > Completely agree, I will test with 4.17.11 once it's built and open a new
> > bug and reference that bug number here if I have issues with that update.  I
> > just wanted people to be aware of the issue, and believed commenting in this
> > ticket was the best way to raise awareness to those on cc: - if you know a
> > better way to do this, please let me know.
> 
> For what it's worth I too am still experiencing random freezes up to 4.17.9
> with Fedora 28. I have just subscribed to a bunch of other recent 4.17
> related bug reports with hangs. I will submit a new one if I ever get
> anything useful from the kernel logs. It took about 5 freezes before I got
> the kernel message that identified the bug for this ticket. I will say I am
> now seeing freezes with NO activity to external USB drives now - just
> sitting idly. Before early July I can't recall having an unexplained freeze.
> Since early July I always cringe when I smack the space bar to wake up the
> machine, for lately it's often just frozen. I have never had that experience
> and I've been running Fedora since there was a Fedora. I get the impression
> (but it could be selective bias) that others are in the same boat.
I am with you in this boat. I had upgrade the server and was scared about the 
instability. The system freeze and sometimes a kernel panic message was written 
on the console. I downgraded to 4.17.3 and everything is fine since.
Now I am waiting for a longer stabilized period, before I start the next round.
I also can't remember to be aware this issues and use Fedora for years.
Hopefully it will be stable soon...

Comment 107 Gerald Cox 2018-08-01 17:21:37 UTC
I've run 4.17.11 for 2 nights in a row and it appears to be good - so whatever was introduced or regressed in 4.17.10 has been fixed.  Thank goodness.

Comment 108 Donald O. 2018-08-02 08:24:55 UTC
please post the link to the new  4.17.10 bug.

Comment 109 Gerald Cox 2018-08-02 18:08:52 UTC
(In reply to Donald O. from comment #108)
> please post the link to the new  4.17.10 bug.

Please see my comment in #102 and #104.  Since 4.17.11 release was imminent I didn't pursue opening a bug for 4.17.10.  4.17.11 has been running fine for 3 nights so far so nothing to report other than appears whatever the issue was with 4.17.10 - it was resolved in 4.17.11.

Comment 110 Frank Haefemeier 2018-08-03 22:20:21 UTC
(In reply to Gerald Cox from comment #109)
> (In reply to Donald O. from comment #108)
> > please post the link to the new  4.17.10 bug.
> 
> Please see my comment in #102 and #104.  Since 4.17.11 release was imminent
> I didn't pursue opening a bug for 4.17.10.  4.17.11 has been running fine
> for 3 nights so far so nothing to report other than appears whatever the
> issue was with 4.17.10 - it was resolved in 4.17.11.

Who knows when kernel 4.17.11 will be available in the standard update stream? The latest kernel there is 4.17.9.

Comment 111 Samuel Sieb 2018-08-03 23:23:05 UTC
It's in updates-testing.  If you don't have that repo enabled, you can add "--enablerepo=updates-testing" to the dnf command to get it right now.

Comment 112 Donald O. 2018-08-04 06:17:29 UTC
4.17.11 was deployed a couple of hours ago by the regular update channel. I'm running it since 5 Minutes.

Comment 113 Luigi Cantoni 2018-08-04 08:42:07 UTC
I am not on the beta channel so I only get full releases thus I did not get the "buggy" 4.17.10. I can confirm though that the earlier bug that was fixed up to 4.17.9 for me and I was working fine on several machines. I have downloaded 4.17.11 a few hours ago on one of my home machines and the cases that caused the earlier bug to appear (hanging on large data transfers) worked fine for me so that bug still appears fixed in 4.17.11.
I will test 4.17.11 at work in two days but I am sure it will be fine.

Comment 114 Donald O. 2018-08-04 21:56:40 UTC
4.17.11 up now for 15 hours, 45 minutes. No problems. Everything's fine. Thanks RH. Thanks community!!!

Comment 115 David W. Legg 2018-08-05 08:44:02 UTC
Have had no 'hangs' since abandoning 4.17.7.
Suggest discouraging its use somehow?
Now on 4.17.11 from Fedora testing repo.
Thanks to everyone who helped.

Comment 116 Donald O. 2018-08-05 19:51:23 UTC
4.17.11 up for 1 day, 13 hours, 37 minutes. Everything's fine. Thanks!
( we have some concurrent usb access slow downs, even if the background job is running with lowest task/io priorities. That isn't a 4.17 problem, but a general problem)


Note You need to log in before you can comment on or make changes to this bug.