1627288 – F29 becomes unresponsive with no error indication

Bug 1627288 - F29 becomes unresponsive with no error indication

Summary: F29 becomes unresponsive with no error indication

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	29
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-09-10 15:07 UTC by Larkin Lowrey
Modified:	2019-10-24 02:24 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2019-09-17 20:09:54 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Output from echo "l" > /proc/sysrq-trigger (8.42 KB, text/plain) 2018-09-10 15:07 UTC, Larkin Lowrey	no flags	Details
Output from echo "t" > /proc/sysrq-trigger (2.42 MB, text/plain) 2018-09-10 15:08 UTC, Larkin Lowrey	no flags	Details
sysrq 'l' -- handle_mm_fault (6.23 KB, text/plain) 2018-09-18 15:09 UTC, Larkin Lowrey	no flags	Details
sysrq-l w/ 4.19.5-300.fc29.x86_64 (10.24 KB, text/plain) 2018-12-14 16:32 UTC, Larkin Lowrey	no flags	Details
sysrq-t w/ 4.19.5-300.fc29.x86_64 (2.34 MB, text/plain) 2018-12-14 16:33 UTC, Larkin Lowrey	no flags	Details
View All

Description Larkin Lowrey 2018-09-10 15:07:34 UTC

Created attachment 1482176 [details]
Output from echo "l" > /proc/sysrq-trigger

Description of problem:
The system will become unresponsive except for certain very basic commands. I can interrogate the /proc fs but it seems like any command that must be read from storage or itself reads from storage freezes.

I am able to SSH to the host and it accepts my credentials but hangs just before showing the shell prompt.

Version-Release number of selected component (if applicable):
Kernel 4.17.19-200.fc28.x86_64

How reproducible:
Happens about once ever 48-72 hours.

Steps to Reproduce:
1. Use system normally
2.
3.

Actual results:
All processes seem to freeze. I can log in but the session will freeze just before showing the shell prompt.


Expected results:
All processes continue to operate as expected.

Additional info:

There are no error messages on the console. I'm using a serial console that is connected to a logger so I can say for sure that there are no errors output to console.

This happens every few days, typically during or shortly after the nightly backup run.

The system is dual 16-core Opteron with ECC memory.

I updated from F27 to F28 about 6 weeks ago. This phenomenon started just a week ago when I added a new network card and did a 'dnf update' and switched from 4.17.12-200 to 4.17.19-200. The network still functions so I find it hard to believe the NIC is the problem.

I've reverted back to 4.17.12-200.fc28.x86_64 which had been reliable for me. I'll report back if the problem recurs.

Comment 1 Larkin Lowrey 2018-09-10 15:08:43 UTC

Created attachment 1482177 [details]
Output from echo "t" > /proc/sysrq-trigger

Comment 2 Larkin Lowrey 2018-09-10 15:09:55 UTC

# free
              total        used        free      shared  buff/cache   available
Mem:       90718964    16834516    35883396        2652    38001052    71913592
Swap:      33554428     1991680    31562748

# swapon
NAME      TYPE      SIZE USED PRIO
/dev/dm-1 partition  32G 1.9G   -2

# top
top - 09:58:00 up 23:57,  2 users,  load average: 46.08, 46.70, 45.34
Tasks: 833 total,   1 running, 528 sleeping,   0 stopped,   0 zombie
%Cpu(s):  3.4 us,  3.6 sy,  0.0 ni, 92.6 id,  0.0 wa,  0.1 hi,  0.3 si,  0.0 st
KiB Mem : 90718960 total, 35880104 free, 16836428 used, 38002432 buff/cache
KiB Swap: 33554428 total, 31562748 free,  1991680 used. 71911088 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
 3539 qemu      20   0 21.470g 7.138g  13216 D 200.0  8.3   1885:36 qemu-system-x86
25343 root      20   0  256068   5420   3708 R  13.6  0.0   0:00.82 top
 2957 plex      20   0  949356  65444  22720 S   4.5  0.1   6:32.82 Plex Media Serv
 3716 qemu      20   0 9507448 799448  13104 S   4.5  0.9  19:23.66 qemu-system-x86
 4282 root      20   0 1577044  36664   5864 S   4.5  0.0  16:35.56 python3
 4704 root      20   0 27.258g 578280  11628 S   4.5  0.6  39:13.72 java
    1 root      20   0  240820  13576   6700 S   0.0  0.0   0:20.01 systemd
    2 root      20   0       0      0      0 S   0.0  0.0   0:00.47 kthreadd
    3 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 rcu_gp
    5 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 kworker/0:0H
    8 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 mm_percpu_wq
    9 root      20   0       0      0      0 S   0.0  0.0   0:00.83 ksoftirqd/0
   10 root      20   0       0      0      0 I   0.0  0.0   1:06.87 rcu_sched
   11 root      20   0       0      0      0 I   0.0  0.0   0:00.00 rcu_bh
   12 root      rt   0       0      0      0 S   0.0  0.0   0:00.09 migration/0
   13 root      rt   0       0      0      0 S   0.0  0.0   0:00.01 watchdog/0
   14 root      20   0       0      0      0 S   0.0  0.0   0:00.00 cpuhp/0
   15 root      20   0       0      0      0 S   0.0  0.0   0:00.00 cpuhp/1
   16 root      rt   0       0      0      0 S   0.0  0.0   0:00.16 watchdog/1
   17 root      rt   0       0      0      0 S   0.0  0.0   0:00.03 migration/1
   18 root      20   0       0      0      0 S   0.0  0.0   0:00.26 ksoftirqd/1
   20 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 kworker/1:0H
   21 root      20   0       0      0      0 S   0.0  0.0   0:00.00 cpuhp/2
   22 root      rt   0       0      0      0 S   0.0  0.0   0:00.15 watchdog/2
   23 root      rt   0       0      0      0 S   0.0  0.0   0:00.07 migration/2
   24 root      20   0       0      0      0 S   0.0  0.0   0:00.17 ksoftirqd/2
   26 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 kworker/2:0H
   27 root      20   0       0      0      0 S   0.0  0.0   0:00.00 cpuhp/3
   28 root      rt   0       0      0      0 S   0.0  0.0   0:00.16 watchdog/3
   29 root      rt   0       0      0      0 S   0.0  0.0   0:00.04 migration/3
   30 root      20   0       0      0      0 S   0.0  0.0   0:00.10 ksoftirqd/3
   32 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 kworker/3:0H
   33 root      20   0       0      0      0 S   0.0  0.0   0:00.00 cpuhp/4
   34 root      rt   0       0      0      0 S   0.0  0.0   0:00.15 watchdog/4
   35 root      rt   0       0      0      0 S   0.0  0.0   0:00.06 migration/4
   36 root      20   0       0      0      0 S   0.0  0.0   0:02.87 ksoftirqd/4
   38 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 kworker/4:0H
   39 root      20   0       0      0      0 S   0.0  0.0   0:00.00 cpuhp/5
   40 root      rt   0       0      0      0 S   0.0  0.0   0:00.15 watchdog/5
   41 root      rt   0       0      0      0 S   0.0  0.0   0:00.04 migration/5
   42 root      20   0       0      0      0 S   0.0  0.0   0:00.10 ksoftirqd/5
   44 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 kworker/5:0H
   45 root      20   0       0      0      0 S   0.0  0.0   0:00.00 cpuhp/6
   46 root      rt   0       0      0      0 S   0.0  0.0   0:00.15 watchdog/6
   47 root      rt   0       0      0      0 S   0.0  0.0   0:00.06 migration/6
   48 root      20   0       0      0      0 S   0.0  0.0   0:02.77 ksoftirqd/6
   50 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 kworker/6:0H
   51 root      20   0       0      0      0 S   0.0  0.0   0:00.00 cpuhp/7
   52 root      rt   0       0      0      0 S   0.0  0.0   0:00.16 watchdog/7
   53 root      rt   0       0      0      0 S   0.0  0.0   0:00.04 migration/7
   54 root      20   0       0      0      0 S   0.0  0.0   0:00.09 ksoftirqd/7
   56 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 kworker/7:0H
   57 root      20   0       0      0      0 S   0.0  0.0   0:00.00 cpuhp/8
   58 root      rt   0       0      0      0 S   0.0  0.0   0:00.11 watchdog/8
   59 root      rt   0       0      0      0 S   0.0  0.0   0:00.06 migration/8
   60 root      20   0       0      0      0 S   0.0  0.0   0:00.12 ksoftirqd/8
   62 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 kworker/8:0H
   64 root      20   0       0      0      0 S   0.0  0.0   0:00.00 cpuhp/9
   65 root      rt   0       0      0      0 S   0.0  0.0   0:00.11 watchdog/9
   66 root      rt   0       0      0      0 S   0.0  0.0   0:00.04 migration/9
   67 root      20   0       0      0      0 S   0.0  0.0   0:00.06 ksoftirqd/9
   69 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 kworker/9:0H
   70 root      20   0       0      0      0 S   0.0  0.0   0:00.00 cpuhp/10
   71 root      rt   0       0      0      0 S   0.0  0.0   0:00.09 watchdog/10
   72 root      rt   0       0      0      0 S   0.0  0.0   0:00.05 migration/10
   73 root      20   0       0      0      0 S   0.0  0.0   0:00.12 ksoftirqd/10
   75 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 kworker/10:0H
   76 root      20   0       0      0      0 S   0.0  0.0   0:00.00 cpuhp/11
   77 root      rt   0       0      0      0 S   0.0  0.0   0:00.10 watchdog/11
   78 root      rt   0       0      0      0 S   0.0  0.0   0:00.04 migration/11
   79 root      20   0       0      0      0 S   0.0  0.0   0:00.05 ksoftirqd/11
   81 root       0 -20       0      0      0 I   0.0  0.0   0:00.00 kworker/11:0H
   82 root      20   0       0      0      0 S   0.0  0.0   0:00.00 cpuhp/12

Comment 3 Larkin Lowrey 2018-09-18 15:09:15 UTC

Created attachment 1484410 [details]
sysrq 'l' -- handle_mm_fault

Comment 4 Larkin Lowrey 2018-09-18 15:15:21 UTC

This happened again but with kernel 4.18.5-200.fc28.x86_64.

A qemu process was hung at 100% CPU. I managed to kill it then kworker/0:1+eve hung at 100%CPU.

I did a 'l' sysrq and saw the same thing I always see: handle_mm_fault + do_swap_page.

I had several days of stability with 4.17.12-200.fc28.x86_64. Going back to that now.

Comment 5 Larkin Lowrey 2018-11-26 16:33:04 UTC

I had over 60 days of uptime with kernel 4.17.12-200.fc28.x86_64. I "upgraded" to F29 and to kernel 4.19.2-301.fc29.x86_64 and less than 48 hours later the host froze, just like before with 4.18.

I've now lost the ability to go back to 4.17.12-200.fc28.x86_64.

There are no message logged to the console. The host continues to function except for operations which require disk I/O. Those commands hang and I can't ctrl-c out of them.

Comment 6 Larkin Lowrey 2018-11-26 16:42:23 UTC

I just noticed that backups were still in progress on the host. The FS is btrfs and the backups are from a snapshot subvol. Apparently, the backup process can still read from the filesystem.

It's possible that the hangs I'm seeing are due to failure to write to the FS, possibly due to a deadlock related to the snapshot subvol.

Comment 7 Jeremy Cline 2018-12-03 17:30:58 UTC

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 29 kernel bugs.
 
Fedora 29 has now been rebased to 4.19.5-300.fc29.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.
 
If you experience different issues, please open a new bug report for those.

Comment 8 Larkin Lowrey 2018-12-14 16:32:31 UTC

Created attachment 1514407 [details]
sysrq-l w/ 4.19.5-300.fc29.x86_64

Comment 9 Larkin Lowrey 2018-12-14 16:33:06 UTC

Created attachment 1514408 [details]
sysrq-t w/ 4.19.5-300.fc29.x86_64

Comment 10 Larkin Lowrey 2018-12-14 16:40:45 UTC

The issue still occurs with 4.19.5-300.fc29.x86_64.

Every time it happens the stack trace contains a call to try_async_pf. There are always at least 6 VMs running, some amount of swap usage, and at least 50G RAM free.

It sure looks like a page fault occurs in a VM and the kernel tries to handle it asynchronously and the page-in event never occurs.

Comment 11 Larkin Lowrey 2019-01-29 02:38:46 UTC

This still occurs with kernel 4.20.3-200.fc29.x86_64.

A successful workaround is to add 'no-kvmapf' to the guest kernel command-line to disable async page fault handling. My host and guests have been completely stable since adding that option.

Comment 12 Justin M. Forbes 2019-08-20 17:44:05 UTC

*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There are a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 29 kernel bugs.

Fedora 29 has now been rebased to 5.2.9-100.fc29.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you have moved on to Fedora 30, and are still experiencing this issue, please change the version to Fedora 30.

If you experience different issues, please open a new bug report for those.

Comment 13 Justin M. Forbes 2019-09-17 20:09:54 UTC

*********** MASS BUG UPDATE **************
This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 3 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.

Note You need to log in before you can comment on or make changes to this bug.