1567725 – NVMe IO hang after offline CPU

Bug 1567725 - NVMe IO hang after offline CPU [NEEDINFO]

Summary: NVMe IO hang after offline CPU

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	28
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-04-16 05:31 UTC by Zhang Yi
Modified:	2018-08-29 15:03 UTC (History)
CC List:	17 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2018-08-29 15:03:03 UTC
Type:	Bug
Embargoed:
Dependent Products:
Flags:	jforbes: needinfo?

Attachments	(Terms of Use)

Description Zhang Yi 2018-04-16 05:31:17 UTC

Description of problem:
NVMe IO hang after offline CPU

Version-Release number of selected component (if applicable):
4.16.1-300.fc28.x86_64

How reproducible:
100%

Steps to Reproduce:
#fio -filename=/dev/nvme0n1p1 -iodepth=1 -thread -rw=randwrite -ioengine=psync -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 -direct=1 -runtime=120 -size=-group_reporting -name=mytest -numjobs=60 &
#sleep 10
#echo 0 > /sys/devices/system/cpu/cpu1/online
#echo 0 > /sys/devices/system/cpu/cpu2/online
#echo 0 > /sys/devices/system/cpu/cpu3/online

Actual results:
fio cannot be finished

Expected results:
fio can be finished

Additional info:

[  122.024084] smpboot: CPU 1 is now offline
[  122.055263] smpboot: CPU 2 is now offline
[  122.299090] smpboot: CPU 3 is now offline
[  153.009736] nvme nvme0: I/O 898 QID 2 timeout, completion polled
[  153.016460] nvme nvme0: I/O 900 QID 2 timeout, aborting
[  153.022310] nvme nvme0: I/O 901 QID 2 timeout, aborting
[  153.028135] nvme nvme0: Abort status: 0x0
[  153.032627] nvme nvme0: I/O 902 QID 2 timeout, aborting
[  153.038458] nvme nvme0: Abort status: 0x0
[  153.042948] nvme nvme0: I/O 765 QID 7 timeout, completion polled
[  153.049652] nvme nvme0: Abort status: 0x0
[  153.054146] nvme nvme0: I/O 767 QID 7 timeout, aborting
[  153.059986] nvme nvme0: I/O 768 QID 7 timeout, aborting
[  153.065817] nvme nvme0: Abort status: 0x0
[  153.070304] nvme nvme0: I/O 769 QID 7 timeout, aborting
[  153.076133] nvme nvme0: Abort status: 0x0
[  153.080619] nvme nvme0: I/O 1015 QID 8 timeout, completion polled
[  153.087420] nvme nvme0: Abort status: 0x0
[  153.091913] nvme nvme0: I/O 1018 QID 8 timeout, completion polled
[  153.098728] nvme nvme0: I/O 1019 QID 8 timeout, completion polled
[  183.650509] nvme nvme0: I/O 900 QID 2 timeout, reset controller
[  186.020822] nvme nvme0: I/O 901 QID 2 timeout, disable controller
[  186.038595] nvme nvme0: I/O 902 QID 2 timeout, disable controller
[  186.054499] nvme nvme0: I/O 767 QID 7 timeout, disable controller
[  186.072495] nvme nvme0: I/O 768 QID 7 timeout, disable controller
[  186.087494] nvme nvme0: I/O 769 QID 7 timeout, disable controller
[  218.659194] nvme nvme0: I/O 900 QID 2 timeout, completion polled
[  218.665926] nvme nvme0: I/O 901 QID 2 timeout, disable controller

# ps aux | grep fio
root      1471  1.1  9.6 6428504 1568544 pts/0 Sl+  01:06   0:14 fio -filename=/dev/nvme0n1p1 -iodepth=1 -thread -rw=randwrite -ioengine=psync -bssplit=5k/10:9k/10:13k/10:17k/10:21k/10:25k/10:29k/10:33k/10:37k/10:41k/10 -direct=1 -runtime=600 -size=-group_reporting -name=mytest -numjobs=60
 
# cat /proc/1471/status 
Name:	fio
Umask:	0022
State:	S (sleeping)
Tgid:	1471
Ngid:	1471
Pid:	1471
PPid:	1470
TracerPid:	0
Uid:	0	0	0	0
Gid:	0	0	0	0
FDSize:	64
Groups:	0 
NStgid:	1471
NSpid:	1471
NSpgid:	1272
NSsid:	1093
VmPeak:	 6494040 kB
VmSize:	 6428504 kB
VmLck:	       0 kB
VmPin:	       0 kB
VmHWM:	 1568544 kB
VmRSS:	 1568544 kB
RssAnon:	 1157260 kB
RssFile:	    3284 kB
RssShmem:	  408000 kB
VmData:	 1673888 kB
VmStk:	     132 kB
VmExe:	     632 kB
VmLib:	    3908 kB
VmPTE:	    3700 kB
VmSwap:	       0 kB
HugetlbPages:	       0 kB
CoreDumping:	0
Threads:	62
SigQ:	3/62985
SigPnd:	0000000000000000
ShdPnd:	0000000000000000
SigBlk:	0000000000000000
SigIgn:	0000000000000000
SigCgt:	0000000180004202
CapInh:	0000000000000000
CapPrm:	0000003fffffffff
CapEff:	0000003fffffffff
CapBnd:	0000003fffffffff
CapAmb:	0000000000000000
NoNewPrivs:	0
Seccomp:	0
Cpus_allowed:	fff
Cpus_allowed_list:	0-11
Mems_allowed:	00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000003
Mems_allowed_list:	0-1
voluntary_ctxt_switches:	122885
nonvoluntary_ctxt_switches:	20

# cat /proc/1471/stack 
[<0>] hrtimer_nanosleep+0xd4/0x1e0
[<0>] SyS_nanosleep+0x75/0xa0
[<0>] do_syscall_64+0x74/0x180
[<0>] 0xffffffffffffffff

Comment 1 Justin M. Forbes 2018-07-23 14:53:06 UTC

*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There are a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 28 kernel bugs.

Fedora 28 has now been rebased to 4.17.7-200.fc28.  Please test this kernel update (or newer) and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you experience different issues, please open a new bug report for those.

Comment 2 Justin M. Forbes 2018-08-29 15:03:03 UTC

*********** MASS BUG UPDATE **************
This bug is being closed with INSUFFICIENT_DATA as there has not been a response in 5 weeks. If you are still experiencing this issue, please reopen and attach the relevant data from the latest kernel you are running and any data that might have been requested previously.

Note You need to log in before you can comment on or make changes to this bug.