Bug 1724345
Summary: | mkfs.xfs hangs issuing discards | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | John Pittman <jpittman> |
Component: | kernel | Assignee: | Ming Lei <minlei> |
kernel sub component: | NVMe | QA Contact: | Zhang Yi <yizhan> |
Status: | CLOSED ERRATA | Docs Contact: | |
Severity: | urgent | ||
Priority: | urgent | CC: | alex.wang, amityony, asanders, bubrown, dmilburn, emilne, gtiwari, linville, minlei, pdwyer, rgirase, yizhan, ysudarev |
Version: | 7.6 | Keywords: | OtherQA |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | kernel-3.10.0-1145.el7 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2020-09-29 21:01:42 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
John Pittman
2019-06-26 20:21:33 UTC
The following script may show what the last bad request is, and could you please run it before starting the mkfs test, then post the log after the timeout is triggered. /usr/share/bcc/tools/trace -t -C \ -I 'linux/blkdev.h' -I 'linux/blk-mq.h' \ 'nvme_queue_rq(void *p, struct blk_mq_queue_data *bd) "%d + %d %x", bd->rq->__sector, bd->rq->__data_len >> 9, bd->rq->cmd_flags' \ 'blk_mq_complete_request(struct request *rq, int error) "%d %d + %d %x", error, rq->__sector, rq->__data_len >> 9, rq->cmd_flags' BTW, 'bcc' package is required for the above script, also please run it in another standalone session. (In reply to Ming Lei from comment #4) > The following script may show what the last bad request is, and could you > please run it before starting the mkfs test, > then post the log after the timeout is triggered. > > > /usr/share/bcc/tools/trace -t -C \ > -I 'linux/blkdev.h' -I 'linux/blk-mq.h' \ > 'nvme_queue_rq(void *p, struct blk_mq_queue_data *bd) "%d + %d %x", > bd->rq->__sector, bd->rq->__data_len >> 9, bd->rq->cmd_flags' \ > 'blk_mq_complete_request(struct request *rq, int error) "%d %d + %d %x", > error, rq->__sector, rq->__data_len >> 9, rq->cmd_flags' > > > > > BTW, 'bcc' package is required for the above script, also please run it in > another standalone session. Maybe better to dump the rq->tag, so please use the following one: /usr/share/bcc/tools/trace -t -C \ -I 'linux/blkdev.h' -I 'linux/blk-mq.h' \ 'scsi_queue_rq(void *p, struct blk_mq_queue_data *bd) "%d: %d + %d %x", bd->rq->tag, bd->rq->__sector, bd->rq->__data_len >> 9, bd->rq->cmd_flags' \ 'blk_mq_complete_request(struct request *rq, int error) "%d %d: %d + %d %x", rq->tag, error, rq->__sector, rq->__data_len >> 9, rq->cmd_flags' Created attachment 1585181 [details]
bcc output 1
Hi Ming. Thanks a lot for looking. I tried the command on one of our 7.6 test systems and the compile failed. Could you please take a look? Thanks!
Comment on attachment 1585181 [details]
bcc output 1
I didn't see such issue in my rhel7 VM, maybe I install kernel-devel or kernel
header package.
However, you can workaround the issue by changing the following line in
/lib/modules/3.10.0-957.21.3.el7.x86_64/build/include/linux/compiler-gcc.h
#define asm_volatile_goto(x...) do { asm goto(x); asm (""); } while (0)
into
#define asm_volatile_goto(x...)
Thanks,
Hi Ming. The customer finally got back to us. Unfortunately they were unable to change the discard_max_bytes value.
# echo 2147483647 > /sys/block/nvme0n1/queue/discard_max_bytes
-bash: echo: write error: Invalid argument
# cat /sys/block/nvme0n1/queue/discard_max_bytes
2199023255040
Additionally, I made a small error on providing your test kernel to the customer. Of course your original packages are no longer available, and I forgot to give the customer kernel-devel, so the script failed with the below:
> # /usr/share/bcc/tools/trace -t -C \
> > -I 'linux/blkdev.h' -I 'linux/blk-mq.h' \
> > 'nvme_queue_rq(struct blk_mq_hw_ctx *hctx, struct blk_mq_queue_data *bd) "%d-%d: %u + %u %x", hctx->queue_num, bd->rq->tag, bd->rq->__sector, bd->rq->__data_len >> 9, bd->rq->cmd_flags' \
> > 'nvme_pci_complete_rq(struct request *rq) "%d: %u + %u %x", rq->tag, rq->__sector, rq->__data_len >> 9, rq->cmd_flags'
> chdir(/lib/modules/3.10.0-1059.el7.1724345.x86_64/build): No such file or directory
> Failed to compile BPF text
How can we fix the permission error on discard_max_bytes? Can you provide another set of test kernels?
Hi Ming, unfortunately the customer has moved to a different hardware profile and can no longer provide testing. The details of the hardware is provided below. NVMe(2x 1.6TB HHHL AIC HGST SN260 NVMe Extreme Perf. High Endurance pcie cards) using striped LVM in RHEL 7.6. 5e:00.0 Non-Volatile memory controller: HGST, Inc. Ultrastar SN200 Series NVMe SSD (rev 02) d8:00.0 Non-Volatile memory controller: HGST, Inc. Ultrastar SN200 Series NVMe SSD (rev 02) /dev/nvme0n1 SDM00000ED1E UCSC-F-H16003 1 1.60 TB / 1.60 TB 512 B + 0 B KNCCD101 /dev/nvme1n1 SDM0000230CB UCSC-F-H16003 1 1.60 TB / 1.60 TB 512 B + 0 B KNCCD101 Ming, is there anyway we can fix this issue without that particular hardware? (In reply to John Pittman from comment #18) > Ming, is there anyway we can fix this issue without that particular hardware? From IO trace, IO timeout is triggered on big or lots of discard request, which should be one issue wrt. the type of drive, or even the specific disk. We do run this kinds of tests on NVMe, and not see such report yet. So I believe we need to run some test on the drive for figuring out one root cause. Yeah, it is always helpful to loop our partner to take a look at it. Hi Ming, we are still trying to get our hands on the proper hardware. Rupesh happened to find this thread: https://patchwork.kernel.org/patch/10028457/. Do you think this is what is happening here? (In reply to John Pittman from comment #23) > Hi Ming, we are still trying to get our hands on the proper hardware. > Rupesh happened to find this thread: > https://patchwork.kernel.org/patch/10028457/. Do you think this is what is > happening here? From the IO trace shared, it is basically same with the issue Keith is trying to address, see the following comments from Keith: " The block limit only specifies the maximum size of a *single* discard request as seen by the end device. This single request is not a problem for timeouts, as far as I know. The timeouts occur when queueing many of them at the same time: the last one in the queue will have very high latency compared to ones ahead of it in the queue if the device processes discards serially (many do). There's no such limit to say the maximum outstanding number discard requests that can be dispatched at the same time; the max number of dispatched commands are shard with read and write. " Hi John, Could you ask our customer to test the following workaround? rmmod nvme_core modprobe nvme_core nvme_io_timeout=90 Thanks, Hi Ming, below are the stats for the disk as pulled from the vmcore. It's 1 TiB large, and has a very high max hw sectors value. The queue depth is 1024. crash> epython /cores/crashext/epython/storage/nvme.py -l Node SN Model Namespace Capacity(gendisk) Format FW Rev ---------------- -------------------- ---------------------------------------- --------- ------------------ ---------------- -------- /dev/nvme0n1 SDM00000ED1E UCSC-F-H16003 1 1 TiB 512 B + 0 B KNCCD101 /dev/nvme1n1 SDM0000230CB UCSC-F-H16003 1 1 TiB 512 B + 0 B KNCCD101 ============================================= ============================================= crash> epython /cores/crashext/epython/storage/nvme.py -c Name Ctrl Addr Namespaces(list_head) AdminQ ConnectQ Subsystem Ctrl Device ---------- ---------------- --------------------- ---------------- ---------------- ---------------- ---------------- nvme0 ffff9317e6b5c1b0 ffff9317e6b5c1f8 ffff9317e6bb8998 0 <unavailable> ffff9317e6b5c228 Quirks: NVME_QUIRK_DELAY_BEFORE_CHK_RDY NumQueues: 17 CtrlState: NVME_CTRL_CONNECTING MaxHWSectors: 4294967295 <==== MaxSegments: <unavailable> PageSize: 4096 Name Ctrl Addr Namespaces(list_head) AdminQ ConnectQ Subsystem Ctrl Device ---------- ---------------- --------------------- ---------------- ---------------- ---------------- ---------------- nvme1 ffff9317e6b581b0 ffff9317e6b581f8 ffff9317da3a8000 0 <unavailable> ffff9317e6b58228 Quirks: NVME_QUIRK_DELAY_BEFORE_CHK_RDY NumQueues: 17 CtrlState: NVME_CTRL_CONNECTING MaxHWSectors: 4294967295 <===== MaxSegments: <unavailable> PageSize: 4096 ============================================= ============================================= crash> epython /cores/crashext/epython/storage/nvme.py -q nvme0 -i 1 pci:Ctrl[qid] Queue Addr DMA Dev NVMe Dev SQ Cmds Completion Tags ------------- ---------------- ---------------- ---------------- ---------------- ---------------- ---------------- nvme0[1] ffff9317e0de0098 ffff931aef988098 ffff9317e6b5c000 ffff9317e72a0000 ffff9317e5f88000 ffff925bd9db7800 QDepth: 1024 CQVector: -1 SQTail: 0 LastSQTail: <unavailable> CQHead: 0 LastCQHead: <unavailable> QiD: 1 CQPhase: 1 Flags: <unavailable> QCount: 17 Hi Ming. It looks we have a second customer tripping over discards building in the queues. They are hitting timeouts when issuing "many concurrent blkdiscard operations along with some dd write commands on specific drives" [29335.840122] nvme 0000:db:00.0: irq 238 for MSI/MSI-X [29335.840145] nvme 0000:db:00.0: irq 239 for MSI/MSI-X [29335.840167] nvme 0000:db:00.0: irq 240 for MSI/MSI-X .... [235416.159141] nvme1n1: p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 p13 p14 p15 p16 [235441.266349] nvme2n1: p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 p13 p14 p15 p16 [235448.816967] nvme3n1: p1 p2 p3 p4 p5 p6 p7 p8 p9 p10 p11 p12 p13 p14 p15 p16 .... [259511.358607] nvme nvme2: I/O 83 QID 25 timeout, aborting [259511.429656] nvme nvme2: I/O 84 QID 25 timeout, aborting [259511.500677] nvme nvme2: I/O 122 QID 25 timeout, aborting [259511.571018] nvme nvme2: I/O 978 QID 25 timeout, aborting [259519.767856] nvme nvme2: Abort status: 0x0 [259519.817955] nvme nvme2: Abort status: 0x0 [259519.867994] nvme nvme2: Abort status: 0x0 [259519.918008] nvme nvme2: Abort status: 0x0 [260498.023600] nvme nvme2: I/O 592 QID 36 timeout, aborting [260498.044565] nvme nvme1: I/O 30 QID 29 timeout, aborting [260498.044576] nvme nvme1: I/O 33 QID 29 timeout, aborting [260498.044579] nvme nvme1: I/O 34 QID 29 timeout, aborting [260498.044582] nvme nvme1: I/O 35 QID 29 timeout, aborting [260498.346809] nvme nvme2: I/O 594 QID 36 timeout, aborting [260498.411955] nvme nvme2: I/O 597 QID 36 timeout, aborting [260498.477086] nvme nvme2: I/O 599 QID 36 timeout, aborting [262142.182133] nvme nvme1: I/O 69 QID 18 timeout, aborting [262144.248473] nvme nvme1: Abort status: 0x0 [262144.297620] nvme nvme1: Abort status: 0x0 [262144.346745] nvme nvme1: Abort status: 0x0 [262144.395862] nvme nvme1: Abort status: 0x0 .... [262173.108846] nvme nvme2: I/O 334 QID 30 timeout, aborting [262173.173862] nvme nvme2: I/O 335 QID 30 timeout, aborting [262173.238739] nvme nvme2: I/O 336 QID 30 timeout, aborting [262173.303647] nvme nvme2: I/O 337 QID 30 timeout, aborting [262173.368660] nvme nvme1: I/O 69 QID 18 timeout, reset controller I have suggested that the increase the io_timeout to 90 and I have also requested that they attempt the same test as the other customer, to see if they can reproduce just by doing a mkfs. Unfortunately, our original customer had to move the systems into prod so they can no longer assist with testing, but I think the new customer may be able to. New customer's (comment 31) discard sysfs values from the sosreport: 3.10.0-862.el7 ATTR{discard_zeroes_data}=="1" ATTR{discard_granularity}=="512" ATTR{discard_max_bytes}=="2199023255040" ATTRS{discard_alignment}=="512" The customer was able to reproduce the issue with the 90 second timeout. They are trying 300 next. This issue happened when they issued blkdiscard as an ioctl on their filesystem which is assembled on four nvme partitions with raid level 1 + 0. Aug 29 11:19:16 user.warn kernel: [10586.454123] nvme nvme0: I/O 218 QID 25 timeout, aborting Aug 29 11:19:16 user.warn kernel: [10586.517541] nvme nvme0: I/O 219 QID 25 timeout, aborting Aug 29 11:19:16 user.warn kernel: [10586.580952] nvme nvme0: I/O 220 QID 25 timeout, aborting Aug 29 11:19:16 user.warn kernel: [10586.644357] nvme nvme0: I/O 221 QID 25 timeout, aborting Aug 29 11:20:17 user.warn kernel: [10647.307580] nvme nvme0: I/O 1 QID 0 timeout, reset controller Aug 29 11:20:47 user.warn kernel: [10677.249119] nvme nvme0: I/O 218 QID 25 timeout, reset controller Aug 29 11:22:17 user.err kernel: [10767.624454] nvme nvme0: Device not ready; aborting reset Aug 29 11:22:17 user.warn kernel: [10767.688198] nvme nvme0: Abort status: 0x7 Aug 29 11:22:17 user.warn kernel: [10767.736038] nvme nvme0: Abort status: 0x7 Aug 29 11:22:17 user.warn kernel: [10767.783881] nvme nvme0: Abort status: 0x7 Aug 29 11:22:17 user.warn kernel: [10767.831736] nvme nvme0: Abort status: 0x7 Aug 29 11:22:17 user.debug kernel: [10767.832138] nvme 0000:d8:00.0: irq 406 for MSI/MSI-X Aug 29 11:23:18 user.err kernel: [10828.221837] nvme nvme0: Device not ready; aborting reset Aug 29 11:23:18 user.warn kernel: [10828.285244] nvme nvme0: Removing after probe failure status: -19 Aug 29 11:23:18 user.info kernel: [10828.356992] nvme0n1: detected capacity change from 3200631791616 to 0 Aug 29 11:23:20 user.err kernel: [10830.678618] blk_update_request: I/O error, dev nvme0n1, sector 1377623040 Aug 29 11:23:21 user.err kernel: [10830.764185] blk_update_request: I/O error, dev nvme0n1, sector 1386011136 Aug 29 11:23:21 user.err kernel: [10830.849618] blk_update_request: I/O error, dev nvme0n1, sector 1394399232 Aug 29 11:23:21 user.err kernel: [10830.935051] blk_update_request: I/O error, dev nvme0n1, sector 1402787328 Aug 29 11:23:21 user.err kernel: [10831.020473] blk_update_request: I/O error, dev nvme0n1, sector 1411175424 Aug 29 11:23:21 user.err kernel: [10831.105900] blk_update_request: I/O error, dev nvme0n1, sector 1419563520 Aug 29 11:23:21 user.err kernel: [10831.191311] blk_update_request: I/O error, dev nvme0n1, sector 1427951616 Aug 29 11:23:21 user.err kernel: [10831.276738] blk_update_request: I/O error, dev nvme0n1, sector 1436339712 Aug 29 11:23:21 user.err kernel: [10831.362159] blk_update_request: I/O error, dev nvme0n1, sector 1444727808 Aug 29 11:23:21 user.err kernel: [10831.447588] blk_update_request: I/O error, dev nvme0n1, sector 1453115904 Aug 29 11:23:22 user.err kernel: [10831.793476] md: super_written gets error=-5, uptodate=0 Aug 29 11:23:22 user.crit kernel: [10831.855850] md/raid10:md100: Disk failure on nvme0n1p1, disabling device. Aug 29 11:23:22 user.crit kernel: [10831.855850] md/raid10:md100: Operation continuing on 3 devices. I was talking to Dave Jeffery about this and he mentioned that upstream discard max is writable. Is that something that's easily back-ported? Would it potentially help here? BTW, the issue may be addressed by backporting blk-wbt to rhel7. However, that can be a new feature, and the change is a bit big. Hi Guys, Please test the following kernel build and see if the task hung issue can be fixed? http://people.redhat.com/minlei/.1724345/ (In reply to Ming Lei from comment #58) > > Hi Guys, > > Please test the following kernel build and see if the task hung issue can be > fixed? > > http://people.redhat.com/minlei/.1724345/ Can you confirm the sha256sum's for me please? I have the following: e76e1f9f0af6dc3c621a8fe2117fa2a1228775b364d1a1f74933a6ad693677e8 kernel-3.10.0-1135.el7.1724345.x86_64.rpm 6f6ec452b79782f8c4354ea3f779ab96a4725a380c08c9b6464adc0fa24e3b58 kernel-devel-3.10.0-1135.el7.1724345.x86_64.rpm (In reply to Andrew Sanders from comment #59) > (In reply to Ming Lei from comment #58) > > > > Hi Guys, > > > > Please test the following kernel build and see if the task hung issue can be > > fixed? > > > > http://people.redhat.com/minlei/.1724345/ > > Can you confirm the sha256sum's for me please? I have the following: > e76e1f9f0af6dc3c621a8fe2117fa2a1228775b364d1a1f74933a6ad693677e8 > kernel-3.10.0-1135.el7.1724345.x86_64.rpm > 6f6ec452b79782f8c4354ea3f779ab96a4725a380c08c9b6464adc0fa24e3b58 > kernel-devel-3.10.0-1135.el7.1724345.x86_64.rpm Yeah, it is correct, or you may download from the brew build link directly. [minlei@slayer temp]$ sha256sum * e76e1f9f0af6dc3c621a8fe2117fa2a1228775b364d1a1f74933a6ad693677e8 kernel-3.10.0-1135.el7.1724345.x86_64.rpm 6f6ec452b79782f8c4354ea3f779ab96a4725a380c08c9b6464adc0fa24e3b58 kernel-devel-3.10.0-1135.el7.1724345.x86_64.rpm Thanks, Ming Thanks Ming. As Andrew has the only case left open, moving the needsinfo to him for potential results. By the time the customer got the test packages in hand they had already used the workaround to provision their servers and get them into production. Since their servers affected by this are all in production they no longer have servers to test the fix against. There's no reason to hold on me or my customer at this point for testing feedback. Patch(es) committed on kernel-3.10.0-1145.el7 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: kernel security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:4060 |