Bug 1433831 - NVMe SSD fails to initialize on AWS i3.4xlarge instances
Summary: NVMe SSD fails to initialize on AWS i3.4xlarge instances
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 7
Classification: Red Hat
Component: kernel
Version: 7.3
Hardware: x86_64
OS: Linux
urgent
urgent
Target Milestone: rc
: ---
Assignee: Vitaly Kuznetsov
QA Contact: Zhang Yi
URL:
Whiteboard:
: 1432737 (view as bug list)
Depends On:
Blocks: 1450037
TreeView+ depends on / blocked
 
Reported: 2017-03-20 03:45 UTC by artem
Modified: 2020-08-13 08:58 UTC (History)
14 users (show)

Fixed In Version: kernel-3.10.0-660.el7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1450037 (view as bug list)
Environment:
Last Closed: 2017-08-02 05:55:43 UTC
Target Upstream Version:


Attachments (Terms of Use)
Interrupts with kernel-3.10.0-514.17.1.el7.nvme_avoid_mult_resets.1 (31.43 KB, text/plain)
2017-03-23 21:44 UTC, artem
no flags Details
interrupts with kernel-4.1.12-61.1.25.el7uek.x86_64 (40.08 KB, text/plain)
2017-03-23 21:53 UTC, artem
no flags Details
interrupts with 3.10.0-514.18.1.el7.nvme_avoid_mult_resets.2.x86_64 (31.43 KB, text/plain)
2017-03-30 07:52 UTC, artem
no flags Details
linux-4.10 dmesg output with failure (40.87 KB, text/plain)
2017-04-05 18:58 UTC, David Milburn
no flags Details
dmesg after applying debug patch (47.95 KB, text/plain)
2017-04-18 15:25 UTC, David Milburn
no flags Details
dmesg after applying debug patch and dump_stack() (161.25 KB, text/plain)
2017-04-18 17:49 UTC, David Milburn
no flags Details
dmesg after applying debug patch and dump_stack() and others (139.34 KB, text/plain)
2017-04-19 14:05 UTC, David Milburn
no flags Details
dmesg after applying debug patches and checking for NVMe device (113.42 KB, text/plain)
2017-04-19 18:02 UTC, David Milburn
no flags Details
Patch to make __startup_pirq() thread safe (1.66 KB, patch)
2017-04-24 09:48 UTC, Vitaly Kuznetsov
no flags Details | Diff
RHEL7 version of the patch (3.56 KB, patch)
2017-04-24 18:03 UTC, Vitaly Kuznetsov
no flags Details | Diff


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:1842 normal SHIPPED_LIVE Important: kernel security, bug fix, and enhancement update 2017-08-01 18:22:09 UTC

Description artem 2017-03-20 03:45:16 UTC
Description of problem:
NVME SSD fails to initialize on AWS i3.4xlarge instances

Version-Release number of selected component (if applicable):
3.10.0-514.10.2.el7.x86_64

How reproducible:
Intermittent, possibly 10%

Steps to Reproduce:
1. Reboot VM


Additional info:


[   81.268665] nvme 0000:00:1e.0: I/O 0 QID 0 timeout, disable controller
[   81.271694] ------------[ cut here ]------------
[   81.273808] WARNING: at kernel/irq/manage.c:1273 __free_irq+0xa3/0x220()
[   81.276870] Trying to free already-free IRQ 157
[   81.278920] Modules linked in: dm_queue_length sd_mod sg iscsi_tcp libiscsi_tcp libiscsi target_core_user uio target_core_pscsi target_core_file target_core_iblock iscsi_target_mod target_core_mod crc_t10dif crct10dif_generic dm_multipath cls_u32 vxlan ip6_udp_tunnel udp_tunnel sch_htb scsi_transport_iscsi dm_mod intel_powerclamp intel_rapl iosf_mbi ppdev crc32_pclmul ghash_clmulni_intel xen_netfront aesni_intel lrw gf128mul glue_helper ablk_helper cryptd pcspkr i2c_piix4 parport_pc parport nfsd auth_rpcgss nfs_acl lockd grace sunrpc binfmt_misc ip_tables xfs libcrc32c ata_generic pata_acpi xen_blkfront cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ata_piix crct10dif_pclmul crct10dif_common crc32c_intel libata serio_raw i2c_core nvme floppy fjes
[   81.318175] CPU: 1 PID: 950 Comm: kworker/1:1H Not tainted 3.10.0-514.10.2.el7.x86_64 #1
[   81.322507] Hardware name: Xen HVM domU, BIOS 4.2.amazon 12/12/2016
[   81.325881] Workqueue: kblockd blk_mq_timeout_work
[   81.328557]  ffff881def71fb20 00000000b19c337d ffff881def71fad8 ffffffff816863ef
[   81.332773]  ffff881def71fb10 ffffffff81085940 0000000000000000 ffff881e84267a00
[   81.337023]  000000000000009d ffff881dfec95484 ffff881dfec95400 ffff881def71fb78
[   81.341271] Call Trace:
[   81.342616]  [<ffffffff816863ef>] dump_stack+0x19/0x1b
[   81.345452]  [<ffffffff81085940>] warn_slowpath_common+0x70/0xb0
[   81.348684]  [<ffffffff810859dc>] warn_slowpath_fmt+0x5c/0x80
[   81.351785]  [<ffffffff811309b3>] __free_irq+0xa3/0x220
[   81.354676]  [<ffffffff81130bc9>] free_irq+0x49/0xb0
[   81.357479]  [<ffffffffa0037527>] nvme_suspend_queue+0xa7/0x100 [nvme]
[   81.360583]  [<ffffffffa003848f>] nvme_dev_disable+0x8f/0x410 [nvme]
[   81.363601]  [<ffffffff814289fc>] ? dev_warn+0x6c/0x90
[   81.366123]  [<ffffffffa0038a4d>] nvme_timeout+0x11d/0x1e0 [nvme]
[   81.369233]  [<ffffffff810ce034>] ? update_curr+0x104/0x190
[   81.371819]  [<ffffffff810caa2e>] ? account_entity_dequeue+0xae/0xd0
[   81.374700]  [<ffffffff812f86af>] blk_mq_rq_timed_out+0x2f/0x80
[   81.377384]  [<ffffffff812f874f>] blk_mq_check_expired+0x4f/0x90
[   81.380072]  [<ffffffff812fadf4>] bt_for_each+0xe4/0xf0
[   81.382567]  [<ffffffff812f8700>] ? blk_mq_rq_timed_out+0x80/0x80
[   81.385469]  [<ffffffff812fb76e>] blk_mq_queue_tag_busy_iter+0x4e/0xc0
[   81.388755]  [<ffffffff812f8700>] ? blk_mq_rq_timed_out+0x80/0x80
[   81.392516]  [<ffffffff812f6f6b>] blk_mq_timeout_work+0x8b/0x180
[   81.396210]  [<ffffffff810a810b>] process_one_work+0x17b/0x470
[   81.399808]  [<ffffffff810a8f46>] worker_thread+0x126/0x410
[   81.403279]  [<ffffffff810a8e20>] ? rescuer_thread+0x460/0x460
[   81.406906]  [<ffffffff810b06ff>] kthread+0xcf/0xe0
[   81.409941]  [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140
[   81.414033]  [<ffffffff81696958>] ret_from_fork+0x58/0x90
[   81.417394]  [<ffffffff810b0630>] ? kthread_create_on_node+0x140/0x140
[   81.421460] ---[ end trace a5b2cd20c0a24dde ]---
[  141.525946] nvme 0000:00:1e.0: Removing after probe failure status: -4

Comment 2 artem 2017-03-20 03:53:52 UTC
No such problem with Oracle kernel 4.1.12-61.1.28.el7uek.x86_64

Comment 3 David Milburn 2017-03-21 14:11:40 UTC
Hi,

I think it maybe related to this upstream fix

commit c8f3d2ad6faa282d3446da7425e689a8b226a846
Author: Keith Busch <keith.busch@intel.com>
Date:   Wed Oct 5 16:32:45 2016 -0400

    nvme: don't schedule multiple resets

    The queue_work only fails if the work is pending, but not yet running. If
    the work is running, the work item would get requeued, triggering a
    double reset. If the first reset fails for any reason, the second
    reset triggers:

        WARN_ON(dev->ctrl.state == NVME_CTRL_RESETTING)

    Hitting that schedules controller deletion for a second time, which
    potentially takes a reference on the device that is being deleted.
    If the reset occurs at the same time as a hot removal event, this causes
    a double-free.

Comment 4 David Milburn 2017-03-21 16:52:21 UTC
Here you go, please try  kernel-3.10.0-514.17.1.el7.nvme_avoid_mult_resets.1.x86_64.rpm, thanks.

http://people.redhat.com/dmilburn/.nvme_avoid_multi_reset.736373695f72685f6461/

Comment 5 artem 2017-03-21 23:11:57 UTC
Thanks. We will test and provide an update.

Comment 6 artem 2017-03-22 17:06:37 UTC
We are still seeing the same problem with the test kernel

------------
[   84.820641] nvme 0000:00:1e.0: I/O 0 QID 0 timeout, disable controller
[   84.823599] ------------[ cut here ]------------
[   84.825593] WARNING: at kernel/irq/manage.c:1273 __free_irq+0xa3/0x220()
[   84.828474] Trying to free already-free IRQ 151
[   84.830349] Modules linked in: cls_u32 vxlan ip6_udp_tunnel udp_tunnel sch_htb scsi_transport_iscsi dm_mod intel_powerclamp intel_rapl iosf_mbi crc32_pclmul ppdev ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ena(OE) i2c_piix4 pcspkr parport_pc parport nfsd auth_rpcgss nfs_acl lockd grace sunrpc binfmt_misc ip_tables xfs libcrc32c ata_generic pata_acpi xen_blkfront cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crct10dif_pclmul ata_piix crct10dif_common crc32c_intel libata serio_raw i2c_core nvme floppy fjes
[   84.854052] CPU: 12 PID: 604 Comm: kworker/12:1H Tainted: G           OE  ------------   3.10.0-514.17.1.el7.nvme_avoid_mult_resets.1.x86_64 #1
[   84.859230] Hardware name: Xen HVM domU, BIOS 4.2.amazon 12/12/2016
[   84.861779] Workqueue: kblockd blk_mq_timeout_work
[   84.863827]  ffff881e02e5fb20 00000000db0efa48 ffff881e02e5fad8 ffffffff816869f3
[   84.867157]  ffff881e02e5fb10 ffffffff81085cb0 0000000000000000 ffff881e04493300
[   84.870492]  0000000000000097 ffff881dfd91ae84 ffff881dfd91ae00 ffff881e02e5fb78
[   84.873830] Call Trace:
[   84.874870]  [<ffffffff816869f3>] dump_stack+0x19/0x1b
[   84.876951]  [<ffffffff81085cb0>] warn_slowpath_common+0x70/0xb0
[   84.879387]  [<ffffffff81085d4c>] warn_slowpath_fmt+0x5c/0x80
[   84.881693]  [<ffffffff81130e23>] __free_irq+0xa3/0x220
[   84.883791]  [<ffffffff81131039>] free_irq+0x49/0xb0
[   84.885745]  [<ffffffffa0037537>] nvme_suspend_queue+0xa7/0x100 [nvme]
[   84.888269]  [<ffffffffa003849f>] nvme_dev_disable+0x8f/0x410 [nvme]
[   84.891075]  [<ffffffff81428f7c>] ? dev_warn+0x6c/0x90
[   84.893132]  [<ffffffffa0038a3d>] nvme_timeout+0x10d/0x1d0 [nvme]
[   84.895582]  [<ffffffff810ce474>] ? update_curr+0x104/0x190
[   84.897804]  [<ffffffff810cae6e>] ? account_entity_dequeue+0xae/0xd0
[   84.900357]  [<ffffffff812f8bdf>] blk_mq_rq_timed_out+0x2f/0x80
[   84.902700]  [<ffffffff812f8c7f>] blk_mq_check_expired+0x4f/0x90
[   84.905256]  [<ffffffff812fb324>] bt_for_each+0xe4/0xf0
[   84.907315]  [<ffffffff812f8c30>] ? blk_mq_rq_timed_out+0x80/0x80
[   84.909713]  [<ffffffff812fbc9e>] blk_mq_queue_tag_busy_iter+0x4e/0xc0
[   84.912280]  [<ffffffff812f8c30>] ? blk_mq_rq_timed_out+0x80/0x80
[   84.914710]  [<ffffffff812f749b>] blk_mq_timeout_work+0x8b/0x180
[   84.917100]  [<ffffffff810a845b>] process_one_work+0x17b/0x470
[   84.919422]  [<ffffffff810a9296>] worker_thread+0x126/0x410
[   84.921632]  [<ffffffff810a9170>] ? rescuer_thread+0x460/0x460
[   84.923973]  [<ffffffff810b0a4f>] kthread+0xcf/0xe0
[   84.925949]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
[   84.928533]  [<ffffffff81696fd8>] ret_from_fork+0x58/0x90
[   84.930678]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
[   84.933267] ---[ end trace 1521630dc73bc5a4 ]---
[  145.036978] nvme 0000:00:1e.0: Removing after probe failure status: -4

Comment 7 David Milburn 2017-03-22 17:59:33 UTC
Ok, thanks.

That is odd because there already is a check in nvme_suspend_queue() for
"nvmeq->cq_vector == -1", and there is a spin_lock (nvmeq->q_lock) around
checking and setting the cq_vector. So it shouldn't try to suspend an already
suspended queue, but obviously at the end of nvme_suspend_queue() it trying to
free_irq(vector, nvmeq).

Comment 8 David Milburn 2017-03-22 19:43:53 UTC
Would you please attach the output from "cat /proc/interrupts"? Is there more
than one queue assigned to IRQ 151? Thanks.

Comment 9 David Milburn 2017-03-23 13:24:58 UTC
Also, if possible, would you attach the output from "cat /proc/interrupts" after
booting the kernel that works. Thanks.

Comment 10 artem 2017-03-23 21:44:26 UTC
Created attachment 1265908 [details]
Interrupts with kernel-3.10.0-514.17.1.el7.nvme_avoid_mult_resets.1

Comment 11 artem 2017-03-23 21:53:27 UTC
Created attachment 1265909 [details]
interrupts with kernel-4.1.12-61.1.25.el7uek.x86_64

Comment 12 artem 2017-03-23 22:20:07 UTC
Attached interrupt maps for 3.10.0-514.17.1.el7.nvme_avoid_mult_resets.1 (fails) and 4.1.12-61.1.25.el7uek.x86_64 (works)

Comment 13 David Milburn 2017-03-28 21:30:55 UTC
Hi,

The nvme_timeout() code in this kernel-4.1.12-61.1.25.el7uek.x86_64 looks quite
out-dated. Would please try this .2 version of the test kernel? It brings the
functionality of nvme_timeout() very close to upstream. Thanks.

http://people.redhat.com/dmilburn/.nvme_avoid_multi_reset.736373695f72685f6461/

Comment 14 artem 2017-03-30 07:49:58 UTC
Still seeing the problem with the new test kernel:

[   84.525584] nvme 0000:00:1e.0: I/O 0 QID 0 timeout, disable controller
[   84.528452] ------------[ cut here ]------------
[   84.530410] WARNING: at kernel/irq/manage.c:1273 __free_irq+0xa3/0x220()
[   84.533122] Trying to free already-free IRQ 153
[   84.534993] Modules linked in: dm_queue_length sd_mod sg iscsi_tcp libiscsi_tcp libiscsi target_core_user uio target_core_pscsi target_core_file target_core_iblock iscsi_target_mod target_core_mod crc_t10dif crct10dif_generic dm_multipath cls_u32 vxlan ip6_udp_tunnel udp_tunnel sch_htb scsi_transport_iscsi dm_mod ppdev intel_powerclamp intel_rapl iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ena(OE) i2c_piix4 pcspkr parport_pc parport nfsd auth_rpcgss nfs_acl lockd grace sunrpc binfmt_misc ip_tables xfs libcrc32c ata_generic pata_acpi xen_blkfront cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm crct10dif_pclmul crct10dif_common ata_piix drm crc32c_intel libata nvme serio_raw i2c_core floppy fjes
[   84.566928] CPU: 15 PID: 854 Comm: kworker/15:1H Tainted: G           OE  ------------   3.10.0-514.18.1.el7.nvme_avoid_mult_resets.2.x86_64 #1
[   84.572178] Hardware name: Xen HVM domU, BIOS 4.2.amazon 12/12/2016
[   84.574730] Workqueue: kblockd blk_mq_timeout_work
[   84.576802]  ffff881effea7b20 000000005d0c66a0 ffff881effea7ad8 ffffffff816869b3
[   84.580119]  ffff881effea7b10 ffffffff81085cb0 0000000000000000 ffff881e840e9d00
[   84.583363]  0000000000000099 ffff881dfeeedc84 ffff881dfeeedc00 ffff881effea7b78
[   84.586733] Call Trace:
[   84.587823]  [<ffffffff816869b3>] dump_stack+0x19/0x1b
[   84.590020]  [<ffffffff81085cb0>] warn_slowpath_common+0x70/0xb0
[   84.592517]  [<ffffffff81085d4c>] warn_slowpath_fmt+0x5c/0x80
[   84.594844]  [<ffffffff81130e23>] __free_irq+0xa3/0x220
[   84.596974]  [<ffffffff81131039>] free_irq+0x49/0xb0
[   84.599051]  [<ffffffffa00515b7>] nvme_suspend_queue+0xa7/0x100 [nvme]
[   84.601742]  [<ffffffffa005251f>] nvme_dev_disable+0x8f/0x410 [nvme]
[   84.604216]  [<ffffffff81428f7c>] ? dev_warn+0x6c/0x90
[   84.606219]  [<ffffffffa0052b53>] nvme_timeout+0x1a3/0x1c0 [nvme]
[   84.608576]  [<ffffffff810bc454>] ? __wake_up+0x44/0x50
[   84.610602]  [<ffffffff810ce474>] ? update_curr+0x104/0x190
[   84.612794]  [<ffffffff810cae6e>] ? account_entity_dequeue+0xae/0xd0
[   84.615397]  [<ffffffff812f8bdf>] blk_mq_rq_timed_out+0x2f/0x80
[   84.617851]  [<ffffffff812f8c7f>] blk_mq_check_expired+0x4f/0x90
[   84.620347]  [<ffffffff812fb324>] bt_for_each+0xe4/0xf0
[   84.622548]  [<ffffffff812f8c30>] ? blk_mq_rq_timed_out+0x80/0x80
[   84.625028]  [<ffffffff812fbc9e>] blk_mq_queue_tag_busy_iter+0x4e/0xc0
[   84.627758]  [<ffffffff812f8c30>] ? blk_mq_rq_timed_out+0x80/0x80
[   84.630330]  [<ffffffff812f749b>] blk_mq_timeout_work+0x8b/0x180
[   84.632850]  [<ffffffff810a845b>] process_one_work+0x17b/0x470
[   84.635314]  [<ffffffff810a9296>] worker_thread+0x126/0x410
[   84.637678]  [<ffffffff810a9170>] ? rescuer_thread+0x460/0x460
[   84.640124]  [<ffffffff810b0a4f>] kthread+0xcf/0xe0
[   84.642133]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
[   84.644762]  [<ffffffff81696f98>] ret_from_fork+0x58/0x90
[   84.647007]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
[   84.649750] ---[ end trace ee4c2172bfd41a5f ]---
[  144.754109] nvme 0000:00:1e.0: Removing after probe failure status: -4

Comment 15 artem 2017-03-30 07:52:24 UTC
Created attachment 1267439 [details]
interrupts with 3.10.0-514.18.1.el7.nvme_avoid_mult_resets.2.x86_64

Comment 16 David Milburn 2017-03-30 16:38:22 UTC
Hmm, I thought this kernel would fix the problem, I includes upstream 
bb8d261e08881 (nvme: introduce a controller state machine).

This time IRQ 153
   84.533122] Trying to free already-free IRQ 153

Looks like maybe "cat /proc/interrupts" was taken after freeing 153?

 152:         17          0          2          0          0          0          0          0          0          0          0          0          0          0          0          0  xen-pirq-msi-x     nvme0q6
 154:          1          0          0        469          0          0          0          0          0          0          0          0          0          0          0          0  xen-pirq-msi-x     nvme0q7

I will go back through some code paths, but I may need to put some debug in the
next test kernel. Thanks.

Comment 17 David Milburn 2017-03-30 17:53:42 UTC
It does look like 7.3 has c21377f8 ("nvme: Suspend all queues before deletion"), 
but we don't have commit 7065906096273 (nvme: Delete created IO queues on reset)
which includes a fix for c21377f8.

I will add this one to the test kernel. Thanks.

Comment 18 David Milburn 2017-03-30 19:49:16 UTC
Would you please try the .3 version which includes the patch from Comment#17? Thanks.

http://people.redhat.com/dmilburn/.nvme_avoid_multi_reset.736373695f72685f6461/

Comment 19 artem 2017-03-31 17:08:08 UTC
David,
The issue is still happening.
Would it help if we give you access

[   81.896508] nvme 0000:00:1e.0: I/O 0 QID 0 timeout, disable controller
[   81.901542] ------------[ cut here ]------------
[   81.905072] WARNING: at kernel/irq/manage.c:1273 __free_irq+0xa3/0x220()
[   81.910244] Trying to free already-free IRQ 154
[   81.913787] Modules linked in: cls_u32 vxlan ip6_udp_tunnel udp_tunnel sch_htb scsi_transport_iscsi dm_mod intel_powerclamp intel_rapl iosf_mbi crc32_pclmul ghash_clmulni_intel ppdev aesni_intel lrw gf128mul glue_helper ablk_helper cryptd i2c_piix4 pcspkr parport_pc parport nfsd auth_rpcgss nfs_acl lockd grace sunrpc binfmt_misc ip_tables xfs libcrc32c ata_generic pata_acpi xen_blkfront cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ata_piix crct10dif_pclmul crct10dif_common drm crc32c_intel libata ena(OE) serio_raw nvme i2c_core floppy fjes
[   81.953043] CPU: 4 PID: 802 Comm: kworker/4:1H Tainted: G           OE  ------------   3.10.0-514.18.1.el7.nvme_avoid_mult_resets.3.x86_64 #1
[   81.961261] Hardware name: Xen HVM domU, BIOS 4.2.amazon 12/12/2016
[   81.965367] Workqueue: kblockd blk_mq_timeout_work
[   81.968627]  ffff881df6317b20 000000004e5ffb23 ffff881df6317ad8 ffffffff816869b3
[   81.973921]  ffff881df6317b10 ffffffff81085cb0 0000000000000000 ffff881eff970380
[   81.979139]  000000000000009a ffff881dfeefa184 ffff881dfeefa100 ffff881df6317b78
[   81.984383] Call Trace:
[   81.985947]  [<ffffffff816869b3>] dump_stack+0x19/0x1b
[   81.988871]  [<ffffffff81085cb0>] warn_slowpath_common+0x70/0xb0
[   81.993433]  [<ffffffff81085d4c>] warn_slowpath_fmt+0x5c/0x80
[   81.997977]  [<ffffffff81130e23>] __free_irq+0xa3/0x220
[   82.002255]  [<ffffffff81131039>] free_irq+0x49/0xb0
[   82.006229]  [<ffffffffa00495b7>] nvme_suspend_queue+0xa7/0x100 [nvme]
[   82.011480]  [<ffffffffa004a51f>] nvme_dev_disable+0x8f/0x410 [nvme]
[   82.016359]  [<ffffffff81428f7c>] ? dev_warn+0x6c/0x90
[   82.020247]  [<ffffffffa004ab53>] nvme_timeout+0x1a3/0x1c0 [nvme]
[   82.025035]  [<ffffffff810ce474>] ? update_curr+0x104/0x190
[   82.029533]  [<ffffffff810cae6e>] ? account_entity_dequeue+0xae/0xd0
[   82.034657]  [<ffffffff812f8bdf>] blk_mq_rq_timed_out+0x2f/0x80
[   82.039334]  [<ffffffff812f8c7f>] blk_mq_check_expired+0x4f/0x90
[   82.043981]  [<ffffffff812fb324>] bt_for_each+0xe4/0xf0
[   82.048109]  [<ffffffff812f8c30>] ? blk_mq_rq_timed_out+0x80/0x80
[   82.052810]  [<ffffffff812fbc9e>] blk_mq_queue_tag_busy_iter+0x4e/0xc0
[   82.057926]  [<ffffffff812f8c30>] ? blk_mq_rq_timed_out+0x80/0x80
[   82.062666]  [<ffffffff812f749b>] blk_mq_timeout_work+0x8b/0x180
[   82.067409]  [<ffffffff810a845b>] process_one_work+0x17b/0x470
[   82.071905]  [<ffffffff810a9296>] worker_thread+0x126/0x410
[   82.076384]  [<ffffffff810a9170>] ? rescuer_thread+0x460/0x460
[   82.081069]  [<ffffffff810b0a4f>] kthread+0xcf/0xe0
[   82.085050]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
[   82.090358]  [<ffffffff81696f98>] ret_from_fork+0x58/0x90
[   82.094771]  [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140
[   82.099973] ---[ end trace aad9b72a0c2bbb96 ]---
[  142.205711] nvme 0000:00:1e.0: Removing after probe failure status: -4

Comment 20 David Milburn 2017-03-31 18:34:07 UTC
(In reply to artem from comment #19)
> David,
> The issue is still happening.
> Would it help if we give you access

Yeah, that would be great if I could get in an environment where I can build
and boot some debug kernels. Thanks.

Comment 21 David Milburn 2017-04-05 18:57:23 UTC
Hi Keith,

This is reproducible on upstream linux-4.10, but I don't think it is a NVMe
driver problem. I put in some debug, I don't see where the NVMe driver does
a double free_irq, but I do see this each time

[   19.702913] xen:events: Failed to obtain physical IRQ 259

(attaching dmesg output)

Do you have any thoughts?

Comment 22 David Milburn 2017-04-05 18:58:00 UTC
Created attachment 1269062 [details]
linux-4.10 dmesg output with failure

Comment 23 Keith Busch 2017-04-05 19:15:33 UTC
I'm not familiar with this part, but it looks suspicious to me that the __startup_pirq xen irq startup callback returns success even when it fails.

Comment 24 David Milburn 2017-04-05 19:21:41 UTC
Hi Vitaly,

We are seeing a NVMe timeout followed by the driver disabling the controller
leading to a free_irq error. Would you please search the dmesg output (Comment#22) for "xen:events: Failed to obtain physical IRQ"?

Can you give us an idea why this would fail? Thanks.

Comment 25 Vitaly Kuznetsov 2017-04-18 14:46:22 UTC
Sorry for the delayed reply guys,

Xen uses so-called 'event channel' event delivery mechanism for interrupts (both physical and virtual) and here we can see EVTCHNOP_bind_pirq failing which means we can't bind some physical IRQ to an event channel.

AWS has its own Xen version with non-public patches but looking at the upstream Xen I can see several possible reasons for such behavior:
1) We reached the limit of physical IRQs per domain. This limit is set dynamically, hard to say why we're above the limit.
2) The physical IRQ we're asking for is already bound to some other event channel.
3) We're not allowed to bind to to this phisycal IRQ.
4) We're out of free event channels (very unlikely, especially for upstream as they use FIFO based event channels, the're are too many of them).

Please try the following debug patch (upstream):

diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index 6a53577..b65792e 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -528,6 +528,7 @@ static unsigned int __startup_pirq(unsigned int irq)
        bind_pirq.flags = info->u.pirq.flags & PIRQ_SHAREABLE ?
                                        BIND_PIRQ__WILL_SHARE : 0;
        rc = HYPERVISOR_event_channel_op(EVTCHNOP_bind_pirq, &bind_pirq);
+       pr_info("EVTCHNOP_bind_pirq: pirq: %d, flags: 0x%x, rc=%d (port: %d)\n", bind_pirq.pirq, bind_pirq.flags, rc, bind_pirq.port);
        if (rc != 0) {
                pr_warn("Failed to obtain physical IRQ %d\n", irq);
                return 0;

let's see what's going on.

It would also make sense to start a discussion on xen-devel mailing list, guys may give some valuable suggestions there.

Comment 26 David Milburn 2017-04-18 15:23:46 UTC
(In reply to Vitaly Kuznetsov from comment #25)
> Sorry for the delayed reply guys,
> 
> Xen uses so-called 'event channel' event delivery mechanism for interrupts
> (both physical and virtual) and here we can see EVTCHNOP_bind_pirq failing
> which means we can't bind some physical IRQ to an event channel.
> 
> AWS has its own Xen version with non-public patches but looking at the
> upstream Xen I can see several possible reasons for such behavior:
> 1) We reached the limit of physical IRQs per domain. This limit is set
> dynamically, hard to say why we're above the limit.
> 2) The physical IRQ we're asking for is already bound to some other event
> channel.
> 3) We're not allowed to bind to to this phisycal IRQ.
> 4) We're out of free event channels (very unlikely, especially for upstream
> as they use FIFO based event channels, the're are too many of them).
> 

Thanks Vitaly, I will attach the full dmesg

[   20.195765] xen:events: EVTCHNOP_bind_pirq: pirq: 583, flags: 0x0, rc=-17 (port: -1)
[   20.195766] xen:events: Failed to obtain physical IRQ 260

Comment 27 David Milburn 2017-04-18 15:25:51 UTC
Created attachment 1272366 [details]
dmesg after applying debug patch

Comment 28 Vitaly Kuznetsov 2017-04-18 16:04:29 UTC
-17 is -EEXIST so we're probably trying to bind the PIRQ 253 twice:

$ grep 'pirq: 583' /tmp/dmesg.txt 
[   20.171888] xen:events: EVTCHNOP_bind_pirq: pirq: 583, flags: 0x0, rc=0 (port: 253)
[   20.188750] xen:events: EVTCHNOP_bind_pirq: pirq: 583, flags: 0x0, rc=0 (port: 251)
[   20.195765] xen:events: EVTCHNOP_bind_pirq: pirq: 583, flags: 0x0, rc=-17 (port: -1)

I would expect that the pirq was unbound between the first and the second attempts but it wasn't between the second and the third. To continue investigating I would add dump_stack() to __startup_pirq() (and, probably, to shutdown_pirq()) to see why we're trying to acquire the same pirq twice.

I'm a little bit puzzled as in the beginning of __startup_pirq() we actually check if we already have an event channel for the irq:

int evtchn = evtchn_from_irq(irq);
...
if (VALID_EVTCHN(evtchn))
    goto out;

but this check fails, apparently. I can only imagine a race when we're trying to call __startup_pirq() on multiple CPUs simultaneously...

Comment 29 David Milburn 2017-04-18 17:49:11 UTC
Created attachment 1272391 [details]
dmesg after applying debug patch and dump_stack()

Ok, here is dmesg after inserting dump_stack() at beginning of __startup_pirq()
and shutdown_pirq().

Comment 30 Vitaly Kuznetsov 2017-04-19 11:20:52 UTC
Sorry, I should've asked to print some debug info from shutdown_pirq() too, e.g.:

diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c
index 6a53577..58e744d 100644
--- a/drivers/xen/events/events_base.c
+++ b/drivers/xen/events/events_base.c
@@ -237,6 +237,9 @@ static int xen_irq_info_pirq_setup(unsigned irq,
 
 static void xen_irq_info_cleanup(struct irq_info *info)
 {
+       pr_info("xen_irq_info_cleanup: %d\n", info->evtchn);
+       dump_stack();
+
        set_evtchn_to_irq(info->evtchn, -1);
        info->evtchn = 0;
 }
@@ -528,6 +531,8 @@ static unsigned int __startup_pirq(unsigned int irq)
        bind_pirq.flags = info->u.pirq.flags & PIRQ_SHAREABLE ?
                                        BIND_PIRQ__WILL_SHARE : 0;
        rc = HYPERVISOR_event_channel_op(EVTCHNOP_bind_pirq, &bind_pirq);
+       pr_info("EVTCHNOP_bind_pirq: IRQ: %d, pirq: %d, flags: 0x%x, rc=%d (port: %d)\n", irq, bind_pirq.pirq, bind_pirq.flags, rc, bind_pirq.port);
+       dump_stack();
        if (rc != 0) {
                pr_warn("Failed to obtain physical IRQ %d\n", irq);
                return 0;
@@ -572,6 +577,9 @@ static void shutdown_pirq(struct irq_data *data)
 
        BUG_ON(info->type != IRQT_PIRQ);
 
+       pr_info("shutdown_pirq: irq %d evtchn %d\n", irq, evtchn);
+       dump_stack();
+
        if (!VALID_EVTCHN(evtchn))
                return;

Here I'm adding some debug info to xen_irq_info_cleanup() too to see if the irq<->evtchn mapping is being cleaned from some other place, we have a few places for that.

Anyway, looking at dmesg there is no shutdown_pirq() between the second and the third 'EVTCHNOP_bind_pirq: pirq: 580' which means that the driver is trying to bind the same irq twice. For some reason irq<->evtchn mapping is cleaned in Linux but not in Xen so we end up with -EEXIST.

My wild guess would be: this is also broken on non-Xen installs but Xen specifics (not returning an error code when 'Failed to obtain physical IRQ' happens makes this error fatal. I would expect that on non-Xen setups the nvme driver starts with one less queue sometimes.

I would suggest we also try changing "return 0" to "return rc" in __startup_pirq() to see what happens.

David, I feel like I'm torturing you with these debug patches and numerous guesses here :-( I can try reproducing the issue myself, is it easy enough? Do I just reboot i3.4xlarge instance 10 times and the issue will show up?

Comment 31 David Milburn 2017-04-19 14:04:56 UTC
Hi Vitaly,

(In reply to Vitaly Kuznetsov from comment #30)
> Sorry, I should've asked to print some debug info from shutdown_pirq() too,
> e.g.:
> 
> diff --git a/drivers/xen/events/events_base.c
> b/drivers/xen/events/events_base.c
> index 6a53577..58e744d 100644
> --- a/drivers/xen/events/events_base.c
> +++ b/drivers/xen/events/events_base.c
> @@ -237,6 +237,9 @@ static int xen_irq_info_pirq_setup(unsigned irq,
>  
>  static void xen_irq_info_cleanup(struct irq_info *info)
>  {
> +       pr_info("xen_irq_info_cleanup: %d\n", info->evtchn);
> +       dump_stack();
> +
>         set_evtchn_to_irq(info->evtchn, -1);
>         info->evtchn = 0;
>  }
> @@ -528,6 +531,8 @@ static unsigned int __startup_pirq(unsigned int irq)
>         bind_pirq.flags = info->u.pirq.flags & PIRQ_SHAREABLE ?
>                                         BIND_PIRQ__WILL_SHARE : 0;
>         rc = HYPERVISOR_event_channel_op(EVTCHNOP_bind_pirq, &bind_pirq);
> +       pr_info("EVTCHNOP_bind_pirq: IRQ: %d, pirq: %d, flags: 0x%x, rc=%d
> (port: %d)\n", irq, bind_pirq.pirq, bind_pirq.flags, rc, bind_pirq.port);
> +       dump_stack();
>         if (rc != 0) {
>                 pr_warn("Failed to obtain physical IRQ %d\n", irq);
>                 return 0;
> @@ -572,6 +577,9 @@ static void shutdown_pirq(struct irq_data *data)
>  
>         BUG_ON(info->type != IRQT_PIRQ);
>  
> +       pr_info("shutdown_pirq: irq %d evtchn %d\n", irq, evtchn);
> +       dump_stack();
> +
>         if (!VALID_EVTCHN(evtchn))
>                 return;
> 
> Here I'm adding some debug info to xen_irq_info_cleanup() too to see if the
> irq<->evtchn mapping is being cleaned from some other place, we have a few
> places for that.
> 
> Anyway, looking at dmesg there is no shutdown_pirq() between the second and
> the third 'EVTCHNOP_bind_pirq: pirq: 580' which means that the driver is
> trying to bind the same irq twice. For some reason irq<->evtchn mapping is
> cleaned in Linux but not in Xen so we end up with -EEXIST.
> 
> My wild guess would be: this is also broken on non-Xen installs but Xen
> specifics (not returning an error code when 'Failed to obtain physical IRQ'
> happens makes this error fatal. I would expect that on non-Xen setups the
> nvme driver starts with one less queue sometimes.

Ok, I will attach another dmesg output.

> 
> I would suggest we also try changing "return 0" to "return rc" in
> __startup_pirq() to see what happens.
> 
> David, I feel like I'm torturing you with these debug patches and numerous
> guesses here :-( I can try reproducing the issue myself, is it easy enough?
> Do I just reboot i3.4xlarge instance 10 times and the issue will show up?

I think you will need some large NVMe drives, usually this happens on the
first reboot (I did get into a state where I couldn't reproduce, but built
a fresh kernel from the linux-4.10 sources, and I am back to reproducing on
each reboot).

First thing I check is lsblk (there should be 4 x 1.7T disks), then I
check dmesg output.

$ lsblk
NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
xvda    202:0    0   64G  0 disk 
├─xvda2 202:2    0   64G  0 part /
└─xvda1 202:1    0    1M  0 part 
nvme1n1 259:0    0  1.7T  0 disk 
nvme3n1 259:1    0  1.7T  0 disk 

Artem, do you think you can give Vitaly access to the same instance? Thanks.

Comment 32 David Milburn 2017-04-19 14:05:53 UTC
Created attachment 1272630 [details]
dmesg after applying debug patch and dump_stack() and others

Comment 33 Vitaly Kuznetsov 2017-04-19 14:27:53 UTC
From the last dmesg I conclude we have a race:

[   20.422183] xen:events: EVTCHNOP_bind_pirq: pirq: 582, flags: 0x0, rc=0 (port: 281)
[   20.422185] CPU: 21 PID: 565 Comm: kworker/u256:2 Not tainted 4.10.0 #3
[   20.422185] Hardware name: Xen HVM domU, BIOS 4.2.amazon 12/12/2016
[   20.422187] Workqueue: nvme nvme_reset_work [nvme]
[   20.422188] Call Trace:
[   20.422191]  dump_stack+0x63/0x87
[   20.422193]  __startup_pirq+0xdb/0x170
[   20.422194]  startup_pirq+0x11/0x20
[   20.422196]  irq_startup+0x47/0x80
[   20.422197]  ? alloc_cpumask_var_node+0x24/0x30
[   20.422198]  __setup_irq+0x550/0x640
[   20.422200]  request_threaded_irq+0x10d/0x1f0
[   20.422202]  queue_request_irq+0x4e/0x80 [nvme]
[   20.422203]  nvme_setup_io_queues+0x1c0/0x410 [nvme]
[   20.422205]  ? blk_queue_write_cache+0x4e/0x80
[   20.422207]  ? nvme_set_queue_limits+0x4c/0xa0 [nvme_core]
[   20.422209]  ? nvme_init_identify+0x1fa/0x2f0 [nvme_core]
[   20.422211]  nvme_reset_work+0x5c0/0x900 [nvme]
[   20.422212]  process_one_work+0x165/0x410
[   20.422213]  worker_thread+0x137/0x4c0
[   20.422215]  kthread+0x101/0x140
[   20.422216]  ? rescuer_thread+0x3b0/0x3b0
[   20.422218]  ? kthread_park+0x90/0x90
[   20.422220]  ret_from_fork+0x2c/0x40
[   20.426199] xen:events: EVTCHNOP_bind_pirq: pirq: 582, flags: 0x0, rc=-17 (port: -1)
[   20.426201] CPU: 0 PID: 5 Comm: kworker/u256:0 Not tainted 4.10.0 #3
[   20.426202] Hardware name: Xen HVM domU, BIOS 4.2.amazon 12/12/2016
[   20.426204] Workqueue: nvme nvme_reset_work [nvme]
[   20.426205] Call Trace:
[   20.426208]  dump_stack+0x63/0x87
[   20.426210]  __startup_pirq+0xdb/0x170
[   20.426212]  startup_pirq+0x11/0x20
[   20.426213]  irq_startup+0x47/0x80
[   20.426214]  ? alloc_cpumask_var_node+0x24/0x30
[   20.426215]  __setup_irq+0x550/0x640
[   20.426217]  request_threaded_irq+0x10d/0x1f0
[   20.426218]  queue_request_irq+0x4e/0x80 [nvme]
[   20.426220]  nvme_setup_io_queues+0x1c0/0x410 [nvme]
[   20.426222]  ? nvme_init_identify+0x1fa/0x2f0 [nvme_core]
[   20.426223]  ? kfree+0x122/0x170
[   20.426225]  ? nvme_init_identify+0x1fa/0x2f0 [nvme_core]
[   20.426226]  nvme_reset_work+0x5c0/0x900 [nvme]
[   20.426228]  process_one_work+0x165/0x410
[   20.426229]  worker_thread+0x137/0x4c0
[   20.426230]  kthread+0x101/0x140
[   20.426231]  ? rescuer_thread+0x3b0/0x3b0
[   20.426233]  ? kthread_park+0x90/0x90
[   20.426235]  ret_from_fork+0x2c/0x40
[   20.426236] xen:events: Failed to obtain physical IRQ 259
[   20.429439] xen:events: EVTCHNOP_bind_pirq: pirq: 581, flags: 0x0, rc=0 (port: 282)

It seems we have nvme_setup_io_queues() running on CPU0 and CPU21 simultaneously and the're both trying to setup the same IRQ. Any idea why this is happening in nvme?

Comment 34 Keith Busch 2017-04-19 14:37:20 UTC
Are the two nvme_reset_work threads executing on two different NVMe drives, or on the same one? The former should be okay as far as I know, but the later would definitely be a bug. I know there are three nvme drives in this system, so I would expect parallel initialization, but I can't tell from the stack trace if these are definitely from two different device initializations.

Comment 35 David Milburn 2017-04-19 17:41:07 UTC
Hi Keith,

(In reply to Keith Busch from comment #34)
> Are the two nvme_reset_work threads executing on two different NVMe drives,
> or on the same one? The former should be okay as far as I know, but the
> later would definitely be a bug. I know there are three nvme drives in this
> system, so I would expect parallel initialization, but I can't tell from the
> stack trace if these are definitely from two different device
> initializations.

I added this in nvme_setup_io_queues() right before queue_request_irq(), but
it didn't fail that time, I will try somewhere else.

dev_info(dev->ctrl.device, "NVME_SETUP_IO_QUEUES: pci function %s\n", dev_name(&pdev->dev)\
);

(4 drives)

[   19.998098] nvme nvme3: NVME_SETUP_IO_QUEUES: pci function 0000:00:1e.0
[   20.001832] nvme nvme1: NVME_SETUP_IO_QUEUES: pci function 0000:00:1c.0
[   20.103319] nvme nvme0: NVME_SETUP_IO_QUEUES: pci function 0000:00:1b.0
[   20.106907] nvme nvme2: NVME_SETUP_IO_QUEUES: pci function 0000:00:1d.0

Comment 36 David Milburn 2017-04-19 18:02:46 UTC
Created attachment 1272695 [details]
dmesg after applying debug patches and checking for NVMe device

Looks like different devices, please note I moved this to nvme_create_io_queues()
which is returned from nvme_setup_io_queues(). Sorry, I forgot to change that
to NVME_CREATE_IO_QUEUES, but that is where it really came from. Thanks.

dev_info(dev->ctrl.device, "NVME_SETUP_IO_QUEUES: pci function %s\n", dev_name(&pdev->dev)\
);

Comment 37 Keith Busch 2017-04-20 17:20:46 UTC
This should be allowed for the nvme driver to do. It sounds like whatever is assigning these isn't thread safe, though it seems odd that pirq 582 is always the only duplicate.

Comment 38 Vitaly Kuznetsov 2017-04-21 10:33:09 UTC
Yes, it seems that some of this Xen code is not thread safe (I'll take a look) but my question would be: why nvme driver is trying to call request_threaded_irq() for the same irq for different devices?

BTW, it's not always 582, I saw 580 in one of dmesgs from David.

Comment 39 Vitaly Kuznetsov 2017-04-21 10:43:48 UTC
Oh, I missed the fact that request_threaded_irq() is done with IRQF_SHARED flag so we probably have a shared irq which is OK.

Comment 40 Vitaly Kuznetsov 2017-04-24 09:37:12 UTC
I tried reproducing the issue on Friday with i3.8xlarge instance and with RHEL kernel it always reproduces, I was never able to boot with all 4 NVMes, 1-3 of them always fail. After that I tried to reproduce the upstream issue ("xen:events: Failed to obtain physical IRQ ...") with 4.11-rc7 but failed. I wrote a patch which is supposed to fix the discovered race but as I wasn't able to reproduce the issue I'm not sure if it works or not. I'll attach it here so if anyone is still able to reproduce the upstream issue please let me know if the patch helps or not. I'll be investigating further.

Comment 41 Vitaly Kuznetsov 2017-04-24 09:48:34 UTC
Created attachment 1273561 [details]
Patch to make __startup_pirq() thread safe

Comment 42 David Milburn 2017-04-24 12:37:41 UTC
Hi Vitaly, I am still seeing the failure on linux-4.10 with the patch in comment#41

[ec2-user@ip-172-31-7-115 events]$ lsblk
NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
xvda    202:0    0   64G  0 disk 
├─xvda2 202:2    0   64G  0 part /
└─xvda1 202:1    0    1M  0 part 
nvme0n1 259:1    0  1.7T  0 disk 
nvme3n1 259:0    0  1.7T  0 disk 


[    0.000000] Linux version 4.10.0 (ec2-user@ip-172-31-7-115.us-west-2.compute.internal) (gcc version 4.8.5 20150623 (Red \
Hat 4.8.5-11) (GCC) ) #4 SMP Mon Apr 24 08:24:12 EDT 2017                                                              

<snip>

[   21.026659] xen:events: EVTCHNOP_bind_pirq: pirq: 583, flags: 0x0, rc=-17 (port: -1)
[   21.026660] CPU: 19 PID: 556 Comm: kworker/u256:2 Not tainted 4.10.0 #4
[   21.026661] Hardware name: Xen HVM domU, BIOS 4.2.amazon 12/12/2016
[   21.026663] Workqueue: nvme nvme_reset_work [nvme]
[   21.026664] Call Trace:
[   21.026666]  dump_stack+0x63/0x87
[   21.026669]  __startup_pirq+0xf3/0x180
[   21.026670]  startup_pirq+0x11/0x20
[   21.026672]  irq_startup+0x47/0x80
[   21.026673]  ? alloc_cpumask_var_node+0x24/0x30
[   21.026674]  __setup_irq+0x550/0x640
[   21.026676]  request_threaded_irq+0x10d/0x1f0
[   21.026678]  queue_request_irq+0x4e/0x80 [nvme]
[   21.026679]  nvme_setup_io_queues+0x1c0/0x440 [nvme]
[   21.026681]  ? blk_queue_write_cache+0x4e/0x80
[   21.026683]  ? nvme_set_queue_limits+0x4c/0xa0 [nvme_core]
[   21.026685]  ? nvme_init_identify+0x1fa/0x2f0 [nvme_core]
[   21.026687]  nvme_reset_work+0x5c0/0x900 [nvme]
[   21.026688]  process_one_work+0x165/0x410
[   21.026689]  worker_thread+0x137/0x4c0
[   21.026691]  kthread+0x101/0x140
[   21.026692]  ? rescuer_thread+0x3b0/0x3b0
[   21.026693]  ? kthread_park+0x90/0x90
[   21.026695]  ret_from_fork+0x2c/0x40
[   21.026696] xen:events: Failed to obtain physical IRQ 258
[   21.026704] nvme nvme2: NVME_SETUP_IO_QUEUES: pci function 0000:00:1d.0
[   21.026719] xen:events: EVTCHNOP_bind_pirq: pirq: 536, flags: 0x0, rc=0 (port: 266)
[   21.026720] CPU: 25 PID: 557 Comm: kworker/u256:3 Not tainted 4.10.0 #4
[   21.026720] Hardware name: Xen HVM domU, BIOS 4.2.amazon 12/12/2016
[   21.026721] Workqueue: nvme nvme_reset_work [nvme]
[   21.026722] Call Trace:
[   21.026723]  dump_stack+0x63/0x87
[   21.026724]  __startup_pirq+0xf3/0x180
[   21.026726]  startup_pirq+0x11/0x20
[   21.026727]  irq_startup+0x47/0x80
[   21.026728]  ? alloc_cpumask_var_node+0x24/0x30
[   21.026729]  __setup_irq+0x550/0x640
[   21.026730]  request_threaded_irq+0x10d/0x1f0
[   21.026731]  queue_request_irq+0x4e/0x80 [nvme]
[   21.026733]  nvme_setup_io_queues+0x318/0x440 [nvme]
[   21.026734]  nvme_reset_work+0x5c0/0x900 [nvme]
[   21.026735]  process_one_work+0x165/0x410
[   21.026736]  worker_thread+0x137/0x4c0
[   21.026737]  kthread+0x101/0x140
[   21.026738]  ? rescuer_thread+0x3b0/0x3b0
[   21.026739]  ? kthread_park+0x90/0x90
[   21.026741]  ret_from_fork+0x2c/0x40

Comment 43 David Milburn 2017-04-24 13:50:29 UTC
I booted 4.11.0-rc8 (without patch from Comment#41), but didn't fail, I will
reboot several times to see if I can recreate on latest upstream

[ec2-user@ip-172-31-7-115 ~]$ lsblk
NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
xvda    202:0    0   64G  0 disk 
├─xvda2 202:2    0   64G  0 part /
└─xvda1 202:1    0    1M  0 part 
nvme2n1 259:3    0  1.7T  0 disk 
nvme1n1 259:0    0  1.7T  0 disk 
nvme0n1 259:1    0  1.7T  0 disk 
nvme3n1 259:2    0  1.7T  0 disk

Comment 44 Vitaly Kuznetsov 2017-04-24 14:10:40 UTC
Thanks, I see, the race is not what causing us problems.

I tried linux-4.10 on i3.8xlarge today and voilà, the issue reproduced right away:
# lsblk | grep nvm
nvme2n1 259:2    0  1.7T  0 disk 
nvme0n1 259:0    0  1.7T  0 disk 
nvme3n1 259:1    0  1.7T  0 disk 
(nvme1 is missing). However, I'm not able to reproduce the issue with 4.11-rc7, all four devices are always present. I went through Xen related changes in 4.11 and couldn't find anything relevant, the event channel code is the same. NVMe driver has some changes but I couldn't spot anything obvious too (though, it's not the code I'm familiar with). Could you please try 4.11-rc7 (or even rc8) to see if the issue still reproduces? If it doesn't we can take a look at the changes again or/and bisect.

Comment 45 Vitaly Kuznetsov 2017-04-24 14:22:37 UTC
One other thing which bothers me a bit: we never saw "xen:events: Failed to obtain physical ..." message on RHEL (or, at I don't see it in my testing) but the pr_info() is there and the code is more or less the same, there is a possibility that we're failing for different reasons in RHEL and upstream...

Comment 46 David Milburn 2017-04-24 14:30:06 UTC
Hi,

(In reply to Vitaly Kuznetsov from comment #44)
> Thanks, I see, the race is not what causing us problems.
> 
> I tried linux-4.10 on i3.8xlarge today and voilà, the issue reproduced right
> away:
> # lsblk | grep nvm
> nvme2n1 259:2    0  1.7T  0 disk 
> nvme0n1 259:0    0  1.7T  0 disk 
> nvme3n1 259:1    0  1.7T  0 disk 
> (nvme1 is missing). However, I'm not able to reproduce the issue with
> 4.11-rc7, all four devices are always present. I went through Xen related
> changes in 4.11 and couldn't find anything relevant, the event channel code
> is the same. NVMe driver has some changes but I couldn't spot anything
> obvious too (though, it's not the code I'm familiar with). Could you please
> try 4.11-rc7 (or even rc8) to see if the issue still reproduces? If it
> doesn't we can take a look at the changes again or/and bisect.

I rebooted 4.11-rc8 ten times without seeing the failure. Thanks.

Comment 47 David Milburn 2017-04-24 14:34:37 UTC
(In reply to Vitaly Kuznetsov from comment #45)
> One other thing which bothers me a bit: we never saw "xen:events: Failed to
> obtain physical ..." message on RHEL (or, at I don't see it in my testing)
> but the pr_info() is there and the code is more or less the same, there is a
> possibility that we're failing for different reasons in RHEL and upstream...

Yeah, I did notice that upstream changed that to pr_warn() and RHEL uses pr_info(). Thanks.

Comment 48 David Milburn 2017-04-24 14:42:20 UTC
(In reply to David Milburn from comment #47)
> (In reply to Vitaly Kuznetsov from comment #45)
> > One other thing which bothers me a bit: we never saw "xen:events: Failed to
> > obtain physical ..." message on RHEL (or, at I don't see it in my testing)
> > but the pr_info() is there and the code is more or less the same, there is a
> > possibility that we're failing for different reasons in RHEL and upstream...
> 
> Yeah, I did notice that upstream changed that to pr_warn() and RHEL uses
> pr_info(). Thanks.

Though, should have logged pr_info.

Comment 49 Vitaly Kuznetsov 2017-04-24 16:39:27 UTC
Blind bisect shows the culprit:

commit c74fd80f2f41d05f350bb478151021f88551afe8
Author: Dan Streetman <ddstreet@ieee.org>
Date:   Fri Jan 13 15:07:51 2017 -0500

    xen: do not re-use pirq number cached in pci device msi msg data
    
    Revert the main part of commit:
    af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests")
    
    That commit introduced reading the pci device's msi message data to see
    if a pirq was previously configured for the device's msi/msix, and re-use
    that pirq.  At the time, that was the correct behavior.  However, a
    later change to Qemu caused it to call into the Xen hypervisor to unmap
    all pirqs for a pci device, when the pci device disables its MSI/MSIX
    vectors; specifically the Qemu commit:
    c976437c7dba9c7444fb41df45468968aaa326ad
    ("qemu-xen: free all the pirqs for msi/msix when driver unload")
    
    Once Qemu added this pirq unmapping, it was no longer correct for the
    kernel to re-use the pirq number cached in the pci device msi message
    data.  All Qemu releases since 2.1.0 contain the patch that unmaps the
    pirqs when the pci device disables its MSI/MSIX vectors.
    
    This bug is causing failures to initialize multiple NVMe controllers
    under Xen, because the NVMe driver sets up a single MSIX vector for
    each controller (concurrently), and then after using that to talk to
    the controller for some configuration data, it disables the single MSIX
    vector and re-configures all the MSIX vectors it needs.  So the MSIX
    setup code tries to re-use the cached pirq from the first vector
    for each controller, but the hypervisor has already given away that
    pirq to another controller, and its initialization fails.
    
    This is discussed in more detail at:
    https://lists.xen.org/archives/html/xen-devel/2017-01/msg00447.html
    
    Fixes: af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests")
    Signed-off-by: Dan Streetman <dan.streetman@canonical.com>
    Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
    Acked-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
    Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>

The commit message says something about several NVMes so I think this is a good guess. The next question would be if it helps RHEL or not. I'll try.

Comment 52 Vitaly Kuznetsov 2017-04-24 18:03:44 UTC
Created attachment 1273684 [details]
RHEL7 version of the patch

Comment 57 Zhang Yi 2017-04-28 02:07:25 UTC
Hi artem
QE don't have the environment, could you help feedback test results after it fixed, thanks in advance.

Yi

Comment 59 artem 2017-04-29 19:24:34 UTC
Hi Yi,
We can do some testing. But I would suggest that Red Hat urgently needs to arrange easy access to AWS for kernel QA. RHEL is a major OS offered on AWS. And AWS is a leading "server vendor" today. Red Hat kernel QA not having access to AWS is a major gap. How can we recommend using RHEL on AWS to our large enterprise customers if kernel testing is not done and any troubleshooting is problematic? This issue is a perfect case for urgently fixing this gap.

-Artem

Comment 60 Rafael Aquini 2017-04-30 03:05:27 UTC
Patch(es) committed on kernel repository and an interim kernel build is undergoing testing

Comment 63 Rafael Aquini 2017-05-01 13:43:49 UTC
Patch(es) available on kernel-3.10.0-660.el7

Comment 66 Zhang Yi 2017-05-02 02:12:28 UTC
(In reply to artem from comment #59)
> Hi Yi,
> We can do some testing. But I would suggest that Red Hat urgently needs to
> arrange easy access to AWS for kernel QA. RHEL is a major OS offered on AWS.
> And AWS is a leading "server vendor" today. Red Hat kernel QA not having
> access to AWS is a major gap. How can we recommend using RHEL on AWS to our
> large enterprise customers if kernel testing is not done and any
> troubleshooting is problematic? This issue is a perfect case for urgently
> fixing this gap.
> 
> -Artem

Hi Artem
Thanks for your reminder, I have found one method to access AWS inside Red Hat. Will have a try for this bug ASAP.

Thanks
Yi

Comment 67 Zhang Yi 2017-05-04 08:24:34 UTC
Sometimes this issue cannot be reproduced after system reboot, I tried below script[1] which found from [2], and this issue is easy to reproduce with this script.

[1]
#!/bin/bash
num=0
while [ $num -lt 30]
do
	echo "-----$num"
	rmmod nvme
	modprobe nvme
	((num++))
	sleep 0.5
done

[2]
https://lists.xen.org/archives/html/xen-devel/2017-01/msg00447.html

Comment 75 Vitaly Kuznetsov 2017-05-10 08:32:58 UTC
*** Bug 1432737 has been marked as a duplicate of this bug. ***

Comment 78 errata-xmlrpc 2017-08-02 05:55:43 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2017:1842


Note You need to log in before you can comment on or make changes to this bug.