Bug 1433831
Summary: | NVMe SSD fails to initialize on AWS i3.4xlarge instances | |||
---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | artem | |
Component: | kernel | Assignee: | Vitaly Kuznetsov <vkuznets> | |
kernel sub component: | Platform Enablement | QA Contact: | Zhang Yi <yizhan> | |
Status: | CLOSED ERRATA | Docs Contact: | ||
Severity: | urgent | |||
Priority: | urgent | CC: | ailan, artem, dhoward, jboutaud, jreznik, keith.busch, leiwang, linl, mtessun, nkshirsa, rmadhuso, vkuznets, wshi, yizhan | |
Version: | 7.3 | Keywords: | ZStream | |
Target Milestone: | rc | |||
Target Release: | --- | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | kernel-3.10.0-660.el7 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1450037 (view as bug list) | Environment: | ||
Last Closed: | 2017-08-02 05:55:43 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | ||||
Bug Blocks: | 1450037 | |||
Attachments: |
Description
artem
2017-03-20 03:45:16 UTC
No such problem with Oracle kernel 4.1.12-61.1.28.el7uek.x86_64 Hi, I think it maybe related to this upstream fix commit c8f3d2ad6faa282d3446da7425e689a8b226a846 Author: Keith Busch <keith.busch> Date: Wed Oct 5 16:32:45 2016 -0400 nvme: don't schedule multiple resets The queue_work only fails if the work is pending, but not yet running. If the work is running, the work item would get requeued, triggering a double reset. If the first reset fails for any reason, the second reset triggers: WARN_ON(dev->ctrl.state == NVME_CTRL_RESETTING) Hitting that schedules controller deletion for a second time, which potentially takes a reference on the device that is being deleted. If the reset occurs at the same time as a hot removal event, this causes a double-free. Here you go, please try kernel-3.10.0-514.17.1.el7.nvme_avoid_mult_resets.1.x86_64.rpm, thanks. http://people.redhat.com/dmilburn/.nvme_avoid_multi_reset.736373695f72685f6461/ Thanks. We will test and provide an update. We are still seeing the same problem with the test kernel ------------ [ 84.820641] nvme 0000:00:1e.0: I/O 0 QID 0 timeout, disable controller [ 84.823599] ------------[ cut here ]------------ [ 84.825593] WARNING: at kernel/irq/manage.c:1273 __free_irq+0xa3/0x220() [ 84.828474] Trying to free already-free IRQ 151 [ 84.830349] Modules linked in: cls_u32 vxlan ip6_udp_tunnel udp_tunnel sch_htb scsi_transport_iscsi dm_mod intel_powerclamp intel_rapl iosf_mbi crc32_pclmul ppdev ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ena(OE) i2c_piix4 pcspkr parport_pc parport nfsd auth_rpcgss nfs_acl lockd grace sunrpc binfmt_misc ip_tables xfs libcrc32c ata_generic pata_acpi xen_blkfront cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm crct10dif_pclmul ata_piix crct10dif_common crc32c_intel libata serio_raw i2c_core nvme floppy fjes [ 84.854052] CPU: 12 PID: 604 Comm: kworker/12:1H Tainted: G OE ------------ 3.10.0-514.17.1.el7.nvme_avoid_mult_resets.1.x86_64 #1 [ 84.859230] Hardware name: Xen HVM domU, BIOS 4.2.amazon 12/12/2016 [ 84.861779] Workqueue: kblockd blk_mq_timeout_work [ 84.863827] ffff881e02e5fb20 00000000db0efa48 ffff881e02e5fad8 ffffffff816869f3 [ 84.867157] ffff881e02e5fb10 ffffffff81085cb0 0000000000000000 ffff881e04493300 [ 84.870492] 0000000000000097 ffff881dfd91ae84 ffff881dfd91ae00 ffff881e02e5fb78 [ 84.873830] Call Trace: [ 84.874870] [<ffffffff816869f3>] dump_stack+0x19/0x1b [ 84.876951] [<ffffffff81085cb0>] warn_slowpath_common+0x70/0xb0 [ 84.879387] [<ffffffff81085d4c>] warn_slowpath_fmt+0x5c/0x80 [ 84.881693] [<ffffffff81130e23>] __free_irq+0xa3/0x220 [ 84.883791] [<ffffffff81131039>] free_irq+0x49/0xb0 [ 84.885745] [<ffffffffa0037537>] nvme_suspend_queue+0xa7/0x100 [nvme] [ 84.888269] [<ffffffffa003849f>] nvme_dev_disable+0x8f/0x410 [nvme] [ 84.891075] [<ffffffff81428f7c>] ? dev_warn+0x6c/0x90 [ 84.893132] [<ffffffffa0038a3d>] nvme_timeout+0x10d/0x1d0 [nvme] [ 84.895582] [<ffffffff810ce474>] ? update_curr+0x104/0x190 [ 84.897804] [<ffffffff810cae6e>] ? account_entity_dequeue+0xae/0xd0 [ 84.900357] [<ffffffff812f8bdf>] blk_mq_rq_timed_out+0x2f/0x80 [ 84.902700] [<ffffffff812f8c7f>] blk_mq_check_expired+0x4f/0x90 [ 84.905256] [<ffffffff812fb324>] bt_for_each+0xe4/0xf0 [ 84.907315] [<ffffffff812f8c30>] ? blk_mq_rq_timed_out+0x80/0x80 [ 84.909713] [<ffffffff812fbc9e>] blk_mq_queue_tag_busy_iter+0x4e/0xc0 [ 84.912280] [<ffffffff812f8c30>] ? blk_mq_rq_timed_out+0x80/0x80 [ 84.914710] [<ffffffff812f749b>] blk_mq_timeout_work+0x8b/0x180 [ 84.917100] [<ffffffff810a845b>] process_one_work+0x17b/0x470 [ 84.919422] [<ffffffff810a9296>] worker_thread+0x126/0x410 [ 84.921632] [<ffffffff810a9170>] ? rescuer_thread+0x460/0x460 [ 84.923973] [<ffffffff810b0a4f>] kthread+0xcf/0xe0 [ 84.925949] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 [ 84.928533] [<ffffffff81696fd8>] ret_from_fork+0x58/0x90 [ 84.930678] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 [ 84.933267] ---[ end trace 1521630dc73bc5a4 ]--- [ 145.036978] nvme 0000:00:1e.0: Removing after probe failure status: -4 Ok, thanks. That is odd because there already is a check in nvme_suspend_queue() for "nvmeq->cq_vector == -1", and there is a spin_lock (nvmeq->q_lock) around checking and setting the cq_vector. So it shouldn't try to suspend an already suspended queue, but obviously at the end of nvme_suspend_queue() it trying to free_irq(vector, nvmeq). Would you please attach the output from "cat /proc/interrupts"? Is there more than one queue assigned to IRQ 151? Thanks. Also, if possible, would you attach the output from "cat /proc/interrupts" after booting the kernel that works. Thanks. Created attachment 1265908 [details]
Interrupts with kernel-3.10.0-514.17.1.el7.nvme_avoid_mult_resets.1
Created attachment 1265909 [details]
interrupts with kernel-4.1.12-61.1.25.el7uek.x86_64
Attached interrupt maps for 3.10.0-514.17.1.el7.nvme_avoid_mult_resets.1 (fails) and 4.1.12-61.1.25.el7uek.x86_64 (works) Hi, The nvme_timeout() code in this kernel-4.1.12-61.1.25.el7uek.x86_64 looks quite out-dated. Would please try this .2 version of the test kernel? It brings the functionality of nvme_timeout() very close to upstream. Thanks. http://people.redhat.com/dmilburn/.nvme_avoid_multi_reset.736373695f72685f6461/ Still seeing the problem with the new test kernel: [ 84.525584] nvme 0000:00:1e.0: I/O 0 QID 0 timeout, disable controller [ 84.528452] ------------[ cut here ]------------ [ 84.530410] WARNING: at kernel/irq/manage.c:1273 __free_irq+0xa3/0x220() [ 84.533122] Trying to free already-free IRQ 153 [ 84.534993] Modules linked in: dm_queue_length sd_mod sg iscsi_tcp libiscsi_tcp libiscsi target_core_user uio target_core_pscsi target_core_file target_core_iblock iscsi_target_mod target_core_mod crc_t10dif crct10dif_generic dm_multipath cls_u32 vxlan ip6_udp_tunnel udp_tunnel sch_htb scsi_transport_iscsi dm_mod ppdev intel_powerclamp intel_rapl iosf_mbi crc32_pclmul ghash_clmulni_intel aesni_intel lrw gf128mul glue_helper ablk_helper cryptd ena(OE) i2c_piix4 pcspkr parport_pc parport nfsd auth_rpcgss nfs_acl lockd grace sunrpc binfmt_misc ip_tables xfs libcrc32c ata_generic pata_acpi xen_blkfront cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm crct10dif_pclmul crct10dif_common ata_piix drm crc32c_intel libata nvme serio_raw i2c_core floppy fjes [ 84.566928] CPU: 15 PID: 854 Comm: kworker/15:1H Tainted: G OE ------------ 3.10.0-514.18.1.el7.nvme_avoid_mult_resets.2.x86_64 #1 [ 84.572178] Hardware name: Xen HVM domU, BIOS 4.2.amazon 12/12/2016 [ 84.574730] Workqueue: kblockd blk_mq_timeout_work [ 84.576802] ffff881effea7b20 000000005d0c66a0 ffff881effea7ad8 ffffffff816869b3 [ 84.580119] ffff881effea7b10 ffffffff81085cb0 0000000000000000 ffff881e840e9d00 [ 84.583363] 0000000000000099 ffff881dfeeedc84 ffff881dfeeedc00 ffff881effea7b78 [ 84.586733] Call Trace: [ 84.587823] [<ffffffff816869b3>] dump_stack+0x19/0x1b [ 84.590020] [<ffffffff81085cb0>] warn_slowpath_common+0x70/0xb0 [ 84.592517] [<ffffffff81085d4c>] warn_slowpath_fmt+0x5c/0x80 [ 84.594844] [<ffffffff81130e23>] __free_irq+0xa3/0x220 [ 84.596974] [<ffffffff81131039>] free_irq+0x49/0xb0 [ 84.599051] [<ffffffffa00515b7>] nvme_suspend_queue+0xa7/0x100 [nvme] [ 84.601742] [<ffffffffa005251f>] nvme_dev_disable+0x8f/0x410 [nvme] [ 84.604216] [<ffffffff81428f7c>] ? dev_warn+0x6c/0x90 [ 84.606219] [<ffffffffa0052b53>] nvme_timeout+0x1a3/0x1c0 [nvme] [ 84.608576] [<ffffffff810bc454>] ? __wake_up+0x44/0x50 [ 84.610602] [<ffffffff810ce474>] ? update_curr+0x104/0x190 [ 84.612794] [<ffffffff810cae6e>] ? account_entity_dequeue+0xae/0xd0 [ 84.615397] [<ffffffff812f8bdf>] blk_mq_rq_timed_out+0x2f/0x80 [ 84.617851] [<ffffffff812f8c7f>] blk_mq_check_expired+0x4f/0x90 [ 84.620347] [<ffffffff812fb324>] bt_for_each+0xe4/0xf0 [ 84.622548] [<ffffffff812f8c30>] ? blk_mq_rq_timed_out+0x80/0x80 [ 84.625028] [<ffffffff812fbc9e>] blk_mq_queue_tag_busy_iter+0x4e/0xc0 [ 84.627758] [<ffffffff812f8c30>] ? blk_mq_rq_timed_out+0x80/0x80 [ 84.630330] [<ffffffff812f749b>] blk_mq_timeout_work+0x8b/0x180 [ 84.632850] [<ffffffff810a845b>] process_one_work+0x17b/0x470 [ 84.635314] [<ffffffff810a9296>] worker_thread+0x126/0x410 [ 84.637678] [<ffffffff810a9170>] ? rescuer_thread+0x460/0x460 [ 84.640124] [<ffffffff810b0a4f>] kthread+0xcf/0xe0 [ 84.642133] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 [ 84.644762] [<ffffffff81696f98>] ret_from_fork+0x58/0x90 [ 84.647007] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 [ 84.649750] ---[ end trace ee4c2172bfd41a5f ]--- [ 144.754109] nvme 0000:00:1e.0: Removing after probe failure status: -4 Created attachment 1267439 [details]
interrupts with 3.10.0-514.18.1.el7.nvme_avoid_mult_resets.2.x86_64
Hmm, I thought this kernel would fix the problem, I includes upstream bb8d261e08881 (nvme: introduce a controller state machine). This time IRQ 153 84.533122] Trying to free already-free IRQ 153 Looks like maybe "cat /proc/interrupts" was taken after freeing 153? 152: 17 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 xen-pirq-msi-x nvme0q6 154: 1 0 0 469 0 0 0 0 0 0 0 0 0 0 0 0 xen-pirq-msi-x nvme0q7 I will go back through some code paths, but I may need to put some debug in the next test kernel. Thanks. It does look like 7.3 has c21377f8 ("nvme: Suspend all queues before deletion"), but we don't have commit 7065906096273 (nvme: Delete created IO queues on reset) which includes a fix for c21377f8. I will add this one to the test kernel. Thanks. Would you please try the .3 version which includes the patch from Comment#17? Thanks. http://people.redhat.com/dmilburn/.nvme_avoid_multi_reset.736373695f72685f6461/ David, The issue is still happening. Would it help if we give you access [ 81.896508] nvme 0000:00:1e.0: I/O 0 QID 0 timeout, disable controller [ 81.901542] ------------[ cut here ]------------ [ 81.905072] WARNING: at kernel/irq/manage.c:1273 __free_irq+0xa3/0x220() [ 81.910244] Trying to free already-free IRQ 154 [ 81.913787] Modules linked in: cls_u32 vxlan ip6_udp_tunnel udp_tunnel sch_htb scsi_transport_iscsi dm_mod intel_powerclamp intel_rapl iosf_mbi crc32_pclmul ghash_clmulni_intel ppdev aesni_intel lrw gf128mul glue_helper ablk_helper cryptd i2c_piix4 pcspkr parport_pc parport nfsd auth_rpcgss nfs_acl lockd grace sunrpc binfmt_misc ip_tables xfs libcrc32c ata_generic pata_acpi xen_blkfront cirrus drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ata_piix crct10dif_pclmul crct10dif_common drm crc32c_intel libata ena(OE) serio_raw nvme i2c_core floppy fjes [ 81.953043] CPU: 4 PID: 802 Comm: kworker/4:1H Tainted: G OE ------------ 3.10.0-514.18.1.el7.nvme_avoid_mult_resets.3.x86_64 #1 [ 81.961261] Hardware name: Xen HVM domU, BIOS 4.2.amazon 12/12/2016 [ 81.965367] Workqueue: kblockd blk_mq_timeout_work [ 81.968627] ffff881df6317b20 000000004e5ffb23 ffff881df6317ad8 ffffffff816869b3 [ 81.973921] ffff881df6317b10 ffffffff81085cb0 0000000000000000 ffff881eff970380 [ 81.979139] 000000000000009a ffff881dfeefa184 ffff881dfeefa100 ffff881df6317b78 [ 81.984383] Call Trace: [ 81.985947] [<ffffffff816869b3>] dump_stack+0x19/0x1b [ 81.988871] [<ffffffff81085cb0>] warn_slowpath_common+0x70/0xb0 [ 81.993433] [<ffffffff81085d4c>] warn_slowpath_fmt+0x5c/0x80 [ 81.997977] [<ffffffff81130e23>] __free_irq+0xa3/0x220 [ 82.002255] [<ffffffff81131039>] free_irq+0x49/0xb0 [ 82.006229] [<ffffffffa00495b7>] nvme_suspend_queue+0xa7/0x100 [nvme] [ 82.011480] [<ffffffffa004a51f>] nvme_dev_disable+0x8f/0x410 [nvme] [ 82.016359] [<ffffffff81428f7c>] ? dev_warn+0x6c/0x90 [ 82.020247] [<ffffffffa004ab53>] nvme_timeout+0x1a3/0x1c0 [nvme] [ 82.025035] [<ffffffff810ce474>] ? update_curr+0x104/0x190 [ 82.029533] [<ffffffff810cae6e>] ? account_entity_dequeue+0xae/0xd0 [ 82.034657] [<ffffffff812f8bdf>] blk_mq_rq_timed_out+0x2f/0x80 [ 82.039334] [<ffffffff812f8c7f>] blk_mq_check_expired+0x4f/0x90 [ 82.043981] [<ffffffff812fb324>] bt_for_each+0xe4/0xf0 [ 82.048109] [<ffffffff812f8c30>] ? blk_mq_rq_timed_out+0x80/0x80 [ 82.052810] [<ffffffff812fbc9e>] blk_mq_queue_tag_busy_iter+0x4e/0xc0 [ 82.057926] [<ffffffff812f8c30>] ? blk_mq_rq_timed_out+0x80/0x80 [ 82.062666] [<ffffffff812f749b>] blk_mq_timeout_work+0x8b/0x180 [ 82.067409] [<ffffffff810a845b>] process_one_work+0x17b/0x470 [ 82.071905] [<ffffffff810a9296>] worker_thread+0x126/0x410 [ 82.076384] [<ffffffff810a9170>] ? rescuer_thread+0x460/0x460 [ 82.081069] [<ffffffff810b0a4f>] kthread+0xcf/0xe0 [ 82.085050] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 [ 82.090358] [<ffffffff81696f98>] ret_from_fork+0x58/0x90 [ 82.094771] [<ffffffff810b0980>] ? kthread_create_on_node+0x140/0x140 [ 82.099973] ---[ end trace aad9b72a0c2bbb96 ]--- [ 142.205711] nvme 0000:00:1e.0: Removing after probe failure status: -4 (In reply to artem from comment #19) > David, > The issue is still happening. > Would it help if we give you access Yeah, that would be great if I could get in an environment where I can build and boot some debug kernels. Thanks. Hi Keith, This is reproducible on upstream linux-4.10, but I don't think it is a NVMe driver problem. I put in some debug, I don't see where the NVMe driver does a double free_irq, but I do see this each time [ 19.702913] xen:events: Failed to obtain physical IRQ 259 (attaching dmesg output) Do you have any thoughts? Created attachment 1269062 [details]
linux-4.10 dmesg output with failure
I'm not familiar with this part, but it looks suspicious to me that the __startup_pirq xen irq startup callback returns success even when it fails. Hi Vitaly, We are seeing a NVMe timeout followed by the driver disabling the controller leading to a free_irq error. Would you please search the dmesg output (Comment#22) for "xen:events: Failed to obtain physical IRQ"? Can you give us an idea why this would fail? Thanks. Sorry for the delayed reply guys, Xen uses so-called 'event channel' event delivery mechanism for interrupts (both physical and virtual) and here we can see EVTCHNOP_bind_pirq failing which means we can't bind some physical IRQ to an event channel. AWS has its own Xen version with non-public patches but looking at the upstream Xen I can see several possible reasons for such behavior: 1) We reached the limit of physical IRQs per domain. This limit is set dynamically, hard to say why we're above the limit. 2) The physical IRQ we're asking for is already bound to some other event channel. 3) We're not allowed to bind to to this phisycal IRQ. 4) We're out of free event channels (very unlikely, especially for upstream as they use FIFO based event channels, the're are too many of them). Please try the following debug patch (upstream): diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c index 6a53577..b65792e 100644 --- a/drivers/xen/events/events_base.c +++ b/drivers/xen/events/events_base.c @@ -528,6 +528,7 @@ static unsigned int __startup_pirq(unsigned int irq) bind_pirq.flags = info->u.pirq.flags & PIRQ_SHAREABLE ? BIND_PIRQ__WILL_SHARE : 0; rc = HYPERVISOR_event_channel_op(EVTCHNOP_bind_pirq, &bind_pirq); + pr_info("EVTCHNOP_bind_pirq: pirq: %d, flags: 0x%x, rc=%d (port: %d)\n", bind_pirq.pirq, bind_pirq.flags, rc, bind_pirq.port); if (rc != 0) { pr_warn("Failed to obtain physical IRQ %d\n", irq); return 0; let's see what's going on. It would also make sense to start a discussion on xen-devel mailing list, guys may give some valuable suggestions there. (In reply to Vitaly Kuznetsov from comment #25) > Sorry for the delayed reply guys, > > Xen uses so-called 'event channel' event delivery mechanism for interrupts > (both physical and virtual) and here we can see EVTCHNOP_bind_pirq failing > which means we can't bind some physical IRQ to an event channel. > > AWS has its own Xen version with non-public patches but looking at the > upstream Xen I can see several possible reasons for such behavior: > 1) We reached the limit of physical IRQs per domain. This limit is set > dynamically, hard to say why we're above the limit. > 2) The physical IRQ we're asking for is already bound to some other event > channel. > 3) We're not allowed to bind to to this phisycal IRQ. > 4) We're out of free event channels (very unlikely, especially for upstream > as they use FIFO based event channels, the're are too many of them). > Thanks Vitaly, I will attach the full dmesg [ 20.195765] xen:events: EVTCHNOP_bind_pirq: pirq: 583, flags: 0x0, rc=-17 (port: -1) [ 20.195766] xen:events: Failed to obtain physical IRQ 260 Created attachment 1272366 [details]
dmesg after applying debug patch
-17 is -EEXIST so we're probably trying to bind the PIRQ 253 twice: $ grep 'pirq: 583' /tmp/dmesg.txt [ 20.171888] xen:events: EVTCHNOP_bind_pirq: pirq: 583, flags: 0x0, rc=0 (port: 253) [ 20.188750] xen:events: EVTCHNOP_bind_pirq: pirq: 583, flags: 0x0, rc=0 (port: 251) [ 20.195765] xen:events: EVTCHNOP_bind_pirq: pirq: 583, flags: 0x0, rc=-17 (port: -1) I would expect that the pirq was unbound between the first and the second attempts but it wasn't between the second and the third. To continue investigating I would add dump_stack() to __startup_pirq() (and, probably, to shutdown_pirq()) to see why we're trying to acquire the same pirq twice. I'm a little bit puzzled as in the beginning of __startup_pirq() we actually check if we already have an event channel for the irq: int evtchn = evtchn_from_irq(irq); ... if (VALID_EVTCHN(evtchn)) goto out; but this check fails, apparently. I can only imagine a race when we're trying to call __startup_pirq() on multiple CPUs simultaneously... Created attachment 1272391 [details]
dmesg after applying debug patch and dump_stack()
Ok, here is dmesg after inserting dump_stack() at beginning of __startup_pirq()
and shutdown_pirq().
Sorry, I should've asked to print some debug info from shutdown_pirq() too, e.g.: diff --git a/drivers/xen/events/events_base.c b/drivers/xen/events/events_base.c index 6a53577..58e744d 100644 --- a/drivers/xen/events/events_base.c +++ b/drivers/xen/events/events_base.c @@ -237,6 +237,9 @@ static int xen_irq_info_pirq_setup(unsigned irq, static void xen_irq_info_cleanup(struct irq_info *info) { + pr_info("xen_irq_info_cleanup: %d\n", info->evtchn); + dump_stack(); + set_evtchn_to_irq(info->evtchn, -1); info->evtchn = 0; } @@ -528,6 +531,8 @@ static unsigned int __startup_pirq(unsigned int irq) bind_pirq.flags = info->u.pirq.flags & PIRQ_SHAREABLE ? BIND_PIRQ__WILL_SHARE : 0; rc = HYPERVISOR_event_channel_op(EVTCHNOP_bind_pirq, &bind_pirq); + pr_info("EVTCHNOP_bind_pirq: IRQ: %d, pirq: %d, flags: 0x%x, rc=%d (port: %d)\n", irq, bind_pirq.pirq, bind_pirq.flags, rc, bind_pirq.port); + dump_stack(); if (rc != 0) { pr_warn("Failed to obtain physical IRQ %d\n", irq); return 0; @@ -572,6 +577,9 @@ static void shutdown_pirq(struct irq_data *data) BUG_ON(info->type != IRQT_PIRQ); + pr_info("shutdown_pirq: irq %d evtchn %d\n", irq, evtchn); + dump_stack(); + if (!VALID_EVTCHN(evtchn)) return; Here I'm adding some debug info to xen_irq_info_cleanup() too to see if the irq<->evtchn mapping is being cleaned from some other place, we have a few places for that. Anyway, looking at dmesg there is no shutdown_pirq() between the second and the third 'EVTCHNOP_bind_pirq: pirq: 580' which means that the driver is trying to bind the same irq twice. For some reason irq<->evtchn mapping is cleaned in Linux but not in Xen so we end up with -EEXIST. My wild guess would be: this is also broken on non-Xen installs but Xen specifics (not returning an error code when 'Failed to obtain physical IRQ' happens makes this error fatal. I would expect that on non-Xen setups the nvme driver starts with one less queue sometimes. I would suggest we also try changing "return 0" to "return rc" in __startup_pirq() to see what happens. David, I feel like I'm torturing you with these debug patches and numerous guesses here :-( I can try reproducing the issue myself, is it easy enough? Do I just reboot i3.4xlarge instance 10 times and the issue will show up? Hi Vitaly, (In reply to Vitaly Kuznetsov from comment #30) > Sorry, I should've asked to print some debug info from shutdown_pirq() too, > e.g.: > > diff --git a/drivers/xen/events/events_base.c > b/drivers/xen/events/events_base.c > index 6a53577..58e744d 100644 > --- a/drivers/xen/events/events_base.c > +++ b/drivers/xen/events/events_base.c > @@ -237,6 +237,9 @@ static int xen_irq_info_pirq_setup(unsigned irq, > > static void xen_irq_info_cleanup(struct irq_info *info) > { > + pr_info("xen_irq_info_cleanup: %d\n", info->evtchn); > + dump_stack(); > + > set_evtchn_to_irq(info->evtchn, -1); > info->evtchn = 0; > } > @@ -528,6 +531,8 @@ static unsigned int __startup_pirq(unsigned int irq) > bind_pirq.flags = info->u.pirq.flags & PIRQ_SHAREABLE ? > BIND_PIRQ__WILL_SHARE : 0; > rc = HYPERVISOR_event_channel_op(EVTCHNOP_bind_pirq, &bind_pirq); > + pr_info("EVTCHNOP_bind_pirq: IRQ: %d, pirq: %d, flags: 0x%x, rc=%d > (port: %d)\n", irq, bind_pirq.pirq, bind_pirq.flags, rc, bind_pirq.port); > + dump_stack(); > if (rc != 0) { > pr_warn("Failed to obtain physical IRQ %d\n", irq); > return 0; > @@ -572,6 +577,9 @@ static void shutdown_pirq(struct irq_data *data) > > BUG_ON(info->type != IRQT_PIRQ); > > + pr_info("shutdown_pirq: irq %d evtchn %d\n", irq, evtchn); > + dump_stack(); > + > if (!VALID_EVTCHN(evtchn)) > return; > > Here I'm adding some debug info to xen_irq_info_cleanup() too to see if the > irq<->evtchn mapping is being cleaned from some other place, we have a few > places for that. > > Anyway, looking at dmesg there is no shutdown_pirq() between the second and > the third 'EVTCHNOP_bind_pirq: pirq: 580' which means that the driver is > trying to bind the same irq twice. For some reason irq<->evtchn mapping is > cleaned in Linux but not in Xen so we end up with -EEXIST. > > My wild guess would be: this is also broken on non-Xen installs but Xen > specifics (not returning an error code when 'Failed to obtain physical IRQ' > happens makes this error fatal. I would expect that on non-Xen setups the > nvme driver starts with one less queue sometimes. Ok, I will attach another dmesg output. > > I would suggest we also try changing "return 0" to "return rc" in > __startup_pirq() to see what happens. > > David, I feel like I'm torturing you with these debug patches and numerous > guesses here :-( I can try reproducing the issue myself, is it easy enough? > Do I just reboot i3.4xlarge instance 10 times and the issue will show up? I think you will need some large NVMe drives, usually this happens on the first reboot (I did get into a state where I couldn't reproduce, but built a fresh kernel from the linux-4.10 sources, and I am back to reproducing on each reboot). First thing I check is lsblk (there should be 4 x 1.7T disks), then I check dmesg output. $ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT xvda 202:0 0 64G 0 disk ├─xvda2 202:2 0 64G 0 part / └─xvda1 202:1 0 1M 0 part nvme1n1 259:0 0 1.7T 0 disk nvme3n1 259:1 0 1.7T 0 disk Artem, do you think you can give Vitaly access to the same instance? Thanks. Created attachment 1272630 [details]
dmesg after applying debug patch and dump_stack() and others
From the last dmesg I conclude we have a race: [ 20.422183] xen:events: EVTCHNOP_bind_pirq: pirq: 582, flags: 0x0, rc=0 (port: 281) [ 20.422185] CPU: 21 PID: 565 Comm: kworker/u256:2 Not tainted 4.10.0 #3 [ 20.422185] Hardware name: Xen HVM domU, BIOS 4.2.amazon 12/12/2016 [ 20.422187] Workqueue: nvme nvme_reset_work [nvme] [ 20.422188] Call Trace: [ 20.422191] dump_stack+0x63/0x87 [ 20.422193] __startup_pirq+0xdb/0x170 [ 20.422194] startup_pirq+0x11/0x20 [ 20.422196] irq_startup+0x47/0x80 [ 20.422197] ? alloc_cpumask_var_node+0x24/0x30 [ 20.422198] __setup_irq+0x550/0x640 [ 20.422200] request_threaded_irq+0x10d/0x1f0 [ 20.422202] queue_request_irq+0x4e/0x80 [nvme] [ 20.422203] nvme_setup_io_queues+0x1c0/0x410 [nvme] [ 20.422205] ? blk_queue_write_cache+0x4e/0x80 [ 20.422207] ? nvme_set_queue_limits+0x4c/0xa0 [nvme_core] [ 20.422209] ? nvme_init_identify+0x1fa/0x2f0 [nvme_core] [ 20.422211] nvme_reset_work+0x5c0/0x900 [nvme] [ 20.422212] process_one_work+0x165/0x410 [ 20.422213] worker_thread+0x137/0x4c0 [ 20.422215] kthread+0x101/0x140 [ 20.422216] ? rescuer_thread+0x3b0/0x3b0 [ 20.422218] ? kthread_park+0x90/0x90 [ 20.422220] ret_from_fork+0x2c/0x40 [ 20.426199] xen:events: EVTCHNOP_bind_pirq: pirq: 582, flags: 0x0, rc=-17 (port: -1) [ 20.426201] CPU: 0 PID: 5 Comm: kworker/u256:0 Not tainted 4.10.0 #3 [ 20.426202] Hardware name: Xen HVM domU, BIOS 4.2.amazon 12/12/2016 [ 20.426204] Workqueue: nvme nvme_reset_work [nvme] [ 20.426205] Call Trace: [ 20.426208] dump_stack+0x63/0x87 [ 20.426210] __startup_pirq+0xdb/0x170 [ 20.426212] startup_pirq+0x11/0x20 [ 20.426213] irq_startup+0x47/0x80 [ 20.426214] ? alloc_cpumask_var_node+0x24/0x30 [ 20.426215] __setup_irq+0x550/0x640 [ 20.426217] request_threaded_irq+0x10d/0x1f0 [ 20.426218] queue_request_irq+0x4e/0x80 [nvme] [ 20.426220] nvme_setup_io_queues+0x1c0/0x410 [nvme] [ 20.426222] ? nvme_init_identify+0x1fa/0x2f0 [nvme_core] [ 20.426223] ? kfree+0x122/0x170 [ 20.426225] ? nvme_init_identify+0x1fa/0x2f0 [nvme_core] [ 20.426226] nvme_reset_work+0x5c0/0x900 [nvme] [ 20.426228] process_one_work+0x165/0x410 [ 20.426229] worker_thread+0x137/0x4c0 [ 20.426230] kthread+0x101/0x140 [ 20.426231] ? rescuer_thread+0x3b0/0x3b0 [ 20.426233] ? kthread_park+0x90/0x90 [ 20.426235] ret_from_fork+0x2c/0x40 [ 20.426236] xen:events: Failed to obtain physical IRQ 259 [ 20.429439] xen:events: EVTCHNOP_bind_pirq: pirq: 581, flags: 0x0, rc=0 (port: 282) It seems we have nvme_setup_io_queues() running on CPU0 and CPU21 simultaneously and the're both trying to setup the same IRQ. Any idea why this is happening in nvme? Are the two nvme_reset_work threads executing on two different NVMe drives, or on the same one? The former should be okay as far as I know, but the later would definitely be a bug. I know there are three nvme drives in this system, so I would expect parallel initialization, but I can't tell from the stack trace if these are definitely from two different device initializations. Hi Keith, (In reply to Keith Busch from comment #34) > Are the two nvme_reset_work threads executing on two different NVMe drives, > or on the same one? The former should be okay as far as I know, but the > later would definitely be a bug. I know there are three nvme drives in this > system, so I would expect parallel initialization, but I can't tell from the > stack trace if these are definitely from two different device > initializations. I added this in nvme_setup_io_queues() right before queue_request_irq(), but it didn't fail that time, I will try somewhere else. dev_info(dev->ctrl.device, "NVME_SETUP_IO_QUEUES: pci function %s\n", dev_name(&pdev->dev)\ ); (4 drives) [ 19.998098] nvme nvme3: NVME_SETUP_IO_QUEUES: pci function 0000:00:1e.0 [ 20.001832] nvme nvme1: NVME_SETUP_IO_QUEUES: pci function 0000:00:1c.0 [ 20.103319] nvme nvme0: NVME_SETUP_IO_QUEUES: pci function 0000:00:1b.0 [ 20.106907] nvme nvme2: NVME_SETUP_IO_QUEUES: pci function 0000:00:1d.0 Created attachment 1272695 [details]
dmesg after applying debug patches and checking for NVMe device
Looks like different devices, please note I moved this to nvme_create_io_queues()
which is returned from nvme_setup_io_queues(). Sorry, I forgot to change that
to NVME_CREATE_IO_QUEUES, but that is where it really came from. Thanks.
dev_info(dev->ctrl.device, "NVME_SETUP_IO_QUEUES: pci function %s\n", dev_name(&pdev->dev)\
);
This should be allowed for the nvme driver to do. It sounds like whatever is assigning these isn't thread safe, though it seems odd that pirq 582 is always the only duplicate. Yes, it seems that some of this Xen code is not thread safe (I'll take a look) but my question would be: why nvme driver is trying to call request_threaded_irq() for the same irq for different devices? BTW, it's not always 582, I saw 580 in one of dmesgs from David. Oh, I missed the fact that request_threaded_irq() is done with IRQF_SHARED flag so we probably have a shared irq which is OK. I tried reproducing the issue on Friday with i3.8xlarge instance and with RHEL kernel it always reproduces, I was never able to boot with all 4 NVMes, 1-3 of them always fail. After that I tried to reproduce the upstream issue ("xen:events: Failed to obtain physical IRQ ...") with 4.11-rc7 but failed. I wrote a patch which is supposed to fix the discovered race but as I wasn't able to reproduce the issue I'm not sure if it works or not. I'll attach it here so if anyone is still able to reproduce the upstream issue please let me know if the patch helps or not. I'll be investigating further. Created attachment 1273561 [details]
Patch to make __startup_pirq() thread safe
Hi Vitaly, I am still seeing the failure on linux-4.10 with the patch in comment#41 [ec2-user@ip-172-31-7-115 events]$ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT xvda 202:0 0 64G 0 disk ├─xvda2 202:2 0 64G 0 part / └─xvda1 202:1 0 1M 0 part nvme0n1 259:1 0 1.7T 0 disk nvme3n1 259:0 0 1.7T 0 disk [ 0.000000] Linux version 4.10.0 (ec2-user.compute.internal) (gcc version 4.8.5 20150623 (Red \ Hat 4.8.5-11) (GCC) ) #4 SMP Mon Apr 24 08:24:12 EDT 2017 <snip> [ 21.026659] xen:events: EVTCHNOP_bind_pirq: pirq: 583, flags: 0x0, rc=-17 (port: -1) [ 21.026660] CPU: 19 PID: 556 Comm: kworker/u256:2 Not tainted 4.10.0 #4 [ 21.026661] Hardware name: Xen HVM domU, BIOS 4.2.amazon 12/12/2016 [ 21.026663] Workqueue: nvme nvme_reset_work [nvme] [ 21.026664] Call Trace: [ 21.026666] dump_stack+0x63/0x87 [ 21.026669] __startup_pirq+0xf3/0x180 [ 21.026670] startup_pirq+0x11/0x20 [ 21.026672] irq_startup+0x47/0x80 [ 21.026673] ? alloc_cpumask_var_node+0x24/0x30 [ 21.026674] __setup_irq+0x550/0x640 [ 21.026676] request_threaded_irq+0x10d/0x1f0 [ 21.026678] queue_request_irq+0x4e/0x80 [nvme] [ 21.026679] nvme_setup_io_queues+0x1c0/0x440 [nvme] [ 21.026681] ? blk_queue_write_cache+0x4e/0x80 [ 21.026683] ? nvme_set_queue_limits+0x4c/0xa0 [nvme_core] [ 21.026685] ? nvme_init_identify+0x1fa/0x2f0 [nvme_core] [ 21.026687] nvme_reset_work+0x5c0/0x900 [nvme] [ 21.026688] process_one_work+0x165/0x410 [ 21.026689] worker_thread+0x137/0x4c0 [ 21.026691] kthread+0x101/0x140 [ 21.026692] ? rescuer_thread+0x3b0/0x3b0 [ 21.026693] ? kthread_park+0x90/0x90 [ 21.026695] ret_from_fork+0x2c/0x40 [ 21.026696] xen:events: Failed to obtain physical IRQ 258 [ 21.026704] nvme nvme2: NVME_SETUP_IO_QUEUES: pci function 0000:00:1d.0 [ 21.026719] xen:events: EVTCHNOP_bind_pirq: pirq: 536, flags: 0x0, rc=0 (port: 266) [ 21.026720] CPU: 25 PID: 557 Comm: kworker/u256:3 Not tainted 4.10.0 #4 [ 21.026720] Hardware name: Xen HVM domU, BIOS 4.2.amazon 12/12/2016 [ 21.026721] Workqueue: nvme nvme_reset_work [nvme] [ 21.026722] Call Trace: [ 21.026723] dump_stack+0x63/0x87 [ 21.026724] __startup_pirq+0xf3/0x180 [ 21.026726] startup_pirq+0x11/0x20 [ 21.026727] irq_startup+0x47/0x80 [ 21.026728] ? alloc_cpumask_var_node+0x24/0x30 [ 21.026729] __setup_irq+0x550/0x640 [ 21.026730] request_threaded_irq+0x10d/0x1f0 [ 21.026731] queue_request_irq+0x4e/0x80 [nvme] [ 21.026733] nvme_setup_io_queues+0x318/0x440 [nvme] [ 21.026734] nvme_reset_work+0x5c0/0x900 [nvme] [ 21.026735] process_one_work+0x165/0x410 [ 21.026736] worker_thread+0x137/0x4c0 [ 21.026737] kthread+0x101/0x140 [ 21.026738] ? rescuer_thread+0x3b0/0x3b0 [ 21.026739] ? kthread_park+0x90/0x90 [ 21.026741] ret_from_fork+0x2c/0x40 I booted 4.11.0-rc8 (without patch from Comment#41), but didn't fail, I will reboot several times to see if I can recreate on latest upstream [ec2-user@ip-172-31-7-115 ~]$ lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT xvda 202:0 0 64G 0 disk ├─xvda2 202:2 0 64G 0 part / └─xvda1 202:1 0 1M 0 part nvme2n1 259:3 0 1.7T 0 disk nvme1n1 259:0 0 1.7T 0 disk nvme0n1 259:1 0 1.7T 0 disk nvme3n1 259:2 0 1.7T 0 disk Thanks, I see, the race is not what causing us problems. I tried linux-4.10 on i3.8xlarge today and voilà, the issue reproduced right away: # lsblk | grep nvm nvme2n1 259:2 0 1.7T 0 disk nvme0n1 259:0 0 1.7T 0 disk nvme3n1 259:1 0 1.7T 0 disk (nvme1 is missing). However, I'm not able to reproduce the issue with 4.11-rc7, all four devices are always present. I went through Xen related changes in 4.11 and couldn't find anything relevant, the event channel code is the same. NVMe driver has some changes but I couldn't spot anything obvious too (though, it's not the code I'm familiar with). Could you please try 4.11-rc7 (or even rc8) to see if the issue still reproduces? If it doesn't we can take a look at the changes again or/and bisect. One other thing which bothers me a bit: we never saw "xen:events: Failed to obtain physical ..." message on RHEL (or, at I don't see it in my testing) but the pr_info() is there and the code is more or less the same, there is a possibility that we're failing for different reasons in RHEL and upstream... Hi, (In reply to Vitaly Kuznetsov from comment #44) > Thanks, I see, the race is not what causing us problems. > > I tried linux-4.10 on i3.8xlarge today and voilà, the issue reproduced right > away: > # lsblk | grep nvm > nvme2n1 259:2 0 1.7T 0 disk > nvme0n1 259:0 0 1.7T 0 disk > nvme3n1 259:1 0 1.7T 0 disk > (nvme1 is missing). However, I'm not able to reproduce the issue with > 4.11-rc7, all four devices are always present. I went through Xen related > changes in 4.11 and couldn't find anything relevant, the event channel code > is the same. NVMe driver has some changes but I couldn't spot anything > obvious too (though, it's not the code I'm familiar with). Could you please > try 4.11-rc7 (or even rc8) to see if the issue still reproduces? If it > doesn't we can take a look at the changes again or/and bisect. I rebooted 4.11-rc8 ten times without seeing the failure. Thanks. (In reply to Vitaly Kuznetsov from comment #45) > One other thing which bothers me a bit: we never saw "xen:events: Failed to > obtain physical ..." message on RHEL (or, at I don't see it in my testing) > but the pr_info() is there and the code is more or less the same, there is a > possibility that we're failing for different reasons in RHEL and upstream... Yeah, I did notice that upstream changed that to pr_warn() and RHEL uses pr_info(). Thanks. (In reply to David Milburn from comment #47) > (In reply to Vitaly Kuznetsov from comment #45) > > One other thing which bothers me a bit: we never saw "xen:events: Failed to > > obtain physical ..." message on RHEL (or, at I don't see it in my testing) > > but the pr_info() is there and the code is more or less the same, there is a > > possibility that we're failing for different reasons in RHEL and upstream... > > Yeah, I did notice that upstream changed that to pr_warn() and RHEL uses > pr_info(). Thanks. Though, should have logged pr_info. Blind bisect shows the culprit: commit c74fd80f2f41d05f350bb478151021f88551afe8 Author: Dan Streetman <ddstreet> Date: Fri Jan 13 15:07:51 2017 -0500 xen: do not re-use pirq number cached in pci device msi msg data Revert the main part of commit: af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests") That commit introduced reading the pci device's msi message data to see if a pirq was previously configured for the device's msi/msix, and re-use that pirq. At the time, that was the correct behavior. However, a later change to Qemu caused it to call into the Xen hypervisor to unmap all pirqs for a pci device, when the pci device disables its MSI/MSIX vectors; specifically the Qemu commit: c976437c7dba9c7444fb41df45468968aaa326ad ("qemu-xen: free all the pirqs for msi/msix when driver unload") Once Qemu added this pirq unmapping, it was no longer correct for the kernel to re-use the pirq number cached in the pci device msi message data. All Qemu releases since 2.1.0 contain the patch that unmaps the pirqs when the pci device disables its MSI/MSIX vectors. This bug is causing failures to initialize multiple NVMe controllers under Xen, because the NVMe driver sets up a single MSIX vector for each controller (concurrently), and then after using that to talk to the controller for some configuration data, it disables the single MSIX vector and re-configures all the MSIX vectors it needs. So the MSIX setup code tries to re-use the cached pirq from the first vector for each controller, but the hypervisor has already given away that pirq to another controller, and its initialization fails. This is discussed in more detail at: https://lists.xen.org/archives/html/xen-devel/2017-01/msg00447.html Fixes: af42b8d12f8a ("xen: fix MSI setup and teardown for PV on HVM guests") Signed-off-by: Dan Streetman <dan.streetman> Reviewed-by: Stefano Stabellini <sstabellini> Acked-by: Konrad Rzeszutek Wilk <konrad.wilk> Signed-off-by: Boris Ostrovsky <boris.ostrovsky> The commit message says something about several NVMes so I think this is a good guess. The next question would be if it helps RHEL or not. I'll try. Created attachment 1273684 [details]
RHEL7 version of the patch
Hi artem QE don't have the environment, could you help feedback test results after it fixed, thanks in advance. Yi Hi Yi, We can do some testing. But I would suggest that Red Hat urgently needs to arrange easy access to AWS for kernel QA. RHEL is a major OS offered on AWS. And AWS is a leading "server vendor" today. Red Hat kernel QA not having access to AWS is a major gap. How can we recommend using RHEL on AWS to our large enterprise customers if kernel testing is not done and any troubleshooting is problematic? This issue is a perfect case for urgently fixing this gap. -Artem Patch(es) committed on kernel repository and an interim kernel build is undergoing testing Patch(es) available on kernel-3.10.0-660.el7 (In reply to artem from comment #59) > Hi Yi, > We can do some testing. But I would suggest that Red Hat urgently needs to > arrange easy access to AWS for kernel QA. RHEL is a major OS offered on AWS. > And AWS is a leading "server vendor" today. Red Hat kernel QA not having > access to AWS is a major gap. How can we recommend using RHEL on AWS to our > large enterprise customers if kernel testing is not done and any > troubleshooting is problematic? This issue is a perfect case for urgently > fixing this gap. > > -Artem Hi Artem Thanks for your reminder, I have found one method to access AWS inside Red Hat. Will have a try for this bug ASAP. Thanks Yi Sometimes this issue cannot be reproduced after system reboot, I tried below script[1] which found from [2], and this issue is easy to reproduce with this script. [1] #!/bin/bash num=0 while [ $num -lt 30] do echo "-----$num" rmmod nvme modprobe nvme ((num++)) sleep 0.5 done [2] https://lists.xen.org/archives/html/xen-devel/2017-01/msg00447.html *** Bug 1432737 has been marked as a duplicate of this bug. *** Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2017:1842 |