Bug 862862
Description
Pasi Karkkainen
2012-10-03 19:07:27 UTC
Pasi, please upload a full guest boot log. I'm interested in any udev messages. See http://www.redhat.com/archives/libvir-list/2011-August/msg00937.html http://www.redhat.com/archives/libvir-list/2011-August/msg00937.html There's a difference in how RHEL-5 and RHEL-6 udev identify virtual functions persistently. Harald, what could be the reason of an eth interface being named "__tmp1145796886"? Thank you both. Hi Laine, did the discussion in <http://www.redhat.com/archives/libvir-list/2011-August/msg00937.html> lead anywhere? Is there a way in RHEL-5 libvirt to assign a persistent MAC to a VF? Thanks. (In reply to comment #2) > Pasi, > > please upload a full guest boot log. I'm interested in any udev messages. > > See > http://www.redhat.com/archives/libvir-list/2011-August/msg00937.html > http://www.redhat.com/archives/libvir-list/2011-August/msg00937.html > > There's a difference in how RHEL-5 and RHEL-6 udev identify virtual > functions persistently. > > Harald, > > what could be the reason of an eth interface being named "__tmp1145796886"? > > Thank you both. clash in renaming interfaces... especially, if you have ifcfg with interface names assigned to a HWADDR and you have interfaces without ifcfg files. udev does not know what to do with the "without" interfaces, when they occupy the name of the "HWADDR" interfaces. So they are renamed. This can be resolved by not naming the HWADDR interfaces in the "eth*" namespace. Thanks. So it's the consequence of the varying VF MAC and probably unrelated to the crash. (In reply to comment #0) > rhel 5.8 x64 hvm guest kernel: > Linux version 2.6.18-308.13.1.el5 (mockbuild.bos.redhat.com) > (gcc version 4.1.2 20080704 (Red Hat 4.1.2-52)) #1 SMP Thu Jul 26 05:45:09 > general protection fault: 0000 [1] SMP > RIP: 0010:[<ffffffff881fae5f>] [<ffffffff881fae5f>] > :ixgbevf:ixgbevf_msix_clean_tx+0x58/0x2a2 > <EOI> [<ffffffff881f8113>] :ixgbevf:ixgbevf_set_vfta_vf+0x0/0x2e > [<ffffffff881f83b8>] :ixgbevf:ixgbevf_read_v2p_mailbox+0x10/0x23 > [<ffffffff881f83d8>] :ixgbevf:ixgbevf_check_for_bit_vf+0xd/0x25 > [<ffffffff881f83fe>] :ixgbevf:ixgbevf_check_for_msg_vf+0xe/0x23 > [<ffffffff881f85fe>] :ixgbevf:ixgbevf_write_mbx_vf+0x23/0x79 > [<ffffffff881f852f>] :ixgbevf:ixgbevf_write_posted_mbx+0x17/0x5c > [<ffffffff881f813c>] :ixgbevf:ixgbevf_set_vfta_vf+0x29/0x2e > [<ffffffff881ad35a>] :8021q:vlan_device_event+0x50/0x1b2 > [<ffffffff800675cb>] notifier_call_chain+0x20/0x32 > [<ffffffff8023a75f>] dev_open+0x96/0x9c > [<ffffffff80238854>] dev_change_flags+0x5a/0x119 > [<ffffffff8026d035>] devinet_ioctl+0x235/0x59c > [<ffffffff8022f748>] sock_ioctl+0x1c1/0x1e5 > [<ffffffff80041f7f>] do_ioctl+0x21/0x6b > [<ffffffff8002ffcb>] vfs_ioctl+0x457/0x4b9 $ gdb /usr/lib/debug/lib/modules/2.6.18-308.13.1.el5/kernel/drivers/net/ixgbevf/ixgbevf.ko.debug (gdb) list *(ixgbevf_msix_clean_tx+0x58) 0x2e5f is in ixgbevf_msix_clean_tx (drivers/net/ixgbevf/ixgbevf_main.c:197). 192 union ixgbe_adv_tx_desc *tx_desc, *eop_desc; 193 struct ixgbevf_tx_buffer *tx_buffer_info; 194 unsigned int i, eop, count = 0; 195 unsigned int total_bytes = 0, total_packets = 0; 196 197 i = tx_ring->next_to_clean; ^^^^^^^^^^^^^^^^^^^^^^ 198 eop = tx_ring->tx_buffer_info[i].next_to_watch; 199 eop_desc = IXGBE_TX_DESC_ADV(*tx_ring, eop); 200 201 while ((eop_desc->wb.status & cpu_to_le32(IXGBE_TXD_STAT_DD)) && Backport candidate: http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=10cc1bdd Created attachment 624960 [details]
[2/2] ixgbevf: Fix panic when loading driver
RHEL-5 backport of the following upstream Linux patch:
commit 10cc1bdd5ef65f60f570b594c4c066d763c128fb
Author: Alexander Duyck <alexander.h.duyck>
Date: Mon Jul 16 23:44:48 2012 +0000
ixgbevf: Fix panic when loading driver
This patch addresses a kernel panic seen when setting up the
interface. Specifically we see a NULL pointer dereference on the
Tx descriptor cleanup path when enabling interrupts. This change
corrects that so it cannot occur.
Signed-off-by: Alexander Duyck <alexander.h.duyck>
Acked-by: Greg Rose <gregory.v.rose>
Tested-by: Sibai Li <sibai.li>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher>
Signed-off-by: David S. Miller <davem>
---
drivers/net/ixgbevf/ixgbevf_main.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)
RHEL-6 seems to lack the patch as well, but before thinking about cloning this BZ for RHEL-6, we must see if the patch solves the RHEL-5 crash. I'll prepare a 2.6.18-308.13.1.el5 based RPM. Unfortunately it seems the patch didn't help.. I still get this crash: Unable to handle kernel paging request at ffff810078626d76 RIP: [<ffffffff88223e5f>] :ixgbevf:ixgbevf_msix_clean_tx+0x58/0x2b2 PGD 10063 PUD 0 Oops: 0002 [1] SMP last sysfs file: /block/ram0/dev CPU 0 Modules linked in: ipv6 xfrm_nalgo crypto_api dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button ba ttery asus_acpi acpi_memhotplug ac parport_pc lp parport floppy xen_vnif joydev xen_vbd xen_balloon i2c_piix4 i2c_core xen_platform_pc i ixgbevf e1000 pcspkr tpm_tis tpm tpm_bios serio_raw ide_cd cdrom 8021q dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 1965, comm: ifconfig Not tainted 2.6.18-308.13.1.el5.bz862862_ixgbevf_crash #1 RIP: 0010:[<ffffffff88223e5f>] [<ffffffff88223e5f>] :ixgbevf:ixgbevf_msix_clean_tx+0x58/0x2b2 RSP: 0018:ffffffff804b1ec8 EFLAGS: 00010286 RAX: 0000000000000000 RBX: ffff810078626d3a RCX: 0000000000000000 RDX: ffff81003f678380 RSI: 0000000000000040 RDI: 0000000000000000 RBP: ffff81003d319140 R08: ffff81003ff92f78 R09: ffff81003f772410 R10: ffffffff8009a264 R11: 0000000000001002 R12: ffff81003f037500 R13: 0000000000000000 R14: ffff8100357f3ca8 R15: ffff8100357f3ca8 FS: 00002acbff6f05e0(0000) GS:ffffffff80431000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: ffff810078626d76 CR3: 0000000037095000 CR4: 00000000000006a0 Process ifconfig (pid: 1965, threadinfo ffff8100357f2000, task ffff81003e72b7e0) Stack: ffff81003f678380 0000000000000000 ffff81003f037990 0000000000000246 ffff81003f037500 0000000000000000 ffff81003d319140 00000000000000c1 0000000000000000 ffff8100357f3ca8 ffff8100357f3ca8 ffffffff80010d91 Call Trace: <IRQ> [<ffffffff80010d91>] handle_IRQ_event+0x51/0xa6 [<ffffffff800be15f>] __do_IRQ+0xfb/0x15b [<ffffffff8006d4d1>] do_IRQ+0xe9/0xf7 [<ffffffff8005d615>] ret_from_intr+0x0/0xa <EOI> [<ffffffff80249398>] qdisc_alloc+0xf/0x99 [<ffffffff80249430>] qdisc_create_dflt+0xe/0x3f [<ffffffff802494d1>] dev_activate+0x37/0xee [<ffffffff8023a74b>] dev_open+0x82/0x9c [<ffffffff80238854>] dev_change_flags+0x5a/0x119 [<ffffffff8026d035>] devinet_ioctl+0x235/0x59c [<ffffffff8022f748>] sock_ioctl+0x1c1/0x1e5 [<ffffffff80041f7f>] do_ioctl+0x21/0x6b [<ffffffff8002ffcb>] vfs_ioctl+0x457/0x4b9 [<ffffffff800ba76b>] audit_syscall_entry+0x1a8/0x1d3 [<ffffffff8004c381>] sys_ioctl+0x59/0x78 [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Code: c7 43 3c 00 00 00 00 c7 43 40 00 00 00 00 41 8b 84 24 60 04 RIP [<ffffffff88223e5f>] :ixgbevf:ixgbevf_msix_clean_tx+0x58/0x2b2 RSP <ffffffff804b1ec8> CR2: ffff810078626d76 <0>Kernel panic - not syncing: Fatal exception Anything else I should try/test? Not much of a consolation, but it's a different crash (or in a different location anyway). $ gdb /usr/lib/debug/lib/modules/2.6.18-308.13.1.el5.bz862862_ixgbevf_crash/kernel/drivers/net/ixgbevf/ixgbevf.ko.debug (gdb) list *(ixgbevf_msix_clean_tx+0x58) 0x2e5f is in ixgbevf_msix_clean_tx (drivers/net/ixgbevf/ixgbevf_main.c:979). 974 return IRQ_HANDLED; 975 976 r_idx = find_first_bit(q_vector->txr_idx, adapter->num_tx_queues); 977 for (i = 0; i < q_vector->txr_count; i++) { 978 tx_ring = &(adapter->tx_ring[r_idx]); 979 tx_ring->total_bytes = 0; ^^^^^^^^^^^^^^^^^^^^ 980 tx_ring->total_packets = 0; 981 ixgbevf_clean_tx_irq(adapter, tx_ring); 982 r_idx = find_next_bit(q_vector->txr_idx, adapter->num_tx_queues, 983 r_idx + 1); (The patch attached in comment 7 was for ixgbevf_clean_tx_irq(), called on line 981 above, and most probably inlined -- line 981 is the only one call site.) Apparently the assignment to tx_ring on line 978 can produce bogus values. Now we've seen it cause a crash on the next line, and also a bit later, when the dereference occurs in ixgbevf_clean_tx_irq(). I do think we have the correct patch, just my first backport attempt was wrong. Clearly, if we expect "tx_ring" to be possibly invalid right at the beginning of ixgbevf_clean_tx_irq(), then it can be invalid with the exact same effort earlier, on lines 979 and 980 shown above. Looking at upstream Linux at commit 10cc1bdd (ie. not just the patch, but the entire "ixgbevf_main.c" file in that state), those "tx_ring" dereferences before the call to ixgbevf_clean_tx_irq() are *not* there in ixgbevf_msix_clean_tx(). In other words, in upstream the patch (commit 10cc1bdd) protected *all* dereferences against the adapter being down, while my backport in comment 7 failed to implement the same intent. I still think we should go with this patch, just backport it correctly. BTW upstream has since (as a matter of fact, not much later than 10cc1bdd) moved the TX cleanup to NAPI context from interrupt context, see commit fa71ae27. Obviously that's out of question for RHEL-5. I'll try to prepare a new backport and corresponding RPMs. What I missed was <http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=93659763>. They did it for a different purpose, but in effect that commit removed the offending dereferences. Created attachment 629093 [details]
[1/2] ixgbevf: Prevent RX/TX statistics getting reset to zero
series build tested locally, brew in progress
(also clearing needinfo)
Crash also happens with the latest patch, still in ixgbevf_msix_clean_tx(): general protection fault: 0000 [1] SMP last sysfs file: /block/ram0/dev CPU 0 Modules linked in: ipv6 xfrm_nalgo crypto_api dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport floppy xen_vnif xen_vbd xen_balloon joydev ide_cd pcspkr cdrom e1000 i2c_piix4 ixgbevf i2c_core 8021q xen_platform_pci tpm_tis tpm tpm_bios serio_raw dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 1981, comm: ifconfig Not tainted 2.6.18-308.16.1.el5.bz862862_ixgbevf_crash_2 #1 RIP: 0010:[<ffffffff881f8e26>] [<ffffffff881f8e26>] :ixgbevf:ixgbevf_msix_clean_tx+0x74/0x2a8 RSP: 0018:ffffffff804b1ec8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff0078626d3a33 RCX: 0000000000000000 RDX: ffff0078626d3a33 RSI: 0000000000000040 RDI: 0000000000000000 RBP: ffff81003d87e540 R08: ffff81003ff92f78 R09: ffff81003f483a10 R10: ffffffff8009a264 R11: 0000000000001002 R12: ffff81003f00b500 R13: 0000000000000000 R14: ffff81003f00b000 R15: ffff8100351e3b88 FS: 00002b50f767a5e0(0000) GS:ffffffff80431000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000003e2551dcfc CR3: 0000000034c04000 CR4: 00000000000006a0 Process ifconfig (pid: 1981, threadinfo ffff8100351e2000, task ffff81003f48e7a0) Stack: ffff81003f664780 0000000000000000 ffff81003f00b990 0000000000000246 ffff81003f00b500 0000000000000000 ffff81003d87e540 00000000000000c9 0000000000000000 ffff8100351e3b88 ffff8100351e3b88 ffffffff80010d91 Call Trace: <IRQ> [<ffffffff80010d91>] handle_IRQ_event+0x51/0xa6 [<ffffffff800be15f>] __do_IRQ+0xfb/0x15b [<ffffffff8006d4d1>] do_IRQ+0xe9/0xf7 [<ffffffff8005d615>] ret_from_intr+0x0/0xa <EOI> [<ffffffff800906e9>] __cond_resched+0x1c/0x44 [<ffffffff8006246f>] __sched_text_start+0x9f/0xbd0 [<ffffffff8006245b>] __sched_text_start+0x8b/0xbd0 [<ffffffff881f63d8>] :ixgbevf:ixgbevf_check_for_bit_vf+0xd/0x25 [<ffffffff881f6421>] :ixgbevf:ixgbevf_check_for_ack_vf+0xe/0x23 [<ffffffff800906e9>] __cond_resched+0x1c/0x44 [<ffffffff800630d5>] cond_resched+0x37/0x42 [<ffffffff800df3f4>] __kmalloc+0x3b/0x9f [<ffffffff800cf439>] __kzalloc+0x9/0x21 [<ffffffff802494fa>] qdisc_alloc+0x1f/0x99 [<ffffffff80249582>] qdisc_create_dflt+0xe/0x3f [<ffffffff80249623>] dev_activate+0x37/0xee [<ffffffff8023a89d>] dev_open+0x82/0x9c [<ffffffff802389a6>] dev_change_flags+0x5a/0x119 [<ffffffff8026d187>] devinet_ioctl+0x235/0x59c [<ffffffff8022f89a>] sock_ioctl+0x1c1/0x1e5 [<ffffffff80041f9e>] do_ioctl+0x21/0x6b [<ffffffff8002ffea>] vfs_ioctl+0x457/0x4b9 [<ffffffff800ba76b>] audit_syscall_entry+0x1a8/0x1d3 [<ffffffff8004c3a0>] sys_ioctl+0x59/0x78 [<ffffffff8005d28d>] tracesys+0xd5/0xe0 Code: 8b 6b 24 89 e8 48 c1 e0 05 48 03 43 30 44 0f b7 78 1a 44 89 RIP [<ffffffff881f8e26>] :ixgbevf:ixgbevf_msix_clean_tx+0x74/0x2a8 RSP <ffffffff804b1ec8> <0>Kernel panic - not syncing: Fatal exception Now that is bad. Reading symbols from /usr/lib/debug/lib/modules/2.6.18-308.16.1.el5.bz862862_ixgbevf_crash_2/kernel/drivers/net/ixgbevf/ixgbevf.ko.debug...done. (gdb) list *(ixgbevf_msix_clean_tx+0x74) 0x2e26 is in ixgbevf_msix_clean_tx (drivers/net/ixgbevf/ixgbevf_main.c:200). 195 unsigned int total_bytes = 0, total_packets = 0; 196 197 if (test_bit(__IXGBEVF_DOWN, &adapter->state)) 198 return true; 199 200 i = tx_ring->next_to_clean; ^^^^^^^^^^^^^^^^^^^^^^ 201 eop = tx_ring->tx_buffer_info[i].next_to_watch; 202 eop_desc = IXGBE_TX_DESC_ADV(*tx_ring, eop); 203 204 while ((eop_desc->wb.status & cpu_to_le32(IXGBE_TXD_STAT_DD)) && It means that either - the __IXGBEVF_DOWN doesn't get set correctly, or - the adapter is indeed up, but the "tx_ring" pointer is nonetheless invalid. ixgbevf_probe ixgbevf_assign_netdev_ops dev->open = ixgbevf_open ixgbevf_init_interrupt_scheme ixgbevf_alloc_queues adapter->tx_ring = kcalloc(...) ixgbevf_open ixgbevf_setup_all_tx_resources ... ixgbevf_up_complete clear_bit(__IXGBEVF_DOWN, ...) ixgbevf_request_irq ixgbevf_request_msix_irqs request_irq(..., ixgbevf_msix_clean_tx, ...) ixgbevf_irq_enable Seems sane. With kernel 2.6.18-344.el5: Unable to handle kernel NULL pointer dereference at 000000000000001a RIP: [<ffffffff88196f1b>] :ixgbevf:ixgbevf_msix_clean_tx+0x7b/0x2a2 PGD 351a4067 PUD 3d270067 PMD 0 Oops: 0000 [1] SMP last sysfs file: /block/ram0/dev CPU 0 Modules linked in: ipv6 xfrm_nalgo crypto_api dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport floppy xen_vnif joydev xen_vbd xen_balloon i2c_piix4 i2c_core ide_cd xen_platform_pci serio_raw e1000 cdrom pcspkr tpm_tis tpm tpm_bios ixgbevf 8021q dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 1983, comm: ifconfig Not tainted 2.6.18-344.el5 #1 RIP: 0010:[<ffffffff88196f1b>] [<ffffffff88196f1b>] :ixgbevf:ixgbevf_msix_clean_tx+0x7b/0x2a2 RSP: 0018:ffffffff804b3ec8 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff81000078626d RCX: 0000000000000000 RDX: ffff81003ec38dc0 RSI: 0000000000000040 RDI: 0000000000000000 RBP: 0000000000000000 R08: 0000000000000040 R09: ffff810001738380 R10: ffffffff800a1be8 R11: 0000000000001002 R12: ffff81003e856500 R13: 0000000000000000 R14: ffff81003e856000 R15: ffff8100357d5b88 FS: 00002ba2749c05e0(0000) GS:ffffffff80434000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000000000001a CR3: 000000003d317000 CR4: 00000000000006a0 Process ifconfig (pid: 1983, threadinfo ffff8100357d4000, task ffff81003f3e07e0) Stack: ffff81003ec38dc0 0000000000000000 ffff81003e856990 0000000000000246 ffff81003e856500 0000000000000000 ffff81003cc10140 00000000000000c1 0000000000000000 ffff8100357d5b88 ffff8100357d5b88 ffffffff80010da9 Call Trace: <IRQ> [<ffffffff80010da9>] handle_IRQ_event+0x51/0xa6 [<ffffffff800be877>] __do_IRQ+0xfb/0x15b [<ffffffff8006d47d>] do_IRQ+0xe9/0xf7 [<ffffffff8005d625>] ret_from_intr+0x0/0xa <EOI> [<ffffffff80090bac>] __cond_resched+0x1c/0x44 [<ffffffff800624b2>] __sched_text_start+0xe2/0xbd0 [<ffffffff8006245b>] __sched_text_start+0x8b/0xbd0 [<ffffffff88194428>] :ixgbevf:ixgbevf_check_for_bit_vf+0xd/0x25 [<ffffffff8819444e>] :ixgbevf:ixgbevf_check_for_msg_vf+0xe/0x23 [<ffffffff881945f5>] :ixgbevf:ixgbevf_read_posted_mbx+0x31/0x67 [<ffffffff881944bc>] :ixgbevf:ixgbevf_obtain_mbx_lock_vf+0x13/0x1f [<ffffffff881944da>] :ixgbevf:ixgbevf_read_mbx_vf+0x12/0x5f [<ffffffff8819405c>] :ixgbevf:ixgbevf_write_msg_read_ack+0x2d/0x36 [<ffffffff800bef3e>] request_irq+0xb0/0xd6 [<ffffffff80090bac>] __cond_resched+0x1c/0x44 [<ffffffff800630d5>] cond_resched+0x37/0x42 [<ffffffff800dfc8f>] __kmalloc+0x3b/0x9f [<ffffffff800cfc6a>] __kzalloc+0x9/0x21 [<ffffffff8024b61a>] qdisc_alloc+0x1f/0x99 [<ffffffff8024b6a2>] qdisc_create_dflt+0xe/0x3f [<ffffffff8024b75f>] dev_activate+0x37/0xee [<ffffffff8023c8ee>] dev_open+0x82/0x9c [<ffffffff8023a9d1>] dev_change_flags+0x5a/0x119 [<ffffffff8026f3ca>] devinet_ioctl+0x235/0x59c [<ffffffff80231830>] sock_ioctl+0x1c1/0x1e5 [<ffffffff8004248b>] do_ioctl+0x21/0x6b [<ffffffff800304d3>] vfs_ioctl+0x457/0x4b9 [<ffffffff800bae83>] audit_syscall_entry+0x1a8/0x1d3 [<ffffffff8004c88f>] sys_ioctl+0x59/0x78 [<ffffffff8005d29e>] tracesys+0xd5/0xdf Code: 44 0f b7 78 1a 44 89 f8 48 c1 e0 04 48 03 43 08 c7 44 24 1c RIP [<ffffffff88196f1b>] :ixgbevf:ixgbevf_msix_clean_tx+0x7b/0x2a2 RSP <ffffffff804b3ec8> CR2: 000000000000001a <0>Kernel panic - not syncing: Fatal exception clearing needinfo Are VLANs involved / important somehow? (In reply to comment #0) > 8021q: adding VLAN 0 to HW filter on device __tmp1145796886 > general protection fault: 0000 [1] SMP > <EOI> [<ffffffff881f8113>] :ixgbevf:ixgbevf_set_vfta_vf+0x0/0x2e "ixgbevf_set_vfta_vf - Set/Unset vlan filter table address" I'm uncertain my question makes sense at all, but does the crash reproduce "without VLAN"? There are some VLAN-related changes after summer 2012 in upstream, but I have no idea if they're related. (2ddc7fe1, 5d9a533b). Anyway I can't make any sense of the stack traces. Is an interrupt injected by the hypervisor in a wrong spot, or what?... (In reply to comment #26) > Are VLANs involved / important somehow? > > I'm uncertain my question makes sense at all, but does the crash reproduce > "without VLAN"? > I don't use VLANs when testing this. I use the "ethX" device directly, so the crash happens when I do: "ifconfig __tmp1145796886 <ip> up". It's the VF interface, not a VLAN. Note that on some boots the VF is properly called ethX, and then when I do "ifconfig ethX <ip> up" I *dont* get the crash. The crash only happens when the VF has the name "__tmp1145796886" .. it *seems* the rename stuff is related somehow. I'm able to freely reboot the server for a couple of weeks more.. so it's easy to test stuff now. Anything else I should try? Thanks. We can't reproduce this in our lab. If possible I'd like to avoid manually instrumenting the entire ixgbevf driver and exchange RPMs. So, let's see if we can get a bird's eye view of the control flow with systemtap. Please download the RHEL-5 systemtap docs from <https://access.redhat.com/knowledge/docs/Red_Hat_Enterprise_Linux/>. Setting it up is described in <https://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/5/html/SystemTap_Beginners_Guide/using-systemtap.html>. (1) Please set up systemtap in the HVM guest that tends to crash. (2) Please create a systemtap script (eg. "ixgbevf.stap") with the following contents, in the guest: probe module("ixgbevf").function("*").call { printf("%s -> %s\n", thread_indent(1), probefunc()) } probe module("ixgbevf").function("*").return { printf("%s <- %s\n", thread_indent(-1), probefunc()) } (3) Blacklist the ixgbevf module in the guest. (4) Recreate the situation (first guest boot after host boot IIRC) where the guest would normally crash after loading ixgbevf and bringing up the interface. Putting it differently, I think, please reboot the host and boot the guest, with ixgbevf blacklisted. (5) Once the guest has booted, log in to it with ssh (through PV-on-HVM netfront or an emulated NIC), and modprobe ixgbevf manually. It is my understanding that at this point there's no crash yet. The point of the manual modprobe is to avoid an automatic ifup during the boot process (which would crash). (6) Unbind the passthru ixgbevf from the module, inside the guest: echo DOMAIN:BUS:SLOT.FUNC > /sys/bus/pci/drivers/ixgbevf/unbind (7) In the host, configure "screen" as follows: ~/.screenrc: logfile /root/screen.log logfile flush 1 Then start "screen" and press C-a H. (Capital H.) (8) In "screen", connect to the serial console of the guest from within the host, with "virsh console" or "xm console". Then start the systemtap script from step (2) over the serial console, inside "screen". stap -v ixgbevf.stap Some warning messages may be displayed, but the command should not exit. (9) Using the ssh connection to the guest, rebind the passthru ixgbevf device to the ixgbevf module. I'm using <http://lxr.linux.no/#linux+v3.6.5/Documentation/ABI/testing/sysfs-bus-pci> as cheatsheet: echo DOMAIN:BUS:SLOT.FUNC > /sys/bus/pci/drivers/ixgbevf/bind or echo 1 > /sys/bus/pci/rescan This should produce some systemtap output on the serial console (which is all saved in /root/screen.log). (10) Bring up the ixgbevf interface in the guest. This should procude a lot of systemtap output on the serial console (saved in /root/screen.log) and ultimately crash (with a stack trace). Force off the guest (if it hasn't been automatically destroyed at this point). The "virsh console" command from step (8) will exit, but screen will continue to record pty traffic. (11) In the host, make a copy of /root/screen.log (eg. to "crash.log"), and then truncate the original (do not use "mv"): >| /root/screen.log (12) Reboot the guest, and repeat steps (5), (6), (8), (9), (10). In my understanding the guest doesn't crash this time and the ixgbevf interface is brought up correctly. (13) Type C-c in the "screen" window to interrupt stap, C-] to detach from the guest's serial console, then exit "screen". Rename /root/screen.log to "success.log". (14) Please upload "crash.log" and "success.log". Thanks! Sorry for the delayed reply. I'm having problems reproducing the problem with this method.. another problem: after i re-bind the pci device to ixgbevf driver stap on the serial console starts to get flooded with continuous calls to ixgbevf_clean_rxonly(), ixgbevf_clean_rx_irq() and ixgbevf_alloc_rx_buffers() .. and the serial console goes very slow/unresponsive. Maybe I should try with the "original" method where I'm seeing the crash often.. so by letting the rhel5 guest autoload the ixgbevf driver (but not bring up the eth-interface) at boot time. Now I got it to crash using my original method and while running the stap script.. I get the kernel crash/stacktrace on the console, and nothing on stap.. it crashes so quickly. And I mean running "ifconfig -a" or "ethtool -i <ixgbevf-eth-interface>" I get output from the stap script (calls to ixgbevf_get_stats()) , but when I run "ifconfig __tmp1534769776 0.0.0.0 up" the rhel5 guest kernel crashes immediately and there's nothing more on the stap screen.. so the guest kernel crashes before stap manages to print any call info. (Just in case it might help, a new feature in systemtap version 2.0 is connection to the kernel panic_notifier. If the kernel crashes while a stap job is running, this notifier lets stap emit any last-gasp messages that were buffered but not yet punted to userspace. (There is no official RH build of stap 2.0 for RHEL5, but it's not too hard to build.) FC6 bug about udev eth interface rename creating __tmp names: https://bugzilla.redhat.com/show_bug.cgi?id=208740 Anyway it seems I'm only getting the rhel5 guest kernel crash when the interface name is __tmp1534769776 or something like that.. no idea why rename stuff affects the ixgbevf driver. Pasi, can you perhaps reproduce the problem in a bare-metal RHEL-5 installation? Like booting the machine and using the ixgbevf interface without any hypervisor or guests? Also, in both guest cases (ie. crash and success), can you please set "ignore_loglevel" on the guest kernel command line, capture the output on the virtual serial console, and upload both logs? I tried to look a bit more at the tx_ring value that causes the crash. It is calculated by ixgbevf_msix_clean_tx() for ixgbevf_clean_tx_irq() like this: r_idx = find_first_bit(q_vector->txr_idx, adapter->num_tx_queues); for (i = 0; i < q_vector->txr_count; i++) { tx_ring = &(adapter->tx_ring[r_idx]); tx_ring->total_bytes = 0; tx_ring->total_packets = 0; ixgbevf_clean_tx_irq(adapter, tx_ring); r_idx = find_next_bit(q_vector->txr_idx, adapter->num_tx_queues, r_idx + 1); } Extending comment 20: > ixgbevf_probe > ixgbevf_assign_netdev_ops > dev->open = ixgbevf_open ixgbevf_sw_init -------------------> #1 > ixgbevf_init_interrupt_scheme > ixgbevf_alloc_queues > adapter->tx_ring = kcalloc(...) > > ixgbevf_open > ixgbevf_setup_all_tx_resources > ... /* * Map the Tx/Rx rings to the vectors we were allotted. * if request_irq will be called in this function map_rings * must be called *before* up_complete */ ixgbevf_map_rings_to_vectors ------> #2 map_vector_to_txq > ixgbevf_up_complete > clear_bit(__IXGBEVF_DOWN, ...) > ixgbevf_request_irq > ixgbevf_request_msix_irqs > request_irq(..., ixgbevf_msix_clean_tx, ...) > ixgbevf_irq_enable #2 (ie. ixgbevf_map_rings_to_vectors()) is responsible for setting up "q_vector": /** * ixgbevf_map_rings_to_vectors - Maps descriptor rings to vectors * @adapter: board private structure to initialize * * This function maps descriptor rings to the queue-specific vectors * we were allotted through the MSI-X enabling code. Ideally, we'd have * one vector per ring/queue, but on a constrained vector budget, we * group the rings as "efficiently" as possible. You would add new * mapping configurations in here. **/ Basically it distributes the receive and transmit queues over the available interrupt vectors. When a TX interrupt fires, the ixgbevf_msix_clean_tx() interrupt handler loops over all covered transmit queues. It expects to find exactly "q_vector->txr_count" bits set in the "q_vector->txr_idx" bitmap. The position of each set bit in that map designatex a tx_ring, which is then passed to ixgbevf_clean_tx_irq(). If the bitmap setup is wrong, that's an explanation. If the loop goes on for too long (fewer bits set in the map than indicated by "txr_count"), then find_first_bit() / find_next_bit() will return the size of the bitmap (ie. one greater than the position of the maximum possible bit), resulting in an invalid "tx_ring" pointer. So why would the bitmap setup in #2 go wrong? Now I'm suspecting #1 (ixgbevf_sw_init()), which is part of the probing, ie. precedes opening. This function explicitly depends on the MAC assignment to the VF: err = hw->mac.ops.reset_hw(hw); if (err) { dev_info(&pdev->dev, "PF still in reset state, assigning new address\n"); random_ether_addr(hw->mac.addr); } else { err = hw->mac.ops.init_hw(hw); if (err) { printk(KERN_ERR "init_shared_code failed: %d\n", err); goto out; } } Directly after host reboot, the first branch runs in the guest. (This could be confirmed by your dmesg as requested above.) ( Also, this could be the connection to the udev renaming problem, see Harald's comment 4: > clash in renaming interfaces... especially, if you have ifcfg with interface > names assigned to a HWADDR and you have interfaces without ifcfg files. > > udev does not know what to do with the "without" interfaces, when they > occupy the name of the "HWADDR" interfaces. So they are renamed. This can be > resolved by not naming the HWADDR interfaces in the "eth*" namespace. Let's say the ixgbevf interface is originally called eth1 in the guest (this is the name that the kernel assigns before udev runs.) Also suppose that udev has run at an earlier time (in another boot) in this guest, and created an ifcfg-eth1 file, with HWADDR stanza inside. After host reboot, random_ether_addr() in the first guest branch above produces a MAC address that the guest's udev has never seen before. So there's again an "eth1" interface from the kernel, but it's MAC differs from the HWADDR that was captured earlier in the ifcfg-eth1 file. So udev renames the interface to __tmpXXX. ) I think something goes wrong in/after the probing when the first branch runs ("PF still in reset state"). When the second branch runs (MAC already assigned to vf), everything's fine. I tried to compare ixgbevf_sw_init() with its upstream Linux counterpart; upstream commit 1a0d6ae5 ("rename dev_hw_addr_random and remove redundant second") seems relevant. (Possibly 0ac1e8ce too.) These affect the error handling in/around ixgbevf_sw_init(). I'm going to have to look into those more deeply. However if 1a0d6ae5 indeed plays a role, then this problem is not Xen specific; that's why I'm asking for the bare metal ixgbevf test on the same hardware. ... I might be wrong about which branch runs when. Actually 1a0d6ae5 doesn't seem to change what happens when the first branch runs -- without the patch, a stale non-zero value is returned by ixgbevf_sw_init(), but since it's ignored its staleness makes no difference. Also if init_hw() succeeds in the second branch, no observable change. 1a0d6ae5 changes the handling of init_hw() failure in the second branch ("init_shared_code failed"). With the patch that condition makes ixgbevf_probe() bail out with an error (because the latter no longer ignores ixgbevf_sw_init() errors), and that of course would prevent ixgbevf_open(). Andy, would you have any remarks perhaps? Thanks! Created attachment 653948 [details] [v1] tarball of broken out debug patches for ixgbevf_main.c (64 pcs) To be applied as in: - take 2.6.18-308.20.1.el5 (most recent public Z-Stream build) - apply backport from comment 15 - apply backport from comment 7 - apply this series Locally build-tested. Created attachment 653952 [details] [v1] squashed debug series for ixgbevf_main.c same as in comment 43, for easier viewing in bugzilla Pasi, from an unrelated internal discussion I'm getting the idea that the setting fixed MAC addresses for VFs in rc.local with "ip set" might be good practice. Can you please check what happens at first guest boot after host boot if you place such settings in the guest? Thanks! Sorry for the delayed reply, I've been busy with other things. I'll try to get the requested things tested later this week! Thanks. Created attachment 663166 [details] rhel5 hvm guest kernel crash with 2.6.18-308.20.1.el5.bz862862_ixgbevf_crash_3 Attached is the rhel5 hvm guest kernel boot + crash log with the latest debug patched kernel. Btw it seems the issue often happens for the first VM launched after host reboot, but also after that.. on every third or every fourth reboot of the rhel5 hvm guest. that makes reproducing and testing easier, no need to reboot the physical server. I'm trying to set the VF MAC from dom0 before starting the VM: # ip link set dev eth0 vf 0 mac 00:01:02:03:05:01 Error: either "dev" is duplicate, or "vf" is a garbage. # ip link help Usage: ip link set DEVICE { up | down | arp { on | off } | dynamic { on | off } | multicast { on | off } | allmulticast { on | off } | promisc { on | off } | trailers { on | off } | txqueuelen PACKETS | name NEWNAME | address LLADDR | broadcast LLADDR | mtu MTU } ip link show [ DEVICE ] So hmm.. iproute in rhel5 doesn't support VFs at all? # ip link set eth0 vf 0 mac 00:de:ad:be:ef:01 Error: either "dev" is duplicate, or "vf" is a garbage. Same error with that syntax.. no "dev" in between there. (In reply to comment #50) > Attached is the rhel5 hvm guest kernel boot + crash log with the latest > debug patched kernel. > > Btw it seems the issue often happens for the first VM launched after host > reboot, but also after that.. on every third or every fourth reboot of the > rhel5 hvm guest. that makes reproducing and testing easier, no need to > reboot the physical server. Thank you. Can you please also upload a HVM guest boot log (using the same guest kernel) from when there is no crash? I'd like to compare the logs. Thank you. Also, (from comment #40) > > can you perhaps reproduce the problem in a bare-metal RHEL-5 installation? > Like booting the machine and using the ixgbevf interface without any > hypervisor or guests? Thus in total I'd like to look at three boot logs, each generated by the same kernel and with "ignore_loglevel": - without Xen, on the bare metal (whether it crashes or not), - in an HVM guest, crashing (already uploaded in comment 49), - in HVM guest, not crashing. Thanks! Created attachment 663204 [details] rhel5 hvm guest kernel log when it doesnt crash with 2.6.18-308.20.1.el5.bz862862_ixgbevf_crash_3 For the "no crash" log I did this in the HVM guest: ifconfig eth1 0.0.0.0 up ifconfig eth2 0.0.0.0 up ifconfig eth1 down ifconfig eth2 down poweroff Created attachment 663206 [details] rhel5 baremetal dmesg for 2.6.18-308.20.1.el5.bz862862_ixgbevf_crash_3 I disabled blacklisting of ixgbevf driver in dom0 and rebooted it to native/baremetal kernel. .. and baremetal didn't crash. Comparing the hvm-no-crash & hvm-crash logs (comment 55 and comment 49, resp.): No crash: e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection ixgbevf 0000:00:06.0: ixgbevf_probe() registered with name=eth1 ixgbevf 0000:00:07.0: ixgbevf_probe() registered with name=eth2 netfront: device eth3 has copying receive path. ixgbevf 0000:00:06.0: ixgbevf_open() entry ixgbevf 0000:00:06.0: ixgbevf_request_msix_irqs() requesting irq=201 for vector=0 with name=eth1-rx-0 ixgbevf 0000:00:06.0: ixgbevf_request_msix_irqs() requesting irq=209 for vector=1 with name=eth1-tx-0 ixgbevf 0000:00:06.0: ixgbevf_open() exit (OK) 8021q: adding VLAN 0 to HW filter on device eth1 [no crash] Crash: ixgbevf 0000:00:06.0: ixgbevf_probe() registered with name=eth0 ixgbevf 0000:00:07.0: ixgbevf_probe() registered with name=eth1 e1000: eth2: e1000_probe: Intel(R) PRO/1000 Network Connection netfront: device eth2 has copying receive path. ixgbevf 0000:00:06.0: ixgbevf_open() entry ixgbevf 0000:00:06.0: ixgbevf_request_msix_irqs() requesting irq=193 for vector=0 with name=__tmp703594194-rx-0 ixgbevf 0000:00:06.0: ixgbevf_request_msix_irqs() requesting irq=201 for vector=1 with name=__tmp703594194-tx-0 ixgbevf 0000:00:06.0: ixgbevf_open() exit (OK) 8021q: adding VLAN 0 to HW filter on device __tmp703594194 [...] [crash] The crash seems to depend on *when* the e1000 card is discovered. No crash: - e1000 is discovered first, gets eth0 - the VFs get eth1 and eth2, respectively, - netfront gets eth3 Crash: - The VFs are discovered first, they are named eth0 and eth1, respectively, - e1000 is then discovered, gets eth2 - netfront gets eth2 AGAIN (?!) - udev renames VF1 (eth0) to "__tmp703594194", sometime between ixgbevf_probe() and ixgbevf_open() After VLAN 0 is added, there are a bunch of ixgbevf_set_rx_mode() calls (no structural difference in the logs), which "is called whenever the multicast address list or the network interface flags are updated". This happens presumably inside "ifconfig" (that's the process the crash dump reports), and then we crash in the "VFs discovered first" case. This looks crazy. Regarding the crash itself: - The offending address we try to access is ffff810078626d5a. - The offending code is at ixgbevf_maybe_stop_tx+0xc: (gdb) list *(ixgbevf_maybe_stop_tx+0xc) 0x213d is in ixgbevf_maybe_stop_tx (drivers/net/ixgbevf/ixgbevf_main.c:3725). 3720 3721 /* not instrumented -- possibly called per-packet */ 3722 static int ixgbevf_maybe_stop_tx(struct net_device *netdev, 3723 struct ixgbevf_ring *tx_ring, int size) 3724 { 3725 if (likely(IXGBE_DESC_UNUSED(tx_ring) >= size)) 3726 return 0; 3727 return __ixgbevf_maybe_stop_tx(netdev, tx_ring, size); 3728 } 3729 I can see the value ffff810078626d3a in RSI and R15; the offending address is only 0x20 bytes above it. #define IXGBE_DESC_UNUSED(R) \ ((((R)->next_to_clean > (R)->next_to_use) ? 0 : (R)->count) + \ (R)->next_to_clean - (R)->next_to_use - 1) struct ixgbevf_ring { 0x00 struct ixgbevf_adapter *adapter; /* backlink */ 0x08 void *desc; /* descriptor ring memory */ 0x10 dma_addr_t dma; /* phys. address of descriptor ring */ 0x18 unsigned int size; /* length in bytes */ 0x1C unsigned int count; /* amount of descriptors */ 0x20 unsigned int next_to_use; The code tries to access tx_ring->next_to_use for comparison. Where does tx_ring come from? (gdb) list *(ixgbevf_xmit_frame+0xe8) 0x690e is in ixgbevf_xmit_frame (drivers/net/ixgbevf/ixgbevf_main.c:3761). 3756 3757 count += TXD_USE_COUNT(skb_headlen(skb)); 3758 for (f = 0; f < skb_shinfo(skb)->nr_frags; f++) 3759 count += TXD_USE_COUNT(skb_shinfo(skb)->frags[f].size); 3760 3761 if (ixgbevf_maybe_stop_tx(netdev, tx_ring, count)) { 3762 adapter->tx_busy++; 3763 return NETDEV_TX_BUSY; 3764 } 3765 We're lucky. At this call site, tx_ring points to tx_ring = &adapter->tx_ring[r_idx]; with r_idx being constant 0. Now, checking the debug log again: ixgbevf 0000:00:06.0: ixgbevf_alloc_queues() tx_ring[0] @ ffff81003fed20c0 ixgbevf 0000:00:07.0: ixgbevf_alloc_queues() tx_ring[0] @ ffff81003fed2140 offending address: ffff810078626d5a It's clear that the offending address has *nothing* to do with the previously allocated tx_ring object. Therefore either the "adapter->tx_ring" field (which is a pointer) got corrupted, containing the RSI/R15 value (r_idx == 0), or the "adapter" variable points to garbage. The latter would only be possible if the network subsystem called ixgbevf_xmit_frame() on a corrupted netdev, or a different type of NIC. Something is seriously messed up. Udev is there exactly to mask the random order of NIC discovery. e1000 and netfront should not both claim "eth2". The tx_ring address should be correct. ( For simplicity, the renaming happens via - /etc/udev/rules.d/60-net.rules: ACTION=="add", SUBSYSTEM=="net", IMPORT{program}="/lib/udev/rename_device" SUBSYSTEM=="net", RUN+="/etc/sysconfig/network-scripts/net.hotplug" - "/lib/udev/rename_device" is a binary in the initscripts package; I could finally find the __tmp%d format string in its source. It is the program that scans /etc/sysconfig/network-scripts/ for HWADDR & co. ) I'm not sure if we can solve this problem in RHEL-5. RHEL-6 has a way to bind VFs to interface names in the guest persistently (see below), but RHEL-5 lacks any. The RHEL-5 HVM guest crash (ixgbevf corruption) is related to the renaming (influenced by NIC discovery order and MACs) that occurs between ixgbevf_probe() and ixgbevf_open() (= ifconfig). (The bare metal case (no crash) is consistent with this: the ixgbevf_sw_init() debug messages there differ from both HVM guest runs (crash & no crash). In the bare metal case the VFs got new random addresses and were discovered as last interfaces (eth4 to eth7); they didn't clash with anything known to udev previously and were not renamed.) RHEL-6 has seen changes in this area: - The RHEL-6 initscripts package does not try to rename interfaces itself, it just tries to figure out an interface name and suggest it to "udev". The udev program does the renaming. The __tmp%d pattern is gone, it's not there either in "rename_device.c" (initscripts) or udev. The udev rule was adapted, see <http://git.fedorahosted.org/cgit/initscripts.git/commit/?id=3e634c64>. - The RHEL-6 kernel has received a fix for bug 611418 (its duplicate bug 614786 is not public, related RHEL-6 kernel commit: 645373fb). The upstream counterpart is <http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=c1f79426>. Basically, when a VF gets a random MAC, the NET_ADDR_RANDOM flag is set in the device's "addr_assign_type" word. This word is then exported for userspace via sysfs. The RHEL-5 kernel doesn't have this feature. - RHEL-6 udev (147+patches) knows to check for the "addr_assign_type" sysfs file, and for VFs it saves a persistent device name based on a different key: it doesn't use the MAC (which is unstable) but a "device path id" (for example, a PCI path). See bug 596464 and "udev-147-rule_gen.patch" in the udev SRPM. RHEL-5 udev is not so smart (095+patches). Unless someone has a better idea, I'm proposing CANTFIX for this BZ. Regarding workarounds: as far as I understand, Harald recommended one at the end of comment 4. Also, based on bug 520453 comment 0, I think one could manually add an udev rule like this (in the RHEL-5 guest): SUBSYSTEM=="net", ACTION=="add", KERNEL=="eth*", ID=="$PCIBUSID", NAME="ethX" This should establish a stable mapping from PCI BDF to ethX name, no matter the HWADDR. (I guess the corresponding ifcfg script should not contain any HWADDR entry either.) OTOH udev might still have to rename the interface (just not to __tmp%d), which may or may not have an effect on the crash. (I'd just like to confirm something with disassembly) (In reply to comment #59) > Regarding the crash itself: > > - The offending address we try to access is ffff810078626d5a. > > - The offending code is at ixgbevf_maybe_stop_tx+0xc: > > (gdb) list *(ixgbevf_maybe_stop_tx+0xc) > 0x213d is in ixgbevf_maybe_stop_tx > (drivers/net/ixgbevf/ixgbevf_main.c:3725). > 3720 > 3721 /* not instrumented -- possibly called per-packet */ > 3722 static int ixgbevf_maybe_stop_tx(struct net_device *netdev, > 3723 struct ixgbevf_ring *tx_ring, int size) > 3724 { > 3725 if (likely(IXGBE_DESC_UNUSED(tx_ring) >= size)) > 3726 return 0; > 3727 return __ixgbevf_maybe_stop_tx(netdev, tx_ring, size); > 3728 } > 3729 > > I can see the value ffff810078626d3a in RSI and R15; the offending address > is only 0x20 bytes above it. > > #define IXGBE_DESC_UNUSED(R) \ > ((((R)->next_to_clean > (R)->next_to_use) ? 0 : (R)->count) + \ > (R)->next_to_clean - (R)->next_to_use - 1) > > struct ixgbevf_ring { > 0x00 struct ixgbevf_adapter *adapter; /* backlink */ > 0x08 void *desc; /* descriptor ring memory */ > 0x10 dma_addr_t dma; /* phys. address of descriptor ring */ > 0x18 unsigned int size; /* length in bytes */ > 0x1C unsigned int count; /* amount of descriptors */ > 0x20 unsigned int next_to_use; > > The code tries to access tx_ring->next_to_use for comparison. $ objdump -d ixgbevf.ko 0000000000002131 <ixgbevf_maybe_stop_tx>: 2131: 41 54 push %r12 2133: 41 89 d4 mov %edx,%r12d 2136: 31 d2 xor %edx,%edx 2138: 55 push %rbp 2139: 48 89 fd mov %rdi,%rbp 213c: 53 push %rbx 213d: 8b 46 20 mov 0x20(%rsi),%eax 2140: 48 89 f3 mov %rsi,%rbx 2143: 39 46 24 cmp %eax,0x24(%rsi) 0x2131 + 0xc == 0x213d; plus field "next_to_clean" is at offset 0x24 in the struct. Pasi, would you care to test how a RHEL-5 HVM guest with passthru VFs (+ the debug kernel) works under RHEL-6 (or RHEL-5) kvm? I don't know whether such a test could give us any hints, but currently I have no other idea. Drew reminded me that we had had this test in mind originally. Thanks Laszlo Setting udev_log="debug" in the RHEL-5 HVM guest's "/etc/udev/udev.conf" might provide more info, if (hopefully) udev's debug messages interleaved on the serial console with the ixgbevf debug messages. ("ignore_loglevel console=tty console=ttyS0,115200" would make the serial line the main console in the guest (last console=... spec), redirecting all of the boot log there) Created attachment 672173 [details] rhel5 hvm guest kernel crash with 2.6.18-308.20.1.el5.bz862862_ixgbevf_crash_3 with udev info Happy New Year to everyone! I just added a rhel5 hvm guest crash log with udev_log="info". Hopefully that's enough/helpful.. (I tried with udev_log="debug" aswell, but it produces *so* much output that it took forever to get it transferred from the guest to dom0 over the serial console.. so booting the guest takes *long*. I can try to capture it again if needed..) Created attachment 672220 [details] rhel5 hvm guest kernel crash with 2.6.18-308.20.1.el5.bz862862_ixgbevf_crash_3 with udev debug Ok, now there's a log with udev_log="debug" aswell. It's 788 kB and over 14000 lines long.. now the interesting thing is with udev_log="debug" it seems udevd crashes with segfault 38 times.. grep for segfault and you'll see.. these segfauls don't happen with udev_log="info". I wonder if it's related to the "slowness" of the serial console and the huge amount of output.. There's the guest kernel crash in the end of the log. Happy New Year! These logfiles seem helpful, especially the info one. I have another theory now (well they're cheap, aren't they). I had actually been looking at one upstream kernel commit from 2007 before (RHEL-5 lacks it, RHEL-6 has it), and now my suspicion got stronger. (BTW the exact crashes are different between the info log and the debug log, but both are consistent with the theory.) In a nutshell I suspect ixgbevf functions being called on e1000 private data due to a race condition in interface renaming. Quoting comment 59: > ixgbevf 0000:00:06.0: ixgbevf_alloc_queues() tx_ring[0] @ ffff81003fed20c0 > ixgbevf 0000:00:07.0: ixgbevf_alloc_queues() tx_ring[0] @ ffff81003fed2140 > offending address: ffff810078626d5a > > It's clear that the offending address has *nothing* to do with the > previously allocated tx_ring object. Therefore either the > "adapter->tx_ring" field (which is a pointer) got corrupted, containing > the RSI/R15 value (r_idx == 0), or the "adapter" variable points to > garbage. The latter would only be possible if the network subsystem called ixgbevf_xmit_frame() on a corrupted netdev, or a different type of NIC. Elaborating on the theory: udev seems to work like a server program with an event loop. When an event fires, it runs helpers and utilities to handle the event. Several events (and their handlers) can be in progress simultaneously; this is confirmed by udevd[390]: main: initialize max_childs_running to 16 From the "linear" log in comment 66 it's not easy to see what log entry belongs to what event: events are interleaved. However udev is diligent to log all PIDs, which allows us to de-interleave. Consider the following two (de-interleaved) udev events. - PID [390] is the main udev daemon, - PID [973] is the ixgbevf helper/handler (first VF, at 0000:00:06.0, initial kernel name: eth0), - PID [984] is the e1000 helper/handler (initial kernel name: eth2). udevd[390]: udev_event_run: seq 520 forked, pid [973], 'add' 'net', 1 seconds old udevd-event[973]: wait_for_sysfs: file '/sys/class/net/eth0/address' appeared after 0 loops udevd-event[973]: run_program: '/lib/udev/rename_device' udevd-event[973]: run_program: Waiting 1 seconds for output of '/lib/udev/rename_device(981)' udevd-event[973]: run_program: Waiting 2 seconds for output of '/lib/udev/rename_device(981)' udevd-event[973]: run_program: '/lib/udev/rename_device' returned with status 0 udevd-event[973]: udev_rules_get_name: no node name set, will use kernel name 'eth0' udevd-event[973]: pass_env_to_socket: passed -1 bytes to socket '/org/kernel/udev/monitor', udevd-event[973]: run_program: '/lib/udev/udev_run_hotplugd' udevd-event[973]: run_program: '/lib/udev/udev_run_hotplugd' returned with status 0 udevd-event[973]: run_program: '/lib/udev/udev_run_devd' udevd-event[973]: run_program: '/lib/udev/udev_run_devd' returned with status 0 udevd-event[973]: run_program: '/etc/sysconfig/network-scripts/net.hotplug' udevd-event[973]: run_program: '/etc/sysconfig/network-scripts/net.hotplug' returned with status 0 udevd-event[973]: pass_env_to_socket: passed -1 bytes to socket '/org/freedesktop/hal/udev_event', udevd-event[973]: udev_event_run: seq 520 finished udevd[390]: udev_done: seq 520, pid [973] exit with 0, 17 seconds old udevd[390]: udev_event_run: seq 523 forked, pid [984], 'add' 'net', 0 seconds old udevd-event[984]: wait_for_sysfs: file '/sys/class/net/eth2/address' appeared after 0 loops udevd-event[984]: run_program: '/lib/udev/rename_device' udevd-event[984]: run_program: '/lib/udev/rename_device' (stdout) 'INTERFACE=eth0' udevd-event[984]: run_program: '/lib/udev/rename_device' (stdout) 'DEVPATH=/class/net/eth0' udevd-event[984]: run_program: '/lib/udev/rename_device' returned with status 0 udevd-event[984]: udev_rules_get_name: no node name set, will use kernel name 'eth2' udevd-event[984]: pass_env_to_socket: passed -1 bytes to socket '/org/kernel/udev/monitor', udevd-event[984]: run_program: '/lib/udev/udev_run_hotplugd' udevd-event[984]: run_program: '/lib/udev/udev_run_hotplugd' returned with status 0 udevd-event[984]: run_program: '/lib/udev/udev_run_devd' udevd-event[984]: run_program: '/lib/udev/udev_run_devd' returned with status 0 udevd-event[984]: run_program: '/etc/sysconfig/network-scripts/net.hotplug' udevd-event[984]: run_program: '/etc/sysconfig/network-scripts/net.hotplug' returned with status 0 udevd-event[984]: pass_env_to_socket: passed -1 bytes to socket '/org/freedesktop/hal/udev_event', udevd-event[984]: udev_event_run: seq 523 finished udevd[390]: udev_done: seq 523, pid [984] exit with 0, 15 seconds old (Again from comment 59: > Crash: > - The VFs are discovered first, they are named eth0 and eth1, > respectively, > - e1000 is then discovered, gets eth2 ) The "/lib/udev/rename_device" helper informs udev about having renamed eth2 (the e1000 NIC) to eth0, supplanting the "original" eth0 (ixgbevf). Now look at these two udev events in parallel, in the "linear" log from comment 66: $ egrep 'eth[02]|\[973\]|\[984\]|Bringing up interface|__tmp' \ crash-log-2.6.18-308.20.1.el5.bz862862_ixgbevf_crash_3-with-udev-info.txt Probing by the respective kernel drivers: ixgbevf 0000:00:06.0: ixgbevf_probe() registered with name=eth0 e1000: eth2: e1000_probe: Intel(R) PRO/1000 Network Connection Probing done, udev runs ([973] is ixgbevf, [984] is e1000): udevd[390]: udev_event_run: seq 520 forked, pid [973], 'add' 'net', 1 seconds old udevd-event[973]: wait_for_sysfs: file '/sys/class/net/eth0/address' appeared after 0 loops udevd-event[973]: run_program: '/lib/udev/rename_device' udevd[390]: udev_event_run: seq 523 forked, pid [984], 'add' 'net', 0 seconds old udevd-event[984]: wait_for_sysfs: file '/sys/class/net/eth2/address' appeared after 0 loops udevd-event[973]: run_program: Waiting 1 seconds for output of '/lib/udev/rename_device(981)' udevd-event[973]: run_program: Waiting 2 seconds for output of '/lib/udev/rename_device(981)' udevd-event[984]: run_program: '/lib/udev/rename_device' udevd-event[973]: run_program: '/lib/udev/rename_device' returned with status 0 udevd-event[973]: udev_rules_get_name: no node name set, will use kernel name 'eth0' udevd-event[973]: pass_env_to_socket: passed -1 bytes to socket '/org/kernel/udev/monitor', udevd-event[984]: run_program: '/lib/udev/rename_device' (stdout) 'INTERFACE=eth0' udevd-event[973]: run_program: '/lib/udev/udev_run_hotplugd' udevd-event[973]: run_program: '/lib/udev/udev_run_hotplugd' returned with status 0 udevd-event[973]: run_program: '/lib/udev/udev_run_devd' udevd-event[973]: run_program: '/lib/udev/udev_run_devd' returned with status 0 udevd-event[984]: run_program: '/lib/udev/rename_device' (stdout) 'DEVPATH=/class/net/eth0' udevd-event[984]: run_program: '/lib/udev/rename_device' returned with status 0 udevd-event[984]: udev_rules_get_name: no node name set, will use kernel name 'eth2' udevd-event[984]: pass_env_to_socket: passed -1 bytes to socket '/org/kernel/udev/monitor', udevd-event[973]: run_program: '/etc/sysconfig/network-scripts/net.hotplug' udevd-event[973]: run_program: '/etc/sysconfig/network-scripts/net.hotplug' returned with status 0 udevd-event[973]: pass_env_to_socket: passed -1 bytes to socket '/org/freedesktop/hal/udev_event', udevd-event[984]: run_program: '/lib/udev/udev_run_hotplugd' udevd-event[984]: run_program: '/lib/udev/udev_run_hotplugd' returned with status 0 udevd-event[984]: run_program: '/lib/udev/udev_run_devd' udevd-event[984]: run_program: '/lib/udev/udev_run_devd' returned with status 0 udevd-event[973]: udev_event_run: seq 520 finished udevd-event[984]: run_program: '/etc/sysconfig/network-scripts/net.hotplug' udevd-event[984]: run_program: '/etc/sysconfig/network-scripts/net.hotplug' returned with status 0 udevd-event[984]: pass_env_to_socket: passed -1 bytes to socket '/org/freedesktop/hal/udev_event', udevd[390]: udev_done: seq 520, pid [973] exit with 0, 17 seconds old udevd-event[984]: udev_event_run: seq 523 finished udevd[390]: udev_done: seq 523, pid [984] exit with 0, 15 seconds old The two handlers are interleaved. Bringing up interface eth0: Determining IP information for eth0...ixgbevf 0000:00:06.0: ixgbevf_get_stats() entry / exit e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None 8021q: adding VLAN 0 to HW filter on device eth0 "eth0" already denotes the e1000 card here. rhel58x64hvm01.localdomain login: eth0: no IPv6 routers present ixgbevf 0000:00:06.0: ixgbevf_request_msix_irqs() requesting irq=193 for vector=0 with name=__tmp73526586-rx-0 ixgbevf 0000:00:06.0: ixgbevf_request_msix_irqs() requesting irq=201 for vector=1 with name=__tmp73526586-tx-0 8021q: adding VLAN 0 to HW filter on device __tmp73526586 The first ixgbevf has been renamed to "__tmp73526586". The crash comes from (gdb) list *(ixgbevf_xmit_frame+0x4b1) 0x6cd7 is in ixgbevf_xmit_frame (drivers/net/ixgbevf/ixgbevf_main.c:3549). 3544 len = min(skb_headlen(skb), total); 3545 while (len) { 3546 tx_buffer_info = &tx_ring->tx_buffer_info[i]; 3547 size = min(len, (unsigned int)IXGBE_MAX_DATA_PER_TXD); 3548 3549 tx_buffer_info->length = size; 3550 tx_buffer_info->mapped_as_page = false; 3551 tx_buffer_info->dma = pci_map_single(adapter->pdev, 3552 skb->data + offset, 3553 size, PCI_DMA_TODEVICE); However the "tx_ring->tx_buffer_info" array has been allocated in ixgbevf_setup_tx_resources() with vzalloc(), whose return value is checked. ixgbevf 0000:00:06.0: ixgbevf_setup_tx_resources() entry tx_ring=ffff810001910640 ixgbevf 0000:00:06.0: ixgbevf_setup_tx_resources() exit (OK) This appears consistent with ixgbevf_xmit_frame() being invoked on a bogus net_device structure. The upstream commit I have in mind is <http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=7f988eab>. Again, RHEL-6 has it, RHEL-5 lacks it, and the patch modifies dev_change_name(). (In upstream the relevant section of dev_change_name() was further modified by 72c9528b net: Introduce dev_get_by_name_rcu() which introduced a problem, and then fixed up by 372b2312 net: use hlist_del_rcu() in dev_change_name() But I think those (ie. the dev_get_by_name_rcu() feature) are not important for us. Another commit related to synchronization is c91f6df2 sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name however that too added a new feature.) RHEL-5 call tree: do_rename() [initscripts-8.45.42/src/rename_device.c] ioctl(SIOCSIFNAME) dev_ifsioc() [kernel/net/core/dev.c] dev_change_name() hlist_del() dev_name_hash() hlist_add_head() In RHEL-5, dev_change_names() updates the "dev_name_head" hash without properly locking it with "dev_base_lock" against pure readers. Unfortunately I can't identify the exact concurrent reader (mentioned as the caller of __dev_get_by_name() in the 7f988eab), but I think it could be related to something that "ifconfig" does. I think we should try a test kernel with 7f988eab backported. (FWIW, __dev_get_by_name() seems to be called from a lot of places (ie. "pure readers") without taking "dev_base_lock". That is, the locking protocol could be violated from both sides!) (In reply to comment #70) > > I think we should try a test kernel with 7f988eab backported. > > Good analysis! Yeah, I can try that kernel. Created attachment 674634 [details]
[1/2] instrument dev_change_name()
<this line added to keep git-send-bugzilla happy>
---
net/core/dev.c | 24 ++++++++++++++++++++----
1 files changed, 20 insertions(+), 4 deletions(-)
Created attachment 674635 [details]
[2/2] [NET]: Take dev_base_lock when moving device name hash list entry
When we added name-based hashing the dev_base_lock was designated as the
lock to take when changing the name hash list. Unfortunately, because
it was a preexisting lock that just happened to be taken in the right
spots we neglected to take it in dev_change_name.
The race can affect calles of __dev_get_by_name that do so without taking
the RTNL. They may end up walking down the wrong hash chain and end up
missing the device that they're looking for.
RHEL-5 backport of upstream Linux commit 7f988eab.
---
net/core/dev.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)
Created attachment 674871 [details] rhel5 hvm guest kernel crash with 2.6.18-308.20.1.el5.bz862862_ixgbevf_crash_4 Unfortunately the latest patches didn't help :( serial log from the crashing guest attached. Thanks! Something to check later: re-enabling the udev info logging and correlating those messages with the new dev_change_name() debug log. Namely, dev_change_name: 'eth2' -> 'eth0' enter dev_change_name: exit (EEXIST) This is the first attempt to rename the e1000 NIC (originally namede eth2) to eth0, based on the ifcfg-eth0 script. However 0000:00:06.0 (ixgbevf#0) occupies that name, so dev_change_name: 'eth0' -> '__tmp1686290279' enter dev_change_name: exit (class_device_rename=0) that's first "moved out of the way", dev_change_name: 'eth2' -> 'eth0' enter dev_change_name: exit (class_device_rename=0) and then we rename the e1000 card successfully (eth2 is free again). dev_change_name: 'eth2' -> 'eth0' enter dev_change_name: exit (EEXIST) Here udev tries to rename the netfront NIC (which has come up under the now-free eth2 name) to eth0 (because netfront has the same MAC as e1000). dev_change_name: 'eth0' -> 'eth0' enter dev_change_name: exit (EEXIST) A curious attempt. Maybe trying to move eth0 (e1000 currently) out of the way, but the new name for e1000 can't be anything else, because there's a MAC match in ifcfg-eth0? Created attachment 674973 [details] rhel5 hvm guest kernel crash with 2.6.18-308.20.1.el5.bz862862_ixgbevf_crash_4 with udev _log info (In reply to comment #79) > > Something to check later: re-enabling the udev info logging and correlating > those messages with the new dev_change_name() debug log. > Serial console log with udev_log="info" added. This time dom0 was stock rhel 5.9, but that didn't seem to help this issue. (VF PCI passthrough to HVM guest now works OK with the stock rhel 5.9 Xen). > > A curious attempt. Maybe trying to move eth0 (e1000 currently) out of the > way, but the new name for e1000 can't be anything else, because there's a > MAC match in ifcfg-eth0? > Yes, there's a MAC-match in ifcfg-eth0 .. Created attachment 675360 [details]
rhel5.9 hvm guest kernel crash with 2.6.18-348.el5
Out of curiosity I just tried stock rhel 5.9 x64 in the hvm guest, and 2.6.18-348.el5 crashes aswell. serial console crash log attached. From live debugging on Pasi's machine -- the situation leading to the crash is established by infighting between kudzu and udev: - kudzu generates "alias ethX modname" lines into /etc/modprobe.conf, - the modules seem to be loaded in that order, - in case of "alias eth0 e1000" in "modprobe.conf", plus a matching MAC in "ifcfg-eth0", everything works, - if, however "modprobe.conf" aliases ixgbevf with eth0, then udev will rename the eth2 e1000 card, based on "ifcfg-eth0", to eth0, and eth0 (the first ixgbevf) to __tmpXXXXX, - bringing up __tmpXXXX with ifconfig crashes the VM. <lersek> but I think I'm finding a way to work around the crash reliably <lersek> ie prevent the renaming <lersek> Putting the following in /etc/modprobe.conf: <lersek> ---- <lersek> alias eth0 e1000 <lersek> alias eth1 ixgbevf <lersek> alias eth2 ixgbevf <lersek> ---- <lersek> *AND* <lersek> ifcfg-eth0 actually matches the MAC of the e1000 card <lersek> if this is all set <lersek> then e1000 is always probed first <lersek> no renaming happens <lersek> and all three interfaces work <lersek> testing the fifth boot with this <lersek> yeah 5th boot succeeded as well <lersek> so <lersek> let's see what happens when eth0 conflict between /etc/modprobe.conf and ifcfg-eth0 <lersek> testing <lersek> ---------- <lersek> alias eth0 ixgbevf <lersek> alias eth1 ixgbevf <lersek> alias eth2 e1000 <lersek> ---------- <pasik> ok <pasik> that would be quite a simple workaround :) <pasik> assuming kudzu won't mess it up on next reboot.. <lersek> "__tmp468605363" reestablished immediately :) If I kill the e1000 eth0 device and remove its driver module, then rename the __tmpXXXXX vf0 to its original name (eth0), then it can be brought up correctly! ifconfig eth0 down rmmod e1000 ip link set __tmp1404086329 name eth0 ifconfig eth0 up # no crash Something during ixgbevf probing remembers the eth0 name or its interface index or whatever, and it can be brought up only with that name. In the following test, eth1 is a correctly working, running ixgbevf (it's vf1). ifconfig eth1 down ip link set eth1 name ABCDEFGHIJKLMNO # same length ifconfig ABCDEFGHIJKLMNO up # CRASH I repeated the renaming test (with the same ABCDEFGHIJKLMNO name) in comment 87 on my workstation, with a tg3 card. No crash. I repeated the renaming test in comment 87 inside the VM, on the e1000 card (which, like ixgbevf, supports HW filtering of VLANs). No crash. The problem seems specific to ixgbevf. Great idea from Pasi -- it turns out not to be a match/mismatch question, it points to a buffer overflow dependent on name length. I repeated the test in comment 86, but the final name before bringup was "eth9". (I also repeated the test with "ZZZZ" -- length: 4 chars). No crash in either case. The first name / length that crashes is "x123456789" (10). (In reply to comment #88) > I repeated the renaming test (with the same ABCDEFGHIJKLMNO name) in comment > 87 on my workstation, with a tg3 card. No crash. > > I repeated the renaming test in comment 87 inside the VM, on the e1000 card > (which, like ixgbevf, supports HW filtering of VLANs). No crash. > > The problem seems specific to ixgbevf. Does this happen on the host and the guest or just the guest? (Is the host even running RHEL5?) The host is running RHEL-5.9, x86_64, xen dom0, and indeed the long ixgbevf name crashes the dom0 kernel as well. (start pinging host in another xterm) [root@dom0 ~]# ethtool -i x123456789 driver: ixgbevf version: 2.1.0-k firmware-version: N/A bus-info: 0000:03:10.2 [root@dom0 ~]# sync [root@dom0 ~]# ifconfig x123456789 up (no more echo replies & ssh connection freezes). I'll ask Pasi to reboot the host with the bare metal RHEL-5.9 kernel. It crashes the bare metal kernel as well. [root@dom0 ~]# uname -a Linux dom0.localdomain 2.6.18-348.el5 #1 SMP Wed Nov 28 21:22:00 EST 2012 x86_64 x86_64 x86_64 GNU/Linux [root@dom0 ~]# ethtool -i eth5 driver: ixgbevf version: 2.1.0-k firmware-version: N/A bus-info: 0000:03:10.2 [root@dom0 ~]# ip link set eth5 name x123456789 [root@dom0 ~]# ethtool -i x123456789 driver: ixgbevf version: 2.1.0-k firmware-version: N/A bus-info: 0000:03:10.2 [root@dom0 ~]# sync [root@dom0 ~]# ifconfig x123456789 up # crashed (Oh earlier I single-stepped ifconfig with gdb; the crash is triggered when ifconfig sets the (IFF_UP | IFF_RUNNING) flags with the SIOCSIFFLAGS ioctl on the netdev with the too long name.) What's odd is that this is way less than IFNAMSIZ (which is 16). Can I get on the system and and use serial console to debug? (In reply to comment #95) > > Can I get on the system and and use serial console to debug? > Sure. I'll fix you an access to it. I'd like to grab a vmcore from the guest, using xend in the host (this is one situation where virtualization is helpful, since kexec/kdump are not involved; the host saves the guest's memory), then search the vmcore for the interface name. Analyzing the vmcore. The offending interface name in this case was "__tmp1650898050". Important parts of the ixgbevf debug log and the crash are as follows. Ring allocation at ixgbevf_open() time: ixgbevf 0000:00:06.0: ixgbevf_setup_tx_resources() entry tx_ring=ffff81003ee0c9c0 ixgbevf 0000:00:06.0: ixgbevf_setup_rx_resources() entry rx_ring=ffff81003dc77cc0 Faulting instruction RIP, registers and stack: Pid: 1767, comm: ifconfig Not tainted 2.6.18-308.20.1.el5.bz862862_ixgbevf_crash_4 #1 RIP: 0010:[<ffffffff8846913d>] [<ffffffff8846913d>] :ixgbevf:ixgbevf_maybe_stop_tx+0xc/0x7b RDX: 0000000000000000 RSI: ffff0078626d3a30 RDI: ffff81003815a000 Stack: ffff81003815a000 ffff810033f364c0 0000000000000000 ffffffff8846d90e ffff81003815a000 ffff81003815a500 00000000804b1e68 ffffffff883dd9f4 0000000100000000 ffffffff883db3b0 ffff81003815a000 000000003f4ecc00 Call Trace: <IRQ> [<ffffffff8846d90e>] :ixgbevf:ixgbevf_xmit_frame+0xe8/0x8b1 This call stack corresponds exactly to what's described in comment 59 and comment 62 (verified with gdb and "objdump -d".) Starting the crash utility: $ crash rhel58x64hvm01.vmcore \ /usr/lib/debug/lib/modules/2.6.18-308.20.1.el5.bz862862_ixgbevf_crash_4/vmlinux Loading the ixgbevf symbols: crash> mod -s ixgbevf \ /usr/lib/debug/lib/modules/2.6.18-308.20.1.el5.bz862862_ixgbevf_crash_4/kernel/drivers/net/ixgbevf/ixgbevf.ko.debug The first value on the stack (ffff81003815a000) corresponds to the "struct net_device *netdev" parameter of ixgbevf_maybe_stop_tx(): crash> struct net_device ffff81003815a000 struct net_device { name = "__tmp1650898050", [...] priv = 0xffff81003815a500, and the many function pointers check out, they point to valid ixgbevf functions. "priv" points at an ixgbevf_adapter structure (function pointers check out inside again). crash> struct ixgbevf_adapter ffff81003815a500 struct ixgbevf_adapter { tx_ring = 0xffff0078626d3a30, rx_ring = 0xffff81003dc77cc0, Note that rx_ring ("ffff81003dc77cc0") matches the originally allocated area, printed by ixgbevf_setup_rx_resources() at the top. The tx_ring value ("ffff0078626d3a30") differs from the initially allocated value; it's bogus. Also, the tx_ring value ("ffff0078626d3a30") can be found in RSI, which is consistent with the crash RIP (see comment 59 and comment 62). So where could this tx_ring value come from? Let's take the five lowest bytes (the ten least significant nibbles), ie. "78626d3a30", and try to interpret them as an ASCII string. Since we have little endian representation (lowest address holds least significant byte), we have to mirror the string: 78626d3a30 --> \x30 \x3a \x6d \x62 \x78 $ printf '%b\n' '\x30\x3a\x6d\x62\x78' 0:mbx This is related to the following two lines in ixgbevf_request_msix_irqs(): sprintf(adapter->name[vector], "%s:mbx", netdev->name); err = request_irq(adapter->msix_entries[vector].vector, &ixgbevf_msix_mbx, 0, adapter->name[vector], netdev); The first statement formats the mailbox interrupt *name* for the VF's mailbox interrupt. The second line requests an interrupt with that name. Somehow the formatted vector name tramples over the "tx_ring" field. The leading "0" in "0" is the trailing "0" in "__tmp1650898050": __tmp1650898050:mbx ^^^^^ 0:mbx The fields directly preceding "tx_ring" in the "ixgbevf_adapter" structure are also garbled: struct ixgbevf_adapter { [...] q_vector = {0xffff8100392dbd80, 0xffff8100392db980}, name = {"__tmp1650898050-rx-0\000\000\000\000", "__tmp1650898050-tx-0\000\000\000\000"}, itr_setting = 829451636, eitr_low = 13622, eitr_high = 14384, tx_ring = 0xffff0078626d3a30, Notice how there are two elements in "q_vector" (MAX_MSIX_Q_VECTORS == 2), same for "name" (MAX_MSIX_COUNT == 2). The following fields are only ever set like this in the code, in file "ixgbevf_main.c": field dumped value as ASCII str value should be ---------- ------------ ------------ --------------- itr_setting 829451636, "tmp1" adapter->itr_setting = 1; eitr_low 13622, "65" adapter->eitr_low = 10; eitr_high 14384, "08" adapter->eitr_high = 20; tx_ring 0xffff0078626d3a30 "0:mbx" (as allocated) Reading the first three fields together yields "tmp16508": __tmp1650898050:mbx ^^^^^^^^ tmp16508 This happens (probably) because the subscript called "vector" is out of bounds (ie. it equals 2) in the statement sprintf(adapter->name[vector], "%s:mbx", netdev->name); in ixgbevf_request_msix_irqs(). "vector" is out of bounds due to the preceding loop: /* Decrement for Other and TCP Timer vectors */ q_vectors = adapter->num_msix_vectors - NON_Q_VECTORS; /* ... */ for (vector = 0; vector < q_vectors; vector++) { /* ... */ } /* ... */ sprintf(adapter->name[vector], "%s:mbx", netdev->name); err = request_irq(adapter->msix_entries[vector].vector, &ixgbevf_msix_mbx, 0, adapter->name[vector], netdev); My debug patch logs the value of q_vectors, right between the assignment to it and the loop: ixgbevf 0000:00:06.0: ixgbevf_request_msix_irqs() entry num_msix_vectors=3 q_vectors=2 Further, from "drivers/net/ixgbevf/ixgbevf.h", #define OTHER_VECTOR 1 #define NON_Q_VECTORS (OTHER_VECTOR) hence "adapter->num_msix_vectors" is 3. I'll check out upstream and the number-of-vectors logic. ixgbevf_probe() ixgbevf_init_interrupt_scheme() ixgbevf_set_interrupt_capability() ixgbevf_acquire_msix_vectors() The last function in the call chain is the only place setting "adapter->num_msix_vectors"; my debug patch logs the following ("ixgbevf 0000:00:06.0" prefix stripped): ixgbevf_set_interrupt_capability() entry ixgbevf_set_interrupt_capability() num_rx_queues=1 num_tx_queues=1 v_budget=3 ixgbevf_set_interrupt_capability() v_budget alloc OK ixgbevf_acquire_msix_vectors() entry: vectors=3 ixgbevf_acquire_msix_vectors() acquiring vectors=3 ixgbevf_acquire_msix_vectors() success ixgbevf_acquire_msix_vectors(): allocated 3 MSI-X interrupts ixgbevf_acquire_msix_vectors() exit ixgbevf_set_interrupt_capability() exit (0) This logic seems sane, especially the comment & code in ixgbevf_set_interrupt_capability() that determines the "budget": /* * It's easy to be greedy for MSI-X vectors, but it really * doesn't do us much good if we have a lot more vectors * than CPU's. So let's be conservative and only ask for * (roughly) twice the number of vectors as there are CPU's. */ v_budget = min(adapter->num_rx_queues + adapter->num_tx_queues, (int)(num_online_cpus() * 2)) + NON_Q_VECTORS; So, we have - one tx queue with a dedicated MSI-X interrupt, - one rx queue -||-, - and a non-queue vector (the mailbox vector) with the third MSI-X interrupt. I think the logic is sane. What's wrong is the name formatting / name storage for the mbx vector. Note that even with shorter (ie. non-crashing) interface names, we *can* overwrite part of itr_setting / eitr_low / eitr_high in the ixgbevf_adapter structure. However as far as I can see (it's really not my cup of tea), those are mostly performance related settings, so it's not easy to notice when they are garbled. However if the iface name is long enough, we overwrite a pointer. This is the problem [drivers/net/ixgbevf/ixgbevf.h]: #define MAX_MSIX_Q_VECTORS 2 #define MAX_MSIX_COUNT 2 #define MIN_MSIX_Q_VECTORS 2 #define MIN_MSIX_COUNT (MIN_MSIX_Q_VECTORS + NON_Q_VECTORS) struct ixgbevf_adapter { char name[MAX_MSIX_COUNT][IFNAMSIZ + 9]; "MAX_MSIX_COUNT" should be defined similarly to "MIN_MSIX_COUNT", that is, #define MAX_MSIX_COUNT 2 (MIN_MSIX_Q_VECTORS + NON_Q_VECTORS) Grepping the "kernel/drivers/net/ixgbevf" directory for MAX_MSIX_COUNT, the only hits are ixgbevf.h:181:#define MAX_MSIX_COUNT 2 ixgbevf.h:195: char name[MAX_MSIX_COUNT][IFNAMSIZ + 9]; So this should be safe. Try again?: (from previous comment) #define MAX_MSIX_COUNT 2 (MIN_MSIX_Q_VECTORS + NON_Q_VECTORS) meaning, you have a typo & you wanted to delete the '2' above ? Upstream fixed the error "inadvertently", in <http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=fa71ae27>. We'll have to fix this in a RHEL-5 specific way; I should write a patch for comment 101. Regarding RHEL-6, Andy already backported upstream commit fa71ae27 as commit 24a9e7c2be6700312f18ac9952abdddb1964e3dc Author: Andy Gospodarek <gospo> Date: Wed Oct 3 00:44:40 2012 -0400 [netdrv] ixgbevf: Move Tx clean-up into NAPI context (bug 819105). (In reply to comment #102) > Try again?: (from previous comment) > #define MAX_MSIX_COUNT 2 (MIN_MSIX_Q_VECTORS + NON_Q_VECTORS) > > meaning, you have a typo & you wanted to delete the '2' above ? Ah sorry, I was fighting some history back&forth in Firefox while typing that comment, and I either forgot to update macro expansion after the copy-pasta, or Firefox tricked me. So, what I'm suggesting is: #define MAX_MSIX_COUNT (MAX_MSIX_Q_VECTORS + NON_Q_VECTORS) Created attachment 675802 [details]
ixgbevf: allocate room for the mailbox MSI-X interrupt's name (proposed patch)
Without this change, the
sprintf(adapter->name[vector], "%s:mbx", netdev->name);
statement in ixgbevf_request_msix_irqs() tramples over
- adapter->itr_setting,
- adapter->eitr_low,
- adapter->eitr_high,
- adapter->tx_ring,
dependent on the net_device (interface) name length. For short names the
problem is (apparently) not catastrophic, because the first three fields
above seem to influence performance only. The fourth field ("tx_ring") is
a pointer however, and a long enough iface name (in our testing: at least
10 non-NUL chars) will cause a crash.
Upstream fixed the problem on the side with commit fa71ae27, hence the
patch is RHEL-5 specific.
(Andy backported fa71ae27 to RHEL-6 (commit 24a9e7c2).)
Signed-off-by: Laszlo Ersek <lersek>
drivers/net/ixgbevf/ixgbevf.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
Nice find, Laszlo! I was looking at this before I needed to step away for lunch and was quite sure the interrupt name was related. Mostly because I booted with pci=nomsi and found that the system failed differently *and* that it was only after the name was longer than 10 characters that it became a problem. So it doesn't look like I can bring up an ixgbevf-based interface when booting with pci=nomsi on the kernel command-line: # ifconfig eth2 up Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: [<ffffffff8825dbfa>] :ixgbevf:ixgbevf_open+0x34e/0x48b PGD 3e0d6067 PUD 3c8c0067 PMD 0 Oops: 0000 [1] SMP last sysfs file: /block/ram0/dev CPU 0 Modules linked in: lockd sunrpc ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 xfrm_nalgo crypto_api dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport floppy xen_vnif xen_vbd joydev xen_balloon ixgbevf xen_platform_pci ide_cd cdrom i2c_piix4 i2c_core e1000 serio_raw pcspkr tpm_tis tpm tpm_bios 8021q dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd Pid: 1987, comm: ifconfig Not tainted 2.6.18-348.el5 #1 RIP: 0010:[<ffffffff8825dbfa>] [<ffffffff8825dbfa>] :ixgbevf:ixgbevf_open+0x34e/0x48b RSP: 0018:ffff81003e899d78 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff81003db985b0 RCX: ffff81003db985b0 RDX: 0000000000000000 RSI: ffffffff8825d7c4 RDI: ffff81003db985b8 RBP: ffff81003db98500 R08: ffff81003db98000 R09: 0000000000000032 R10: 00000000000104c2 R11: ffffffff8825a4c8 R12: 0000000000000000 R13: ffff81003db98000 R14: 0000000000000000 R15: 0000000000000000 FS: 00002b25b77d25e0(0000) GS:ffffffff80435000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000003c821000 CR4: 00000000000006a0 Process ifconfig (pid: 1987, threadinfo ffff81003e898000, task ffff81003dcd5040) Stack: ffffffff00000000 ffff81003db98000 0000000000001043 0000000000001002 0000000000000000 ffff81003db98000 0000000000000000 ffffffff8023c8fa ffff81003db98000 ffffffff8023aa07 ffff81003e899e18 00000000ffffffed Call Trace: [<ffffffff8023c8fa>] dev_open+0x58/0x9c [<ffffffff8023aa07>] dev_change_flags+0x5a/0x119 [<ffffffff8026f5f0>] devinet_ioctl+0x235/0x59c [<ffffffff80231866>] sock_ioctl+0x1c1/0x1e5 [<ffffffff8004247e>] do_ioctl+0x21/0x6b [<ffffffff800304da>] vfs_ioctl+0x457/0x4b9 [<ffffffff800baee0>] audit_syscall_entry+0x1a8/0x1d3 [<ffffffff8004c882>] sys_ioctl+0x59/0x78 [<ffffffff8005d29e>] tracesys+0xd5/0xdf Code: 0f b7 38 e8 e9 12 e6 f7 85 c0 41 89 c4 74 52 41 8d 5e ff eb RIP [<ffffffff8825dbfa>] :ixgbevf:ixgbevf_open+0x34e/0x48b RSP <ffff81003e899d78> CR2: 0000000000000000 <0>Kernel panic - not syncing: Fatal exception We should fix this while we are in there fixing the original problem. Regarding comment 107 (crash with pci=nomsi), we've run into that in the RHEL-6 guest sibling bug before (as part of debugging): see bug 849223 comment 26. We discussed this on IRC with Andy, and (if I understand correctly) he's OK with (possibly) filing a separate BZ for the pci=nomsi problem (a BZ that's unrelated to Xen, or virtualization in general). So I'm going to post this patch now to RHKL. Thanks everyone for the help! (In reply to comment #108) > Regarding comment 107 (crash with pci=nomsi), we've run into that in the > RHEL-6 guest sibling bug before (as part of debugging): see bug 849223 > comment 26. > > We discussed this on IRC with Andy, and (if I understand correctly) he's OK > with (possibly) filing a separate BZ for the pci=nomsi problem (a BZ that's > unrelated to Xen, or virtualization in general). > > So I'm going to post this patch now to RHKL. Thanks everyone for the help! Per SRIOV spec, VFs can *only* use/generate MSI or MSI-X interrupts. Thus, booting a host or guest with pci=nomsi results in no VF support. We recently went through an interesting 'exercise' when a vendor's VF generated an unexpected signal interrupt instead of MSI(-X) interrupt; not a pleasant outcome! ;-) A backport of the following upstream patch will resolve the problem in comment #107. commit e45dd5fe654b2f8e52ea9f8dacfe2b2444c87c07 Author: Jakub Kicinski <jakub.kicinski> Date: Tue Nov 13 04:03:16 2012 +0000 ixgbevf: make sure probe fails on MSI-X enable error This driver cannot work without MSI-X interrupts so there is no mechanism to fall back to. I can confirm the patch in comment #105 fixes the bug. When running kernel 2.6.18-348.el5.bz862862_ixgbevf_mbx_name I'm able to run "ifconfig __tmp503635005 0.0.0.0 up" and it works fine, no crash. # grep tmp /proc/interrupts 185: 4819 PCI-MSI-X __tmp503635005-rx-0 193: 147 PCI-MSI-X __tmp503635005-tx-0 201: 12 PCI-MSI-X __tmp503635005:mbx Thanks a lot for all the debugging and persistence to get this fixed! Pasi, thank you very much for checking the many test packages and ultimately permitting us access the box directly; it's been crucial. Laszlo Patch(es) available in kernel-2.6.18-357.el5 You can download this test kernel (or newer) from http://people.redhat.com/plougher/el5/ Detailed testing feedback is always welcomed. If you require guidance regarding testing, please ask the bug assignee. [root@ibm-x3650m4-03 ~]# modprobe -r ixgbe [root@ibm-x3650m4-03 ~]# lsmod |grep ixgbe [root@ibm-x3650m4-03 ~]# modprobe ixgbe max_vfs=2 [root@ibm-x3650m4-03 ~]# lspci |grep 82599 1b:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection (rev 01) 1b:00.1 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection (rev 01) 1b:10.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) 1b:10.1 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) 1b:10.2 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) 1b:10.3 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01) [root@ibm-x3650m4-03 ~]# ethtool -i eth14 driver: ixgbevf version: 2.1.0-k firmware-version: N/A bus-info: 0000:1b:10.0 [root@ibm-x3650m4-03 ~]# ifconfig eth14 up [root@ibm-x3650m4-03 ~]# ethtool eth14 Settings for eth14: Supported ports: [ ] Supported link modes: 10000baseT/Full Supports auto-negotiation: No Advertised link modes: Not reported Advertised auto-negotiation: No Speed: 10000Mb/s Duplex: Full Port: Unknown! (255) PHYAD: 0 Transceiver: Unknown! Auto-negotiation: off Current message level: 0x00000007 (7) Link detected: yes [root@ibm-x3650m4-03 ~]# ifconfig eth14 down [root@ibm-x3650m4-03 ~]# ip link set eth14 name x123456789 [root@ibm-x3650m4-03 ~]# ethtool -i x123456789 driver: ixgbevf version: 2.1.0-k firmware-version: N/A bus-info: 0000:1b:10.0 [root@ibm-x3650m4-03 ~]# ethtool x123456789 Settings for x123456789: Supported ports: [ ] Supported link modes: 10000baseT/Full Supports auto-negotiation: No Advertised link modes: Not reported Advertised auto-negotiation: No Speed: 10000Mb/s Duplex: Full Port: Unknown! (255) PHYAD: 0 Transceiver: Unknown! Auto-negotiation: off Current message level: 0x00000007 (7) Link detected: no [root@ibm-x3650m4-03 ~]# ifconfig x123456789 up [root@ibm-x3650m4-03 ~]# no panic found, set it to verified. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-1348.html |