Bug 862862

Summary: long ixgbevf interface name crashes kernel
Product: Red Hat Enterprise Linux 5 Reporter: Pasi Karkkainen <pasik>
Component: kernelAssignee: Laszlo Ersek <lersek>
Status: CLOSED ERRATA QA Contact: Yulong Pei <ypei>
Severity: high Docs Contact:
Priority: urgent    
Version: 5.8CC: agospoda, bfan, ccui, ddutile, dhoward, drjones, fche, harald, jfeeney, jwest, knoel, kzhang, leiwang, lersek, lmiksik, plougher, rhod, sforsber, wshi, ypei
Target Milestone: rcKeywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: kernel-2.6.18-353.el5 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-09-30 23:34:44 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 514490, 924134    
Attachments:
Description Flags
[2/2] ixgbevf: Fix panic when loading driver
none
[1/2] ixgbevf: Prevent RX/TX statistics getting reset to zero
none
[v1] tarball of broken out debug patches for ixgbevf_main.c (64 pcs)
none
[v1] squashed debug series for ixgbevf_main.c
none
rhel5 hvm guest kernel crash with 2.6.18-308.20.1.el5.bz862862_ixgbevf_crash_3
none
rhel5 hvm guest kernel log when it doesnt crash with 2.6.18-308.20.1.el5.bz862862_ixgbevf_crash_3
none
rhel5 baremetal dmesg for 2.6.18-308.20.1.el5.bz862862_ixgbevf_crash_3
none
rhel5 hvm guest kernel crash with 2.6.18-308.20.1.el5.bz862862_ixgbevf_crash_3 with udev info
none
rhel5 hvm guest kernel crash with 2.6.18-308.20.1.el5.bz862862_ixgbevf_crash_3 with udev debug
none
[1/2] instrument dev_change_name()
none
[2/2] [NET]: Take dev_base_lock when moving device name hash list entry
none
rhel5 hvm guest kernel crash with 2.6.18-308.20.1.el5.bz862862_ixgbevf_crash_4
none
rhel5 hvm guest kernel crash with 2.6.18-308.20.1.el5.bz862862_ixgbevf_crash_4 with udev _log info
none
rhel5.9 hvm guest kernel crash with 2.6.18-348.el5
none
ixgbevf: allocate room for the mailbox MSI-X interrupt's name (proposed patch) none

Description Pasi Karkkainen 2012-10-03 19:07:27 UTC
Description of problem:

Hardware/Software:
- Dell R510 server.
- RHEL 5.8 x64 Xen dom0.
- Intel 82599EB dual-port Direct Attach (DA) SFP+ SR-IOV server NIC.
- VT-x / VT-d (IOMMU) / SR-IOV enabled in BIOS.
- "dom0_mem=2048M iommu=1" options for Xen in grub.
- "pci_pt_e820_access=on" options for dom0 Linux kernel in grub.
- "max_vfs=2" options in modprobe.conf for ixgbe driver.
- "ixgbevf" driver blacklisted in dom0.
- Resulting 4 Virtual Functions (VFs) "hidden" in dom0 using xen pciback.
- Physical Function (PF) eth-devices are configured "up" in dom0.

Now install RHEL 5.8 x64 HVM guest VM. Configure PCI passthru to assign one or two SR-IOV VFs to the VM. Reboot the physical server. Start the RHEL 5.8 VM. 
Log into the VM, and notice the first VF ethernet device has a weird name, it's not ethX, but instead it's for example: "__tmp1145796886". Execute "ifconfig __tmp1145796886 up" and the rhel5.8 guest kernel crashes immediately. Restart the VM, and now the VF has a normal ethX name, and the VF works OK/normally without crashes. 

So the crash usually only happens during the *first* time the VM is started after physical server reboot. Restart the VM once and it works OK afterwards.



Version-Release number of selected component (if applicable):

rhel 5.8 x64 hvm guest kernel:
Linux version 2.6.18-308.13.1.el5 (mockbuild.bos.redhat.com) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-52)) #1 SMP Thu Jul 26 05:45:09 EDT 2012


How reproducible:
Often during the first time when the rhel5.8 guest VM is started after physical server reboot, but not always. The problem usually doesn't happen anymore after you restart/reboot the guest VM once.


Steps to Reproduce:
1. Set up rhel 5.8 x64 Xen dom0 with Intel 82599 10 Gbit/s SR-IOV NIC.
2. Reboot the physical server.
3. Start rhel 5.8 x64 HVM guest VM with SR-IOV VF device passed through.
4. Notice the VF eth-interface is called for example "__tmp1145796886" inside the rhel 5.8 guest, not ethX.
5. Execute "ifconfig __tmp1145796886 up" in the rhel 5.8 HVM guest.
  
Actual results:
The guest kernel crashes immediately.

Expected results:
The VF eth-interface goes 'UP' and the VM doesn't crash.


Additional info from inside the rhel 5.8 guest VM:

# dmesg | grep -i ixgbe
ixgbevf: Intel(R) 10 Gigabit PCI Express Virtual Function Network Driver - version 2.1.0-k


# ifconfig __tmp1145796886

__tmp1145796886 Link encap:Ethernet  HWaddr FE:D7:DE:5D:F4:61  
          BROADCAST MULTICAST  MTU:1500  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:0 (0.0 b)  TX bytes:0 (0.0 b)

# ethtool -i __tmp1145796886
driver: ixgbevf
version: 2.1.0-k
firmware-version: N/A
bus-info: 0000:00:06.0

# lspci -vvv

00:06.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
        Subsystem: Intel Corporation Device 7a11
        Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
        Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
        Latency: 64
        Region 0: Memory at f4020000 (64-bit, non-prefetchable) [size=16K]
        Region 3: Memory at f4024000 (64-bit, non-prefetchable) [size=16K]
        Capabilities: [70] MSI-X: Enable+ Count=3 Masked-
                Vector table: BAR=3 offset=00000000
                PBA: BAR=3 offset=00002000
        Kernel driver in use: ixgbevf
        Kernel modules: ixgbevf


# ifconfig __tmp1145796886 up
<the guest kernel crashes immediately>

Crash log/stacktrace from the VM serial console (xm console):


8021q: adding VLAN 0 to HW filter on device __tmp1145796886
general protection fault: 0000 [1] SMP 
last sysfs file: /devices/pci0000:00/0000:00:00.0/class
CPU 0 
Modules linked in: ipv6 xfrm_nalgo crypto_api dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery 
asus_acpi acpi_memhotplug ac parport_pc lp parport floppy xen_vnif xen_vbd joydev xen_balloon xen_platform_pci i2c_piix4 i2c_core e1000 ixgb
evf ide_cd pcspkr tpm_tis tpm 8021q serio_raw cdrom tpm_bios dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror 
dm_log dm_mod ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 1823, comm: ifconfig Not tainted 2.6.18-308.13.1.el5 #1
RIP: 0010:[<ffffffff881fae5f>]  [<ffffffff881fae5f>] :ixgbevf:ixgbevf_msix_clean_tx+0x58/0x2a2
RSP: 0018:ffffffff804b1ec8  EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff0078626d3a36 RCX: 0000000000000000
RDX: ffff81003ccc7780 RSI: 0000000000000040 RDI: 0000000000000000
RBP: ffff81003dff2bc0 R08: ffffffff80323028 R09: 0000000000000046
R10: 0000000000000001 R11: ffffffff881f8113 R12: ffff810037f46500
R13: 0000000000000000 R14: ffff810035139c18 R15: ffff810035139c18
FS:  00002ae3846c65e0(0000) GS:ffffffff80431000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000055d7218 CR3: 000000003509b000 CR4: 00000000000006a0
Process ifconfig (pid: 1823, threadinfo ffff810035138000, task ffff8100019547a0)
Stack:  ffff81003ccc7780 0000000000000000 0000000000000046 ffffffff80566f00
 ffffffff80448f90 0000000000000000 ffff81003dff2bc0 00000000000000c1
 0000000000000000 ffff810035139c18 ffff810035139c18 ffffffff80010d91

Call Trace:
 <IRQ>  [<ffffffff80010d91>] handle_IRQ_event+0x51/0xa6
 [<ffffffff800be15f>] __do_IRQ+0xfb/0x15b
 [<ffffffff8006d4d1>] do_IRQ+0xe9/0xf7
 [<ffffffff8005d615>] ret_from_intr+0x0/0xa
 <EOI>  [<ffffffff881f8113>] :ixgbevf:ixgbevf_set_vfta_vf+0x0/0x2e
 [<ffffffff881f83b8>] :ixgbevf:ixgbevf_read_v2p_mailbox+0x10/0x23
 [<ffffffff881f83d8>] :ixgbevf:ixgbevf_check_for_bit_vf+0xd/0x25
 [<ffffffff881f83fe>] :ixgbevf:ixgbevf_check_for_msg_vf+0xe/0x23
 [<ffffffff881f85fe>] :ixgbevf:ixgbevf_write_mbx_vf+0x23/0x79
 [<ffffffff881f852f>] :ixgbevf:ixgbevf_write_posted_mbx+0x17/0x5c
 [<ffffffff881f813c>] :ixgbevf:ixgbevf_set_vfta_vf+0x29/0x2e
 [<ffffffff881ad35a>] :8021q:vlan_device_event+0x50/0x1b2
 [<ffffffff800675cb>] notifier_call_chain+0x20/0x32
 [<ffffffff8023a75f>] dev_open+0x96/0x9c
 [<ffffffff80238854>] dev_change_flags+0x5a/0x119
 [<ffffffff8026d035>] devinet_ioctl+0x235/0x59c
 [<ffffffff8022f748>] sock_ioctl+0x1c1/0x1e5
 [<ffffffff80041f7f>] do_ioctl+0x21/0x6b
 [<ffffffff8002ffcb>] vfs_ioctl+0x457/0x4b9
 [<ffffffff800ba76b>] audit_syscall_entry+0x1a8/0x1d3
 [<ffffffff8004c381>] sys_ioctl+0x59/0x78
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0


Code: 8b 6b 24 c7 43 3c 00 00 00 00 c7 43 40 00 00 00 00 4d 8b b4 
RIP  [<ffffffff881fae5f>] :ixgbevf:ixgbevf_msix_clean_tx+0x58/0x2a2
 RSP <ffffffff804b1ec8>
 <0>Kernel panic - not syncing: Fatal exception

Comment 2 Laszlo Ersek 2012-10-08 09:43:11 UTC
Pasi,

please upload a full guest boot log. I'm interested in any udev messages.

See
http://www.redhat.com/archives/libvir-list/2011-August/msg00937.html
http://www.redhat.com/archives/libvir-list/2011-August/msg00937.html

There's a difference in how RHEL-5 and RHEL-6 udev identify virtual functions persistently.

Harald,

what could be the reason of an eth interface being named "__tmp1145796886"?

Thank you both.

Comment 3 Laszlo Ersek 2012-10-08 09:47:19 UTC
Hi Laine,

did the discussion in <http://www.redhat.com/archives/libvir-list/2011-August/msg00937.html> lead anywhere? Is there a way in RHEL-5 libvirt to assign a persistent MAC to a VF?

Thanks.

Comment 4 Harald Hoyer 2012-10-09 15:52:41 UTC
(In reply to comment #2)
> Pasi,
> 
> please upload a full guest boot log. I'm interested in any udev messages.
> 
> See
> http://www.redhat.com/archives/libvir-list/2011-August/msg00937.html
> http://www.redhat.com/archives/libvir-list/2011-August/msg00937.html
> 
> There's a difference in how RHEL-5 and RHEL-6 udev identify virtual
> functions persistently.
> 
> Harald,
> 
> what could be the reason of an eth interface being named "__tmp1145796886"?
> 
> Thank you both.

clash in renaming interfaces... especially, if you have ifcfg with interface names assigned to a HWADDR and you have interfaces without ifcfg files.

udev does not know what to do with the "without" interfaces, when they occupy the name of the "HWADDR" interfaces. So they are renamed. This can be resolved by not naming the HWADDR interfaces in the "eth*" namespace.

Comment 5 Laszlo Ersek 2012-10-09 15:57:57 UTC
Thanks. So it's the consequence of the varying VF MAC and probably unrelated to the crash.

Comment 6 Laszlo Ersek 2012-10-10 15:06:25 UTC
(In reply to comment #0)

> rhel 5.8 x64 hvm guest kernel:
> Linux version 2.6.18-308.13.1.el5 (mockbuild.bos.redhat.com)
> (gcc version 4.1.2 20080704 (Red Hat 4.1.2-52)) #1 SMP Thu Jul 26 05:45:09

> general protection fault: 0000 [1] SMP 

> RIP: 0010:[<ffffffff881fae5f>]  [<ffffffff881fae5f>]
> :ixgbevf:ixgbevf_msix_clean_tx+0x58/0x2a2

>  <EOI>  [<ffffffff881f8113>] :ixgbevf:ixgbevf_set_vfta_vf+0x0/0x2e
>  [<ffffffff881f83b8>] :ixgbevf:ixgbevf_read_v2p_mailbox+0x10/0x23
>  [<ffffffff881f83d8>] :ixgbevf:ixgbevf_check_for_bit_vf+0xd/0x25
>  [<ffffffff881f83fe>] :ixgbevf:ixgbevf_check_for_msg_vf+0xe/0x23
>  [<ffffffff881f85fe>] :ixgbevf:ixgbevf_write_mbx_vf+0x23/0x79
>  [<ffffffff881f852f>] :ixgbevf:ixgbevf_write_posted_mbx+0x17/0x5c
>  [<ffffffff881f813c>] :ixgbevf:ixgbevf_set_vfta_vf+0x29/0x2e
>  [<ffffffff881ad35a>] :8021q:vlan_device_event+0x50/0x1b2
>  [<ffffffff800675cb>] notifier_call_chain+0x20/0x32
>  [<ffffffff8023a75f>] dev_open+0x96/0x9c
>  [<ffffffff80238854>] dev_change_flags+0x5a/0x119
>  [<ffffffff8026d035>] devinet_ioctl+0x235/0x59c
>  [<ffffffff8022f748>] sock_ioctl+0x1c1/0x1e5
>  [<ffffffff80041f7f>] do_ioctl+0x21/0x6b
>  [<ffffffff8002ffcb>] vfs_ioctl+0x457/0x4b9

$ gdb /usr/lib/debug/lib/modules/2.6.18-308.13.1.el5/kernel/drivers/net/ixgbevf/ixgbevf.ko.debug 

(gdb) list *(ixgbevf_msix_clean_tx+0x58)

0x2e5f is in ixgbevf_msix_clean_tx (drivers/net/ixgbevf/ixgbevf_main.c:197).

192             union ixgbe_adv_tx_desc *tx_desc, *eop_desc;
193             struct ixgbevf_tx_buffer *tx_buffer_info;
194             unsigned int i, eop, count = 0;
195             unsigned int total_bytes = 0, total_packets = 0;
196
197             i = tx_ring->next_to_clean;
                    ^^^^^^^^^^^^^^^^^^^^^^

198             eop = tx_ring->tx_buffer_info[i].next_to_watch;
199             eop_desc = IXGBE_TX_DESC_ADV(*tx_ring, eop);
200
201             while ((eop_desc->wb.status & cpu_to_le32(IXGBE_TXD_STAT_DD)) &&

Backport candidate:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=10cc1bdd

Comment 7 Laszlo Ersek 2012-10-10 15:26:38 UTC
Created attachment 624960 [details]
[2/2] ixgbevf: Fix panic when loading driver


RHEL-5 backport of the following upstream Linux patch:

    commit 10cc1bdd5ef65f60f570b594c4c066d763c128fb
    Author: Alexander Duyck <alexander.h.duyck>
    Date:   Mon Jul 16 23:44:48 2012 +0000

        ixgbevf: Fix panic when loading driver

        This patch addresses a kernel panic seen when setting up the
        interface. Specifically we see a NULL pointer dereference on the
        Tx descriptor cleanup path when enabling interrupts.  This change
        corrects that so it cannot occur.

        Signed-off-by: Alexander Duyck <alexander.h.duyck>
        Acked-by: Greg Rose <gregory.v.rose>
        Tested-by: Sibai Li <sibai.li>
        Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher>
        Signed-off-by: David S. Miller <davem>
---
 drivers/net/ixgbevf/ixgbevf_main.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

Comment 8 Laszlo Ersek 2012-10-10 15:29:12 UTC
RHEL-6 seems to lack the patch as well, but before thinking about cloning this BZ for RHEL-6, we must see if the patch solves the RHEL-5 crash. I'll prepare a 2.6.18-308.13.1.el5 based RPM.

Comment 11 Pasi Karkkainen 2012-10-14 17:47:00 UTC
Unfortunately it seems the patch didn't help.. I still get this crash:


Unable to handle kernel paging request at ffff810078626d76 RIP: 
 [<ffffffff88223e5f>] :ixgbevf:ixgbevf_msix_clean_tx+0x58/0x2b2
PGD 10063 PUD 0
Oops: 0002 [1] SMP
last sysfs file: /block/ram0/dev
CPU 0
Modules linked in: ipv6 xfrm_nalgo crypto_api dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button ba
ttery asus_acpi acpi_memhotplug ac parport_pc lp parport floppy xen_vnif joydev xen_vbd xen_balloon i2c_piix4 i2c_core xen_platform_pc
i ixgbevf e1000 pcspkr tpm_tis tpm tpm_bios serio_raw ide_cd cdrom 8021q dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot
dm_zero dm_mirror dm_log dm_mod ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 1965, comm: ifconfig Not tainted 2.6.18-308.13.1.el5.bz862862_ixgbevf_crash #1
RIP: 0010:[<ffffffff88223e5f>]  [<ffffffff88223e5f>] :ixgbevf:ixgbevf_msix_clean_tx+0x58/0x2b2
RSP: 0018:ffffffff804b1ec8  EFLAGS: 00010286
RAX: 0000000000000000 RBX: ffff810078626d3a RCX: 0000000000000000
RDX: ffff81003f678380 RSI: 0000000000000040 RDI: 0000000000000000
RBP: ffff81003d319140 R08: ffff81003ff92f78 R09: ffff81003f772410
R10: ffffffff8009a264 R11: 0000000000001002 R12: ffff81003f037500
R13: 0000000000000000 R14: ffff8100357f3ca8 R15: ffff8100357f3ca8
FS:  00002acbff6f05e0(0000) GS:ffffffff80431000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffff810078626d76 CR3: 0000000037095000 CR4: 00000000000006a0
Process ifconfig (pid: 1965, threadinfo ffff8100357f2000, task ffff81003e72b7e0)
Stack:  ffff81003f678380 0000000000000000 ffff81003f037990 0000000000000246
 ffff81003f037500 0000000000000000 ffff81003d319140 00000000000000c1
 0000000000000000 ffff8100357f3ca8 ffff8100357f3ca8 ffffffff80010d91
Call Trace:
 <IRQ>  [<ffffffff80010d91>] handle_IRQ_event+0x51/0xa6
 [<ffffffff800be15f>] __do_IRQ+0xfb/0x15b
 [<ffffffff8006d4d1>] do_IRQ+0xe9/0xf7
 [<ffffffff8005d615>] ret_from_intr+0x0/0xa
 <EOI>  [<ffffffff80249398>] qdisc_alloc+0xf/0x99
 [<ffffffff80249430>] qdisc_create_dflt+0xe/0x3f
 [<ffffffff802494d1>] dev_activate+0x37/0xee
 [<ffffffff8023a74b>] dev_open+0x82/0x9c
 [<ffffffff80238854>] dev_change_flags+0x5a/0x119
 [<ffffffff8026d035>] devinet_ioctl+0x235/0x59c
 [<ffffffff8022f748>] sock_ioctl+0x1c1/0x1e5
 [<ffffffff80041f7f>] do_ioctl+0x21/0x6b
 [<ffffffff8002ffcb>] vfs_ioctl+0x457/0x4b9
 [<ffffffff800ba76b>] audit_syscall_entry+0x1a8/0x1d3
 [<ffffffff8004c381>] sys_ioctl+0x59/0x78
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0


Code: c7 43 3c 00 00 00 00 c7 43 40 00 00 00 00 41 8b 84 24 60 04 
RIP  [<ffffffff88223e5f>] :ixgbevf:ixgbevf_msix_clean_tx+0x58/0x2b2
 RSP <ffffffff804b1ec8>
CR2: ffff810078626d76
 <0>Kernel panic - not syncing: Fatal exception

Comment 12 Pasi Karkkainen 2012-10-17 18:58:19 UTC
Anything else I should try/test?

Comment 13 Laszlo Ersek 2012-10-17 20:00:24 UTC
Not much of a consolation, but it's a different crash (or in a different location anyway).

$ gdb /usr/lib/debug/lib/modules/2.6.18-308.13.1.el5.bz862862_ixgbevf_crash/kernel/drivers/net/ixgbevf/ixgbevf.ko.debug

(gdb) list *(ixgbevf_msix_clean_tx+0x58)

0x2e5f is in ixgbevf_msix_clean_tx (drivers/net/ixgbevf/ixgbevf_main.c:979).

974                     return IRQ_HANDLED;
975
976             r_idx = find_first_bit(q_vector->txr_idx,
                                       adapter->num_tx_queues);
977             for (i = 0; i < q_vector->txr_count; i++) {
978                     tx_ring = &(adapter->tx_ring[r_idx]);
979                     tx_ring->total_bytes = 0;
                        ^^^^^^^^^^^^^^^^^^^^
980                     tx_ring->total_packets = 0;
981                     ixgbevf_clean_tx_irq(adapter, tx_ring);
982                     r_idx = find_next_bit(q_vector->txr_idx,
                                              adapter->num_tx_queues,
983                                           r_idx + 1);

(The patch attached in comment 7 was for ixgbevf_clean_tx_irq(), called on line 981 above, and most probably inlined -- line 981 is the only one call site.)

Apparently the assignment to tx_ring on line 978 can produce bogus values. Now we've seen it cause a crash on the next line, and also a bit later, when the dereference occurs in ixgbevf_clean_tx_irq().

I do think we have the correct patch, just my first backport attempt was wrong. Clearly, if we expect "tx_ring" to be possibly invalid right at the beginning of ixgbevf_clean_tx_irq(), then it can be invalid with the exact same effort earlier, on lines 979 and 980 shown above.

Looking at upstream Linux at commit 10cc1bdd (ie. not just the patch, but the entire "ixgbevf_main.c" file in that state), those "tx_ring" dereferences before the call to ixgbevf_clean_tx_irq() are *not* there in ixgbevf_msix_clean_tx(). In other words, in upstream the patch (commit 10cc1bdd) protected *all* dereferences against the adapter being down, while my backport in comment 7 failed to implement the same intent.

I still think we should go with this patch, just backport it correctly.

BTW upstream has since (as a matter of fact, not much later than 10cc1bdd) moved the TX cleanup to NAPI context from interrupt context, see commit fa71ae27. Obviously that's out of question for RHEL-5.

I'll try to prepare a new backport and corresponding RPMs.

Comment 14 Laszlo Ersek 2012-10-17 20:10:20 UTC
What I missed was <http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=93659763>. They did it for a different purpose, but in effect that commit removed the offending dereferences.

Comment 15 Laszlo Ersek 2012-10-17 21:36:34 UTC
Created attachment 629093 [details]
[1/2] ixgbevf: Prevent RX/TX statistics getting reset to zero

series build tested locally, brew in progress

(also clearing needinfo)

Comment 18 Pasi Karkkainen 2012-10-18 09:32:47 UTC
Crash also happens with the latest patch, still in ixgbevf_msix_clean_tx():

general protection fault: 0000 [1] SMP
last sysfs file: /block/ram0/dev
CPU 0
Modules linked in: ipv6 xfrm_nalgo crypto_api dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport floppy xen_vnif xen_vbd xen_balloon joydev ide_cd pcspkr cdrom e1000 i2c_piix4 ixgbevf i2c_core 8021q xen_platform_pci tpm_tis tpm tpm_bios serio_raw dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 1981, comm: ifconfig Not tainted 2.6.18-308.16.1.el5.bz862862_ixgbevf_crash_2 #1
RIP: 0010:[<ffffffff881f8e26>]  [<ffffffff881f8e26>] :ixgbevf:ixgbevf_msix_clean_tx+0x74/0x2a8
RSP: 0018:ffffffff804b1ec8  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff0078626d3a33 RCX: 0000000000000000
RDX: ffff0078626d3a33 RSI: 0000000000000040 RDI: 0000000000000000
RBP: ffff81003d87e540 R08: ffff81003ff92f78 R09: ffff81003f483a10
R10: ffffffff8009a264 R11: 0000000000001002 R12: ffff81003f00b500
R13: 0000000000000000 R14: ffff81003f00b000 R15: ffff8100351e3b88
FS:  00002b50f767a5e0(0000) GS:ffffffff80431000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000003e2551dcfc CR3: 0000000034c04000 CR4: 00000000000006a0
Process ifconfig (pid: 1981, threadinfo ffff8100351e2000, task ffff81003f48e7a0)
Stack:  ffff81003f664780 0000000000000000 ffff81003f00b990 0000000000000246
 ffff81003f00b500 0000000000000000 ffff81003d87e540 00000000000000c9
 0000000000000000 ffff8100351e3b88 ffff8100351e3b88 ffffffff80010d91
Call Trace:
 <IRQ>  [<ffffffff80010d91>] handle_IRQ_event+0x51/0xa6
 [<ffffffff800be15f>] __do_IRQ+0xfb/0x15b
 [<ffffffff8006d4d1>] do_IRQ+0xe9/0xf7
 [<ffffffff8005d615>] ret_from_intr+0x0/0xa
 <EOI>  [<ffffffff800906e9>] __cond_resched+0x1c/0x44
 [<ffffffff8006246f>] __sched_text_start+0x9f/0xbd0
 [<ffffffff8006245b>] __sched_text_start+0x8b/0xbd0
 [<ffffffff881f63d8>] :ixgbevf:ixgbevf_check_for_bit_vf+0xd/0x25
 [<ffffffff881f6421>] :ixgbevf:ixgbevf_check_for_ack_vf+0xe/0x23
 [<ffffffff800906e9>] __cond_resched+0x1c/0x44
 [<ffffffff800630d5>] cond_resched+0x37/0x42
 [<ffffffff800df3f4>] __kmalloc+0x3b/0x9f
 [<ffffffff800cf439>] __kzalloc+0x9/0x21
 [<ffffffff802494fa>] qdisc_alloc+0x1f/0x99
 [<ffffffff80249582>] qdisc_create_dflt+0xe/0x3f
 [<ffffffff80249623>] dev_activate+0x37/0xee
 [<ffffffff8023a89d>] dev_open+0x82/0x9c
 [<ffffffff802389a6>] dev_change_flags+0x5a/0x119
 [<ffffffff8026d187>] devinet_ioctl+0x235/0x59c
 [<ffffffff8022f89a>] sock_ioctl+0x1c1/0x1e5
 [<ffffffff80041f9e>] do_ioctl+0x21/0x6b
 [<ffffffff8002ffea>] vfs_ioctl+0x457/0x4b9
 [<ffffffff800ba76b>] audit_syscall_entry+0x1a8/0x1d3
 [<ffffffff8004c3a0>] sys_ioctl+0x59/0x78
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0


Code: 8b 6b 24 89 e8 48 c1 e0 05 48 03 43 30 44 0f b7 78 1a 44 89
RIP  [<ffffffff881f8e26>] :ixgbevf:ixgbevf_msix_clean_tx+0x74/0x2a8
 RSP <ffffffff804b1ec8>
 <0>Kernel panic - not syncing: Fatal exception

Comment 19 Laszlo Ersek 2012-10-18 12:52:05 UTC
Now that is bad.

Reading symbols from /usr/lib/debug/lib/modules/2.6.18-308.16.1.el5.bz862862_ixgbevf_crash_2/kernel/drivers/net/ixgbevf/ixgbevf.ko.debug...done.

(gdb) list *(ixgbevf_msix_clean_tx+0x74)

0x2e26 is in ixgbevf_msix_clean_tx (drivers/net/ixgbevf/ixgbevf_main.c:200).

195             unsigned int total_bytes = 0, total_packets = 0;
196
197             if (test_bit(__IXGBEVF_DOWN, &adapter->state))
198                     return true;
199
200             i = tx_ring->next_to_clean;
                    ^^^^^^^^^^^^^^^^^^^^^^

201             eop = tx_ring->tx_buffer_info[i].next_to_watch;
202             eop_desc = IXGBE_TX_DESC_ADV(*tx_ring, eop);
203
204             while ((eop_desc->wb.status & cpu_to_le32(IXGBE_TXD_STAT_DD)) &&

It means that either
- the __IXGBEVF_DOWN doesn't get set correctly, or
- the adapter is indeed up, but the "tx_ring" pointer is nonetheless invalid.

Comment 20 Laszlo Ersek 2012-10-18 13:13:04 UTC
ixgbevf_probe
  ixgbevf_assign_netdev_ops
    dev->open = ixgbevf_open
  ixgbevf_init_interrupt_scheme
    ixgbevf_alloc_queues
      adapter->tx_ring = kcalloc(...)

ixgbevf_open
  ixgbevf_setup_all_tx_resources
  ...
  ixgbevf_up_complete
    clear_bit(__IXGBEVF_DOWN, ...)
  ixgbevf_request_irq
    ixgbevf_request_msix_irqs
      request_irq(..., ixgbevf_msix_clean_tx, ...)
  ixgbevf_irq_enable

Seems sane.

Comment 24 Pasi Karkkainen 2012-10-18 17:28:38 UTC
With kernel 2.6.18-344.el5:

Unable to handle kernel NULL pointer dereference at 000000000000001a RIP:
 [<ffffffff88196f1b>] :ixgbevf:ixgbevf_msix_clean_tx+0x7b/0x2a2
PGD 351a4067 PUD 3d270067 PMD 0
Oops: 0000 [1] SMP
last sysfs file: /block/ram0/dev
CPU 0
Modules linked in: ipv6 xfrm_nalgo crypto_api dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport floppy xen_vnif joydev xen_vbd xen_balloon i2c_piix4 i2c_core ide_cd xen_platform_pci serio_raw e1000 cdrom pcspkr tpm_tis tpm tpm_bios ixgbevf 8021q dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 1983, comm: ifconfig Not tainted 2.6.18-344.el5 #1
RIP: 0010:[<ffffffff88196f1b>]  [<ffffffff88196f1b>] :ixgbevf:ixgbevf_msix_clean_tx+0x7b/0x2a2
RSP: 0018:ffffffff804b3ec8  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff81000078626d RCX: 0000000000000000
RDX: ffff81003ec38dc0 RSI: 0000000000000040 RDI: 0000000000000000
RBP: 0000000000000000 R08: 0000000000000040 R09: ffff810001738380
R10: ffffffff800a1be8 R11: 0000000000001002 R12: ffff81003e856500
R13: 0000000000000000 R14: ffff81003e856000 R15: ffff8100357d5b88
FS:  00002ba2749c05e0(0000) GS:ffffffff80434000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000000000000001a CR3: 000000003d317000 CR4: 00000000000006a0
Process ifconfig (pid: 1983, threadinfo ffff8100357d4000, task ffff81003f3e07e0)
Stack:  ffff81003ec38dc0 0000000000000000 ffff81003e856990 0000000000000246
 ffff81003e856500 0000000000000000 ffff81003cc10140 00000000000000c1
 0000000000000000 ffff8100357d5b88 ffff8100357d5b88 ffffffff80010da9
Call Trace:
 <IRQ>  [<ffffffff80010da9>] handle_IRQ_event+0x51/0xa6
 [<ffffffff800be877>] __do_IRQ+0xfb/0x15b
 [<ffffffff8006d47d>] do_IRQ+0xe9/0xf7
 [<ffffffff8005d625>] ret_from_intr+0x0/0xa
 <EOI>  [<ffffffff80090bac>] __cond_resched+0x1c/0x44
 [<ffffffff800624b2>] __sched_text_start+0xe2/0xbd0
 [<ffffffff8006245b>] __sched_text_start+0x8b/0xbd0
 [<ffffffff88194428>] :ixgbevf:ixgbevf_check_for_bit_vf+0xd/0x25
 [<ffffffff8819444e>] :ixgbevf:ixgbevf_check_for_msg_vf+0xe/0x23
 [<ffffffff881945f5>] :ixgbevf:ixgbevf_read_posted_mbx+0x31/0x67
 [<ffffffff881944bc>] :ixgbevf:ixgbevf_obtain_mbx_lock_vf+0x13/0x1f
 [<ffffffff881944da>] :ixgbevf:ixgbevf_read_mbx_vf+0x12/0x5f
 [<ffffffff8819405c>] :ixgbevf:ixgbevf_write_msg_read_ack+0x2d/0x36
 [<ffffffff800bef3e>] request_irq+0xb0/0xd6
 [<ffffffff80090bac>] __cond_resched+0x1c/0x44
 [<ffffffff800630d5>] cond_resched+0x37/0x42
 [<ffffffff800dfc8f>] __kmalloc+0x3b/0x9f
 [<ffffffff800cfc6a>] __kzalloc+0x9/0x21
 [<ffffffff8024b61a>] qdisc_alloc+0x1f/0x99
 [<ffffffff8024b6a2>] qdisc_create_dflt+0xe/0x3f
 [<ffffffff8024b75f>] dev_activate+0x37/0xee
 [<ffffffff8023c8ee>] dev_open+0x82/0x9c
 [<ffffffff8023a9d1>] dev_change_flags+0x5a/0x119
 [<ffffffff8026f3ca>] devinet_ioctl+0x235/0x59c
 [<ffffffff80231830>] sock_ioctl+0x1c1/0x1e5
 [<ffffffff8004248b>] do_ioctl+0x21/0x6b
 [<ffffffff800304d3>] vfs_ioctl+0x457/0x4b9
 [<ffffffff800bae83>] audit_syscall_entry+0x1a8/0x1d3
 [<ffffffff8004c88f>] sys_ioctl+0x59/0x78
 [<ffffffff8005d29e>] tracesys+0xd5/0xdf


Code: 44 0f b7 78 1a 44 89 f8 48 c1 e0 04 48 03 43 08 c7 44 24 1c
RIP  [<ffffffff88196f1b>] :ixgbevf:ixgbevf_msix_clean_tx+0x7b/0x2a2
 RSP <ffffffff804b3ec8>
CR2: 000000000000001a
 <0>Kernel panic - not syncing: Fatal exception

Comment 25 Laszlo Ersek 2012-10-18 23:37:38 UTC
clearing needinfo

Comment 26 Laszlo Ersek 2012-10-21 06:27:20 UTC
Are VLANs involved / important somehow?

(In reply to comment #0)

> 8021q: adding VLAN 0 to HW filter on device __tmp1145796886
> general protection fault: 0000 [1] SMP 

>  <EOI>  [<ffffffff881f8113>] :ixgbevf:ixgbevf_set_vfta_vf+0x0/0x2e

"ixgbevf_set_vfta_vf - Set/Unset vlan filter table address"

I'm uncertain my question makes sense at all, but does the crash reproduce "without VLAN"?

There are some VLAN-related changes after summer 2012 in upstream, but I have no idea if they're related. (2ddc7fe1, 5d9a533b).

Anyway I can't make any sense of the stack traces. Is an interrupt injected by the hypervisor in a wrong spot, or what?...

Comment 27 Pasi Karkkainen 2012-10-21 10:37:06 UTC
(In reply to comment #26)
> Are VLANs involved / important somehow?
> 
> I'm uncertain my question makes sense at all, but does the crash reproduce
> "without VLAN"?
> 

I don't use VLANs when testing this. I use the "ethX" device directly, so the crash happens when I do: "ifconfig __tmp1145796886 <ip> up". It's the VF interface, not a VLAN.

Note that on some boots the VF is properly called ethX, and then when I do "ifconfig ethX <ip> up" I *dont* get the crash.

The crash only happens when the VF has the name "__tmp1145796886" .. it *seems* the rename stuff is related somehow.

Comment 33 Pasi Karkkainen 2012-11-01 12:35:28 UTC
I'm able to freely reboot the server for a couple of weeks more.. so it's easy to test stuff now. Anything else I should try? 

Thanks.

Comment 34 Laszlo Ersek 2012-11-01 15:13:35 UTC
We can't reproduce this in our lab. If possible I'd like to avoid manually instrumenting the entire ixgbevf driver and exchange RPMs. So, let's see if we can get a bird's eye view of the control flow with systemtap.

Please download the RHEL-5 systemtap docs from <https://access.redhat.com/knowledge/docs/Red_Hat_Enterprise_Linux/>. Setting it up is described in <https://access.redhat.com/knowledge/docs/en-US/Red_Hat_Enterprise_Linux/5/html/SystemTap_Beginners_Guide/using-systemtap.html>.

(1) Please set up systemtap in the HVM guest that tends to crash.

(2) Please create a systemtap script (eg. "ixgbevf.stap") with the following contents, in the guest:

  probe module("ixgbevf").function("*").call
  {
    printf("%s -> %s\n", thread_indent(1), probefunc())
  }

  probe module("ixgbevf").function("*").return
  {
    printf("%s <- %s\n", thread_indent(-1), probefunc())
  }

(3) Blacklist the ixgbevf module in the guest.

(4) Recreate the situation (first guest boot after host boot IIRC) where the guest would normally crash after loading ixgbevf and bringing up the interface. 

Putting it differently, I think, please reboot the host and boot the guest, with ixgbevf blacklisted.

(5) Once the guest has booted, log in to it with ssh (through PV-on-HVM netfront or an emulated NIC), and modprobe ixgbevf manually. It is my understanding that at this point there's no crash yet.

The point of the manual modprobe is to avoid an automatic ifup during the boot process (which would crash).

(6) Unbind the passthru ixgbevf from the module, inside the guest:

  echo DOMAIN:BUS:SLOT.FUNC > /sys/bus/pci/drivers/ixgbevf/unbind

(7) In the host, configure "screen" as follows:

  ~/.screenrc:

  logfile /root/screen.log
  logfile flush 1

Then start "screen" and press C-a H. (Capital H.)

(8) In "screen", connect to the serial console of the guest from within the host, with "virsh console" or "xm console". 

Then start the systemtap script from step (2) over the serial console, inside "screen".

  stap -v ixgbevf.stap

Some warning messages may be displayed, but the command should not exit.

(9) Using the ssh connection to the guest, rebind the passthru ixgbevf device to the ixgbevf module. I'm using <http://lxr.linux.no/#linux+v3.6.5/Documentation/ABI/testing/sysfs-bus-pci> as cheatsheet:

  echo DOMAIN:BUS:SLOT.FUNC > /sys/bus/pci/drivers/ixgbevf/bind

or

  echo 1 > /sys/bus/pci/rescan

This should produce some systemtap output on the serial console (which is all saved in /root/screen.log).

(10) Bring up the ixgbevf interface in the guest. This should procude a lot of systemtap output on the serial console (saved in /root/screen.log) and ultimately crash (with a stack trace).

Force off the guest (if it hasn't been automatically destroyed at this point). The "virsh console" command from step (8) will exit, but screen will continue to record pty traffic.

(11) In the host, make a copy of /root/screen.log (eg. to "crash.log"), and then truncate the original (do not use "mv"):

  >| /root/screen.log

(12) Reboot the guest, and repeat steps (5), (6), (8), (9), (10). In my understanding the guest doesn't crash this time and the ixgbevf interface is brought up correctly.

(13) Type C-c in the "screen" window to interrupt stap,  C-] to detach from the guest's serial console, then exit "screen". Rename /root/screen.log to "success.log".

(14) Please upload "crash.log" and "success.log".

Thanks!

Comment 35 Pasi Karkkainen 2012-11-27 20:05:16 UTC
Sorry for the delayed reply. I'm having problems reproducing the problem with this method.. 

another problem: after i re-bind the pci device to ixgbevf driver stap on the serial console starts to get flooded with continuous calls to ixgbevf_clean_rxonly(), ixgbevf_clean_rx_irq() and ixgbevf_alloc_rx_buffers() .. and the serial console goes very slow/unresponsive.

Maybe I should try with the "original" method where I'm seeing the crash often.. so by letting the rhel5 guest autoload the ixgbevf driver (but not bring up the eth-interface) at boot time.

Comment 36 Pasi Karkkainen 2012-11-27 20:28:36 UTC
Now I got it to crash using my original method and while running the stap script.. I get the kernel crash/stacktrace on the console, and nothing on stap.. it crashes so quickly.

Comment 37 Pasi Karkkainen 2012-11-27 20:30:45 UTC
And I mean running "ifconfig -a" or "ethtool -i <ixgbevf-eth-interface>" I get output from the stap script (calls to ixgbevf_get_stats()) , but when I run "ifconfig __tmp1534769776 0.0.0.0 up" the rhel5 guest kernel crashes immediately and there's nothing more on the stap screen.. so the guest kernel crashes before stap manages to print any call info.

Comment 38 Frank Ch. Eigler 2012-11-27 20:36:18 UTC
(Just in case it might help, a new feature in systemtap version 2.0 is connection to the kernel panic_notifier.  If the kernel crashes while a stap job is running, this notifier lets stap emit any last-gasp messages that were buffered but not yet punted to userspace.  (There is no official RH build of stap 2.0 for RHEL5, but it's not too hard to build.)

Comment 39 Pasi Karkkainen 2012-11-27 21:04:21 UTC
FC6 bug about udev eth interface rename creating __tmp names:
https://bugzilla.redhat.com/show_bug.cgi?id=208740

Anyway it seems I'm only getting the rhel5 guest kernel crash when the interface name is __tmp1534769776 or something like that.. no idea why rename stuff affects the ixgbevf driver.

Comment 40 Laszlo Ersek 2012-11-28 14:18:38 UTC
Pasi,

can you perhaps reproduce the problem in a bare-metal RHEL-5 installation? Like booting the machine and using the ixgbevf interface without any hypervisor or guests?

Also, in both guest cases (ie. crash and success), can you please set "ignore_loglevel" on the guest kernel command line, capture the output on the virtual serial console, and upload both logs?


I tried to look a bit more at the tx_ring value that causes the crash. It is calculated by ixgbevf_msix_clean_tx() for ixgbevf_clean_tx_irq() like this:

	r_idx = find_first_bit(q_vector->txr_idx, adapter->num_tx_queues);
	for (i = 0; i < q_vector->txr_count; i++) {
		tx_ring = &(adapter->tx_ring[r_idx]);
		tx_ring->total_bytes = 0;
		tx_ring->total_packets = 0;
		ixgbevf_clean_tx_irq(adapter, tx_ring);
		r_idx = find_next_bit(q_vector->txr_idx, adapter->num_tx_queues,
				      r_idx + 1);
	}

Extending comment 20:

> ixgbevf_probe
>   ixgbevf_assign_netdev_ops
>     dev->open = ixgbevf_open

    ixgbevf_sw_init -------------------> #1

>   ixgbevf_init_interrupt_scheme
>     ixgbevf_alloc_queues
>       adapter->tx_ring = kcalloc(...)
> 
> ixgbevf_open
>   ixgbevf_setup_all_tx_resources
>   ...

    /*
    * Map the Tx/Rx rings to the vectors we were allotted.
    * if request_irq will be called in this function map_rings
    * must be called *before* up_complete
    */
    ixgbevf_map_rings_to_vectors ------> #2
      map_vector_to_txq

>   ixgbevf_up_complete
>     clear_bit(__IXGBEVF_DOWN, ...)
>   ixgbevf_request_irq
>     ixgbevf_request_msix_irqs
>       request_irq(..., ixgbevf_msix_clean_tx, ...)
>   ixgbevf_irq_enable


#2 (ie. ixgbevf_map_rings_to_vectors()) is responsible for setting up "q_vector":

/**
 * ixgbevf_map_rings_to_vectors - Maps descriptor rings to vectors
 * @adapter: board private structure to initialize
 *
 * This function maps descriptor rings to the queue-specific vectors
 * we were allotted through the MSI-X enabling code.  Ideally, we'd have
 * one vector per ring/queue, but on a constrained vector budget, we
 * group the rings as "efficiently" as possible.  You would add new
 * mapping configurations in here.
 **/

Basically it distributes the receive and transmit queues over the available interrupt vectors. When a TX interrupt fires, the ixgbevf_msix_clean_tx() interrupt handler loops over all covered transmit queues. It expects to find exactly "q_vector->txr_count" bits set in the "q_vector->txr_idx" bitmap. The position of each set bit in that map designatex a tx_ring, which is then passed to ixgbevf_clean_tx_irq().

If the bitmap setup is wrong, that's an explanation. If the loop goes on for too long (fewer bits set in the map than indicated by "txr_count"), then find_first_bit() / find_next_bit() will return the size of the bitmap (ie. one greater than the position of the maximum possible bit), resulting in an invalid "tx_ring" pointer.

So why would the bitmap setup in #2 go wrong?

Now I'm suspecting #1 (ixgbevf_sw_init()), which is part of the probing, ie. precedes opening. This function explicitly depends on the MAC assignment to the VF:

	err = hw->mac.ops.reset_hw(hw);
	if (err) {
		dev_info(&pdev->dev,
		         "PF still in reset state, assigning new address\n");
		random_ether_addr(hw->mac.addr);
	} else {
		err = hw->mac.ops.init_hw(hw);
		if (err) {
			printk(KERN_ERR "init_shared_code failed: %d\n", err);
			goto out;
		}
	}

Directly after host reboot, the first branch runs in the guest. (This could be confirmed by your dmesg as requested above.)

(
Also, this could be the connection to the udev renaming problem, see Harald's comment 4:

> clash in renaming interfaces... especially, if you have ifcfg with interface
> names assigned to a HWADDR and you have interfaces without ifcfg files.
> 
> udev does not know what to do with the "without" interfaces, when they
> occupy the name of the "HWADDR" interfaces. So they are renamed. This can be
> resolved by not naming the HWADDR interfaces in the "eth*" namespace.

Let's say the ixgbevf interface is originally called eth1 in the guest (this is the name that the kernel assigns before udev runs.) Also suppose that udev has run at an earlier time (in another boot) in this guest, and created an ifcfg-eth1 file, with HWADDR stanza inside.

After host reboot, random_ether_addr() in the first guest branch above produces a MAC address that the guest's udev has never seen before. So there's again an "eth1" interface from the kernel, but it's MAC differs from the HWADDR that was captured earlier in the ifcfg-eth1 file. So udev renames the interface to __tmpXXX.
)

I think something goes wrong in/after the probing when the first branch runs ("PF still in reset state"). When the second branch runs (MAC already assigned to vf), everything's fine.

I tried to compare ixgbevf_sw_init() with its upstream Linux counterpart; upstream commit 1a0d6ae5 ("rename dev_hw_addr_random and remove redundant second") seems relevant. (Possibly 0ac1e8ce too.) These affect the error handling in/around ixgbevf_sw_init().

I'm going to have to look into those more deeply. However if 1a0d6ae5 indeed plays a role, then this problem is not Xen specific; that's why I'm asking for the bare metal ixgbevf test on the same hardware.

... I might be wrong about which branch runs when. Actually 1a0d6ae5 doesn't seem to change what happens when the first branch runs -- without the patch, a stale non-zero value is returned by ixgbevf_sw_init(), but since it's ignored its staleness makes no difference. Also if init_hw() succeeds in the second branch, no observable change.

1a0d6ae5 changes the handling of init_hw() failure in the second branch ("init_shared_code failed"). With the patch that condition makes ixgbevf_probe() bail out with an error (because the latter no longer ignores ixgbevf_sw_init() errors), and that of course would prevent ixgbevf_open().


Andy, would you have any remarks perhaps? Thanks!

Comment 43 Laszlo Ersek 2012-11-29 03:14:39 UTC
Created attachment 653948 [details]
[v1] tarball of broken out debug patches for ixgbevf_main.c (64 pcs)

To be applied as in:
- take 2.6.18-308.20.1.el5 (most recent public Z-Stream build)
- apply backport from comment 15
- apply backport from comment 7
- apply this series

Locally build-tested.

Comment 44 Laszlo Ersek 2012-11-29 03:17:52 UTC
Created attachment 653952 [details]
[v1] squashed debug series for ixgbevf_main.c

same as in comment 43, for easier viewing in bugzilla

Comment 47 Laszlo Ersek 2012-12-10 16:44:21 UTC
Pasi,

from an unrelated internal discussion I'm getting the idea that the setting fixed MAC addresses for VFs in rc.local with "ip set" might be good practice.

Can you please check what happens at first guest boot after host boot if you place such settings in the guest? Thanks!

Comment 48 Pasi Karkkainen 2012-12-12 16:51:03 UTC
Sorry for the delayed reply, I've been busy with other things. I'll try to get the requested things tested later this week! Thanks.

Comment 49 Pasi Karkkainen 2012-12-13 20:09:40 UTC
Created attachment 663166 [details]
rhel5 hvm guest kernel crash with 2.6.18-308.20.1.el5.bz862862_ixgbevf_crash_3

Comment 50 Pasi Karkkainen 2012-12-13 20:12:30 UTC
Attached is the rhel5 hvm guest kernel boot + crash log with the latest debug patched kernel. 

Btw it seems the issue often happens for the first VM launched after host reboot, but also after that.. on every third or every fourth reboot of the rhel5 hvm guest. that makes reproducing and testing easier, no need to reboot the physical server.

Comment 51 Pasi Karkkainen 2012-12-13 20:31:48 UTC
I'm trying to set the VF MAC from dom0 before starting the VM:

# ip link set dev eth0 vf 0 mac 00:01:02:03:05:01
Error: either "dev" is duplicate, or "vf" is a garbage.

# ip link help
Usage: ip link set DEVICE { up | down |
                             arp { on | off } |
                             dynamic { on | off } |
                             multicast { on | off } |
                             allmulticast { on | off } |
                             promisc { on | off } |
                             trailers { on | off } |
                             txqueuelen PACKETS |
                             name NEWNAME |
                             address LLADDR | broadcast LLADDR |
                             mtu MTU }
       ip link show [ DEVICE ]


So hmm.. iproute in rhel5 doesn't support VFs at all?

Comment 52 Pasi Karkkainen 2012-12-13 20:46:58 UTC
# ip link set eth0 vf 0 mac 00:de:ad:be:ef:01
Error: either "dev" is duplicate, or "vf" is a garbage.

Same error with that syntax.. no "dev" in between there.

Comment 53 Laszlo Ersek 2012-12-13 20:57:56 UTC
(In reply to comment #50)
> Attached is the rhel5 hvm guest kernel boot + crash log with the latest
> debug patched kernel. 
> 
> Btw it seems the issue often happens for the first VM launched after host
> reboot, but also after that.. on every third or every fourth reboot of the
> rhel5 hvm guest. that makes reproducing and testing easier, no need to
> reboot the physical server.

Thank you. Can you please also upload a HVM guest boot log (using the same guest kernel) from when there is no crash? I'd like to compare the logs.

Thank you.

Comment 54 Laszlo Ersek 2012-12-13 21:02:13 UTC
Also,

(from comment #40)
>
> can you perhaps reproduce the problem in a bare-metal RHEL-5 installation?
> Like booting the machine and using the ixgbevf interface without any
> hypervisor or guests?

Thus in total I'd like to look at three boot logs, each generated by the same kernel and with "ignore_loglevel":
- without Xen, on the bare metal (whether it crashes or not),
- in an HVM guest, crashing (already uploaded in comment 49),
- in HVM guest, not crashing.

Thanks!

Comment 55 Pasi Karkkainen 2012-12-13 21:17:49 UTC
Created attachment 663204 [details]
rhel5 hvm guest kernel log when it doesnt crash with 2.6.18-308.20.1.el5.bz862862_ixgbevf_crash_3

Comment 56 Pasi Karkkainen 2012-12-13 21:24:06 UTC
For the "no crash" log I did this in the HVM guest:
ifconfig eth1 0.0.0.0 up
ifconfig eth2 0.0.0.0 up
ifconfig eth1 down
ifconfig eth2 down
poweroff

Comment 57 Pasi Karkkainen 2012-12-13 21:38:39 UTC
Created attachment 663206 [details]
rhel5 baremetal dmesg for 2.6.18-308.20.1.el5.bz862862_ixgbevf_crash_3

I disabled blacklisting of ixgbevf driver in dom0 and rebooted it to native/baremetal kernel.

Comment 58 Pasi Karkkainen 2012-12-13 21:41:59 UTC
.. and baremetal didn't crash.

Comment 59 Laszlo Ersek 2012-12-13 23:27:26 UTC
Comparing the hvm-no-crash & hvm-crash logs (comment 55 and comment 49,
resp.):

No crash:

  e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection
  ixgbevf 0000:00:06.0: ixgbevf_probe() registered with name=eth1
  ixgbevf 0000:00:07.0: ixgbevf_probe() registered with name=eth2
  netfront: device eth3 has copying receive path.
  ixgbevf 0000:00:06.0: ixgbevf_open() entry
  ixgbevf 0000:00:06.0: ixgbevf_request_msix_irqs() requesting irq=201 for
                        vector=0 with name=eth1-rx-0
  ixgbevf 0000:00:06.0: ixgbevf_request_msix_irqs() requesting irq=209 for
                        vector=1 with name=eth1-tx-0
  ixgbevf 0000:00:06.0: ixgbevf_open() exit (OK)
  8021q: adding VLAN 0 to HW filter on device eth1
  [no crash]

Crash:

  ixgbevf 0000:00:06.0: ixgbevf_probe() registered with name=eth0
  ixgbevf 0000:00:07.0: ixgbevf_probe() registered with name=eth1
  e1000: eth2: e1000_probe: Intel(R) PRO/1000 Network Connection
  netfront: device eth2 has copying receive path.
  ixgbevf 0000:00:06.0: ixgbevf_open() entry
  ixgbevf 0000:00:06.0: ixgbevf_request_msix_irqs() requesting irq=193 for
                        vector=0 with name=__tmp703594194-rx-0
  ixgbevf 0000:00:06.0: ixgbevf_request_msix_irqs() requesting irq=201 for
                        vector=1 with name=__tmp703594194-tx-0
  ixgbevf 0000:00:06.0: ixgbevf_open() exit (OK)
  8021q: adding VLAN 0 to HW filter on device __tmp703594194
  [...]
  [crash]

The crash seems to depend on *when* the e1000 card is discovered.

No crash:
- e1000 is discovered first, gets eth0
- the VFs get eth1 and eth2, respectively,
- netfront gets eth3

Crash:
- The VFs are discovered first, they are named eth0 and eth1, respectively,
- e1000 is then discovered, gets eth2
- netfront gets eth2 AGAIN (?!)
- udev renames VF1 (eth0) to "__tmp703594194", sometime between
  ixgbevf_probe() and ixgbevf_open()

After VLAN 0 is added, there are a bunch of ixgbevf_set_rx_mode() calls (no
structural difference in the logs), which "is called whenever the multicast
address list or the network interface flags are updated". This happens
presumably inside "ifconfig" (that's the process the crash dump reports),
and then we crash in the "VFs discovered first" case.

This looks crazy.

Regarding the crash itself:

- The offending address we try to access is ffff810078626d5a.

- The offending code is at ixgbevf_maybe_stop_tx+0xc:

(gdb) list *(ixgbevf_maybe_stop_tx+0xc)
0x213d is in ixgbevf_maybe_stop_tx
                          (drivers/net/ixgbevf/ixgbevf_main.c:3725).
3720
3721    /* not instrumented -- possibly called per-packet */
3722    static int ixgbevf_maybe_stop_tx(struct net_device *netdev,
3723                                 struct ixgbevf_ring *tx_ring, int size)
3724    {
3725            if (likely(IXGBE_DESC_UNUSED(tx_ring) >= size))
3726                    return 0;
3727            return __ixgbevf_maybe_stop_tx(netdev, tx_ring, size);
3728    }
3729

I can see the value ffff810078626d3a in RSI and R15; the offending address
is only 0x20 bytes above it.

#define IXGBE_DESC_UNUSED(R) \
	((((R)->next_to_clean > (R)->next_to_use) ? 0 : (R)->count) + \
	(R)->next_to_clean - (R)->next_to_use - 1)

struct ixgbevf_ring {
0x00   	struct ixgbevf_adapter *adapter;  /* backlink */
0x08	void *desc;		/* descriptor ring memory */
0x10	dma_addr_t dma;		/* phys. address of descriptor ring */
0x18	unsigned int size;	/* length in bytes */
0x1C	unsigned int count;	/* amount of descriptors */
0x20	unsigned int next_to_use;

The code tries to access tx_ring->next_to_use for comparison. Where does
tx_ring come from?

(gdb) list *(ixgbevf_xmit_frame+0xe8)
0x690e is in ixgbevf_xmit_frame (drivers/net/ixgbevf/ixgbevf_main.c:3761).
3756
3757            count += TXD_USE_COUNT(skb_headlen(skb));
3758            for (f = 0; f < skb_shinfo(skb)->nr_frags; f++)
3759                count += TXD_USE_COUNT(skb_shinfo(skb)->frags[f].size);
3760
3761            if (ixgbevf_maybe_stop_tx(netdev, tx_ring, count)) {
3762                    adapter->tx_busy++;
3763                    return NETDEV_TX_BUSY;
3764            }
3765

We're lucky. At this call site, tx_ring points to

	tx_ring = &adapter->tx_ring[r_idx];

with r_idx being constant 0. Now, checking the debug log again:

ixgbevf 0000:00:06.0: ixgbevf_alloc_queues() tx_ring[0] @ ffff81003fed20c0
ixgbevf 0000:00:07.0: ixgbevf_alloc_queues() tx_ring[0] @ ffff81003fed2140
offending address:                                        ffff810078626d5a

It's clear that the offending address has *nothing* to do with the
previously allocated tx_ring object. Therefore either the "adapter->tx_ring"
field (which is a pointer) got corrupted, containing the RSI/R15 value
(r_idx == 0), or the "adapter" variable points to garbage. The latter would
only be possible if the network subsystem called ixgbevf_xmit_frame() on a
corrupted netdev, or a different type of NIC.

Something is seriously messed up. Udev is there exactly to mask the random
order of NIC discovery. e1000 and netfront should not both claim "eth2". The
tx_ring address should be correct.

Comment 60 Laszlo Ersek 2012-12-13 23:31:34 UTC
(

For simplicity, the renaming happens via

- /etc/udev/rules.d/60-net.rules:

ACTION=="add", SUBSYSTEM=="net", IMPORT{program}="/lib/udev/rename_device"
SUBSYSTEM=="net", RUN+="/etc/sysconfig/network-scripts/net.hotplug"

- "/lib/udev/rename_device" is a binary in the initscripts package; I could finally find the __tmp%d format string in its source. It is the program that scans /etc/sysconfig/network-scripts/ for HWADDR & co.

)

Comment 61 Laszlo Ersek 2012-12-14 02:41:43 UTC
I'm not sure if we can solve this problem in RHEL-5. RHEL-6 has a way to
bind VFs to interface names in the guest persistently (see below), but
RHEL-5 lacks any.

The RHEL-5 HVM guest crash (ixgbevf corruption) is related to the renaming
(influenced by NIC discovery order and MACs) that occurs between
ixgbevf_probe() and ixgbevf_open() (= ifconfig).

(The bare metal case (no crash) is consistent with this: the
ixgbevf_sw_init() debug messages there differ from both HVM guest runs
(crash & no crash). In the bare metal case the VFs got new random addresses
and were discovered as last interfaces (eth4 to eth7); they didn't clash
with anything known to udev previously and were not renamed.)

RHEL-6 has seen changes in this area:

- The RHEL-6 initscripts package does not try to rename interfaces itself,
it just tries to figure out an interface name and suggest it to "udev". The
udev program does the renaming. The __tmp%d pattern is gone, it's not there
either in "rename_device.c" (initscripts) or udev. The udev rule was
adapted, see
<http://git.fedorahosted.org/cgit/initscripts.git/commit/?id=3e634c64>.

- The RHEL-6 kernel has received a fix for bug 611418 (its duplicate bug
614786 is not public, related RHEL-6 kernel commit: 645373fb). The upstream
counterpart is
<http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=c1f79426>.

Basically, when a VF gets a random MAC, the NET_ADDR_RANDOM flag is set in
the device's "addr_assign_type" word. This word is then exported for
userspace via sysfs. The RHEL-5 kernel doesn't have this feature.

- RHEL-6 udev (147+patches) knows to check for the "addr_assign_type" sysfs
file, and for VFs it saves a persistent device name based on a different
key: it doesn't use the MAC (which is unstable) but a "device path id" (for
example, a PCI path). See bug 596464 and "udev-147-rule_gen.patch" in the
udev SRPM. RHEL-5 udev is not so smart (095+patches).

Unless someone has a better idea, I'm proposing CANTFIX for this BZ.


Regarding workarounds: as far as I understand, Harald recommended one at the
end of comment 4.

Also, based on bug 520453 comment 0, I think one could manually add an udev
rule like this (in the RHEL-5 guest):

SUBSYSTEM=="net", ACTION=="add", KERNEL=="eth*", ID=="$PCIBUSID", NAME="ethX"

This should establish a stable mapping from PCI BDF to ethX name, no matter
the HWADDR. (I guess the corresponding ifcfg script should not contain any
HWADDR entry either.)

OTOH udev might still have to rename the interface (just not to __tmp%d),
which may or may not have an effect on the crash.

Comment 62 Laszlo Ersek 2012-12-14 03:22:33 UTC
(I'd just like to confirm something with disassembly)

(In reply to comment #59)

> Regarding the crash itself:
> 
> - The offending address we try to access is ffff810078626d5a.
> 
> - The offending code is at ixgbevf_maybe_stop_tx+0xc:
> 
> (gdb) list *(ixgbevf_maybe_stop_tx+0xc)
> 0x213d is in ixgbevf_maybe_stop_tx
>                           (drivers/net/ixgbevf/ixgbevf_main.c:3725).
> 3720
> 3721    /* not instrumented -- possibly called per-packet */
> 3722    static int ixgbevf_maybe_stop_tx(struct net_device *netdev,
> 3723                                 struct ixgbevf_ring *tx_ring, int size)
> 3724    {
> 3725            if (likely(IXGBE_DESC_UNUSED(tx_ring) >= size))
> 3726                    return 0;
> 3727            return __ixgbevf_maybe_stop_tx(netdev, tx_ring, size);
> 3728    }
> 3729
> 
> I can see the value ffff810078626d3a in RSI and R15; the offending address
> is only 0x20 bytes above it.
> 
> #define IXGBE_DESC_UNUSED(R) \
> 	((((R)->next_to_clean > (R)->next_to_use) ? 0 : (R)->count) + \
> 	(R)->next_to_clean - (R)->next_to_use - 1)
> 
> struct ixgbevf_ring {
> 0x00   	struct ixgbevf_adapter *adapter;  /* backlink */
> 0x08	void *desc;		/* descriptor ring memory */
> 0x10	dma_addr_t dma;		/* phys. address of descriptor ring */
> 0x18	unsigned int size;	/* length in bytes */
> 0x1C	unsigned int count;	/* amount of descriptors */
> 0x20	unsigned int next_to_use;
> 
> The code tries to access tx_ring->next_to_use for comparison.

$ objdump -d ixgbevf.ko

0000000000002131 <ixgbevf_maybe_stop_tx>:
    2131:       41 54                   push   %r12
    2133:       41 89 d4                mov    %edx,%r12d
    2136:       31 d2                   xor    %edx,%edx
    2138:       55                      push   %rbp
    2139:       48 89 fd                mov    %rdi,%rbp
    213c:       53                      push   %rbx
    213d:       8b 46 20                mov    0x20(%rsi),%eax
    2140:       48 89 f3                mov    %rsi,%rbx
    2143:       39 46 24                cmp    %eax,0x24(%rsi)

0x2131 + 0xc == 0x213d; plus field "next_to_clean" is at offset 0x24 in the struct.

Comment 63 Laszlo Ersek 2012-12-14 11:03:46 UTC
Pasi,

would you care to test how a RHEL-5 HVM guest with passthru VFs (+ the debug kernel) works under RHEL-6 (or RHEL-5) kvm?

I don't know whether such a test could give us any hints, but currently I have no other idea. Drew reminded me that we had had this test in mind originally.

Thanks
Laszlo

Comment 64 Laszlo Ersek 2012-12-21 14:18:55 UTC
Setting

udev_log="debug"

in the RHEL-5 HVM guest's "/etc/udev/udev.conf" might provide more info, if (hopefully) udev's debug messages interleaved on the serial console with the ixgbevf debug messages.

Comment 65 Laszlo Ersek 2012-12-21 14:20:16 UTC
("ignore_loglevel console=tty console=ttyS0,115200" would make the serial line the main console in the guest (last console=... spec), redirecting all of the boot log there)

Comment 66 Pasi Karkkainen 2013-01-03 19:25:28 UTC
Created attachment 672173 [details]
rhel5 hvm guest kernel crash with 2.6.18-308.20.1.el5.bz862862_ixgbevf_crash_3 with udev info

Comment 67 Pasi Karkkainen 2013-01-03 19:28:08 UTC
Happy New Year to everyone!

I just added a rhel5 hvm guest crash log with udev_log="info". Hopefully that's enough/helpful..

(I tried with udev_log="debug" aswell, but it produces *so* much output that it took forever to get it transferred from the guest to dom0 over the serial console.. so booting the guest takes *long*. I can try to capture it again if needed..)

Comment 68 Pasi Karkkainen 2013-01-03 21:21:16 UTC
Created attachment 672220 [details]
rhel5 hvm guest kernel crash with 2.6.18-308.20.1.el5.bz862862_ixgbevf_crash_3 with udev debug

Comment 69 Pasi Karkkainen 2013-01-03 21:24:07 UTC
Ok, now there's a log with udev_log="debug" aswell. It's 788 kB and over 14000 lines long.. 

now the interesting thing is with udev_log="debug" it seems udevd crashes with segfault 38 times.. grep for segfault and you'll see.. these segfauls don't happen with udev_log="info". I wonder if it's related to the "slowness" of the serial console and the huge amount of output..

There's the guest kernel crash in the end of the log.

Comment 70 Laszlo Ersek 2013-01-04 16:23:56 UTC
Happy New Year!

These logfiles seem helpful, especially the info one. I have another theory
now (well they're cheap, aren't they). I had actually been looking at one
upstream kernel commit from 2007 before (RHEL-5 lacks it, RHEL-6 has it),
and now my suspicion got stronger.

(BTW the exact crashes are different between the info log and the debug log,
but both are consistent with the theory.)

In a nutshell I suspect ixgbevf functions being called on e1000 private data
due to a race condition in interface renaming. Quoting comment 59:

> ixgbevf 0000:00:06.0: ixgbevf_alloc_queues() tx_ring[0] @ ffff81003fed20c0
> ixgbevf 0000:00:07.0: ixgbevf_alloc_queues() tx_ring[0] @ ffff81003fed2140
> offending address:                                        ffff810078626d5a
>
> It's clear that the offending address has *nothing* to do with the
> previously allocated tx_ring object. Therefore either the
> "adapter->tx_ring" field (which is a pointer) got corrupted, containing
> the RSI/R15 value (r_idx == 0), or the "adapter" variable points to
> garbage. The latter would only be possible if the network subsystem called
  ixgbevf_xmit_frame() on a corrupted netdev, or a different type of NIC.

Elaborating on the theory:

udev seems to work like a server program with an event loop. When an event
fires, it runs helpers and utilities to handle the event. Several events
(and their handlers) can be in progress simultaneously; this is confirmed by

  udevd[390]: main: initialize max_childs_running to 16

From the "linear" log in comment 66 it's not easy to see what log entry
belongs to what event: events are interleaved. However udev is diligent to
log all PIDs, which allows us to de-interleave.

Consider the following two (de-interleaved) udev events.
- PID [390] is the main udev daemon,
- PID [973] is the ixgbevf helper/handler (first VF, at 0000:00:06.0,
                                           initial kernel name: eth0),
- PID [984] is the e1000 helper/handler (initial kernel name: eth2).

  udevd[390]: udev_event_run: seq 520 forked, pid [973], 'add' 'net', 1 seconds old
  udevd-event[973]: wait_for_sysfs: file '/sys/class/net/eth0/address' appeared after 0 loops
  udevd-event[973]: run_program: '/lib/udev/rename_device'
  udevd-event[973]: run_program: Waiting 1 seconds for output of '/lib/udev/rename_device(981)'
  udevd-event[973]: run_program: Waiting 2 seconds for output of '/lib/udev/rename_device(981)'
  udevd-event[973]: run_program: '/lib/udev/rename_device' returned with status 0
  udevd-event[973]: udev_rules_get_name: no node name set, will use kernel name 'eth0'
  udevd-event[973]: pass_env_to_socket: passed -1 bytes to socket '/org/kernel/udev/monitor',
  udevd-event[973]: run_program: '/lib/udev/udev_run_hotplugd'
  udevd-event[973]: run_program: '/lib/udev/udev_run_hotplugd' returned with status 0
  udevd-event[973]: run_program: '/lib/udev/udev_run_devd'
  udevd-event[973]: run_program: '/lib/udev/udev_run_devd' returned with status 0
  udevd-event[973]: run_program: '/etc/sysconfig/network-scripts/net.hotplug'
  udevd-event[973]: run_program: '/etc/sysconfig/network-scripts/net.hotplug' returned with status 0
  udevd-event[973]: pass_env_to_socket: passed -1 bytes to socket '/org/freedesktop/hal/udev_event',
  udevd-event[973]: udev_event_run: seq 520 finished
  udevd[390]: udev_done: seq 520, pid [973] exit with 0, 17 seconds old

  udevd[390]: udev_event_run: seq 523 forked, pid [984], 'add' 'net', 0 seconds old
  udevd-event[984]: wait_for_sysfs: file '/sys/class/net/eth2/address' appeared after 0 loops
  udevd-event[984]: run_program: '/lib/udev/rename_device'
  udevd-event[984]: run_program: '/lib/udev/rename_device' (stdout) 'INTERFACE=eth0'
  udevd-event[984]: run_program: '/lib/udev/rename_device' (stdout) 'DEVPATH=/class/net/eth0'
  udevd-event[984]: run_program: '/lib/udev/rename_device' returned with status 0
  udevd-event[984]: udev_rules_get_name: no node name set, will use kernel name 'eth2'
  udevd-event[984]: pass_env_to_socket: passed -1 bytes to socket '/org/kernel/udev/monitor',
  udevd-event[984]: run_program: '/lib/udev/udev_run_hotplugd'
  udevd-event[984]: run_program: '/lib/udev/udev_run_hotplugd' returned with status 0
  udevd-event[984]: run_program: '/lib/udev/udev_run_devd'
  udevd-event[984]: run_program: '/lib/udev/udev_run_devd' returned with status 0
  udevd-event[984]: run_program: '/etc/sysconfig/network-scripts/net.hotplug'
  udevd-event[984]: run_program: '/etc/sysconfig/network-scripts/net.hotplug' returned with status 0
  udevd-event[984]: pass_env_to_socket: passed -1 bytes to socket '/org/freedesktop/hal/udev_event',
  udevd-event[984]: udev_event_run: seq 523 finished
  udevd[390]: udev_done: seq 523, pid [984] exit with 0, 15 seconds old

(Again from comment 59:

> Crash:
> - The VFs are discovered first, they are named eth0 and eth1,
>   respectively,
> - e1000 is then discovered, gets eth2
)

The "/lib/udev/rename_device" helper informs udev about having renamed eth2
(the e1000 NIC) to eth0, supplanting the "original" eth0 (ixgbevf).

Now look at these two udev events in parallel, in the "linear" log from
comment 66:

$ egrep 'eth[02]|\[973\]|\[984\]|Bringing up interface|__tmp' \
  crash-log-2.6.18-308.20.1.el5.bz862862_ixgbevf_crash_3-with-udev-info.txt

Probing by the respective kernel drivers:

  ixgbevf 0000:00:06.0: ixgbevf_probe() registered with name=eth0
  e1000: eth2: e1000_probe: Intel(R) PRO/1000 Network Connection

Probing done, udev runs ([973] is ixgbevf, [984] is e1000):

  udevd[390]: udev_event_run: seq 520 forked, pid [973], 'add' 'net', 1 seconds old
  udevd-event[973]: wait_for_sysfs: file '/sys/class/net/eth0/address' appeared after 0 loops
  udevd-event[973]: run_program: '/lib/udev/rename_device'
  udevd[390]: udev_event_run: seq 523 forked, pid [984], 'add' 'net', 0 seconds old
  udevd-event[984]: wait_for_sysfs: file '/sys/class/net/eth2/address' appeared after 0 loops
  udevd-event[973]: run_program: Waiting 1 seconds for output of '/lib/udev/rename_device(981)'
  udevd-event[973]: run_program: Waiting 2 seconds for output of '/lib/udev/rename_device(981)'
  udevd-event[984]: run_program: '/lib/udev/rename_device'
  udevd-event[973]: run_program: '/lib/udev/rename_device' returned with status 0
  udevd-event[973]: udev_rules_get_name: no node name set, will use kernel name 'eth0'
  udevd-event[973]: pass_env_to_socket: passed -1 bytes to socket '/org/kernel/udev/monitor',
  udevd-event[984]: run_program: '/lib/udev/rename_device' (stdout) 'INTERFACE=eth0'
  udevd-event[973]: run_program: '/lib/udev/udev_run_hotplugd'
  udevd-event[973]: run_program: '/lib/udev/udev_run_hotplugd' returned with status 0
  udevd-event[973]: run_program: '/lib/udev/udev_run_devd'
  udevd-event[973]: run_program: '/lib/udev/udev_run_devd' returned with status 0
  udevd-event[984]: run_program: '/lib/udev/rename_device' (stdout) 'DEVPATH=/class/net/eth0'
  udevd-event[984]: run_program: '/lib/udev/rename_device' returned with status 0
  udevd-event[984]: udev_rules_get_name: no node name set, will use kernel name 'eth2'
  udevd-event[984]: pass_env_to_socket: passed -1 bytes to socket '/org/kernel/udev/monitor',
  udevd-event[973]: run_program: '/etc/sysconfig/network-scripts/net.hotplug'
  udevd-event[973]: run_program: '/etc/sysconfig/network-scripts/net.hotplug' returned with status 0
  udevd-event[973]: pass_env_to_socket: passed -1 bytes to socket '/org/freedesktop/hal/udev_event',
  udevd-event[984]: run_program: '/lib/udev/udev_run_hotplugd'
  udevd-event[984]: run_program: '/lib/udev/udev_run_hotplugd' returned with status 0
  udevd-event[984]: run_program: '/lib/udev/udev_run_devd'
  udevd-event[984]: run_program: '/lib/udev/udev_run_devd' returned with status 0
  udevd-event[973]: udev_event_run: seq 520 finished
  udevd-event[984]: run_program: '/etc/sysconfig/network-scripts/net.hotplug'
  udevd-event[984]: run_program: '/etc/sysconfig/network-scripts/net.hotplug' returned with status 0
  udevd-event[984]: pass_env_to_socket: passed -1 bytes to socket '/org/freedesktop/hal/udev_event',
  udevd[390]: udev_done: seq 520, pid [973] exit with 0, 17 seconds old
  udevd-event[984]: udev_event_run: seq 523 finished
  udevd[390]: udev_done: seq 523, pid [984] exit with 0, 15 seconds old

The two handlers are interleaved.

  Bringing up interface eth0:
  Determining IP information for eth0...ixgbevf 0000:00:06.0: ixgbevf_get_stats() entry / exit
  e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
  8021q: adding VLAN 0 to HW filter on device eth0

"eth0" already denotes the e1000 card here.

  rhel58x64hvm01.localdomain login: eth0: no IPv6 routers present
  ixgbevf 0000:00:06.0: ixgbevf_request_msix_irqs() requesting irq=193 for vector=0 with name=__tmp73526586-rx-0
  ixgbevf 0000:00:06.0: ixgbevf_request_msix_irqs() requesting irq=201 for vector=1 with name=__tmp73526586-tx-0
  8021q: adding VLAN 0 to HW filter on device __tmp73526586

The first ixgbevf has been renamed to "__tmp73526586".

The crash comes from

(gdb) list *(ixgbevf_xmit_frame+0x4b1)
0x6cd7 is in ixgbevf_xmit_frame (drivers/net/ixgbevf/ixgbevf_main.c:3549).
3544            len = min(skb_headlen(skb), total);
3545            while (len) {
3546                    tx_buffer_info = &tx_ring->tx_buffer_info[i];
3547                    size = min(len, (unsigned int)IXGBE_MAX_DATA_PER_TXD);
3548
3549                    tx_buffer_info->length = size;
3550                    tx_buffer_info->mapped_as_page = false;
3551                    tx_buffer_info->dma = pci_map_single(adapter->pdev,
3552                                                         skb->data + offset,
3553                                                         size, PCI_DMA_TODEVICE);

However the "tx_ring->tx_buffer_info" array has been allocated in
ixgbevf_setup_tx_resources() with vzalloc(), whose return value is checked.

  ixgbevf 0000:00:06.0: ixgbevf_setup_tx_resources() entry tx_ring=ffff810001910640
  ixgbevf 0000:00:06.0: ixgbevf_setup_tx_resources() exit (OK)

This appears consistent with ixgbevf_xmit_frame() being invoked on a bogus
net_device structure.


The upstream commit I have in mind is
<http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=7f988eab>.
Again, RHEL-6 has it, RHEL-5 lacks it, and the patch modifies
dev_change_name().

(In upstream the relevant section of dev_change_name() was further modified
by

  72c9528b net: Introduce dev_get_by_name_rcu()

which introduced a problem, and then fixed up by

  372b2312 net: use hlist_del_rcu() in dev_change_name()

But I think those (ie. the dev_get_by_name_rcu() feature) are not important
for us. Another commit related to synchronization is

  c91f6df2 sockopt: Change getsockopt() of SO_BINDTODEVICE to return an
           interface name

however that too added a new feature.)

RHEL-5 call tree:

do_rename() [initscripts-8.45.42/src/rename_device.c]
  ioctl(SIOCSIFNAME)
    dev_ifsioc() [kernel/net/core/dev.c]
      dev_change_name()
        hlist_del()
        dev_name_hash()
        hlist_add_head()

In RHEL-5, dev_change_names() updates the "dev_name_head" hash without
properly locking it with "dev_base_lock" against pure readers.

Unfortunately I can't identify the exact concurrent reader (mentioned as the
caller of __dev_get_by_name() in the 7f988eab), but I think it could be
related to something that "ifconfig" does.

I think we should try a test kernel with 7f988eab backported.

(FWIW, __dev_get_by_name() seems to be called from a lot of places (ie.
"pure readers") without taking "dev_base_lock". That is, the locking
protocol could be violated from both sides!)

Comment 72 Pasi Karkkainen 2013-01-07 15:30:05 UTC
(In reply to comment #70)
> 
> I think we should try a test kernel with 7f988eab backported.
> 
> 

Good analysis! Yeah, I can try that kernel.

Comment 74 Laszlo Ersek 2013-01-08 09:18:42 UTC
Created attachment 674634 [details]
[1/2] instrument dev_change_name()


<this line added to keep git-send-bugzilla happy>
---
 net/core/dev.c |   24 ++++++++++++++++++++----
 1 files changed, 20 insertions(+), 4 deletions(-)

Comment 75 Laszlo Ersek 2013-01-08 09:19:03 UTC
Created attachment 674635 [details]
[2/2] [NET]: Take dev_base_lock when moving device name hash list entry


When we added name-based hashing the dev_base_lock was designated as the
lock to take when changing the name hash list.  Unfortunately, because
it was a preexisting lock that just happened to be taken in the right
spots we neglected to take it in dev_change_name.

The race can affect calles of __dev_get_by_name that do so without taking
the RTNL.  They may end up walking down the wrong hash chain and end up
missing the device that they're looking for.

RHEL-5 backport of upstream Linux commit 7f988eab.
---
 net/core/dev.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

Comment 77 Pasi Karkkainen 2013-01-08 14:16:42 UTC
Created attachment 674871 [details]
rhel5 hvm guest kernel crash with 2.6.18-308.20.1.el5.bz862862_ixgbevf_crash_4

Comment 78 Pasi Karkkainen 2013-01-08 14:18:08 UTC
Unfortunately the latest patches didn't help :( serial log from the crashing guest attached.

Comment 79 Laszlo Ersek 2013-01-08 15:19:03 UTC
Thanks!

Something to check later: re-enabling the udev info logging and correlating those messages with the new dev_change_name() debug log. Namely,

    dev_change_name: 'eth2' -> 'eth0' enter
    dev_change_name: exit (EEXIST)

This is the first attempt to rename the e1000 NIC (originally namede eth2) to eth0, based on the ifcfg-eth0 script. However 0000:00:06.0 (ixgbevf#0) occupies that name, so

    dev_change_name: 'eth0' -> '__tmp1686290279' enter
    dev_change_name: exit (class_device_rename=0)

that's first "moved out of the way",

    dev_change_name: 'eth2' -> 'eth0' enter
    dev_change_name: exit (class_device_rename=0)

and then we rename the e1000 card successfully (eth2 is free again).

    dev_change_name: 'eth2' -> 'eth0' enter
    dev_change_name: exit (EEXIST)

Here udev tries to rename the netfront NIC (which has come up under the now-free eth2 name) to eth0 (because netfront has the same MAC as e1000).

    dev_change_name: 'eth0' -> 'eth0' enter
    dev_change_name: exit (EEXIST)

A curious attempt. Maybe trying to move eth0 (e1000 currently) out of the way, but the new name for e1000 can't be anything else, because there's a MAC match in ifcfg-eth0?

Comment 80 Pasi Karkkainen 2013-01-08 17:17:29 UTC
Created attachment 674973 [details]
rhel5 hvm guest kernel crash with 2.6.18-308.20.1.el5.bz862862_ixgbevf_crash_4 with udev _log info

Comment 81 Pasi Karkkainen 2013-01-08 17:20:23 UTC
(In reply to comment #79)
> 
> Something to check later: re-enabling the udev info logging and correlating
> those messages with the new dev_change_name() debug log.
>

Serial console log with udev_log="info" added. 

This time dom0 was stock rhel 5.9, but that didn't seem to help this issue.
(VF PCI passthrough to HVM guest now works OK with the stock rhel 5.9 Xen).

> 
> A curious attempt. Maybe trying to move eth0 (e1000 currently) out of the
> way, but the new name for e1000 can't be anything else, because there's a
> MAC match in ifcfg-eth0?
>

Yes, there's a MAC-match in ifcfg-eth0 ..

Comment 83 Pasi Karkkainen 2013-01-09 09:47:45 UTC
Created attachment 675360 [details]
rhel5.9 hvm guest kernel crash with 2.6.18-348.el5

Comment 84 Pasi Karkkainen 2013-01-09 09:48:46 UTC
Out of curiosity I just tried stock rhel 5.9 x64 in the hvm guest, and 2.6.18-348.el5 crashes aswell. serial console crash log attached.

Comment 85 Laszlo Ersek 2013-01-09 12:37:37 UTC
From live debugging on Pasi's machine -- the situation leading to the crash is established by infighting between kudzu and udev:
- kudzu generates "alias ethX modname" lines into /etc/modprobe.conf,
- the modules seem to be loaded in that order,
- in case of "alias eth0 e1000" in "modprobe.conf", plus a matching MAC in "ifcfg-eth0", everything works,
- if, however "modprobe.conf" aliases ixgbevf with eth0, then udev will rename the eth2 e1000 card, based on "ifcfg-eth0", to eth0, and eth0 (the first ixgbevf) to __tmpXXXXX,
- bringing up __tmpXXXX with ifconfig crashes the VM.

<lersek> but I think I'm finding a way to work around the crash reliably
<lersek> ie prevent the renaming
<lersek> Putting the following in /etc/modprobe.conf:
<lersek> ----
<lersek> alias eth0 e1000
<lersek> alias eth1 ixgbevf
<lersek> alias eth2 ixgbevf
<lersek> ----
<lersek> *AND*
<lersek> ifcfg-eth0 actually matches the MAC of the e1000 card
<lersek> if this is all set
<lersek> then e1000 is always probed first
<lersek> no renaming happens
<lersek> and all three interfaces work
<lersek> testing the fifth boot with this
<lersek> yeah 5th boot succeeded as well
<lersek> so
<lersek> let's see what happens when eth0 conflict between /etc/modprobe.conf and ifcfg-eth0
<lersek> testing
<lersek> ----------
<lersek> alias eth0 ixgbevf
<lersek> alias eth1 ixgbevf
<lersek> alias eth2 e1000
<lersek> ----------
<pasik> ok
<pasik> that would be quite a simple workaround :)
<pasik> assuming kudzu won't mess it up on next reboot..
<lersek> "__tmp468605363" reestablished immediately :)

Comment 86 Laszlo Ersek 2013-01-09 13:02:31 UTC
If I kill the e1000 eth0 device and remove its driver module, then rename the __tmpXXXXX vf0 to its original name (eth0), then it can be brought up correctly!

ifconfig eth0 down
rmmod e1000
ip link set __tmp1404086329 name eth0
ifconfig eth0 up
# no crash

Something during ixgbevf probing remembers the eth0 name or its interface index or whatever, and it can be brought up only with that name.

Comment 87 Laszlo Ersek 2013-01-09 13:05:13 UTC
In the following test, eth1 is a correctly working, running ixgbevf (it's vf1).

ifconfig eth1 down
ip link set eth1 name ABCDEFGHIJKLMNO # same length
ifconfig ABCDEFGHIJKLMNO up
# CRASH

Comment 88 Laszlo Ersek 2013-01-09 13:16:39 UTC
I repeated the renaming test (with the same ABCDEFGHIJKLMNO name) in comment 87 on my workstation, with a tg3 card. No crash.

I repeated the renaming test in comment 87 inside the VM, on the e1000 card (which, like ixgbevf, supports HW filtering of VLANs). No crash.

The problem seems specific to ixgbevf.

Comment 89 Laszlo Ersek 2013-01-09 13:21:55 UTC
Great idea from Pasi -- it turns out not to be a match/mismatch question, it points to a buffer overflow dependent on name length. I repeated the test in comment 86, but the final name before bringup was "eth9". (I also repeated the test with "ZZZZ" -- length: 4 chars). No crash in either case.

Comment 90 Laszlo Ersek 2013-01-09 13:50:53 UTC
The first name / length that crashes is "x123456789" (10).

Comment 91 Andy Gospodarek 2013-01-09 14:37:22 UTC
(In reply to comment #88)
> I repeated the renaming test (with the same ABCDEFGHIJKLMNO name) in comment
> 87 on my workstation, with a tg3 card. No crash.
> 
> I repeated the renaming test in comment 87 inside the VM, on the e1000 card
> (which, like ixgbevf, supports HW filtering of VLANs). No crash.
> 
> The problem seems specific to ixgbevf.

Does this happen on the host and the guest or just the guest?  (Is the host even running RHEL5?)

Comment 92 Laszlo Ersek 2013-01-09 14:45:37 UTC
The host is running RHEL-5.9, x86_64, xen dom0, and indeed the long ixgbevf name crashes the dom0 kernel as well.

(start pinging host in another xterm)

[root@dom0 ~]# ethtool -i x123456789
driver: ixgbevf
version: 2.1.0-k
firmware-version: N/A
bus-info: 0000:03:10.2
[root@dom0 ~]# sync
[root@dom0 ~]# ifconfig x123456789 up

(no more echo replies & ssh connection freezes).

I'll ask Pasi to reboot the host with the bare metal RHEL-5.9 kernel.

Comment 93 Laszlo Ersek 2013-01-09 14:55:53 UTC
It crashes the bare metal kernel as well.

[root@dom0 ~]# uname -a
Linux dom0.localdomain 2.6.18-348.el5 #1 SMP Wed Nov 28 21:22:00 EST 2012 x86_64 x86_64 x86_64 GNU/Linux

[root@dom0 ~]# ethtool -i eth5
driver: ixgbevf
version: 2.1.0-k
firmware-version: N/A
bus-info: 0000:03:10.2

[root@dom0 ~]# ip link set eth5 name x123456789

[root@dom0 ~]# ethtool -i x123456789
driver: ixgbevf
version: 2.1.0-k
firmware-version: N/A
bus-info: 0000:03:10.2

[root@dom0 ~]# sync
[root@dom0 ~]# ifconfig x123456789 up

# crashed

Comment 94 Laszlo Ersek 2013-01-09 15:07:16 UTC
(Oh earlier I single-stepped ifconfig with gdb; the crash is triggered when ifconfig sets the (IFF_UP | IFF_RUNNING) flags with the SIOCSIFFLAGS ioctl on the netdev with the too long name.)

Comment 95 Andy Gospodarek 2013-01-09 15:13:50 UTC
What's odd is that this is way less than IFNAMSIZ (which is 16).

Can I get on the system and and use serial console to debug?

Comment 96 Pasi Karkkainen 2013-01-09 15:19:36 UTC
(In reply to comment #95)
> 
> Can I get on the system and and use serial console to debug?
>

Sure. I'll fix you an access to it.

Comment 97 Laszlo Ersek 2013-01-09 15:26:04 UTC
I'd like to grab a vmcore from the guest, using xend in the host (this is one situation where virtualization is helpful, since kexec/kdump are not involved; the host saves the guest's memory), then search the vmcore for the interface name.

Comment 98 Laszlo Ersek 2013-01-09 17:41:24 UTC
Analyzing the vmcore. The offending interface name in this case was
"__tmp1650898050". Important parts of the ixgbevf debug log and the crash
are as follows.

Ring allocation at ixgbevf_open() time:

  ixgbevf 0000:00:06.0: ixgbevf_setup_tx_resources() entry
                        tx_ring=ffff81003ee0c9c0
  ixgbevf 0000:00:06.0: ixgbevf_setup_rx_resources() entry
                        rx_ring=ffff81003dc77cc0

Faulting instruction RIP, registers and stack:

  Pid: 1767, comm: ifconfig
  Not tainted 2.6.18-308.20.1.el5.bz862862_ixgbevf_crash_4 #1
  RIP: 0010:[<ffffffff8846913d>]  [<ffffffff8846913d>]
       :ixgbevf:ixgbevf_maybe_stop_tx+0xc/0x7b

  RDX: 0000000000000000 RSI: ffff0078626d3a30 RDI: ffff81003815a000

  Stack:  ffff81003815a000 ffff810033f364c0 0000000000000000 ffffffff8846d90e
   ffff81003815a000 ffff81003815a500 00000000804b1e68 ffffffff883dd9f4
   0000000100000000 ffffffff883db3b0 ffff81003815a000 000000003f4ecc00

  Call Trace:
   <IRQ>  [<ffffffff8846d90e>] :ixgbevf:ixgbevf_xmit_frame+0xe8/0x8b1

This call stack corresponds exactly to what's described in comment 59 and
comment 62 (verified with gdb and "objdump -d".)

Starting the crash utility:

  $ crash rhel58x64hvm01.vmcore \
  /usr/lib/debug/lib/modules/2.6.18-308.20.1.el5.bz862862_ixgbevf_crash_4/vmlinux

Loading the ixgbevf symbols:

  crash> mod -s ixgbevf \
  /usr/lib/debug/lib/modules/2.6.18-308.20.1.el5.bz862862_ixgbevf_crash_4/kernel/drivers/net/ixgbevf/ixgbevf.ko.debug

The first value on the stack (ffff81003815a000) corresponds to the "struct
net_device *netdev" parameter of ixgbevf_maybe_stop_tx():

  crash> struct net_device ffff81003815a000

  struct net_device {
    name = "__tmp1650898050", 
    [...]
    priv = 0xffff81003815a500, 

and the many function pointers check out, they point to valid ixgbevf
functions.

"priv" points at an ixgbevf_adapter structure (function pointers check out
inside again).

  crash> struct ixgbevf_adapter ffff81003815a500

  struct ixgbevf_adapter {
    tx_ring = 0xffff0078626d3a30, 
    rx_ring = 0xffff81003dc77cc0, 

Note that rx_ring ("ffff81003dc77cc0") matches the originally allocated
area, printed by ixgbevf_setup_rx_resources() at the top. The tx_ring value
("ffff0078626d3a30") differs from the initially allocated value; it's
bogus. 

Also, the tx_ring value ("ffff0078626d3a30") can be found in RSI, which is
consistent with the crash RIP (see comment 59 and comment 62).

So where could this tx_ring value come from? Let's take the five lowest
bytes (the ten least significant nibbles), ie. "78626d3a30", and try to
interpret them as an ASCII string. Since we have little endian
representation (lowest address holds least significant byte), we have to
mirror the string: 78626d3a30 --> \x30 \x3a \x6d \x62 \x78

  $ printf '%b\n' '\x30\x3a\x6d\x62\x78'
  0:mbx

This is related to the following two lines in ixgbevf_request_msix_irqs():

    sprintf(adapter->name[vector], "%s:mbx", netdev->name);
    err = request_irq(adapter->msix_entries[vector].vector,
              &ixgbevf_msix_mbx, 0, adapter->name[vector], netdev);

The first statement formats the mailbox interrupt *name* for the VF's
mailbox interrupt. The second line requests an interrupt with that name.

Somehow the formatted vector name tramples over the "tx_ring" field. The
leading "0" in "0" is the trailing "0" in "__tmp1650898050":

  __tmp1650898050:mbx
                ^^^^^
                0:mbx

Comment 99 Laszlo Ersek 2013-01-09 18:15:20 UTC
The fields directly preceding "tx_ring" in the "ixgbevf_adapter" structure are also garbled:

  struct ixgbevf_adapter {
    [...]
    q_vector = {0xffff8100392dbd80, 0xffff8100392db980}, 
    name = {"__tmp1650898050-rx-0\000\000\000\000", "__tmp1650898050-tx-0\000\000\000\000"}, 
    itr_setting = 829451636, 
    eitr_low = 13622, 
    eitr_high = 14384, 
    tx_ring = 0xffff0078626d3a30, 

Notice how there are two elements in "q_vector" (MAX_MSIX_Q_VECTORS == 2), same for "name" (MAX_MSIX_COUNT == 2). The following fields are only ever set like this in the code, in file "ixgbevf_main.c":

  field       dumped value        as ASCII str  value should be
  ----------  ------------        ------------  ---------------
  itr_setting 829451636,          "tmp1"        adapter->itr_setting = 1;
  eitr_low    13622,              "65"          adapter->eitr_low = 10;
  eitr_high   14384,              "08"          adapter->eitr_high = 20;
  tx_ring     0xffff0078626d3a30  "0:mbx"       (as allocated)

Reading the first three fields together yields "tmp16508":

  __tmp1650898050:mbx
    ^^^^^^^^
    tmp16508

This happens (probably) because the subscript called "vector" is out of bounds (ie. it equals 2) in the statement

  sprintf(adapter->name[vector], "%s:mbx", netdev->name);

in ixgbevf_request_msix_irqs(). "vector" is out of bounds due to the preceding loop:

	/* Decrement for Other and TCP Timer vectors */
	q_vectors = adapter->num_msix_vectors - NON_Q_VECTORS;

	/* ... */

	for (vector = 0; vector < q_vectors; vector++) {
		/* ... */
	}

	/* ... */

	sprintf(adapter->name[vector], "%s:mbx", netdev->name);
	err = request_irq(adapter->msix_entries[vector].vector,
			  &ixgbevf_msix_mbx, 0, adapter->name[vector], netdev);

My debug patch logs the value of q_vectors, right between the assignment to it and the loop:

  ixgbevf 0000:00:06.0: ixgbevf_request_msix_irqs() entry
                        num_msix_vectors=3 q_vectors=2

Further, from "drivers/net/ixgbevf/ixgbevf.h",

#define OTHER_VECTOR 1
#define NON_Q_VECTORS (OTHER_VECTOR)

hence "adapter->num_msix_vectors" is 3.

I'll check out upstream and the number-of-vectors logic.

Comment 100 Laszlo Ersek 2013-01-09 18:36:22 UTC
ixgbevf_probe()
  ixgbevf_init_interrupt_scheme()
    ixgbevf_set_interrupt_capability()
      ixgbevf_acquire_msix_vectors()

The last function in the call chain is the only place setting "adapter->num_msix_vectors"; my debug patch logs the following ("ixgbevf 0000:00:06.0" prefix stripped):

ixgbevf_set_interrupt_capability() entry
ixgbevf_set_interrupt_capability() num_rx_queues=1 num_tx_queues=1 v_budget=3
ixgbevf_set_interrupt_capability() v_budget alloc OK
ixgbevf_acquire_msix_vectors() entry: vectors=3
ixgbevf_acquire_msix_vectors() acquiring vectors=3
ixgbevf_acquire_msix_vectors() success
ixgbevf_acquire_msix_vectors(): allocated 3 MSI-X interrupts
ixgbevf_acquire_msix_vectors() exit
ixgbevf_set_interrupt_capability() exit (0)

This logic seems sane, especially the comment & code in ixgbevf_set_interrupt_capability() that determines the "budget":

	/*
	 * It's easy to be greedy for MSI-X vectors, but it really
	 * doesn't do us much good if we have a lot more vectors
	 * than CPU's.  So let's be conservative and only ask for
	 * (roughly) twice the number of vectors as there are CPU's.
	 */
	v_budget = min(adapter->num_rx_queues + adapter->num_tx_queues,
		       (int)(num_online_cpus() * 2)) + NON_Q_VECTORS;

So, we have
- one tx queue with a dedicated MSI-X interrupt,
- one rx queue -||-,
- and a non-queue vector (the mailbox vector) with the third MSI-X interrupt.

I think the logic is sane. What's wrong is the name formatting / name storage for the mbx vector.

Note that even with shorter (ie. non-crashing) interface names, we *can* overwrite part of itr_setting / eitr_low / eitr_high in the ixgbevf_adapter structure. However as far as I can see (it's really not my cup of tea), those are mostly performance related settings, so it's not easy to notice when they are garbled. However if the iface name is long enough, we overwrite a pointer.

Comment 101 Laszlo Ersek 2013-01-09 18:46:02 UTC
This is the problem [drivers/net/ixgbevf/ixgbevf.h]:

  #define MAX_MSIX_Q_VECTORS 2
  #define MAX_MSIX_COUNT 2

  #define MIN_MSIX_Q_VECTORS 2
  #define MIN_MSIX_COUNT (MIN_MSIX_Q_VECTORS + NON_Q_VECTORS)

  struct ixgbevf_adapter {
          char name[MAX_MSIX_COUNT][IFNAMSIZ + 9];


"MAX_MSIX_COUNT" should be defined similarly to "MIN_MSIX_COUNT", that is,

  #define MAX_MSIX_COUNT 2 (MIN_MSIX_Q_VECTORS + NON_Q_VECTORS)

Grepping the "kernel/drivers/net/ixgbevf" directory for MAX_MSIX_COUNT, the only hits are

ixgbevf.h:181:#define MAX_MSIX_COUNT 2
ixgbevf.h:195:  char name[MAX_MSIX_COUNT][IFNAMSIZ + 9];

So this should be safe.

Comment 102 Don Dutile (Red Hat) 2013-01-09 18:58:32 UTC
Try again?: (from previous comment)
  #define MAX_MSIX_COUNT 2 (MIN_MSIX_Q_VECTORS + NON_Q_VECTORS)

meaning, you have a typo & you wanted to delete the '2' above ?

Comment 103 Laszlo Ersek 2013-01-09 19:02:53 UTC
Upstream fixed the error "inadvertently", in <http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git;a=commitdiff;h=fa71ae27>. We'll have to fix this in a RHEL-5 specific way; I should write a patch for comment 101.

Regarding RHEL-6, Andy already backported upstream commit fa71ae27 as

commit 24a9e7c2be6700312f18ac9952abdddb1964e3dc
Author: Andy Gospodarek <gospo>
Date:   Wed Oct 3 00:44:40 2012 -0400

    [netdrv] ixgbevf: Move Tx clean-up into NAPI context

(bug 819105).

Comment 104 Laszlo Ersek 2013-01-09 19:05:41 UTC
(In reply to comment #102)
> Try again?: (from previous comment)
>   #define MAX_MSIX_COUNT 2 (MIN_MSIX_Q_VECTORS + NON_Q_VECTORS)
> 
> meaning, you have a typo & you wanted to delete the '2' above ?

Ah sorry, I was fighting some history back&forth in Firefox while typing that comment, and I either forgot to update macro expansion after the copy-pasta, or Firefox tricked me. So, what I'm suggesting is:

  #define MAX_MSIX_COUNT (MAX_MSIX_Q_VECTORS + NON_Q_VECTORS)

Comment 105 Laszlo Ersek 2013-01-09 19:31:38 UTC
Created attachment 675802 [details]
ixgbevf: allocate room for the mailbox MSI-X interrupt's name (proposed patch)

Without this change, the

  sprintf(adapter->name[vector], "%s:mbx", netdev->name);

statement in ixgbevf_request_msix_irqs() tramples over

- adapter->itr_setting,
- adapter->eitr_low,
- adapter->eitr_high,
- adapter->tx_ring,

dependent on the net_device (interface) name length. For short names the
problem is (apparently) not catastrophic, because the first three fields
above seem to influence performance only. The fourth field ("tx_ring") is
a pointer however, and a long enough iface name (in our testing: at least
10 non-NUL chars) will cause a crash.

Upstream fixed the problem on the side with commit fa71ae27, hence the
patch is RHEL-5 specific.

(Andy backported fa71ae27 to RHEL-6 (commit 24a9e7c2).)

Signed-off-by: Laszlo Ersek <lersek>

 drivers/net/ixgbevf/ixgbevf.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Comment 106 Andy Gospodarek 2013-01-09 19:41:01 UTC
Nice find, Laszlo!  I was looking at this before I needed to step away for lunch and was quite sure the interrupt name was related.  Mostly because I booted with pci=nomsi and found that the system failed differently *and* that it was only after the name was longer than 10 characters that it became a problem.

Comment 107 Andy Gospodarek 2013-01-09 20:42:24 UTC
So it doesn't look like I can bring up an ixgbevf-based interface when booting with pci=nomsi on the kernel command-line:

# ifconfig eth2 up 
Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: 
 [<ffffffff8825dbfa>] :ixgbevf:ixgbevf_open+0x34e/0x48b
PGD 3e0d6067 PUD 3c8c0067 PMD 0 
Oops: 0000 [1] SMP 
last sysfs file: /block/ram0/dev
CPU 0 
Modules linked in: lockd sunrpc ip_conntrack_netbios_ns ipt_REJECT xt_state ip_conntrack nfnetlink iptable_filter ip_tables ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 xfrm_nalgo crypto_api dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport floppy xen_vnif xen_vbd joydev xen_balloon ixgbevf xen_platform_pci ide_cd cdrom i2c_piix4 i2c_core e1000 serio_raw pcspkr tpm_tis tpm tpm_bios 8021q dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ata_piix libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 1987, comm: ifconfig Not tainted 2.6.18-348.el5 #1
RIP: 0010:[<ffffffff8825dbfa>]  [<ffffffff8825dbfa>] :ixgbevf:ixgbevf_open+0x34e/0x48b
RSP: 0018:ffff81003e899d78  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff81003db985b0 RCX: ffff81003db985b0
RDX: 0000000000000000 RSI: ffffffff8825d7c4 RDI: ffff81003db985b8
RBP: ffff81003db98500 R08: ffff81003db98000 R09: 0000000000000032
R10: 00000000000104c2 R11: ffffffff8825a4c8 R12: 0000000000000000
R13: ffff81003db98000 R14: 0000000000000000 R15: 0000000000000000
FS:  00002b25b77d25e0(0000) GS:ffffffff80435000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000003c821000 CR4: 00000000000006a0
Process ifconfig (pid: 1987, threadinfo ffff81003e898000, task ffff81003dcd5040)
Stack:  ffffffff00000000 ffff81003db98000 0000000000001043 0000000000001002
 0000000000000000 ffff81003db98000 0000000000000000 ffffffff8023c8fa
 ffff81003db98000 ffffffff8023aa07 ffff81003e899e18 00000000ffffffed
Call Trace:
 [<ffffffff8023c8fa>] dev_open+0x58/0x9c
 [<ffffffff8023aa07>] dev_change_flags+0x5a/0x119
 [<ffffffff8026f5f0>] devinet_ioctl+0x235/0x59c
 [<ffffffff80231866>] sock_ioctl+0x1c1/0x1e5
 [<ffffffff8004247e>] do_ioctl+0x21/0x6b
 [<ffffffff800304da>] vfs_ioctl+0x457/0x4b9
 [<ffffffff800baee0>] audit_syscall_entry+0x1a8/0x1d3
 [<ffffffff8004c882>] sys_ioctl+0x59/0x78
 [<ffffffff8005d29e>] tracesys+0xd5/0xdf


Code: 0f b7 38 e8 e9 12 e6 f7 85 c0 41 89 c4 74 52 41 8d 5e ff eb 
RIP  [<ffffffff8825dbfa>] :ixgbevf:ixgbevf_open+0x34e/0x48b
 RSP <ffff81003e899d78>
CR2: 0000000000000000
 <0>Kernel panic - not syncing: Fatal exception

We should fix this while we are in there fixing the original problem.

Comment 108 Laszlo Ersek 2013-01-09 21:05:09 UTC
Regarding comment 107 (crash with pci=nomsi), we've run into that in the RHEL-6 guest sibling bug before (as part of debugging): see bug 849223 comment 26.

We discussed this on IRC with Andy, and (if I understand correctly) he's OK with (possibly) filing a separate BZ for the pci=nomsi problem (a BZ that's unrelated to Xen, or virtualization in general).

So I'm going to post this patch now to RHKL. Thanks everyone for the help!

Comment 109 Don Dutile (Red Hat) 2013-01-09 21:28:29 UTC
(In reply to comment #108)
> Regarding comment 107 (crash with pci=nomsi), we've run into that in the
> RHEL-6 guest sibling bug before (as part of debugging): see bug 849223
> comment 26.
> 
> We discussed this on IRC with Andy, and (if I understand correctly) he's OK
> with (possibly) filing a separate BZ for the pci=nomsi problem (a BZ that's
> unrelated to Xen, or virtualization in general).
> 
> So I'm going to post this patch now to RHKL. Thanks everyone for the help!

Per SRIOV spec, VFs can *only* use/generate MSI or MSI-X interrupts.
Thus, booting a host or guest with pci=nomsi results in no VF support.

We recently went through an interesting 'exercise' when a vendor's VF generated an unexpected signal interrupt instead of MSI(-X) interrupt;  not a pleasant outcome! ;-)

Comment 110 Andy Gospodarek 2013-01-09 21:41:09 UTC
A backport of the following upstream patch will resolve the problem in comment #107.

commit e45dd5fe654b2f8e52ea9f8dacfe2b2444c87c07
Author: Jakub Kicinski <jakub.kicinski>
Date:   Tue Nov 13 04:03:16 2012 +0000

    ixgbevf: make sure probe fails on MSI-X enable error
    
    This driver cannot work without MSI-X interrupts
    so there is no mechanism to fall back to.

Comment 115 Pasi Karkkainen 2013-01-13 16:00:00 UTC
I can confirm the patch in comment #105 fixes the bug. When running kernel 2.6.18-348.el5.bz862862_ixgbevf_mbx_name I'm able to run "ifconfig __tmp503635005 0.0.0.0 up" and it works fine, no crash.

# grep tmp /proc/interrupts
185:       4819       PCI-MSI-X  __tmp503635005-rx-0
193:        147       PCI-MSI-X  __tmp503635005-tx-0
201:         12       PCI-MSI-X  __tmp503635005:mbx

Thanks a lot for all the debugging and persistence to get this fixed!

Comment 116 Laszlo Ersek 2013-01-14 11:28:30 UTC
Pasi,

thank you very much for checking the many test packages and ultimately permitting us access the box directly; it's been crucial.

Laszlo

Comment 124 Phillip Lougher 2013-05-22 00:06:34 UTC
Patch(es) available in kernel-2.6.18-357.el5
You can download this test kernel (or newer) from http://people.redhat.com/plougher/el5/
Detailed testing feedback is always welcomed.
If you require guidance regarding testing, please ask the bug assignee.

Comment 125 Yulong Pei 2013-07-04 10:52:57 UTC
[root@ibm-x3650m4-03 ~]# modprobe -r ixgbe
[root@ibm-x3650m4-03 ~]# lsmod |grep ixgbe
[root@ibm-x3650m4-03 ~]# modprobe ixgbe max_vfs=2

[root@ibm-x3650m4-03 ~]# lspci |grep 82599
1b:00.0 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection (rev 01)
1b:00.1 Ethernet controller: Intel Corporation 82599EB 10-Gigabit SFI/SFP+ Network Connection (rev 01)
1b:10.0 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
1b:10.1 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
1b:10.2 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)
1b:10.3 Ethernet controller: Intel Corporation 82599 Ethernet Controller Virtual Function (rev 01)

[root@ibm-x3650m4-03 ~]# ethtool -i eth14
driver: ixgbevf
version: 2.1.0-k
firmware-version: N/A
bus-info: 0000:1b:10.0

[root@ibm-x3650m4-03 ~]# ifconfig eth14 up
[root@ibm-x3650m4-03 ~]# ethtool eth14
Settings for eth14:
	Supported ports: [ ]
	Supported link modes:   10000baseT/Full 
	Supports auto-negotiation: No
	Advertised link modes:  Not reported
	Advertised auto-negotiation: No
	Speed: 10000Mb/s
	Duplex: Full
	Port: Unknown! (255)
	PHYAD: 0
	Transceiver: Unknown!
	Auto-negotiation: off
	Current message level: 0x00000007 (7)
	Link detected: yes

[root@ibm-x3650m4-03 ~]# ifconfig eth14 down
[root@ibm-x3650m4-03 ~]# ip link set eth14 name x123456789
[root@ibm-x3650m4-03 ~]# ethtool -i x123456789
driver: ixgbevf
version: 2.1.0-k
firmware-version: N/A
bus-info: 0000:1b:10.0
[root@ibm-x3650m4-03 ~]# ethtool x123456789
Settings for x123456789:
	Supported ports: [ ]
	Supported link modes:   10000baseT/Full 
	Supports auto-negotiation: No
	Advertised link modes:  Not reported
	Advertised auto-negotiation: No
	Speed: 10000Mb/s
	Duplex: Full
	Port: Unknown! (255)
	PHYAD: 0
	Transceiver: Unknown!
	Auto-negotiation: off
	Current message level: 0x00000007 (7)
	Link detected: no
[root@ibm-x3650m4-03 ~]# ifconfig x123456789 up
[root@ibm-x3650m4-03 ~]# 

no panic found, set it to verified.

Comment 127 errata-xmlrpc 2013-09-30 23:34:44 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-1348.html