Bug 503851 - Host crash when running through NIC with LRO - tun
Summary: Host crash when running through NIC with LRO - tun
Keywords:
Status: CLOSED DUPLICATE of bug 483646
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.4
Hardware: All
OS: Linux
low
medium
Target Milestone: rc
: ---
Assignee: Herbert Xu
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-06-03 01:22 UTC by Mark Wagner
Modified: 2010-05-21 10:58 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2010-05-21 10:58:56 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
Patch to use GRO instead of LRO in be2net (10.19 KB, patch)
2009-07-17 17:25 UTC, Subbu Seetharaman
no flags Details | Diff

Description Mark Wagner 2009-06-03 01:22:16 UTC
Description of problem:
When running to a guest that is bridged to a NIC that is running with LRO, the host will crash. 


Version-Release number of selected component (if applicable):
kernel 2.6.18-150

How reproducible:
Everytime

Steps to Reproduce:
1. Configure for KVM guest to use a LRO based NIC
2. Bring up guest
3. Try ro ssh to guest from external box over link that supports LRO
  
Actual results:
Host crashes

Expected results:
Its shouldn't crash the host

Additional info:
Kernel BUG at drivers/net/tun.c:487
invalid opcode: 0000 [1] SMP 
last sysfs file: /devices/pci0000:00/0000:00:00.0/irq
CPU 15 
Modules linked in: iptable_raw iptable_nat ip_nat ip_conntrack nfnetlink iptable_filter ip_tables x_tables tun bridge ipv6 xfrm_nalgo crypto_api autofs4 hidp rfcomm l2cap bluetooth sunrpc cpufreq_ondemand acpi_cpufreq freq_table dm_multipath scsi_dh video hwmon backlight sbs i2c_ec button battery asus_acpi acpi_memhotplug ac parport_pc lp parport ksm(U) kvm_intel(U) kvm(U) joydev sr_mod cdrom shpchp igb i2c_i801 sg i2c_core ixgbe serio_raw pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ahci libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 8482, comm: qemu-kvm Tainted: G      2.6.18-151.el5 #1
RIP: 0010:[<ffffffff884ce7ab>]  [<ffffffff884ce7ab>] :tun:tun_chr_readv+0x2b1/0x3a6
RSP: 0018:ffff81031e71de48  EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff81031e71de98 RCX: 0000000041551405
RDX: ffff8101b0a32580 RSI: ffff81031e71de9e RDI: ffff81031e71de92
RBP: 0000000000010ff6 R08: 0000000000000000 R09: 0000000000000001
R10: ffff81031e71de94 R11: 0000000000000048 R12: ffff8101b0e0fa80
R13: ffff8101bc091d00 R14: 0000000000000000 R15: ffff81031e71def8
FS:  00002afbce153fc0(0000) GS:ffff81033fcbf0c0(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00002b714c976000 CR3: 00000001aee46000 CR4: 00000000000026e0
Process qemu-kvm (pid: 8482, threadinfo ffff81031e71c000, task ffff81033d5c07e0)
Stack:  ffff8101bd60bb00 ffff8101bb220e80 0000000000000000 ffff81033d5c07e0
 ffffffff8008cd53 ffff8101bc091d28 ffff8101bc091d28 ffff8101bca56ed0
 0000010000420000 0000000000000000 0000563412005452 0000000000000000
Call Trace:
 [<ffffffff8008cd53>] default_wake_function+0x0/0xe
 [<ffffffff884ce8ba>] :tun:tun_chr_read+0x1a/0x1f
 [<ffffffff8000bd4d>] vfs_read+0xcb/0x171
 [<ffffffff800121f6>] sys_read+0x45/0x6e
 [<ffffffff8005e28d>] tracesys+0xd5/0xe0


Code: 0f 0b 68 90 f4 4c 88 c2 e7 01 f6 42 0a 08 74 0c 80 4c 24 41 
RIP  [<ffffffff884ce7ab>] :tun:tun_chr_readv+0x2b1/0x3a6
 RSP <ffff81031e71de48>
 <0>Kernel panic - not syncing: Fatal exception
 <0>Rebooting in 10 seconds..



And here's the BUG:


                        if (sinfo->gso_type & SKB_GSO_TCPV4)
                                gso.gso_type = VIRTIO_NET_HDR_GSO_TCPV4;
                        else if (sinfo->gso_type & SKB_GSO_TCPV6) 
                                gso.gso_type = VIRTIO_NET_HDR_GSO_TCPV6;
                        else
                                BUG();

And here's some quick debugging:

tun_put_user: skb: hdr_len 66 gso_size 256, gso_type 0

Comment 1 Herbert Xu 2009-06-17 07:54:04 UTC
Suggested remedy: Convert all LRO drivers to GRO.

Comment 2 Herbert Xu 2009-06-29 03:01:45 UTC
Here's the list of remaining LRO drivers in RHEL5.  Do we have hardware for these so I can test it after converting them to GRO?

enic
ehea
mlx4
benet
s2io

If no hardware is available, I recommend that we disable LRO on these by default.

Comment 3 Andy Gospodarek 2009-06-29 13:06:26 UTC
I have enic, benet (be2net), and s2io.  Someone has access to ehea since patches get posted and there are mlx4 cards floating around.

We could certainly try and disable LRO, but I think we should really be carrying these patches too since I think I'm seeing reports of a panic with GRO and bridging too.

http://people.redhat.com/agospoda/rhel5/0049-lro-add-check-to-warn-if-forwarding-on-devices-that.patch
http://people.redhat.com/agospoda/rhel5/0131-tun-fix-LRO-crash.patch

Comment 4 Herbert Xu 2009-06-29 14:03:26 UTC
1. I certainly am not against carrying those patches if they're already upstream.

2. Can you point me to the GRO crashes that you saw?

3. If I give you GRO patches for enic, benet, s2io could you test them for me?

Thanks!

Comment 5 Andy Gospodarek 2009-06-29 14:21:46 UTC
(In reply to comment #4)
> 1. I certainly am not against carrying those patches if they're already
> upstream.

Of course they are.  Look at them -- they will be quite familiar. :)

> 2. Can you point me to the GRO crashes that you saw?
> 

It was in Issue-Tracker last week, but I told the person to open a bug and it looks like someone did:

https://bugzilla.redhat.com/show_bug.cgi?id=507189

> 3. If I give you GRO patches for enic, benet, s2io could you test them for me?

I was planning to mail the be2net-based card to Westford today, but there is a chance it could be ready in a day or two for testing up there.

You do know that none of those drivers support GRO upstream, right?

Comment 6 Herbert Xu 2009-06-29 14:48:32 UTC
(In reply to comment #5)
>
> Of course they are.  Look at them -- they will be quite familiar. :)

Please post them for RHEL5 then.

> > 2. Can you point me to the GRO crashes that you saw?
> > 
> 
> It was in Issue-Tracker last week, but I told the person to open a bug and it
> looks like someone did:
> 
> https://bugzilla.redhat.com/show_bug.cgi?id=507189

Which is already fixed in RHEL5.
 
> > 3. If I give you GRO patches for enic, benet, s2io could you test them for me?
> 
> I was planning to mail the be2net-based card to Westford today, but there is a
> chance it could be ready in a day or two for testing up there.
> 
> You do know that none of those drivers support GRO upstream, right?  

Well you could test the upstream patches too if you have the time :)

Comment 7 Andy Gospodarek 2009-06-29 15:01:12 UTC
(In reply to comment #6)
> (In reply to comment #5)
> >
> > Of course they are.  Look at them -- they will be quite familiar. :)
> 
> Please post them for RHEL5 then.
> 

If you are going to get rid of all LRO, then there is really not need (especially since 507189 isn't what I thought it might be).

> > > 2. Can you point me to the GRO crashes that you saw?
> > > 
> > 
> > It was in Issue-Tracker last week, but I told the person to open a bug and it
> > looks like someone did:
> > 
> > https://bugzilla.redhat.com/show_bug.cgi?id=507189
> 
> Which is already fixed in RHEL5.
> 

I see that now -- I didn't look at the bug before I sent it to you since I only had the issue-tracker ticket open from Friday and it was not linked to the bz at that point.  Glad it is resolved.

> > > 3. If I give you GRO patches for enic, benet, s2io could you test them for me?
> > 
> > I was planning to mail the be2net-based card to Westford today, but there is a
> > chance it could be ready in a day or two for testing up there.
> > 
> > You do know that none of those drivers support GRO upstream, right?  
> 
> Well you could test the upstream patches too if you have the time :)  

I can probably try them, but I would like to put these cards in the mail today, so be2net will probably not get tested for a day or two.

Comment 8 Andy Gospodarek 2009-06-29 15:10:08 UTC
Mark, can you give my test kernels a try?  They have two patches which will probably help with cards still stuck using LRO.

http://people.redhat.com/agospoda/#rhel5

Comment 9 Subbu Seetharaman 2009-07-01 11:14:22 UTC
be2net has no GRO upstream yet.  If Herbert has a patch, I can test it immediately.  If no one has a GRO patch yet, we can do it work on that.

Thanks.

Subbu

Comment 10 Mark Wagner 2009-07-01 13:57:36 UTC
Andy, the only card on the list that I have access to is the s2io.  My card is a 
PCI-e x4 so full gro performance will.   Also, given our machine configs it will take a week or too to get the systems reconfigured to do this

Comment 11 Subbu Seetharaman 2009-07-16 12:57:01 UTC
We have a GRO port of be2net ready now.  Will it help if we submit a patch for GRO ?

Comment 12 Herbert Xu 2009-07-16 13:17:40 UTC
Yes please submit it upstream.  Thanks!

Comment 13 Subbu Seetharaman 2009-07-17 17:25:11 UTC
Created attachment 354179 [details]
Patch to use GRO instead of LRO in be2net

Upstream patch could not be tested today due to a disk failure.  Will be doing it shortly.  Patch against el5.158 driver source is attached.  Limitted testing has been done.  More testing to follow.

Subbu

Comment 14 Subbu Seetharaman 2009-07-21 17:13:02 UTC
Upstream patch to replace LRO with GRO in be2net was submitted today.

Subbu

Comment 15 Herbert Xu 2010-05-21 10:58:56 UTC
As we already have the following patch in RHEL5, I think we can close this bug.

commit d6543abe29bb59e0a6109d7f4c13384bfdf96d21
Author: Andy Gospodarek <gospo>
Date:   Thu Sep 10 16:26:48 2009 -0400

    [net] bridge: fix LRO crash with tun

*** This bug has been marked as a duplicate of bug 483646 ***


Note You need to log in before you can comment on or make changes to this bug.