Bug 1040128

Summary: IPv6 very slow speed on some computers
Product: [Fedora] Fedora Reporter: Enrique V. Bonet Esteban <enrique.bonet>
Component: kernelAssignee: Neil Horman <nhorman>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: high Docs Contact:
Priority: unspecified    
Version: 19CC: enrique.bonet, gansalmon, itamar, jonathan, kernel-maint, madhu.chinakonda, michele, nhorman
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: kernel-3.12.8-300.fc20 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-01-20 03:04:16 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Output of lspci and ethtool commands
none
Output of the iperf command (IPv4 and IPv6)
none
Computer description and iperf test result
none
SLAAC capture traffic
none
patch to ensure route expiration is set none

Description Enrique V. Bonet Esteban 2013-12-10 19:18:45 UTC
Description of problem:

I update the kernel of the computers a few days ago to
kernel-3.11.9-100.fc18.x86_64 and today to kernel-3.11.10-100.fc18.x86_64
on my computers. Two of the computers work correctly on IPv4, but on IPv6 the
speed is very slow.

I have mounted the following test:

Computer A: RedHat Enterprise 6.5 with IPv6 and 1 Gbit connection and IPv6
X:X:X:222::2.
Computer B1, B2, ...: Personal computers with IPv6 and 1 Gbit connection.

All computers connected to the same switch.

I run the command iperf -V -s in the RedHat server.

If I run sequentially the command iperf -V -c X:X:X:222::2 on the computers,
I obtain a speed of 900-1000 Mbits/sec, but in two computer I obtain a speed
of 60-70Mbits/sec.

If I reverse the test* for the two computers that have a slow speed, the speed
is 900-1000 Mbits/sec.

* Computer B execute iperf -V -s and computer A execute iperf -V -c.

The kernels work fine in motherboards:

Asus P6T WS PRO
Intel DP35DP
Intel S3210SH
Dell PowerEdge 840

But fail in motherboards:

Asus P8P67 LE
Asus P8H77-V LE

Version-Release number of selected component (if applicable):

kernel-3.11.9.100 and kernel-3.11.10-100

How reproducible:

Install these kernels in a computer with the Asus P8P67 LE or Asus P8H77-V LE
motherboards and execute the command:

iperf -V -c <IPv6 iperf server address>

The speed is very slow.

Steps to Reproduce:
1. Install kernel 3.11.9-100 or 3.11.10-100 in a Asus P8P67 LE or Asus
P8H77-V LE motherboard.
2. Execute the command iperf -V -c <IPv6 iperf server address> with a 1Gb
connection.
3. The speed is 60-70 Mbits/sec

Actual results:

The iperf -V -c <IPv6 iperf server address> return a 60-70 Mbits/sec

Expected results:

The iperf -V -c <IPv6 iperf server address> return a 900-1000 Mbits/sec

Additional info:

On IPv4 the computers work fine and the iperf -c <IPv4 iperf server address>
return a 900-1000 Mbits/sec on all computers.

Comment 1 Enrique V. Bonet Esteban 2013-12-20 12:22:39 UTC
I have updated the systems to Fedora 19 (kernel 3.11.10-200.fc19.x86_64) and
the problem remain.

Comment 2 Enrique V. Bonet Esteban 2013-12-23 09:01:40 UTC
I have updated the kernel to version 3.12.5-200.fc19.x86_64 and the problem
remain.

Comment 3 Michele Baldessari 2014-01-02 20:34:03 UTC
Can we get an lspci from the two bad computers? Also the output of
ip a
ethtool -i <interface>
ethtool -k <interface>

Thanks,
Michele

Comment 4 Enrique V. Bonet Esteban 2014-01-03 15:04:42 UTC
Created attachment 844978 [details]
Output of lspci and ethtool commands

Comment 5 Enrique V. Bonet Esteban 2014-01-03 15:05:55 UTC
The new kernel 3.12.6-200.fc19.x86_64 don't solved the problem.

Comment 6 Neil Horman 2014-01-03 15:18:03 UTC
do you have a binary tcpdump capture of the iperf session (and possibly the ipv4 variant of the session for comparison)?

Comment 7 Enrique V. Bonet Esteban 2014-01-03 21:26:22 UTC
Created attachment 845088 [details]
Output of the iperf command (IPv4 and IPv6)

Hi,

I run the iperf commands and the output of these commands are:

[root@amparo ~]# iperf -c 147.156.223.157 -t 1
------------------------------------------------------------
Client connecting to 147.156.223.157, TCP port 5001
TCP window size: 22.9 KByte (default)
------------------------------------------------------------
[  3] local 147.156.222.34 port 42647 connected with 147.156.223.157 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec   112 MBytes   939 Mbits/sec
[root@amparo ~]# iperf -V -c 2001:720:1014:222::2 -t 1
------------------------------------------------------------
Client connecting to 2001:720:1014:222::2, TCP port 5001
TCP window size: 22.7 KByte (default)
------------------------------------------------------------
[  3] local 2001:720:1014:222:f66d:4ff:fe09:8938 port 34195 connected with 2001:720:1014:222::2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec  8.38 MBytes  70.2 Mbits/sec

I capture the traffic with:

tcpdump -i p5p1 -w IPv4 host 147.156.223.157
tcpdump -i p5p1 -w IPv6 host 2001:720:1014:222::2

And the files IPv4 and IPv6 are zipped in the data.zip file.

Thanks,

Enrique

Comment 8 Michele Baldessari 2014-01-04 16:58:14 UTC
IPv6 traffic seems to "pause" every ~0.2s, hence the bad performance.

You mention that:
"""
The kernels work fine in motherboards:
Asus P6T WS PRO
Intel DP35DP
Intel S3210SH
Dell PowerEdge 840

But fail in motherboards:

Asus P8P67 LE
Asus P8H77-V LE
"""

Do I understand correctly that it is the same r8169 card in all those motherboards? Or do the other motherboards have different NICs? (In which
case the issue would seem to be more r8169 related)

Comment 9 Enrique V. Bonet Esteban 2014-01-04 20:53:41 UTC
Hi Michele,

No, only the Asus P6T WS PRO, the Asus P8P67 LE and the ASUS P8H77-V LE have
similar NICs:

Asus P6T WS PRO -> Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express
Gigabit Ethernet Controller (rev 02)
Asus P8P57 LE -> Realtek Semiconductor Co., Ltd. RTL8111/8168 PCI Express
Gigabit Ethernet controller (rev 09)
Asus P8H77-V LE -> Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express
Gigabit Ethernet Controller (rev 06)

But the computer that has an Asus P6T WS PRO motherboard (slopez) work fine:

[root@slopez ~]# iperf -V -c 2001:720:1014:222::2 -t 1
------------------------------------------------------------
Client connecting to 2001:720:1014:222::2, TCP port 5001
TCP window size: 22.7 KByte (default)
------------------------------------------------------------
[  3] local 2001:720:1014:88:22cf:30ff:fef1:a3df port 36165 connected with 2001:720:1014:222::2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec   110 MBytes   922 Mbits/sec

So that's why I put all the motherboards models which had tested. I apologize if that has confused you.

I think that the problem is the driver, but I don't understand why it works
fine on a slopez (Asus P6T WS PRO)... the NIC rev?

Thanks,

Enrique

Comment 10 Michele Baldessari 2014-01-04 21:45:17 UTC
Hi Enrique,

ah I see now. Odd indeed. Could you maybe expand between which boxes you've done
all the tests and their results? It might very well be that we find out that 
it is a single NIC/Box slowing down the IPv6 test (it might be either the client
or the server slowing down the ipv6 run). At least I'd hope so, otherwise
the plot thickens quite a bit.

thanks,
Michele

Comment 11 Enrique V. Bonet Esteban 2014-01-04 23:34:40 UTC
Created attachment 845610 [details]
Computer description and iperf test result

Hi Michele,

The server always is mirror.uv.es (2001:720:1014:222::2), a S5520HC motherboard
with Red Hat Enterprise 6.5 x86_64. Its NICs are Intel Corporation 82575EB
Gigabit Network Connection (rev 02) and driver igb.

I attached a ZIP with two files:

TarjetaRed.pdf -> Table with the name of the clients, motherboard, NIC and
driver.
ResIperf.txt -> Output of run the iperf command in the clients.

Note that the iperf server always is mirror.uv.es and the test is runnning
sequentially in the clients.

If it's any help, the kernel prior to 3.11.9 working properly (I can't remember
the version).

Thanks,

Enrique

Comment 12 Michele Baldessari 2014-01-05 10:38:32 UTC
Hi Enrique,

ok so to recap. The netperf server is always mirror.uv.es 
(2001:720:1014:222::2) with igb.

The tests with the clients are:

r8169 - Asus P6T WS PRO
[root@slopez ~]# iperf -V -c 2001:720:1014:222::2 -t 1
[  3]  0.0- 1.0 sec   110 MBytes   925 Mbits/sec

e1000 - Intel S3210SH
[root@crash ~]# iperf -V -c 2001:720:1014:222::2 -t 1
[  3]  0.0- 1.0 sec  71.6 MBytes   600 Mbits/sec

e1000e - Intel DP35DP
[root@crunch1 ~]# iperf -V -c 2001:720:1014:222::2 -t 1
[  3]  0.0- 1.0 sec   111 MBytes   927 Mbits/sec

e1000e - Intel DP35DP
[root@crunch2 ~]# iperf -V -c 2001:720:1014:222::2 -t 1
[  3]  0.0- 1.0 sec   111 MBytes   928 Mbits/sec

tg3 - Dell PE 840
[root@smagris1 ~]# iperf -V -c 2001:720:1014:222::2 -t 1
[  3]  0.0- 1.0 sec  96.4 MBytes   808 Mbits/sec

tg3 - Dell PE 840
[root@smagris2 ~]# iperf -V -c 2001:720:1014:222::2 -t 1
[  3]  0.0- 1.0 sec  99.2 MBytes   832 Mbits/sec

tg3 - Dell PE 840
[root@smagris3 ~]# iperf -V -c 2001:720:1014:222::2 -t 1
[  3]  0.0- 1.0 sec  97.6 MBytes   818 Mbits/sec

r8169 - Asus P8P67 LE
[root@amparo ~]# iperf -V -c 2001:720:1014:222::2 -t 1
[  3]  0.0- 1.0 sec  7.75 MBytes  62.7 Mbits/sec

r8169 - Asus P8H77-V LE 
[root@cordon3] Could not be tested, but is a slow one

Since you mentioned that some previous kernel used to work correctly,
my recommendation would be to start bisecting a bit.
Start with installing a 3.10.x kernel and then take it from there.
If you need some help with bisecting let me (there are many guides around)

hth,
Michele

Comment 13 Enrique V. Bonet Esteban 2014-01-05 12:53:11 UTC
Hi Michele,

I have downloaded and installed the kernel-3.9.5-301.fc19.x86_64, the first
Fedora 19 kernel.

When I run the iperf command, the output is:

[root@amparo ~]# uname -a
Linux amparo 3.9.5-301.fc19.x86_64 #1 SMP Tue Jun 11 19:39:38 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
[root@amparo ~]# iperf -V -c 2001:720:1014:222::2 -t 1
------------------------------------------------------------
Client connecting to 2001:720:1014:222::2, TCP port 5001
TCP window size: 22.7 KByte (default)
------------------------------------------------------------
[  3] local 2001:720:1014:222:f66d:4ff:fe09:8938 port 45826 connected with 2001:720:1014:222::2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec   111 MBytes   927 Mbits/sec

! Work fine !.

The last kernel that works fine is the before of the
kernel-3.11.9-200.fc19.x86_64 (I think that is the
kernel-3.11.8-200.fc19.x86_64), but I can not find it in the mirrors.

Can you provides me this kernel version?. Is there a repository of old kernels
versions?. I would like to download and install the versions 3.11.8 and 3.11.9
to give you exactly the version where it fails.

Thanks,

Enrique

Comment 14 Michele Baldessari 2014-01-05 18:32:22 UTC
Hi Enrique,

you can find all the builds in koji. Specifically the kernel ones are here:
http://koji.fedoraproject.org/koji/packageinfo?packageID=8

3.11.8 for fc19 is here: http://koji.fedoraproject.org/koji/buildinfo?buildID=478117

regards,
Michele

Comment 15 Enrique V. Bonet Esteban 2014-01-05 19:33:23 UTC
Hi Michele,

I didn't know this repository, thanks.

I download and install the kernels version 3.11.8 and 3.11.9 and I run the iperf
command. The outputs are:

[root@amparo ~]# uname -a
Linux amparo 3.11.8-200.fc19.x86_64 #1 SMP Wed Nov 13 16:29:59 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
[root@amparo ~]# iperf -V -c 2001:720:1014:222::2 -t 1
------------------------------------------------------------
Client connecting to 2001:720:1014:222::2, TCP port 5001
TCP window size: 22.7 KByte (default)
------------------------------------------------------------
[  3] local 2001:720:1014:222:f66d:4ff:fe09:8938 port 49638 connected with 2001:720:1014:222::2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec   111 MBytes   928 Mbits/sec

! Work fine !

[root@amparo ~]# uname -a
Linux amparo 3.11.9-200.fc19.x86_64 #1 SMP Wed Nov 20 21:22:24 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
[root@amparo ~]# iperf -V -c 2001:720:1014:222::2 -t 1
------------------------------------------------------------
Client connecting to 2001:720:1014:222::2, TCP port 5001
TCP window size: 22.7 KByte (default)
------------------------------------------------------------
[  3] local 2001:720:1014:222:f66d:4ff:fe09:8938 port 49989 connected with 2001:720:1014:222::2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec  8.25 MBytes  68.9 Mbits/sec

! Don't work fine !

The problem appears in the kernel 3.11.9, the kernel 3.11.8 works properly.

I think that we should focus the problem in the r8169 module, although it works
fine on a computer, but you know better the kernel.

Regards,

Enrique

Comment 16 Michele Baldessari 2014-01-05 20:04:49 UTC
Hi Enrique,

interesting. This makes the potential changesets much smaller:
$ git log --oneline v3.11.8..v3.11.9  
56a766f media: sh_vou: almost forever loop in sh_vou_try_fmt_vid_out()
1e7c2cd usbcore: set lpm_capable field for LPM capable root hubs
a35bbad usb: fail on usb_hub_create_port_device() errors
b2c2f76 usb: fix cleanup after failure in hub_configure()
c03642e backlight: atmel-pwm-bl: fix deferred probe from __init
be85221 misc: atmel_pwm: add deferred-probing support
b53ef13 iwlwifi: pcie: add new SKUs for 7000 & 3160 NIC series
57b0a9d perf: Fix perf ring buffer memory ordering
747b007 drm/i915/dp: workaround BIOS eDP bpp clamping issue
33e3df4 tracing: Fix potential out-of-bounds in trace_get_user()
0e5f119 ALSA: hda - hdmi: Fix reported channel map on common default layouts
01535e4 USB: add new zte 3g-dongle's pid to option.c
e55433c hyperv-fb: add pci stub
583d159 Thermal: x86_pkg_temp: change spin lock
edd6447 xen-netback: transition to CLOSED when removing a VIF
5fe1417 xen-netback: Handle backend state transitions in a more robust way
4e9728a ipv6: reset dst.expires value when clearing expire flag
1731edc ipv6: ip6_dst_check needs to check for expired dst_entries
2ce4f60 tcp: gso: fix truesize tracking
16eb627 cxgb3: Fix length calculation in write_ofld_wr() on 32-bit architectures
6f54c27 xen-netback: use jiffies_64 value to calculate credit timeout
1527a1e virtio-net: correctly handle cpu hotplug notifier during resuming
6047108 net: flow_dissector: fail on evil iph->ihl
e697716 net: sctp: do not trigger BUG_ON in sctp_cmd_delete_tcb
51ce609 net/mlx4_core: Fix call to __mlx4_unregister_mac

In terms of Fedora specific patches we have:
e1db685 Add patch to fix rhel5.9 KVM guests (rhbz 967652)
0daff16 Add bugzilla/upstream-status notes to 24hz audio patch
4c2b97b Add patch to fix crash from slab when using md-raid mirrors (rhbz 1031086)
59378ff Add patches from Pierre Ossman to fix 24Hz/24p radeon audio (rhbz 1010679)
0b654a6 Add patch to fix ALX phy issues after resume (rhbz 1011362)
09060dc Fix ipv6 sit panic with packet size > mtu (from Michele Baldessari) (rbhz 1015905)
67ce21f CVE-2013-4563: net: large udp packet over IPv6 over UFO-enabled device with TBF qdisc panic (rhbz 1030015 1030017)

The more likely changes are:
1) 4e9728a ipv6: reset dst.expires value when clearing expire flag
2) 1731edc ipv6: ip6_dst_check needs to check for expired dst_entries
3) 2ce4f60 tcp: gso: fix truesize tracking
4) 09060dc Fix ipv6 sit panic with packet size > mtu (from Michele Baldessari) (rbhz 1015905)
5) 67ce21f CVE-2013-4563: net: large udp packet over IPv6 over UFO-enabled device with TBF qdisc panic (rhbz 1030015 1030017)

5) Is UDP-only so it should not affect this BZ.
4) is composed of
9037c3579a277f3a23ba476664629fda8c35f7c4 "ip6_output: fragment outgoing reassembled skb properly"
6aafeef03b9d9ecf255f3a80ed85ee070260e1ae "netfilter: push reasm skb through instead of original frag skbs"
3) Hits ipv4-only code
2) I don't see how this could be relevant
1) I don't see how this could be relevant

So my hunch now would be that 4) is a good candidate.
Although it does not explain yet why only one system seems to be affected

Do you have any netfilter rules on the amparo box? Any rules in general I mean.
Would you be able to compile a kernel without the two commits of 4) and see if the problem goes away?

(Note: I'll be flying the next couple of days so I'll resume looking at this mid-week)

regards,
Michele

Comment 17 Neil Horman 2014-01-06 17:18:17 UTC
I'd check your routing table, to see if the dst entry to ::2 is expiring.  1731edc could be causing previously unexpiring dst entries to expire properly now, requiring the need for a new neighbor solicitation every .2 seconds or so, which may introduce delays.

Comment 18 Enrique V. Bonet Esteban 2014-01-06 19:30:49 UTC
Hi,

I will compile the kernel this week, when I have some time. 

The netfilter rules are:

[root@amparo ~]# ip6tables -L
Chain INPUT (policy ACCEPT)
target     prot opt source               destination         
ACCEPT     all      anywhere             anywhere             state RELATED,ESTABLISHED
ACCEPT     ipv6-icmp    anywhere             anywhere            
ACCEPT     all      anywhere             anywhere            
ACCEPT     tcp      anywhere             anywhere             state NEW tcp dpt:ssh
ACCEPT     udp      anywhere             anywhere             state NEW udp dpt:ipp
ACCEPT     udp      anywhere             ff02::fb/128         state NEW udp dpt:mdns
ACCEPT     tcp      anywhere             anywhere             state NEW tcp dpt:ipp
ACCEPT     udp      anywhere             anywhere             state NEW udp dpt:ipp
REJECT     all      anywhere             anywhere             reject-with icmp6-adm-prohibited

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         
REJECT     all      anywhere             anywhere             reject-with icmp6-adm-prohibited

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination         

I don't found something strange.

The IPv6 routing table are:

[root@amparo ~]# ip -6 route
2001:720:1014:222::/64 dev p5p1  proto kernel  metric 256  expires 2591996sec
fe80::/64 dev vmnet1  proto kernel  metric 256 
fe80::/64 dev vmnet8  proto kernel  metric 256 
fe80::/64 dev p5p1  proto kernel  metric 256 
default via fe80::20e:d6ff:feb7:400 dev p5p1  proto ra  metric 1024  expires 1796sec

It seems correct, but I have run a test. I run the script shell:

while true; do ip -6 route >> ip6route.txt; echo "==========" >> ip6route.txt; sleep 1; done

And I run the iperf -V -c 2001:720:1014:222::2 -t 10 in other shell.

The file "ip6route.txt" contain:

...
2001:720:1014:222::2 dev p5p1  metric 0 
    cache  expires -4379514sec
2001:720:1014:222::/64 dev p5p1  proto kernel  metric 256  expires 2591812sec
fe80::/64 dev vmnet1  proto kernel  metric 256 
fe80::/64 dev vmnet8  proto kernel  metric 256 
fe80::/64 dev p5p1  proto kernel  metric 256 
default via fe80::20e:d6ff:feb7:400 dev p5p1  proto ra  metric 1024  expires 1612sec
==========
2001:720:1014:222::2 dev p5p1  metric 0 
    cache  expires -4379515sec
2001:720:1014:222::/64 dev p5p1  proto kernel  metric 256  expires 2591811sec
fe80::/64 dev vmnet1  proto kernel  metric 256 
fe80::/64 dev vmnet8  proto kernel  metric 256 
fe80::/64 dev p5p1  proto kernel  metric 256 
default via fe80::20e:d6ff:feb7:400 dev p5p1  proto ra  metric 1024  expires 1611sec
...

The cache expires value is negative. Is this correct?. This happen on all
computers on my network.

Thanks,

Enrique

Comment 19 Neil Horman 2014-01-06 20:28:30 UTC
Its not right, but I don't think its a catastrophic problem (the problem is in iproute2, in its treatment of an unsigned kernel value as signed).  That said, it makes it very difficult to determine exactly what the lifetime of that cached route is (you'll note it counts up instead of down).  IIRC, that route is cloned from the gateway route with with the expiration time of 2591812.  That said, it seems odd to me that they have different expiration times.  Looking at the source, it seems we can update the expiration time if we are updating path mtu information and the /proc/sys/net/ipv6/route/mtu_expires value is set to something very low.  It doesn't appear that you have mtu information set, but I could be wrong.  Can you do 3 things:

1) Check your value for /proc/sys/net/ipv6/route/mtu_expires
2) Back out commit 1731edc, and test that kernel as that seems to be the most likely reason we're dropping dst entries that have had their expirations reduced quickly.

3) Augment the function rt6_check_expireds to log a message when a dst entry is expired?

Comment 20 Enrique V. Bonet Esteban 2014-01-08 17:03:43 UTC
Hi Norman,

The value of /proc/sys/net/ipv6/route/mtu_expires is 600. Others computers
have the same value.

I don't know how back out patch 1731edc, but I applied patch-3.11.9 to kernel
3.11 and I modified the source code of the net/ipv6/route.c as follow:

...
static struct dst_entry *ip6_dst_check(struct dst_entry *dst,u32 cookie)
{
...

/* if (rt6_check_expired(rt))
      return NULL; */

   return dst;
}
...

And I run the iperf command:

[root@amparo ~]# iperf -V -c 2001:720:1014:222::2 -t 1
------------------------------------------------------------
Client connecting to 2001:720:1014:222::2, TCP port 5001
TCP window size: 22.7 KByte (default)
------------------------------------------------------------
[  3] local 2001:720:1014:222:f66d:4ff:fe09:8938 port 53408 connected with 2001:720:1014:222::2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec  40.2 MBytes   337 Mbits/sec

[root@amparo ~]# iperf -V -c 2001:720:1014:222::2 -t 10
------------------------------------------------------------
Client connecting to 2001:720:1014:222::2, TCP port 5001
TCP window size: 22.7 KByte (default)
------------------------------------------------------------
[  3] local 2001:720:1014:222:f66d:4ff:fe09:8938 port 53409 connected with 2001:720:1014:222::2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec   946 MBytes   794 Mbits/sec

It has improved...

"3) Augment the function rt6_check_expireds to log a message when a dst entry
is expired?"

Can you help me?. What function can I use to write in /var/log/messages or
other file?.

Thanks,

Enrique

Comment 21 Neil Horman 2014-01-08 19:35:01 UTC
the 600 value is good, thats the default value, and suggests that mtu updates are not at fault here.

The removal of the rt6_check_expired was sufficient for the test I suggested in 2 and 3.  It indicates thats commit e3bc10bd95d7fcc3f2ac690c6ff22833ea6781d6
is causing this problem (thats the upstream sha1 for the "ipv6: ip6_dst_check needs to check for expired dst_entries" fix).  The fix itself isn't wrong, but seems to be uncovering either another bug or misconfiguration in your network.  What would really be useful here is a reproducer.   How do you assign global ipv6 addresses in your network?  Do you use SLAAC, DHCPv6 or manual assignment?  Do you have a tcpdump that shows some router advertisements or dhcpv6 transactions from your network that I can look over?

Comment 22 Neil Horman 2014-01-09 17:49:17 UTC
hey, just FYI, I've received a simmilar report to this one, and have an internal reproducer for it, I'll update this bz when I have some results from that investigation

Comment 23 Neil Horman 2014-01-09 18:22:59 UTC
think I see the problem, ip6_rt_copy needs to call rt6_update_expires.

Comment 24 Enrique V. Bonet Esteban 2014-01-09 19:13:47 UTC
Created attachment 847770 [details]
SLAAC capture traffic

Hi Neil,

I read your comments 22 and 23, but I answer your comment 21.

The protocol used by assign global ipv6 addresses in our network is SLAAC.
I attached a ZIP file with the tcpdump, the filter used is:

tcpdump -i p5p1 -w slaac ether <my ethernet MAC> and ip6

Besides, I modified the function rt6_check_expired() as follow:

static bool rt6_check_expired(const struct rt6_info *rt)
{
        static unsigned long int call=0,expire=0;

        if ((++call%1000)==0)
                printk(KERN_DEBUG "r6_check_expired called %lu times\n",call);

        if (rt->rt6i_flags & RTF_EXPIRES) {
                if (time_after(jiffies, rt->dst.expires))
                {
                        if ((++expire%1000)==0)
                                printk(KERN_DEBUG "r6_check_expired suceed %lu times\n",expire);

                        return true;
                }
        } else if (rt->dst.from) {
                return rt6_check_expired((struct rt6_info *) rt->dst.from);
        }
        return false;
}

If I uncommented, in the ip6_dst_check() function, the code:

if (rt6_check_expired(rt))
      return NULL;

And run the iperf -V -c 2001:720:1014:222::2

The output of the dmesg command is:

[   83.299503] r6_check_expired called 1000 times
[   83.301892] r6_check_expired called 2000 times
[   83.303622] r6_check_expired suceed 1000 times
[   83.303819] r6_check_expired called 3000 times
...
[   84.093472] r6_check_expired called 46000 times
[   84.336506] r6_check_expired suceed 17000 times
[   84.336740] r6_check_expired called 47000 times
[   84.338378] r6_check_expired called 48000 times

More than 48000 calls and 17000 expires !, and the transfer rate is 62.5
Mbits/sec.

If I comment the code, the dmesg command output has not lines r6_check_expired,
so it is rarely called, and the transfer rate is 926 Mbits/sec.

I think that the number of calls at the r6_check_expired() funcion is the
problem.

Best regards,

Enrique

Comment 25 Enrique V. Bonet Esteban 2014-01-09 19:56:11 UTC
Hi Norman,

I think that the solution is, as you said in the comment 23, call to
rt6_update_expires() in the ip6_rt_copy().

I changed the ip6_rt_copy() function as follows:

static struct rt6_info *ip6_rt_copy(struct rt6_info *ort,
                                    const struct in6_addr *dest)
{
        struct net *net = dev_net(ort->dst.dev);
        struct rt6_info *rt = ip6_dst_alloc(net, ort->dst.dev, 0,
                                            ort->rt6i_table);

        if (rt) {
...
                rt->rt6i_table = ort->rt6i_table;
/* The following line is added */
                rt6_update_expires(rt, net->ipv6.sysctl.ip6_rt_mtu_expires);
        }
        return rt;
}

And the iperf command return:

[root@amparo ~]# iperf -V -c 2001:720:1014:222::2 -t 1
------------------------------------------------------------
Client connecting to 2001:720:1014:222::2, TCP port 5001
TCP window size: 22.7 KByte (default)
------------------------------------------------------------
[  3] local 2001:720:1014:222:f66d:4ff:fe09:8938 port 39961 connected with 2001:720:1014:222::2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec   111 MBytes   926 Mbits/sec

The r6_check_expired() function is called more than 97000 times !, but it
never happen.

Could this be the solution?

Best regards,

Enrique

Comment 26 Neil Horman 2014-01-09 20:05:01 UTC
It could be, but I'm not sure yet.  I tried the same thing with success, but then I attempted another fix, which I felt was more correct (changing the condition under which we call rt6_set_from in rt6_check_expires.  That modified the flow such that the cloned route sets the from pointer in the dst_entry and clears the cloned expires flag, so the cloned dst_entry should use the parent (from pointers dst informtation), I've validated thats happening, but just the same we're not getting better performance, despite the parent route not expiring.  I'm looking some more.

Comment 27 Neil Horman 2014-01-09 20:18:13 UTC
Actually, your patch is also wrong by way of the use of net->ipv6.sysctl.ip6_rt_mtu_expires.  If we were going to preserve the expiration of this route, we would set it to the expiration of the parent route that we are copying from.

Comment 28 Neil Horman 2014-01-13 21:36:55 UTC
Created attachment 849636 [details]
patch to ensure route expiration is set

Hey, I think this is going to be our fix. I'm still not convinced that we don't just need to set the from pointer on the copied route, but I'm still looking into that.  Could you please test this and confirm that it fixes the problem?  Thanks!

Comment 29 Enrique V. Bonet Esteban 2014-01-14 12:17:33 UTC
Hi Norman,

Your patch solves the problem. I apply the patch to the 3.11.9 kernel, reboot
and run the iperf -V command. The output is:

[root@amparo ~]# iperf -V -c 2001:720:1014:222::2 -t 1
------------------------------------------------------------
Client connecting to 2001:720:1014:222::2, TCP port 5001
TCP window size: 22.7 KByte (default)
------------------------------------------------------------
[  3] local 2001:720:1014:222:f66d:4ff:fe09:8938 port 39545 connected with 2001:720:1014:222::2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0- 1.0 sec  50.4 MBytes   421 Mbits/sec
[root@amparo ~]# iperf -V -c 2001:720:1014:222::2 -t 10
------------------------------------------------------------
Client connecting to 2001:720:1014:222::2, TCP port 5001
TCP window size: 22.7 KByte (default)
------------------------------------------------------------
[  3] local 2001:720:1014:222:f66d:4ff:fe09:8938 port 39546 connected with 2001:720:1014:222::2 port 5001
[ ID] Interval       Transfer     Bandwidth
[  3]  0.0-10.0 sec  1.00 GBytes   860 Mbits/sec

! Work fine !

Just a comment. I first apply the 3.11.9 general patches:

patch -p1 < ../patch-3.11.9

And after your patch

patch -p1 < ../route.patch (your patch)

And the output was:

patching file net/ipv6/route.c
Hunk #1 succeeded at 1859 (offset -52 lines).

Are you going to apply this patch on a new kernel version?

Thanks,

Enrique

Comment 30 Neil Horman 2014-01-14 15:15:29 UTC
I'm going to post this upstream, and if its accepted, yes, I'll backport it to fedora.

Comment 31 Neil Horman 2014-01-14 15:22:48 UTC
grr, looks like someone beat me too it.  Upstream commit  24f5b855e17df7e355eacd6c4a12cc4d6a6c9ff0 is what we need.

Comment 32 Neil Horman 2014-01-14 15:31:50 UTC
ok, I've comitted the backport to the tree, the next f19 kernel should have it fixed.

Comment 33 Fedora Update System 2014-01-16 22:53:02 UTC
kernel-3.12.8-300.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/kernel-3.12.8-300.fc20

Comment 34 Fedora Update System 2014-01-16 22:54:12 UTC
kernel-3.12.8-200.fc19 has been submitted as an update for Fedora 19.
https://admin.fedoraproject.org/updates/kernel-3.12.8-200.fc19

Comment 35 Fedora Update System 2014-01-18 04:25:47 UTC
Package kernel-3.12.8-300.fc20:
* should fix your issue,
* was pushed to the Fedora 20 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing kernel-3.12.8-300.fc20'
as soon as you are able to, then reboot.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2014-1062/kernel-3.12.8-300.fc20
then log in and leave karma (feedback).

Comment 36 Fedora Update System 2014-01-20 03:04:16 UTC
kernel-3.12.8-200.fc19 has been pushed to the Fedora 19 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 37 Fedora Update System 2014-01-20 03:07:27 UTC
kernel-3.12.8-300.fc20 has been pushed to the Fedora 20 stable repository.  If problems still persist, please make note of it in this bug report.