Bug 1580217

Summary: [ovn]ipv6 load balancer for layer4 on logical router doesn't work
Product: Red Hat Enterprise Linux 7 Reporter: haidong li <haili>
Component: openvswitchAssignee: Mark Michelson <mmichels>
Status: CLOSED ERRATA QA Contact: haidong li <haili>
Severity: high Docs Contact:
Priority: high    
Version: 7.5CC: atelang, atragler, kfida, lmanasko, mmichels, pvauter, tredaelli
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: openvswitch-2.9.0-69.el7fdn Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-11-05 14:59:03 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description haidong li 2018-05-21 03:47:01 UTC
Description of problem:
ipv6 load balancer for layer4 on logical router doesn't work

Version-Release number of selected component (if applicable):
openvswitch-2.9.0-36.el7fdp.x86_64
openvswitch-selinux-extra-policy-1.0-3.el7fdp.noarch
openvswitch-ovn-common-2.9.0-36.el7fdp.x86_64
openvswitch-ovn-host-2.9.0-36.el7fdp.x86_64
openvswitch-ovn-central-2.9.0-36.el7fdp.x86_64

How reproducible:
everytime

Steps to Reproduce:
In my environment for ipv6,layer 3 load balancer works well,but layer4 doesn't work.

TOPO:every switch has two guests connected


 hv1_vm00----s2---------r1----------s3----hv0_vm01
             |                        |
             |                        |
          hv1_vm00                 hv0_vm00


[root@dell-per730-19 ovn]# ovn-nbctl list load_balancer
_uuid               : 685ad133-ff9a-4f6a-a7e4-f63d7ad07792
external_ids        : {}
name                : ""
protocol            : []
vips                : {"30.0.0.2"="172.16.103.11,172.16.103.12", "30.0.0.2:8000"="172.16.103.11:80,172.16.103.12:80"}

_uuid               : a7b0f293-8897-43bd-ada5-61b67382ce45
external_ids        : {}
name                : ""
protocol            : []
vips                : {"300::1"="2001b8:103::11,2001b8:103::12", "[300::1]:8000"="[2001b8:103::11]:80,[2001b8:103::12]:80"}

_uuid               : f0e8d873-50ca-4715-ac15-b0cf1eb2f9a1
external_ids        : {}
name                : ""
protocol            : []
vips                : {"30.0.0.1"="172.16.103.11,172.16.103.12", "30.0.0.1:8000"="172.16.103.11:80,172.16.103.12:80"}
[root@dell-per730-19 ovn]# ovn-nbctl show
switch 184b6840-32ad-4a05-aedf-f6e2f25d7ff8 (s3)
    port s3_r1
        type: router
        addresses: ["00e:ad:ff:01:03 172.16.103.1 2001b8:103::1"]
        router-port: r1_s3
    port hv0_vm01_vnet1
        addresses: ["00e:ad:00:01:01 172.16.103.12 2001b8:103::12"]
    port hv0_vm00_vnet1
        addresses: ["00e:ad:00:00:01 172.16.103.11 2001b8:103::11"]
switch ea195969-cfc3-4d67-97ce-e4e853b5e3a4 (s2)
    port hv1_vm01_vnet1
        addresses: ["00e:ad:01:01:01 172.16.102.12 2001b8:102::12"]
    port hv1_vm00_vnet1
        addresses: ["00e:ad:01:00:01 172.16.102.11 2001b8:102::11"]
    port s2_r1
        type: router
        addresses: ["00e:ad:ff:01:02 172.16.102.1 2001b8:102::1"]
        router-port: r1_s2
router 51b6a0d4-8388-493a-9751-929179780b1b (r1)
    port r1_s3
        mac: "00e:ad:ff:01:03"
        networks: ["172.16.103.1/24", "2001b8:103::1/64"]
    port r1_s2
        mac: "00e:ad:ff:01:02"
        networks: ["172.16.102.1/24", "2001b8:102::1/64"]
[root@dell-per730-19 ovn]#  ovn-nbctl get logical_router r1 load_balancer
[a7b0f293-8897-43bd-ada5-61b67382ce45]
[root@dell-per730-19 ovn]#
[root@dell-per730-19 ovn]# ovn-sbctl show
Chassis "hv0"
    hostname: "dell-per730-49.rhts.eng.pek2.redhat.com"
    Encap geneve
        ip: "20.0.0.26"
        options: {csum="true"}
    Port_Binding "hv0_vm00_vnet1"
    Port_Binding "hv0_vm01_vnet1"
Chassis "hv1"
    hostname: "dell-per730-19.rhts.eng.pek2.redhat.com"
    Encap geneve
        ip: "20.0.0.25"
        options: {csum="true"}
    Port_Binding "hv1_vm01_vnet1"
    Port_Binding "hv1_vm00_vnet1"
    Port_Binding "cr-r1_s2"
[root@dell-per730-19 ovn]#
[root@dell-per730-19 ovn]# virsh list
 Id    Name                           State
----------------------------------------------------
 9     hv1_vm00                       running
 10    hv1_vm01                       running

[root@dell-per730-19 ovn]# virsh console hv1_vm00
Connected to domain hv1_vm00
Escape character is ^]

[root@localhost ~]# ping6 300::1
PING 300::1(300::1) 56 data bytes
64 bytes from 2001b8:103::11: icmp_seq=1 ttl=63 time=2.46 ms
64 bytes from 2001b8:103::11: icmp_seq=2 ttl=63 time=0.584 ms

--- 300::1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.584/1.526/2.469/0.943 ms
[root@localhost ~]# curl -g [2001b8:103::11]:80            <-------------curl success through the ip of hv0_vm00
i am vm1
[root@localhost ~]# curl -g [300::1]:8000                    <-------------hang there


Additional info:
1.no such issue on ipv4 load balancer.
2.no such issue if I changed to use the load balancer on logical switch

[root@dell-per730-19 ~]# ovn-nbctl lr-lb-list r1
UUID                                    LB                  PROTO      VIP              IPs
a7b0f293-8897-43bd-ada5-61b67382ce45                        tcp/udp    300::1           2001:db8:103::11,2001:db8:103::12
                                                            (null)     [300::1]:8000    [2001:db8:103::11]:80,[2001:db8:103::12]:80
[root@dell-per730-19 ~]# ovn-nbctl lr-lb-del r1
[root@dell-per730-19 ~]# ovn-nbctl lr-lb-list r1
[root@dell-per730-19 ~]# ovn-nbctl ls-lb-add s2 a7b0f293-8897-43bd-ada5-61b67382ce45
[root@dell-per730-19 ~]# ovn-nbctl ls-lb-list s2
UUID                                    LB                  PROTO      VIP              IPs
a7b0f293-8897-43bd-ada5-61b67382ce45                        tcp/udp    300::1           2001:db8:103::11,2001:db8:103::12
                                                            (null)     [300::1]:8000    [2001:db8:103::11]:80,[2001:db8:103::12]:80
[root@dell-per730-19 ~]# virsh console hv1_vm00
Connected to domain hv1_vm00
Escape character is ^]

[root@localhost ~]# ping6 300::1
PING 300::1(300::1) 56 data bytes
64 bytes from 300::1: icmp_seq=1 ttl=63 time=1.97 ms

--- 300::1 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 1.974/1.974/1.974/0.000 ms
[root@localhost ~]# curl -g [300::1]:8000 
i am vm2
[root@localhost ~]# curl -g [300::1]:8000 
i am vm1
[root@localhost ~]#

Comment 2 Mark Michelson 2018-06-20 18:58:01 UTC
I tried to reproduce this and I was unable to. When I set up an IPv6 load balancer with a port, it worked as expected. I noticed something suspicious in the output of `ovn-nbctl lr-lb-list`:

[root@dell-per730-19 ~]# ovn-nbctl ls-lb-list s2
UUID                                    LB                  PROTO      VIP              IPs
a7b0f293-8897-43bd-ada5-61b67382ce45                        tcp/udp    300::1           2001:db8:103::11,2001:db8:103::12
                                                            (null)     [300::1]:8000    [2001:db8:103::11]:80,[2001:db8:103::12]:80

Notice how the PROTO is "(null)" for the VIP with a port number. When I run the command on my machine, it looks like this:

[vagrant@central ~]$ sudo ovn-nbctl lr-lb-list ro0
UUID                                    LB                  PROTO      VIP                               IPs
ad707ab6-3f78-4547-9e50-c8a0e1d8bb2d    lb0                 tcp        [fd0f:f07:71c6:b050::100]:8000    [fd0f:0f07:71c6:af56::194]:8000,[fd0f:0f07:71c6:af56::195]:8000
                                                            tcp/udp    fd0f:f07:71c6:b050::100           fd0f:0f07:71c6:af56::194

Notice that the PROTO is "tcp" for the VIP with a port number. The way I created my load balancer was to issue the following two commands:

ovn-nbctl lb-add lb0 fd0f:0f07:71c6:b050::100 fd0f:0f07:71c6:af56::194
ovn-nbctl lb-add lb0 [fd0f:0f07:71c6:b050::100]:8000 [fd0f:0f07:71c6:af56::194]:8000,[fd0f:0f07:71c6:af56::195]:8000
ovn-nbctl lr-lb-add ro0 lb0

Notice that I did not specify a protocol, but it defaulted to "tcp". Did you create your load balancers this way? Or did you add them directly to the database? If you add them directly to the database and you specify "tcp" as the protocol, does this issue still occur?

Comment 3 haidong li 2018-06-21 09:41:03 UTC
Yes I tested directly to the database.But the issue still exist in my environment after I set the TCP param or use the command you mentioned:

[root@hp-dl380g9-04 ovn]# ovn-nbctl lb-add lb0 300::1 2001:db8:103::11,2001:db8:103::12
[root@hp-dl380g9-04 ovn]# ovn-nbctl lb-add lb0 [300::1]:8000 [2001:db8:103::11]:80,[2001:db8:103::12]:80
[root@hp-dl380g9-04 ovn]# ovn-nbctl lr-lb-add r1 lb0
[root@hp-dl380g9-04 ovn]# ovn-nbctl lr-lb-list r1
UUID                                    LB                  PROTO      VIP              IPs
22bdef9d-dc3d-45e0-8055-c68fd2f0cd73    lb0                 tcp/udp    300::1           2001:db8:103::11,2001:db8:103::12
                                                            tcp        [300::1]:8000    [2001:db8:103::11]:80,[2001:db8:103::12]:80
[root@hp-dl380g9-04 ovn]# ovn-nbctl lb-list 
UUID                                    LB                  PROTO      VIP              IPs
34b7145f-0d91-45e8-b3ad-42922d1a8b38                        tcp/udp    30.0.0.2         172.16.103.11,172.16.103.12
                                                            (null)     30.0.0.2:8000    172.16.103.11:80,172.16.103.12:80
6529a06b-6e5b-4c12-8aca-9ea7798a906d                        tcp/udp    300::1           2001:db8:103::11,2001:db8:103::12
                                                            tcp        [300::1]:8000    [2001:db8:103::11]:80,[2001:db8:103::12]:80
2e6e7e49-b0af-444e-95ad-8cad040b6483                        tcp/udp    30.0.0.1         172.16.103.11,172.16.103.12
                                                            (null)     30.0.0.1:8000    172.16.103.11:80,172.16.103.12:80
22bdef9d-dc3d-45e0-8055-c68fd2f0cd73    lb0                 tcp/udp    300::1           2001:db8:103::11,2001:db8:103::12
                                                            tcp        [300::1]:8000    [2001:db8:103::11]:80,[2001:db8:103::12]:80
[root@hp-dl380g9-04 ovn]# virsh console hv1_vm00
Connected to domain hv1_vm00
Escape character is ^]

[root@localhost ~]# ping6 300::1
PING 300::1(300::1) 56 data bytes
64 bytes from 2001:db8:103::11: icmp_seq=1 ttl=63 time=1.51 ms
64 bytes from 2001:db8:103::11: icmp_seq=2 ttl=63 time=0.590 ms

--- 300::1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.590/1.050/1.510/0.460 ms
[root@localhost ~]# curl -g [300::1]:8000 >> log3.txt
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:--  0:00:04 --:--:--     0
[root@localhost ~]# curl -g [2001:db8:103::11]:80
i am vm1
[root@localhost ~]# curl -g [2001:db8:103::12]:80
i am vm2

By the way,can you please login the machines I used to check the configuration if convenient,the password is redhat
hp-dl380g9-04.rhts.eng.pek2.redhat.com
hp-dl388g8-09.rhts.eng.pek2.redhat.com

Comment 4 Mark Michelson 2018-06-21 20:45:15 UTC
I figured out how to reproduce this locally.

In my setup, on my logical router, I set options:chassis="central". In your setup, you set options:redirect-chassis="hv1" on the r1_s2 logical router port.

I changed my configuration to use redirect-chassis on the logical router port and now I have the same problem. I will look into why this is happening and report back when I have a fix.

Comment 5 Mark Michelson 2018-06-26 18:43:03 UTC
I figured out the problem and have created a fix locally.

The issue is that there is a rule for un-DNATting return traffic from the load balancer destination that does not get installed when using IPv6. The fix is to install this rule for IPv6. I have submitted this patch for review upstream: https://patchwork.ozlabs.org/patch/935066/

Comment 6 Mark Michelson 2018-07-10 14:39:23 UTC
This has been committed upstream in OVS master and OVS 2.9

Comment 7 Mark Michelson 2018-07-10 14:42:25 UTC
On second inspection, it turns out this is committed to master but not to 2.9. I am putting this back into POST until it is committed to the upstream 2.9 branch. I have sent an e-mail to Ben Pfaff requesting the backport.

Comment 8 Mark Michelson 2018-07-12 12:16:04 UTC
This is now backported to the 2.9 branch as well.

Comment 11 haidong li 2018-10-10 06:52:53 UTC
This issue is verified on the latest version:

[root@localhost ~]# curl -g [300::1]:8000 >> log3.txt
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100     9  100     9    0     0    461      0 --:--:-- --:--:-- --:--:--   500
[root@localhost ~]# echo $?
0
[root@localhost ~]# logout


Red Hat Enterprise Linux Server 7.5 (Maipo)
Kernel 3.10.0-862.el7.x86_64 on an x86_64

localhost login: 
spawn virsh console hv1_vm00
Connected to domain hv1_vm00
Escape character is ^]


Red Hat Enterprise Linux Server 7.5 (Maipo)
Kernel 3.10.0-862.el7.x86_64 on an x86_64

localhost login: root

Password: 
Last login: Tue Oct  9 10:30:13 on ttyS0
[root@localhost ~]# curl -g [300::1]:8000 >> log3.txt
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100     9  100     9    0     0   1428      0 --:--:-- --:--:-- --:--:--  1800

job link:
https://beaker.engineering.redhat.com/jobs/2911725

Comment 13 errata-xmlrpc 2018-11-05 14:59:03 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:3500