Bug 1562731

Summary: OVN L3HA when creating 2 routers they scheduled to same controller node
Product: Red Hat OpenStack Reporter: Eran Kuris <ekuris>
Component: python-networking-ovnAssignee: Daniel Alvarez Sanchez <dalvarez>
Status: CLOSED ERRATA QA Contact: Eran Kuris <ekuris>
Severity: high Docs Contact:
Priority: medium    
Version: 13.0 (Queens)CC: amuller, apevec, bcafarel, dalvarez, jschluet, lhh, majopela, nyechiel, samccann
Target Milestone: betaKeywords: Triaged
Target Release: 13.0 (Queens)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: python-networking-ovn-4.0.1-0.20180420150809.c7c16d4.el7ost Doc Type: Bug Fix
Doc Text:
The current L3 HA scheduler was not taking the priorities of the nodes into consideration. Therefore, all gateways were being hosted by the same node and the load was not distributed across candidates. This fix implements an algorithm to select the least loaded node when scheduling a gateway router. Gateway ports are now being scheduled on the least loaded network node distributing the load evenly across them.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-06-27 13:49:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Eran Kuris 2018-04-02 09:39:26 UTC
Description of problem:
the least loaded scheduler doesn't work
the schedule of router creation is not working well on OVN-L3HA.
we take all the gateway chassis list and add it to the list instead of taking the highest priority.
we should ponderate by priority


(overcloud) [root@controller-0 ~]#  ovn-nbctl --db=tcp:172.17.1.15:6641 lrp-get-gateway-chassis lrp-6042c7e2-79b3-4925-b606-b86c6dc1e824
lrp-6042c7e2-79b3-4925-b606-b86c6dc1e824_942750fc-cec5-4a9f-aeb5-6dfddf9be3be     3
lrp-6042c7e2-79b3-4925-b606-b86c6dc1e824_113644ed-b3c6-47f2-9488-984d37936c97     2
lrp-6042c7e2-79b3-4925-b606-b86c6dc1e824_a34f57de-09d3-4c1f-b56b-270eb850537a     1
(overcloud) [root@controller-0 ~]#  ovn-nbctl --db=tcp:172.17.1.15:6641 lrp-get-gateway-chassis lrp-284190ed-ff6a-438b-b9ee-a843f13edbd6
lrp-284190ed-ff6a-438b-b9ee-a843f13edbd6_942750fc-cec5-4a9f-aeb5-6dfddf9be3be     3
lrp-284190ed-ff6a-438b-b9ee-a843f13edbd6_113644ed-b3c6-47f2-9488-984d37936c97     2
lrp-284190ed-ff6a-438b-b9ee-a843f13edbd6_a34f57de-09d3-4c1f-b56b-270eb850537a     1
(overcloud) [root@controller-0 ~]# ovn-nbctl --db=tcp:172.17.1.15:6641 show 
switch 31990f00-c41e-466e-9070-bf3760b58926 (neutron-7b8f0751-6907-408a-8997-89747009fd09) (aka net-64-2)
    port 6a9c85b2-8a8e-470b-b50f-7ae7c3380b03
        type: localport
        addresses: ["fa:16:3e:85:ae:47 10.0.2.2"]
    port a0cc0b12-70d5-46c9-8e00-e76e970c711f
        addresses: ["fa:16:3e:42:d6:89 10.0.2.8"]
    port 580a8d2c-eaa0-48f0-a7e8-8c379abb8b29
        type: router
        router-port: lrp-580a8d2c-eaa0-48f0-a7e8-8c379abb8b29
switch 7bb30649-71dc-405f-9220-37f7f80f855f (neutron-88236779-29ef-46aa-bc6b-80d8f0f15b45) (aka nova)
    port 2ae28cbb-8ced-4158-ac3a-7f43cf520ee7
        type: localport
        addresses: ["fa:16:3e:18:b4:cd"]
    port 6042c7e2-79b3-4925-b606-b86c6dc1e824
        type: router
        router-port: lrp-6042c7e2-79b3-4925-b606-b86c6dc1e824
    port 284190ed-ff6a-438b-b9ee-a843f13edbd6
        type: router
        router-port: lrp-284190ed-ff6a-438b-b9ee-a843f13edbd6
    port provnet-88236779-29ef-46aa-bc6b-80d8f0f15b45
        type: localnet
        addresses: ["unknown"]
switch 26f1fe62-b330-47a6-8527-0d098a2239ac (neutron-6484b473-5e68-440e-9d90-a53e42fe9dc2) (aka net-64-3)
    port 783de96f-ed69-4d3f-83a3-afa2560a7e02
        type: router
        router-port: lrp-783de96f-ed69-4d3f-83a3-afa2560a7e02
    port d12c0cd5-b818-484a-ac0f-70222b15b0cd
        addresses: ["fa:16:3e:cb:69:c1 10.0.3.9"]
    port 53afc813-7488-47fe-ba2d-9047577e9ce3
        addresses: ["fa:16:3e:33:c2:e6 10.0.3.10"]
    port accbd0cb-be25-4f96-8e5b-59e3f473871d
        type: localport
        addresses: ["fa:16:3e:04:03:20 10.0.3.2"]
router ed8829a4-4206-4410-983d-df2e88790121 (neutron-9b83b3ff-e802-4e2a-8c36-1918b6355c7a) (aka Router_eNet_2)
    port lrp-6042c7e2-79b3-4925-b606-b86c6dc1e824
        mac: "fa:16:3e:0a:22:a5"
        networks: ["10.0.0.220/24"]
        gateway chassis: [113644ed-b3c6-47f2-9488-984d37936c97 a34f57de-09d3-4c1f-b56b-270eb850537a 942750fc-cec5-4a9f-aeb5-6dfddf9be3be]
router 0769eb6f-60ed-451a-af57-8ea56c257fda (neutron-cb989bd4-f821-46b4-b556-b499dd64d5c7) (aka Router_eNet)
    port lrp-284190ed-ff6a-438b-b9ee-a843f13edbd6
        mac: "fa:16:3e:53:26:19"
        networks: ["10.0.0.214/24"]
        gateway chassis: [a34f57de-09d3-4c1f-b56b-270eb850537a 113644ed-b3c6-47f2-9488-984d37936c97 942750fc-cec5-4a9f-aeb5-6dfddf9be3be]
    port lrp-783de96f-ed69-4d3f-83a3-afa2560a7e02
        mac: "fa:16:3e:0c:8e:28"
        networks: ["10.0.3.1/24"]
    port lrp-580a8d2c-eaa0-48f0-a7e8-8c379abb8b29
        mac: "fa:16:3e:c3:0a:b0"
        networks: ["10.0.2.1/24"]
    nat 1801d558-fe18-4015-96c7-6998160c64f5
        external ip: "10.0.0.218"
        logical ip: "10.0.3.9"
        type: "dnat_and_snat"
    nat 46c19fad-c450-490f-8255-66bb3c1f715f
        external ip: "10.0.0.214"
        logical ip: "10.0.2.0/24"
        type: "snat"
    nat b81c0ac9-6e19-4beb-88aa-3c1e120fe680
        external ip: "10.0.0.215"
        logical ip: "10.0.2.8"
        type: "dnat_and_snat"
    nat dce146ff-354b-4340-9607-49ee78d33be9
        external ip: "10.0.0.214"
        logical ip: "10.0.3.0/24"
        type: "snat"
(overcloud) [root@controller-0 ~]# ovn-sbctl --db=tcp:172.17.1.15:6642 show 
Chassis "113644ed-b3c6-47f2-9488-984d37936c97"
    hostname: "controller-2.localdomain"
    Encap geneve
        ip: "172.17.2.21"
        options: {csum="true"}
Chassis "50bcbcc8-7f24-4383-9636-81c833ccc345"
    hostname: "compute-1.localdomain"
    Encap geneve
        ip: "172.17.2.18"
        options: {csum="true"}
    Port_Binding "d12c0cd5-b818-484a-ac0f-70222b15b0cd"
Chassis "942750fc-cec5-4a9f-aeb5-6dfddf9be3be"
    hostname: "controller-0.localdomain"
    Encap geneve
        ip: "172.17.2.20"
        options: {csum="true"}
    Port_Binding "cr-lrp-6042c7e2-79b3-4925-b606-b86c6dc1e824"
    Port_Binding "cr-lrp-284190ed-ff6a-438b-b9ee-a843f13edbd6"
Chassis "a34f57de-09d3-4c1f-b56b-270eb850537a"
    hostname: "controller-1.localdomain"
    Encap geneve
        ip: "172.17.2.10"
        options: {csum="true"}
Chassis "0407dcee-65c0-48c3-be8f-e6d7997c7613"
    hostname: "compute-0.localdomain"
    Encap geneve
        ip: "172.17.2.14"
        options: {csum="true"}
    Port_Binding "a0cc0b12-70d5-46c9-8e00-e76e970c711f"


Version-Release number of selected component (if applicable):
(overcloud) [root@controller-0 ~]# rpm -qa | grep ovn
puppet-ovn-12.3.1-0.20180221062110.4b16f7c.el7ost.noarch
openvswitch-ovn-central-2.9.0-3.el7fdp.x86_64
novnc-0.6.1-1.el7ost.noarch
openvswitch-ovn-common-2.9.0-3.el7fdp.x86_64
openvswitch-ovn-host-2.9.0-3.el7fdp.x86_64
python-networking-ovn-metadata-agent-4.0.0-0.20180220131809.329d6d8.el7ost.noarch
python-networking-ovn-4.0.0-0.20180220131809.329d6d8.el7ost.noarch
^[[Aopenstack-nova-novncproxy-17.0.1-0.20180302144923.9ace6ed.el7ost.noarch
(overcloud) [root@controller-0 ~]# cat /etc/yum.repos.d/latest-installed 
13   -p 2018-03-20.2


How reproducible:
always

Steps to Reproduce:
1. create 2 network 
2.create 2 router 
3. check with OVN command where the routers scheduled

Comment 2 Miguel Angel Ajo 2018-04-02 10:57:59 UTC
When I looked to the scheduler, I checked that it's using the least used scheduler, and it's not working as expected.

As far as I understood from the code, it's listing all the chassis that have a gw router port scheduled, but it's not taking in account the priorities of the gateway chassis.

We should make sure we use the priority in the calculation, otherwise, all the chassis (master or backup) are equally weighted for the calculation.

Comment 4 Jon Schlueter 2018-04-19 15:27:31 UTC
Merged on master and proposed to stable/queens.

Comment 11 Eran Kuris 2018-05-01 10:28:45 UTC
Fix verified:
cat /etc/yum.repos.d/latest-installed 
13   -p 2018-04-26.3
(overcloud) [root@controller-0 ~]# rpm -qa |grep python-networking-ovn
python-networking-ovn-4.0.1-0.20180420150809.c7c16d4.el7ost.noarch
python-networking-ovn-metadata-agent-4.0.1-0.20180420150809.c7c16d4.el7ost.noarch

I created 3 Routers & verified they scheduled on a different controller node. 
Now it looks like we are taking into account the priorities of the gateway chassis. Also, run connectivity check to the Router external interface & to instance that attached to the Router.
router fa4d44f5-669a-41ce-a0f3-51b127aaf1c0 (neutron-f7df49ce-69be-4fac-b476-22de9ece4cd1) (aka Router_eNet)
    port lrp-d80d1f0e-a7e2-45bf-854d-6d87246aae48
        mac: "fa:16:3e:e4:71:9f"
        networks: ["10.0.0.217/24"]
        gateway chassis: [37601a52-d66a-4eac-be13-b9f93095ebf1 21762b93-5d6c-4684-ac52-6018d9d35217 95b77591-d3e9-4a79-b7b6-1e817c4faa48]
router c8ac6efa-395d-48d0-906f-1bc4404070a9 (neutron-94c445b3-5912-455f-9708-e90c8ab50b73) (aka Router_eNet_2)
    port lrp-781c85b7-b59b-4f6e-a7f4-e6f5228f55bf
        mac: "fa:16:3e:4f:60:ed"
        networks: ["10.0.0.214/24"]
        gateway chassis: [37601a52-d66a-4eac-be13-b9f93095ebf1 21762b93-5d6c-4684-ac52-6018d9d35217 95b77591-d3e9-4a79-b7b6-1e817c4faa48]
router 33d15737-a35b-4251-9f0b-672a3f52071c (neutron-c7f7ddc8-2a7c-4ecc-8c46-17713a39b9ca) (aka Router_eNet_3)
    port lrp-95eceb8e-a07f-4868-9c97-388e7da2a3e8
        mac: "fa:16:3e:71:c4:0c"
        networks: ["10.0.0.211/24"]
        gateway chassis: [21762b93-5d6c-4684-ac52-6018d9d35217 37601a52-d66a-4eac-be13-b9f93095ebf1 95b77591-d3e9-4a79-b7b6-1e817c4faa48]
(overcloud) [root@controller-0 ~]# ovn-nbctl lrp-get-gateway-chassis lrp-d80d1f0e-a7e2-45bf-854d-6d87246aae48
lrp-d80d1f0e-a7e2-45bf-854d-6d87246aae48_21762b93-5d6c-4684-ac52-6018d9d35217     3
lrp-d80d1f0e-a7e2-45bf-854d-6d87246aae48_37601a52-d66a-4eac-be13-b9f93095ebf1     2
lrp-d80d1f0e-a7e2-45bf-854d-6d87246aae48_95b77591-d3e9-4a79-b7b6-1e817c4faa48     1
(overcloud) [root@controller-0 ~]# ovn-nbctl lrp-get-gateway-chassis lrp-781c85b7-b59b-4f6e-a7f4-e6f5228f55bf
lrp-781c85b7-b59b-4f6e-a7f4-e6f5228f55bf_95b77591-d3e9-4a79-b7b6-1e817c4faa48     3
lrp-781c85b7-b59b-4f6e-a7f4-e6f5228f55bf_37601a52-d66a-4eac-be13-b9f93095ebf1     2
lrp-781c85b7-b59b-4f6e-a7f4-e6f5228f55bf_21762b93-5d6c-4684-ac52-6018d9d35217     1
(overcloud) [root@controller-0 ~]# ovn-nbctl lrp-get-gateway-chassis lrp-95eceb8e-a07f-4868-9c97-388e7da2a3e8
lrp-95eceb8e-a07f-4868-9c97-388e7da2a3e8_37601a52-d66a-4eac-be13-b9f93095ebf1     3
lrp-95eceb8e-a07f-4868-9c97-388e7da2a3e8_21762b93-5d6c-4684-ac52-6018d9d35217     2
lrp-95eceb8e-a07f-4868-9c97-388e7da2a3e8_95b77591-d3e9-4a79-b7b6-1e817c4faa48     1
(overcloud) [root@controller-0 ~]# ping 10.0.0.211
PING 10.0.0.211 (10.0.0.211) 56(84) bytes of data.
64 bytes from 10.0.0.211: icmp_seq=1 ttl=254 time=2.50 ms
64 bytes from 10.0.0.211: icmp_seq=2 ttl=254 time=0.458 ms
--- 10.0.0.211 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1006ms
rtt min/avg/max/mdev = 0.458/1.483/2.508/1.025 ms

(overcloud) [root@controller-0 ~]# ping 10.0.0.214
PING 10.0.0.214 (10.0.0.214) 56(84) bytes of data.
64 bytes from 10.0.0.214: icmp_seq=1 ttl=254 time=0.741 ms
64 bytes from 10.0.0.214: icmp_seq=2 ttl=254 time=0.239 ms
^C
--- 10.0.0.214 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1000ms
rtt min/avg/max/mdev = 0.239/0.490/0.741/0.251 ms

(overcloud) [root@controller-0 ~]# ping 10.0.0.217
PING 10.0.0.217 (10.0.0.217) 56(84) bytes of data.
64 bytes from 10.0.0.217: icmp_seq=1 ttl=254 time=0.969 ms
64 bytes from 10.0.0.217: icmp_seq=2 ttl=254 time=0.397 ms
^C
--- 10.0.0.217 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1003ms
rtt min/avg/max/mdev = 0.397/0.683/0.969/0.286 ms
# ovn-sbctl show 
Chassis "21762b93-5d6c-4684-ac52-6018d9d35217"
    hostname: "controller-1.localdomain"
    Encap geneve
        ip: "172.17.2.16"
        options: {csum="true"}
    Port_Binding "cr-lrp-d80d1f0e-a7e2-45bf-854d-6d87246aae48"
Chassis "95b77591-d3e9-4a79-b7b6-1e817c4faa48"
    hostname: "controller-0.localdomain"
    Encap geneve
        ip: "172.17.2.13"
        options: {csum="true"}
    Port_Binding "cr-lrp-781c85b7-b59b-4f6e-a7f4-e6f5228f55bf"
Chassis "37601a52-d66a-4eac-be13-b9f93095ebf1"
    hostname: "controller-2.localdomain"
    Encap geneve
        ip: "172.17.2.12"
        options: {csum="true"}
    Port_Binding "cr-lrp-95eceb8e-a07f-4868-9c97-388e7da2a3e8"

Comment 13 errata-xmlrpc 2018-06-27 13:49:35 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:2086