Bug 1762777
| Summary: | [OVN] HA Chassis failover won't work if there's stale chassis entries | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux Fast Datapath | Reporter: | Daniel Alvarez Sanchez <dalvarez> |
| Component: | ovn2.11 | Assignee: | Numan Siddique <nusiddiq> |
| Status: | CLOSED ERRATA | QA Contact: | Jianlin Shi <jishi> |
| Severity: | urgent | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | FDP 20.A | CC: | 33186108, ctrautma, fhallal, jishi, kfida, nusiddiq, qding |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | Unspecified | ||
| OS: | Unspecified | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2020-01-21 17:02:44 UTC | Type: | Bug |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
| Embargoed: | |||
The fix is available in OVN2.11-2.11.1-19 reproduced on ovn2.11-2.11.1-8.el7fdp.x86_64: [root@dell-per740-12 bz1762777]# rpm -qa | grep -E "openvswitch|ovn" openvswitch2.11-2.11.0-35.el7fdp.x86_64 openvswitch-selinux-extra-policy-1.0-14.el7fdp.noarch ovn2.11-2.11.1-8.el7fdp.x86_64 ovn2.11-host-2.11.1-8.el7fdp.x86_64 ovn2.11-central-2.11.1-8.el7fdp.x86_64 start ovn on server: [root@dell-per740-12 bz1762777]# bash -x rep.sh + systemctl start openvswitch + systemctl start ovn-northd + ovn-nbctl set-connection ptcp:6641 + ovn-sbctl set-connection ptcp:6642 + ovs-vsctl set open . external_ids:system-id=hv1 external_ids:ovn-remote=tcp:20.0.30.25:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=20.0.30.25 + systemctl restart ovn-controller + ovn-nbctl lr-add lr1 + ovn-nbctl lrp-add lr1 lr1-ls1 00:de:ad:ff:01:03 192.168.111.254/24 3000::a/64 + ovn-nbctl ha-chassis-group-add hagrp1 + ovn-nbctl ha-chassis-group-add-chassis hagrp1 hv0 30 + ovn-nbctl ha-chassis-group-add-chassis hagrp1 hv1 20 ++ ovn-nbctl --bare --columns _uuid find ha_chassis_group name=hagrp1 + hagrp1_uuid=45bc12c5-75f4-4de3-8425-883bb990874e + ovn-nbctl set Logical_Router_Port lr1-ls1 ha-chassis-group=45bc12c5-75f4-4de3-8425-883bb990874e start on client: [root@hp-dl380pg8-12 bz1762777]# bash -x rep.sh + systemctl start openvswitch + ovs-vsctl set open . external_ids:system-id=hv0 external_ids:ovn-remote=tcp:20.0.30.25:6642 external_ids:ovn-encap-type=geneve external_ids:ovn-encap-ip=20.0.30.26 + systemctl restart ovn-controller [root@hp-dl380pg8-12 bz1762777]# ovs-vsctl show 90e4c87f-34d3-46b4-96e0-f852798d29cb Bridge br-int fail_mode: secure Port br-int Interface br-int type: internal Port "ovn-hv1-0" Interface "ovn-hv1-0" type: geneve options: {csum="true", key=flow, remote_ip="20.0.30.25"} bfd_status: {diagnostic="No Diagnostic", flap_count="1", forwarding="true", remote_diagnostic="No Diagnostic", remote_state=up, state=up} ovs_version: "2.11.0" on server: [root@dell-per740-12 bz1762777]# ovn-sbctl list ha_chassis _uuid : 968086f6-c440-459b-b57f-a01a42b2b679 chassis : 49d3fa31-7392-47f0-832a-bb15b8df3504 external_ids : {chassis-name="hv1"} priority : 20 _uuid : d5deba9e-12a7-4b61-92f8-60f6991bc8eb chassis : 3c99aea8-c155-41ef-a39a-b2748c21c27c external_ids : {chassis-name="hv0"} priority : 30 [root@dell-per740-12 bz1762777]# ovn-sbctl list port_binding _uuid : e60747ac-18bc-4c3b-bbaf-19e47d973a7d chassis : 3c99aea8-c155-41ef-a39a-b2748c21c27c <=== the port is bound to chassis hv0(client) datapath : 66b56d95-62b7-4174-91fe-762459b1c833 encap : [] external_ids : {} gateway_chassis : [] ha_chassis_group : c49b48c8-c695-4f11-a0d4-b1dba7307853 logical_port : "cr-lr1-ls1" mac : ["00:de:ad:ff:01:03 192.168.111.254/24 3000::a/64"] nat_addresses : [] options : {distributed-port="lr1-ls1"} parent_port : [] tag : [] tunnel_key : 2 type : chassisredirect virtual_parent : [] _uuid : 5319836c-4b19-42a9-9437-9b4b7a3d5d31 chassis : [] datapath : 66b56d95-62b7-4174-91fe-762459b1c833 encap : [] external_ids : {} gateway_chassis : [] ha_chassis_group : [] logical_port : "lr1-ls1" mac : ["00:de:ad:ff:01:03 192.168.111.254/24 3000::a/64"] nat_addresses : [] options : {} parent_port : [] tag : [] tunnel_key : 1 type : patch virtual_parent : [] stop ovn-controller on client: [root@hp-dl380pg8-12 bz1762777]# systemctl stop ovn-controller on server: [root@dell-per740-12 bz1762777]# ovn-sbctl list ha_chassis _uuid : 968086f6-c440-459b-b57f-a01a42b2b679 chassis : 49d3fa31-7392-47f0-832a-bb15b8df3504 external_ids : {chassis-name="hv1"} priority : 20 _uuid : d5deba9e-12a7-4b61-92f8-60f6991bc8eb chassis : [] external_ids : {chassis-name="hv0"} priority : 30 <=== chassis of hv0 is null [root@dell-per740-12 bz1762777]# ovn-sbctl list port_binding _uuid : e60747ac-18bc-4c3b-bbaf-19e47d973a7d chassis : [] <==== chassis becomes null, not bound to hv1, fail over doesn't work datapath : 66b56d95-62b7-4174-91fe-762459b1c833 encap : [] external_ids : {} gateway_chassis : [] ha_chassis_group : c49b48c8-c695-4f11-a0d4-b1dba7307853 logical_port : "cr-lr1-ls1" mac : ["00:de:ad:ff:01:03 192.168.111.254/24 3000::a/64"] nat_addresses : [] options : {distributed-port="lr1-ls1"} parent_port : [] tag : [] tunnel_key : 2 type : chassisredirect virtual_parent : [] _uuid : 5319836c-4b19-42a9-9437-9b4b7a3d5d31 chassis : [] datapath : 66b56d95-62b7-4174-91fe-762459b1c833 encap : [] external_ids : {} gateway_chassis : [] ha_chassis_group : [] logical_port : "lr1-ls1" mac : ["00:de:ad:ff:01:03 192.168.111.254/24 3000::a/64"] nat_addresses : [] options : {} parent_port : [] tag : [] tunnel_key : 1 type : patch virtual_parent : [] Verified on ovn2.11-2.11.1-24.el7fdp.x86_64: [root@dell-per740-12 ovn]# rpm -qa | grep -E "openvswitch|ovn" openvswitch2.11-2.11.0-35.el7fdp.x86_64 ovn2.11-2.11.1-24.el7fdp.x86_64 openvswitch-selinux-extra-policy-1.0-14.el7fdp.noarch ovn2.11-central-2.11.1-24.el7fdp.x86_64 ovn2.11-host-2.11.1-24.el7fdp.x86_64 [root@dell-per740-12 bz1762777]# ovn-sbctl list ha_chassis _uuid : 7f693870-01ab-474c-9574-cdcb35cce22f chassis : 814b3332-a72c-408f-86b3-6e0de0863dee external_ids : {chassis-name="hv1"} priority : 20 _uuid : 3c121701-7a67-4cf3-aa2c-476d0a90af4d chassis : 0e926138-e86c-427c-87f7-39c2417bcb73 external_ids : {chassis-name="hv0"} priority : 30 [root@dell-per740-12 bz1762777]# ovn-sbctl list port_binding _uuid : bb61994c-7482-48d6-8da0-0de71513e24e chassis : [] datapath : 94e30e1f-20aa-4d95-9678-4182184e4089 encap : [] external_ids : {} gateway_chassis : [] ha_chassis_group : [] logical_port : "lr1-ls1" mac : ["00:de:ad:ff:01:03 192.168.111.254/24 3000::a/64"] nat_addresses : [] options : {} parent_port : [] tag : [] tunnel_key : 1 type : patch virtual_parent : [] _uuid : 4472beb7-70cd-47b4-8aee-7159e26b11fb chassis : 0e926138-e86c-427c-87f7-39c2417bcb73 <=== port bound to hv0 datapath : 94e30e1f-20aa-4d95-9678-4182184e4089 encap : [] external_ids : {} gateway_chassis : [] ha_chassis_group : aece3c53-c5c4-46ef-8473-f8927e0d7ddf logical_port : "cr-lr1-ls1" mac : ["00:de:ad:ff:01:03 192.168.111.254/24 3000::a/64"] nat_addresses : [] options : {distributed-port="lr1-ls1"} parent_port : [] tag : [] tunnel_key : 2 type : chassisredirect virtual_parent : [] stop ovn-controller on hv0: [root@hp-dl380pg8-12 bz1762777]# systemctl stop ovn-controller [root@dell-per740-12 bz1762777]# ovn-sbctl list ha_chassis _uuid : 7f693870-01ab-474c-9574-cdcb35cce22f chassis : 814b3332-a72c-408f-86b3-6e0de0863dee external_ids : {chassis-name="hv1"} priority : 20 _uuid : 3c121701-7a67-4cf3-aa2c-476d0a90af4d chassis : [] <=== chassis for hv0 is null external_ids : {chassis-name="hv0"} priority : 30 [root@dell-per740-12 bz1762777]# ovn-sbctl list port_binding _uuid : bb61994c-7482-48d6-8da0-0de71513e24e chassis : [] datapath : 94e30e1f-20aa-4d95-9678-4182184e4089 encap : [] external_ids : {} gateway_chassis : [] ha_chassis_group : [] logical_port : "lr1-ls1" mac : ["00:de:ad:ff:01:03 192.168.111.254/24 3000::a/64"] nat_addresses : [] options : {} parent_port : [] tag : [] tunnel_key : 1 type : patch virtual_parent : [] _uuid : 4472beb7-70cd-47b4-8aee-7159e26b11fb chassis : 814b3332-a72c-408f-86b3-6e0de0863dee <=== port bound to hv1, fail over works datapath : 94e30e1f-20aa-4d95-9678-4182184e4089 encap : [] external_ids : {} gateway_chassis : [] ha_chassis_group : aece3c53-c5c4-46ef-8473-f8927e0d7ddf logical_port : "cr-lr1-ls1" mac : ["00:de:ad:ff:01:03 192.168.111.254/24 3000::a/64"] nat_addresses : [] options : {distributed-port="lr1-ls1"} parent_port : [] tag : [] tunnel_key : 2 type : chassisredirect virtual_parent : [] also verified on rhel8 version: [root@hp-dl380pg8-12 bz1762777]# rpm -qa | grep -E "openvswitch|ovn" kernel-kernel-networking-openvswitch-ovn-basic-1.0-14.noarch openvswitch-selinux-extra-policy-1.0-19.el8fdp.noarch ovn2.11-host-2.11.1-24.el8fdp.x86_64 kernel-kernel-networking-openvswitch-ovn-common-1.0-6.noarch ovn2.11-2.11.1-24.el8fdp.x86_64 ovn2.11-central-2.11.1-24.el8fdp.x86_64 openvswitch2.11-2.11.0-35.el8fdp.x86_64 [root@hp-dl380pg8-12 bz1762777]# ovn-sbctl list ha_chassis _uuid : 5bb2e325-d37a-404a-91af-be61b129f61c chassis : 5b48c1e4-41c1-42f1-a451-c064e396653f external_ids : {chassis-name="hv0"} priority : 30 _uuid : a2d63741-ac8b-4be3-9ef9-da186a0e8d6d chassis : e00214c3-9d46-4d4b-9901-9b4cfbd11352 external_ids : {chassis-name="hv1"} priority : 20 [root@hp-dl380pg8-12 bz1762777]# ovn-sbctl list port_binding _uuid : 8ef09c14-c7ee-41c1-ad8e-0bc8046b9e6f chassis : [] datapath : 23d40b53-6bdc-49e3-b270-84125fe5733e encap : [] external_ids : {} gateway_chassis : [] ha_chassis_group : [] logical_port : "lr1-ls1" mac : ["00:de:ad:ff:01:03 192.168.111.254/24 3000::a/64"] nat_addresses : [] options : {} parent_port : [] tag : [] tunnel_key : 1 type : patch virtual_parent : [] _uuid : 43b0083f-b4db-4471-ae38-c07c5aa1ecc9 chassis : 5b48c1e4-41c1-42f1-a451-c064e396653f <==== port bind to hv0 datapath : 23d40b53-6bdc-49e3-b270-84125fe5733e encap : [] external_ids : {} gateway_chassis : [] ha_chassis_group : 815ee734-34f4-49a9-8689-a626c0a60940 logical_port : "cr-lr1-ls1" mac : ["00:de:ad:ff:01:03 192.168.111.254/24 3000::a/64"] nat_addresses : [] options : {distributed-port="lr1-ls1"} parent_port : [] tag : [] tunnel_key : 2 type : chassisredirect virtual_parent : [] stop ovn-controller on hv0: [root@dell-per740-12 bz1762777]# systemctl stop ovn-controller [root@hp-dl380pg8-12 bz1762777]# ovn-sbctl list ha_chassis _uuid : 5bb2e325-d37a-404a-91af-be61b129f61c chassis : [] external_ids : {chassis-name="hv0"} priority : 30 _uuid : a2d63741-ac8b-4be3-9ef9-da186a0e8d6d chassis : e00214c3-9d46-4d4b-9901-9b4cfbd11352 external_ids : {chassis-name="hv1"} priority : 20 [root@hp-dl380pg8-12 bz1762777]# ovn-sbctl list port_binding _uuid : 8ef09c14-c7ee-41c1-ad8e-0bc8046b9e6f chassis : [] datapath : 23d40b53-6bdc-49e3-b270-84125fe5733e encap : [] external_ids : {} gateway_chassis : [] ha_chassis_group : [] logical_port : "lr1-ls1" mac : ["00:de:ad:ff:01:03 192.168.111.254/24 3000::a/64"] nat_addresses : [] options : {} parent_port : [] tag : [] tunnel_key : 1 type : patch virtual_parent : [] _uuid : 43b0083f-b4db-4471-ae38-c07c5aa1ecc9 chassis : e00214c3-9d46-4d4b-9901-9b4cfbd11352 <=== port bound to hv1, fail over works datapath : 23d40b53-6bdc-49e3-b270-84125fe5733e encap : [] external_ids : {} gateway_chassis : [] ha_chassis_group : 815ee734-34f4-49a9-8689-a626c0a60940 logical_port : "cr-lr1-ls1" mac : ["00:de:ad:ff:01:03 192.168.111.254/24 3000::a/64"] nat_addresses : [] options : {distributed-port="lr1-ls1"} parent_port : [] tag : [] tunnel_key : 2 type : chassisredirect virtual_parent : [] Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0190 |
We observed than when there is an HA Chassis entry with an empty chassis column like: # ovn-sbctl list ha_chassis _uuid : 4b114ad4-5425-4b66-8644-7d3d2ff3176b chassis : [] external_ids : {chassis-name="f69c10c5-1f4b-4112-a429-d318a058a17f"} priority : 1 All the ports that were bound to that chassis (in the example "f69c10c5-1f4b-4112-a429-d318a058a17f") will not be claimed to other HA Chassis in the HA Chassis group where they belong to. This situation can happen easily if the CMS doesn't remove the chassis from the HA Chassis group when ovn-controller shuts down deleting its chassis entry on some node. If CMS is down and don't process the chassis removal, HA won't kick in. Moreover, if ovn-controler dies ungracefully and the chassis entry is stale in the SB database, CMS won't detect anything either and even though the Port_Binding entries will get now the chassis column set to empty, they won't move to the next high prio HA Chassis.