Bug 1957025
Summary: | [ovn] ARP broadcasts and duplicate mac address issues happen due to unexpected openflow rules before ovn-controller is fully up | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux Fast Datapath | Reporter: | ffernand <ffernand> |
Component: | ovn2.13 | Assignee: | Numan Siddique <nusiddiq> |
Status: | CLOSED ERRATA | QA Contact: | ying xu <yinxu> |
Severity: | unspecified | Docs Contact: | |
Priority: | high | ||
Version: | FDP 20.I | CC: | averi, ctrautma, dcbw, dceara, ggrimaux, jbeaudoi, jiji, jishi, mmichels, nusiddiq, ralongi |
Target Milestone: | --- | Keywords: | CustomerScenariosInitiative, Triaged |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | ovn2.13-20.12.0-116 ovn-2021-21.03.0-32 | Doc Type: | If docs needed, set a value |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2021-06-21 14:44:39 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
ffernand
2021-05-04 21:25:40 UTC
()[root@controller-0 /]# ovn-sbctl find port_binding logical_port="cr-lrp-d796ca6d-dd92-49c2-9f4b-e94413f2fb8d" _uuid : 42d7a1c3-8f7c-4a03-bedd-bda5b6a9b209 chassis : cdda0ab3-ec8f-4932-b3db-a6c4c573d06d datapath : 46ef0b93-ca0b-488b-99be-34617fdcb9ca encap : [] external_ids : {} gateway_chassis : [] ha_chassis_group : 69af9edb-0e1f-4b1f-a982-abfb2b029bdd logical_port : cr-lrp-d796ca6d-dd92-49c2-9f4b-e94413f2fb8d mac : ["fa:16:3e:61:a1:98 10.2.113.250/24 2620:52:4:230d::3d9/64"] nat_addresses : [] options : {distributed-port=lrp-d796ca6d-dd92-49c2-9f4b-e94413f2fb8d} parent_port : [] tag : [] tunnel_key : 3 type : chassisredirect up : true virtual_parent : [] 2021-05-04T16:38:33.338Z|00034|binding|INFO|cr-lrp-d796ca6d-dd92-49c2-9f4b-e94413f2fb8d: Claiming fa:16:3e:61:a1:98 10.2.113.250/24 2620:52:4:230d::3d9/64 2021-05-04T16:38:33.338Z|00035|binding|INFO|Changing chassis for lport cr-lrp-8732c441-8155-4fe6-bf64-da68e871c211 from 466e1018-1a42-4d16-907a-d1fc62910ea3 to 651b4ea3-f4ea-42b3-bc11-1ec03154aebb. ()[root@controller-0 /]# ovn-nbctl lrp-get-gateway-chassis lrp-d796ca6d-dd92-49c2-9f4b-e94413f2fb8d lrp-d796ca6d-dd92-49c2-9f4b-e94413f2fb8d_466e1018-1a42-4d16-907a-d1fc62910ea3 5 <-- controller-2 lrp-d796ca6d-dd92-49c2-9f4b-e94413f2fb8d_66758056-d693-4a33-89af-dd15d08b05fa 4 lrp-d796ca6d-dd92-49c2-9f4b-e94413f2fb8d_98cc5a8e-c116-4f82-8c67-b1242888cb09 3 lrp-d796ca6d-dd92-49c2-9f4b-e94413f2fb8d_651b4ea3-f4ea-42b3-bc11-1ec03154aebb 2 <-- networker-1 lrp-d796ca6d-dd92-49c2-9f4b-e94413f2fb8d_4f2021dd-ee45-4677-aa17-854aa0d228ca 1 ()[root@controller-0 /]# ovn-sbctl --column _uuid,hostname list chassis 466e1018-1a42-4d16-907a-d1fc62910ea3 _uuid : cdda0ab3-ec8f-4932-b3db-a6c4c573d06d hostname : controller-2.osp-002.prod.iad2.dc.redhat.com ()[root@controller-0 /]# ovn-sbctl --column _uuid,hostname list chassis 651b4ea3-f4ea-42b3-bc11-1ec03154aebb _uuid : 45ebb384-c348-425a-b3c4-9369e54e11a1 hostname : networker-1.osp-002.prod.iad2.dc.redhat.com ()[root@controller-0 /]# **Important** follow up on this issue: After consulting with one of my OVN gurus, I learned that the Openflow rules should not be present in the node upon power cycle. That was definitely the case, BUT it turned out that the br-int bridge did not have fail-mode set to secure [0]. Because of that, the default 'NORMAL' action rule was present and that is what caused the "chaos" as default bridge operations were being performed on all of its ports: [root@networker-0 openvswitch]# ovs-ofctl dump-flows br-int cookie=0x0, duration=1611.902s, table=0, n_packets=857734338, n_bytes=55326381337, priority=0 actions=NORMAL Further inspection is needed to understand why/how the br-int bridge in some of the nodes were not set with fail-mode secure. This workaround fixed the problem in the cluster: [stack@director1 ~]$ for i in `OS_CLOUD=undercloud openstack server list| grep control | awk '{print $8}' | sed 's/ctlplane=//'`; do echo $i ; \ ssh $i sudo ovs-vsctl set-fail-mode br-int secure ; done [0]: http://www.openvswitch.org/support/dist-docs/ovs-vsctl.8.txt ==> Controller Failure Settings If OVN creates br-int (default) then it sets fail-mode=secure:
static const struct ovsrec_bridge *
create_br_int(struct ovsdb_idl_txn *ovs_idl_txn,
const struct ovsrec_open_vswitch_table *ovs_table)
{
<snip>
struct ovsrec_bridge *bridge;
bridge = ovsrec_bridge_insert(ovs_idl_txn);
ovsrec_bridge_set_name(bridge, bridge_name);
>>> ovsrec_bridge_set_fail_mode(bridge, "secure");
ovsrec_bridge_set_ports(bridge, &port, 1);
but if that bridge already exists when ovn-controller starts, I don't think it will enforce fail-mode=secure on the existing bridge.
Is there any way that br-int would have been created by something (even ovn-controller) and then fail-mode got reset, then when ovn-controller restarts it doesn't touch the bridge?
(In reply to Dan Williams from comment #4) > If OVN creates br-int (default) then it sets fail-mode=secure: > > static const struct ovsrec_bridge * > create_br_int(struct ovsdb_idl_txn *ovs_idl_txn, > const struct ovsrec_open_vswitch_table *ovs_table) > { > <snip> > > struct ovsrec_bridge *bridge; > bridge = ovsrec_bridge_insert(ovs_idl_txn); > ovsrec_bridge_set_name(bridge, bridge_name); > >>> ovsrec_bridge_set_fail_mode(bridge, "secure"); > ovsrec_bridge_set_ports(bridge, &port, 1); > > but if that bridge already exists when ovn-controller starts, I don't think > it will enforce fail-mode=secure on the existing bridge. > > Is there any way that br-int would have been created by something (even > ovn-controller) and then fail-mode got reset, then when ovn-controller > restarts it doesn't touch the bridge? Thanks, Dan. I still think that just as ovn creates br-int with secure fail mode when needed, I am thinking that that should be checked and set, regardless. Is that an okay approach? static const struct ovsrec_bridge * process_br_int(struct ovsdb_idl_txn *ovs_idl_txn, const struct ovsrec_bridge_table *bridge_table, const struct ovsrec_open_vswitch_table *ovs_table) { const struct ovsrec_bridge *br_int = get_br_int(bridge_table, ovs_table); if (!br_int) { br_int = create_br_int(ovs_idl_txn, ovs_table); } <=== else { ... ovsrec_bridge_set_fail_mode(bridge, "secure"); ... } <---- br_int for ovn must be always used with fail mode "secure" ... return br_int; } (In reply to ffernand from comment #5) > (In reply to Dan Williams from comment #4) > > If OVN creates br-int (default) then it sets fail-mode=secure: > > > > static const struct ovsrec_bridge * > > create_br_int(struct ovsdb_idl_txn *ovs_idl_txn, > > const struct ovsrec_open_vswitch_table *ovs_table) > > { > > <snip> > > > > struct ovsrec_bridge *bridge; > > bridge = ovsrec_bridge_insert(ovs_idl_txn); > > ovsrec_bridge_set_name(bridge, bridge_name); > > >>> ovsrec_bridge_set_fail_mode(bridge, "secure"); > > ovsrec_bridge_set_ports(bridge, &port, 1); > > > > but if that bridge already exists when ovn-controller starts, I don't think > > it will enforce fail-mode=secure on the existing bridge. > > > > Is there any way that br-int would have been created by something (even > > ovn-controller) and then fail-mode got reset, then when ovn-controller > > restarts it doesn't touch the bridge? > > Thanks, Dan. I still think that just as ovn creates br-int with secure fail > mode when needed, > I am thinking that that should be checked and set, regardless. Is that an > okay approach? I would agree; if OVN needs fail-mode=secure, then it should probably enforce that on the integration bridge. (In reply to Dan Williams from comment #6) > (In reply to ffernand from comment #5) > > (In reply to Dan Williams from comment #4) > > > If OVN creates br-int (default) then it sets fail-mode=secure: > > > > > > static const struct ovsrec_bridge * > > > create_br_int(struct ovsdb_idl_txn *ovs_idl_txn, > > > const struct ovsrec_open_vswitch_table *ovs_table) > > > { > > > <snip> > > > > > > struct ovsrec_bridge *bridge; > > > bridge = ovsrec_bridge_insert(ovs_idl_txn); > > > ovsrec_bridge_set_name(bridge, bridge_name); > > > >>> ovsrec_bridge_set_fail_mode(bridge, "secure"); > > > ovsrec_bridge_set_ports(bridge, &port, 1); > > > > > > but if that bridge already exists when ovn-controller starts, I don't think > > > it will enforce fail-mode=secure on the existing bridge. > > > > > > Is there any way that br-int would have been created by something (even > > > ovn-controller) and then fail-mode got reset, then when ovn-controller > > > restarts it doesn't touch the bridge? > > > > Thanks, Dan. I still think that just as ovn creates br-int with secure fail > > mode when needed, > > I am thinking that that should be checked and set, regardless. Is that an > > okay approach? > > I would agree; if OVN needs fail-mode=secure, then it should probably > enforce that on the integration bridge. And perhaps warn if it isn't set, since this should be an exceptional condition that the admin should look into? Changes posted upstream at: https://patchwork.ozlabs.org/project/ovn/patch/20210507012226.1504699-1-flavio@flaviof.com/ Changes v2 posted upstream at: https://patchwork.ozlabs.org/project/ovn/patch/20210507174947.1879798-1-flavio@flaviof.com/ Fix is merged upstream. I'm assigning bz to Numan, so he can help me out creating a d/s rpm for ovn 2.13 that includes the fix. https://github.com/ovn-org/ovn/commit/9cc334bc1a036a93cc1a541513d48f4df6933e9b https://github.com/ovn-org/ovn/commit/be65a461ce134c1e874b65add402f7c2744f29f5 I talked to numan,this bug is addressed by setting the fail_mode to secure in ovn-controller if this is not set. since it is hard to set the env like osp, numan said we can simply reproduce by deleting the fail_mode after ovn-controller started. on version # rpm -qa|grep ovn ovn-2021-21.03.0-21.el8fdp.x86_64 ovn-2021-central-21.03.0-21.el8fdp.x86_64 ovn-2021-host-21.03.0-21.el8fdp.x86_64 # ovs-vsctl show aa1481e5-933e-4276-be67-f56d95703c54 Bridge br-int fail_mode: secure Port br-int Interface br-int type: internal ovs_version: "2.15.1" [root@dell-per740-54 load_balance]# ovs-vsctl clear bridge br-int fail_mode --------delete [root@dell-per740-54 load_balance]# ovs-vsctl show aa1481e5-933e-4276-be67-f56d95703c54 Bridge br-int ----------------------none Port br-int Interface br-int type: internal ovs_version: "2.15.1" [root@dell-per740-54 load_balance]# ovs-vsctl get bridge br-int fail_mode --------------none [] verified on version: # rpm -qa|grep ovn ovn2.13-host-20.12.0-118.el8fdp.x86_64 ovn2.13-20.12.0-118.el8fdp.x86_64 ovn2.13-central-20.12.0-118.el8fdp.x86_64 ovn-2021-21.03.0-40.el8fdp.x86_64 ovn-2021-central-21.03.0-40.el8fdp.x86_64 ovn-2021-host-21.03.0-40.el8fdp.x86_64 # ovs-vsctl get bridge br-int fail_mode secure [root@dell-per730-19 ~]# ovs-vsctl clear bridge br-int fail_mode [root@dell-per730-19 ~]# ovs-vsctl get bridge br-int fail_mode --------------fail-mode is not empty even we clear it secure [root@dell-per730-19 ~]# systemctl restart ovn-controller [root@dell-per730-19 ~]# ovs-vsctl get bridge br-int fail_mode secure [root@dell-per730-19 ~]# ovs-vsctl show e9fe1808-a871-435a-8b77-7d31b8a9c00d Bridge br-int fail_mode: secure Port br-int Interface br-int type: internal Bridge br-provider Port veth0_c0_p Interface veth0_c0_p Port br-provider Interface br-provider type: internal Port enp4s0d1 Interface enp4s0d1 ovs_version: "2.15.1" [root@dell-per730-19 ~]# ovs-vsctl clear bridge br-int fail_mode [root@dell-per730-19 ~]# [root@dell-per730-19 ~]# ovs-vsctl show e9fe1808-a871-435a-8b77-7d31b8a9c00d Bridge br-int fail_mode: secure Port br-int Interface br-int type: internal Bridge br-provider Port veth0_c0_p Interface veth0_c0_p Port br-provider Interface br-provider type: internal Port enp4s0d1 Interface enp4s0d1 ovs_version: "2.15.1" Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (ovn2.13 bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2507 *** Bug 1942085 has been marked as a duplicate of this bug. *** |