Bug 1948472
Summary: | OVN controllers on Edge sites fail to register - Transaction causes multiple rows in "Encap" table to have identical values | |||
---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Bernard Cafarelli <bcafarel> | |
Component: | documentation | Assignee: | Greg Rakauskas <gregraka> | |
Status: | CLOSED CURRENTRELEASE | QA Contact: | RHOS Documentation Team <rhos-docs> | |
Severity: | urgent | Docs Contact: | ||
Priority: | high | |||
Version: | 16.1 (Train) | CC: | alex.bron, apevec, astupnik, cfontain, cmilleta, cmuresan, dalvarez, egarciar, ffernand, gregraka, jamsmith, jmelvin, jveiraca, lhh, lmartins, majopela, nalmond, pveiga, rheslop, rsafrono, scohen | |
Target Milestone: | async | Keywords: | Triaged | |
Target Release: | 16.1 (Train on RHEL 8.2) | |||
Hardware: | Unspecified | |||
OS: | Unspecified | |||
Whiteboard: | ||||
Fixed In Version: | Doc Type: | If docs needed, set a value | ||
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 2298873 (view as bug list) | Environment: | ||
Last Closed: | 2022-09-27 21:02:15 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1788336, 1946835, 2002099 | |||
Bug Blocks: | 2298873, 2299409 |
Description
Bernard Cafarelli
2021-04-12 09:21:43 UTC
Note that as the workaround drops the chassis, this breaks things like "openstack network agent list" The initially mentioned patch https://patchwork.ozlabs.org/project/openvswitch/patch/20200525152821.19838-1-dalvarez@redhat.com/ is present in openvswitch2.13-2.13.0-79.5.el8fdp.x86_64, which is included in 16.1.4 and the deployed lab here: $ cat /etc/rhosp-release Red Hat OpenStack Platform release 16.1.4 GA (Train) $ rpm -q openvswitch2.13 openvswitch2.13-2.13.0-79.5.el8fdp.x86_64 After a full redeploy of the overcloud (including central site), we did not see this error happen again on Edge sites. Looking at OVN database dumps did not have definitive clues, but the most probable reason was that this was caused by a partial redeploy, aka edge sites redeployed on the same nodes. While the current lab is fixed, similar issues can and will happen on deployed clouds: either when scaling down and up on same nodes (as was probably the case here), or when nodes are taken out intentionally or not. We should have a clear and tested documented way to handle this operation, and also push to have bug 1946835 fixed Current draft of needed steps: * if node is accessible before scale down (planned operations), node should be gracefully shut down before (not via ironic) and confirming that relevant OVN agents are not listed anymore * for unplanned scale down, the procedure needs manual steps First the relevant chassis should be deleted from OVN db with: ovn-sbctl chassis-del <chassis-id> If needed, that ID can be found from the the "Encap" error lines with: ovn-sbctl list encap |grep -a3 <IP address from ovn-controller.log> Once these chassis are removed, the Chassis_Private table should be checked: ovn-sbctl find Chassis_private chassis="[]" Any entries reported should be removed with: ovn-sbctl destroy Chassis_Private <listed_id> Once this is done, "openstack network agent list" should run properly and have expected list Hello. We have successfully used workaround described at comment #8 to resolve similar problem in customer's deployment. "neutron agent-list" was still broken, but it could be caused by other OVN problems, so we have reported separate bug #1975264 Short notes about steps taken: - chassis_name value from output [2] should be used as an argument for command [1] - name value from output [3] should be used as an argument for command [4] - tripleo_ovn_controller tripleo_ovn_metadata_agent on affected nodes (nodes that have specified errors in logs) must be restarted after OVN DB is changed [1] ovn-sbctl chassis-del <chassis-id> [2] ovn-sbctl list encap |grep -a3 <IP address from ovn-controller.log> [3] ovn-sbctl find Chassis_private chassis="[]" [4] ovn-sbctl destroy Chassis_Private <listed_id> Regards, Alex. *** Bug 2049763 has been marked as a duplicate of this bug. *** Hi, The RHOSP 16.1, 16.2, and 17.0 networking guides have been updated. Customers can see these changes here: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html/networking_guide/neutron-troubleshoot_rhosp-network#fix-ovn-controller-fail-edge_neutron-troubleshoot https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html/networking_guide/neutron-troubleshoot_rhosp-network#fix-ovn-controller-fail-edge_neutron-troubleshoot https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/17.0/html/networking_guide/neutron-troubleshoot_rhosp-network#fix-ovn-controller-fail-edge_neutron-troubleshoot --Greg |