Bug 1948472
| Summary: | OVN controllers on Edge sites fail to register - Transaction causes multiple rows in "Encap" table to have identical values | |||
|---|---|---|---|---|
| Product: | Red Hat OpenStack | Reporter: | Bernard Cafarelli <bcafarel> | |
| Component: | documentation | Assignee: | Greg Rakauskas <gregraka> | |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | RHOS Documentation Team <rhos-docs> | |
| Severity: | urgent | Docs Contact: | ||
| Priority: | high | |||
| Version: | 16.1 (Train) | CC: | alex.bron, apevec, astupnik, cfontain, cmilleta, cmuresan, dalvarez, egarciar, ffernand, gregraka, jamsmith, jmelvin, jveiraca, lhh, lmartins, majopela, nalmond, pveiga, rheslop, rsafrono, scohen | |
| Target Milestone: | async | Keywords: | Triaged | |
| Target Release: | 16.1 (Train on RHEL 8.2) | |||
| Hardware: | Unspecified | |||
| OS: | Unspecified | |||
| Whiteboard: | ||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | ||
| Doc Text: | Story Points: | --- | ||
| Clone Of: | ||||
| : | 2298873 (view as bug list) | Environment: | ||
| Last Closed: | 2022-09-27 21:02:15 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | 1788336, 1946835, 2002099 | |||
| Bug Blocks: | 2298873, 2299409 | |||
Note that as the workaround drops the chassis, this breaks things like "openstack network agent list" The initially mentioned patch https://patchwork.ozlabs.org/project/openvswitch/patch/20200525152821.19838-1-dalvarez@redhat.com/ is present in openvswitch2.13-2.13.0-79.5.el8fdp.x86_64, which is included in 16.1.4 and the deployed lab here: $ cat /etc/rhosp-release Red Hat OpenStack Platform release 16.1.4 GA (Train) $ rpm -q openvswitch2.13 openvswitch2.13-2.13.0-79.5.el8fdp.x86_64 After a full redeploy of the overcloud (including central site), we did not see this error happen again on Edge sites. Looking at OVN database dumps did not have definitive clues, but the most probable reason was that this was caused by a partial redeploy, aka edge sites redeployed on the same nodes. While the current lab is fixed, similar issues can and will happen on deployed clouds: either when scaling down and up on same nodes (as was probably the case here), or when nodes are taken out intentionally or not. We should have a clear and tested documented way to handle this operation, and also push to have bug 1946835 fixed Current draft of needed steps: * if node is accessible before scale down (planned operations), node should be gracefully shut down before (not via ironic) and confirming that relevant OVN agents are not listed anymore * for unplanned scale down, the procedure needs manual steps First the relevant chassis should be deleted from OVN db with: ovn-sbctl chassis-del <chassis-id> If needed, that ID can be found from the the "Encap" error lines with: ovn-sbctl list encap |grep -a3 <IP address from ovn-controller.log> Once these chassis are removed, the Chassis_Private table should be checked: ovn-sbctl find Chassis_private chassis="[]" Any entries reported should be removed with: ovn-sbctl destroy Chassis_Private <listed_id> Once this is done, "openstack network agent list" should run properly and have expected list Hello. We have successfully used workaround described at comment #8 to resolve similar problem in customer's deployment. "neutron agent-list" was still broken, but it could be caused by other OVN problems, so we have reported separate bug #1975264 Short notes about steps taken: - chassis_name value from output [2] should be used as an argument for command [1] - name value from output [3] should be used as an argument for command [4] - tripleo_ovn_controller tripleo_ovn_metadata_agent on affected nodes (nodes that have specified errors in logs) must be restarted after OVN DB is changed [1] ovn-sbctl chassis-del <chassis-id> [2] ovn-sbctl list encap |grep -a3 <IP address from ovn-controller.log> [3] ovn-sbctl find Chassis_private chassis="[]" [4] ovn-sbctl destroy Chassis_Private <listed_id> Regards, Alex. *** Bug 2049763 has been marked as a duplicate of this bug. *** Hi, The RHOSP 16.1, 16.2, and 17.0 networking guides have been updated. Customers can see these changes here: https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.1/html/networking_guide/neutron-troubleshoot_rhosp-network#fix-ovn-controller-fail-edge_neutron-troubleshoot https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/16.2/html/networking_guide/neutron-troubleshoot_rhosp-network#fix-ovn-controller-fail-edge_neutron-troubleshoot https://access.redhat.com/documentation/en-us/red_hat_openstack_platform/17.0/html/networking_guide/neutron-troubleshoot_rhosp-network#fix-ovn-controller-fail-edge_neutron-troubleshoot --Greg |
On a 16.1.4 DCN lab deployment, we cannot create VMs on one DCN site. Checking the logs on compute nodes there, this is neutron/ovn-metadata-agent.log 2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.transaction [-] Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/connection.py", line 128, in run txn.results.put(txn.do_commit()) File "/usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py", line 86, in do_commit command.run_idl(txn) File "/usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/command.py", line 168, in run_idl record = self.api.lookup(self.table, self.record) File "/usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/__init__.py", line 172, in lookup return self._lookup(table, record) File "/usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/__init__.py", line 215, in _lookup row = idlutils.row_by_value(self, rl.table, rl.column, record) File "/usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/idlutils.py", line 130, in row_by_value raise RowNotFound(table=table, col=column, match=match) ovsdbapp.backend.ovs_idl.idlutils.RowNotFound: Cannot find Chassis_Private with name=9ec08d48-23a3-447e-9da7-71a171d38ac0 2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command [-] Error executing command: ovsdbapp.backend.ovs_idl.idlutils.RowNotFound: Cannot find Chassis_Private with name=9ec08d48-23a3-447e-9da7-71a171d38ac0 2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command Traceback (most recent call last): 2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command File "/usr/lib/python3.6/site-packages/ovsdbapp/api.py", line 111, in transaction 2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command yield self._nested_txns_map[cur_thread_id] 2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command KeyError: 140241704471744 2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command 2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command During handling of the above exception, another exception occurred: 2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command 2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command Traceback (most recent call last): 2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command File "/usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/command.py", line 42, in execute 2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command t.add(self) 2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command File "/usr/lib64/python3.6/contextlib.py", line 88, in __exit__ 2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command next(self.gen) 2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command File "/usr/lib/python3.6/site-packages/ovsdbapp/api.py", line 119, in transaction 2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command del self._nested_txns_map[cur_thread_id] 2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command File "/usr/lib/python3.6/site-packages/ovsdbapp/api.py", line 69, in __exit__ 2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command self.result = self.commit() 2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command File "/usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py", line 62, in commit 2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command raise result.ex 2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command File "/usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/connection.py", line 128, in run 2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command txn.results.put(txn.do_commit()) 2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command File "/usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py", line 86, in do_commit 2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command command.run_idl(txn) 2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command File "/usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/command.py", line 168, in run_idl 2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command record = self.api.lookup(self.table, self.record) 2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command File "/usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/__init__.py", line 172, in lookup 2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command return self._lookup(table, record) 2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command File "/usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/__init__.py", line 215, in _lookup 2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command row = idlutils.row_by_value(self, rl.table, rl.column, record) 2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command File "/usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/idlutils.py", line 130, in row_by_value 2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command raise RowNotFound(table=table, col=column, match=match) 2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command ovsdbapp.backend.ovs_idl.idlutils.RowNotFound: Cannot find Chassis_Private with name=9ec08d48-23a3-447e-9da7-71a171d38ac0 And openvswitch/ovn-controller.log 2021-04-12T09:14:48.994Z|04753|ovsdb_idl|WARN|Dropped 56170 log messages in last 60 seconds (most recently, 0 seconds ago) due to excessive rate 2021-04-12T09:14:48.994Z|04754|ovsdb_idl|WARN|transaction error: {"details":"Transaction causes multiple rows in \"Encap\" table to have identical values (geneve and \"10.14.2.7\") for index on columns \"type\" and \"ip\". First row, with UUID 3973cad5-eb8a-4f29-85c3-c105d861c0e0, was inserted by this transaction. Second row, with UUID f06b71a8-4162-475b-8542-d27db3a9097a, existed in the database before this transaction and was not modified by the transaction.","error":"constraint violation"} 2021-04-12T09:15:48.993Z|04755|ovsdb_idl|WARN|Dropped 55709 log messages in last 60 seconds (most recently, 0 seconds ago) due to excessive rate 2021-04-12T09:15:48.993Z|04756|ovsdb_idl|WARN|transaction error: {"details":"Transaction causes multiple rows in \"Encap\" table to have identical values (geneve and \"10.14.2.7\") for index on columns \"type\" and \"ip\". First row, with UUID f06b71a8-4162-475b-8542-d27db3a9097a, existed in the database before this transaction and was not modified by the transaction. Second row, with UUID f81c3e8c-8c24-41bc-95a1-3a1ced147ebb, was inserted by this transaction.","error":"constraint violation"} 2021-04-12T09:16:48.993Z|04757|ovsdb_idl|WARN|Dropped 55070 log messages in last 60 seconds (most recently, 0 seconds ago) due to excessive rate 2021-04-12T09:16:48.993Z|04758|ovsdb_idl|WARN|transaction error: {"details":"Transaction causes multiple rows in \"Encap\" table to have identical values (geneve and \"10.14.2.7\") for index on columns \"type\" and \"ip\". First row, with UUID 87a8ee7a-a0fb-4600-a05a-8f6af6a609ba, was inserted by this transaction. Second row, with UUID f06b71a8-4162-475b-8542-d27db3a9097a, existed in the database before this transaction and was not modified by the transaction.","error":"constraint violation"} Apparently, if ovn-controller replaces the hostname it will register another chassis entry (which includes another encap entry). This should be the relevant fix: https://patchwork.ozlabs.org/project/openvswitch/patch/20200525152821.19838-1-dalvarez@redhat.com/ Workaround when this happens is to drop the chassis linked to that IP, with steps similar to: ovn-sbctl list encap |grep -a3 <IP address from ovn-controller.log> ovn-sbctl chassis-del <chassis-id> and restart tripleo_ovn_controller tripleo_ovn_metadata_agent on the nodes