Bug 1948472 - OVN controllers on Edge sites fail to register - Transaction causes multiple rows in "Encap" table to have identical values
Summary: OVN controllers on Edge sites fail to register - Transaction causes multiple ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: documentation
Version: 16.1 (Train)
Hardware: Unspecified
OS: Unspecified
high
urgent
Target Milestone: async
: 16.1 (Train on RHEL 8.2)
Assignee: Greg Rakauskas
QA Contact: RHOS Documentation Team
URL:
Whiteboard:
: 2049763 (view as bug list)
Depends On: 1788336 1946835 2002099
Blocks:
TreeView+ depends on / blocked
 
Reported: 2021-04-12 09:21 UTC by Bernard Cafarelli
Modified: 2023-01-20 10:15 UTC (History)
21 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2022-09-27 21:02:15 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Issue Tracker OSP-2299 0 None None None 2021-11-17 09:37:23 UTC
Red Hat Knowledge Base (Solution) 6551871 0 None None None 2021-12-01 14:28:43 UTC

Description Bernard Cafarelli 2021-04-12 09:21:43 UTC
On a 16.1.4 DCN lab deployment, we cannot create VMs on one DCN site. Checking the logs on compute nodes there, this is neutron/ovn-metadata-agent.log

2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.transaction [-] Traceback (most recent call last):                                                                              
  File "/usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/connection.py", line 128, in run                                                                                           
    txn.results.put(txn.do_commit())
  File "/usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py", line 86, in do_commit                                                                                     
    command.run_idl(txn)
  File "/usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/command.py", line 168, in run_idl                                                                                          
    record = self.api.lookup(self.table, self.record)
  File "/usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/__init__.py", line 172, in lookup                                                                                          
    return self._lookup(table, record)
  File "/usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/__init__.py", line 215, in _lookup                                                                                         
    row = idlutils.row_by_value(self, rl.table, rl.column, record)
  File "/usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/idlutils.py", line 130, in row_by_value                                                                                    
    raise RowNotFound(table=table, col=column, match=match)
ovsdbapp.backend.ovs_idl.idlutils.RowNotFound: Cannot find Chassis_Private with name=9ec08d48-23a3-447e-9da7-71a171d38ac0                                                                    

2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command [-] Error executing command: ovsdbapp.backend.ovs_idl.idlutils.RowNotFound: Cannot find Chassis_Private with name=9ec08d48-23a3-447e-9da7-71a171d38ac0
2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command Traceback (most recent call last):                                                                                      
2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command   File "/usr/lib/python3.6/site-packages/ovsdbapp/api.py", line 111, in transaction                                     
2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command     yield self._nested_txns_map[cur_thread_id]                                                                          
2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command KeyError: 140241704471744
2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command
2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command During handling of the above exception, another exception occurred:                                                     
2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command
2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command Traceback (most recent call last):                                                                                      
2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command   File "/usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/command.py", line 42, in execute                      
2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command     t.add(self)
2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command   File "/usr/lib64/python3.6/contextlib.py", line 88, in __exit__                                                       
2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command     next(self.gen)
2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command   File "/usr/lib/python3.6/site-packages/ovsdbapp/api.py", line 119, in transaction                                     
2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command     del self._nested_txns_map[cur_thread_id]                                                                            
2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command   File "/usr/lib/python3.6/site-packages/ovsdbapp/api.py", line 69, in __exit__                                         
2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command     self.result = self.commit()                                                                                         
2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command   File "/usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py", line 62, in commit                   
2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command     raise result.ex
2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command   File "/usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/connection.py", line 128, in run                      
2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command     txn.results.put(txn.do_commit())                                                                                    
2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command   File "/usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/transaction.py", line 86, in do_commit                
2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command     command.run_idl(txn)
2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command   File "/usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/command.py", line 168, in run_idl                     
2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command     record = self.api.lookup(self.table, self.record)                                                                   
2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command   File "/usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/__init__.py", line 172, in lookup                     
2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command     return self._lookup(table, record)                                                                                  
2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command   File "/usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/__init__.py", line 215, in _lookup                    
2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command     row = idlutils.row_by_value(self, rl.table, rl.column, record)                                                      
2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command   File "/usr/lib/python3.6/site-packages/ovsdbapp/backend/ovs_idl/idlutils.py", line 130, in row_by_value               
2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command     raise RowNotFound(table=table, col=column, match=match)                                                             
2021-04-12 09:14:09.814 64258 ERROR ovsdbapp.backend.ovs_idl.command ovsdbapp.backend.ovs_idl.idlutils.RowNotFound: Cannot find Chassis_Private with name=9ec08d48-23a3-447e-9da7-71a171d38ac0


And openvswitch/ovn-controller.log

2021-04-12T09:14:48.994Z|04753|ovsdb_idl|WARN|Dropped 56170 log messages in last 60 seconds (most recently, 0 seconds ago) due to excessive rate                                             
2021-04-12T09:14:48.994Z|04754|ovsdb_idl|WARN|transaction error: {"details":"Transaction causes multiple rows in \"Encap\" table to have identical values (geneve and \"10.14.2.7\") for index on columns \"type\" and \"ip\".  First row, with UUID 3973cad5-eb8a-4f29-85c3-c105d861c0e0, was inserted by this transaction.  Second row, with UUID f06b71a8-4162-475b-8542-d27db3a9097a, existed in the database before this transaction and was not modified by the transaction.","error":"constraint violation"}                                                                      
2021-04-12T09:15:48.993Z|04755|ovsdb_idl|WARN|Dropped 55709 log messages in last 60 seconds (most recently, 0 seconds ago) due to excessive rate                                             
2021-04-12T09:15:48.993Z|04756|ovsdb_idl|WARN|transaction error: {"details":"Transaction causes multiple rows in \"Encap\" table to have identical values (geneve and \"10.14.2.7\") for index on columns \"type\" and \"ip\".  First row, with UUID f06b71a8-4162-475b-8542-d27db3a9097a, existed in the database before this transaction and was not modified by the transaction.  Second
row, with UUID f81c3e8c-8c24-41bc-95a1-3a1ced147ebb, was inserted by this transaction.","error":"constraint violation"}                                                                      
2021-04-12T09:16:48.993Z|04757|ovsdb_idl|WARN|Dropped 55070 log messages in last 60 seconds (most recently, 0 seconds ago) due to excessive rate                                             
2021-04-12T09:16:48.993Z|04758|ovsdb_idl|WARN|transaction error: {"details":"Transaction causes multiple rows in \"Encap\" table to have identical values (geneve and \"10.14.2.7\") for index on columns \"type\" and \"ip\".  First row, with UUID 87a8ee7a-a0fb-4600-a05a-8f6af6a609ba, was inserted by this transaction.  Second row, with UUID f06b71a8-4162-475b-8542-d27db3a9097a, existed in the database before this transaction and was not modified by the transaction.","error":"constraint violation"}  


Apparently, if ovn-controller replaces the hostname it will register another chassis entry (which includes another encap entry). This should be the relevant fix:
https://patchwork.ozlabs.org/project/openvswitch/patch/20200525152821.19838-1-dalvarez@redhat.com/


Workaround when this happens is to drop the chassis linked to that IP, with steps similar to:
ovn-sbctl list encap |grep -a3 <IP address from ovn-controller.log>
ovn-sbctl chassis-del <chassis-id>
and restart tripleo_ovn_controller tripleo_ovn_metadata_agent on the nodes

Comment 1 Bernard Cafarelli 2021-04-12 14:17:00 UTC
Note that as the workaround drops the chassis, this breaks things like "openstack network agent list"

Comment 2 Bernard Cafarelli 2021-04-12 15:50:59 UTC
The initially mentioned patch https://patchwork.ozlabs.org/project/openvswitch/patch/20200525152821.19838-1-dalvarez@redhat.com/ is present in openvswitch2.13-2.13.0-79.5.el8fdp.x86_64, which is included in 16.1.4 and the deployed lab here:

$ cat /etc/rhosp-release
Red Hat OpenStack Platform release 16.1.4 GA (Train)
$ rpm -q openvswitch2.13
openvswitch2.13-2.13.0-79.5.el8fdp.x86_64

Comment 8 Bernard Cafarelli 2021-04-14 12:33:21 UTC
After a full redeploy of the overcloud (including central site), we did not see this error happen again on Edge sites. Looking at OVN database dumps did not have definitive clues, but the most probable reason was that this was caused by a partial redeploy, aka edge sites redeployed on the same nodes.

While the current lab is fixed, similar issues can and will happen on deployed clouds: either when scaling down and up on same nodes (as was probably the case here), or when nodes are taken out intentionally or not. We should have a clear and tested documented way to handle this operation, and also push to have bug 1946835 fixed

Current draft of needed steps:
* if node is accessible before scale down (planned operations), node should be gracefully shut down before (not via ironic) and confirming that relevant OVN agents are not listed anymore
* for unplanned scale down, the procedure needs manual steps

First the relevant chassis should be deleted from OVN db with:
  ovn-sbctl chassis-del <chassis-id>
If needed, that ID can be found from the the "Encap" error lines with:
  ovn-sbctl list encap |grep -a3 <IP address from ovn-controller.log>

Once these chassis are removed, the Chassis_Private table should be checked:
  ovn-sbctl find Chassis_private chassis="[]"
Any entries reported should be removed with:
  ovn-sbctl destroy Chassis_Private <listed_id>

Once this is done, "openstack network agent list" should run properly and have expected list

Comment 12 Alex Stupnikov 2021-06-23 11:30:15 UTC
Hello.

We have successfully used workaround described at comment #8 to resolve similar problem in customer's deployment. "neutron agent-list" was still broken, but it could be caused by other OVN problems, so we have reported separate bug #1975264

Short notes about steps taken:

- chassis_name value from output [2] should be used as an argument for command [1]
- name value from output [3] should be used as an argument for command [4]
- tripleo_ovn_controller tripleo_ovn_metadata_agent on affected nodes (nodes that have specified errors in logs) must be restarted after OVN DB is changed

[1]
  ovn-sbctl chassis-del <chassis-id>
[2]
  ovn-sbctl list encap |grep -a3 <IP address from ovn-controller.log>
[3]
  ovn-sbctl find Chassis_private chassis="[]"
[4]
  ovn-sbctl destroy Chassis_Private <listed_id>

Regards, Alex.

Comment 17 Terry Wilson 2022-02-15 14:06:58 UTC
*** Bug 2049763 has been marked as a duplicate of this bug. ***


Note You need to log in before you can comment on or make changes to this bug.