Description of problem: The call to ovsdbapp command that changes component in ovn DB fails with KeyError: UUID('x-y-z'), this happens right after: ERROR (Thread-3) [ovsdbapp.backend.ovs_idl.vlog] ssl:127.0.0.1:6641: no response to inactivity probe after 5 seconds, disconnecting (reconnect:123) It seems that the connection is disrupted and the ovsdbapp internal state is corrupted. It only happens one time after restarting ovirt-provider-ovn which uses ovsdbapp to communicate with ovn database. The ovn network is actually created even if we get the key error. Version-Release number of selected component (if applicable): python3-ovsdbapp-0.17.5-1.el8.noarch ovn2.13-20.12.0-85.el8s.x86_64 ovn2.13-central-20.12.0-85.el8s.x86_64 openvswitch2.13-2.13.0-115.el8s.x86_64 python3-openvswitch2.13-2.13.0-115.el8s.x86_64 How reproducible: 100% Steps to Reproduce: 1. systemctl restart ovirt-provider-ovn 2. Run https://gist.github.com/almusil/696ae7f51624bcf9e5e00c4de8323c32 Actual results: Fails with KeyError: UUID('x-y-z') Expected results: Should pass
Thanks for providing the environment. I did git bisect on python-ovs and was able to track it down the patch that introduced the regression: https://github.com/openvswitch/ovs/commit/c39751e44539a014e642bcd930cb9e3a33af1805 I'm switching the component to the OVS team.
Created attachment 1804661 [details] Use a persistent connection I've been investigating as well, and I have some concerns with the ovirt usage of ovsdbapp. 1) It looks like the intention is for NeutronApi's context manager to create a new connection to ovsdb for each request. This in general is a bad idea. python-ovs, which ovsdbapp uses, downloads a copy of the database into memory on every connection and then operates on that. The whole point of using ovsdbapp instead of ovn-nbctl etc. is that it can create a long-lived connection and avoid dumping the db over a socket for every request. 2) ovn_connection.connect() does: ovsidl = ... return ovsidl, OvnNbApiIdlImpl(Connection(idl=ovsidl, timeout=100)) The Connection is assigned to OvnNbApiIdlImpl.ovsdb_connection which is a *class variable* i.e. global. OvnNbApiIdl derives from ovsdbapp.backend.ovs_idl.Backend which does: def __init__(self, connection, start=True, **kwargs): super(Backend, self).__init__(**kwargs) if start: self.start_connection(connection) @classmethod def start_connection(cls, connection): try: if cls.ovsdb_connection is None: cls.ovsdb_connection = connection cls.ovsdb_connection.start() except Exception as e: connection_exception = exceptions.OvsdbConnectionUnavailable( db_schema=cls.schema, error=e) LOG.exception(connection_exception) raise connection_exception As you can see, if ovsdb_connection is already set, it will ignore the connection object passed in. This means that on subsequent calls of connect(), the ovsidl being returned is not the idl that is stored on the OvnNbApiIdlImpl.ovsdb_connection--what is actually used by all of the code. So when NeutronApi.__exit__ is called and calls self.ovsidl.close() it is closing a connection that was never actually initiated or used. You could just access self.idl.ovsdb_connection.idl in __exit__() (and self.idl.tables in are_security_groups_supported). But really, __exit__ shouldn't even do anything. Closing the connection every request defeats the purpose of ovsdbapp. Basically connect() could just be called once at the beginning of the app and left up. Attaching a patch that shows the changes I would make (and that seem to work with the producer script--though I manually typed it in on the test vm since we were not using a git repo there, so there could by typos in the patch).
Thank you Terry, I have applied the suggested change and so far it looks good. We will do some testing if it works without any regression. I'll lower the severity, even if we have this workaround the original bug should be fixed nevertheless.
Based on the comment#4 and comment#5, reassigning this to python-ovsdbapp. If you think this issue still should be fixed/improved in OVS, please re-assign back to OVS with the describing the desired outcome. Thanks, fbl
(In reply to Flavio Leitner from comment #6) > Based on the comment#4 and comment#5, reassigning this to python-ovsdbapp. > If you think this issue still should be fixed/improved in OVS, please > re-assign back to OVS with the describing the desired outcome. > Thanks, > fbl I think the comments for 4 and 5 were for ovirt project to workaround failure at first transaction after reconnecting. I still believe the problem is in python-ovs based on comment 3 and it should get fixed there in python-ovs.
There has been a KeyError that we've been getting in certain cases due to python-ovs allowing transactions to be sent before it has started monitoring the DB. So if you send a transaction immediately after connecting that creates an object, we can call get_insert_uuid() to get the object then try to look it up via idl.tables['Port_Group'].rows[insert_uuid] and fail because the initial db dump hasn't happened yet. This patch up for review https://patchwork.ozlabs.org/project/openvswitch/patch/20210901161526.237479-1-twilson@redhat.com/ should fix that, which mirrors a change made to the C IDL years ago. With ovsdbapp, we always worked around the issue by waiting for the seqno to change, which before adding support for clustered dbs and requesting the _Server DB, was always the initial monitor response, which no longer works. @fleitner if you get a chance to review/merge that and then get it downstream, I believe we can close this issue (and several others). It should port down to 2.15, and then with adding a couple of lines for tracking IDL_S_MONITORING back to 2.13 I think.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (openvswitch2.13 update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2022:0051