Bug 1980269

Summary: [python-ovs] KeyError raised after ovs probe inactivity
Product: Red Hat Enterprise Linux Fast Datapath Reporter: Ales Musil <amusil>
Component: openvswitch2.13Assignee: Timothy Redaelli <tredaelli>
Status: CLOSED ERRATA QA Contact: qding
Severity: medium Docs Contact:
Priority: high    
Version: FDP 21.ACC: cgoncalves, ctrautma, dholler, egarciar, fleitner, jhsiao, jlibosva, mburman, michele, mperina, nusiddiq, pmannidi, qding, ralongi, tredaelli, twilson
Target Milestone: ---Flags: fleitner: needinfo-
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: openvswitch2.13-2.13.0-133.el8fdp Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2022-01-10 16:49:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1940824    
Attachments:
Description Flags
Use a persistent connection none

Description Ales Musil 2021-07-08 08:34:49 UTC
Description of problem:
The call to ovsdbapp command that changes component in ovn DB fails with 
KeyError: UUID('x-y-z'), this happens right after:
ERROR   (Thread-3) [ovsdbapp.backend.ovs_idl.vlog] ssl:127.0.0.1:6641: no response to inactivity probe after 5 seconds, disconnecting (reconnect:123)

It seems that the connection is disrupted and the ovsdbapp internal state 
is corrupted. It only happens one time after restarting ovirt-provider-ovn 
which uses ovsdbapp to communicate with ovn database. The ovn network is actually created even if we get the key error. 


Version-Release number of selected component (if applicable):
python3-ovsdbapp-0.17.5-1.el8.noarch
ovn2.13-20.12.0-85.el8s.x86_64
ovn2.13-central-20.12.0-85.el8s.x86_64
openvswitch2.13-2.13.0-115.el8s.x86_64
python3-openvswitch2.13-2.13.0-115.el8s.x86_64

How reproducible:
100%

Steps to Reproduce:
1. systemctl restart ovirt-provider-ovn
2. Run https://gist.github.com/almusil/696ae7f51624bcf9e5e00c4de8323c32


Actual results:
Fails with KeyError: UUID('x-y-z')

Expected results:
Should pass

Comment 3 Jakub Libosvar 2021-07-15 16:33:26 UTC
Thanks for providing the environment. I did git bisect on python-ovs and was able to track it down the patch that introduced the regression:

https://github.com/openvswitch/ovs/commit/c39751e44539a014e642bcd930cb9e3a33af1805

I'm switching the component to the OVS team.

Comment 4 Terry Wilson 2021-07-22 20:58:10 UTC
Created attachment 1804661 [details]
Use a persistent connection

I've been investigating as well, and I have some concerns with the ovirt usage of ovsdbapp.

1) It looks like the intention is for NeutronApi's context manager to create a new connection to ovsdb for each request.

This in general is a bad idea. python-ovs, which ovsdbapp uses, downloads a copy of the database into memory on every connection and then operates on that. The whole point of using ovsdbapp instead of ovn-nbctl etc. is that it can create a long-lived connection and avoid dumping the db over a socket for every request.

2) ovn_connection.connect() does:

    ovsidl = ...
    return ovsidl, OvnNbApiIdlImpl(Connection(idl=ovsidl, timeout=100))

The Connection is assigned to OvnNbApiIdlImpl.ovsdb_connection which is a *class variable* i.e. global. OvnNbApiIdl derives from ovsdbapp.backend.ovs_idl.Backend which does:


    def __init__(self, connection, start=True, **kwargs):
        super(Backend, self).__init__(**kwargs)
        if start:
            self.start_connection(connection)

    @classmethod
    def start_connection(cls, connection):
        try:
            if cls.ovsdb_connection is None:
                cls.ovsdb_connection = connection
                cls.ovsdb_connection.start()
        except Exception as e:
            connection_exception = exceptions.OvsdbConnectionUnavailable(
                db_schema=cls.schema, error=e)
            LOG.exception(connection_exception)
            raise connection_exception


As you can see, if ovsdb_connection is already set, it will ignore the connection object passed in. This means that on subsequent calls of connect(), the ovsidl being returned is not the idl that is stored on the OvnNbApiIdlImpl.ovsdb_connection--what is actually used by all of the code. So when NeutronApi.__exit__ is called and calls self.ovsidl.close() it is closing a connection that was never actually initiated or used.

You could just access self.idl.ovsdb_connection.idl in __exit__() (and self.idl.tables in are_security_groups_supported). But really, __exit__ shouldn't even do anything. Closing the connection every request defeats the purpose of ovsdbapp. Basically connect() could just be called once at the beginning of the app and left up. Attaching a patch that shows the changes I would make (and that seem to work with the producer script--though I manually typed it in on the test vm since we were not using a git repo there, so there could by typos in the patch).

Comment 5 Ales Musil 2021-07-26 05:08:33 UTC
Thank you Terry, I have applied the suggested change and so far it looks good. We will do some testing if it works without any regression. 
I'll lower the severity, even if we have this workaround the original bug should be fixed nevertheless.

Comment 6 Flavio Leitner 2021-08-02 18:14:51 UTC
Based on the comment#4 and comment#5, reassigning this to python-ovsdbapp.
If you think this issue still should be fixed/improved in OVS, please re-assign back to OVS with the describing the desired outcome.
Thanks,
fbl

Comment 7 Jakub Libosvar 2021-08-03 09:38:22 UTC
(In reply to Flavio Leitner from comment #6)
> Based on the comment#4 and comment#5, reassigning this to python-ovsdbapp.
> If you think this issue still should be fixed/improved in OVS, please
> re-assign back to OVS with the describing the desired outcome.
> Thanks,
> fbl

I think the comments for 4 and 5 were for ovirt project to workaround failure at first transaction after reconnecting. I still believe the problem is in python-ovs based on comment 3 and it should get fixed there in python-ovs.

Comment 15 Terry Wilson 2021-09-01 20:16:47 UTC
There has been a KeyError that we've been getting in certain cases due to python-ovs allowing transactions to be sent before it has started monitoring the DB. So if you send a transaction immediately after connecting that creates an object, we can call get_insert_uuid() to get the object then try to look it up via idl.tables['Port_Group'].rows[insert_uuid] and fail because the initial db dump hasn't happened yet.

This patch up for review https://patchwork.ozlabs.org/project/openvswitch/patch/20210901161526.237479-1-twilson@redhat.com/ should fix that, which mirrors a change made to the C IDL years ago. With ovsdbapp, we always worked around the issue by waiting for the seqno to change, which before adding support for clustered dbs and requesting the _Server DB, was always the initial monitor response, which no longer works.

@fleitner if you get a chance to review/merge that and then get it downstream, I believe we can close this issue (and several others). It should port down to 2.15, and then with adding a couple of lines for tracking IDL_S_MONITORING back to 2.13 I think.

Comment 22 errata-xmlrpc 2022-01-10 16:49:55 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (openvswitch2.13 update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:0051