1980269 – [python-ovs] KeyError raised after ovs probe inactivity

The FDP team is no longer accepting new bugs in Bugzilla. Please report your issues under FDP project in Jira. Thanks.

Bug 1980269 - [python-ovs] KeyError raised after ovs probe inactivity

Summary: [python-ovs] KeyError raised after ovs probe inactivity

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux Fast Datapath
Classification:	Red Hat
Component:	openvswitch2.13
Sub Component:
Version:	FDP 21.A
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	medium
Target Milestone:	---
Target Release:	---
Assignee:	Timothy Redaelli
QA Contact:	qding
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1940824
TreeView+	depends on / blocked

Reported:	2021-07-08 08:34 UTC by Ales Musil
Modified:	2022-01-10 16:50 UTC (History)
CC List:	16 users (show)
Fixed In Version:	openvswitch2.13-2.13.0-133.el8fdp
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2022-01-10 16:49:55 UTC
Target Upstream Version:
Embargoed:
Flags:	fleitner: needinfo-

Attachments	(Terms of Use)
Use a persistent connection (2.40 KB, patch) 2021-07-22 20:58 UTC, Terry Wilson	no flags	Details \| Diff
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Issue Tracker	FD-1428	0	None	None	None	2021-08-09 05:24:52 UTC
Red Hat Product Errata	RHBA-2022:0051	0	None	None	None	2022-01-10 16:50:00 UTC

Description Ales Musil 2021-07-08 08:34:49 UTC

Description of problem:
The call to ovsdbapp command that changes component in ovn DB fails with 
KeyError: UUID('x-y-z'), this happens right after:
ERROR   (Thread-3) [ovsdbapp.backend.ovs_idl.vlog] ssl:127.0.0.1:6641: no response to inactivity probe after 5 seconds, disconnecting (reconnect:123)

It seems that the connection is disrupted and the ovsdbapp internal state 
is corrupted. It only happens one time after restarting ovirt-provider-ovn 
which uses ovsdbapp to communicate with ovn database. The ovn network is actually created even if we get the key error. 


Version-Release number of selected component (if applicable):
python3-ovsdbapp-0.17.5-1.el8.noarch
ovn2.13-20.12.0-85.el8s.x86_64
ovn2.13-central-20.12.0-85.el8s.x86_64
openvswitch2.13-2.13.0-115.el8s.x86_64
python3-openvswitch2.13-2.13.0-115.el8s.x86_64

How reproducible:
100%

Steps to Reproduce:
1. systemctl restart ovirt-provider-ovn
2. Run https://gist.github.com/almusil/696ae7f51624bcf9e5e00c4de8323c32


Actual results:
Fails with KeyError: UUID('x-y-z')

Expected results:
Should pass

Comment 3 Jakub Libosvar 2021-07-15 16:33:26 UTC

Thanks for providing the environment. I did git bisect on python-ovs and was able to track it down the patch that introduced the regression:

https://github.com/openvswitch/ovs/commit/c39751e44539a014e642bcd930cb9e3a33af1805

I'm switching the component to the OVS team.

Comment 4 Terry Wilson 2021-07-22 20:58:10 UTC

Created attachment 1804661 [details]
Use a persistent connection

I've been investigating as well, and I have some concerns with the ovirt usage of ovsdbapp.

1) It looks like the intention is for NeutronApi's context manager to create a new connection to ovsdb for each request.

This in general is a bad idea. python-ovs, which ovsdbapp uses, downloads a copy of the database into memory on every connection and then operates on that. The whole point of using ovsdbapp instead of ovn-nbctl etc. is that it can create a long-lived connection and avoid dumping the db over a socket for every request.

2) ovn_connection.connect() does:

    ovsidl = ...
    return ovsidl, OvnNbApiIdlImpl(Connection(idl=ovsidl, timeout=100))

The Connection is assigned to OvnNbApiIdlImpl.ovsdb_connection which is a *class variable* i.e. global. OvnNbApiIdl derives from ovsdbapp.backend.ovs_idl.Backend which does:


    def __init__(self, connection, start=True, **kwargs):
        super(Backend, self).__init__(**kwargs)
        if start:
            self.start_connection(connection)

    @classmethod
    def start_connection(cls, connection):
        try:
            if cls.ovsdb_connection is None:
                cls.ovsdb_connection = connection
                cls.ovsdb_connection.start()
        except Exception as e:
            connection_exception = exceptions.OvsdbConnectionUnavailable(
                db_schema=cls.schema, error=e)
            LOG.exception(connection_exception)
            raise connection_exception


As you can see, if ovsdb_connection is already set, it will ignore the connection object passed in. This means that on subsequent calls of connect(), the ovsidl being returned is not the idl that is stored on the OvnNbApiIdlImpl.ovsdb_connection--what is actually used by all of the code. So when NeutronApi.__exit__ is called and calls self.ovsidl.close() it is closing a connection that was never actually initiated or used.

You could just access self.idl.ovsdb_connection.idl in __exit__() (and self.idl.tables in are_security_groups_supported). But really, __exit__ shouldn't even do anything. Closing the connection every request defeats the purpose of ovsdbapp. Basically connect() could just be called once at the beginning of the app and left up. Attaching a patch that shows the changes I would make (and that seem to work with the producer script--though I manually typed it in on the test vm since we were not using a git repo there, so there could by typos in the patch).

Comment 5 Ales Musil 2021-07-26 05:08:33 UTC

Thank you Terry, I have applied the suggested change and so far it looks good. We will do some testing if it works without any regression. 
I'll lower the severity, even if we have this workaround the original bug should be fixed nevertheless.

Comment 6 Flavio Leitner 2021-08-02 18:14:51 UTC

Based on the comment#4 and comment#5, reassigning this to python-ovsdbapp.
If you think this issue still should be fixed/improved in OVS, please re-assign back to OVS with the describing the desired outcome.
Thanks,
fbl

Comment 7 Jakub Libosvar 2021-08-03 09:38:22 UTC

(In reply to Flavio Leitner from comment #6)
> Based on the comment#4 and comment#5, reassigning this to python-ovsdbapp.
> If you think this issue still should be fixed/improved in OVS, please
> re-assign back to OVS with the describing the desired outcome.
> Thanks,
> fbl

I think the comments for 4 and 5 were for ovirt project to workaround failure at first transaction after reconnecting. I still believe the problem is in python-ovs based on comment 3 and it should get fixed there in python-ovs.

Comment 15 Terry Wilson 2021-09-01 20:16:47 UTC

There has been a KeyError that we've been getting in certain cases due to python-ovs allowing transactions to be sent before it has started monitoring the DB. So if you send a transaction immediately after connecting that creates an object, we can call get_insert_uuid() to get the object then try to look it up via idl.tables['Port_Group'].rows[insert_uuid] and fail because the initial db dump hasn't happened yet.

This patch up for review https://patchwork.ozlabs.org/project/openvswitch/patch/20210901161526.237479-1-twilson@redhat.com/ should fix that, which mirrors a change made to the C IDL years ago. With ovsdbapp, we always worked around the issue by waiting for the seqno to change, which before adding support for clustered dbs and requesting the _Server DB, was always the initial monitor response, which no longer works.

@fleitner if you get a chance to review/merge that and then get it downstream, I believe we can close this issue (and several others). It should port down to 2.15, and then with adding a couple of lines for tracking IDL_S_MONITORING back to 2.13 I think.

Comment 22 errata-xmlrpc 2022-01-10 16:49:55 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (openvswitch2.13 update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2022:0051

Note You need to log in before you can comment on or make changes to this bug.