Bug 1546719 - OVS errors in ovs_idl.connection thread not handled properly
Summary: OVS errors in ovs_idl.connection thread not handled properly
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Red Hat OpenStack
Classification: Red Hat
Component: python-ovsdbapp
Version: 13.0 (Queens)
Hardware: Unspecified
OS: Unspecified
medium
high
Target Milestone: z9
: 13.0 (Queens)
Assignee: Terry Wilson
QA Contact: Eran Kuris
URL:
Whiteboard:
Depends On:
Blocks: 1757512
TreeView+ depends on / blocked
 
Reported: 2018-02-19 12:11 UTC by Marcin Mirecki
Modified: 2019-12-13 16:27 UTC (History)
8 users (show)

Fixed In Version: python-ovsdbapp-0.10.4-1.el7ost
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1757512 (view as bug list)
Environment:
Last Closed: 2019-12-10 14:26:46 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
OpenStack gerrit 564624 0 'None' MERGED Ensure timeout on queueing transaction 2020-02-28 11:10:21 UTC
OpenStack gerrit 678004 0 'None' MERGED Ensure timeout on queueing transaction 2020-02-28 11:10:20 UTC

Description Marcin Mirecki 2018-02-19 12:11:53 UTC
The ovsdbapp ovs_idl.connection thread does not handle error.
If an exception will be raised in anywhere, the thread will die.
This happens for example when trying to create a lsp with unicode characters in name, when an exception is raised in ovs.db.idl.run() used here:
    https://github.com/openstack/ovsdbapp/blob/master/ovsdbapp/backend/ovs_idl/connection.py#L95

Comment 1 Terry Wilson 2018-04-11 20:40:47 UTC
If ovs.db.idl.run() raises an Exception, it means that the server has most likely sent us a message that ovsdbapp can't handle. run() is called here outside of a Transaction, so there is no way to pass the Exception back to the other thread like we do for Exceptions in do_commit().

I think the only thing we could really do would be to log the exception and continue, though I'm not sure this would be good. If we've gotten an exception when updating our in-memory copy of the DB, that means we are now no longer reflecting what is in the database. Messages that we have sent to the DB have modified it, but those changes will not be reflected when we examine idl.tables[table].rows, etc.

Another thing to try would be to try to force a reconnect, but that will cause the whole database to be dumped back into memory--including whatever message most likely would have been sent that caused the Exception in the first place. I'm just not sure that an Exception in idl.run() is actually recoverable. We can log as much information as we can, but I think that stopping might actually be the best action to take in this exceptional instance. Maybe shutting down the thread more cleanly. But ovsdbapp is designed to allow a txn to be queued even if we aren't currently connected (it will be run upon connection), so we still wouldn't be notifying the caller that anything had happened until they time out.

Comment 2 Terry Wilson 2018-04-26 20:50:14 UTC
Since it isn't really possible to recover (the in-memory copy of the database will be out of sync if we've failed on a read from the database in idl.run()), I've added a patch to log the exception and to ensure that we eventually time out when queueing a transaction.

Comment 20 Eran Kuris 2019-11-04 13:27:37 UTC
According to Terry, this bug should be verified by functional test - impl_idl.
It looks like it failed in the latest run.
https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-13-dsvm-functional-rhos/482/consoleFull

failed QA according to CI functional test failures.

Comment 24 Jakub Libosvar 2019-11-05 14:14:49 UTC
(In reply to Eran Kuris from comment #20)
> According to Terry, this bug should be verified by functional test -
> impl_idl.
> It looks like it failed in the latest run.
> https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/
> network/view/networking-ovn/job/DFG-network-networking-ovn-13-dsvm-
> functional-rhos/482/consoleFull
> 
> failed QA according to CI functional test failures.

Moving back to ON_QA as this is ovsdbapp and the failure above is from networking-ovn and related to backported port groups.


Note You need to log in before you can comment on or make changes to this bug.