Bug 1546719

Summary: OVS errors in ovs_idl.connection thread not handled properly
Product: Red Hat OpenStack Reporter: Marcin Mirecki <mmirecki>
Component: python-ovsdbappAssignee: Terry Wilson <twilson>
Status: CLOSED CURRENTRELEASE QA Contact: Eran Kuris <ekuris>
Severity: high Docs Contact:
Priority: medium    
Version: 13.0 (Queens)CC: amuller, danken, ekuris, jlibosva, jschluet, njohnston, shdunne, twilson
Target Milestone: z9Keywords: TestOnly, Triaged, ZStream
Target Release: 13.0 (Queens)   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: python-ovsdbapp-0.10.4-1.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1757512 (view as bug list) Environment:
Last Closed: 2019-12-10 14:26:46 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1757512    

Description Marcin Mirecki 2018-02-19 12:11:53 UTC
The ovsdbapp ovs_idl.connection thread does not handle error.
If an exception will be raised in anywhere, the thread will die.
This happens for example when trying to create a lsp with unicode characters in name, when an exception is raised in ovs.db.idl.run() used here:
    https://github.com/openstack/ovsdbapp/blob/master/ovsdbapp/backend/ovs_idl/connection.py#L95

Comment 1 Terry Wilson 2018-04-11 20:40:47 UTC
If ovs.db.idl.run() raises an Exception, it means that the server has most likely sent us a message that ovsdbapp can't handle. run() is called here outside of a Transaction, so there is no way to pass the Exception back to the other thread like we do for Exceptions in do_commit().

I think the only thing we could really do would be to log the exception and continue, though I'm not sure this would be good. If we've gotten an exception when updating our in-memory copy of the DB, that means we are now no longer reflecting what is in the database. Messages that we have sent to the DB have modified it, but those changes will not be reflected when we examine idl.tables[table].rows, etc.

Another thing to try would be to try to force a reconnect, but that will cause the whole database to be dumped back into memory--including whatever message most likely would have been sent that caused the Exception in the first place. I'm just not sure that an Exception in idl.run() is actually recoverable. We can log as much information as we can, but I think that stopping might actually be the best action to take in this exceptional instance. Maybe shutting down the thread more cleanly. But ovsdbapp is designed to allow a txn to be queued even if we aren't currently connected (it will be run upon connection), so we still wouldn't be notifying the caller that anything had happened until they time out.

Comment 2 Terry Wilson 2018-04-26 20:50:14 UTC
Since it isn't really possible to recover (the in-memory copy of the database will be out of sync if we've failed on a read from the database in idl.run()), I've added a patch to log the exception and to ensure that we eventually time out when queueing a transaction.

Comment 20 Eran Kuris 2019-11-04 13:27:37 UTC
According to Terry, this bug should be verified by functional test - impl_idl.
It looks like it failed in the latest run.
https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/network/view/networking-ovn/job/DFG-network-networking-ovn-13-dsvm-functional-rhos/482/consoleFull

failed QA according to CI functional test failures.

Comment 24 Jakub Libosvar 2019-11-05 14:14:49 UTC
(In reply to Eran Kuris from comment #20)
> According to Terry, this bug should be verified by functional test -
> impl_idl.
> It looks like it failed in the latest run.
> https://rhos-qe-jenkins.rhev-ci-vms.eng.rdu2.redhat.com/view/DFG/view/
> network/view/networking-ovn/job/DFG-network-networking-ovn-13-dsvm-
> functional-rhos/482/consoleFull
> 
> failed QA according to CI functional test failures.

Moving back to ON_QA as this is ovsdbapp and the failure above is from networking-ovn and related to backported port groups.