Bug 1563581

Summary: Unhandled connection exception in python-openvswitch/ovsdbapp code
Product: Red Hat OpenStack Reporter: Mor <mkalfon>
Component: python-ovsdbappAssignee: Terry Wilson <twilson>
Status: CLOSED ERRATA QA Contact: Toni Freger <tfreger>
Severity: medium Docs Contact:
Priority: medium    
Version: 13.0 (Queens)CC: atragler, danken, dholler, jschluet, lmartins, mduarted, mkalfon, mmirecki, twilson
Target Milestone: ---Keywords: Triaged, ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: python-ovsdbapp-0.10.4-2.el7ost Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-03-10 11:52:51 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
logs
none
Don't reuse poller object none

Description Mor 2018-04-04 08:41:44 UTC
Created attachment 1417147 [details]
logs

Description of problem:
During automation failure analysis I noticed this error in ovirt-provider-ovn.log:
    2018-04-02 18:15:46,262 root From: 10.35.161.249:53708 Request: GET /v2.0/
    2018-04-02 18:15:46,262 auth.plugins.ovirt.sso Connecting to oVirt engine's SSO module: https://network-ge-2.scl.lab.tlv.redhat.com:443/ovirt-engine/sso/oauth/token-info
    2018-04-02 18:15:46,262 auth.plugins.ovirt.sso Connecting to oVirt engine's SSO module: https://network-ge-2.scl.lab.tlv.redhat.com:443/ovirt-engine/sso/oauth/token-info
    2018-04-02 18:15:46,264 requests.packages.urllib3.connectionpool Starting new HTTPS connection (1): network-ge-2.scl.lab.tlv.redhat.com
    2018-04-02 18:15:46,264 requests.packages.urllib3.connectionpool Starting new HTTPS connection (1): network-ge-2.scl.lab.tlv.redhat.com
    2018-04-02 18:15:46,311 requests.packages.urllib3.connectionpool "POST /ovirt-engine/sso/oauth/token-info HTTP/1.1" 200 735
    2018-04-02 18:15:46,311 requests.packages.urllib3.connectionpool "POST /ovirt-engine/sso/oauth/token-info HTTP/1.1" 200 735
    2018-04-02 18:15:46,443 root Uncaught exception
    Traceback (most recent call last):
      File "/usr/share/ovirt-provider-ovn/ovirt_provider_ovn.py", line 62, in run_with_except_hook
        run_original(*args2, **kwargs2)
      File "/usr/lib64/python2.7/threading.py", line 765, in run
        self.__target(*self.__args, **self.__kwargs)
      File "/usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/connection.py", line 90, in run
        self.idl.wait(self.poller)
      File "/usr/lib/python2.7/site-packages/ovs/db/idl.py", line 295, in wait
        self._session.wait(poller)
      File "/usr/lib/python2.7/site-packages/ovs/jsonrpc.py", line 520, in wait
        self.rpc.wait(poller)
      File "/usr/lib/python2.7/site-packages/ovs/jsonrpc.py", line 215, in wait
        self.stream.run_wait(poller)
    AttributeError: 'NoneType' object has no attribute 'run_wait'
    2018-04-02 18:15:46,446 root Irrecoverable error. Exiting!

Unhandled exception in ovsdbapp or python-openvswitch package, while trying to connect to the north db.

Version-Release number of selected component (if applicable):
RHV 4.2.2.6-0.1.el7
openvswitch-2.9.0-1.el7fdp.x86_64
python-openvswitch-2.9.0-1.el7fdp.noarch
openvswitch-ovn-central-2.9.0-1.el7fdp.x86_64
openvswitch-ovn-common-2.9.0-1.el7fdp.x86_64
python-ovsdbapp-0.6.0-1.el7ev.noarch

How reproducible:
Reproduced only once - might be related to timing.

Steps to Reproduce:
Test failed doing the following steps:
1. Configure ovirt-provider-ovn to handle SSO authentication.
2. Restart ovirt-provider-ovn.
3. Restart ovirt-engine.

Actual results:
Error in description.

Expected results:
No error should appear.

Additional info:

Comment 2 Dominik Holler 2018-04-04 09:09:06 UTC
The situation during the issue happens might be related to bug 1563574.

Comment 4 Terry Wilson 2018-10-31 20:30:51 UTC
ovirt is doing some really weird stuff with monkey-patching the threading module here: https://github.com/oVirt/ovirt-provider-ovn/blob/master/provider/ovirt_provider_ovn.py#L71

The poller object is set at Connection.__init__ and is never assigned to again, https://github.com/openstack/ovsdbapp/blob/master/ovsdbapp/backend/ovs_idl/connection.py#L85 so I don't really see how self.poller can ever actually equal None, since calling self.poller = poller.Poller() can never return None. I definitely don't like ovirt replacing Thread.__init__ which ovsdbapp uses.

We could probably work around whatever you are doing by not saving the Poller object as self.poller and just creating a new one each loop, ala the attached patch. Could you try this out and see if it fixes things for you?

Comment 5 Terry Wilson 2018-10-31 20:31:43 UTC
Created attachment 1499614 [details]
Don't reuse poller object

Comment 6 Dominik Holler 2018-11-02 11:08:17 UTC
Terry, thank you very much for picking this issue!

(In reply to Terry Wilson from comment #4)
> ovirt is doing some really weird stuff with monkey-patching the threading
> module here:
> https://github.com/oVirt/ovirt-provider-ovn/blob/master/provider/
> ovirt_provider_ovn.py#L71
> 
> The poller object is set at Connection.__init__ and is never assigned to
> again,
> https://github.com/openstack/ovsdbapp/blob/master/ovsdbapp/backend/ovs_idl/
> connection.py#L85 so I don't really see how self.poller can ever actually
> equal None, since calling self.poller = poller.Poller() can never return
> None. 

I understood the error:
      File "/usr/lib/python2.7/site-packages/ovs/jsonrpc.py", line 215, in wait
        self.stream.run_wait(poller)
    AttributeError: 'NoneType' object has no attribute 'run_wait'
    2018-04-02 18:15:46,446 root Irrecoverable error. Exiting!

in the way that self.stream is None. This might happen because 
Connection.wait() is called after Connection.close()


> I definitely don't like ovirt replacing Thread.__init__ which ovsdbapp
> uses.
> 

Which implementation do you suggest to catch the exceptions of ovsdbapp's threads?

> We could probably work around whatever you are doing by not saving the
> Poller object as self.poller and just creating a new one each loop, ala the
> attached patch. Could you try this out and see if it fixes things for you?

Comment 7 Terry Wilson 2018-11-06 15:41:19 UTC
Yeah, I didn't read the Exception carefully enough. :(

So, in 0.13.0, we add a try/except around the idl.run() call in Connection.run(). I think just moving the try to the top of the loop would allow this to at least retry when it gets the error. On initial look, I didn't see where Connection.wait() would be called after Connection.close(), but Connection.__disconnect can also set stream to None. So if there is any kind of connection error/reconnection, it looks like there could be a case where the Connection.run() thread could call wait() before the reconnect happens. If that is happening, the retry logic should work around the issue.

I've also considered writing a threadless version that is just run in the applications main loop. For tcp/ssl connections where the ovsdb Connection.inactivity_probe timeout is set, the run loop would need to guarantee that it would complete in less time than the inactivity_probe value or it would cause a reconnect. Does this sound like something that would be useful to ovirt?

Comment 8 Dan Kenigsberg 2018-11-07 10:55:24 UTC
Mor has left the company.

Terry, do you say that a retry logic is already in? Or does ovsdbapp panic on error? (it is a library, it should not)

I would not recommend changing the API or adding new function before I understand the current state. There's noting ovirt-ish about this bug, it should interest OSP, too, I believe.

Comment 9 Terry Wilson 2018-11-07 16:34:56 UTC
(In reply to Dan Kenigsberg from comment #8)
> Terry, do you say that a retry logic is already in? Or does ovsdbapp panic
> on error? (it is a library, it should not)

The retry logic currently happens if there is an exception in idl.run(). I can move the try up a little bit to make the retry happen if there is an exception in wait()/other python-ovs code as well.

If the retry failed, it might still raise an exception that would need to be caught, but as your code has already figured out, that gets a little tricky with the python threading module.

> I would not recommend changing the API or adding new function before I
> understand the current state. There's noting ovirt-ish about this bug, it
> should interest OSP, too, I believe.

If I did it, it would just be a separate Connection class or something that you would just call Connection.run() once per loop in your main thread. Just something I've been thinking about for a while. python-ovs is made to be very non-blocking, event-loop style. ovsdbapp kind of wrapped it to make it look more synchronous due to historical reasons with neutron usage, hence the kinda odd Connection thread and queue of length one message passing system.

Anyway, I'll write a patch moving the try. It would be based off of master, and should be backportable. What would you all need to be able to test it?

Comment 10 Dan Kenigsberg 2018-11-19 13:15:28 UTC
> What would you all need to be able to test it?

Our CI system is always happy to consume an RPM from copr.

Comment 11 Terry Wilson 2019-03-19 12:05:17 UTC
Dan, this is fixed in version python-ovsdbapp-0.14.0-0.20190121205004.0d44123.el7ost, but apparently didn't make it into queens. Where do you all normally pull packages from?

Comment 13 Dan Kenigsberg 2019-03-19 12:40:13 UTC
(In reply to Terry Wilson from comment #11)
> Dan, this is fixed in version
> python-ovsdbapp-0.14.0-0.20190121205004.0d44123.el7ost, but apparently
> didn't make it into queens. Where do you all normally pull packages from?

Miguel can answer this more authoritatively.

Comment 14 Miguel Duarte Barroso 2019-03-19 13:42:42 UTC
> > Dan, this is fixed in version
> > python-ovsdbapp-0.14.0-0.20190121205004.0d44123.el7ost, but apparently
> > didn't make it into queens. Where do you all normally pull packages from?

> Miguel can answer this more authoritatively.

We cross tag against selected versions.

I'll test this against build [0], which includes the fix. If all goes well, we'll update our tag.

[0] - https://cbs.centos.org/koji/buildinfo?buildID=25350

Comment 15 Miguel Duarte Barroso 2019-03-19 13:46:46 UTC
downstream I can't find a build of 0.15.X for el7, and I understand that brew build [0] features the fix,
making it the ideal candidate for the tag.

That would make us ship 0.15.0 on ovirt (upstream) and 0.14.0 on RHV (downstream).

Any objections ?

[0] - https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=832837

Comment 16 Dan Kenigsberg 2019-03-20 11:00:42 UTC
(In reply to Miguel Duarte Barroso from comment #15)
> downstream I can't find a build of 0.15.X for el7, and I understand that
> brew build [0] features the fix,
> making it the ideal candidate for the tag.
> 
> That would make us ship 0.15.0 on ovirt (upstream) and 0.14.0 on RHV
> (downstream).
> 
> Any objections ?


I can live with this, but QE must concur. Note that we probably need a RHV BZ for this change, and imho it could and should wait after GA of RHV-4.3.

Comment 17 Jon Schlueter 2019-04-08 16:00:25 UTC
for OSP 13 we end up pulling from upstream/stable/queens branch (but we work from what RDO Queens Trunk is testing)

Comment 23 errata-xmlrpc 2020-03-10 11:52:51 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:0769