Created attachment 1417147 [details] logs Description of problem: During automation failure analysis I noticed this error in ovirt-provider-ovn.log: 2018-04-02 18:15:46,262 root From: 10.35.161.249:53708 Request: GET /v2.0/ 2018-04-02 18:15:46,262 auth.plugins.ovirt.sso Connecting to oVirt engine's SSO module: https://network-ge-2.scl.lab.tlv.redhat.com:443/ovirt-engine/sso/oauth/token-info 2018-04-02 18:15:46,262 auth.plugins.ovirt.sso Connecting to oVirt engine's SSO module: https://network-ge-2.scl.lab.tlv.redhat.com:443/ovirt-engine/sso/oauth/token-info 2018-04-02 18:15:46,264 requests.packages.urllib3.connectionpool Starting new HTTPS connection (1): network-ge-2.scl.lab.tlv.redhat.com 2018-04-02 18:15:46,264 requests.packages.urllib3.connectionpool Starting new HTTPS connection (1): network-ge-2.scl.lab.tlv.redhat.com 2018-04-02 18:15:46,311 requests.packages.urllib3.connectionpool "POST /ovirt-engine/sso/oauth/token-info HTTP/1.1" 200 735 2018-04-02 18:15:46,311 requests.packages.urllib3.connectionpool "POST /ovirt-engine/sso/oauth/token-info HTTP/1.1" 200 735 2018-04-02 18:15:46,443 root Uncaught exception Traceback (most recent call last): File "/usr/share/ovirt-provider-ovn/ovirt_provider_ovn.py", line 62, in run_with_except_hook run_original(*args2, **kwargs2) File "/usr/lib64/python2.7/threading.py", line 765, in run self.__target(*self.__args, **self.__kwargs) File "/usr/lib/python2.7/site-packages/ovsdbapp/backend/ovs_idl/connection.py", line 90, in run self.idl.wait(self.poller) File "/usr/lib/python2.7/site-packages/ovs/db/idl.py", line 295, in wait self._session.wait(poller) File "/usr/lib/python2.7/site-packages/ovs/jsonrpc.py", line 520, in wait self.rpc.wait(poller) File "/usr/lib/python2.7/site-packages/ovs/jsonrpc.py", line 215, in wait self.stream.run_wait(poller) AttributeError: 'NoneType' object has no attribute 'run_wait' 2018-04-02 18:15:46,446 root Irrecoverable error. Exiting! Unhandled exception in ovsdbapp or python-openvswitch package, while trying to connect to the north db. Version-Release number of selected component (if applicable): RHV 4.2.2.6-0.1.el7 openvswitch-2.9.0-1.el7fdp.x86_64 python-openvswitch-2.9.0-1.el7fdp.noarch openvswitch-ovn-central-2.9.0-1.el7fdp.x86_64 openvswitch-ovn-common-2.9.0-1.el7fdp.x86_64 python-ovsdbapp-0.6.0-1.el7ev.noarch How reproducible: Reproduced only once - might be related to timing. Steps to Reproduce: Test failed doing the following steps: 1. Configure ovirt-provider-ovn to handle SSO authentication. 2. Restart ovirt-provider-ovn. 3. Restart ovirt-engine. Actual results: Error in description. Expected results: No error should appear. Additional info:
The situation during the issue happens might be related to bug 1563574.
ovirt is doing some really weird stuff with monkey-patching the threading module here: https://github.com/oVirt/ovirt-provider-ovn/blob/master/provider/ovirt_provider_ovn.py#L71 The poller object is set at Connection.__init__ and is never assigned to again, https://github.com/openstack/ovsdbapp/blob/master/ovsdbapp/backend/ovs_idl/connection.py#L85 so I don't really see how self.poller can ever actually equal None, since calling self.poller = poller.Poller() can never return None. I definitely don't like ovirt replacing Thread.__init__ which ovsdbapp uses. We could probably work around whatever you are doing by not saving the Poller object as self.poller and just creating a new one each loop, ala the attached patch. Could you try this out and see if it fixes things for you?
Created attachment 1499614 [details] Don't reuse poller object
Terry, thank you very much for picking this issue! (In reply to Terry Wilson from comment #4) > ovirt is doing some really weird stuff with monkey-patching the threading > module here: > https://github.com/oVirt/ovirt-provider-ovn/blob/master/provider/ > ovirt_provider_ovn.py#L71 > > The poller object is set at Connection.__init__ and is never assigned to > again, > https://github.com/openstack/ovsdbapp/blob/master/ovsdbapp/backend/ovs_idl/ > connection.py#L85 so I don't really see how self.poller can ever actually > equal None, since calling self.poller = poller.Poller() can never return > None. I understood the error: File "/usr/lib/python2.7/site-packages/ovs/jsonrpc.py", line 215, in wait self.stream.run_wait(poller) AttributeError: 'NoneType' object has no attribute 'run_wait' 2018-04-02 18:15:46,446 root Irrecoverable error. Exiting! in the way that self.stream is None. This might happen because Connection.wait() is called after Connection.close() > I definitely don't like ovirt replacing Thread.__init__ which ovsdbapp > uses. > Which implementation do you suggest to catch the exceptions of ovsdbapp's threads? > We could probably work around whatever you are doing by not saving the > Poller object as self.poller and just creating a new one each loop, ala the > attached patch. Could you try this out and see if it fixes things for you?
Yeah, I didn't read the Exception carefully enough. :( So, in 0.13.0, we add a try/except around the idl.run() call in Connection.run(). I think just moving the try to the top of the loop would allow this to at least retry when it gets the error. On initial look, I didn't see where Connection.wait() would be called after Connection.close(), but Connection.__disconnect can also set stream to None. So if there is any kind of connection error/reconnection, it looks like there could be a case where the Connection.run() thread could call wait() before the reconnect happens. If that is happening, the retry logic should work around the issue. I've also considered writing a threadless version that is just run in the applications main loop. For tcp/ssl connections where the ovsdb Connection.inactivity_probe timeout is set, the run loop would need to guarantee that it would complete in less time than the inactivity_probe value or it would cause a reconnect. Does this sound like something that would be useful to ovirt?
Mor has left the company. Terry, do you say that a retry logic is already in? Or does ovsdbapp panic on error? (it is a library, it should not) I would not recommend changing the API or adding new function before I understand the current state. There's noting ovirt-ish about this bug, it should interest OSP, too, I believe.
(In reply to Dan Kenigsberg from comment #8) > Terry, do you say that a retry logic is already in? Or does ovsdbapp panic > on error? (it is a library, it should not) The retry logic currently happens if there is an exception in idl.run(). I can move the try up a little bit to make the retry happen if there is an exception in wait()/other python-ovs code as well. If the retry failed, it might still raise an exception that would need to be caught, but as your code has already figured out, that gets a little tricky with the python threading module. > I would not recommend changing the API or adding new function before I > understand the current state. There's noting ovirt-ish about this bug, it > should interest OSP, too, I believe. If I did it, it would just be a separate Connection class or something that you would just call Connection.run() once per loop in your main thread. Just something I've been thinking about for a while. python-ovs is made to be very non-blocking, event-loop style. ovsdbapp kind of wrapped it to make it look more synchronous due to historical reasons with neutron usage, hence the kinda odd Connection thread and queue of length one message passing system. Anyway, I'll write a patch moving the try. It would be based off of master, and should be backportable. What would you all need to be able to test it?
> What would you all need to be able to test it? Our CI system is always happy to consume an RPM from copr.
Dan, this is fixed in version python-ovsdbapp-0.14.0-0.20190121205004.0d44123.el7ost, but apparently didn't make it into queens. Where do you all normally pull packages from?
(In reply to Terry Wilson from comment #11) > Dan, this is fixed in version > python-ovsdbapp-0.14.0-0.20190121205004.0d44123.el7ost, but apparently > didn't make it into queens. Where do you all normally pull packages from? Miguel can answer this more authoritatively.
> > Dan, this is fixed in version > > python-ovsdbapp-0.14.0-0.20190121205004.0d44123.el7ost, but apparently > > didn't make it into queens. Where do you all normally pull packages from? > Miguel can answer this more authoritatively. We cross tag against selected versions. I'll test this against build [0], which includes the fix. If all goes well, we'll update our tag. [0] - https://cbs.centos.org/koji/buildinfo?buildID=25350
downstream I can't find a build of 0.15.X for el7, and I understand that brew build [0] features the fix, making it the ideal candidate for the tag. That would make us ship 0.15.0 on ovirt (upstream) and 0.14.0 on RHV (downstream). Any objections ? [0] - https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=832837
(In reply to Miguel Duarte Barroso from comment #15) > downstream I can't find a build of 0.15.X for el7, and I understand that > brew build [0] features the fix, > making it the ideal candidate for the tag. > > That would make us ship 0.15.0 on ovirt (upstream) and 0.14.0 on RHV > (downstream). > > Any objections ? I can live with this, but QE must concur. Note that we probably need a RHV BZ for this change, and imho it could and should wait after GA of RHV-4.3.
for OSP 13 we end up pulling from upstream/stable/queens branch (but we work from what RDO Queens Trunk is testing)
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0769