Bug 1563581
Summary: | Unhandled connection exception in python-openvswitch/ovsdbapp code | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat OpenStack | Reporter: | Mor <mkalfon> | ||||||
Component: | python-ovsdbapp | Assignee: | Terry Wilson <twilson> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Toni Freger <tfreger> | ||||||
Severity: | medium | Docs Contact: | |||||||
Priority: | medium | ||||||||
Version: | 13.0 (Queens) | CC: | atragler, danken, dholler, jschluet, lmartins, mduarted, mkalfon, mmirecki, twilson | ||||||
Target Milestone: | --- | Keywords: | Triaged, ZStream | ||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | python-ovsdbapp-0.10.4-2.el7ost | Doc Type: | If docs needed, set a value | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2020-03-10 11:52:51 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Description
Mor
2018-04-04 08:41:44 UTC
The situation during the issue happens might be related to bug 1563574. ovirt is doing some really weird stuff with monkey-patching the threading module here: https://github.com/oVirt/ovirt-provider-ovn/blob/master/provider/ovirt_provider_ovn.py#L71 The poller object is set at Connection.__init__ and is never assigned to again, https://github.com/openstack/ovsdbapp/blob/master/ovsdbapp/backend/ovs_idl/connection.py#L85 so I don't really see how self.poller can ever actually equal None, since calling self.poller = poller.Poller() can never return None. I definitely don't like ovirt replacing Thread.__init__ which ovsdbapp uses. We could probably work around whatever you are doing by not saving the Poller object as self.poller and just creating a new one each loop, ala the attached patch. Could you try this out and see if it fixes things for you? Created attachment 1499614 [details]
Don't reuse poller object
Terry, thank you very much for picking this issue! (In reply to Terry Wilson from comment #4) > ovirt is doing some really weird stuff with monkey-patching the threading > module here: > https://github.com/oVirt/ovirt-provider-ovn/blob/master/provider/ > ovirt_provider_ovn.py#L71 > > The poller object is set at Connection.__init__ and is never assigned to > again, > https://github.com/openstack/ovsdbapp/blob/master/ovsdbapp/backend/ovs_idl/ > connection.py#L85 so I don't really see how self.poller can ever actually > equal None, since calling self.poller = poller.Poller() can never return > None. I understood the error: File "/usr/lib/python2.7/site-packages/ovs/jsonrpc.py", line 215, in wait self.stream.run_wait(poller) AttributeError: 'NoneType' object has no attribute 'run_wait' 2018-04-02 18:15:46,446 root Irrecoverable error. Exiting! in the way that self.stream is None. This might happen because Connection.wait() is called after Connection.close() > I definitely don't like ovirt replacing Thread.__init__ which ovsdbapp > uses. > Which implementation do you suggest to catch the exceptions of ovsdbapp's threads? > We could probably work around whatever you are doing by not saving the > Poller object as self.poller and just creating a new one each loop, ala the > attached patch. Could you try this out and see if it fixes things for you? Yeah, I didn't read the Exception carefully enough. :( So, in 0.13.0, we add a try/except around the idl.run() call in Connection.run(). I think just moving the try to the top of the loop would allow this to at least retry when it gets the error. On initial look, I didn't see where Connection.wait() would be called after Connection.close(), but Connection.__disconnect can also set stream to None. So if there is any kind of connection error/reconnection, it looks like there could be a case where the Connection.run() thread could call wait() before the reconnect happens. If that is happening, the retry logic should work around the issue. I've also considered writing a threadless version that is just run in the applications main loop. For tcp/ssl connections where the ovsdb Connection.inactivity_probe timeout is set, the run loop would need to guarantee that it would complete in less time than the inactivity_probe value or it would cause a reconnect. Does this sound like something that would be useful to ovirt? Mor has left the company. Terry, do you say that a retry logic is already in? Or does ovsdbapp panic on error? (it is a library, it should not) I would not recommend changing the API or adding new function before I understand the current state. There's noting ovirt-ish about this bug, it should interest OSP, too, I believe. (In reply to Dan Kenigsberg from comment #8) > Terry, do you say that a retry logic is already in? Or does ovsdbapp panic > on error? (it is a library, it should not) The retry logic currently happens if there is an exception in idl.run(). I can move the try up a little bit to make the retry happen if there is an exception in wait()/other python-ovs code as well. If the retry failed, it might still raise an exception that would need to be caught, but as your code has already figured out, that gets a little tricky with the python threading module. > I would not recommend changing the API or adding new function before I > understand the current state. There's noting ovirt-ish about this bug, it > should interest OSP, too, I believe. If I did it, it would just be a separate Connection class or something that you would just call Connection.run() once per loop in your main thread. Just something I've been thinking about for a while. python-ovs is made to be very non-blocking, event-loop style. ovsdbapp kind of wrapped it to make it look more synchronous due to historical reasons with neutron usage, hence the kinda odd Connection thread and queue of length one message passing system. Anyway, I'll write a patch moving the try. It would be based off of master, and should be backportable. What would you all need to be able to test it? > What would you all need to be able to test it?
Our CI system is always happy to consume an RPM from copr.
Dan, this is fixed in version python-ovsdbapp-0.14.0-0.20190121205004.0d44123.el7ost, but apparently didn't make it into queens. Where do you all normally pull packages from? (In reply to Terry Wilson from comment #11) > Dan, this is fixed in version > python-ovsdbapp-0.14.0-0.20190121205004.0d44123.el7ost, but apparently > didn't make it into queens. Where do you all normally pull packages from? Miguel can answer this more authoritatively. > > Dan, this is fixed in version > > python-ovsdbapp-0.14.0-0.20190121205004.0d44123.el7ost, but apparently > > didn't make it into queens. Where do you all normally pull packages from? > Miguel can answer this more authoritatively. We cross tag against selected versions. I'll test this against build [0], which includes the fix. If all goes well, we'll update our tag. [0] - https://cbs.centos.org/koji/buildinfo?buildID=25350 downstream I can't find a build of 0.15.X for el7, and I understand that brew build [0] features the fix, making it the ideal candidate for the tag. That would make us ship 0.15.0 on ovirt (upstream) and 0.14.0 on RHV (downstream). Any objections ? [0] - https://brewweb.engineering.redhat.com/brew/buildinfo?buildID=832837 (In reply to Miguel Duarte Barroso from comment #15) > downstream I can't find a build of 0.15.X for el7, and I understand that > brew build [0] features the fix, > making it the ideal candidate for the tag. > > That would make us ship 0.15.0 on ovirt (upstream) and 0.14.0 on RHV > (downstream). > > Any objections ? I can live with this, but QE must concur. Note that we probably need a RHV BZ for this change, and imho it could and should wait after GA of RHV-4.3. for OSP 13 we end up pulling from upstream/stable/queens branch (but we work from what RDO Queens Trunk is testing) Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:0769 |