Created attachment 1426608 [details] engine.log Description of problem: All the open requests to the host are terminated once one of the requests gets a timeout, even if the connectivity to the host was already restored. The scenario I debugged- 1. Stop the vdsm 2. While the vdsm is down GetAllVmStatsVDSCommand is sent (several times) and gets a ConnectException. The request is registered to the JsonRpcClient.tracker. 3. Start the vdsm. 4. Run several SetupNetworks commands. 3 minutes after GetAllVmStatsVDSCommand failed, ResponseTracker.loop starts the timeout treatment. As part of the treatment, all the open requests to the host are terminated. If a SN is currently running it is terminated. BTW, the issue was partially solved for async vds commands in patch https://gerrit.ovirt.org/#/c/90189 In case of an immediate ConnectException the request is not registered to the JsonRpcClient.tracker. Version-Release number of selected component (if applicable): How reproducible: 20% Steps to Reproduce: 1. Stop the vdsm 2. Wait for the host to become non-responsive 3. Start the vdsm 4. Wait for the host to become up. 5. Run several setup networks (or other vds commands) one after the other for 3 minutes. Actual results: ~3 minutes after stopping the vdsm, all the vds commands are terminated although the host is up. Expected results: The vds commands should finish successfully. Additional info:
Created attachment 1426609 [details] vdsm.log
Created attachment 1426610 [details] engine2.log
Created attachment 1426611 [details] vdsm2.log
This is not nothing new, the way how connections are closed within vdsm-jsonrpc-java exists from beginning and the issue it causes is not related to the nonblocking thread changes in oVirt 4.2. Also this change it's quite dangerous and we really need to verify all possible regressions, so moving to 4.2.4
Could you please suggest verification steps?
I don't think we have anything other than what's mentioned in Description, right Ravi?
There are no specific steps except for the ones in Description. You should not see any errors in logs regarding SetupNetworks
3 of 3 - no termination catched
This bugzilla is included in oVirt 4.2.4 release, published on June 26th 2018. Since the problem described in this bug report should be resolved in oVirt 4.2.4 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.