Description of problem: SetupNetworks fails on connectivity check. 2016-12-22 11:55:30,509 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', 36960, 0, 0) at 0x268c5f0>: unexpected eof (better Asyncore:113) MainProcess|jsonrpc/5::ERROR::2016-12-22 11:24:35,347::supervdsmServer::97::SuperVdsm.ServerCallback::(wrapper) Error in setupNetworks Traceback (most recent call last): File "/usr/share/vdsm/supervdsmServer", line 95, in wrapper res = func(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/network/api.py", line 261, in setupNetworks _setup_networks(networks, bondings, options) File "/usr/lib/python2.7/site-packages/vdsm/network/api.py", line 282, in _setup_networks netswitch.setup(networks, bondings, options, in_rollback) File "/usr/lib/python2.7/site-packages/vdsm/network/netswitch.py", line 132, in setup _setup_legacy(legacy_nets, legacy_bonds, options, in_rollback) File "/usr/lib/python2.7/site-packages/vdsm/network/netswitch.py", line 155, in _setup_legacy connectivity.check(options) File "/usr/lib/python2.7/site-packages/vdsm/network/connectivity.py", line 46, in check 'connectivity check failed') ConfigNetworkError: (10, 'connectivity check failed') Version-Release number of selected component (if applicable): ovirt-engine-4.2.0-0.0.master.20161219154406.gitf36001c.el7.centos.noarch vdsm-4.18.999-1219.gite8c2926.el7.centos.x86_64 NetworkManager-1.4.0-13.el7_3.x86_64 Steps to Reproduce: 1. host_network_api/sync_test case 02 Actual results: SetupNetwork fails Expected results: SetupNetwork succeeded Additional info: 2016-12-22 11:22:28,427+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (default task-13) [hosts_syncAction_1845d67a-7ab3-4aff] Exception: │ │ org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Vds timeout occured org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:188) [vdsbroker.jar:] org.ovirt.engine.core.vdsbroker.vdsbroker.FutureVDSCommand.get(FutureVDSCommand.java:70) [vdsbroker.jar:] org.ovirt.engine.core.bll.network.host.HostSetupNetworkPoller.getValue(HostSetupNetworkPoller.java:56) [bll.jar:] org.ovirt.engine.core.bll.network.host.HostSetupNetworkPoller.poll(HostSetupNetworkPoller.java:41) [bll.jar:]
Created attachment 1234712 [details] vdsm, supervdsm, engine and tcpdump logs tcpdump was captured on engine while setupNetworks was sent.
The difference is that we get RST frame when NetworkManager is running and we do not get it without it. The RST frame cause engine to throw: 2016-12-22 10:40:53,550+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (default task-13) [hosts_syncAction_7c7fb25f-841c-44d1] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Vds timeout occured at org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:188) [vdsbroker.jar:] at org.ovirt.engine.core.vdsbroker.vdsbroker.FutureVDSCommand.get(FutureVDSCommand.java:70) [vdsbroker.jar:] at org.ovirt.engine.core.bll.network.host.HostSetupNetworkPoller.getValue(HostSetupNetworkPoller.java:56) [bll.jar:] at org.ovirt.engine.core.bll.network.host.HostSetupNetworkPoller.poll(HostSetupNetworkPoller.java:41) [bll.jar:] which is badly handled and it makes the setupNetwork to fail. In my opinion we need to fix Poller to be able to handle this exception. Based on this I am changing a team.
I'd rather disagree with the analysis that the problem is related to the code in HostSetupNetworkPoller. It knows nothing about frames being received by the layer that manages the TCP connection, it even doesn't know of a connection existence. The only logic that it includes, is whether the poll succeed or not in the given period of time. I'd expect the underlying layer to maintain the TCP connection and renew that if needed (e.g. upon getting RST frame). Looks like JsonRpcClient reports status 5022, which is being translated to VDSNetworkException on the engine side (BrokerCommandBase.proceedProxyReturnValue), even though "Host.ping" method (what JsonRpcVdsServer.timeBoundPoll sends) means resetting the connection upon a network failure. My patch [1] should work around the problem, but I'd rather avoid merging it. IMHO the treatment should be in vdsm-jsonrpc-java-client code. [1] - https://gerrit.ovirt.org/69108
(In reply to Yevgeny Zaspitsky from comment #3) > I'd rather disagree with the analysis that the problem is related to the > code in HostSetupNetworkPoller. It knows nothing about frames being received > by the layer that manages the TCP connection, it even doesn't know of a > connection existence. The only logic that it includes, is whether the poll > succeed or not in the given period of time. I'd expect the underlying layer > to maintain the TCP connection and renew that if needed (e.g. upon getting > RST frame). Correct and the poller should have no knowledge about it. > > Looks like JsonRpcClient reports status 5022, which is being translated to > VDSNetworkException on the engine side > (BrokerCommandBase.proceedProxyReturnValue), even though "Host.ping" method > (what JsonRpcVdsServer.timeBoundPoll sends) means resetting the connection > upon a network failure. > > My patch [1] should work around the problem, but I'd rather avoid merging > it. IMHO the treatment should be in vdsm-jsonrpc-java-client code. Jsonrpc code is quite stupid and it doesn't know the context it is running in. It needs to propagate any network level issues to the higher layer. When network manager is running we know that TCP behaves differently and we get RST frame. In this situation there is nothing else to do except to propagate the difference. Since the behaviour is different we need to handle it differently in higher layer. I think that we should go with [1] if it fixes issue handling for this specific use case. > > [1] - https://gerrit.ovirt.org/69108
(In reply to Piotr Kliczewski from comment #4) > I think that we should go with [1] if it fixes issue handling for this > specific use case. > > > > > [1] - https://gerrit.ovirt.org/69108 The bug doesn't reproduce anymore (Meni tried to reproduce), so it'd be hard to prove that [1] fixes it. On the other hand if you're in favor of [1], please give a score for that ;-) .
(In reply to Yevgeny Zaspitsky from comment #5) > > The bug doesn't reproduce anymore (Meni tried to reproduce), so it'd be hard > to prove that [1] fixes it. > On the other hand if you're in favor of [1], please give a score for that > ;-) . Maybe the issue is related to specific NetworkManager version. Are we testing with the same version?
Piotr, Maybe you can help us emulate a RST over the jsonrpc socket. I'd like to modify Vdsm to finish setupNetwork successfully, but RST the connection slightly before that.
To send an RST on a TCP connection, set the SO_LINGER option to true with a zero timeout, then close the socket. We would need to hack the code to do it. Google says that we could do: soc.setsockopt(socket.SOL_SOCKET, socket.SO_LINGER, struct.pack('ii', 1, 0)) soc.close()
Meni, could you please tell if the environment (host side: NetworkManager or other packages) has been changed in the period between opening the bug and the attempt of reproducing it? If yes, could you please tell what has been changed?
I don't know. I'm pretty sure that NM was the same version.
I close the bug as it doesn't reproduce, so there is no point in trying to verify the change. The fix that was applied is a proper handling (ignoring) of the exception that was seen in the log.