Bug 1408175 - SetupNetworks fails if connection to Vdsm is RST while it takes place
Summary: SetupNetworks fails if connection to Vdsm is RST while it takes place
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: BLL.Infra
Version: 4.1.0
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ovirt-4.1.0-beta
: 4.1.0.2
Assignee: Yevgeny Zaspitsky
QA Contact: Meni Yakove
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-12-22 11:24 UTC by Meni Yakove
Modified: 2017-01-23 11:25 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-01-19 15:00:53 UTC
oVirt Team: Network
Embargoed:
rule-engine: ovirt-4.1+


Attachments (Terms of Use)
vdsm, supervdsm, engine and tcpdump logs (2.18 MB, application/zip)
2016-12-22 11:25 UTC, Meni Yakove
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 69108 0 master MERGED engine: avoid PollVDSCommand to fail upon VDSNetworkException 2020-07-31 11:10:31 UTC
oVirt gerrit 70737 0 ovirt-engine-4.1 MERGED engine: avoid PollVDSCommand to fail upon VDSNetworkException 2020-07-31 11:10:31 UTC

Description Meni Yakove 2016-12-22 11:24:34 UTC
Description of problem:
SetupNetworks fails on connectivity check.

2016-12-22 11:55:30,509 ERROR (JsonRpc (StompReactor)) [vds.dispatcher] SSL error receiving from <yajsonrpc.betterAsyncore.Dispatcher connected ('::1', 36960, 0, 0) at 0x268c5f0>: unexpected eof (better
Asyncore:113)

MainProcess|jsonrpc/5::ERROR::2016-12-22 11:24:35,347::supervdsmServer::97::SuperVdsm.ServerCallback::(wrapper) Error in setupNetworks
Traceback (most recent call last):
  File "/usr/share/vdsm/supervdsmServer", line 95, in wrapper
    res = func(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/network/api.py", line 261, in setupNetworks
    _setup_networks(networks, bondings, options)
  File "/usr/lib/python2.7/site-packages/vdsm/network/api.py", line 282, in _setup_networks
    netswitch.setup(networks, bondings, options, in_rollback)
  File "/usr/lib/python2.7/site-packages/vdsm/network/netswitch.py", line 132, in setup
    _setup_legacy(legacy_nets, legacy_bonds, options, in_rollback)
  File "/usr/lib/python2.7/site-packages/vdsm/network/netswitch.py", line 155, in _setup_legacy
    connectivity.check(options)
  File "/usr/lib/python2.7/site-packages/vdsm/network/connectivity.py", line 46, in check
    'connectivity check failed')
ConfigNetworkError: (10, 'connectivity check failed')


Version-Release number of selected component (if applicable):
ovirt-engine-4.2.0-0.0.master.20161219154406.gitf36001c.el7.centos.noarch
vdsm-4.18.999-1219.gite8c2926.el7.centos.x86_64
NetworkManager-1.4.0-13.el7_3.x86_64


Steps to Reproduce:
1. host_network_api/sync_test case 02


Actual results:
SetupNetwork fails

Expected results:
SetupNetwork succeeded 

Additional info:
2016-12-22 11:22:28,427+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (default task-13) [hosts_syncAction_1845d67a-7ab3-4aff] Exception:
                               │                  │ org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Vds timeout occured
                              org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:188) [vdsbroker.jar:]
                               org.ovirt.engine.core.vdsbroker.vdsbroker.FutureVDSCommand.get(FutureVDSCommand.java:70) [vdsbroker.jar:]
                               org.ovirt.engine.core.bll.network.host.HostSetupNetworkPoller.getValue(HostSetupNetworkPoller.java:56) [bll.jar:]
                               org.ovirt.engine.core.bll.network.host.HostSetupNetworkPoller.poll(HostSetupNetworkPoller.java:41) [bll.jar:]

Comment 1 Meni Yakove 2016-12-22 11:25:54 UTC
Created attachment 1234712 [details]
vdsm, supervdsm, engine and tcpdump logs

tcpdump was captured on engine while setupNetworks was sent.

Comment 2 Piotr Kliczewski 2016-12-22 12:12:37 UTC
The difference is that we get RST frame when NetworkManager is running and we do not get it without it.

The RST frame cause engine to throw:

2016-12-22 10:40:53,550+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.PollVDSCommand] (default task-13) [hosts_syncAction_7c7fb25f-841c-44d1] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Vds timeout occured
	at org.ovirt.engine.core.vdsbroker.vdsbroker.BrokerCommandBase.proceedProxyReturnValue(BrokerCommandBase.java:188) [vdsbroker.jar:]
	at org.ovirt.engine.core.vdsbroker.vdsbroker.FutureVDSCommand.get(FutureVDSCommand.java:70) [vdsbroker.jar:]
	at org.ovirt.engine.core.bll.network.host.HostSetupNetworkPoller.getValue(HostSetupNetworkPoller.java:56) [bll.jar:]
	at org.ovirt.engine.core.bll.network.host.HostSetupNetworkPoller.poll(HostSetupNetworkPoller.java:41) [bll.jar:]

which is badly handled and it makes the setupNetwork to fail.

In my opinion we need to fix Poller to be able to handle this exception. Based on this I am changing a team.

Comment 3 Yevgeny Zaspitsky 2016-12-25 18:36:18 UTC
I'd rather disagree with the analysis that the problem is related to the code in HostSetupNetworkPoller. It knows nothing about frames being received by the layer that manages the TCP connection, it even doesn't know of a connection existence. The only logic that it includes, is whether the poll succeed or not in the given period of time. I'd expect the underlying layer to maintain the TCP connection and renew that if needed (e.g. upon getting RST frame).

Looks like JsonRpcClient reports status 5022, which is being translated to VDSNetworkException on the engine side (BrokerCommandBase.proceedProxyReturnValue), even though "Host.ping" method (what JsonRpcVdsServer.timeBoundPoll sends) means resetting the connection upon a network failure.

My patch [1] should work around the problem, but I'd rather avoid merging it. IMHO the treatment should be in vdsm-jsonrpc-java-client code.

[1] - https://gerrit.ovirt.org/69108

Comment 4 Piotr Kliczewski 2017-01-02 08:17:32 UTC
(In reply to Yevgeny Zaspitsky from comment #3)
> I'd rather disagree with the analysis that the problem is related to the
> code in HostSetupNetworkPoller. It knows nothing about frames being received
> by the layer that manages the TCP connection, it even doesn't know of a
> connection existence. The only logic that it includes, is whether the poll
> succeed or not in the given period of time. I'd expect the underlying layer
> to maintain the TCP connection and renew that if needed (e.g. upon getting
> RST frame).

Correct and the poller should have no knowledge about it.

> 
> Looks like JsonRpcClient reports status 5022, which is being translated to
> VDSNetworkException on the engine side
> (BrokerCommandBase.proceedProxyReturnValue), even though "Host.ping" method
> (what JsonRpcVdsServer.timeBoundPoll sends) means resetting the connection
> upon a network failure.
> 
> My patch [1] should work around the problem, but I'd rather avoid merging
> it. IMHO the treatment should be in vdsm-jsonrpc-java-client code.

Jsonrpc code is quite stupid and it doesn't know the context it is running in. It needs to propagate any network level issues to the higher layer. When network manager is running we know that TCP behaves differently and we get RST frame. In this situation there is nothing else to do except to propagate the difference. Since the behaviour is different we need to handle it differently in higher layer.

I think that we should go with [1] if it fixes issue handling for this specific use case.

> 
> [1] - https://gerrit.ovirt.org/69108

Comment 5 Yevgeny Zaspitsky 2017-01-02 10:04:52 UTC
(In reply to Piotr Kliczewski from comment #4)
> I think that we should go with [1] if it fixes issue handling for this
> specific use case.
> 
> > 
> > [1] - https://gerrit.ovirt.org/69108

The bug doesn't reproduce anymore (Meni tried to reproduce), so it'd be hard to prove that [1] fixes it.
On the other hand if you're in favor of [1], please give a score for that ;-) .

Comment 6 Piotr Kliczewski 2017-01-02 10:43:04 UTC
(In reply to Yevgeny Zaspitsky from comment #5)

> 
> The bug doesn't reproduce anymore (Meni tried to reproduce), so it'd be hard
> to prove that [1] fixes it.
> On the other hand if you're in favor of [1], please give a score for that
> ;-) .

Maybe the issue is related to specific NetworkManager version. Are we testing with the same version?

Comment 7 Dan Kenigsberg 2017-01-02 11:38:30 UTC
Piotr, Maybe you can help us emulate a RST over the jsonrpc socket. I'd like to modify Vdsm to finish setupNetwork successfully, but RST the connection slightly before that.

Comment 8 Piotr Kliczewski 2017-01-02 11:55:24 UTC
To send an RST on a TCP connection, set the SO_LINGER option to true with a zero timeout, then close the socket.

We would need to hack the code to do it. Google says that we could do:

soc.setsockopt(socket.SOL_SOCKET, socket.SO_LINGER, struct.pack('ii', 1, 0))
soc.close()

Comment 9 Yevgeny Zaspitsky 2017-01-02 16:33:46 UTC
Meni, could you please tell if the environment (host side: NetworkManager or other packages) has been changed in the period between opening the bug and the attempt of reproducing it? If yes, could you please tell what has been changed?

Comment 10 Meni Yakove 2017-01-04 08:39:50 UTC
I don't know.
I'm pretty sure that NM was the same version.

Comment 11 Yevgeny Zaspitsky 2017-01-19 15:00:53 UTC
I close the bug as it doesn't reproduce, so there is no point in trying to verify the change.
The fix that was applied is a proper handling (ignoring) of the exception that was seen in the log.


Note You need to log in before you can comment on or make changes to this bug.