Bug 1377161

Summary:	ovirt-ha-agent should talk to vdsm via loopback address or other socket family.
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	Germano Veit Michel <gveitmic>
Component:	ovirt-hosted-engine-ha	Assignee:	Martin Sivák <msivak>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Artyom <alukiano>
Severity:	urgent	Docs Contact:
Priority:	urgent
Version:	4.0.3	CC:	amarchuk, dfediuck, fromani, gklein, gveitmic, lsurette, mkalinin, msivak, rgolan, ykaul, ylavi
Target Milestone:	ovirt-4.0.4	Keywords:	TestOnly, Triaged
Target Release:	4.0.4
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2016-10-17 13:09:39 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	SLA	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1358530
Bug Blocks:

Description Germano Veit Michel 2016-09-19 02:31:08 UTC

Description of problem:

ovirt-ha-agent connects to vdsm via management interface IP address. A network failure may put the management interface down (in case it's not a VM network -> no bridge -> link down). So a network failure causes ovirt-ha-agent to timeout talking to vdsm. This breaks HA failover as the agent gets stuck restarting and trying to connect to vdsm.

I believe ha-agent and vdsm should talk via loopback device (localhost/127.0.0.1) (which should never go down), another AF such as AF_UNIX or something else that is more reliable.

Version-Release number of selected component (if applicable):
vdsm-4.18.11-1.el7ev.x86_64
ovirt-hosted-engine-ha-2.0.3-1.el7ev.noarch

How reproducible:
100%

Steps to Reproduce:
1. Unplug the network cable of the management network (if not a VM network)
2. The IP address of the interface goes down with it (removed from the FIB).
3. ha-agent get's stuck trying to talk to vdsm.
4. ha-agent does not realize broker is not pinging the gateway anymore. Score is not reduced by 1600 points
5. ha-agent does not shut down the HE vm.
6. other hosts cannot start the HE vm as the lock is still grabbed.

As a reproducer, in case the management network is a VM network (bridged), just 'ip link set dev ovirtmgmt down', the management IP address will be gone with it and vdsm and ovirt-ha-agent will lose contact.

Actual results:
- Host score not lowered
- HE does not fail over to another host.

Expected results:
- HE fails over

Additional info:

ha-agent should connect to vdsm via 127.0.0.1, so that when the IP address is removed the agent won't get stuck, so the host score will be reduced, the HE stopped and another host will start it.

192.168.7.22 is the management interface IP address for the host:

$ cat sos_commands/process/lsof_-b_M_-n_-l | grep ovirt-ha | grep TCP
ovirt-ha- 19064             36   14u     IPv4              39445         0t0        TCP 192.168.7.22:45752->192.168.7.22:54321 (ESTABLISHED)
ovirt-ha- 19064 24267       36   14u     IPv4              39445         0t0        TCP 192.168.7.22:45752->192.168.7.22:54321 (ESTABLISHED)

Broker pinging gateway:

Thread-2039::INFO::2016-09-18 14:41:40,815::ping::52::ping.Ping::(action) Successfully pinged 192.168.7.254

Management network goes down:

Thread-2039::WARNING::2016-09-18 14:42:00,846::ping::48::ping.Ping::(action) Failed to ping 192.168.7.254

At the same time the agent fails to talk to vdsm and goes on loop, restarting

MainThread::INFO::2016-09-18 14:41:44,867::states::421::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Engine vm running on localhost
MainThread::INFO::2016-09-18 14:41:44,870::hosted_engine::612::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_vdsm) Initializing VDSM
MainThread::INFO::2016-09-18 14:41:48,394::hosted_engine::639::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images) Connecting the storage
MainThread::INFO::2016-09-18 14:41:48,394::storage_server::218::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Connecting storage server
MainThread::INFO::2016-09-18 14:41:51,856::storage_server::232::ovirt_hosted_engine_ha.lib.storage_server.StorageServer::(connect_storage_server) Refreshing the storage domain
MainThread::INFO::2016-09-18 14:41:51,921::hosted_engine::666::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_storage_images) Preparing images
MainThread::INFO::2016-09-18 14:41:51,922::image::126::ovirt_hosted_engine_ha.lib.image.Image::(prepare_images) Preparing images
MainThread::INFO::2016-09-18 14:41:53,926::util::194::ovirt_hosted_engine_ha.lib.image.Image::(connect_vdsm_json_rpc) Waiting for VDSM to reply
MainThread::INFO::2016-09-18 14:41:55,930::util::194::ovirt_hosted_engine_ha.lib.image.Image::(connect_vdsm_json_rpc) Waiting for VDSM to reply
MainThread::INFO::2016-09-18 14:41:57,934::util::194::ovirt_hosted_engine_ha.lib.image.Image::(connect_vdsm_json_rpc) Waiting for VDSM to reply
.....
MainThread::INFO::2016-09-18 14:51:28,027::util::194::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(connect_vdsm_json_rpc) Waiting for VDSM to reply
MainThread::ERROR::2016-09-18 14:51:28,032::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Couldnt  connect to VDSM within 240 seconds' - trying to restart agent
MainThread::WARNING::2016-09-18 14:51:33,038::agent::208::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Restarting agent, attempt '1'

And there it goes again...

MainThread::INFO::2016-09-18 14:51:33,086::hosted_engine::612::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_initialize_vdsm) Initializing VDSM
MainThread::INFO::2016-09-18 14:51:35,169::util::194::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(connect_vdsm_json_rpc) Waiting for VDSM to reply
MainThread::INFO::2016-09-18 14:51:37,173::util::194::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(connect_vdsm_json_rpc) Waiting for VDSM to reply
MainThread::INFO::2016-09-18 14:51:39,176::util::194::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(connect_vdsm_json_rpc) Waiting for VDSM to reply
MainThread::INFO::2016-09-18 14:51:41,179::util::194::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(connect_vdsm_json_rpc) Waiting for VDSM to reply
MainThread::INFO::2016-09-18 14:51:43,182::util::194::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(connect_vdsm_json_rpc) Waiting for VDSM to reply
MainThread::INFO::2016-09-18 14:51:45,186::util::194::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(connect_vdsm_json_rpc) Waiting for VDSM to reply
MainThread::INFO::2016-09-18 14:51:47,189::util::194::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(connect_vdsm_json_rpc) Waiting for VDSM to reply

Comment 2 Roy Golan 2016-09-19 06:20:40 UTC

- Given there is no network then the sanlock should expire, cause we write the timestamp to the whiteboard, so I'd expect sanlock to free(kill) the resource(VM)
can you output `sanlock log_dump` when this happens?

- We clearly see a ping fails, how come we didn't lower the score? can you run this with debug log level?

Comment 3 Yaniv Kaul 2016-09-19 06:22:05 UTC

(In reply to Roy Golan from comment #2)
> - Given there is no network then the sanlock should expire, cause we write
> the timestamp to the whiteboard, so I'd expect sanlock to free(kill) the
> resource(VM)
> can you output `sanlock log_dump` when this happens?

sanlock may be on FC, or via another interface other than the mgmt, no?

> 
> - We clearly see a ping fails, how come we didn't lower the score? can you
> run this with debug log level?

Comment 4 Germano Veit Michel 2016-09-19 06:26:20 UTC

Hi Roy,

First of all thank you for the prompt response.

It's fibre channel storage, sanlock will not expire.

And there is also the case where storage might be accessed via a different NIC and network.

It did not lower the score because the agent is just looping (restarting and dying).

I can easily reproduce this in my env. Just put the ovirtmgmt network administratively down. Provided you reach your NFS/iSCSI storage via a different network you will see the exact same problem.

To me, the main problem is this, where the IP address might be attached to an interface that can go down by unplugging a cable.

$ cat sos_commands/process/lsof_-b_M_-n_-l | grep ovirt-ha | grep TCP
ovirt-ha- 19064             36   14u     IPv4              39445         0t0        TCP 192.168.7.22:45752->192.168.7.22:54321 (ESTABLISHED)
ovirt-ha- 19064 24267       36   14u     IPv4              39445         0t0        TCP 192.168.7.22:45752->192.168.7.22:54321 (ESTABLISHED)

Unplugging a cable should not make vdsm and ha-agent stop talking.

If you need more info, please let me know.

Cheers,
Germano

Comment 6 Germano Veit Michel 2016-09-19 06:41:35 UTC

This change should fix it:

Changing 'socket.gethostname()' to 'localhost' in the vdsm lib the agent is using.

https://gerrit.ovirt.org/#/c/63308/2/lib/vdsm/jsonrpcvdscli.py

Please confirm.

Comment 7 Germano Veit Michel 2016-09-19 06:54:44 UTC

After applying that fix it seems to fall back to using ipv6 lo

ovirt-ha- 1503            36    8u     IPv6              78992       0t0        TCP [::1]:54156->[::1]:54321 (ESTABLISHED)
ovirt-ha- 1503 1912       36    8u     IPv6              78992       0t0        TCP [::1]:54156->[::1]:54321 (ESTABLISHED)

Looks good.

Comment 8 Roy Golan 2016-09-19 07:27:06 UTC

(In reply to Germano Veit Michel from comment #6)
> This change should fix it:
> 
> Changing 'socket.gethostname()' to 'localhost' in the vdsm lib the agent is
> using.
> 
> https://gerrit.ovirt.org/#/c/63308/2/lib/vdsm/jsonrpcvdscli.py
> 
> Please confirm.

Yes it prevented other problems with the client but this will resolve the problem of the interface going down.

One more problem (that may be masqueraded by this fix) is that we keep looping and not lowering the score, even though the ping failed which doesn't look right. Martin?

Comment 9 Martin Sivák 2016-09-19 07:27:26 UTC

Guys, I think the same issue will happen to MOM and vdsClient tools. We use the default address the vdsm library provides to us and this should be fixed there as Germano found out.

Francesco: can you please confirm this? It might affect any vdsm client, not just HE (what about supervdsm btw?)

Comment 10 Yaniv Kaul 2016-09-19 07:30:39 UTC

(In reply to Martin Sivák from comment #9)
> Guys, I think the same issue will happen to MOM and vdsClient tools. We use
> the default address the vdsm library provides to us and this should be fixed
> there as Germano found out.
> 
> Francesco: can you please confirm this? It might affect any vdsm client, not
> just HE (what about supervdsm btw?)

An added bonus is that we can probably skip SSL in this case,  but I assume it requires some further changes. 
Would be nice to move to sockets.

Comment 11 Martin Sivák 2016-09-19 07:32:17 UTC

> One more problem (that may be masqueraded by this fix) is that we keep
> looping and not lowering the score, even though the ping failed which
> doesn't look right. Martin?

Ping monitor tries to access the network gateway, it is not related to the vdsm socket and should still fail. We keep looping because we never get to the actual score calculation (and we no longer publish any updates). Other nodes should stop seeing updates from this host and act accordingly though.

On the other hand.. nobody will kill the local VM if sanlock is still fine.

Comment 13 Martin Sivák 2016-09-19 08:33:59 UTC

Ah, I see the localhost patch was already merged to the necessary branches. This requires no code change to hosted engine and is therefore test only.

Comment 14 Francesco Romani 2016-09-19 11:04:43 UTC

(In reply to Martin Sivák from comment #9)
> Guys, I think the same issue will happen to MOM and vdsClient tools. We use
> the default address the vdsm library provides to us and this should be fixed
> there as Germano found out.
> 
> Francesco: can you please confirm this? It might affect any vdsm client, not
> just HE (what about supervdsm btw?)

jsonrpcvdscli is indeed the recommended way to talk with Vdsm, so, yes, it will affect each and every client.
Vdsm talsk to Supervdsm using one UNIX domain socket, so this part should be fine.

Comment 15 Artyom 2016-09-21 14:55:09 UTC

Verified on vdsm-4.18.13-1.el7ev.x86_64 and ovirt-hosted-engine-ha-2.0.4-1.el7ev.noarch

I do not have the possibility to check it on FC, so I checked it on NFS.

1) Configure two interfaces on the host and use one as management network
2) Bring down management interface "ip link set dev ovirtmgmt down"
3) Check that vdsm succeeds to talk with the ovirt-ha-agent
4) Block connection via IP tables to the gateway and check that ovirt-ha-agent drop score of the host to 1800
5) Check lsof:
# lsof | grep ovirt-ha | grep TCP
ovirt-ha- 119122                   vdsm    8u     IPv6            1342661       0t0        TCP localhost:52498->localhost:54321 (ESTABLISHED)
ovirt-ha- 119122                   vdsm   12u     IPv6            1362041       0t0        TCP localhost:52964->localhost:54321 (ESTABLISHED)
ovirt-ha- 119122 119531            vdsm    8u     IPv6            1342661       0t0        TCP localhost:52498->localhost:54321 (ESTABLISHED)
ovirt-ha- 119122 119531            vdsm   12u     IPv6            1362041       0t0        TCP localhost:52964->localhost:54321 (ESTABLISHED)
ovirt-ha- 119122 120865            vdsm    8u     IPv6            1342661       0t0        TCP localhost:52498->localhost:54321 (ESTABLISHED)
ovirt-ha- 119122 120865            vdsm   12u     IPv6            1362041       0t0        TCP localhost:52964->localhost:54321 (ESTABLISHED)