1378310 – Continuously restarting ha-agent breaks failover

Bug 1378310 - Continuously restarting ha-agent breaks failover

Summary: Continuously restarting ha-agent breaks failover

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-hosted-engine-ha
Sub Component:
Version:	4.0.3
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.2.0
Target Release:	4.2.0
Assignee:	Martin Sivák
QA Contact:	Nikolai Sednev
Docs Contact:
URL:
Whiteboard:	hosted-engine
Depends On:	1393902 1525907
Blocks:	1388122
TreeView+	depends on / blocked

Reported:	2016-09-22 05:33 UTC by Germano Veit Michel
Modified:	2022-05-16 06:41 UTC (History)
CC List:	16 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Clones:	1388122 (view as bug list)
Environment:
Last Closed:	2018-05-15 17:32:29 UTC
oVirt Team:	SLA
Target Upstream Version:
Embargoed:
Flags:	lsvaty: testing_plan_complete-

Attachments	(Terms of Use)
Screenshot from 2017-12-20 15-15-45.png (460.58 KB, image/png) 2017-12-20 13:22 UTC, Nikolai Sednev	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2018:1472	None	None	None	2018-05-15 17:33:55 UTC
oVirt gerrit	80456	'None'	MERGED	init: Tuned restart procedure for the hosted-engine agent	2020-11-27 16:43:42 UTC
oVirt gerrit	80474	'None'	MERGED	logging: Loop execution time is useful thing and should be logged as info.	2020-11-27 16:43:43 UTC
oVirt gerrit	80475	'None'	MERGED	agent: Removed retries	2020-11-27 16:43:43 UTC
oVirt gerrit	80476	'None'	MERGED	storage: Removed storage re-initialization on failure	2020-11-27 16:43:43 UTC
oVirt gerrit	80521	'None'	MERGED	agent: Monitoring loop should raise exception on failure	2020-11-27 16:44:07 UTC
oVirt gerrit	80522	'None'	MERGED	storage: Storage validation should raise a specific expection.	2020-11-27 16:43:42 UTC

Description Germano Veit Michel 2016-09-22 05:33:58 UTC

Description of problem:

1. Hosted Engine Storage on Fibre Channel
2. Host is running HE
3. loses network
4. score down by 1600 points
5. ha-agent decides to shutdown HE VM
6. broker does not responding to agent (notify state_transition)
7. agent restarts, goto 6. (agent talks fine to broker on restart, but times out on state_transition)
~~~
Result: HostedEngine VM not shutdown, stuck forever on this host, which does not have network.

I thought it was related, but BZ 1359059 is supposed to be fixed here:
$ cat installed-rpms | grep engine-ha 
ovirt-hosted-engine-ha-2.0.3-1.el7ev.noarch

MainThread::INFO::2016-09-21
10:47:08,718::states::128::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(score)
Penalizing score by 1600 due to gateway status

MainThread::INFO::2016-09-21
10:47:08,720::hosted_engine::461::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(start_monitoring)
Current state EngineUp (score: 1800)

MainThread::ERROR::2016-09-21
10:47:18,770::states::413::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume)
Host <HOSTNAME> (id 2) score is significantly better
than local score, shutting down VM on this host

MainThread::INFO::2016-09-21
10:47:18,772::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify)
Trying: notify time=1474426038.77 type=state_transition
detail=EngineUp-EngineStop hostname='<HOSTNAME>'

<30 seconds silence>

MainThread::ERROR::2016-09-21
10:47:48,802::brokerlink::279::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(_communicate)
Connection closed: Connection timed out

Meanwhile, in the broker:

Thread-624275::INFO::2016-09-21
10:47:16,268::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle)
Connection closed
Thread-624276::INFO::2016-09-21
10:47:16,268::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup)
Connection established
Thread-624276::INFO::2016-09-21
10:47:16,269::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle)
Connection closed
Thread-300607::INFO::2016-09-21
10:47:16,282::mem_free::53::mem_free.MemFree::(action) memFree: 247721
Thread-624277::INFO::2016-09-21
10:47:19,729::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup)
Connection established
Thread-624277::INFO::2016-09-21
10:47:19,767::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle)
Connection closed
Thread-624278::INFO::2016-09-21
10:47:19,768::listener::134::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(setup)
Connection established
Thread-624278::INFO::2016-09-21
10:47:19,769::listener::186::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle)
Connection closed

So the agent restarts... and it's talking fine to the broker:

MainThread::INFO::2016-09-21 10:55:39,353::brokerlink::129::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor ping, options {'addr': '192.168.7.254'}
MainThread::INFO::2016-09-21 10:55:39,358::brokerlink::140::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Success, id 140626423492496
MainThread::INFO::2016-09-21 10:55:39,358::brokerlink::129::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(start_monitor) Starting monitor mgmt-bridge, options {'use_ssl': 'true', 'bridge_name': 'ovirtmgmt', 'address': '0'}

Refreshes the shared storage metadata and, suprise:

MainThread::INFO::2016-09-21 11:19:08,576::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1474427948.58 type=state_transition detail=StartState-ReinitializeFSM hostname='<HOSTNAME>'
MainThread::ERROR::2016-09-21 11:19:38,607::brokerlink::279::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(_communicate) Connection closed: Connection timed out
MainThread::ERROR::2016-09-21 11:19:38,617::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Failed to start monitor <type 'type'>, options {'hostname': '<HOSTNAME>'}: Connection timed out' - trying to restart agent

Restarts again... Same thing.

The agent keeps looping forever, it never tries to shutdown the HE VM again.

When network comes up, agent and broker talk notifications again.

MainThread::INFO::2016-09-21 11:29:50,541::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1474428590.54 type=state_transition detail=ReinitializeFSM-EngineStarting hostname='<HOSTNAME>'
MainThread::INFO::2016-09-21 11:29:50,622::brokerlink::120::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Success, was notification of state_transition (ReinitializeFSM-EngineStarting) sent? sent

FYI: 
BZ 1359059 is supposed to be fixed here:
$ cat installed-rpms | grep engine-ha 
ovirt-hosted-engine-ha-2.0.3-1.el7ev.noarch

BZ 1377161 should not be related as they are talking via file.

Issues:
1. looks like the broker was somehow stuck due to the network down. Since it uses a file socket (broker.socket) it shouldn't be related to the network down, but the log timing say it is. Stuck related to the ping failure? Or BZ 1359059 fix was not 100%?
2. ha-agent first priority on failover is to stop the HE VM and free sanlock, not sending emails or updating the shared storage metadata. ha-agent should not keep restarting when failing to do "non essential things". This breaks the failover mechanism. If VM was stopped on first try (instead of restarting to send an email?), the HE VM would have failed over. If the ha-agent shutdown the engine and later looped somewhere else, it would be fine, as the HE would have failed over, in critical situations that's probably good enough.
3. Why don't we restart the broker as well? It might be hung somewhere.

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-ha-2.0.3-1.el7ev.noarch

How reproducible:
100% on customer site.

Actual results:
HostedEngine stuck in one host, without network, forever

Expected results:
HostedEngine failover

Comment 2 Martin Sivák 2016-09-22 07:32:59 UTC

Can we please get the debug log from both broker and agent? Was that a new clean deployment or an upgraded one?

The fact that you actually get a timeout indicates that the fix is working (it was stuck forever before). My guess is that we query DNS somewhere and that causes the issue.

Comment 3 Martin Sivák 2016-09-22 07:48:01 UTC

One more question.. what was the configured email address to receive the notifications and the smtp server?

Comment 4 Germano Veit Michel 2016-09-23 02:42:47 UTC

Hi Martin,

Thanks for looking into this.

Regarding our talk yesterday, I am afraid we are back to the drawing board...

$ cat broker.conf 
[email]
smtp-server = localhost
smtp-port = 25
source-email = root@localhost
destination-emails = root@localhost

[notify]
state_transition = maintenance|start|stop|migrate|up|down

-> from the conf_volume and also confirmed in the hosts

I'll try to find some time to reproduce this within the next hours. 
Also uploading the logs now.

Comment 6 Germano Veit Michel 2016-09-23 02:51:49 UTC

(In reply to Martin Sivák from comment #2)
> Can we please get the debug log from both broker and agent? Was that a new
> clean deployment or an upgraded one?
> 

Brand new. Logs attached.

Comment 8 Germano Veit Michel 2016-09-27 23:11:46 UTC

Hi Martin,

Can't this be solved by stopping the 'restart ha-agent' during HE shutdown?

1. Wouldn't it make the whole mechanism more reliable? If it's detected that the HE should be shut down due to lower score, then that's probably what should be done. If something else is failing, then after the HE is down it can re-enable the 'restart' thing to try to fix it.

AFAIK, in this case, even if it doesn't push the host status to the shared storage, the HE VM would eventually fail over, right?

2. Any ideas on what exactly to debug at customer side? I'm thinking of creating some debug package to log more information on what exactly is going on.

3. Do you still think this could be related to DNS? They mail server is localhost and the host itself is in /etc/hosts.

Thanks,

Comment 11 Martin Sivák 2016-09-29 14:18:42 UTC

Just so we have the info here. We were discussing this with Germano and a test without notifications is needed. The Python smtplib tries to resolve its own hostname in https://hg.python.org/cpython/file/2.7/Lib/smtplib.py#l265 and the reverse lookup might be doing the DNS call.

Comment 12 Germano Veit Michel 2016-10-03 01:26:20 UTC

We are now waiting for the results of the proposed test.

Martin, wouldn't one of these two changes make sense?
1. skip mail notifications in case of network related score penalty, or allow them to fail without restarting the agent.
2. disable restarting the agent after it's found that HE should be shutdown.

Comment 13 Germano Veit Michel 2016-10-07 06:57:58 UTC

Martin, 

We changed, in notifications.py (broker)

def notify(**kwargs):
<...>
return send_email(smtp_config, email_body)

to 

return True

So send_mail is not called within the broker.


Broker stops pinging at 10:36

Thread-10::INFO::2016-10-05 10:35:04,373::ping::52::ping.Ping::(action) Successfully pinged 192.168.7.254
Thread-10::INFO::2016-10-05 10:36:04,461::ping::52::ping.Ping::(action) Successfully pinged 192.168.7.254
Thread-10::WARNING::2016-10-05 10:36:24,468::ping::48::ping.Ping::(action) Failed to ping 192.168.7.254
Thread-10::WARNING::2016-10-05 10:36:34,474::ping::48::ping.Ping::(action) Failed to ping 192.168.7.254

But Same results from agent side:

MainThread::ERROR::2016-10-05 10:46:57,348::states::413::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(consume) Host XXX (id 1) score is significantly better than local score, shutting down VM on this host

MainThread::INFO::2016-10-05 10:46:57,351::brokerlink::111::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(notify) Trying: notify time=1475635617.35 type=state_transition detail=EngineUp-EngineStop hostname='XXX'

MainThread::ERROR::2016-10-05 10:47:27,381::brokerlink::279::ovirt_hosted_engine_ha.lib.brokerlink.BrokerLink::(_communicate) Connection closed: Connection timed out
MainThread::ERROR::2016-10-05 10:47:27,381::agent::205::ovirt_hosted_engine_ha.agent.agent.Agent::(_run_agent) Error: 'Failed to start monitor <type 'type'>, options {'hostname': 'rhv-prod-h03.etanah.ptgkl.gov.my'}: Connection timed out' - trying to restart agent

So the problem is somewhere else?

Thanks,
Germano

Comment 19 Germano Veit Michel 2016-10-31 01:13:07 UTC

Hi Martin,

The new logs came in. This time it did not even lower the score as the agent timed out before there. Nevertheless, the error is very, very similar to the other occurrences.

From what I can see, the issue is caused by this, in the broker:

def notify(**kwargs):
    ....
    logger.debug("nofity: %s" % (repr(kwargs),))   <--- typo, 'nofity'
    ....
    heconf = config.Config(logger=logger)
    heconf.refresh_local_conf_file(  <----------------- taking too long
        localcopy_filename=constants.NOTIFY_CONF_FILE,
        archive_fname=constants.NOTIFY_CONF_FILE_ARCHIVE_FNAME,
    )
    ....

During "network cable unplug" refresh_local_conf_file takes too long, way more than the 30 seconds the agent waits. So the agent times out.

Look at this:

Broker receives the notification request from the agent:

Thread-5::DEBUG::2016-10-28 18:25:52,951::listener::163::ovirt_hosted_engine_ha.broker.listener.ConnectionHandler::(handle) Input: notify time=1477650352.95 type=state_transition detail=EngineUp-EngineUpBadHealth hostname='XXX'
Thread-5::DEBUG::2016-10-28 18:25:52,952::notifications::50::ovirt_hosted_engine_ha.broker.notifications.Notifications::(notify) nofity: {'hostname': 'XXX', 'type': 'state_transition', 'detail': 'EngineUp-EngineUpBadHealth', 'time': '1477650352.95'}

12 minutes later...

Thread-5::DEBUG::2016-10-28 18:37:25,918::config::251::ovirt_hosted_engine_ha.broker.notifications.Notifications.config::(refresh_local_conf_file) Reading 'broker.conf' from '/rhev/data-center/mnt/blockSD/ccd8c49a-11ec-4078-bc33-71f07c536f2e/images/ad0be349-ce6c-41e2-9c25-cc9c2ecc829b/656d4a6f-b68a-4dbc-aba2-9ae4d1dfcb65'
Thread-5::DEBUG::2016-10-28 18:37:25,918::heconflib::69::ovirt_hosted_engine_ha.broker.notifications.Notifications.config::(_dd_pipe_tar) executing: 'sudo -u vdsm dd if=/rhev/data-center/mnt/blockSD/ccd8c49a-11ec-4078-bc33-71f07c536f2e/images/ad0be349-ce6c-41e2-9c25-cc9c2ecc829b/656d4a6f-b68a-4dbc-aba2-9ae4d1dfcb65 bs=4k'
Thread-5::DEBUG::2016-10-28 18:37:25,918::heconflib::70::ovirt_hosted_engine_ha.broker.notifications.Notifications.config::(_dd_pipe_tar) executing: 'tar -tvf -'

I also noticed it's taking roughly the same time for this function to get the vm.conf. And it looks very similar.

One thing that I think it related is that the ISO domain went down when the network was unplugged. I don't know at which step in refresh_local_conf_file we reload storage domain ids on vdsm (if we do that), but the whole thing moved on at the exact time vdsm gave up on the ISO domain.

Thread-12::ERROR::2016-10-28 18:37:25,918::monitor::484::Storage.Monitor::(_pathChecked) Error checking path /rhev/data-center/mnt/192.168.3.122:_volume1_iso__domain/d58998e0-a388-4c80-b6b6-0a0676e74c5d/dom_md/metadata
Traceback (most recent call last):
  File "/usr/share/vdsm/storage/monitor.py", line 482, in _pathChecked
    delay = result.delay()
  File "/usr/lib/python2.7/site-packages/vdsm/storage/check.py", line 367, in delay
    raise exception.MiscFileReadException(self.path, self.rc, self.err)
MiscFileReadException: Internal file read failure: (u'/rhev/data-center/mnt/192.168.3.122:_volume1_iso__domain/d58998e0-a388-4c80-b6b6-0a0676e74c5d/dom_md/metadata', 1, bytearray(b"/usr/bin/dd: failed to open \'/rhev/data-center/mnt/192.168.3.122:_volume1_iso__domain/d58998e0-a388-4c80-b6b6-0a0676e74c5d/dom_md/metadata\': Input/output error\n"))
jsonrpc.Executor/3::DEBUG::2016-10-28 18:37:25,925::hsm::2367::Storage.HSM::(__prefetchDomains) Found SD uuids: (u'd58998e0-a388-4c80-b6b6-0a0676e74c5d',)
jsonrpc.Executor/3::DEBUG::2016-10-28 18:37:25,926::hsm::2427::Storage.HSM::(connectStorageServer) knownSDs: {ccd8c49a-11ec-4078-bc33-71f07c536f2e: storage.blockSD.findDomain, 4462059e-90bf-4c29-8d9f-715cb326d752: storage.blockSD.findDomain, d58998e0-a388-4c80-b6b6-0a0676e74c5d: storage.nfsSD.findDomain}
jsonrpc.Executor/3::INFO::2016-10-28 18:37:25,927::logUtils::52::dispatcher::(wrapper) Run and protect: connectStorageServer, Return response: {'statuslist': [{'status': 0, 'id': u'1e835544-1ac3-4f2f-a31e-4b7a6bccac6f'}]}

Was the broker waiting on vdsm?

I'm uploading the logs for you. For reference, here are the times:

Thread-6::INFO::2016-10-28 18:12:17,569::ping::52::ping.Ping::(action) Successfully pinged 192.168.7.254
<Network outage start>
Thread-6::WARNING::2016-10-28 18:13:27,658::ping::48::ping.Ping::(action) Failed to ping 192.168.7.254
<Network outage stop>
Thread-6::INFO::2016-10-28 18:34:51,763::ping::52::ping.Ping::(action) Successfully pinged 192.168.7.254

Thanks

Comment 21 Martin Sivák 2016-10-31 10:20:42 UTC

Hi Germano, this is entirely possible! I will dig into the config refresh routine to see what can be done. Is the customer using NFS, iSCSI, FC or Gluster?

Comment 22 Germano Veit Michel 2016-11-01 00:12:40 UTC

(In reply to Martin Sivák from comment #21)
> Hi Germano, this is entirely possible! I will dig into the config refresh
> routine to see what can be done. Is the customer using NFS, iSCSI, FC or
> Gluster?

Storage is FC, ISO Domain is NFS. When network goes down, so does ISO Domain.

Comment 23 Yanir Quinn 2016-11-08 10:15:59 UTC

2 attempts to reproduce this issue by blocking storage communication (using iptables):

1.block communication to nfs port on the storage server.

  -vm was down and after restoring communication brought up properly. 

2.isolated the iso domain to a different IP  and blocked only the communication   to that IP (nfs/tcp).

 -agent got to a point where :  (getEngineVMOVF) Extracting Engine VM OVF from   the OVF_STORE    and didn't move from there
after restoring connectivity the agent recovered and vm was up and running.

Comment 24 Nikolai Sednev 2016-11-08 10:52:28 UTC

(In reply to Yanir from comment #23)
> 2 attempts to reproduce this issue by blocking storage communication (using
> iptables):
> 
> 1.block communication to nfs port on the storage server.
> 
>   -vm was down and after restoring communication brought up properly. 
> 
> 2.isolated the iso domain to a different IP  and blocked only the
> communication   to that IP (nfs/tcp).
> 
>  -agent got to a point where :  (getEngineVMOVF) Extracting Engine VM OVF
> from   the OVF_STORE    and didn't move from there
> after restoring connectivity the agent recovered and vm was up and running.

I think that "network down" here is meant for disconnected physically NIC with no link up, which might be preventing ha-agent from getting to broker on host.
iptables won't simulate real network disconnection, as they're blocking only specific range of IPs or ports, and TCP/IP stack continues working on that interface.

I'd be trying to reproduce this issue with disabling NIC and then bringing it back on line.

Comment 25 Yaniv Kaul 2016-11-08 10:53:37 UTC

(In reply to Nikolai Sednev from comment #24)
> (In reply to Yanir from comment #23)
> > 2 attempts to reproduce this issue by blocking storage communication (using
> > iptables):
> > 
> > 1.block communication to nfs port on the storage server.
> > 
> >   -vm was down and after restoring communication brought up properly. 
> > 
> > 2.isolated the iso domain to a different IP  and blocked only the
> > communication   to that IP (nfs/tcp).
> > 
> >  -agent got to a point where :  (getEngineVMOVF) Extracting Engine VM OVF
> > from   the OVF_STORE    and didn't move from there
> > after restoring connectivity the agent recovered and vm was up and running.
> 
> I think that "network down" here is meant for disconnected physically NIC
> with no link up, which might be preventing ha-agent from getting to broker
> on host.
> iptables won't simulate real network disconnection, as they're blocking only
> specific range of IPs or ports, and TCP/IP stack continues working on that
> interface.

The network can be disconnected on the other side of the path...
A drop (vs. block) should simulate it well enough.

> 
> I'd be trying to reproduce this issue with disabling NIC and then bringing
> it back on line.

Comment 26 Nikolai Sednev 2016-11-08 10:57:58 UTC

(In reply to Yaniv Kaul from comment #25)
> (In reply to Nikolai Sednev from comment #24)
> > (In reply to Yanir from comment #23)
> > > 2 attempts to reproduce this issue by blocking storage communication (using
> > > iptables):
> > > 
> > > 1.block communication to nfs port on the storage server.
> > > 
> > >   -vm was down and after restoring communication brought up properly. 
> > > 
> > > 2.isolated the iso domain to a different IP  and blocked only the
> > > communication   to that IP (nfs/tcp).
> > > 
> > >  -agent got to a point where :  (getEngineVMOVF) Extracting Engine VM OVF
> > > from   the OVF_STORE    and didn't move from there
> > > after restoring connectivity the agent recovered and vm was up and running.
> > 
> > I think that "network down" here is meant for disconnected physically NIC
> > with no link up, which might be preventing ha-agent from getting to broker
> > on host.
> > iptables won't simulate real network disconnection, as they're blocking only
> > specific range of IPs or ports, and TCP/IP stack continues working on that
> > interface.
> 
> The network can be disconnected on the other side of the path...
> A drop (vs. block) should simulate it well enough.
> 
> > 
> > I'd be trying to reproduce this issue with disabling NIC and then bringing
> > it back on line.

Totally possible, network might get lost with several possible scenarios. Even erasing the IP address from NIC might work.

Comment 27 Yanir Quinn 2016-11-08 11:19:13 UTC

(In reply to Nikolai Sednev from comment #26)
> (In reply to Yaniv Kaul from comment #25)
> > (In reply to Nikolai Sednev from comment #24)
> > > (In reply to Yanir from comment #23)
> > > > 2 attempts to reproduce this issue by blocking storage communication (using
> > > > iptables):
> > > > 
> > > > 1.block communication to nfs port on the storage server.
> > > > 
> > > >   -vm was down and after restoring communication brought up properly. 
> > > > 
> > > > 2.isolated the iso domain to a different IP  and blocked only the
> > > > communication   to that IP (nfs/tcp).
> > > > 
> > > >  -agent got to a point where :  (getEngineVMOVF) Extracting Engine VM OVF
> > > > from   the OVF_STORE    and didn't move from there
> > > > after restoring connectivity the agent recovered and vm was up and running.
> > > 
> > > I think that "network down" here is meant for disconnected physically NIC
> > > with no link up, which might be preventing ha-agent from getting to broker
> > > on host.
> > > iptables won't simulate real network disconnection, as they're blocking only
> > > specific range of IPs or ports, and TCP/IP stack continues working on that
> > > interface.
> > 
> > The network can be disconnected on the other side of the path...
> > A drop (vs. block) should simulate it well enough.
> > 
> > > 
> > > I'd be trying to reproduce this issue with disabling NIC and then bringing
> > > it back on line.
> 
> Totally possible, network might get lost with several possible scenarios.
> Even erasing the IP address from NIC might work.

blocking using iptables was done by using drop rule (on nfs port) on the storage server side ,
e.g :
sudo iptables -I OUTPUT -d <IP> -p tcp --dport=2049 -j DROP
/
sudo iptables -I INPUT -d <IP> -p tcp --dport=2049 -j DROP

Comment 28 Germano Veit Michel 2016-11-09 01:38:41 UTC

(In reply to Nikolai Sednev from comment #24)
> I think that "network down" here is meant for disconnected physically NIC
> with no link up, which might be preventing ha-agent from getting to broker
> on host.

AFAICS, the ha-agent and ha-broker talk via AF_UNIX.
And this problem also reproduces running packages where BZ1377161 is fixed.

From what I can see the problem is not between the agent and the broker, but between broker and vdsm (refresh_local_conf_file). The communication channel between the broker and vdsm is just fine. But vdsm simply doesn't answer the broker because it's too worried about the unreachable ISO SD. So everything times out and the agent restarts.

> iptables won't simulate real network disconnection, as they're blocking only
> specific range of IPs or ports, and TCP/IP stack continues working on that
> interface.

Right, iptables is different from cable unplug as the interface link doesn't go down and the Host IP is not removed from the FIB. But I don't think this is relevant here.

The HE SD is over FC, which is never disconnected. Then when network goes down it looks like ha-agent hangs on refreshing it's local conf files because vdsm is "busy" on the ISO Domain - which is now unreachable - (and probably reproduces on any other SD that depends on the network). This is similar to your attempt #2. Attempt #1 is unrelated to this problem. If the ISO Domain (or possibly any other iSCSI/NFS domain) is unreachable due to the network down, it doesn't really seem to matter if the disconnect is local or in a different point in the network.

Comment 29 Germano Veit Michel 2016-11-18 07:21:53 UTC

Hi,

Any progress on this? Is more data required?

Thanks

Comment 30 Martin Sivák 2016-11-18 08:35:42 UTC

Hi Germano, we think we know all we needed and we even have reproducing steps. We will need some time to fix this though as the storage interaction is a bit tricky.

Comment 31 Germano Veit Michel 2016-11-20 23:10:22 UTC

Hi Martin, looking at the "Depends On" bug I understand the amount of work involved is substantial.

Thank you!

Comment 32 Martin Sivák 2016-11-23 13:44:32 UTC

Moving to 4.2 as we need to cooperate with the storage team on this.

Comment 33 Nir Soffer 2017-05-22 08:09:07 UTC

Martin, can you explain what is the storage related issues blocking you here?

Comment 34 Martin Sivák 2017-05-22 08:20:39 UTC

The issue here is that a lost connection to an unrelated NFS ISO domain causes the monitoring loop (we call repoStats iirc) to stale and wait for the NFS timeout before deciding we have an issue. This is related to our request for domain specific status check verb.

We obviously also lose the connection to the hosted engine related storage in case of total network failure, but that is nothing we can solve in software.

Comment 35 Nir Soffer 2017-05-24 07:28:00 UTC

(In reply to Martin Sivák from comment #34)
> The issue here is that a lost connection to an unrelated NFS ISO domain
> causes the monitoring loop (we call repoStats iirc) to stale and wait for
> the NFS timeout before deciding we have an issue.

repoStat cannot block (I looked again at the code now).

It must be some other request accessing storage, or hosted engine agent accessing
storage directly (like bug 1332813?).

> This is related to our
> request for domain specific status check verb.

I'm not sure this is really needed, but I posted this:
https://gerrit.ovirt.org/77224

Comment 36 Sandro Bonazzola 2017-06-07 12:09:21 UTC

Any update?

Comment 42 Nikolai Sednev 2017-12-13 12:33:32 UTC

I think that unless https://bugzilla.redhat.com/show_bug.cgi?id=1524119 is not properly verified, this bug should remain open, that is because we won't get proper migration of SHE-VM from ha-host with failed network interface to another best-score ha-host.

Comment 43 Nikolai Sednev 2017-12-20 13:20:34 UTC

1. Hosted Engine Storage on Fibre Channel
2. Host is running HE
3. loses network
4. score down by 1600 points
5. ha-agent decides to shutdown HE VM
6. broker does not responding to agent (notify state_transition)
7. agent restarts, goto 6. (agent talks fine to broker on restart, but times out on state_transition)

After step 3 I clearly see from second ha-host purple-vds2 that score on first ha-host which was running SHE-VM has changed to 1800 instead of 1600 as was reported. Then after a few moments it has dropped to 0.

[root@purple-vds2 ~]# hosted-engine --vm-status


--== Host 1 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : purple-vds1.qa.lab.tlv.redhat.com
Host ID                            : 1
Engine status                      : {"health": "good", "vm": "up", "detail": "Up"}
Score                              : 1800
stopped                            : False
Local maintenance                  : False
crc32                              : c155f4ad
local_conf_timestamp               : 73843
Host timestamp                     : 73839
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=73839 (Wed Dec 20 15:09:12 2017)
        host-id=1
        score=1800
        vm_conf_refresh_time=73843 (Wed Dec 20 15:09:16 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUp
        stopped=False


--== Host 2 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : purple-vds2.qa.lab.tlv.redhat.com
Host ID                            : 2
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 744c0e8a
local_conf_timestamp               : 73748
Host timestamp                     : 73745
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=73745 (Wed Dec 20 15:09:23 2017)
        host-id=2
        score=3400
        vm_conf_refresh_time=73748 (Wed Dec 20 15:09:27 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineDown
        stopped=False
[root@purple-vds2 ~]# hosted-engine --vm-status


--== Host 1 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : purple-vds1.qa.lab.tlv.redhat.com
Host ID                            : 1
Engine status                      : {"reason": "failed liveliness check", "health": "bad", "vm": "up", "detail": "Up"}
Score                              : 0
stopped                            : True
Local maintenance                  : False
crc32                              : cc791af5
local_conf_timestamp               : 73924
Host timestamp                     : 73899
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=73899 (Wed Dec 20 15:10:11 2017)
        host-id=1
        score=0
        vm_conf_refresh_time=73924 (Wed Dec 20 15:10:37 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=AgentStopped
        stopped=True


--== Host 2 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : purple-vds2.qa.lab.tlv.redhat.com
Host ID                            : 2
Engine status                      : {"reason": "vm not running on this host", "health": "bad", "vm": "down", "detail": "unknown"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 0c7ebc95
local_conf_timestamp               : 73854
Host timestamp                     : 73850
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=73850 (Wed Dec 20 15:11:09 2017)
        host-id=2
        score=3400
        vm_conf_refresh_time=73854 (Wed Dec 20 15:11:13 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineDown
        stopped=False

Then failover tarted shifting SHE-VM to second ha-host vds2:
[root@purple-vds2 ~]# hosted-engine --vm-status


--== Host 1 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : False
Hostname                           : purple-vds1.qa.lab.tlv.redhat.com
Host ID                            : 1
Engine status                      : unknown stale-data
Score                              : 0
stopped                            : True
Local maintenance                  : False
crc32                              : cc791af5
local_conf_timestamp               : 73924
Host timestamp                     : 73899
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=73899 (Wed Dec 20 15:10:11 2017)
        host-id=1
        score=0
        vm_conf_refresh_time=73924 (Wed Dec 20 15:10:37 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=AgentStopped
        stopped=True


--== Host 2 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : purple-vds2.qa.lab.tlv.redhat.com
Host ID                            : 2
Engine status                      : {"reason": "failed liveliness check", "health": "bad", "vm": "up", "detail": "Up"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 5a9b55b8
local_conf_timestamp               : 73982
Host timestamp                     : 73978
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=73978 (Wed Dec 20 15:13:16 2017)
        host-id=2
        score=3400
        vm_conf_refresh_time=73982 (Wed Dec 20 15:13:21 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineStarting
        stopped=False

On first ha-host I've seen error message as appears in attachment.


Finally SHE-VM got started on second ha-host vds2:
[root@purple-vds2 ~]# hosted-engine --vm-status


--== Host 1 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : False
Hostname                           : purple-vds1.qa.lab.tlv.redhat.com
Host ID                            : 1
Engine status                      : unknown stale-data
Score                              : 0
stopped                            : True
Local maintenance                  : False
crc32                              : cc791af5
local_conf_timestamp               : 73924
Host timestamp                     : 73899
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=73899 (Wed Dec 20 15:10:11 2017)
        host-id=1
        score=0
        vm_conf_refresh_time=73924 (Wed Dec 20 15:10:37 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=AgentStopped
        stopped=True


--== Host 2 status ==--

conf_on_shared_storage             : True
Status up-to-date                  : True
Hostname                           : purple-vds2.qa.lab.tlv.redhat.com
Host ID                            : 2
Engine status                      : {"health": "good", "vm": "up", "detail": "Up"}
Score                              : 3400
stopped                            : False
Local maintenance                  : False
crc32                              : 52b8c1b3
local_conf_timestamp               : 74141
Host timestamp                     : 74137
Extra metadata (valid at timestamp):
        metadata_parse_version=1
        metadata_feature_version=1
        timestamp=74137 (Wed Dec 20 15:15:55 2017)
        host-id=2
        score=3400
        vm_conf_refresh_time=74141 (Wed Dec 20 15:15:59 2017)
        conf_on_shared_storage=True
        maintenance=False
        state=EngineUp
        stopped=False

Moving to verified as reported issue had not been reproduced any more.

Tested on these components:
rhvm-appliance-4.2-20171207.0.el7.noarch
ovirt-hosted-engine-setup-2.2.2-1.el7ev.noarch
ovirt-hosted-engine-ha-2.2.2-1.el7ev.noarch
vdsm-4.20.9.3-1.el7ev.x86_64

Comment 44 Nikolai Sednev 2017-12-20 13:22:01 UTC

Created attachment 1370467 [details]
Screenshot from 2017-12-20 15-15-45.png

Comment 48 errata-xmlrpc 2018-05-15 17:32:29 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:1472

Comment 49 Franta Kust 2019-05-16 13:07:18 UTC

BZ<2>Jira Resync

Note You need to log in before you can comment on or make changes to this bug.