1488259 – Host Non-operational due to Heartbeat Exceeded

Bug 1488259 - Host Non-operational due to Heartbeat Exceeded

Summary: Host Non-operational due to Heartbeat Exceeded

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-engine
Sub Component:
Version:	4.1.2
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	low
Target Milestone:	ovirt-4.3.0
Target Release:	---
Assignee:	Nobody
QA Contact:	Elad
Docs Contact:
URL:
Whiteboard:
Depends On:	1513360
Blocks:
TreeView+	depends on / blocked

Reported:	2017-09-04 23:27 UTC by Germano Veit Michel
Modified:	2022-03-13 14:25 UTC (History)
CC List:	14 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-02-12 09:45:42 UTC
oVirt Team:	Storage
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1486543	0	high	CLOSED	Migration leads to VM running on 2 Hosts	2021-05-01 16:35:09 UTC

Internal Links: 1486543

Description Germano Veit Michel 2017-09-04 23:27:18 UTC

Description of problem:

Apparently a Heartbeat exceeded during connectStorageServer causes the engine to move perfectly working hosts to Non-Operational, causing mass migrations for nothing.

1. RHV-M loses connection to host
2. Connects again
3. Among other things, does a connectStorageServer
4. Something disrupts the connection again, connectStorageServer fails with Heartbeat Exceeded exception.
5. Engine understands it as a failure to connect to the Pool and moves it to Non-Operational
6. Unnecessary mass migrations.

I'm not 100% sure on 5, but it's the only thing I can see. Everything else looks fine.

Also:
* VDSM answers are all timely, repoStats before and after the connectStorageServer are the same. No Storage Domain was in problem or not reported.
* In the specific case below the OS running ovirt-engine had severe problems, causing the Heartbeat exceeded problems. But it can also be caused by poor network.
* See 3 examples:

2017-08-28 03:42:19,675+05 ERROR [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (org.ovirt.thread.pool-6-thread-41) [] Could not connect host 'dch11bl13' to pool 'DR': Heartbeat exceeded

2017-08-28 03:42:36,867+05 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (DefaultQuartzScheduler2) [db65f77] START, SetVdsStatusVDSCommand(HostName = dch11bl13, SetVdsStatusVDSCommandParameters:{runAsync='true', hostId='2a02097f-7026-4d63-93d7-3db474b1f889', status='NonOperational', nonOperationalReason='STORAGE_DOMAIN_UNREACHABLE', stopSpmFailureLogged='false', maintenanceReason='null'}), log id: 226305ec

2017-08-28 03:42:39,329+05 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler2) [db65f77] EVENT_ID: VDS_SET_NONOPERATIONAL_DOMAIN(522), Correlation ID: db65f77, Job ID: 3f389dbf-c73b-45ad-8db2-fa5f6d9b475d, Call Stack: null, Custom Event ID: -1, Message: Host dch11bl13 cannot access the Storage Domain(s) <UNKNOWN> attached to the Data Center DR-CHENNAI. Setting Host state to Non-Operational.

-----------------------


2017-08-28 03:52:13,035+05 ERROR [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (org.ovirt.thread.pool-6-thread-21) [] Could not connect host 'dch12bl01' to pool 'DR-CHENNAI': Heartbeat exceeded

2017-08-28 03:52:17,750+05 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (DefaultQuartzScheduler10) [17e08558] START, SetVdsStatusVDSCommand(HostName = dch12bl01, SetVdsStatusVDSCommandParameters:{runAsync='true', hostId='d7191d33-899d-49ea-9309-d51da80bc5eb', status='NonOperational', nonOperationalReason='STORAGE_DOMAIN_UNREACHABLE', stopSpmFailureLogged='false', maintenanceReason='null'}), log id: 5cd84c52

-----------------------

2017-08-28 03:54:28,801+05 ERROR [org.ovirt.engine.core.bll.InitVdsOnUpCommand] (org.ovirt.thread.pool-6-thread-43) [] Could not connect host 'dch12bl04' to pool 'DR-CHENNAI': Heartbeat exceeded

2017-08-28 03:54:32,618+05 INFO  [org.ovirt.engine.core.vdsbroker.SetVdsStatusVDSCommand] (DefaultQuartzScheduler3) [737e71c3] START, SetVdsStatusVDSCommand(HostName = dch12bl04, SetVdsStatusVDSCommandParameters:{runAsync='true', hostId='289abce1-5898-4be8-b582-dfe82905adf6', status='NonOperational', nonOperationalReason='STORAGE_DOMAIN_UNREACHABLE', stopSpmFailureLogged='false', maintenanceReason='null'}), log id: 444cf9a5

Version-Release number of selected component (if applicable):
ovirt-engine-4.1.2

How reproducible:
Unsure

Comment 2 Germano Veit Michel 2017-09-04 23:36:56 UTC

It seems to be a ConnectStoragePool on initializeStorage() in InitVdsOnUpCommand.

And does it really need to do this every single time a connection is re-established? Because these hosts were already connected to the Pool.

Comment 3 Yaniv Kaul 2017-09-05 05:37:24 UTC

The real question is why we got heartbeat exceeded. I suspect due to the rhvm terrible state. 
I think moving a host to non operational on heartbeat exceeded makes sense, so I'm not sure what we should look at here.

Comment 4 Germano Veit Michel 2017-09-05 05:40:47 UTC

(In reply to Yaniv Kaul from comment #3)
> The real question is why we got heartbeat exceeded. I suspect due to the
> rhvm terrible state. 

Yes this is from that hung (softlockups) rhv-m.

> I think moving a host to non operational on heartbeat exceeded makes sense,
> so I'm not sure what we should look at here.

It doesn't look like a good idea to me. What if only RHV-M has network problems (switch port, remote site WAN, VPN etc..). It can trigger mass migrations on a perfect working cluster. IMO Non-Operational should be triggered based on received data/returns, not exceptions. But this is your decision.

Comment 5 Piotr Kliczewski 2017-09-05 07:05:11 UTC

Checking the logs I see that "heartbeat exceeded" is very common in the log starting with:

2017-08-27 00:09:13,644+05 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (DefaultQuartzScheduler1) [] Command 'GetAllVmStatsVDSCommand(HostName = dch12bl13.cbec.gov.in, VdsIdVDSCommandParametersBase:{runAsync='true', hostId='e31634bf-d4a0-47e8-ba81-d49c280acaa3'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded

I see many hosts being not able to communicate. It could be either too small heartbeat timeout value (for busier environments) or there is other issue that we need to investigate.

Please provide one or more failing host logs.

Comment 6 Germano Veit Michel 2017-09-05 07:16:01 UTC

Piotr,

The network and the hosts are all fine here, this is not about any bug on jsonrpc. The problem is not even the engine, but the Machine (Virtual) the engine is running on. It was completely broken and locked up. Java as postgres processes are stuck on CPUs, not running etc, softlockups everywhere, IO not happening etc.. This triggered some very erratic behavior (corner cases) which ended up with undesirable results.

This BZ is about of one them:

Should a Heartbeat Exceeded during ConnectStoragePool move a Host to Non-Operational? Because on during other commands it does not do this.

So if a manager<->host connection is dropped in the case of a network problem (quite common). The manager connects again and during ConnectStoragePool the communication problem happens again, triggering Heartbeat exceeded. Should the host move to Non-Operational? If yes, feel free to close the BZ.

Comment 10 Piotr Kliczewski 2017-09-06 12:49:39 UTC

Please provide steps how to reproduce the issue.

Comment 12 Martin Perina 2017-09-07 07:05:51 UTC

Reducing severity to high as this happens only if engine is running inside VM and this VM was frozen (based on above description).

Comment 13 Piotr Kliczewski 2017-09-07 07:36:10 UTC

From exception handling perspective it looks like it needs to be fixed. I understand that the critical piece is handling ConnectStoragePool. Let me ask Allon what he things about such changes.

Comment 14 Germano Veit Michel 2017-09-07 07:38:01 UTC

(In reply to Martin Perina from comment #12)
> Reducing severity to high as this happens only if engine is running inside
> VM and this VM was frozen (based on above description).

Hmm. I think it can happen on network issues too. As long as the Heartbeat exception is raised during a connectStoragePool the same can happen. Just needs to have bad luck.

(In reply to Piotr Kliczewski from comment #13)
> From exception handling perspective it looks like it needs to be fixed. I
> understand that the critical piece is handling ConnectStoragePool. Let me
> ask Allon what he things about such changes.

Yup, that's why I raised the BZ. Thanks for following up.

Comment 15 Allon Mureinik 2017-09-11 12:34:49 UTC

(In reply to Piotr Kliczewski from comment #13)
> From exception handling perspective it looks like it needs to be fixed. I
> understand that the critical piece is handling ConnectStoragePool. Let me
> ask Allon what he things about such changes.

If a host cannot connect to the storage pool (i.e. - it does not start monitoring all the relevant storage domains), it should indeed be NonOp. Piotr - can you elaborate on what you think should be fixed here?
To me, this looks like the correct behavior.

Comment 16 Piotr Kliczewski 2017-09-11 14:15:41 UTC

(In reply to Allon Mureinik from comment #15)
> 
> If a host cannot connect to the storage pool (i.e. - it does not start
> monitoring all the relevant storage domains), it should indeed be NonOp.
> Piotr - can you elaborate on what you think should be fixed here?
> To me, this looks like the correct behavior.

Allon, As Germano described above there is a customer running the engine in a vm which gets frozen from time to time. The issue is that it was frozen during connect storage command. The issue was detected and handled to move it to NoOp (host was healthy) and it triggered unneeded migration. What do you think about retying the call or perfrom a simple check whether the host is accessible? This should mitigate unneeded migrations.

Comment 17 Germano Veit Michel 2017-09-14 06:09:33 UTC

(In reply to Piotr Kliczewski from comment #16)
> Allon, As Germano described above there is a customer running the engine in
> a vm which gets frozen from time to time. The issue is that it was frozen
> during connect storage command.

Well, that's the least of our concerns right? A hanging RHV-M should not happen that frequently. 

What concerns me is that a bad network (intermittent issue) between just the engine and vdsm can cause the same thing. Even if the host is completely healthy and all networks are fine between all hosts. Just the manager is having network issues issues like a bad NIC or VPN in case it's on a different site. Then we can immediately move hosts to non-operational even if the manager reconnects instantly after (fencing countdown never reached, so never fenced) right Piotr?

Comment 18 Allon Mureinik 2017-09-14 16:00:33 UTC

(In reply to Germano Veit Michel from comment #17)
> (In reply to Piotr Kliczewski from comment #16)
> > Allon, As Germano described above there is a customer running the engine in
> > a vm which gets frozen from time to time. The issue is that it was frozen
> > during connect storage command.
> 
> Well, that's the least of our concerns right? A hanging RHV-M should not
> happen that frequently. 
> 
> What concerns me is that a bad network (intermittent issue) between just the
> engine and vdsm can cause the same thing. Even if the host is completely
> healthy and all networks are fine between all hosts. Just the manager is
> having network issues issues like a bad NIC or VPN in case it's on a
> different site. Then we can immediately move hosts to non-operational even
> if the manager reconnects instantly after (fencing countdown never reached,
> so never fenced) right Piotr?

Now I understand the question, thanks.
I don't think retrying is the right way to go. Although ConnectStoragePool is relatively innocuous, the same treatment can happen in numerous other commands you won't want to retry.
I'd expect the host monitoring subsystem to catch the heartbeat exceeded error and move the host to NonResponding (which it is, frankly)

Comment 19 Piotr Kliczewski 2017-09-14 17:00:43 UTC

When there is a network issue we do start fencing flow and it seems that ConnectStoragePool is different. If fencing is not triggered it is a bug and we need to fix it but it should.

When I rewrote transport I provided retry logic but it is disabled in the db. It was not tested in the real life and vdsm calls are not idempotent so enabling it could cause some issues. We could provide deduplication functionality in vdsm so we could move our guarantees from "at most once" to "at least once".

Comment 20 Yaniv Kaul 2017-10-25 09:09:49 UTC

(In reply to Piotr Kliczewski from comment #19)
> When there is a network issue we do start fencing flow and it seems that
> ConnectStoragePool is different. If fencing is not triggered it is a bug and
> we need to fix it but it should.
> 
> When I rewrote transport I provided retry logic but it is disabled in the
> db. It was not tested in the real life and vdsm calls are not idempotent so
> enabling it could cause some issues. We could provide deduplication
> functionality in vdsm so we could move our guarantees from "at most once" to
> "at least once".

So what's the next step here?

Comment 21 Piotr Kliczewski 2017-10-26 08:06:46 UTC

According to Allon exception handling for connectStoragePool is correct even though it causes the issue. We have 2 ways of handling the issue:

1. Retry connectStoragePool which at the moment is one time operation and when there is networking issue it can have significant consequences.
2. Increase heartbeat timeout to the point the issues are not occurring.

Solution 2 is short term before we will be able to do 1 as part of this BZ. 
Allon do you agree?

Comment 22 Allon Mureinik 2017-11-19 10:09:36 UTC

(In reply to Piotr Kliczewski from comment #21)
> According to Allon exception handling for connectStoragePool is correct even
> though it causes the issue. We have 2 ways of handling the issue:
> 
> 1. Retry connectStoragePool which at the moment is one time operation and
> when there is networking issue it can have significant consequences.
> 2. Increase heartbeat timeout to the point the issues are not occurring.
> 
> Solution 2 is short term before we will be able to do 1 as part of this BZ. 
> Allon do you agree?

Agreed.

Comment 23 Piotr Kliczewski 2017-11-20 09:40:02 UTC

Heartbeat interval change is tracked by BZ #1513360.

Comment 24 Martin Perina 2017-11-20 09:59:14 UTC

We will increase heartbeat in 4.2 as a part of BZ1513360, but as mentioned above the proper fix for this issue is inside storage part where connectStoragePool should be retried if some connection issues are raised.

Comment 25 Allon Mureinik 2017-11-21 10:42:49 UTC

(In reply to Martin Perina from comment #24)
> We will increase heartbeat in 4.2 as a part of BZ1513360, but as mentioned
> above the proper fix for this issue is inside storage part where
> connectStoragePool should be retried if some connection issues are raised.

Agreed.
With the increased timeout and monitoring improvements, this is not a 4.2 item.

Comment 26 Yaniv Lavi 2018-02-12 09:45:42 UTC

We made improvements to fix the main issue.
If there is still a bugs around this, please open a new report with logs.

Comment 27 Franta Kust 2019-05-16 12:54:50 UTC

BZ<2>Jira re-sync

Note You need to log in before you can comment on or make changes to this bug.