1451703 – Guest not responding in RHV Manager UI

Bug 1451703 - Guest not responding in RHV Manager UI

Summary: Guest not responding in RHV Manager UI

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	4.0.7
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.1.3
Target Release:	---
Assignee:	Francesco Romani
QA Contact:	Jiri Belka
Docs Contact:
URL:
Whiteboard:
Depends On:	1419856
Blocks:
TreeView+	depends on / blocked

Reported:	2017-05-17 10:27 UTC by Marian Jankular
Modified:	2020-12-14 09:33 UTC (History)
CC List:	30 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:	1419856
Environment:
Last Closed:	2017-07-06 07:32:08 UTC
oVirt Team:	Virt
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2017:1696	normal	SHIPPED_LIVE	VDSM bug fix and enhancement update 4.1.3	2017-07-06 11:25:09 UTC
oVirt gerrit	73133	'None'	MERGED	virt: host: stats: do not replace HostMonitor	2021-02-16 00:29:37 UTC
oVirt gerrit	73534	'None'	MERGED	virt: periodic: support for exclusive operations	2021-02-16 00:29:36 UTC
oVirt gerrit	74492	'None'	MERGED	virt: periodic: HostMonitor: set not-discardable	2021-02-16 00:29:37 UTC
oVirt gerrit	75476	'None'	MERGED	virt: periodic: support for exclusive operations	2021-02-16 00:29:37 UTC
oVirt gerrit	75477	'None'	MERGED	periodic: docs: document the 'exclusive' parameter	2021-02-16 00:29:36 UTC
oVirt gerrit	75478	'None'	MERGED	periodic: rename: _step -> _reschedule	2021-02-16 00:29:37 UTC
oVirt gerrit	75479	'None'	MERGED	virt: periodic: expose the `discard` flag	2021-02-16 00:29:37 UTC
oVirt gerrit	75480	'None'	MERGED	virt: host: stats: do not replace HostMonitor	2021-02-16 00:29:37 UTC
oVirt gerrit	75481	'None'	MERGED	virt: periodic: HostMonitor: set not-discardable	2021-02-16 00:29:37 UTC

Description Marian Jankular 2017-05-17 10:27:08 UTC

+++ This bug was initially created as a clone of Bug #1419856 +++

Hi,

I submitted an issue on the Ovirt users forum and have been asked to raise a bug report. http://lists.ovirt.org/pipermail/users/2017-January/079334.html

Host server: Dell PowerEdge R815 (40 cores and 768GB memory)
Stoage: Dell Equallogic (Firmware V8.1.4)
OS: Centos 7.3 (although the same thing happens on 7.2)
Ovirt: 4.0.6.3-1

We have several Ovirt clusters. Two of the hosts (in separate clusters) are showing as up in Hosted Engine but the guests running on them are showing as Not Responding. I can connect to the guests via ssh, etc but can’t interact with them from the Ovirt GUI. It was fine on Saturday (28th Jan) morning but looks like something happened Sunday morning around 07:14 as we suddenly see the following in engine.log on one host:

2017-01-29 07:14:26,952 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler1) [53ca8dc5] VM 'd0aa990f-e6aa-4e79-93ce-011fe1372fb0'(lnd-ion-lindev-01) moved from 'Up' --> 'NotResponding'
2017-01-29 07:14:27,069 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler1) [53ca8dc5] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM lnd-ion-lindev-01 is not responding.
2017-01-29 07:14:27,070 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler1) [53ca8dc5] VM '788bfc0e-1712-469e-9a0a-395b8bb3f369'(lnd-ion-windev-02) moved from 'Up' --> 'NotResponding'
2017-01-29 07:14:27,088 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler1) [53ca8dc5] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM lnd-ion-windev-02 is not responding.
2017-01-29 07:14:27,089 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler1) [53ca8dc5] VM 'd7eaa4ec-d65e-45c0-bc4f-505100658121'(lnd-ion-windev-04) moved from 'Up' --> 'NotResponding'
2017-01-29 07:14:27,103 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler1) [53ca8dc5] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM lnd-ion-windev-04 is not responding.
2017-01-29 07:14:27,104 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler1) [53ca8dc5] VM '5af875ad-70f9-4f49-9640-ee2b9927348b'(lnd-anv9-sup1) moved from 'Up' --> 'NotResponding'
2017-01-29 07:14:27,121 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler1) [53ca8dc5] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM lnd-anv9-sup1 is not responding.
2017-01-29 07:14:27,121 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler1) [53ca8dc5] VM 'b3b7c5f3-0b5b-4d8f-9cc8-b758cc1ce3b9'(lnd-db-dev-03) moved from 'Up' --> 'NotResponding'
2017-01-29 07:14:27,136 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler1) [53ca8dc5] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM lnd-db-dev-03 is not responding.
2017-01-29 07:14:27,137 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler1) [53ca8dc5] VM '6c0a6e17-47c3-4464-939b-e83984dbeaa6'(lnd-db-dev-04) moved from 'Up' --> 'NotResponding'
2017-01-29 07:14:27,167 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler1) [53ca8dc5] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM lnd-db-dev-04 is not responding.
2017-01-29 07:14:27,168 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler1) [53ca8dc5] VM 'ab15bb08-1244-4dc1-a4f1-f6e94246aa23'(lnd-ion-lindev-05) moved from 'Up' --> 'NotResponding'


Checking the vdsm logs this morning on the hosts I see a lot of the following messages:

jsonrpc.Executor/0::WARNING::2017-01-30 09:34:15,989::vm::4890::virt.vm::(_setUnresponsiveIfTimeout) vmId=`ab15bb08-1244-4dc1-a4f1-f6e94246aa23`::monitor became unresponsive (command timeout, age=94854.48)
jsonrpc.Executor/0::WARNING::2017-01-30 09:34:15,990::vm::4890::virt.vm::(_setUnresponsiveIfTimeout) vmId=`20a51347-ef08-47a9-9982-32b2047991e1`::monitor became unresponsive (command timeout, age=94854.48)
jsonrpc.Executor/0::WARNING::2017-01-30 09:34:15,991::vm::4890::virt.vm::(_setUnresponsiveIfTimeout) vmId=`2cd8698d-a0f9-43b7-9a89-92a93e920eb7`::monitor became unresponsive (command timeout, age=94854.49)
jsonrpc.Executor/0::WARNING::2017-01-30 09:34:15,992::vm::4890::virt.vm::(_setUnresponsiveIfTimeout) vmId=`5af875ad-70f9-4f49-9640-ee2b9927348b`::monitor became unresponsive (command timeout, age=94854.49)

and

vdsm.Scheduler::WARNING::2017-01-30 09:36:36,444::periodic::212::virt.periodic.Operation::(_dispatch) could not run <VmDispatcher operation=<class 'vdsm.virt.periodic.DriveWatermarkMonitor'> at 0x295bd50>, executor queue full
vdsm.Scheduler::WARNING::2017-01-30 09:36:38,446::periodic::212::virt.periodic.Operation::(_dispatch) could not run <VmDispatcher operation=<class 'vdsm.virt.periodic.DriveWatermarkMonitor'> at 0x295bd50>, executor queue full
vdsm.Scheduler::WARNING::2017-01-30 09:36:38,627::periodic::212::virt.periodic.Operation::(_dispatch) could not run <vdsm.virt.sampling.HostMonitor object at 0x295bdd0>, executor queue full
vdsm.Scheduler::WARNING::2017-01-30 09:36:38,707::periodic::212::virt.periodic.Operation::(_dispatch) could not run <vdsm.virt.sampling.VMBulkSampler object at 0x295ba90>, executor queue full
vdsm.Scheduler::WARNING::2017-01-30 09:36:38,929::periodic::212::virt.periodic.Operation::(_dispatch) could not run <VmDispatcher operation=<class 'vdsm.virt.periodic.BlockjobMonitor'> at 0x295ba10>, executor queue full
vdsm.Scheduler::WARNING::2017-01-30 09:36:40,450::periodic::212::virt.periodic.Operation::(_dispatch) could not run <VmDispatcher operation=<class 'vdsm.virt.periodic.DriveWatermarkMonitor'> at 0x295bd50>, executor queue full
vdsm.Scheduler::WARNING::2017-01-30 09:36:42,451::periodic::212::virt.periodic.Operation::(_dispatch) could not run <VmDispatcher operation=<class 'vdsm.virt.periodic.DriveWatermarkMonitor'> at 0x295bd50>, executor queue full
vdsm.Scheduler::WARNING::2017-01-30 09:36:44,452::periodic::212::virt.periodic.Operation::(_dispatch) could not run <VmDispatcher operation=<class 'vdsm.virt.periodic.DriveWatermarkMonitor'> at 0x295bd50>, executor queue full

I’ve also attached logs from time period for one of the hosts in question. This host is in a single node DC and cluster with iSCSI shared storage. I’ve had to make the time window on the logs quite small to fit within the mail size limit. Let me know if you need anything more specific.

Many Thanks,
Mark

--- Additional comment from Nir Soffer on 2017-02-07 04:08:36 EST ---

Francesco, can you take a look at this?

--- Additional comment from Pavel Gashev on 2017-02-07 04:35:03 EST ---

Executor state:

vdsm.Scheduler::DEBUG::2017-01-29 07:13:44,015::executor::137::Executor::(_worker_discarded) Too many workers (limit=30), not adding more
vdsm.Scheduler::DEBUG::2017-01-29 07:13:44,015::executor::140::Executor::(_worker_discarded) executor state: count=30 workers=set([
<Worker name=periodic/238 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x2c82dd0>,
<Worker name=periodic/245 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x2af7890>,
<Worker name=periodic/260 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x7ff158321090>,
<Worker name=periodic/242 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x299f9d0>,
<Worker name=periodic/228 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x7ff1e0394a10>,
<Worker name=periodic/254 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x2af2090>,
<Worker name=periodic/256 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x7ff1c81e1250>,
<Worker name=periodic/241 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x2a0f310>,
<Worker name=periodic/237 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x7ff1e026ca90>,
<Worker name=periodic/243 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x2e381d0>,
<Worker name=periodic/231 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x7ff1e020c310>,
<Worker name=periodic/233 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x28ddb50>,
<Worker name=periodic/244 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x2b77890>,
<Worker name=periodic/257 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x7ff1e0390b90>,
<Worker name=periodic/240 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x2d35b50>,
<Worker name=periodic/259 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x7ff158339a10>,
<Worker name=periodic/258 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x28ddc50>,
<Worker name=periodic/251 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x7ff1cc21e210>,
<Worker name=periodic/249 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x2bd8cd0>,
<Worker name=periodic/235 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x2af2d10>,
<Worker name=periodic/253 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x2c82f50>,
<Worker name=periodic/246 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x2af7d90>,
<Worker name=periodic/250 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x7ff1c819c5d0>,
<Worker name=periodic/255 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x7ff1e0103e50>,
<Worker name=periodic/239 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x7ff1e0390ed0>,
<Worker name=periodic/252 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x28ddf10>,
<Worker name=periodic/247 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x29b4750>,
<Worker name=periodic/261 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x2c7e290>,
<Worker name=periodic/248 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x7ff1c4190f90>,
<Worker name=periodic/236 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x7ff1c8206fd0>])

--- Additional comment from Francesco Romani on 2017-02-07 04:47:31 EST ---

The issue here is that the HostMonitor is blocking and exhausting the executor resources. We need to know why HostMonitor blocks.

First thing: are the storage domain responsive?

--- Additional comment from Mark on 2017-02-07 05:07:35 EST ---

Hi,

Throughout the problem period both the host and storage domains remained in an active and 'up' state in hosted-engine. Using ssh, etc I could also directly connect to the guests in question.

Thanks,
Mark

--- Additional comment from Francesco Romani on 2017-02-07 06:00:52 EST ---

Unfortunately we only have vague hints in the provided logs.

It seems storage was not responding well, UpdateVolumes was slow:

vdsm.Scheduler::DEBUG::2017-01-29 07:06:50,195::executor::254::Executor::(_discard) Worker discarded: <Worker name=periodic/234 running Task(callable=<UpdateVolumes vm=580d810b-fd9b-45e8-bd4f-a8c462b4e4c1 at 0x7ff1e01edc10>, timeout=30.0) discarded at 0x2bf81d0>

but it seems the disruption started even earlier than that:

vdsm.Scheduler::DEBUG::2017-01-29 06:56:49,261::executor::254::Executor::(_discard) Worker discarded: <Worker name=periodic/220 running Task(callable=<UpdateVolumes vm=580d810b-fd9b-45e8-bd4f-a8c462b4e4c1 at 0x2a0f910>, timeout=30.0) discarded at 0x29b4350>
vdsm.Scheduler::DEBUG::2017-01-29 06:56:49,262::executor::196::Executor::(__init__) Starting worker periodic/234

Do you have the metrics enabled?
(/etc/vdsm/vdsm.conf, [metrics] section, enable = true)

--- Additional comment from Mark on 2017-02-07 07:03:45 EST ---

Hi,

[root@uk1-ion-ovm-08 ~]# cat /etc/vdsm/vdsm.conf
[vars]
ssl = true

[addresses]
management_port = 54321

[root@uk1-ion-ovm-08 ~]#

We've just left that as default.

Any other logs that may be helpful?

I bounced the VDSMD service, the guests recovered and the monitor and queue full messages also cleared. However, we did keep getting intermittent “Guest x Not Responding “ messages being communicated by the Hosted Engine, in most cases the guests would actually almost immediately recover though. The odd occasion would result in guests staying “Not Responding” and me bouncing the VDSMD service again. The Host had a memory load of around 85% (out of 768GB) and a CPU load of around 65% (48 cores). I have since added another host to that cluster and spread the guests between the two hosts. This seems to have totally cleared the messages (at least for the last 5 days anyway).

Nothing else changed on that cluster other than adding a host and spreading the guests between them (same shared storage, number of storage domains, etc).

Thanks,
Mark

--- Additional comment from Pavel Gashev on 2017-02-07 08:38:24 EST ---

Please note that all worker slots are used by discarded workers. Why the discarded workers are not removed?

--- Additional comment from Francesco Romani on 2017-02-07 08:57:12 EST ---

(In reply to Pavel Gashev from comment #7)
> Please note that all worker slots are used by discarded workers. Why the
> discarded workers are not removed?

Because we can't do that safely (or not at all) from Vdsm, in turn because of constraints of either Vdsm architecture or python runtime.

We need to wait for the lower layer to timeout, or unlock.

The best thing we can do is to park the discarded worker and replenish the worker pool. But we can't replenish the pool indefinitely, to avoid leak workers.
This why we have a cap on the worker pool size.

--- Additional comment from Francesco Romani on 2017-02-07 09:06:28 EST ---

(In reply to Mark from comment #6)
> Hi,
> 
> [root@uk1-ion-ovm-08 ~]# cat /etc/vdsm/vdsm.conf
> [vars]
> ssl = true
> 
> [addresses]
> management_port = 54321
> 
> [root@uk1-ion-ovm-08 ~]#
> 
> We've just left that as default.
> 
> Any other logs that may be helpful?

Yes, see below.

> I bounced the VDSMD service, the guests recovered and the monitor and queue
> full messages also cleared. However, we did keep getting intermittent “Guest
> x Not Responding “ messages being communicated by the Hosted Engine, in most
> cases the guests would actually almost immediately recover though. The odd
> occasion would result in guests staying “Not Responding” and me bouncing the
> VDSMD service again.

This makes sense. From the logs we see that:
1. for some reasons, you have HostMonitoring very slow, or blocking.
   Please note that another suspect is the UpdateVolumes operations, more on that below.
2. eventually, the worker threads timeout, they are replaced, but you reach the worker pool limit, so no more worker thread get started
3. no more monitoring tasks can be executed
4. guest stats get stale
5. guest are reported unresponsive because stale stats

Now, you can tune vdsm to use a bigger worker pool, but this only delayes the problem.

The real fix is to learn why HostMonitor blocks. It is surprising behaviour, because HostMonitor mostly consumes stats from procfs, so it should hardly block. The only dangerous path is if metrics are enabled - and it seems they are not.

From the available logs we just see the issue already begun; we need bigger time window in the past, to catch how the disruption starts (= when and why worker start to get discarded)

It may also help to share what's the cpu usage of Vdsm.

If you can do that, please provide the output of 'ps -auxw' just before to restarting Vdsm. We are looking for processes stuck in I/O (D state).

[1] Update volumes polls the storage subsystem to get up to date information about the actual volume size. *If* the storage is _very_ slow to respond, this could explain what happen.
Please check about NFS/ISCSI timeouts

> The Host had a memory load of around 85% (out of 768GB)
> and a CPU load of around 65% (48 cores). I have since added another host to
> that cluster and spread the guests between the two hosts. This seems to have
> totally cleared the messages (at least for the last 5 days anyway).

With the informations available now, I don't think it's related to resource usage. There is something unresponsive in the chain, but this could happen under light or heavy load.

--- Additional comment from Pavel Gashev on 2017-02-07 10:45:35 EST ---

I can confirm that the issue is reproducible when some storage is not responding. 

It makes sense to pause periodic HostMonitor when it's blocked. Otherwise it blocks other periodic tasks.

--- Additional comment from Francesco Romani on 2017-02-08 03:20:06 EST ---

From the system perspective, if storage is not responding, it is correct to report guests as unresponsive, otherways we are hiding problems.

We are aware of the interference between the periodic operations in case of resource exhaustion and we are evaluating deeper fixes.

--- Additional comment from Francesco Romani on 2017-02-09 10:38:14 EST ---

Please note that the only way forward at the moment is learn why HostMonitor is blocking. To do that you we must monitor the logs before the worker threads start to get discarded.

Otherwise we just don't have enough data to tackle the issue.

--- Additional comment from Pavel Gashev on 2017-02-20 07:34 EST ---

Please find attached output of `grep -e ^vdsm.Scheduler -e ::executor:: vdsm.log`.

Please note that HostMonitor starts discarding since the following log lines when several HostMonitor tasks run simultaneously.

vdsm.Scheduler::DEBUG::2017-02-18 09:22:18,978::executor::254::Executor::(_discard) Worker discarded: <Worker name=periodic/4272 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x382bdd0> at 0x382be10>, timeout=7.5) discarded at 0x39d7d90>
vdsm.Scheduler::DEBUG::2017-02-18 09:22:18,980::executor::140::Executor::(_worker_discarded) executor state: count=5 workers=set([<Worker name=periodic/4274 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x382bdd0> at 0x382be10>, timeout=7.5) at 0x7f8af0310b90>, <Worker name=periodic/4290 waiting at 0x7f8a742c0b90>, <Worker name=periodic/4272 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x382bdd0> at 0x382be10>, timeout=7.5) discarded at 0x39d7d90>, <Worker name=periodic/4289 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x382bdd0> at 0x382be10>, timeout=7.5) at 0x7f8b00280a90>, <Worker name=periodic/4273 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x382bdd0> at 0x382be10>, timeout=7.5) at 0x7f8a3824b650>])

Let me know timeframe if you need detailed logs.

--- Additional comment from Francesco Romani on 2017-02-22 11:19:30 EST ---

(In reply to Pavel Gashev from comment #13)
> Created attachment 1255675 [details]
> scheduler.vdsm.log.gz
> 
> Please find attached output of `grep -e ^vdsm.Scheduler -e ::executor::
> vdsm.log`.
> 
> Please note that HostMonitor starts discarding since the following log lines
> when several HostMonitor tasks run simultaneously.
> 
> vdsm.Scheduler::DEBUG::2017-02-18
> 09:22:18,978::executor::254::Executor::(_discard) Worker discarded: <Worker
> name=periodic/4272 running Task(callable=<Operation
> action=<vdsm.virt.sampling.HostMonitor object at 0x382bdd0> at 0x382be10>,
> timeout=7.5) discarded at 0x39d7d90>
> vdsm.Scheduler::DEBUG::2017-02-18
> 09:22:18,980::executor::140::Executor::(_worker_discarded) executor state:
> count=5 workers=set([<Worker name=periodic/4274 running
> Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at
> 0x382bdd0> at 0x382be10>, timeout=7.5) at 0x7f8af0310b90>, <Worker
> name=periodic/4290 waiting at 0x7f8a742c0b90>, <Worker name=periodic/4272
> running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor
> object at 0x382bdd0> at 0x382be10>, timeout=7.5) discarded at 0x39d7d90>,
> <Worker name=periodic/4289 running Task(callable=<Operation
> action=<vdsm.virt.sampling.HostMonitor object at 0x382bdd0> at 0x382be10>,
> timeout=7.5) at 0x7f8b00280a90>, <Worker name=periodic/4273 running
> Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at
> 0x382bdd0> at 0x382be10>, timeout=7.5) at 0x7f8a3824b650>])
> 
> Let me know timeframe if you need detailed logs.

Unfortunately this doesn't add the information we need.

We need, for some reason, either HostMonitor or UpdateVolume operation is blocking, or taking too much to complete (more likely the first).
*This* in turn make the executor exhaust its resources, so, after some time, no more periodic operations can be done.

We also know that those operations *do not* eventually unblock, so the periodic executor never goes back in service.

Your extract further confirms that, but we need to know why the UpdateVolume and/or HostMonitor operations start to fail.

We need a much longer timespan in the past, I'd say the full 24h before;
the corresponding sar/system logs (journal or /var/log/messages) would also greatly help.

Best would be the complete Vdsm log from restart to executor full.

--- Additional comment from Pavel Gashev on 2017-02-24 08:10:53 EST ---

UpdateVolume was blocked by overloaded storage. It's a slow storage. It was overloaded few days in a row. It's acceptable that VMs are "not responding" when the storage is overloaded. I have an one hour logs when performance was restored. 

The storage performance restored around 2017-02-18 09:22:09. All UpdateVolume "was discarded". The issue with HostMonitor started just after that.

Actually, it's easy to reproduce the issue with the following script. 

--- cut here ---
#!/usr/bin/python

import os
import threading
from vdsm.virt.sampling import HostSample

def testHostSample():
    sample = HostSample(os.getpid())
    print sample

t1 = threading.Thread(target=testHostSample)
t2 = threading.Thread(target=testHostSample)
t3 = threading.Thread(target=testHostSample)

t1.start()
t2.start()
t3.start()

t1.join()
t2.join()
t3.join()
--- cut here ---

It works with one thread. It works with two thread. It dead locks with three and more threads.

Quick fix patch:

--- sampling.py.orig    2016-12-14 09:42:27.000000000 +0000
+++ sampling.py 2017-02-24 13:08:28.148220674 +0000
@@ -48,6 +48,8 @@
 _METRICS_ENABLED = config.getboolean('metrics', 'enabled')
 
 
+InterfaceSample_lock = threading.Lock()
+
 class InterfaceSample(object):
     """
     A network interface sample.
@@ -91,6 +93,7 @@
                     raise
 
     def __init__(self, link):
+      with InterfaceSample_lock:
         ifid = link.name
         self.rx = self.readIfaceStat(ifid, 'rx_bytes')
         self.tx = self.readIfaceStat(ifid, 'tx_bytes')

--- Additional comment from Francesco Romani on 2017-02-27 05:07:58 EST ---

(In reply to Pavel Gashev from comment #15)
> UpdateVolume was blocked by overloaded storage. It's a slow storage. It was
> overloaded few days in a row. It's acceptable that VMs are "not responding"
> when the storage is overloaded. I have an one hour logs when performance was
> restored. 

Ok, this fully makes sense.
It may be surprising, but Vdsm is actually doing the right thing here.
*Some* storage flows may need the updated volume information to work correctly, so if such information is not available, the system is in a degraded state - hence the "Not responding" state.

 
> The storage performance restored around 2017-02-18 09:22:09. All
> UpdateVolume "was discarded". The issue with HostMonitor started just after
> that.
> 
> Actually, it's easy to reproduce the issue with the following script. 
> 
> --- cut here ---
> #!/usr/bin/python
> 
> import os
> import threading
> from vdsm.virt.sampling import HostSample
> 
> def testHostSample():
>     sample = HostSample(os.getpid())
>     print sample
> 
> t1 = threading.Thread(target=testHostSample)
> t2 = threading.Thread(target=testHostSample)
> t3 = threading.Thread(target=testHostSample)
> 
> t1.start()
> t2.start()
> t3.start()
> 
> t1.join()
> t2.join()
> t3.join()
> --- cut here ---
> 
> It works with one thread. It works with two thread. It dead locks with three
> and more threads.
> 
> Quick fix patch:
> 
> --- sampling.py.orig    2016-12-14 09:42:27.000000000 +0000
> +++ sampling.py 2017-02-24 13:08:28.148220674 +0000
> @@ -48,6 +48,8 @@
>  _METRICS_ENABLED = config.getboolean('metrics', 'enabled')
>  
>  
> +InterfaceSample_lock = threading.Lock()
> +
>  class InterfaceSample(object):
>      """
>      A network interface sample.
> @@ -91,6 +93,7 @@
>                      raise
>  
>      def __init__(self, link):
> +      with InterfaceSample_lock:
>          ifid = link.name
>          self.rx = self.readIfaceStat(ifid, 'rx_bytes')
>          self.tx = self.readIfaceStat(ifid, 'tx_bytes')

This is interesting information, I will now investigate this and post a patch. Indeed looks related to the main issue and we may have one bug in the lowe layers of Vdsm. Will help in the aforementioned scenario, reducing the number of blocked HostMonitors.

--- Additional comment from  on 2017-04-05 05:53:24 EDT ---

Is there a workaround .configuration tweaks for RHEL 7.3 ?

--- Additional comment from Francesco Romani on 2017-04-13 06:52:52 EDT ---

(In reply to eberman from comment #17)
> Is there a workaround .configuration tweaks for RHEL 7.3 ?

The underlying issue is still under investigation. It seems a bug/misuse of the netlink API. I've merged mitigation patches on Vdsm side.

--- Additional comment from Jiri Belka on 2017-05-04 03:01:36 EDT ---

I see no clearly written reproduction steps (#10, #11). What are the steps for verification?

--- Additional comment from Francesco Romani on 2017-05-08 04:41:33 EDT ---

(In reply to Jiri Belka from comment #19)
> I see no clearly written reproduction steps (#10, #11). What are the steps
> for verification?

Bug reproduction steps

1. let the tested host run at least five (5) VMs.
2. once the system is stable, simulate loss of connectivity to the host on the interace(s) used by the VMs (simpler way: unplug the network cable)
3. wait at least 30 minutes
4. (if the host has other network access) observe the error in the logs
5. restore the network connectivity
6. check the previous logs for the error

Of course once it is fixed we should see the error(s) anymore in the steps #4 and #6

--- Additional comment from Francesco Romani on 2017-05-08 04:49:18 EDT ---

(In reply to Francesco Romani from comment #20)
> (In reply to Jiri Belka from comment #19)
> > I see no clearly written reproduction steps (#10, #11). What are the steps
> > for verification?
> 
> Bug reproduction steps
> 
> 1. let the tested host run at least five (5) VMs.
> 2. once the system is stable, simulate loss of connectivity to the host on
> the interace(s) used by the VMs (simpler way: unplug the network cable)
> 3. wait at least 30 minutes
> 4. (if the host has other network access) observe the error in the logs
> 5. restore the network connectivity
> 6. check the previous logs for the error
> 
> Of course once it is fixed we should see the error(s) anymore in the steps
> #4 and #6

Discard this, it is for a different bug (https://bugzilla.redhat.com/show_bug.cgi?id=1443654)

Reproduction steps

1. let the tested host run at least five (5) VMs.
2. once the system is stable, simulate lockup in the netlink interface [***]
3. wait at least 10 minutes
4. check the running VMs are marked unresponsive
5. restore the netlink interface responsiveness [***], check after 2-3 minutes the running VMs are marked responsive again

Problem is, I don't know how to trigger the netlink interface lockup (https://bugzilla.redhat.com/show_bug.cgi?id=1419856#c16)

Will investigate and marke CodeChange if I can't come up with a solution

--- Additional comment from Francesco Romani on 2017-05-08 04:59:30 EDT ---

(In reply to Francesco Romani from comment #21)
> (In reply to Francesco Romani from comment #20)
> > (In reply to Jiri Belka from comment #19)
> > > I see no clearly written reproduction steps (#10, #11). What are the steps
> > > for verification?
> Reproduction steps
> 
> 1. let the tested host run at least five (5) VMs.
> 2. once the system is stable, simulate lockup in the netlink interface [***]
> 3. wait at least 10 minutes
> 4. check the running VMs are marked unresponsive
> 5. restore the netlink interface responsiveness [***], check after 2-3
> minutes the running VMs are marked responsive again
> 
> Problem is, I don't know how to trigger the netlink interface lockup
> (https://bugzilla.redhat.com/show_bug.cgi?id=1419856#c16)
> 
> Will investigate and marke CodeChange if I can't come up with a solution


a quick way could be
1. rebind-mount slowfs (https://github.com/nirs/slowfs) over /var/log/core 
2. configure slowfs with a very long delay (60 minutes or so)

this should make HostMonitor block much like the netlink issue did.

--- Additional comment from Jiri Belka on 2017-05-11 12:08:05 EDT ---

> > 1. let the tested host run at least five (5) VMs.
> > 2. once the system is stable, simulate lockup in the netlink interface [***]
> > 3. wait at least 10 minutes
> > 4. check the running VMs are marked unresponsive
> > 5. restore the netlink interface responsiveness [***], check after 2-3
> > minutes the running VMs are marked responsive again
> > 

> a quick way could be
> 1. rebind-mount slowfs (https://github.com/nirs/slowfs) over /var/log/core 
> 2. configure slowfs with a very long delay (60 minutes or so)
> 
> this should make HostMonitor block much like the netlink issue did.

I had run slowfs over /var/log/core with '1800' in slowfs.cfg and I observed no change at all. Verification steps are unclear, please provide exact steps for verification.

--- Additional comment from Michal Skrivanek on 2017-05-12 03:43:38 EDT ---

Jiri, comment #15?
Pavel, if it is still reproducible for you, can you give it a try in your setup? As far as we can tell it is fixed, but can't really verify on a real deployment. If not, the changes are included in 4.1.2 either way.

--- Additional comment from Jiri Belka on 2017-05-15 08:27:11 EDT ---

(In reply to Michal Skrivanek from comment #24)
> Jiri, comment #15?
> Pavel, if it is still reproducible for you, can you give it a try in your
> setup? As far as we can tell it is fixed, but can't really verify on a real
> deployment. If not, the changes are included in 4.1.2 either way.

I could not get any visible change while using the script from #15, ie. I have not seen any 'Not responsive' at all. Thus I'm unable to reproduce this issue.

Comment 8 Tomas Jelinek 2017-06-21 09:22:43 UTC

it has already been verified on Bug #1419856

Comment 11 Francesco Romani 2017-06-28 08:07:58 UTC

This internal bug doesn't need extra documentation.

Comment 13 errata-xmlrpc 2017-07-06 07:32:08 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2017:1696

Note You need to log in before you can comment on or make changes to this bug.

baptiste.agasse
bazulay
bugs
cshao
dguo
eberman
fdelorey
fromani
huzhao
jbelka
jiawu
lsurette
lsvaty
m.greenall
michal.skrivanek
mjankula
mtessun
nsoffer
pax
pstehlik
qiyuan
rbarry
sbonazzo
srevivo
tjelinek
weiwang
yaniwang
ycui
ykaul
yzhao