1419856 – Ovirt 4.0.6 guests 'Not Responding'

Bug 1419856 - Ovirt 4.0.6 guests 'Not Responding'

Summary: Ovirt 4.0.6 guests 'Not Responding'

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	vdsm
Classification:	oVirt
Component:	Core
Sub Component:
Version:	4.18.21
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.1.2
Target Release:	---
Assignee:	Francesco Romani
QA Contact:	Jiri Belka
Docs Contact:
URL:
Whiteboard:
Duplicates (1):	1449967 (view as bug list)
Depends On:
Blocks:	1451703
TreeView+	depends on / blocked

Reported:	2017-02-07 09:04 UTC by Mark
Modified:	2017-11-14 05:15 UTC (History)
CC List:	12 users (show)
Fixed In Version:
Clone Of:
Clones:	1451703 (view as bug list)
Environment:
Last Closed:	2017-05-23 08:22:12 UTC
oVirt Team:	Virt
Embargoed:
Dependent Products:
Flags:	pax: needinfo- tjelinek: ovirt-4.1? tjelinek: planning_ack? tjelinek: devel_ack+ lsvaty: testing_ack+

Attachments	(Terms of Use)
vdsm and engine logs from the time period in question (4.08 MB, application/zip) 2017-02-07 09:04 UTC, Mark	no flags	Details
scheduler.vdsm.log.gz (13.79 KB, application/x-gzip) 2017-02-20 12:34 UTC, Pavel Gashev	no flags	Details
View All

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Bugzilla	1512547	unspecified	CLOSED	HostMonitor stops collecting data when TooManyTasks occurs	2021-06-10 13:38:25 UTC
oVirt gerrit	73133	master	MERGED	virt: host: stats: do not replace HostMonitor	2021-01-22 12:39:05 UTC
oVirt gerrit	73534	master	MERGED	virt: periodic: support for exclusive operations	2021-01-22 12:39:05 UTC
oVirt gerrit	74492	master	MERGED	virt: periodic: HostMonitor: set not-discardable	2021-01-22 12:39:05 UTC
oVirt gerrit	75476	ovirt-4.1	MERGED	virt: periodic: support for exclusive operations	2021-01-22 12:38:25 UTC
oVirt gerrit	75477	ovirt-4.1	MERGED	periodic: docs: document the 'exclusive' parameter	2021-01-22 12:38:25 UTC
oVirt gerrit	75478	ovirt-4.1	MERGED	periodic: rename: _step -> _reschedule	2021-01-22 12:38:25 UTC
oVirt gerrit	75479	ovirt-4.1	MERGED	virt: periodic: expose the `discard` flag	2021-01-22 12:38:25 UTC
oVirt gerrit	75480	ovirt-4.1	MERGED	virt: host: stats: do not replace HostMonitor	2021-01-22 12:38:26 UTC
oVirt gerrit	75481	ovirt-4.1	MERGED	virt: periodic: HostMonitor: set not-discardable	2021-01-22 12:39:07 UTC

Internal Links: 1512547

Description Mark 2017-02-07 09:04:06 UTC

Created attachment 1248316 [details]
vdsm and engine logs from the time period in question

Hi,

I submitted an issue on the Ovirt users forum and have been asked to raise a bug report. http://lists.ovirt.org/pipermail/users/2017-January/079334.html

Host server: Dell PowerEdge R815 (40 cores and 768GB memory)
Stoage: Dell Equallogic (Firmware V8.1.4)
OS: Centos 7.3 (although the same thing happens on 7.2)
Ovirt: 4.0.6.3-1

We have several Ovirt clusters. Two of the hosts (in separate clusters) are showing as up in Hosted Engine but the guests running on them are showing as Not Responding. I can connect to the guests via ssh, etc but can’t interact with them from the Ovirt GUI. It was fine on Saturday (28th Jan) morning but looks like something happened Sunday morning around 07:14 as we suddenly see the following in engine.log on one host:

2017-01-29 07:14:26,952 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler1) [53ca8dc5] VM 'd0aa990f-e6aa-4e79-93ce-011fe1372fb0'(lnd-ion-lindev-01) moved from 'Up' --> 'NotResponding'
2017-01-29 07:14:27,069 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler1) [53ca8dc5] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM lnd-ion-lindev-01 is not responding.
2017-01-29 07:14:27,070 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler1) [53ca8dc5] VM '788bfc0e-1712-469e-9a0a-395b8bb3f369'(lnd-ion-windev-02) moved from 'Up' --> 'NotResponding'
2017-01-29 07:14:27,088 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler1) [53ca8dc5] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM lnd-ion-windev-02 is not responding.
2017-01-29 07:14:27,089 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler1) [53ca8dc5] VM 'd7eaa4ec-d65e-45c0-bc4f-505100658121'(lnd-ion-windev-04) moved from 'Up' --> 'NotResponding'
2017-01-29 07:14:27,103 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler1) [53ca8dc5] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM lnd-ion-windev-04 is not responding.
2017-01-29 07:14:27,104 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler1) [53ca8dc5] VM '5af875ad-70f9-4f49-9640-ee2b9927348b'(lnd-anv9-sup1) moved from 'Up' --> 'NotResponding'
2017-01-29 07:14:27,121 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler1) [53ca8dc5] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM lnd-anv9-sup1 is not responding.
2017-01-29 07:14:27,121 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler1) [53ca8dc5] VM 'b3b7c5f3-0b5b-4d8f-9cc8-b758cc1ce3b9'(lnd-db-dev-03) moved from 'Up' --> 'NotResponding'
2017-01-29 07:14:27,136 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler1) [53ca8dc5] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM lnd-db-dev-03 is not responding.
2017-01-29 07:14:27,137 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler1) [53ca8dc5] VM '6c0a6e17-47c3-4464-939b-e83984dbeaa6'(lnd-db-dev-04) moved from 'Up' --> 'NotResponding'
2017-01-29 07:14:27,167 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler1) [53ca8dc5] Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM lnd-db-dev-04 is not responding.
2017-01-29 07:14:27,168 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler1) [53ca8dc5] VM 'ab15bb08-1244-4dc1-a4f1-f6e94246aa23'(lnd-ion-lindev-05) moved from 'Up' --> 'NotResponding'


Checking the vdsm logs this morning on the hosts I see a lot of the following messages:

jsonrpc.Executor/0::WARNING::2017-01-30 09:34:15,989::vm::4890::virt.vm::(_setUnresponsiveIfTimeout) vmId=`ab15bb08-1244-4dc1-a4f1-f6e94246aa23`::monitor became unresponsive (command timeout, age=94854.48)
jsonrpc.Executor/0::WARNING::2017-01-30 09:34:15,990::vm::4890::virt.vm::(_setUnresponsiveIfTimeout) vmId=`20a51347-ef08-47a9-9982-32b2047991e1`::monitor became unresponsive (command timeout, age=94854.48)
jsonrpc.Executor/0::WARNING::2017-01-30 09:34:15,991::vm::4890::virt.vm::(_setUnresponsiveIfTimeout) vmId=`2cd8698d-a0f9-43b7-9a89-92a93e920eb7`::monitor became unresponsive (command timeout, age=94854.49)
jsonrpc.Executor/0::WARNING::2017-01-30 09:34:15,992::vm::4890::virt.vm::(_setUnresponsiveIfTimeout) vmId=`5af875ad-70f9-4f49-9640-ee2b9927348b`::monitor became unresponsive (command timeout, age=94854.49)

and

vdsm.Scheduler::WARNING::2017-01-30 09:36:36,444::periodic::212::virt.periodic.Operation::(_dispatch) could not run <VmDispatcher operation=<class 'vdsm.virt.periodic.DriveWatermarkMonitor'> at 0x295bd50>, executor queue full
vdsm.Scheduler::WARNING::2017-01-30 09:36:38,446::periodic::212::virt.periodic.Operation::(_dispatch) could not run <VmDispatcher operation=<class 'vdsm.virt.periodic.DriveWatermarkMonitor'> at 0x295bd50>, executor queue full
vdsm.Scheduler::WARNING::2017-01-30 09:36:38,627::periodic::212::virt.periodic.Operation::(_dispatch) could not run <vdsm.virt.sampling.HostMonitor object at 0x295bdd0>, executor queue full
vdsm.Scheduler::WARNING::2017-01-30 09:36:38,707::periodic::212::virt.periodic.Operation::(_dispatch) could not run <vdsm.virt.sampling.VMBulkSampler object at 0x295ba90>, executor queue full
vdsm.Scheduler::WARNING::2017-01-30 09:36:38,929::periodic::212::virt.periodic.Operation::(_dispatch) could not run <VmDispatcher operation=<class 'vdsm.virt.periodic.BlockjobMonitor'> at 0x295ba10>, executor queue full
vdsm.Scheduler::WARNING::2017-01-30 09:36:40,450::periodic::212::virt.periodic.Operation::(_dispatch) could not run <VmDispatcher operation=<class 'vdsm.virt.periodic.DriveWatermarkMonitor'> at 0x295bd50>, executor queue full
vdsm.Scheduler::WARNING::2017-01-30 09:36:42,451::periodic::212::virt.periodic.Operation::(_dispatch) could not run <VmDispatcher operation=<class 'vdsm.virt.periodic.DriveWatermarkMonitor'> at 0x295bd50>, executor queue full
vdsm.Scheduler::WARNING::2017-01-30 09:36:44,452::periodic::212::virt.periodic.Operation::(_dispatch) could not run <VmDispatcher operation=<class 'vdsm.virt.periodic.DriveWatermarkMonitor'> at 0x295bd50>, executor queue full

I’ve also attached logs from time period for one of the hosts in question. This host is in a single node DC and cluster with iSCSI shared storage. I’ve had to make the time window on the logs quite small to fit within the mail size limit. Let me know if you need anything more specific.

Many Thanks,
Mark

Comment 1 Nir Soffer 2017-02-07 09:08:36 UTC

Francesco, can you take a look at this?

Comment 2 Pavel Gashev 2017-02-07 09:35:03 UTC

Executor state:

vdsm.Scheduler::DEBUG::2017-01-29 07:13:44,015::executor::137::Executor::(_worker_discarded) Too many workers (limit=30), not adding more
vdsm.Scheduler::DEBUG::2017-01-29 07:13:44,015::executor::140::Executor::(_worker_discarded) executor state: count=30 workers=set([
<Worker name=periodic/238 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x2c82dd0>,
<Worker name=periodic/245 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x2af7890>,
<Worker name=periodic/260 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x7ff158321090>,
<Worker name=periodic/242 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x299f9d0>,
<Worker name=periodic/228 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x7ff1e0394a10>,
<Worker name=periodic/254 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x2af2090>,
<Worker name=periodic/256 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x7ff1c81e1250>,
<Worker name=periodic/241 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x2a0f310>,
<Worker name=periodic/237 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x7ff1e026ca90>,
<Worker name=periodic/243 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x2e381d0>,
<Worker name=periodic/231 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x7ff1e020c310>,
<Worker name=periodic/233 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x28ddb50>,
<Worker name=periodic/244 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x2b77890>,
<Worker name=periodic/257 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x7ff1e0390b90>,
<Worker name=periodic/240 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x2d35b50>,
<Worker name=periodic/259 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x7ff158339a10>,
<Worker name=periodic/258 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x28ddc50>,
<Worker name=periodic/251 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x7ff1cc21e210>,
<Worker name=periodic/249 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x2bd8cd0>,
<Worker name=periodic/235 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x2af2d10>,
<Worker name=periodic/253 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x2c82f50>,
<Worker name=periodic/246 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x2af7d90>,
<Worker name=periodic/250 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x7ff1c819c5d0>,
<Worker name=periodic/255 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x7ff1e0103e50>,
<Worker name=periodic/239 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x7ff1e0390ed0>,
<Worker name=periodic/252 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x28ddf10>,
<Worker name=periodic/247 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x29b4750>,
<Worker name=periodic/261 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x2c7e290>,
<Worker name=periodic/248 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x7ff1c4190f90>,
<Worker name=periodic/236 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x295bdd0> at 0x295be10>, timeout=7.5) discarded at 0x7ff1c8206fd0>])

Comment 3 Francesco Romani 2017-02-07 09:47:31 UTC

The issue here is that the HostMonitor is blocking and exhausting the executor resources. We need to know why HostMonitor blocks.

First thing: are the storage domain responsive?

Comment 4 Mark 2017-02-07 10:07:35 UTC

Hi,

Throughout the problem period both the host and storage domains remained in an active and 'up' state in hosted-engine. Using ssh, etc I could also directly connect to the guests in question.

Thanks,
Mark

Comment 5 Francesco Romani 2017-02-07 11:00:52 UTC

Unfortunately we only have vague hints in the provided logs.

It seems storage was not responding well, UpdateVolumes was slow:

vdsm.Scheduler::DEBUG::2017-01-29 07:06:50,195::executor::254::Executor::(_discard) Worker discarded: <Worker name=periodic/234 running Task(callable=<UpdateVolumes vm=580d810b-fd9b-45e8-bd4f-a8c462b4e4c1 at 0x7ff1e01edc10>, timeout=30.0) discarded at 0x2bf81d0>

but it seems the disruption started even earlier than that:

vdsm.Scheduler::DEBUG::2017-01-29 06:56:49,261::executor::254::Executor::(_discard) Worker discarded: <Worker name=periodic/220 running Task(callable=<UpdateVolumes vm=580d810b-fd9b-45e8-bd4f-a8c462b4e4c1 at 0x2a0f910>, timeout=30.0) discarded at 0x29b4350>
vdsm.Scheduler::DEBUG::2017-01-29 06:56:49,262::executor::196::Executor::(__init__) Starting worker periodic/234

Do you have the metrics enabled?
(/etc/vdsm/vdsm.conf, [metrics] section, enable = true)

Comment 6 Mark 2017-02-07 12:03:45 UTC

Hi,

[root@uk1-ion-ovm-08 ~]# cat /etc/vdsm/vdsm.conf
[vars]
ssl = true

[addresses]
management_port = 54321

[root@uk1-ion-ovm-08 ~]#

We've just left that as default.

Any other logs that may be helpful?

I bounced the VDSMD service, the guests recovered and the monitor and queue full messages also cleared. However, we did keep getting intermittent “Guest x Not Responding “ messages being communicated by the Hosted Engine, in most cases the guests would actually almost immediately recover though. The odd occasion would result in guests staying “Not Responding” and me bouncing the VDSMD service again. The Host had a memory load of around 85% (out of 768GB) and a CPU load of around 65% (48 cores). I have since added another host to that cluster and spread the guests between the two hosts. This seems to have totally cleared the messages (at least for the last 5 days anyway).

Nothing else changed on that cluster other than adding a host and spreading the guests between them (same shared storage, number of storage domains, etc).

Thanks,
Mark

Comment 7 Pavel Gashev 2017-02-07 13:38:24 UTC

Please note that all worker slots are used by discarded workers. Why the discarded workers are not removed?

Comment 8 Francesco Romani 2017-02-07 13:57:12 UTC

(In reply to Pavel Gashev from comment #7)
> Please note that all worker slots are used by discarded workers. Why the
> discarded workers are not removed?

Because we can't do that safely (or not at all) from Vdsm, in turn because of constraints of either Vdsm architecture or python runtime.

We need to wait for the lower layer to timeout, or unlock.

The best thing we can do is to park the discarded worker and replenish the worker pool. But we can't replenish the pool indefinitely, to avoid leak workers.
This why we have a cap on the worker pool size.

Comment 9 Francesco Romani 2017-02-07 14:06:28 UTC

(In reply to Mark from comment #6)
> Hi,
> 
> [root@uk1-ion-ovm-08 ~]# cat /etc/vdsm/vdsm.conf
> [vars]
> ssl = true
> 
> [addresses]
> management_port = 54321
> 
> [root@uk1-ion-ovm-08 ~]#
> 
> We've just left that as default.
> 
> Any other logs that may be helpful?

Yes, see below.

> I bounced the VDSMD service, the guests recovered and the monitor and queue
> full messages also cleared. However, we did keep getting intermittent “Guest
> x Not Responding “ messages being communicated by the Hosted Engine, in most
> cases the guests would actually almost immediately recover though. The odd
> occasion would result in guests staying “Not Responding” and me bouncing the
> VDSMD service again.

This makes sense. From the logs we see that:
1. for some reasons, you have HostMonitoring very slow, or blocking.
   Please note that another suspect is the UpdateVolumes operations, more on that below.
2. eventually, the worker threads timeout, they are replaced, but you reach the worker pool limit, so no more worker thread get started
3. no more monitoring tasks can be executed
4. guest stats get stale
5. guest are reported unresponsive because stale stats

Now, you can tune vdsm to use a bigger worker pool, but this only delayes the problem.

The real fix is to learn why HostMonitor blocks. It is surprising behaviour, because HostMonitor mostly consumes stats from procfs, so it should hardly block. The only dangerous path is if metrics are enabled - and it seems they are not.

From the available logs we just see the issue already begun; we need bigger time window in the past, to catch how the disruption starts (= when and why worker start to get discarded)

It may also help to share what's the cpu usage of Vdsm.

If you can do that, please provide the output of 'ps -auxw' just before to restarting Vdsm. We are looking for processes stuck in I/O (D state).

[1] Update volumes polls the storage subsystem to get up to date information about the actual volume size. *If* the storage is _very_ slow to respond, this could explain what happen.
Please check about NFS/ISCSI timeouts

> The Host had a memory load of around 85% (out of 768GB)
> and a CPU load of around 65% (48 cores). I have since added another host to
> that cluster and spread the guests between the two hosts. This seems to have
> totally cleared the messages (at least for the last 5 days anyway).

With the informations available now, I don't think it's related to resource usage. There is something unresponsive in the chain, but this could happen under light or heavy load.

Comment 10 Pavel Gashev 2017-02-07 15:45:35 UTC

I can confirm that the issue is reproducible when some storage is not responding. 

It makes sense to pause periodic HostMonitor when it's blocked. Otherwise it blocks other periodic tasks.

Comment 11 Francesco Romani 2017-02-08 08:20:06 UTC

From the system perspective, if storage is not responding, it is correct to report guests as unresponsive, otherways we are hiding problems.

We are aware of the interference between the periodic operations in case of resource exhaustion and we are evaluating deeper fixes.

Comment 12 Francesco Romani 2017-02-09 15:38:14 UTC

Please note that the only way forward at the moment is learn why HostMonitor is blocking. To do that you we must monitor the logs before the worker threads start to get discarded.

Otherwise we just don't have enough data to tackle the issue.

Comment 13 Pavel Gashev 2017-02-20 12:34:13 UTC

Created attachment 1255675 [details]
scheduler.vdsm.log.gz

Please find attached output of `grep -e ^vdsm.Scheduler -e ::executor:: vdsm.log`.

Please note that HostMonitor starts discarding since the following log lines when several HostMonitor tasks run simultaneously.

vdsm.Scheduler::DEBUG::2017-02-18 09:22:18,978::executor::254::Executor::(_discard) Worker discarded: <Worker name=periodic/4272 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x382bdd0> at 0x382be10>, timeout=7.5) discarded at 0x39d7d90>
vdsm.Scheduler::DEBUG::2017-02-18 09:22:18,980::executor::140::Executor::(_worker_discarded) executor state: count=5 workers=set([<Worker name=periodic/4274 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x382bdd0> at 0x382be10>, timeout=7.5) at 0x7f8af0310b90>, <Worker name=periodic/4290 waiting at 0x7f8a742c0b90>, <Worker name=periodic/4272 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x382bdd0> at 0x382be10>, timeout=7.5) discarded at 0x39d7d90>, <Worker name=periodic/4289 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x382bdd0> at 0x382be10>, timeout=7.5) at 0x7f8b00280a90>, <Worker name=periodic/4273 running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at 0x382bdd0> at 0x382be10>, timeout=7.5) at 0x7f8a3824b650>])

Let me know timeframe if you need detailed logs.

Comment 14 Francesco Romani 2017-02-22 16:19:30 UTC

(In reply to Pavel Gashev from comment #13)
> Created attachment 1255675 [details]
> scheduler.vdsm.log.gz
> 
> Please find attached output of `grep -e ^vdsm.Scheduler -e ::executor::
> vdsm.log`.
> 
> Please note that HostMonitor starts discarding since the following log lines
> when several HostMonitor tasks run simultaneously.
> 
> vdsm.Scheduler::DEBUG::2017-02-18
> 09:22:18,978::executor::254::Executor::(_discard) Worker discarded: <Worker
> name=periodic/4272 running Task(callable=<Operation
> action=<vdsm.virt.sampling.HostMonitor object at 0x382bdd0> at 0x382be10>,
> timeout=7.5) discarded at 0x39d7d90>
> vdsm.Scheduler::DEBUG::2017-02-18
> 09:22:18,980::executor::140::Executor::(_worker_discarded) executor state:
> count=5 workers=set([<Worker name=periodic/4274 running
> Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at
> 0x382bdd0> at 0x382be10>, timeout=7.5) at 0x7f8af0310b90>, <Worker
> name=periodic/4290 waiting at 0x7f8a742c0b90>, <Worker name=periodic/4272
> running Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor
> object at 0x382bdd0> at 0x382be10>, timeout=7.5) discarded at 0x39d7d90>,
> <Worker name=periodic/4289 running Task(callable=<Operation
> action=<vdsm.virt.sampling.HostMonitor object at 0x382bdd0> at 0x382be10>,
> timeout=7.5) at 0x7f8b00280a90>, <Worker name=periodic/4273 running
> Task(callable=<Operation action=<vdsm.virt.sampling.HostMonitor object at
> 0x382bdd0> at 0x382be10>, timeout=7.5) at 0x7f8a3824b650>])
> 
> Let me know timeframe if you need detailed logs.

Unfortunately this doesn't add the information we need.

We need, for some reason, either HostMonitor or UpdateVolume operation is blocking, or taking too much to complete (more likely the first).
*This* in turn make the executor exhaust its resources, so, after some time, no more periodic operations can be done.

We also know that those operations *do not* eventually unblock, so the periodic executor never goes back in service.

Your extract further confirms that, but we need to know why the UpdateVolume and/or HostMonitor operations start to fail.

We need a much longer timespan in the past, I'd say the full 24h before;
the corresponding sar/system logs (journal or /var/log/messages) would also greatly help.

Best would be the complete Vdsm log from restart to executor full.

Comment 15 Pavel Gashev 2017-02-24 13:10:53 UTC

UpdateVolume was blocked by overloaded storage. It's a slow storage. It was overloaded few days in a row. It's acceptable that VMs are "not responding" when the storage is overloaded. I have an one hour logs when performance was restored. 

The storage performance restored around 2017-02-18 09:22:09. All UpdateVolume "was discarded". The issue with HostMonitor started just after that.

Actually, it's easy to reproduce the issue with the following script. 

--- cut here ---
#!/usr/bin/python

import os
import threading
from vdsm.virt.sampling import HostSample

def testHostSample():
    sample = HostSample(os.getpid())
    print sample

t1 = threading.Thread(target=testHostSample)
t2 = threading.Thread(target=testHostSample)
t3 = threading.Thread(target=testHostSample)

t1.start()
t2.start()
t3.start()

t1.join()
t2.join()
t3.join()
--- cut here ---

It works with one thread. It works with two thread. It dead locks with three and more threads.

Quick fix patch:

--- sampling.py.orig    2016-12-14 09:42:27.000000000 +0000
+++ sampling.py 2017-02-24 13:08:28.148220674 +0000
@@ -48,6 +48,8 @@
 _METRICS_ENABLED = config.getboolean('metrics', 'enabled')
 
 
+InterfaceSample_lock = threading.Lock()
+
 class InterfaceSample(object):
     """
     A network interface sample.
@@ -91,6 +93,7 @@
                     raise
 
     def __init__(self, link):
+      with InterfaceSample_lock:
         ifid = link.name
         self.rx = self.readIfaceStat(ifid, 'rx_bytes')
         self.tx = self.readIfaceStat(ifid, 'tx_bytes')

Comment 16 Francesco Romani 2017-02-27 10:07:58 UTC

(In reply to Pavel Gashev from comment #15)
> UpdateVolume was blocked by overloaded storage. It's a slow storage. It was
> overloaded few days in a row. It's acceptable that VMs are "not responding"
> when the storage is overloaded. I have an one hour logs when performance was
> restored. 

Ok, this fully makes sense.
It may be surprising, but Vdsm is actually doing the right thing here.
*Some* storage flows may need the updated volume information to work correctly, so if such information is not available, the system is in a degraded state - hence the "Not responding" state.

 
> The storage performance restored around 2017-02-18 09:22:09. All
> UpdateVolume "was discarded". The issue with HostMonitor started just after
> that.
> 
> Actually, it's easy to reproduce the issue with the following script. 
> 
> --- cut here ---
> #!/usr/bin/python
> 
> import os
> import threading
> from vdsm.virt.sampling import HostSample
> 
> def testHostSample():
>     sample = HostSample(os.getpid())
>     print sample
> 
> t1 = threading.Thread(target=testHostSample)
> t2 = threading.Thread(target=testHostSample)
> t3 = threading.Thread(target=testHostSample)
> 
> t1.start()
> t2.start()
> t3.start()
> 
> t1.join()
> t2.join()
> t3.join()
> --- cut here ---
> 
> It works with one thread. It works with two thread. It dead locks with three
> and more threads.
> 
> Quick fix patch:
> 
> --- sampling.py.orig    2016-12-14 09:42:27.000000000 +0000
> +++ sampling.py 2017-02-24 13:08:28.148220674 +0000
> @@ -48,6 +48,8 @@
>  _METRICS_ENABLED = config.getboolean('metrics', 'enabled')
>  
>  
> +InterfaceSample_lock = threading.Lock()
> +
>  class InterfaceSample(object):
>      """
>      A network interface sample.
> @@ -91,6 +93,7 @@
>                      raise
>  
>      def __init__(self, link):
> +      with InterfaceSample_lock:
>          ifid = link.name
>          self.rx = self.readIfaceStat(ifid, 'rx_bytes')
>          self.tx = self.readIfaceStat(ifid, 'tx_bytes')

This is interesting information, I will now investigate this and post a patch. Indeed looks related to the main issue and we may have one bug in the lowe layers of Vdsm. Will help in the aforementioned scenario, reducing the number of blocked HostMonitors.

Comment 17 eberman 2017-04-05 09:53:24 UTC

Is there a workaround .configuration tweaks for RHEL 7.3 ?

Comment 18 Francesco Romani 2017-04-13 10:52:52 UTC

(In reply to eberman from comment #17)
> Is there a workaround .configuration tweaks for RHEL 7.3 ?

The underlying issue is still under investigation. It seems a bug/misuse of the netlink API. I've merged mitigation patches on Vdsm side.

Comment 19 Jiri Belka 2017-05-04 07:01:36 UTC

I see no clearly written reproduction steps (#10, #11). What are the steps for verification?

Comment 20 Francesco Romani 2017-05-08 08:41:33 UTC

(In reply to Jiri Belka from comment #19)
> I see no clearly written reproduction steps (#10, #11). What are the steps
> for verification?

Bug reproduction steps

1. let the tested host run at least five (5) VMs.
2. once the system is stable, simulate loss of connectivity to the host on the interace(s) used by the VMs (simpler way: unplug the network cable)
3. wait at least 30 minutes
4. (if the host has other network access) observe the error in the logs
5. restore the network connectivity
6. check the previous logs for the error

Of course once it is fixed we should see the error(s) anymore in the steps #4 and #6

Comment 21 Francesco Romani 2017-05-08 08:49:18 UTC

(In reply to Francesco Romani from comment #20)
> (In reply to Jiri Belka from comment #19)
> > I see no clearly written reproduction steps (#10, #11). What are the steps
> > for verification?
> 
> Bug reproduction steps
> 
> 1. let the tested host run at least five (5) VMs.
> 2. once the system is stable, simulate loss of connectivity to the host on
> the interace(s) used by the VMs (simpler way: unplug the network cable)
> 3. wait at least 30 minutes
> 4. (if the host has other network access) observe the error in the logs
> 5. restore the network connectivity
> 6. check the previous logs for the error
> 
> Of course once it is fixed we should see the error(s) anymore in the steps
> #4 and #6

Discard this, it is for a different bug (https://bugzilla.redhat.com/show_bug.cgi?id=1443654)

Reproduction steps

1. let the tested host run at least five (5) VMs.
2. once the system is stable, simulate lockup in the netlink interface [***]
3. wait at least 10 minutes
4. check the running VMs are marked unresponsive
5. restore the netlink interface responsiveness [***], check after 2-3 minutes the running VMs are marked responsive again

Problem is, I don't know how to trigger the netlink interface lockup (https://bugzilla.redhat.com/show_bug.cgi?id=1419856#c16)

Will investigate and marke CodeChange if I can't come up with a solution

Comment 22 Francesco Romani 2017-05-08 08:59:30 UTC

(In reply to Francesco Romani from comment #21)
> (In reply to Francesco Romani from comment #20)
> > (In reply to Jiri Belka from comment #19)
> > > I see no clearly written reproduction steps (#10, #11). What are the steps
> > > for verification?
> Reproduction steps
> 
> 1. let the tested host run at least five (5) VMs.
> 2. once the system is stable, simulate lockup in the netlink interface [***]
> 3. wait at least 10 minutes
> 4. check the running VMs are marked unresponsive
> 5. restore the netlink interface responsiveness [***], check after 2-3
> minutes the running VMs are marked responsive again
> 
> Problem is, I don't know how to trigger the netlink interface lockup
> (https://bugzilla.redhat.com/show_bug.cgi?id=1419856#c16)
> 
> Will investigate and marke CodeChange if I can't come up with a solution


a quick way could be
1. rebind-mount slowfs (https://github.com/nirs/slowfs) over /var/log/core 
2. configure slowfs with a very long delay (60 minutes or so)

this should make HostMonitor block much like the netlink issue did.

Comment 23 Jiri Belka 2017-05-11 16:08:05 UTC

> > 1. let the tested host run at least five (5) VMs.
> > 2. once the system is stable, simulate lockup in the netlink interface [***]
> > 3. wait at least 10 minutes
> > 4. check the running VMs are marked unresponsive
> > 5. restore the netlink interface responsiveness [***], check after 2-3
> > minutes the running VMs are marked responsive again
> > 

> a quick way could be
> 1. rebind-mount slowfs (https://github.com/nirs/slowfs) over /var/log/core 
> 2. configure slowfs with a very long delay (60 minutes or so)
> 
> this should make HostMonitor block much like the netlink issue did.

I had run slowfs over /var/log/core with '1800' in slowfs.cfg and I observed no change at all. Verification steps are unclear, please provide exact steps for verification.

Comment 24 Michal Skrivanek 2017-05-12 07:43:38 UTC

Jiri, comment #15?
Pavel, if it is still reproducible for you, can you give it a try in your setup? As far as we can tell it is fixed, but can't really verify on a real deployment. If not, the changes are included in 4.1.2 either way.

Comment 25 Jiri Belka 2017-05-15 12:27:11 UTC

(In reply to Michal Skrivanek from comment #24)
> Jiri, comment #15?
> Pavel, if it is still reproducible for you, can you give it a try in your
> setup? As far as we can tell it is fixed, but can't really verify on a real
> deployment. If not, the changes are included in 4.1.2 either way.

I could not get any visible change while using the script from #15, ie. I have not seen any 'Not responsive' at all. Thus I'm unable to reproduce this issue.

Comment 26 Tomas Jelinek 2017-05-18 07:12:17 UTC

*** Bug 1449967 has been marked as a duplicate of this bug. ***

Comment 27 Jiri Belka 2017-05-19 12:04:41 UTC

ovirt-engine-4.1.2.2-0.1.el7.noarch

we were unable to reproduce this issue in version ovirt-engine-4.1.2.2-0.1.el7.noarch where fix is included, moving to verified as SanityOnly was done and patch is included in current build, if you will encounter this issue in the future please reopen.

Note You need to log in before you can comment on or make changes to this bug.