1660452 – Executor queue can get full if vm.destroy takes some time to complete

Bug 1660452 - Executor queue can get full if vm.destroy takes some time to complete

Summary: Executor queue can get full if vm.destroy takes some time to complete

Keywords:
Status:	CLOSED DUPLICATE of bug 1660451
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	vdsm
Sub Component:
Version:	4.2.7
Hardware:	All
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	ovirt-4.3.0
Target Release:	---
Assignee:	Dan Kenigsberg
QA Contact:	Lukas Svaty
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2018-12-18 10:58 UTC by nijin ashok
Modified:	2020-08-03 15:30 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2018-12-19 00:13:17 UTC
oVirt Team:	Infra
Target Upstream Version:
Embargoed:
Flags:	lsvaty: testing_plan_complete-

Attachments	(Terms of Use)

Description nijin ashok 2018-12-18 10:58:32 UTC

Description of problem:

In vm.destroy, we are sending libvirt.VIR_DOMAIN_DESTROY_GRACEFUL which will send SIGTERM to qemu-kvm process. So this can take some time if there is a lot of i/o pending to flush for the VM. However, during this time, the qemu monitor will not be accessible for this domain and hence the periodic tasks which access this VMs qemu monitor will get stuck. With a single VM with vm.destroy running for 30 seconds, the executor queue can reach 20+ threads which all serves the stuck tasks.

The issue was observed in the customer environment when the customer tried to shut down a VM when it was migrating.

===
2018-12-10 16:49:02,370+0900 INFO  (jsonrpc/6) [api.virt] START destroy(gracefulAttempts=1) from=::ffff:,49558, flow_id=2d076393-b68f-4504-9309-9a8b6004e947, vmId=a6d98c62-fd66-44bc-aa2c-2b52321a9f00 (api:46)
2018-12-10 16:49:02,370+0900 INFO  (jsonrpc/6) [virt.vm] (vmId='a6d98c62-fd66-44bc-aa2c-2b52321a9f00') Release VM resources (vm:5215)
2018-12-10 16:49:02,370+0900 INFO  (jsonrpc/6) [virt.vm] (vmId='a6d98c62-fd66-44bc-aa2c-2b52321a9f00') Stopping connection (guestagent:438)
2018-12-10 16:49:02,371+0900 INFO  (jsonrpc/6) [virt.vm] (vmId='a6d98c62-fd66-44bc-aa2c-2b52321a9f00') _destroyVmGraceful attempt #0 (vm:5252)
===

Then the executor queue starts increasing which and reach the limit of 30 all serving the periodic tasks of this VM. Once it reaches 30, all the VMs will go into "non-responding" status since there is no more worker thread available.

===
2018-12-10 16:49:04,271+0900 WARN  (vdsm.Scheduler) [Executor] executor state: count=6
2018-12-10 16:49:14,896+0900 WARN  (vdsm.Scheduler) [Executor] executor state: count=10
2018-12-10 16:49:39,338+0900 WARN  (vdsm.Scheduler) [Executor] executor state: count=17
2018-12-10 16:49:48,325+0900 WARN  (vdsm.Scheduler) [Executor] executor state: count=22
2018-12-10 16:49:50,863+0900 WARN  (vdsm.Scheduler) [Executor] executor state: count=27 
2018-12-10 16:49:58,838+0900 WARN  (vdsm.Scheduler) [Executor] executor state: count=30
2018-12-10 16:49:59,842+0900 WARN  (vdsm.Scheduler) [Executor] Too many workers (limit=30), not adding more (executor:210)

2018-12-10 16:49:52,362+0900 WARN  (jsonrpc/3) [virt.vm] (vmId='35f7d64b-071f-4432-97aa-3b10c048e6be') monitor became unresponsive (command timeout, age=60.9699999988) (vm:6151)
2018-12-10 16:49:52,364+0900 WARN  (jsonrpc/3) [virt.vm] (vmId='1e8d5f9c-77f2-4659-a64c-fbf857081a6d') monitor became unresponsive (command timeout, age=60.9699999988) (vm:6151)
===

Once the VM is down, it will release all the threads and the executor queue will go into normal stage.

===
2018-12-10 16:50:02,305+0900 INFO  (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call VM.destroy succeeded in 59.93 seconds (__init__:573)
2018-12-10 16:50:02,307+0900 INFO  (periodic/19625) [Executor] Worker was discarded (executor:305)
2018-12-10 16:50:02,307+0900 INFO  (periodic/19623) [Executor] Worker was discarded (executor:305)
2018-12-10 16:50:02,321+0900 INFO  (periodic/19624) [Executor] Worker was discarded (executor:305)
2018-12-10 16:50:02,388+0900 INFO  (periodic/19609) [Executor] Worker was discarded (executor:305)
2018-12-10 16:50:02,390+0900 INFO  (periodic/19632) [Executor] Worker was discarded (executor:305)
2018-12-10 16:50:02,403+0900 INFO  (periodic/19636) [Executor] Worker was discarded (executor:305)
2018-12-10 16:50:02,403+0900 INFO  (periodic/19638) [Executor] Worker was discarded (executor:305)
2018-12-10 16:50:02,405+0900 INFO  (periodic/19639) [Executor] Worker was discarded (executor:305)
===

I was able to reproduce this issue by setting some (unusual) values for the disk caching and limiting the i/o in storage so that I will be having large dirty pages while destroying the VM so there would be some delay in vm.destroy. The destroy took 30 seconds and the queue went to 23 with only single VM running in the host.

From my test environment.

===
2018-12-18 09:56:37,726+0530 INFO  (jsonrpc/5) [api.virt] START destroy(gracefulAttempts=1) from=::ffff:10.65.177.57,49360, flow_id=7d78f081, vmId=0a58e3f8-aead-4b1d-9d83-5ed4c93619ee (api:46)
2018-12-18 09:56:45,495+0530 WARN  (vdsm.Scheduler) [Executor] executor state: count=8
2018-12-18 09:56:59,499+0530 WARN  (vdsm.Scheduler) [Executor] executor state: count=17
2018-12-18 09:57:03,499+0530 WARN  (vdsm.Scheduler) [Executor] executor state: count=21
2018-12-18 09:57:07,494+0530 WARN  (vdsm.Scheduler) [Executor] executor state: count=23

2018-12-18 09:57:07,756+0530 INFO  (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call VM.destroy failed (error 42) in 30.03 seconds (__init__:573)
2018-12-18 09:57:07,756+0530 INFO  (qgapoller/8) [Executor] Worker was discarded (executor:305)
2018-12-18 09:57:07,757+0530 INFO  (periodic/32) [Executor] Worker was discarded (executor:305)
2018-12-18 09:57:07,759+0530 INFO  (periodic/43) [Executor] Worker was discarded (executor:305)
2018-12-18 09:57:07,780+0530 INFO  (periodic/40) [Executor] Worker was discarded (executor:305)
===


Version-Release number of selected component (if applicable):

vdsm-4.20.39.1-1.el7ev.x86_64


How reproducible:

100%

Steps to Reproduce:

1. In my 3 GB RAM VM, I set below values in sysctl to get large dirty pages.

vm.dirty_background_ratio = 60
vm.dirty_ratio = 80
vm.dirty_expire_centisecs = 36000

This was done just for getting a large cache since I don't have a VM with huge RAM to work with.

2. Migrate the VM and during the migration, shutdown the VM.

3. Check the vdsm log and we will be able to see executor queue getting full.

Actual results:

Executor queue is getting full if vm.destroy takes some time to complete

Expected results:

I think the periodic monitoring should be disabled if VM is getting destroyed.

Additional info:

Comment 3 Ryan Barry 2018-12-19 00:13:17 UTC


*** This bug has been marked as a duplicate of bug 1660451 ***

Note You need to log in before you can comment on or make changes to this bug.