Bug 1427782 - improve vdsm logging to detect storage issues
Summary: improve vdsm logging to detect storage issues
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm
Classification: oVirt
Component: Core
Version: ---
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ovirt-4.1.2
: 4.19.11
Assignee: Nir Soffer
QA Contact: Israel Pinto
URL:
Whiteboard:
Depends On: 1430847
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-03-01 08:24 UTC by Michael Burman
Modified: 2017-05-23 08:18 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-05-23 08:18:23 UTC
oVirt Team: Storage
Embargoed:
rule-engine: ovirt-4.1+


Attachments (Terms of Use)
engine log (262.77 KB, application/x-gzip)
2017-03-01 08:24 UTC, Michael Burman
no flags Details
vdsm log from one of the hosts (280.76 KB, application/x-gzip)
2017-03-01 09:27 UTC, Michael Burman
no flags Details
libvirt in debug (3.15 MB, application/x-gzip)
2017-03-06 13:01 UTC, Michael Burman
no flags Details
vdsm log (897.47 KB, application/x-gzip)
2017-03-07 05:57 UTC, Michael Burman
no flags Details
screenshot (113.11 KB, image/png)
2017-03-07 05:57 UTC, Michael Burman
no flags Details
logs in debug (12.59 MB, application/x-gzip)
2017-03-07 13:01 UTC, Michael Burman
no flags Details
qemu logs (5.52 KB, application/x-gzip)
2017-03-08 12:51 UTC, Michael Burman
no flags Details
sanlock (23.19 KB, application/x-gzip)
2017-03-08 17:55 UTC, Michael Burman
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 73556 0 master MERGED storage.asyncutils: Introduce the asyncutils module 2020-12-22 00:57:56 UTC
oVirt gerrit 73592 0 master MERGED check: Warn about blocked checker 2020-12-22 00:57:56 UTC
oVirt gerrit 75100 0 master MERGED tests: Mark slow tests in storage_asyncutils_test.py 2020-12-22 00:58:28 UTC
oVirt gerrit 75250 0 ovirt-4.1 MERGED storage.asyncutils: Introduce the asyncutils module 2020-12-22 00:58:28 UTC
oVirt gerrit 75251 0 ovirt-4.1 MERGED tests: Mark slow tests in storage_asyncutils_test.py 2020-12-22 00:57:56 UTC
oVirt gerrit 75252 0 ovirt-4.1 MERGED check: Warn about blocked checker 2020-12-22 00:57:56 UTC
oVirt gerrit 75261 0 ovirt-4.1 MERGED tests: Complete FakeLogger helper 2020-12-22 00:57:57 UTC
oVirt gerrit 75538 0 ovirt-4.1 MERGED tests: Increase timeouts on time sensitive test 2020-12-22 00:57:55 UTC

Description Michael Burman 2017-03-01 08:24:55 UTC
Created attachment 1258601 [details]
engine log

Description of problem:
VM moved from 'Up' --> 'NotResponding' with EVENT_ID: VM_NOT_RESPONDING(126).

VMs changing state from UP > NotResponding from time to time.


2017-03-01 09:57:09,249+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler8) [4d8f4e67] VM '9c80dc47-0c96-493a-89d5-935be3cd59f8'(VM4) moved from 'Up' --> 'NotResponding'
2017-03-01 09:57:09,417+02 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler8) [4d8f4e67] EVENT_ID: VM_NOT_RESPONDING(126), Correlation ID: null, Call Stack: null
, Custom Event ID: -1, Message: VM VM4 is not responding.
2017-03-01 09:57:09,418+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler8) [4d8f4e67] VM '3f18b745-f9d8-44fa-a1c3-ee7bacafecf8'(VM1) moved from 'Up' --> 'NotResponding'
2017-03-01 09:57:09,438+02 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler8) [4d8f4e67] EVENT_ID: VM_NOT_RESPONDING(126), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM VM1 is not responding.


Version-Release number of selected component (if applicable):
rhevm-4.1.1.2-0.1.el7.noarch
vdsm-4.19.6-1.el7ev.x86_64

How reproducible:
No idea. From time to time. Can't understand what trigger it.

Steps to Reproduce:
1. Run VM/s on latest 4.1.1.2

Actual results:
Sometimes VMs moving from up to non responding state with no visible reason.

Comment 1 Tomas Jelinek 2017-03-01 09:15:27 UTC
Please also provide VDSM logs from the time it happens.

Can you please explain what is happening on the host on which the VMs are running during this time? Is it heavily loaded? Or the network? Are some migrations running?

Comment 2 Michael Burman 2017-03-01 09:23:34 UTC
Tomas, 

It happens on all hosts, not on a specific one.
In my setup it happens on 3 hosts for example. 
Like i wrote, i can't explain what is going on. nothing special. no load, no network traffic, nothing. Just happens from time to time and not on specific host.

Comment 3 Michael Burman 2017-03-01 09:27:37 UTC
Created attachment 1258616 [details]
vdsm log from one of the hosts

Comment 4 Michael Burman 2017-03-01 11:53:17 UTC
It's happening on 4.0.7 as well.

Comment 5 Yaniv Kaul 2017-03-02 09:03:25 UTC
Perhaps libvirt in debug would reveal something?

Comment 6 Michael Burman 2017-03-06 13:01:06 UTC
Not sure if libvirt in debug revealed something.
I will attach vdsm in debug later as well.

Comment 7 Michael Burman 2017-03-06 13:01:47 UTC
Created attachment 1260398 [details]
libvirt in debug

Comment 8 Michael Burman 2017-03-07 05:57:18 UTC
Created attachment 1260688 [details]
vdsm log

Comment 9 Michael Burman 2017-03-07 05:57:44 UTC
Created attachment 1260689 [details]
screenshot

Comment 10 Yaniv Kaul 2017-03-07 11:59:39 UTC
Anyone looking at this?

Comment 11 Milan Zamazal 2017-03-07 12:37:46 UTC
I can see in vdsm.log that two VMs are UP and nothing special except for "monitor became unresponsive" for one of the VMs at one moment.

Michael, could you please provide all the debug logs (engine, vdsm, libvirt) from the same time range, not from mutually different days, in order to check what is or is not happening when the error occurs?

Comment 12 Michael Burman 2017-03-07 13:00:23 UTC
(In reply to Milan Zamazal from comment #11)
> I can see in vdsm.log that two VMs are UP and nothing special except for
> "monitor became unresponsive" for one of the VMs at one moment.
> 
> Michael, could you please provide all the debug logs (engine, vdsm, libvirt)
> from the same time range, not from mutually different days, in order to
> check what is or is not happening when the error occurs?

OK

Attaching logs from today, it just happened, so the relevant time is: 
engine 14:52:56 (VM1)
engine 14:52:57 (VM5)

vdsm 14:52:56 (VM1)
vdsm 14:52:57 (VM5)

libvirt 12:52:5 (VM1)
libvirt 12:52:5 (VM5)

Comment 13 Michael Burman 2017-03-07 13:01:28 UTC
Created attachment 1260795 [details]
logs in debug

Comment 14 Milan Zamazal 2017-03-08 11:32:11 UTC
Thank you, Michael, for the logs. Do you have also QEMU logs for the VMs from that time? And to be sure, do you run it with up-to-date libvirt version?

Looking into Vdsm log the problem seems to be that a getAllVmStats call blocks:

  2017-03-07 14:52:55,551+0200 DEBUG (jsonrpc/3) [jsonrpc.JsonRpcServer] Calling 'Host.getAllVmStats' in bridge with {} (__init__:532)
  2017-03-07 14:52:55,552+0200 WARN  (jsonrpc/3) [virt.vm] (vmId='3f18b745-f9d8-44fa-a1c3-ee7bacafecf8') monitor became unresponsive (command timeout, age=65.7899999991) (vm:5008)
  2017-03-07 14:52:55,553+0200 WARN  (jsonrpc/3) [virt.vm] (vmId='78ae390d-ffba-4116-b3f7-5fdf39cbdd62') monitor became unresponsive (command timeout, age=65.7899999991) (vm:5008)
  2017-03-07 14:52:55,554+0200 DEBUG (jsonrpc/3) [jsonrpc.JsonRpcServer] Return 'Host.getAllVmStats' in bridge with (suppressed) (__init__:557)
  2017-03-07 14:52:55,555+0200 INFO  (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC call Host.getAllVmStats succeeded in 0.00 seconds (__init__:515)
  2017-03-07 14:53:02,197+0200 DEBUG (jsonrpc/2) [jsonrpc.JsonRpcServer] Calling 'Host.getAllVmStats' in bridge with {} (__init__:532)
  2017-03-07 14:53:02,198+0200 WARN  (jsonrpc/2) [virt.vm] (vmId='3f18b745-f9d8-44fa-a1c3-ee7bacafecf8') monitor became unresponsive (command timeout, age=72.4299999997) (vm:5008)
  2017-03-07 14:53:02,199+0200 WARN  (jsonrpc/2) [virt.vm] (vmId='78ae390d-ffba-4116-b3f7-5fdf39cbdd62') monitor became unresponsive (command timeout, age=72.4399999995) (vm:5008)
  2017-03-07 14:53:02,200+0200 DEBUG (jsonrpc/2) [jsonrpc.JsonRpcServer] Return 'Host.getAllVmStats' in bridge with (suppressed) (__init__:557)
  2017-03-07 14:53:02,201+0200 INFO  (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call Host.getAllVmStats succeeded in 0.01 seconds (__init__:515)

Also:

  2017-03-07 14:52:29,202+0200 WARN  (periodic/3) [virt.sampling.StatsCache] dropped stale old sample: sampled 4921538.480000 stored 4921553.480000 (sampling:442)
  2017-03-07 14:52:29,202+0200 DEBUG (periodic/3) [virt.sampling.VMBulkSampler] sampled timestamp 4921538.48 elapsed 24.440 acquired True domains all (sampling:508)
  2017-03-07 14:52:29,203+0200 DEBUG (periodic/3) [Executor] Worker was discarded (executor:277)

And:

  2017-03-07 14:52:12,270+0200 DEBUG (vdsm.Scheduler) [Executor] Worker discarded: <Worker name=periodic/3 running Task(callable=<Operation action=<vdsm.virt.sampling.VMBulkSampler object at 0x7f852c065950> at 0x7f852c065b50>, timeout=7.5) discarded task#=19989 at 0x38c1d90> (executor:320)

So it looks like a problem with getAllDomainStats libvirt call.  Interesting things might happen in libvirt around 14:52:04,770+0200.  Here's an excerpt from libvirt log at the time:

  2017-03-07 12:52:04.766+0000: 30196: debug : virThreadJobSet:96 : Thread 30196 (virNetServerHandleJob) is now running job remoteDispatchConnectGetAllDomainStats
  2017-03-07 12:52:04.766+0000: 30196: debug : virConnectGetAllDomainStats:11501 : conn=0x7f8d50000af0, stats=0x0, retStats=0x7f8d64869b10, flags=0x0
  2017-03-07 12:52:04.766+0000: 30196: debug : virAccessManagerCheckConnect:218 : manager=0x7f8d75345ca0(name=stack) driver=QEMU perm=3
  2017-03-07 12:52:04.766+0000: 30196: debug : virAccessManagerCheckConnect:218 : manager=0x7f8d75345d90(name=none) driver=QEMU perm=3
  2017-03-07 12:52:04.766+0000: 30196: debug : virAccessManagerCheckDomain:234 : manager=0x7f8d75345ca0(name=stack) driver=QEMU domain=0x7f8d48018fd0 perm=1
  2017-03-07 12:52:04.766+0000: 30196: debug : virAccessManagerCheckDomain:234 : manager=0x7f8d75345d90(name=none) driver=QEMU domain=0x7f8d48018fd0 perm=1
  2017-03-07 12:52:04.766+0000: 30196: debug : virAccessManagerCheckDomain:234 : manager=0x7f8d75345ca0(name=stack) driver=QEMU domain=0x7f8d50014470 perm=1
  2017-03-07 12:52:04.766+0000: 30196: debug : virAccessManagerCheckDomain:234 : manager=0x7f8d75345d90(name=none) driver=QEMU domain=0x7f8d50014470 perm=1
  2017-03-07 12:52:04.766+0000: 30196: debug : virAccessManagerCheckDomain:234 : manager=0x7f8d75345ca0(name=stack) driver=QEMU domain=0x7f8d5401a370 perm=1
  2017-03-07 12:52:04.766+0000: 30196: debug : virAccessManagerCheckDomain:234 : manager=0x7f8d75345d90(name=none) driver=QEMU domain=0x7f8d5401a370 perm=1
  2017-03-07 12:52:04.766+0000: 30196: debug : qemuDomainObjBeginJobInternal:2945 : Starting job: query (vm=0x7f8d4801c2a0 name=VM3, current job=none async=none)
  2017-03-07 12:52:04.766+0000: 30196: debug : qemuDomainObjBeginJobInternal:2986 : Started job: query (async=none vm=0x7f8d4801c2a0 name=VM3)
  2017-03-07 12:52:04.766+0000: 30196: debug : virCgroupGetValueStr:822 : Get value /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-qemu\x2d13\x2dVM3.scope/cpuacct.usage
  2017-03-07 12:52:04.766+0000: 30196: debug : virFileClose:102 : Closed fd 26
  2017-03-07 12:52:04.766+0000: 30196: debug : virCgroupGetValueStr:822 : Get value /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-qemu\x2d13\x2dVM3.scope/cpuacct.stat
  2017-03-07 12:52:04.766+0000: 30196: debug : virFileClose:102 : Closed fd 26
  2017-03-07 12:52:04.766+0000: 30196: debug : qemuGetProcessInfo:1443 : Got status for 19518/19522 user=9683 sys=5138 cpu=1 rss=150369
  2017-03-07 12:52:04.766+0000: 30196: debug : virFileClose:102 : Closed fd 26
  2017-03-07 12:52:04.767+0000: 30196: debug : qemuDomainObjEnterMonitorInternal:3209 : Entering monitor (mon=0x7f8d48015c10 vm=0x7f8d4801c2a0 name=VM3)
  2017-03-07 12:52:04.767+0000: 30196: debug : qemuMonitorGetAllBlockStatsInfo:2122 : ret_stats=0x7f8d648698b8, backing=0
  2017-03-07 12:52:04.767+0000: 30196: debug : qemuMonitorGetAllBlockStatsInfo:2124 : mon:0x7f8d48015c10 vm:0x7f8d4801c2a0 json:1 fd:32
  2017-03-07 12:52:04.767+0000: 30196: info : qemuMonitorSend:1009 : QEMU_MONITOR_SEND_MSG: mon=0x7f8d48015c10 msg={"execute":"query-blockstats","id":"libvirt-3895"}
   fd=-1
  2017-03-07 12:52:04.767+0000: 30192: info : qemuMonitorIOWrite:534 : QEMU_MONITOR_IO_WRITE: mon=0x7f8d48015c10 buf={"execute":"query-blockstats","id":"libvirt-3895"}
   len=52 ret=52 errno=11
  2017-03-07 12:52:04.768+0000: 30192: info : qemuMonitorIOProcess:429 : QEMU_MONITOR_IO_PROCESS: mon=0x7f8d48015c10 buf={"return": [{"device": "drive-ide0-1-0", "stats": {"flush_total_time_ns": 0, "wr_highest_offset": 0, "wr_total_time_ns": 0, "failed_wr_operations": 0, "failed_rd_operations": 0, "wr_merged": 0, "wr_bytes": 0, "timed_stats": [], "failed_flush_operations": 0, "account_invalid": false, "rd_total_time_ns": 106122, "flush_operations": 0, "wr_operations": 0, "rd_merged": 0, "rd_bytes": 152, "invalid_flush_operations": 0, "account_failed": false, "idle_time_ns": 19450308525146, "rd_operations": 4, "invalid_wr_operations": 0, "invalid_rd_operations": 0}}, {"device": "drive-scsi0-0-0-0", "parent": {"stats": {"flush_total_time_ns": 0, "wr_highest_offset": 4237303808, "wr_total_time_ns": 0, "failed_wr_operations": 0, "failed_rd_operations": 0, "wr_merged": 0, "wr_bytes": 0, "timed_stats": [], "failed_flush_operations": 0, "account_invalid": false, "rd_total_time_ns": 0, "flush_operations": 0, "wr_operations": 0, "rd_merged": 0, "rd_bytes": 0, "invalid_flush_operations": 0, "account_failed": false, "rd_operations": 0, " len=1023
  2017-03-07 12:52:04.768+0000: 30192: info : qemuMonitorIOProcess:429 : QEMU_MONITOR_IO_PROCESS: mon=0x7f8d48015c10 buf={"return": [{"device": "drive-ide0-1-0", "stats": {"flush_total_time_ns": 0, "wr_highest_offset": 0, "wr_total_time_ns": 0, "failed_wr_operations": 0, "failed_rd_operations": 0, "wr_merged": 0, "wr_bytes": 0, "timed_stats": [], "failed_flush_operations": 0, "account_invalid": false, "rd_total_time_ns": 106122, "flush_operations": 0, "wr_operations": 0, "rd_merged": 0, "rd_bytes": 152, "invalid_flush_operations": 0, "account_failed": false, "idle_time_ns": 19450308525146, "rd_operations": 4, "invalid_wr_operations": 0, "invalid_rd_operations": 0}}, {"device": "drive-scsi0-0-0-0", "parent": {"stats": {"flush_total_time_ns": 0, "wr_highest_offset": 4237303808, "wr_total_time_ns": 0, "failed_wr_operations": 0, "failed_rd_operations": 0, "wr_merged": 0, "wr_bytes": 0, "timed_stats": [], "failed_flush_operations": 0, "account_invalid": false, "rd_total_time_ns": 0, "flush_operations": 0, "wr_operations": 0, "rd_merged": 0, "rd_bytes": 0, "invalid_flush_operations": 0, "account_failed": false, "rd_operations": 0, "invalid_wr_operations": 0, "invalid_rd_operations": 0}, "node-name": "#block073"}, "stats": {"flush_total_time_ns": 27153799, "wr_highest_offset": 4237303808, "wr_total_time_ns": 9645370174, "failed_wr_operations": 0, "failed_rd_operations": 0, "wr_merged": 0, "wr_bytes": 9303040, "timed_stats": [], "failed_flush_operations": 0, "account_invalid": true, "rd_total_time_ns": 13148691297, "flush_operations": 437, "wr_operations": 1322, "rd_merged": 0, "rd_bytes": 136903680, "invalid_flush_operations": 0, "account_failed": true, "idle_time_ns": 83125945116, "rd_operations": 7906, "invalid_wr_operations": 0, "invalid_rd_operations": 0}, "node-name": "#block168"}], "id": "libvirt-3895"}
   len=1714
  2017-03-07 12:52:04.768+0000: 30196: debug : qemuMonitorBlockStatsUpdateCapacity:2161 : stats=0x7f8d4c000a80, backing=0
  2017-03-07 12:52:04.768+0000: 30196: debug : qemuMonitorBlockStatsUpdateCapacity:2163 : mon:0x7f8d48015c10 vm:0x7f8d4801c2a0 json:1 fd:32
  2017-03-07 12:52:04.768+0000: 30196: info : qemuMonitorSend:1009 : QEMU_MONITOR_SEND_MSG: mon=0x7f8d48015c10 msg={"execute":"query-block","id":"libvirt-3896"}
   fd=-1
  2017-03-07 12:52:04.768+0000: 30192: info : qemuMonitorIOWrite:534 : QEMU_MONITOR_IO_WRITE: mon=0x7f8d48015c10 buf={"execute":"query-block","id":"libvirt-3896"}
   len=47 ret=47 errno=11
  2017-03-07 12:52:04.769+0000: 30192: info : qemuMonitorIOProcess:429 : QEMU_MONITOR_IO_PROCESS: mon=0x7f8d48015c10 buf={"return": [{"io-status": "ok", "device": "drive-ide0-1-0", "locked": false, "removable": true, "tray_open": false, "type": "unknown"}, {"io-status": "ok", "device": "drive-scsi0-0-0-0", "locked": false, "removable": false, "inserted": {"iops_rd": 0, "detect_zeroes": "off", "image": {"virtual-size": 4294967296, "filename": "/rhev/data-center/74dd8b69-7211-4914-9a5f-3d64162eacfd/4638f290-c6b3-45d1-b860-d4665f3d92cc/images/c32c2f0f-cab8-4e80-ade1-1acb97a6a7bf/ca6f44b7-a796-4eeb-8ce8-f613bda5f862", "format": "raw", "actual-size": 1562066944, "dirty-flag": false}, "iops_wr": 0, "ro": false, "node-name": "#block168", "backing_file_depth": 0, "drv": "raw", "iops": 0, "bps_wr": 0, "write_threshold": 0, "encrypted": false, "bps": 0, "bps_rd": 0, "cache": {"no-flush": false, "direct": true, "writeback": true}, "file": "/rhev/data-center/74dd8b69-7211-4914-9a5f-3d64162eacfd/4638f290-c6b3-45d1-b860-d4665f3d92cc/images/c32c2f0f-cab8-4e80-ade1-1acb97a6a7bf/ca6f44b7-a796-4eeb-8ce8-f613bda5f862", "encryption_key_missing":  len=1023
  2017-03-07 12:52:04.769+0000: 30192: info : qemuMonitorIOProcess:429 : QEMU_MONITOR_IO_PROCESS: mon=0x7f8d48015c10 buf={"return": [{"io-status": "ok", "device": "drive-ide0-1-0", "locked": false, "removable": true, "tray_open": false, "type": "unknown"}, {"io-status": "ok", "device": "drive-scsi0-0-0-0", "locked": false, "removable": false, "inserted": {"iops_rd": 0, "detect_zeroes": "off", "image": {"virtual-size": 4294967296, "filename": "/rhev/data-center/74dd8b69-7211-4914-9a5f-3d64162eacfd/4638f290-c6b3-45d1-b860-d4665f3d92cc/images/c32c2f0f-cab8-4e80-ade1-1acb97a6a7bf/ca6f44b7-a796-4eeb-8ce8-f613bda5f862", "format": "raw", "actual-size": 1562066944, "dirty-flag": false}, "iops_wr": 0, "ro": false, "node-name": "#block168", "backing_file_depth": 0, "drv": "raw", "iops": 0, "bps_wr": 0, "write_threshold": 0, "encrypted": false, "bps": 0, "bps_rd": 0, "cache": {"no-flush": false, "direct": true, "writeback": true}, "file": "/rhev/data-center/74dd8b69-7211-4914-9a5f-3d64162eacfd/4638f290-c6b3-45d1-b860-d4665f3d92cc/images/c32c2f0f-cab8-4e80-ade1-1acb97a6a7bf/ca6f44b7-a796-4eeb-8ce8-f613bda5f862", "encryption_key_missing": false}, "type": "unknown"}], "id": "libvirt-3896"}
   len=1075
  2017-03-07 12:52:04.769+0000: 30196: debug : qemuDomainObjExitMonitorInternal:3232 : Exited monitor (mon=0x7f8d48015c10 vm=0x7f8d4801c2a0 name=VM3)
  2017-03-07 12:52:04.769+0000: 30196: info : virObjectNew:202 : OBJECT_NEW: obj=0x7f8d4c002370 classname=virDomain
  2017-03-07 12:52:04.769+0000: 30196: debug : qemuDomainObjEndJob:3140 : Stopping job: query (async=none vm=0x7f8d4801c2a0 name=VM3)

Stats got gathered from VM3, so far good.  

  2017-03-07 12:52:04.769+0000: 30196: debug : qemuDomainObjBeginJobInternal:2945 : Starting job: query (vm=0x7f8d50001170 name=VM5, current job=none async=none)
  2017-03-07 12:52:04.769+0000: 30196: debug : qemuDomainObjBeginJobInternal:2986 : Started job: query (async=none vm=0x7f8d50001170 name=VM5)
  2017-03-07 12:52:04.769+0000: 30196: debug : virCgroupGetValueStr:822 : Get value /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-qemu\x2d12\x2dVM5.scope/cpuacct.usage
  2017-03-07 12:52:04.769+0000: 30196: debug : virFileClose:102 : Closed fd 26
  2017-03-07 12:52:04.769+0000: 30196: debug : virCgroupGetValueStr:822 : Get value /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-qemu\x2d12\x2dVM5.scope/cpuacct.stat
  2017-03-07 12:52:04.769+0000: 30196: debug : virFileClose:102 : Closed fd 26
  2017-03-07 12:52:04.769+0000: 30196: debug : qemuGetProcessInfo:1443 : Got status for 19248/19252 user=9729 sys=5147 cpu=0 rss=150234
  2017-03-07 12:52:04.769+0000: 30196: debug : virFileClose:102 : Closed fd 26
  2017-03-07 12:52:04.769+0000: 30196: debug : qemuDomainObjEnterMonitorInternal:3209 : Entering monitor (mon=0x7f8d50004b70 vm=0x7f8d50001170 name=VM5)
  2017-03-07 12:52:04.769+0000: 30196: debug : qemuMonitorGetAllBlockStatsInfo:2122 : ret_stats=0x7f8d648698b8, backing=0
  2017-03-07 12:52:04.769+0000: 30196: debug : qemuMonitorGetAllBlockStatsInfo:2124 : mon:0x7f8d50004b70 vm:0x7f8d50001170 json:1 fd:27
  2017-03-07 12:52:04.769+0000: 30196: info : qemuMonitorSend:1009 : QEMU_MONITOR_SEND_MSG: mon=0x7f8d50004b70 msg={"execute":"query-blockstats","id":"libvirt-3932"}
   fd=-1
  2017-03-07 12:52:04.769+0000: 30192: info : qemuMonitorIOWrite:534 : QEMU_MONITOR_IO_WRITE: mon=0x7f8d50004b70 buf={"execute":"query-blockstats","id":"libvirt-3932"}
   len=52 ret=52 errno=11
  2017-03-07 12:52:04.770+0000: 30192: info : qemuMonitorIOProcess:429 : QEMU_MONITOR_IO_PROCESS: mon=0x7f8d50004b70 buf={"return": [{"device": "drive-ide0-1-0", "stats": {"flush_total_time_ns": 0, "wr_highest_offset": 0, "wr_total_time_ns": 0, "failed_wr_operations": 0, "failed_rd_operations": 0, "wr_merged": 0, "wr_bytes": 0, "timed_stats": [], "failed_flush_operations": 0, "account_invalid": false, "rd_total_time_ns": 105997, "flush_operations": 0, "wr_operations": 0, "rd_merged": 0, "rd_bytes": 152, "invalid_flush_operations": 0, "account_failed": false, "idle_time_ns": 19565630138926, "rd_operations": 4, "invalid_wr_operations": 0, "invalid_rd_operations": 0}}, {"device": "drive-scsi0-0-0-0", "parent": {"stats": {"flush_total_time_ns": 0, "wr_highest_offset": 4237303808, "wr_total_time_ns": 0, "failed_wr_operations": 0, "failed_rd_operations": 0, "wr_merged": 0, "wr_bytes": 0, "timed_stats": [], "failed_flush_operations": 0, "account_invalid": false, "rd_total_time_ns": 0, "flush_operations": 0, "wr_operations": 0, "rd_merged": 0, "rd_bytes": 0, "invalid_flush_operations": 0, "account_failed": false, "rd_operations": 0, " len=1023
  2017-03-07 12:52:04.770+0000: 30192: info : qemuMonitorIOProcess:429 : QEMU_MONITOR_IO_PROCESS: mon=0x7f8d50004b70 buf={"return": [{"device": "drive-ide0-1-0", "stats": {"flush_total_time_ns": 0, "wr_highest_offset": 0, "wr_total_time_ns": 0, "failed_wr_operations": 0, "failed_rd_operations": 0, "wr_merged": 0, "wr_bytes": 0, "timed_stats": [], "failed_flush_operations": 0, "account_invalid": false, "rd_total_time_ns": 105997, "flush_operations": 0, "wr_operations": 0, "rd_merged": 0, "rd_bytes": 152, "invalid_flush_operations": 0, "account_failed": false, "idle_time_ns": 19565630138926, "rd_operations": 4, "invalid_wr_operations": 0, "invalid_rd_operations": 0}}, {"device": "drive-scsi0-0-0-0", "parent": {"stats": {"flush_total_time_ns": 0, "wr_highest_offset": 4237303808, "wr_total_time_ns": 0, "failed_wr_operations": 0, "failed_rd_operations": 0, "wr_merged": 0, "wr_bytes": 0, "timed_stats": [], "failed_flush_operations": 0, "account_invalid": false, "rd_total_time_ns": 0, "flush_operations": 0, "wr_operations": 0, "rd_merged": 0, "rd_bytes": 0, "invalid_flush_operations": 0, "account_failed": false, "rd_operations": 0, "invalid_wr_operations": 0, "invalid_rd_operations": 0}, "node-name": "#block090"}, "stats": {"flush_total_time_ns": 37264545, "wr_highest_offset": 4237303808, "wr_total_time_ns": 27561117289, "failed_wr_operations": 0, "failed_rd_operations": 0, "wr_merged": 0, "wr_bytes": 10212352, "timed_stats": [], "failed_flush_operations": 0, "account_invalid": true, "rd_total_time_ns": 13016446057, "flush_operations": 481, "wr_operations": 1466, "rd_merged": 0, "rd_bytes": 139013120, "invalid_flush_operations": 0, "account_failed": true, "idle_time_ns": 12369012028, "rd_operations": 8055, "invalid_wr_operations": 0, "invalid_rd_operations": 0}, "node-name": "#block176"}], "id": "libvirt-3932"}
   len=1716
  2017-03-07 12:52:04.770+0000: 30196: debug : qemuMonitorBlockStatsUpdateCapacity:2161 : stats=0x7f8d4c002a60, backing=0
  2017-03-07 12:52:04.770+0000: 30196: debug : qemuMonitorBlockStatsUpdateCapacity:2163 : mon:0x7f8d50004b70 vm:0x7f8d50001170 json:1 fd:27
  2017-03-07 12:52:04.770+0000: 30196: info : qemuMonitorSend:1009 : QEMU_MONITOR_SEND_MSG: mon=0x7f8d50004b70 msg={"execute":"query-block","id":"libvirt-3933"}
   fd=-1
  2017-03-07 12:52:04.770+0000: 30192: info : qemuMonitorIOWrite:534 : QEMU_MONITOR_IO_WRITE: mon=0x7f8d50004b70 buf={"execute":"query-block","id":"libvirt-3933"}
   len=47 ret=47 errno=11

But now the call gets stuck for VM5 and continues only 15 seconds later:

  2017-03-07 12:52:19.765+0000: 30192: info : qemuMonitorIOWrite:534 : QEMU_MONITOR_IO_WRITE: mon=0x7f8d48015c10 buf={"execute":"query-blockstats","id":"libvirt-3897"}
   len=52 ret=52 errno=11
  2017-03-07 12:52:19.766+0000: 30192: info : qemuMonitorIOProcess:429 : QEMU_MONITOR_IO_PROCESS: mon=0x7f8d48015c10 buf={"return": [{"device": "drive-ide0-1-0", "stats": {"flush_total_time_ns": 0, "wr_highest_offset": 0, "wr_total_time_ns": 0, "failed_wr_operations": 0, "failed_rd_operations": 0, "wr_merged": 0, "wr_bytes": 0, "timed_stats": [], "failed_flush_operations": 0, "account_invalid": false, "rd_total_time_ns": 106122, "flush_operations": 0, "wr_operations": 0, "rd_merged": 0, "rd_bytes": 152, "invalid_flush_operations": 0, "account_failed": false, "idle_time_ns": 19465306883106, "rd_operations": 4, "invalid_wr_operations": 0, "invalid_rd_operations": 0}}, {"device": "drive-scsi0-0-0-0", "parent": {"stats": {"flush_total_time_ns": 0, "wr_highest_offset": 4237303808, "wr_total_time_ns": 0, "failed_wr_operations": 0, "failed_rd_operations": 0, "wr_merged": 0, "wr_bytes": 0, "timed_stats": [], "failed_flush_operations": 0, "account_invalid": false, "rd_total_time_ns": 0, "flush_operations": 0, "wr_operations": 0, "rd_merged": 0, "rd_bytes": 0, "invalid_flush_operations": 0, "account_failed": false, "rd_operations": 0, " len=1023
  2017-03-07 12:52:19.766+0000: 30192: info : qemuMonitorIOProcess:429 : QEMU_MONITOR_IO_PROCESS: mon=0x7f8d48015c10 buf={"return": [{"device": "drive-ide0-1-0", "stats": {"flush_total_time_ns": 0, "wr_highest_offset": 0, "wr_total_time_ns": 0, "failed_wr_operations": 0, "failed_rd_operations": 0, "wr_merged": 0, "wr_bytes": 0, "timed_stats": [], "failed_flush_operations": 0, "account_invalid": false, "rd_total_time_ns": 106122, "flush_operations": 0, "wr_operations": 0, "rd_merged": 0, "rd_bytes": 152, "invalid_flush_operations": 0, "account_failed": false, "idle_time_ns": 19465306883106, "rd_operations": 4, "invalid_wr_operations": 0, "invalid_rd_operations": 0}}, {"device": "drive-scsi0-0-0-0", "parent": {"stats": {"flush_total_time_ns": 0, "wr_highest_offset": 4237303808, "wr_total_time_ns": 0, "failed_wr_operations": 0, "failed_rd_operations": 0, "wr_merged": 0, "wr_bytes": 0, "timed_stats": [], "failed_flush_operations": 0, "account_invalid": false, "rd_total_time_ns": 0, "flush_operations": 0, "wr_operations": 0, "rd_merged": 0, "rd_bytes": 0, "invalid_flush_operations": 0, "account_failed": false, "rd_operations": 0, "invalid_wr_operations": 0, "invalid_rd_operations": 0}, "node-name": "#block073"}, "stats": {"flush_total_time_ns": 27153799, "wr_highest_offset": 4237303808, "wr_total_time_ns": 9645370174, "failed_wr_operations": 0, "failed_rd_operations": 0, "wr_merged": 0, "wr_bytes": 9303040, "timed_stats": [], "failed_flush_operations": 0, "account_invalid": true, "rd_total_time_ns": 13148691297, "flush_operations": 437, "wr_operations": 1322, "rd_merged": 0, "rd_bytes": 136903680, "invalid_flush_operations": 0, "account_failed": true, "idle_time_ns": 98124303092, "rd_operations": 7906, "invalid_wr_operations": 0, "invalid_rd_operations": 0}, "node-name": "#block168"}], "id": "libvirt-3897"}
   len=1714
  2017-03-07 12:52:19.766+0000: 30192: info : qemuMonitorIOWrite:534 : QEMU_MONITOR_IO_WRITE: mon=0x7f8d48015c10 buf={"execute":"query-block","id":"libvirt-3898"}
   len=47 ret=47 errno=11
  2017-03-07 12:52:19.767+0000: 30192: info : qemuMonitorIOProcess:429 : QEMU_MONITOR_IO_PROCESS: mon=0x7f8d48015c10 buf={"return": [{"io-status": "ok", "device": "drive-ide0-1-0", "locked": false, "removable": true, "tray_open": false, "type": "unknown"}, {"io-status": "ok", "device": "drive-scsi0-0-0-0", "locked": false, "removable": false, "inserted": {"iops_rd": 0, "detect_zeroes": "off", "image": {"virtual-size": 4294967296, "filename": "/rhev/data-center/74dd8b69-7211-4914-9a5f-3d64162eacfd/4638f290-c6b3-45d1-b860-d4665f3d92cc/images/c32c2f0f-cab8-4e80-ade1-1acb97a6a7bf/ca6f44b7-a796-4eeb-8ce8-f613bda5f862", "format": "raw", "actual-size": 1562066944, "dirty-flag": false}, "iops_wr": 0, "ro": false, "node-name": "#block168", "backing_file_depth": 0, "drv": "raw", "iops": 0, "bps_wr": 0, "write_threshold": 0, "encrypted": false, "bps": 0, "bps_rd": 0, "cache": {"no-flush": false, "direct": true, "writeback": true}, "file": "/rhev/data-center/74dd8b69-7211-4914-9a5f-3d64162eacfd/4638f290-c6b3-45d1-b860-d4665f3d92cc/images/c32c2f0f-cab8-4e80-ade1-1acb97a6a7bf/ca6f44b7-a796-4eeb-8ce8-f613bda5f862", "encryption_key_missing":  len=1023
  2017-03-07 12:52:19.767+0000: 30192: info : qemuMonitorIOProcess:429 : QEMU_MONITOR_IO_PROCESS: mon=0x7f8d48015c10 buf={"return": [{"io-status": "ok", "device": "drive-ide0-1-0", "locked": false, "removable": true, "tray_open": false, "type": "unknown"}, {"io-status": "ok", "device": "drive-scsi0-0-0-0", "locked": false, "removable": false, "inserted": {"iops_rd": 0, "detect_zeroes": "off", "image": {"virtual-size": 4294967296, "filename": "/rhev/data-center/74dd8b69-7211-4914-9a5f-3d64162eacfd/4638f290-c6b3-45d1-b860-d4665f3d92cc/images/c32c2f0f-cab8-4e80-ade1-1acb97a6a7bf/ca6f44b7-a796-4eeb-8ce8-f613bda5f862", "format": "raw", "actual-size": 1562066944, "dirty-flag": false}, "iops_wr": 0, "ro": false, "node-name": "#block168", "backing_file_depth": 0, "drv": "raw", "iops": 0, "bps_wr": 0, "write_threshold": 0, "encrypted": false, "bps": 0, "bps_rd": 0, "cache": {"no-flush": false, "direct": true, "writeback": true}, "file": "/rhev/data-center/74dd8b69-7211-4914-9a5f-3d64162eacfd/4638f290-c6b3-45d1-b860-d4665f3d92cc/images/c32c2f0f-cab8-4e80-ade1-1acb97a6a7bf/ca6f44b7-a796-4eeb-8ce8-f613bda5f862", "encryption_key_missing": false}, "type": "unknown"}], "id": "libvirt-3898"}
   len=1075

Gets stuck for another 10 seconds:

  2017-03-07 12:52:29.161+0000: 30192: info : qemuMonitorIOProcess:429 : QEMU_MONITOR_IO_PROCESS: mon=0x7f8d50004b70 buf={"return": [{"io-status": "ok", "device": "drive-ide0-1-0", "locked": false, "removable": true, "tray_open": false, "type": "unknown"}, {"io-status": "ok", "device": "drive-scsi0-0-0-0", "locked": false, "removable": false, "inserted": {"iops_rd": 0, "detect_zeroes": "off", "image": {"virtual-size": 4294967296, "filename": "/rhev/data-center/74dd8b69-7211-4914-9a5f-3d64162eacfd/4638f290-c6b3-45d1-b860-d4665f3d92cc/images/77506a50-ed71-4500-8ab5-4ddd7f318ad5/a1546bcb-46ab-4c84-85df-b12ecc512a0e", "format": "raw", "actual-size": 1561993216, "dirty-flag": false}, "iops_wr": 0, "ro": false, "node-name": "#block176", "backing_file_depth": 0, "drv": "raw", "iops": 0, "bps_wr": 0, "write_threshold": 0, "encrypted": false, "bps": 0, "bps_rd": 0, "cache": {"no-flush": false, "direct": true, "writeback": true}, "file": "/rhev/data-center/74dd8b69-7211-4914-9a5f-3d64162eacfd/4638f290-c6b3-45d1-b860-d4665f3d92cc/images/77506a50-ed71-4500-8ab5-4ddd7f318ad5/a1546bcb-46ab-4c84-85df-b12ecc512a0e", "encryption_key_missing":  len=1023
  2017-03-07 12:52:29.161+0000: 30192: info : qemuMonitorIOProcess:429 : QEMU_MONITOR_IO_PROCESS: mon=0x7f8d50004b70 buf={"return": [{"io-status": "ok", "device": "drive-ide0-1-0", "locked": false, "removable": true, "tray_open": false, "type": "unknown"}, {"io-status": "ok", "device": "drive-scsi0-0-0-0", "locked": false, "removable": false, "inserted": {"iops_rd": 0, "detect_zeroes": "off", "image": {"virtual-size": 4294967296, "filename": "/rhev/data-center/74dd8b69-7211-4914-9a5f-3d64162eacfd/4638f290-c6b3-45d1-b860-d4665f3d92cc/images/77506a50-ed71-4500-8ab5-4ddd7f318ad5/a1546bcb-46ab-4c84-85df-b12ecc512a0e", "format": "raw", "actual-size": 1561993216, "dirty-flag": false}, "iops_wr": 0, "ro": false, "node-name": "#block176", "backing_file_depth": 0, "drv": "raw", "iops": 0, "bps_wr": 0, "write_threshold": 0, "encrypted": false, "bps": 0, "bps_rd": 0, "cache": {"no-flush": false, "direct": true, "writeback": true}, "file": "/rhev/data-center/74dd8b69-7211-4914-9a5f-3d64162eacfd/4638f290-c6b3-45d1-b860-d4665f3d92cc/images/77506a50-ed71-4500-8ab5-4ddd7f318ad5/a1546bcb-46ab-4c84-85df-b12ecc512a0e", "encryption_key_missing": false}, "type": "unknown"}], "id": "libvirt-3933"}
   len=1075
  2017-03-07 12:52:29.161+0000: 30196: debug : qemuDomainObjExitMonitorInternal:3232 : Exited monitor (mon=0x7f8d50004b70 vm=0x7f8d50001170 name=VM5)
  2017-03-07 12:52:29.161+0000: 30196: info : virObjectNew:202 : OBJECT_NEW: obj=0x7f8d4c002660 classname=virDomain
  2017-03-07 12:52:29.161+0000: 30196: debug : qemuDomainObjEndJob:3140 : Stopping job: query (async=none vm=0x7f8d50001170 name=VM5)
  2017-03-07 12:52:29.161+0000: 30196: debug : qemuDomainObjBeginJobInternal:2945 : Starting job: query (vm=0x7f8d54001910 name=VM1, current job=query async=none)
  2017-03-07 12:52:29.161+0000: 30196: debug : qemuDomainObjBeginJobInternal:2968 : Waiting for job (vm=0x7f8d54001910 name=VM1)
  2017-03-07 12:52:29.194+0000: 30192: info : qemuMonitorIOProcess:429 : QEMU_MONITOR_IO_PROCESS: mon=0x7f8d54017b00 buf={"return": [{"io-status": "ok", "device": "drive-ide0-1-0", "locked": false, "removable": true, "tray_open": false, "type": "unknown"}, {"io-status": "ok", "device": "drive-scsi0-0-0-0", "locked": false, "removable": false, "inserted": {"iops_rd": 0, "detect_zeroes": "off", "image": {"virtual-size": 4294967296, "filename": "/rhev/data-center/74dd8b69-7211-4914-9a5f-3d64162eacfd/4638f290-c6b3-45d1-b860-d4665f3d92cc/images/d2a0f96b-6586-4569-be7d-40e3b74c68f2/9b50a4cc-68da-45e1-991d-7fd8efdc4b68", "format": "raw", "actual-size": 1611055104, "dirty-flag": false}, "iops_wr": 0, "ro": false, "node-name": "#block143", "backing_file_depth": 0, "drv": "raw", "iops": 0, "bps_wr": 0, "write_threshold": 0, "encrypted": false, "bps": 0, "bps_rd": 0, "cache": {"no-flush": false, "direct": true, "writeback": true}, "file": "/rhev/data-center/74dd8b69-7211-4914-9a5f-3d64162eacfd/4638f290-c6b3-45d1-b860-d4665f3d92cc/images/d2a0f96b-6586-4569-be7d-40e3b74c68f2/9b50a4cc-68da-45e1-991d-7fd8efdc4b68", "encryption_key_missing":  len=1023
  2017-03-07 12:52:29.194+0000: 30192: info : qemuMonitorIOProcess:429 : QEMU_MONITOR_IO_PROCESS: mon=0x7f8d54017b00 buf={"return": [{"io-status": "ok", "device": "drive-ide0-1-0", "locked": false, "removable": true, "tray_open": false, "type": "unknown"}, {"io-status": "ok", "device": "drive-scsi0-0-0-0", "locked": false, "removable": false, "inserted": {"iops_rd": 0, "detect_zeroes": "off", "image": {"virtual-size": 4294967296, "filename": "/rhev/data-center/74dd8b69-7211-4914-9a5f-3d64162eacfd/4638f290-c6b3-45d1-b860-d4665f3d92cc/images/d2a0f96b-6586-4569-be7d-40e3b74c68f2/9b50a4cc-68da-45e1-991d-7fd8efdc4b68", "format": "raw", "actual-size": 1611055104, "dirty-flag": false}, "iops_wr": 0, "ro": false, "node-name": "#block143", "backing_file_depth": 0, "drv": "raw", "iops": 0, "bps_wr": 0, "write_threshold": 0, "encrypted": false, "bps": 0, "bps_rd": 0, "cache": {"no-flush": false, "direct": true, "writeback": true}, "file": "/rhev/data-center/74dd8b69-7211-4914-9a5f-3d64162eacfd/4638f290-c6b3-45d1-b860-d4665f3d92cc/images/d2a0f96b-6586-4569-be7d-40e3b74c68f2/9b50a4cc-68da-45e1-991d-7fd8efdc4b68", "encryption_key_missing": false}, "type": "unknown"}], "id": "libvirt-4550"}
   len=1075
  2017-03-07 12:52:29.194+0000: 30196: debug : qemuDomainObjBeginJobInternal:2986 : Started job: query (async=none vm=0x7f8d54001910 name=VM1)
  2017-03-07 12:52:29.194+0000: 30196: debug : virCgroupGetValueStr:822 : Get value /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-qemu\x2d11\x2dVM1.scope/cpuacct.usage
  2017-03-07 12:52:29.194+0000: 30196: debug : virFileClose:102 : Closed fd 26
  2017-03-07 12:52:29.194+0000: 30196: debug : virCgroupGetValueStr:822 : Get value /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-qemu\x2d11\x2dVM1.scope/cpuacct.stat
  2017-03-07 12:52:29.194+0000: 30196: debug : virFileClose:102 : Closed fd 26
  2017-03-07 12:52:29.194+0000: 30196: debug : qemuGetProcessInfo:1443 : Got status for 16078/16082 user=10175 sys=4435 cpu=1 rss=148656
  2017-03-07 12:52:29.194+0000: 30196: debug : virFileClose:102 : Closed fd 26
  2017-03-07 12:52:29.195+0000: 30196: debug : qemuDomainObjEnterMonitorInternal:3209 : Entering monitor (mon=0x7f8d54017b00 vm=0x7f8d54001910 name=VM1)
  2017-03-07 12:52:29.195+0000: 30196: debug : qemuMonitorGetAllBlockStatsInfo:2122 : ret_stats=0x7f8d648698b8, backing=0
  2017-03-07 12:52:29.195+0000: 30196: debug : qemuMonitorGetAllBlockStatsInfo:2124 : mon:0x7f8d54017b00 vm:0x7f8d54001910 json:1 fd:33
  2017-03-07 12:52:29.195+0000: 30196: info : qemuMonitorSend:1009 : QEMU_MONITOR_SEND_MSG: mon=0x7f8d54017b00 msg={"execute":"query-blockstats","id":"libvirt-4551"}
   fd=-1
  2017-03-07 12:52:29.195+0000: 30192: info : qemuMonitorIOWrite:534 : QEMU_MONITOR_IO_WRITE: mon=0x7f8d54017b00 buf={"execute":"query-blockstats","id":"libvirt-4551"}
   len=52 ret=52 errno=11
  2017-03-07 12:52:29.197+0000: 30192: info : qemuMonitorIOProcess:429 : QEMU_MONITOR_IO_PROCESS: mon=0x7f8d54017b00 buf={"return": [{"device": "drive-ide0-1-0", "stats": {"flush_total_time_ns": 0, "wr_highest_offset": 0, "wr_total_time_ns": 0, "failed_wr_operations": 0, "failed_rd_operations": 0, "wr_merged": 0, "wr_bytes": 0, "timed_stats": [], "failed_flush_operations": 0, "account_invalid": false, "rd_total_time_ns": 0, "flush_operations": 0, "wr_operations": 0, "rd_merged": 0, "rd_bytes": 0, "invalid_flush_operations": 0, "account_failed": false, "rd_operations": 0, "invalid_wr_operations": 0, "invalid_rd_operations": 0}}, {"device": "drive-scsi0-0-0-0", "parent": {"stats": {"flush_total_time_ns": 0, "wr_highest_offset": 4104044544, "wr_total_time_ns": 0, "failed_wr_operations": 0, "failed_rd_operations": 0, "wr_merged": 0, "wr_bytes": 0, "timed_stats": [], "failed_flush_operations": 0, "account_invalid": false, "rd_total_time_ns": 0, "flush_operations": 0, "wr_operations": 0, "rd_merged": 0, "rd_bytes": 0, "invalid_flush_operations": 0, "account_failed": false, "rd_operations": 0, "invalid_wr_operations": 0, "invalid_rd_ len=1023
  2017-03-07 12:52:29.197+0000: 30192: info : qemuMonitorIOProcess:429 : QEMU_MONITOR_IO_PROCESS: mon=0x7f8d54017b00 buf={"return": [{"device": "drive-ide0-1-0", "stats": {"flush_total_time_ns": 0, "wr_highest_offset": 0, "wr_total_time_ns": 0, "failed_wr_operations": 0, "failed_rd_operations": 0, "wr_merged": 0, "wr_bytes": 0, "timed_stats": [], "failed_flush_operations": 0, "account_invalid": false, "rd_total_time_ns": 0, "flush_operations": 0, "wr_operations": 0, "rd_merged": 0, "rd_bytes": 0, "invalid_flush_operations": 0, "account_failed": false, "rd_operations": 0, "invalid_wr_operations": 0, "invalid_rd_operations": 0}}, {"device": "drive-scsi0-0-0-0", "parent": {"stats": {"flush_total_time_ns": 0, "wr_highest_offset": 4104044544, "wr_total_time_ns": 0, "failed_wr_operations": 0, "failed_rd_operations": 0, "wr_merged": 0, "wr_bytes": 0, "timed_stats": [], "failed_flush_operations": 0, "account_invalid": false, "rd_total_time_ns": 0, "flush_operations": 0, "wr_operations": 0, "rd_merged": 0, "rd_bytes": 0, "invalid_flush_operations": 0, "account_failed": false, "rd_operations": 0, "invalid_wr_operations": 0, "invalid_rd_operations": 0}, "node-name": "#block001"}, "stats": {"flush_total_time_ns": 96504023, "wr_highest_offset": 4104044544, "wr_total_time_ns": 325011539158, "failed_wr_operations": 0, "failed_rd_operations": 0, "wr_merged": 0, "wr_bytes": 22859776, "timed_stats": [], "failed_flush_operations": 0, "account_invalid": true, "rd_total_time_ns": 0, "flush_operations": 2104, "wr_operations": 4001, "rd_merged": 0, "rd_bytes": 0, "invalid_flush_operations": 0, "account_failed": true, "idle_time_ns": 7459570, "rd_operations": 0, "invalid_wr_operations": 0, "invalid_rd_operations": 0}, "node-name": "#block143"}], "id": "libvirt-4551"}
   len=1654
  2017-03-07 12:52:29.197+0000: 30196: debug : qemuMonitorBlockStatsUpdateCapacity:2161 : stats=0x7f8d4c002710, backing=0
  2017-03-07 12:52:29.197+0000: 30196: debug : qemuMonitorBlockStatsUpdateCapacity:2163 : mon:0x7f8d54017b00 vm:0x7f8d54001910 json:1 fd:33
  2017-03-07 12:52:29.197+0000: 30196: info : qemuMonitorSend:1009 : QEMU_MONITOR_SEND_MSG: mon=0x7f8d54017b00 msg={"execute":"query-block","id":"libvirt-4552"}
   fd=-1
  2017-03-07 12:52:29.197+0000: 30192: info : qemuMonitorIOWrite:534 : QEMU_MONITOR_IO_WRITE: mon=0x7f8d54017b00 buf={"execute":"query-block","id":"libvirt-4552"}
   len=47 ret=47 errno=11
  2017-03-07 12:52:29.200+0000: 30192: info : qemuMonitorIOProcess:429 : QEMU_MONITOR_IO_PROCESS: mon=0x7f8d54017b00 buf={"return": [{"io-status": "ok", "device": "drive-ide0-1-0", "locked": false, "removable": true, "tray_open": false, "type": "unknown"}, {"io-status": "ok", "device": "drive-scsi0-0-0-0", "locked": false, "removable": false, "inserted": {"iops_rd": 0, "detect_zeroes": "off", "image": {"virtual-size": 4294967296, "filename": "/rhev/data-center/74dd8b69-7211-4914-9a5f-3d64162eacfd/4638f290-c6b3-45d1-b860-d4665f3d92cc/images/d2a0f96b-6586-4569-be7d-40e3b74c68f2/9b50a4cc-68da-45e1-991d-7fd8efdc4b68", "format": "raw", "actual-size": 1611055104, "dirty-flag": false}, "iops_wr": 0, "ro": false, "node-name": "#block143", "backing_file_depth": 0, "drv": "raw", "iops": 0, "bps_wr": 0, "write_threshold": 0, "encrypted": false, "bps": 0, "bps_rd": 0, "cache": {"no-flush": false, "direct": true, "writeback": true}, "file": "/rhev/data-center/74dd8b69-7211-4914-9a5f-3d64162eacfd/4638f290-c6b3-45d1-b860-d4665f3d92cc/images/d2a0f96b-6586-4569-be7d-40e3b74c68f2/9b50a4cc-68da-45e1-991d-7fd8efdc4b68", "encryption_key_missing":  len=1023
  2017-03-07 12:52:29.200+0000: 30192: info : qemuMonitorIOProcess:429 : QEMU_MONITOR_IO_PROCESS: mon=0x7f8d54017b00 buf={"return": [{"io-status": "ok", "device": "drive-ide0-1-0", "locked": false, "removable": true, "tray_open": false, "type": "unknown"}, {"io-status": "ok", "device": "drive-scsi0-0-0-0", "locked": false, "removable": false, "inserted": {"iops_rd": 0, "detect_zeroes": "off", "image": {"virtual-size": 4294967296, "filename": "/rhev/data-center/74dd8b69-7211-4914-9a5f-3d64162eacfd/4638f290-c6b3-45d1-b860-d4665f3d92cc/images/d2a0f96b-6586-4569-be7d-40e3b74c68f2/9b50a4cc-68da-45e1-991d-7fd8efdc4b68", "format": "raw", "actual-size": 1611055104, "dirty-flag": false}, "iops_wr": 0, "ro": false, "node-name": "#block143", "backing_file_depth": 0, "drv": "raw", "iops": 0, "bps_wr": 0, "write_threshold": 0, "encrypted": false, "bps": 0, "bps_rd": 0, "cache": {"no-flush": false, "direct": true, "writeback": true}, "file": "/rhev/data-center/74dd8b69-7211-4914-9a5f-3d64162eacfd/4638f290-c6b3-45d1-b860-d4665f3d92cc/images/d2a0f96b-6586-4569-be7d-40e3b74c68f2/9b50a4cc-68da-45e1-991d-7fd8efdc4b68", "encryption_key_missing": false}, "type": "unknown"}], "id": "libvirt-4552"}
   len=1075
  2017-03-07 12:52:29.201+0000: 30198: debug : virThreadJobSet:96 : Thread 30198 (virNetServerHandleJob) is now running job remoteDispatchConnectSupportsFeature
  2017-03-07 12:52:29.201+0000: 30198: debug : virConnectSupportsFeature:72 : conn=0x7f8d50000af0, feature=9
  2017-03-07 12:52:29.201+0000: 30198: debug : virAccessManagerCheckConnect:218 : manager=0x7f8d75345ca0(name=stack) driver=QEMU perm=0
  2017-03-07 12:52:29.201+0000: 30198: debug : virAccessManagerCheckConnect:218 : manager=0x7f8d75345d90(name=none) driver=QEMU perm=0
  2017-03-07 12:52:29.201+0000: 30198: debug : virThreadJobClear:121 : Thread 30198 (virNetServerHandleJob) finished job remoteDispatchConnectSupportsFeature with ret=0
  2017-03-07 12:52:29.201+0000: 30196: debug : qemuDomainObjExitMonitorInternal:3232 : Exited monitor (mon=0x7f8d54017b00 vm=0x7f8d54001910 name=VM1)
  2017-03-07 12:52:29.201+0000: 30196: info : virObjectNew:202 : OBJECT_NEW: obj=0x7f8d4c000930 classname=virDomain
  2017-03-07 12:52:29.201+0000: 30196: debug : qemuDomainObjEndJob:3140 : Stopping job: query (async=none vm=0x7f8d54001910 name=VM1)
  2017-03-07 12:52:29.201+0000: 30196: debug : virDomainFree:565 : dom=0x7f8d4c002370, (VM: name=VM3, uuid=eed925e6-f19c-44a7-abdf-84c608c3bbf3)
  2017-03-07 12:52:29.201+0000: 30196: info : virObjectUnref:261 : OBJECT_DISPOSE: obj=0x7f8d4c002370
  2017-03-07 12:52:29.201+0000: 30196: debug : virDomainDispose:316 : release domain 0x7f8d4c002370 VM3 eed925e6-f19c-44a7-abdf-84c608c3bbf3
  2017-03-07 12:52:29.201+0000: 30196: debug : virDomainFree:565 : dom=0x7f8d4c002660, (VM: name=VM5, uuid=78ae390d-ffba-4116-b3f7-5fdf39cbdd62)
  2017-03-07 12:52:29.201+0000: 30196: info : virObjectUnref:261 : OBJECT_DISPOSE: obj=0x7f8d4c002660
  2017-03-07 12:52:29.201+0000: 30196: debug : virDomainDispose:316 : release domain 0x7f8d4c002660 VM5 78ae390d-ffba-4116-b3f7-5fdf39cbdd62
  2017-03-07 12:52:29.201+0000: 30196: debug : virDomainFree:565 : dom=0x7f8d4c000930, (VM: name=VM1, uuid=3f18b745-f9d8-44fa-a1c3-ee7bacafecf8)
  2017-03-07 12:52:29.201+0000: 30196: info : virObjectUnref:261 : OBJECT_DISPOSE: obj=0x7f8d4c000930
  2017-03-07 12:52:29.201+0000: 30196: debug : virDomainDispose:316 : release domain 0x7f8d4c000930 VM1 3f18b745-f9d8-44fa-a1c3-ee7bacafecf8
  2017-03-07 12:52:29.201+0000: 30196: debug : virThreadJobClear:121 : Thread 30196 (virNetServerHandleJob) finished job remoteDispatchConnectGetAllDomainStats with ret=0

Finally finished. There's at least one successful getAllDomainStats call in the meantime in the libvirt log, but it queries only VM3. So it seems that getAllDomainStats libvirt call blocks occasionally and its other calls in the meantime don't provide information about some VMs (probably about those not yet processed in the stuck call). I don't know what could cause the blocking, maybe we could see something in QEMU logs?

Comment 15 Michael Burman 2017-03-08 12:50:22 UTC
Hello Milan

Thank you for looking into this.
I'm running with 
libvirt-client-2.0.0-10.el7_3.5.x86_64
libvirt-daemon-2.0.0-10.el7_3.5.x86_64

I can't understand as well what is causing this, the thing is that it happens on all of my setups( i have 4 different ones) and both on 4.0.4 and 4.1.1.3...
On different servers(8 different HWs) and it happens every few hours..maybe it is somehow related to our lab/storage...cause i was asking other qe people around and it happens on their env as well. 
From the logs i don't see the main/source issue for this.

Comment 16 Michael Burman 2017-03-08 12:51:17 UTC
Created attachment 1261265 [details]
qemu logs

Comment 17 Milan Zamazal 2017-03-08 13:32:53 UTC
Thank you for QEMU logs, unfortunately there is nothing there.

So all I'm able to say is that libvirt apparently blocks from time to time on retrieving storage stats, which is most likely the thing that eventually results in temporary NotResponding status of some VMs. It indeed looks like some problem with storage in your lab.

Comment 18 Michal Skrivanek 2017-03-08 14:57:27 UTC
Tal, can you think of any way how to further troubleshoot this perhaps at the storage access level?

Comment 19 Adam Litke 2017-03-08 16:34:31 UTC
I took a look at the storage domain stats in the vdsm log (lines containing 'Run and protect: repoStats, Return response:') and the 'delay' field never showed read latency greater than 0.0281469 seconds.  This suggests that the storage is remaining accessible throughout the duration.

The only interesting thing I found was in libvirtd.log:

2017-03-06 06:23:12.530+0000: 27478: error : virNetSocketReadWire:1623 : End of file while reading data: Input/output error
2017-03-06 06:28:46.997+0000: 27478: error : virNetSocketReadWire:1623 : End of file while reading data: Input/output error

but these could be nothing.

It looks like you are doing live migration?  Does this happen without live migration?  Does it happen if you do not use VM leases?

Comment 20 Adam Litke 2017-03-08 16:54:04 UTC
Nir suggested I look at the lastCheck values in the vdsm log so I used the following pipeline to get a sorted histogram of lastCheck values:

grep -io "'lastCheck': '[[:digit:]]*\.[[:digit:]]*'" vdsm.log | awk '{print $2}' | tr "'" " " | awk -F . '{print $1}' | sort -rn  | uniq -c

fields are: <nr occurrences> <lastCheck value>

      4  36
      4  32
      8  31
      4  29
      4  27
      8  25
     12  21
      4  20
      4  19
     12  16
      4  13
      4  11
      4  10
     94  9
     98  8
    243  7
   1490  6
    463  5
    426  4
    580  3
    597  2
   3184  1
    981  0

lastCheck values above 10s mean problematic storage (perhaps not performant enough to handle your use case).

Comment 21 Nir Soffer 2017-03-08 17:41:16 UTC
Can you attach sanlock.log? if we have storage issues, sanlock will most likely
complain about them.

Comment 22 Michael Burman 2017-03-08 17:46:23 UTC
Thank you Adam, we will investigate our nfs storage domain then, it is possible that we facing some bad performance on the storage.

My use case is actually pretty simple, i have some VMs just running, nothing special.
Some of them with leases and some not. 
Yes i did some live migrations, but the warnings appears long time after the migration has ended.

I will attach the sanlock.log in few minutes)

Comment 23 Michael Burman 2017-03-08 17:54:29 UTC
017-03-08 18:53:02+0200 727729 [15793]: s1 renewal error -202 delta_length 10 last_success 727698
2017-03-08 18:53:20+0200 727746 [15793]: b5effbba aio collect RD 0x7f3cb40008c0:0x7f3cb40008d0:0x7f3cb93f5000 result 1048576:0 other free
2017-03-08 18:53:20+0200 727746 [15855]: 68f39e69 aio collect RD 0x7f3cac0008c0:0x7f3cac0008d0:0x7f3cc806f000 result 1048576:0 other free
2017-03-08 18:53:20+0200 727746 [15849]: 4638f290 aio collect RD 0x7f3cb00008c0:0x7f3cb00008d0:0x7f3cc8273000 result 1048576:0 other free

The errors in the log are at the same time the warnings appear on the engine.

Comment 24 Michael Burman 2017-03-08 17:55:01 UTC
Created attachment 1261359 [details]
sanlock

Comment 25 Nir Soffer 2017-03-09 12:40:13 UTC
The attached patch does not resolve the storage issue (we cannot work with such
storage), but adds warnings when a storage domain checker is blocked for 10
seconds. This will make it trivial to detect storage issues by grepping warnings
in vdsm logs.

Comment 26 Nir Soffer 2017-03-09 12:52:24 UTC
(In reply to Milan Zamazal from comment #17)
> So all I'm able to say is that libvirt apparently blocks from time to time
> on retrieving storage stats, which is most likely the thing that eventually
> results in temporary NotResponding status of some VMs. It indeed looks like
> some problem with storage in your lab.

Milan, why libvirt/qemu has to block when getting storage stats? this smell like
a design issue in qemu. Qemu should collect stats while doing real io, and report
collected stats without blocking. This is the way vdsm is reporting storage stats,
and same for sanlock, you can get stats about lease renewals in a non-blocking
way, even if sanlock threads are blocked on inaccessible storage.

Can you open a libvirt/qemu bug about this?

Comment 27 Milan Zamazal 2017-03-09 17:15:50 UTC
Done: https://bugzilla.redhat.com/1430847

Comment 28 Nir Soffer 2017-03-09 17:32:20 UTC
Looks like this bug depends on bug 1430847.

Comment 29 Michal Skrivanek 2017-03-10 14:49:41 UTC
changing the scope to logging improvements.
Do not expect a real solution any time soon, it's indeed architectural limitation in qemu and libvirt

Comment 30 Nir Soffer 2017-04-03 21:53:52 UTC
2 patches are merged, 1 test fix pending.

Tal, do you want to target this to 4.1.2?

Comment 31 Tal Nisan 2017-04-04 08:03:27 UTC
(In reply to Nir Soffer from comment #30)
> 2 patches are merged, 1 test fix pending.
> 
> Tal, do you want to target this to 4.1.2?

Yes please

Comment 32 Israel Pinto 2017-05-17 06:53:49 UTC
Verify with:
Red Hat Virtualization Manager Version: 4.1.2.1-0.1.el7
Host:
OS Version:RHEL - 7.3 - 7.el7
Kernel Version:3.10.0 - 514.16.1.el7.x86_64
KVM Version:2.6.0 - 28.el7_3.9
LIBVIRT Version:libvirt-2.0.0-10.el7_3.5
VDSM Version:vdsm-4.19.12-1.el7ev

Step:
Set latency on the SPM host with 'tc'
and check vdsm log for the WARN message.

 
vdsm log:
2017-05-17 09:45:36,589+0300 WARN  (check/loop) [storage.check] Checker '/dev/d0e0d37d-23f5-4c94-bb76-fb7453c07f10/metadata' is blocked for 10.00 seconds (check:279)
2017-05-17 09:45:37,261+0300 WARN  (check/loop) [storage.check] Checker '/dev/25f23b38-42f9-4fec-af32-4ab3b8d743eb/metadata' is blocked for 9.99 seconds (check:279)
2017-05-17 09:45:37,823+0300 WARN  (check/loop) [storage.check] Checker '/dev/5449c95e-b387-4d19-8f58-a22000608f3c/metadata' is blocked for 10.00 seconds (check:279)
2017-05-17 09:46:13,086+0300 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/glusterSD/gluster-server02.qa.lab.tlv.redhat.com:_virt__local__ge2__volume__1/bd80429f-b1fd-4a5d-bac6-64f025df4ed3/dom_md/metadata' is blocked for 10.04 seconds (check:279)
2017-05-17 09:46:43,045+0300 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/glusterSD/gluster-server02.qa.lab.tlv.redhat.com:_virt__local__ge2__volume__1/bd80429f-b1fd-4a5d-bac6-64f025df4ed3/dom_md/metadata' is blocked for 10.00 seconds (check:279)
2017-05-17 09:47:13,045+0300 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/glusterSD/gluster-server02.qa.lab.tlv.redhat.com:_virt__local__ge2__volume__1/bd80429f-b1fd-4a5d-bac6-64f025df4ed3/dom_md/metadata' is blocked for 10.00 seconds (check:279)
2017-05-17 09:47:16,585+0300 WARN  (check/loop) [storage.check] Checker '/dev/d0e0d37d-23f5-4c94-bb76-fb7453c07f10/metadata' is blocked for 10.00 seconds (check:279)
2017-05-17 09:47:23,048+0300 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/glusterSD/gluster-server02.qa.lab.tlv.redhat.com:_virt__local__ge2__volume__1/bd80429f-b1fd-4a5d-bac6-64f025df4ed3/dom_md/metadata' is blocked for 20.00 seconds (check:279)
2017-05-17 09:47:42,035+0300 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/glusterSD/gluster-server02.qa.lab.tlv.redhat.com:_virt__local__ge2__volume__2/d71ee3d7-c1ab-4645-8f75-29ff79315d1c/dom_md/metadata' is blocked for 10.00 seconds (check:279)
2017-05-17 09:47:42,126+0300 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/glusterSD/gluster-server02.qa.lab.tlv.redhat.com:_virt__local__ge2__volume__0/a629d603-00a0-439f-b632-2c7a795092da/dom_md/metadata' is blocked for 10.00 seconds (check:279)
2017-05-17 09:47:53,051+0300 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/glusterSD/gluster-server02.qa.lab.tlv.redhat.com:_virt__local__ge2__volume__1/bd80429f-b1fd-4a5d-bac6-64f025df4ed3/dom_md/metadata' is blocked for 10.00 seconds (check:279)
2017-05-17 09:48:12,033+0300 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/glusterSD/gluster-server02.qa.lab.tlv.redhat.com:_virt__local__ge2__volume__2/d71ee3d7-c1ab-4645-8f75-29ff79315d1c/dom_md/metadata' is blocked for 10.00 seconds (check:279)
2017-05-17 09:48:12,124+0300 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/glusterSD/gluster-server02.qa.lab.tlv.redhat.com:_virt__local__ge2__volume__0/a629d603-00a0-439f-b632-2c7a795092da/dom_md/metadata' is blocked for 10.00 seconds (check:279)
2017-05-17 09:48:42,030+0300 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/glusterSD/gluster-server02.qa.lab.tlv.redhat.com:_virt__local__ge2__volume__2/d71ee3d7-c1ab-4645-8f75-29ff79315d1c/dom_md/metadata' is blocked for 10.00 seconds (check:279)
2017-05-17 09:48:47,262+0300 WARN  (check/loop) [storage.check] Checker '/dev/25f23b38-42f9-4fec-af32-4ab3b8d743eb/metadata' is blocked for 10.00 seconds (check:279)
2017-05-17 09:49:12,033+0300 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/glusterSD/gluster-server02.qa.lab.tlv.redhat.com:_virt__local__ge2__volume__2/d71ee3d7-c1ab-4645-8f75-29ff79315d1c/dom_md/metadata' is blocked for 9.99 seconds (check:279)
2017-05-17 09:49:23,049+0300 WARN  (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/glusterSD/gluster-server02.qa.lab.tlv.redhat.com:_virt__local__ge2__volume__1/bd80429f-b1fd-4a5d-bac6-64f025df4ed3/dom_md/metadata' is blocked for 10.00 seconds (check:279)


Note You need to log in before you can comment on or make changes to this bug.