Created attachment 1258601 [details] engine log Description of problem: VM moved from 'Up' --> 'NotResponding' with EVENT_ID: VM_NOT_RESPONDING(126). VMs changing state from UP > NotResponding from time to time. 2017-03-01 09:57:09,249+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler8) [4d8f4e67] VM '9c80dc47-0c96-493a-89d5-935be3cd59f8'(VM4) moved from 'Up' --> 'NotResponding' 2017-03-01 09:57:09,417+02 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler8) [4d8f4e67] EVENT_ID: VM_NOT_RESPONDING(126), Correlation ID: null, Call Stack: null , Custom Event ID: -1, Message: VM VM4 is not responding. 2017-03-01 09:57:09,418+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (DefaultQuartzScheduler8) [4d8f4e67] VM '3f18b745-f9d8-44fa-a1c3-ee7bacafecf8'(VM1) moved from 'Up' --> 'NotResponding' 2017-03-01 09:57:09,438+02 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler8) [4d8f4e67] EVENT_ID: VM_NOT_RESPONDING(126), Correlation ID: null, Call Stack: null, Custom Event ID: -1, Message: VM VM1 is not responding. Version-Release number of selected component (if applicable): rhevm-4.1.1.2-0.1.el7.noarch vdsm-4.19.6-1.el7ev.x86_64 How reproducible: No idea. From time to time. Can't understand what trigger it. Steps to Reproduce: 1. Run VM/s on latest 4.1.1.2 Actual results: Sometimes VMs moving from up to non responding state with no visible reason.
Please also provide VDSM logs from the time it happens. Can you please explain what is happening on the host on which the VMs are running during this time? Is it heavily loaded? Or the network? Are some migrations running?
Tomas, It happens on all hosts, not on a specific one. In my setup it happens on 3 hosts for example. Like i wrote, i can't explain what is going on. nothing special. no load, no network traffic, nothing. Just happens from time to time and not on specific host.
Created attachment 1258616 [details] vdsm log from one of the hosts
It's happening on 4.0.7 as well.
Perhaps libvirt in debug would reveal something?
Not sure if libvirt in debug revealed something. I will attach vdsm in debug later as well.
Created attachment 1260398 [details] libvirt in debug
Created attachment 1260688 [details] vdsm log
Created attachment 1260689 [details] screenshot
Anyone looking at this?
I can see in vdsm.log that two VMs are UP and nothing special except for "monitor became unresponsive" for one of the VMs at one moment. Michael, could you please provide all the debug logs (engine, vdsm, libvirt) from the same time range, not from mutually different days, in order to check what is or is not happening when the error occurs?
(In reply to Milan Zamazal from comment #11) > I can see in vdsm.log that two VMs are UP and nothing special except for > "monitor became unresponsive" for one of the VMs at one moment. > > Michael, could you please provide all the debug logs (engine, vdsm, libvirt) > from the same time range, not from mutually different days, in order to > check what is or is not happening when the error occurs? OK Attaching logs from today, it just happened, so the relevant time is: engine 14:52:56 (VM1) engine 14:52:57 (VM5) vdsm 14:52:56 (VM1) vdsm 14:52:57 (VM5) libvirt 12:52:5 (VM1) libvirt 12:52:5 (VM5)
Created attachment 1260795 [details] logs in debug
Thank you, Michael, for the logs. Do you have also QEMU logs for the VMs from that time? And to be sure, do you run it with up-to-date libvirt version? Looking into Vdsm log the problem seems to be that a getAllVmStats call blocks: 2017-03-07 14:52:55,551+0200 DEBUG (jsonrpc/3) [jsonrpc.JsonRpcServer] Calling 'Host.getAllVmStats' in bridge with {} (__init__:532) 2017-03-07 14:52:55,552+0200 WARN (jsonrpc/3) [virt.vm] (vmId='3f18b745-f9d8-44fa-a1c3-ee7bacafecf8') monitor became unresponsive (command timeout, age=65.7899999991) (vm:5008) 2017-03-07 14:52:55,553+0200 WARN (jsonrpc/3) [virt.vm] (vmId='78ae390d-ffba-4116-b3f7-5fdf39cbdd62') monitor became unresponsive (command timeout, age=65.7899999991) (vm:5008) 2017-03-07 14:52:55,554+0200 DEBUG (jsonrpc/3) [jsonrpc.JsonRpcServer] Return 'Host.getAllVmStats' in bridge with (suppressed) (__init__:557) 2017-03-07 14:52:55,555+0200 INFO (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC call Host.getAllVmStats succeeded in 0.00 seconds (__init__:515) 2017-03-07 14:53:02,197+0200 DEBUG (jsonrpc/2) [jsonrpc.JsonRpcServer] Calling 'Host.getAllVmStats' in bridge with {} (__init__:532) 2017-03-07 14:53:02,198+0200 WARN (jsonrpc/2) [virt.vm] (vmId='3f18b745-f9d8-44fa-a1c3-ee7bacafecf8') monitor became unresponsive (command timeout, age=72.4299999997) (vm:5008) 2017-03-07 14:53:02,199+0200 WARN (jsonrpc/2) [virt.vm] (vmId='78ae390d-ffba-4116-b3f7-5fdf39cbdd62') monitor became unresponsive (command timeout, age=72.4399999995) (vm:5008) 2017-03-07 14:53:02,200+0200 DEBUG (jsonrpc/2) [jsonrpc.JsonRpcServer] Return 'Host.getAllVmStats' in bridge with (suppressed) (__init__:557) 2017-03-07 14:53:02,201+0200 INFO (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call Host.getAllVmStats succeeded in 0.01 seconds (__init__:515) Also: 2017-03-07 14:52:29,202+0200 WARN (periodic/3) [virt.sampling.StatsCache] dropped stale old sample: sampled 4921538.480000 stored 4921553.480000 (sampling:442) 2017-03-07 14:52:29,202+0200 DEBUG (periodic/3) [virt.sampling.VMBulkSampler] sampled timestamp 4921538.48 elapsed 24.440 acquired True domains all (sampling:508) 2017-03-07 14:52:29,203+0200 DEBUG (periodic/3) [Executor] Worker was discarded (executor:277) And: 2017-03-07 14:52:12,270+0200 DEBUG (vdsm.Scheduler) [Executor] Worker discarded: <Worker name=periodic/3 running Task(callable=<Operation action=<vdsm.virt.sampling.VMBulkSampler object at 0x7f852c065950> at 0x7f852c065b50>, timeout=7.5) discarded task#=19989 at 0x38c1d90> (executor:320) So it looks like a problem with getAllDomainStats libvirt call. Interesting things might happen in libvirt around 14:52:04,770+0200. Here's an excerpt from libvirt log at the time: 2017-03-07 12:52:04.766+0000: 30196: debug : virThreadJobSet:96 : Thread 30196 (virNetServerHandleJob) is now running job remoteDispatchConnectGetAllDomainStats 2017-03-07 12:52:04.766+0000: 30196: debug : virConnectGetAllDomainStats:11501 : conn=0x7f8d50000af0, stats=0x0, retStats=0x7f8d64869b10, flags=0x0 2017-03-07 12:52:04.766+0000: 30196: debug : virAccessManagerCheckConnect:218 : manager=0x7f8d75345ca0(name=stack) driver=QEMU perm=3 2017-03-07 12:52:04.766+0000: 30196: debug : virAccessManagerCheckConnect:218 : manager=0x7f8d75345d90(name=none) driver=QEMU perm=3 2017-03-07 12:52:04.766+0000: 30196: debug : virAccessManagerCheckDomain:234 : manager=0x7f8d75345ca0(name=stack) driver=QEMU domain=0x7f8d48018fd0 perm=1 2017-03-07 12:52:04.766+0000: 30196: debug : virAccessManagerCheckDomain:234 : manager=0x7f8d75345d90(name=none) driver=QEMU domain=0x7f8d48018fd0 perm=1 2017-03-07 12:52:04.766+0000: 30196: debug : virAccessManagerCheckDomain:234 : manager=0x7f8d75345ca0(name=stack) driver=QEMU domain=0x7f8d50014470 perm=1 2017-03-07 12:52:04.766+0000: 30196: debug : virAccessManagerCheckDomain:234 : manager=0x7f8d75345d90(name=none) driver=QEMU domain=0x7f8d50014470 perm=1 2017-03-07 12:52:04.766+0000: 30196: debug : virAccessManagerCheckDomain:234 : manager=0x7f8d75345ca0(name=stack) driver=QEMU domain=0x7f8d5401a370 perm=1 2017-03-07 12:52:04.766+0000: 30196: debug : virAccessManagerCheckDomain:234 : manager=0x7f8d75345d90(name=none) driver=QEMU domain=0x7f8d5401a370 perm=1 2017-03-07 12:52:04.766+0000: 30196: debug : qemuDomainObjBeginJobInternal:2945 : Starting job: query (vm=0x7f8d4801c2a0 name=VM3, current job=none async=none) 2017-03-07 12:52:04.766+0000: 30196: debug : qemuDomainObjBeginJobInternal:2986 : Started job: query (async=none vm=0x7f8d4801c2a0 name=VM3) 2017-03-07 12:52:04.766+0000: 30196: debug : virCgroupGetValueStr:822 : Get value /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-qemu\x2d13\x2dVM3.scope/cpuacct.usage 2017-03-07 12:52:04.766+0000: 30196: debug : virFileClose:102 : Closed fd 26 2017-03-07 12:52:04.766+0000: 30196: debug : virCgroupGetValueStr:822 : Get value /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-qemu\x2d13\x2dVM3.scope/cpuacct.stat 2017-03-07 12:52:04.766+0000: 30196: debug : virFileClose:102 : Closed fd 26 2017-03-07 12:52:04.766+0000: 30196: debug : qemuGetProcessInfo:1443 : Got status for 19518/19522 user=9683 sys=5138 cpu=1 rss=150369 2017-03-07 12:52:04.766+0000: 30196: debug : virFileClose:102 : Closed fd 26 2017-03-07 12:52:04.767+0000: 30196: debug : qemuDomainObjEnterMonitorInternal:3209 : Entering monitor (mon=0x7f8d48015c10 vm=0x7f8d4801c2a0 name=VM3) 2017-03-07 12:52:04.767+0000: 30196: debug : qemuMonitorGetAllBlockStatsInfo:2122 : ret_stats=0x7f8d648698b8, backing=0 2017-03-07 12:52:04.767+0000: 30196: debug : qemuMonitorGetAllBlockStatsInfo:2124 : mon:0x7f8d48015c10 vm:0x7f8d4801c2a0 json:1 fd:32 2017-03-07 12:52:04.767+0000: 30196: info : qemuMonitorSend:1009 : QEMU_MONITOR_SEND_MSG: mon=0x7f8d48015c10 msg={"execute":"query-blockstats","id":"libvirt-3895"} fd=-1 2017-03-07 12:52:04.767+0000: 30192: info : qemuMonitorIOWrite:534 : QEMU_MONITOR_IO_WRITE: mon=0x7f8d48015c10 buf={"execute":"query-blockstats","id":"libvirt-3895"} len=52 ret=52 errno=11 2017-03-07 12:52:04.768+0000: 30192: info : qemuMonitorIOProcess:429 : QEMU_MONITOR_IO_PROCESS: mon=0x7f8d48015c10 buf={"return": [{"device": "drive-ide0-1-0", "stats": {"flush_total_time_ns": 0, "wr_highest_offset": 0, "wr_total_time_ns": 0, "failed_wr_operations": 0, "failed_rd_operations": 0, "wr_merged": 0, "wr_bytes": 0, "timed_stats": [], "failed_flush_operations": 0, "account_invalid": false, "rd_total_time_ns": 106122, "flush_operations": 0, "wr_operations": 0, "rd_merged": 0, "rd_bytes": 152, "invalid_flush_operations": 0, "account_failed": false, "idle_time_ns": 19450308525146, "rd_operations": 4, "invalid_wr_operations": 0, "invalid_rd_operations": 0}}, {"device": "drive-scsi0-0-0-0", "parent": {"stats": {"flush_total_time_ns": 0, "wr_highest_offset": 4237303808, "wr_total_time_ns": 0, "failed_wr_operations": 0, "failed_rd_operations": 0, "wr_merged": 0, "wr_bytes": 0, "timed_stats": [], "failed_flush_operations": 0, "account_invalid": false, "rd_total_time_ns": 0, "flush_operations": 0, "wr_operations": 0, "rd_merged": 0, "rd_bytes": 0, "invalid_flush_operations": 0, "account_failed": false, "rd_operations": 0, " len=1023 2017-03-07 12:52:04.768+0000: 30192: info : qemuMonitorIOProcess:429 : QEMU_MONITOR_IO_PROCESS: mon=0x7f8d48015c10 buf={"return": [{"device": "drive-ide0-1-0", "stats": {"flush_total_time_ns": 0, "wr_highest_offset": 0, "wr_total_time_ns": 0, "failed_wr_operations": 0, "failed_rd_operations": 0, "wr_merged": 0, "wr_bytes": 0, "timed_stats": [], "failed_flush_operations": 0, "account_invalid": false, "rd_total_time_ns": 106122, "flush_operations": 0, "wr_operations": 0, "rd_merged": 0, "rd_bytes": 152, "invalid_flush_operations": 0, "account_failed": false, "idle_time_ns": 19450308525146, "rd_operations": 4, "invalid_wr_operations": 0, "invalid_rd_operations": 0}}, {"device": "drive-scsi0-0-0-0", "parent": {"stats": {"flush_total_time_ns": 0, "wr_highest_offset": 4237303808, "wr_total_time_ns": 0, "failed_wr_operations": 0, "failed_rd_operations": 0, "wr_merged": 0, "wr_bytes": 0, "timed_stats": [], "failed_flush_operations": 0, "account_invalid": false, "rd_total_time_ns": 0, "flush_operations": 0, "wr_operations": 0, "rd_merged": 0, "rd_bytes": 0, "invalid_flush_operations": 0, "account_failed": false, "rd_operations": 0, "invalid_wr_operations": 0, "invalid_rd_operations": 0}, "node-name": "#block073"}, "stats": {"flush_total_time_ns": 27153799, "wr_highest_offset": 4237303808, "wr_total_time_ns": 9645370174, "failed_wr_operations": 0, "failed_rd_operations": 0, "wr_merged": 0, "wr_bytes": 9303040, "timed_stats": [], "failed_flush_operations": 0, "account_invalid": true, "rd_total_time_ns": 13148691297, "flush_operations": 437, "wr_operations": 1322, "rd_merged": 0, "rd_bytes": 136903680, "invalid_flush_operations": 0, "account_failed": true, "idle_time_ns": 83125945116, "rd_operations": 7906, "invalid_wr_operations": 0, "invalid_rd_operations": 0}, "node-name": "#block168"}], "id": "libvirt-3895"} len=1714 2017-03-07 12:52:04.768+0000: 30196: debug : qemuMonitorBlockStatsUpdateCapacity:2161 : stats=0x7f8d4c000a80, backing=0 2017-03-07 12:52:04.768+0000: 30196: debug : qemuMonitorBlockStatsUpdateCapacity:2163 : mon:0x7f8d48015c10 vm:0x7f8d4801c2a0 json:1 fd:32 2017-03-07 12:52:04.768+0000: 30196: info : qemuMonitorSend:1009 : QEMU_MONITOR_SEND_MSG: mon=0x7f8d48015c10 msg={"execute":"query-block","id":"libvirt-3896"} fd=-1 2017-03-07 12:52:04.768+0000: 30192: info : qemuMonitorIOWrite:534 : QEMU_MONITOR_IO_WRITE: mon=0x7f8d48015c10 buf={"execute":"query-block","id":"libvirt-3896"} len=47 ret=47 errno=11 2017-03-07 12:52:04.769+0000: 30192: info : qemuMonitorIOProcess:429 : QEMU_MONITOR_IO_PROCESS: mon=0x7f8d48015c10 buf={"return": [{"io-status": "ok", "device": "drive-ide0-1-0", "locked": false, "removable": true, "tray_open": false, "type": "unknown"}, {"io-status": "ok", "device": "drive-scsi0-0-0-0", "locked": false, "removable": false, "inserted": {"iops_rd": 0, "detect_zeroes": "off", "image": {"virtual-size": 4294967296, "filename": "/rhev/data-center/74dd8b69-7211-4914-9a5f-3d64162eacfd/4638f290-c6b3-45d1-b860-d4665f3d92cc/images/c32c2f0f-cab8-4e80-ade1-1acb97a6a7bf/ca6f44b7-a796-4eeb-8ce8-f613bda5f862", "format": "raw", "actual-size": 1562066944, "dirty-flag": false}, "iops_wr": 0, "ro": false, "node-name": "#block168", "backing_file_depth": 0, "drv": "raw", "iops": 0, "bps_wr": 0, "write_threshold": 0, "encrypted": false, "bps": 0, "bps_rd": 0, "cache": {"no-flush": false, "direct": true, "writeback": true}, "file": "/rhev/data-center/74dd8b69-7211-4914-9a5f-3d64162eacfd/4638f290-c6b3-45d1-b860-d4665f3d92cc/images/c32c2f0f-cab8-4e80-ade1-1acb97a6a7bf/ca6f44b7-a796-4eeb-8ce8-f613bda5f862", "encryption_key_missing": len=1023 2017-03-07 12:52:04.769+0000: 30192: info : qemuMonitorIOProcess:429 : QEMU_MONITOR_IO_PROCESS: mon=0x7f8d48015c10 buf={"return": [{"io-status": "ok", "device": "drive-ide0-1-0", "locked": false, "removable": true, "tray_open": false, "type": "unknown"}, {"io-status": "ok", "device": "drive-scsi0-0-0-0", "locked": false, "removable": false, "inserted": {"iops_rd": 0, "detect_zeroes": "off", "image": {"virtual-size": 4294967296, "filename": "/rhev/data-center/74dd8b69-7211-4914-9a5f-3d64162eacfd/4638f290-c6b3-45d1-b860-d4665f3d92cc/images/c32c2f0f-cab8-4e80-ade1-1acb97a6a7bf/ca6f44b7-a796-4eeb-8ce8-f613bda5f862", "format": "raw", "actual-size": 1562066944, "dirty-flag": false}, "iops_wr": 0, "ro": false, "node-name": "#block168", "backing_file_depth": 0, "drv": "raw", "iops": 0, "bps_wr": 0, "write_threshold": 0, "encrypted": false, "bps": 0, "bps_rd": 0, "cache": {"no-flush": false, "direct": true, "writeback": true}, "file": "/rhev/data-center/74dd8b69-7211-4914-9a5f-3d64162eacfd/4638f290-c6b3-45d1-b860-d4665f3d92cc/images/c32c2f0f-cab8-4e80-ade1-1acb97a6a7bf/ca6f44b7-a796-4eeb-8ce8-f613bda5f862", "encryption_key_missing": false}, "type": "unknown"}], "id": "libvirt-3896"} len=1075 2017-03-07 12:52:04.769+0000: 30196: debug : qemuDomainObjExitMonitorInternal:3232 : Exited monitor (mon=0x7f8d48015c10 vm=0x7f8d4801c2a0 name=VM3) 2017-03-07 12:52:04.769+0000: 30196: info : virObjectNew:202 : OBJECT_NEW: obj=0x7f8d4c002370 classname=virDomain 2017-03-07 12:52:04.769+0000: 30196: debug : qemuDomainObjEndJob:3140 : Stopping job: query (async=none vm=0x7f8d4801c2a0 name=VM3) Stats got gathered from VM3, so far good. 2017-03-07 12:52:04.769+0000: 30196: debug : qemuDomainObjBeginJobInternal:2945 : Starting job: query (vm=0x7f8d50001170 name=VM5, current job=none async=none) 2017-03-07 12:52:04.769+0000: 30196: debug : qemuDomainObjBeginJobInternal:2986 : Started job: query (async=none vm=0x7f8d50001170 name=VM5) 2017-03-07 12:52:04.769+0000: 30196: debug : virCgroupGetValueStr:822 : Get value /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-qemu\x2d12\x2dVM5.scope/cpuacct.usage 2017-03-07 12:52:04.769+0000: 30196: debug : virFileClose:102 : Closed fd 26 2017-03-07 12:52:04.769+0000: 30196: debug : virCgroupGetValueStr:822 : Get value /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-qemu\x2d12\x2dVM5.scope/cpuacct.stat 2017-03-07 12:52:04.769+0000: 30196: debug : virFileClose:102 : Closed fd 26 2017-03-07 12:52:04.769+0000: 30196: debug : qemuGetProcessInfo:1443 : Got status for 19248/19252 user=9729 sys=5147 cpu=0 rss=150234 2017-03-07 12:52:04.769+0000: 30196: debug : virFileClose:102 : Closed fd 26 2017-03-07 12:52:04.769+0000: 30196: debug : qemuDomainObjEnterMonitorInternal:3209 : Entering monitor (mon=0x7f8d50004b70 vm=0x7f8d50001170 name=VM5) 2017-03-07 12:52:04.769+0000: 30196: debug : qemuMonitorGetAllBlockStatsInfo:2122 : ret_stats=0x7f8d648698b8, backing=0 2017-03-07 12:52:04.769+0000: 30196: debug : qemuMonitorGetAllBlockStatsInfo:2124 : mon:0x7f8d50004b70 vm:0x7f8d50001170 json:1 fd:27 2017-03-07 12:52:04.769+0000: 30196: info : qemuMonitorSend:1009 : QEMU_MONITOR_SEND_MSG: mon=0x7f8d50004b70 msg={"execute":"query-blockstats","id":"libvirt-3932"} fd=-1 2017-03-07 12:52:04.769+0000: 30192: info : qemuMonitorIOWrite:534 : QEMU_MONITOR_IO_WRITE: mon=0x7f8d50004b70 buf={"execute":"query-blockstats","id":"libvirt-3932"} len=52 ret=52 errno=11 2017-03-07 12:52:04.770+0000: 30192: info : qemuMonitorIOProcess:429 : QEMU_MONITOR_IO_PROCESS: mon=0x7f8d50004b70 buf={"return": [{"device": "drive-ide0-1-0", "stats": {"flush_total_time_ns": 0, "wr_highest_offset": 0, "wr_total_time_ns": 0, "failed_wr_operations": 0, "failed_rd_operations": 0, "wr_merged": 0, "wr_bytes": 0, "timed_stats": [], "failed_flush_operations": 0, "account_invalid": false, "rd_total_time_ns": 105997, "flush_operations": 0, "wr_operations": 0, "rd_merged": 0, "rd_bytes": 152, "invalid_flush_operations": 0, "account_failed": false, "idle_time_ns": 19565630138926, "rd_operations": 4, "invalid_wr_operations": 0, "invalid_rd_operations": 0}}, {"device": "drive-scsi0-0-0-0", "parent": {"stats": {"flush_total_time_ns": 0, "wr_highest_offset": 4237303808, "wr_total_time_ns": 0, "failed_wr_operations": 0, "failed_rd_operations": 0, "wr_merged": 0, "wr_bytes": 0, "timed_stats": [], "failed_flush_operations": 0, "account_invalid": false, "rd_total_time_ns": 0, "flush_operations": 0, "wr_operations": 0, "rd_merged": 0, "rd_bytes": 0, "invalid_flush_operations": 0, "account_failed": false, "rd_operations": 0, " len=1023 2017-03-07 12:52:04.770+0000: 30192: info : qemuMonitorIOProcess:429 : QEMU_MONITOR_IO_PROCESS: mon=0x7f8d50004b70 buf={"return": [{"device": "drive-ide0-1-0", "stats": {"flush_total_time_ns": 0, "wr_highest_offset": 0, "wr_total_time_ns": 0, "failed_wr_operations": 0, "failed_rd_operations": 0, "wr_merged": 0, "wr_bytes": 0, "timed_stats": [], "failed_flush_operations": 0, "account_invalid": false, "rd_total_time_ns": 105997, "flush_operations": 0, "wr_operations": 0, "rd_merged": 0, "rd_bytes": 152, "invalid_flush_operations": 0, "account_failed": false, "idle_time_ns": 19565630138926, "rd_operations": 4, "invalid_wr_operations": 0, "invalid_rd_operations": 0}}, {"device": "drive-scsi0-0-0-0", "parent": {"stats": {"flush_total_time_ns": 0, "wr_highest_offset": 4237303808, "wr_total_time_ns": 0, "failed_wr_operations": 0, "failed_rd_operations": 0, "wr_merged": 0, "wr_bytes": 0, "timed_stats": [], "failed_flush_operations": 0, "account_invalid": false, "rd_total_time_ns": 0, "flush_operations": 0, "wr_operations": 0, "rd_merged": 0, "rd_bytes": 0, "invalid_flush_operations": 0, "account_failed": false, "rd_operations": 0, "invalid_wr_operations": 0, "invalid_rd_operations": 0}, "node-name": "#block090"}, "stats": {"flush_total_time_ns": 37264545, "wr_highest_offset": 4237303808, "wr_total_time_ns": 27561117289, "failed_wr_operations": 0, "failed_rd_operations": 0, "wr_merged": 0, "wr_bytes": 10212352, "timed_stats": [], "failed_flush_operations": 0, "account_invalid": true, "rd_total_time_ns": 13016446057, "flush_operations": 481, "wr_operations": 1466, "rd_merged": 0, "rd_bytes": 139013120, "invalid_flush_operations": 0, "account_failed": true, "idle_time_ns": 12369012028, "rd_operations": 8055, "invalid_wr_operations": 0, "invalid_rd_operations": 0}, "node-name": "#block176"}], "id": "libvirt-3932"} len=1716 2017-03-07 12:52:04.770+0000: 30196: debug : qemuMonitorBlockStatsUpdateCapacity:2161 : stats=0x7f8d4c002a60, backing=0 2017-03-07 12:52:04.770+0000: 30196: debug : qemuMonitorBlockStatsUpdateCapacity:2163 : mon:0x7f8d50004b70 vm:0x7f8d50001170 json:1 fd:27 2017-03-07 12:52:04.770+0000: 30196: info : qemuMonitorSend:1009 : QEMU_MONITOR_SEND_MSG: mon=0x7f8d50004b70 msg={"execute":"query-block","id":"libvirt-3933"} fd=-1 2017-03-07 12:52:04.770+0000: 30192: info : qemuMonitorIOWrite:534 : QEMU_MONITOR_IO_WRITE: mon=0x7f8d50004b70 buf={"execute":"query-block","id":"libvirt-3933"} len=47 ret=47 errno=11 But now the call gets stuck for VM5 and continues only 15 seconds later: 2017-03-07 12:52:19.765+0000: 30192: info : qemuMonitorIOWrite:534 : QEMU_MONITOR_IO_WRITE: mon=0x7f8d48015c10 buf={"execute":"query-blockstats","id":"libvirt-3897"} len=52 ret=52 errno=11 2017-03-07 12:52:19.766+0000: 30192: info : qemuMonitorIOProcess:429 : QEMU_MONITOR_IO_PROCESS: mon=0x7f8d48015c10 buf={"return": [{"device": "drive-ide0-1-0", "stats": {"flush_total_time_ns": 0, "wr_highest_offset": 0, "wr_total_time_ns": 0, "failed_wr_operations": 0, "failed_rd_operations": 0, "wr_merged": 0, "wr_bytes": 0, "timed_stats": [], "failed_flush_operations": 0, "account_invalid": false, "rd_total_time_ns": 106122, "flush_operations": 0, "wr_operations": 0, "rd_merged": 0, "rd_bytes": 152, "invalid_flush_operations": 0, "account_failed": false, "idle_time_ns": 19465306883106, "rd_operations": 4, "invalid_wr_operations": 0, "invalid_rd_operations": 0}}, {"device": "drive-scsi0-0-0-0", "parent": {"stats": {"flush_total_time_ns": 0, "wr_highest_offset": 4237303808, "wr_total_time_ns": 0, "failed_wr_operations": 0, "failed_rd_operations": 0, "wr_merged": 0, "wr_bytes": 0, "timed_stats": [], "failed_flush_operations": 0, "account_invalid": false, "rd_total_time_ns": 0, "flush_operations": 0, "wr_operations": 0, "rd_merged": 0, "rd_bytes": 0, "invalid_flush_operations": 0, "account_failed": false, "rd_operations": 0, " len=1023 2017-03-07 12:52:19.766+0000: 30192: info : qemuMonitorIOProcess:429 : QEMU_MONITOR_IO_PROCESS: mon=0x7f8d48015c10 buf={"return": [{"device": "drive-ide0-1-0", "stats": {"flush_total_time_ns": 0, "wr_highest_offset": 0, "wr_total_time_ns": 0, "failed_wr_operations": 0, "failed_rd_operations": 0, "wr_merged": 0, "wr_bytes": 0, "timed_stats": [], "failed_flush_operations": 0, "account_invalid": false, "rd_total_time_ns": 106122, "flush_operations": 0, "wr_operations": 0, "rd_merged": 0, "rd_bytes": 152, "invalid_flush_operations": 0, "account_failed": false, "idle_time_ns": 19465306883106, "rd_operations": 4, "invalid_wr_operations": 0, "invalid_rd_operations": 0}}, {"device": "drive-scsi0-0-0-0", "parent": {"stats": {"flush_total_time_ns": 0, "wr_highest_offset": 4237303808, "wr_total_time_ns": 0, "failed_wr_operations": 0, "failed_rd_operations": 0, "wr_merged": 0, "wr_bytes": 0, "timed_stats": [], "failed_flush_operations": 0, "account_invalid": false, "rd_total_time_ns": 0, "flush_operations": 0, "wr_operations": 0, "rd_merged": 0, "rd_bytes": 0, "invalid_flush_operations": 0, "account_failed": false, "rd_operations": 0, "invalid_wr_operations": 0, "invalid_rd_operations": 0}, "node-name": "#block073"}, "stats": {"flush_total_time_ns": 27153799, "wr_highest_offset": 4237303808, "wr_total_time_ns": 9645370174, "failed_wr_operations": 0, "failed_rd_operations": 0, "wr_merged": 0, "wr_bytes": 9303040, "timed_stats": [], "failed_flush_operations": 0, "account_invalid": true, "rd_total_time_ns": 13148691297, "flush_operations": 437, "wr_operations": 1322, "rd_merged": 0, "rd_bytes": 136903680, "invalid_flush_operations": 0, "account_failed": true, "idle_time_ns": 98124303092, "rd_operations": 7906, "invalid_wr_operations": 0, "invalid_rd_operations": 0}, "node-name": "#block168"}], "id": "libvirt-3897"} len=1714 2017-03-07 12:52:19.766+0000: 30192: info : qemuMonitorIOWrite:534 : QEMU_MONITOR_IO_WRITE: mon=0x7f8d48015c10 buf={"execute":"query-block","id":"libvirt-3898"} len=47 ret=47 errno=11 2017-03-07 12:52:19.767+0000: 30192: info : qemuMonitorIOProcess:429 : QEMU_MONITOR_IO_PROCESS: mon=0x7f8d48015c10 buf={"return": [{"io-status": "ok", "device": "drive-ide0-1-0", "locked": false, "removable": true, "tray_open": false, "type": "unknown"}, {"io-status": "ok", "device": "drive-scsi0-0-0-0", "locked": false, "removable": false, "inserted": {"iops_rd": 0, "detect_zeroes": "off", "image": {"virtual-size": 4294967296, "filename": "/rhev/data-center/74dd8b69-7211-4914-9a5f-3d64162eacfd/4638f290-c6b3-45d1-b860-d4665f3d92cc/images/c32c2f0f-cab8-4e80-ade1-1acb97a6a7bf/ca6f44b7-a796-4eeb-8ce8-f613bda5f862", "format": "raw", "actual-size": 1562066944, "dirty-flag": false}, "iops_wr": 0, "ro": false, "node-name": "#block168", "backing_file_depth": 0, "drv": "raw", "iops": 0, "bps_wr": 0, "write_threshold": 0, "encrypted": false, "bps": 0, "bps_rd": 0, "cache": {"no-flush": false, "direct": true, "writeback": true}, "file": "/rhev/data-center/74dd8b69-7211-4914-9a5f-3d64162eacfd/4638f290-c6b3-45d1-b860-d4665f3d92cc/images/c32c2f0f-cab8-4e80-ade1-1acb97a6a7bf/ca6f44b7-a796-4eeb-8ce8-f613bda5f862", "encryption_key_missing": len=1023 2017-03-07 12:52:19.767+0000: 30192: info : qemuMonitorIOProcess:429 : QEMU_MONITOR_IO_PROCESS: mon=0x7f8d48015c10 buf={"return": [{"io-status": "ok", "device": "drive-ide0-1-0", "locked": false, "removable": true, "tray_open": false, "type": "unknown"}, {"io-status": "ok", "device": "drive-scsi0-0-0-0", "locked": false, "removable": false, "inserted": {"iops_rd": 0, "detect_zeroes": "off", "image": {"virtual-size": 4294967296, "filename": "/rhev/data-center/74dd8b69-7211-4914-9a5f-3d64162eacfd/4638f290-c6b3-45d1-b860-d4665f3d92cc/images/c32c2f0f-cab8-4e80-ade1-1acb97a6a7bf/ca6f44b7-a796-4eeb-8ce8-f613bda5f862", "format": "raw", "actual-size": 1562066944, "dirty-flag": false}, "iops_wr": 0, "ro": false, "node-name": "#block168", "backing_file_depth": 0, "drv": "raw", "iops": 0, "bps_wr": 0, "write_threshold": 0, "encrypted": false, "bps": 0, "bps_rd": 0, "cache": {"no-flush": false, "direct": true, "writeback": true}, "file": "/rhev/data-center/74dd8b69-7211-4914-9a5f-3d64162eacfd/4638f290-c6b3-45d1-b860-d4665f3d92cc/images/c32c2f0f-cab8-4e80-ade1-1acb97a6a7bf/ca6f44b7-a796-4eeb-8ce8-f613bda5f862", "encryption_key_missing": false}, "type": "unknown"}], "id": "libvirt-3898"} len=1075 Gets stuck for another 10 seconds: 2017-03-07 12:52:29.161+0000: 30192: info : qemuMonitorIOProcess:429 : QEMU_MONITOR_IO_PROCESS: mon=0x7f8d50004b70 buf={"return": [{"io-status": "ok", "device": "drive-ide0-1-0", "locked": false, "removable": true, "tray_open": false, "type": "unknown"}, {"io-status": "ok", "device": "drive-scsi0-0-0-0", "locked": false, "removable": false, "inserted": {"iops_rd": 0, "detect_zeroes": "off", "image": {"virtual-size": 4294967296, "filename": "/rhev/data-center/74dd8b69-7211-4914-9a5f-3d64162eacfd/4638f290-c6b3-45d1-b860-d4665f3d92cc/images/77506a50-ed71-4500-8ab5-4ddd7f318ad5/a1546bcb-46ab-4c84-85df-b12ecc512a0e", "format": "raw", "actual-size": 1561993216, "dirty-flag": false}, "iops_wr": 0, "ro": false, "node-name": "#block176", "backing_file_depth": 0, "drv": "raw", "iops": 0, "bps_wr": 0, "write_threshold": 0, "encrypted": false, "bps": 0, "bps_rd": 0, "cache": {"no-flush": false, "direct": true, "writeback": true}, "file": "/rhev/data-center/74dd8b69-7211-4914-9a5f-3d64162eacfd/4638f290-c6b3-45d1-b860-d4665f3d92cc/images/77506a50-ed71-4500-8ab5-4ddd7f318ad5/a1546bcb-46ab-4c84-85df-b12ecc512a0e", "encryption_key_missing": len=1023 2017-03-07 12:52:29.161+0000: 30192: info : qemuMonitorIOProcess:429 : QEMU_MONITOR_IO_PROCESS: mon=0x7f8d50004b70 buf={"return": [{"io-status": "ok", "device": "drive-ide0-1-0", "locked": false, "removable": true, "tray_open": false, "type": "unknown"}, {"io-status": "ok", "device": "drive-scsi0-0-0-0", "locked": false, "removable": false, "inserted": {"iops_rd": 0, "detect_zeroes": "off", "image": {"virtual-size": 4294967296, "filename": "/rhev/data-center/74dd8b69-7211-4914-9a5f-3d64162eacfd/4638f290-c6b3-45d1-b860-d4665f3d92cc/images/77506a50-ed71-4500-8ab5-4ddd7f318ad5/a1546bcb-46ab-4c84-85df-b12ecc512a0e", "format": "raw", "actual-size": 1561993216, "dirty-flag": false}, "iops_wr": 0, "ro": false, "node-name": "#block176", "backing_file_depth": 0, "drv": "raw", "iops": 0, "bps_wr": 0, "write_threshold": 0, "encrypted": false, "bps": 0, "bps_rd": 0, "cache": {"no-flush": false, "direct": true, "writeback": true}, "file": "/rhev/data-center/74dd8b69-7211-4914-9a5f-3d64162eacfd/4638f290-c6b3-45d1-b860-d4665f3d92cc/images/77506a50-ed71-4500-8ab5-4ddd7f318ad5/a1546bcb-46ab-4c84-85df-b12ecc512a0e", "encryption_key_missing": false}, "type": "unknown"}], "id": "libvirt-3933"} len=1075 2017-03-07 12:52:29.161+0000: 30196: debug : qemuDomainObjExitMonitorInternal:3232 : Exited monitor (mon=0x7f8d50004b70 vm=0x7f8d50001170 name=VM5) 2017-03-07 12:52:29.161+0000: 30196: info : virObjectNew:202 : OBJECT_NEW: obj=0x7f8d4c002660 classname=virDomain 2017-03-07 12:52:29.161+0000: 30196: debug : qemuDomainObjEndJob:3140 : Stopping job: query (async=none vm=0x7f8d50001170 name=VM5) 2017-03-07 12:52:29.161+0000: 30196: debug : qemuDomainObjBeginJobInternal:2945 : Starting job: query (vm=0x7f8d54001910 name=VM1, current job=query async=none) 2017-03-07 12:52:29.161+0000: 30196: debug : qemuDomainObjBeginJobInternal:2968 : Waiting for job (vm=0x7f8d54001910 name=VM1) 2017-03-07 12:52:29.194+0000: 30192: info : qemuMonitorIOProcess:429 : QEMU_MONITOR_IO_PROCESS: mon=0x7f8d54017b00 buf={"return": [{"io-status": "ok", "device": "drive-ide0-1-0", "locked": false, "removable": true, "tray_open": false, "type": "unknown"}, {"io-status": "ok", "device": "drive-scsi0-0-0-0", "locked": false, "removable": false, "inserted": {"iops_rd": 0, "detect_zeroes": "off", "image": {"virtual-size": 4294967296, "filename": "/rhev/data-center/74dd8b69-7211-4914-9a5f-3d64162eacfd/4638f290-c6b3-45d1-b860-d4665f3d92cc/images/d2a0f96b-6586-4569-be7d-40e3b74c68f2/9b50a4cc-68da-45e1-991d-7fd8efdc4b68", "format": "raw", "actual-size": 1611055104, "dirty-flag": false}, "iops_wr": 0, "ro": false, "node-name": "#block143", "backing_file_depth": 0, "drv": "raw", "iops": 0, "bps_wr": 0, "write_threshold": 0, "encrypted": false, "bps": 0, "bps_rd": 0, "cache": {"no-flush": false, "direct": true, "writeback": true}, "file": "/rhev/data-center/74dd8b69-7211-4914-9a5f-3d64162eacfd/4638f290-c6b3-45d1-b860-d4665f3d92cc/images/d2a0f96b-6586-4569-be7d-40e3b74c68f2/9b50a4cc-68da-45e1-991d-7fd8efdc4b68", "encryption_key_missing": len=1023 2017-03-07 12:52:29.194+0000: 30192: info : qemuMonitorIOProcess:429 : QEMU_MONITOR_IO_PROCESS: mon=0x7f8d54017b00 buf={"return": [{"io-status": "ok", "device": "drive-ide0-1-0", "locked": false, "removable": true, "tray_open": false, "type": "unknown"}, {"io-status": "ok", "device": "drive-scsi0-0-0-0", "locked": false, "removable": false, "inserted": {"iops_rd": 0, "detect_zeroes": "off", "image": {"virtual-size": 4294967296, "filename": "/rhev/data-center/74dd8b69-7211-4914-9a5f-3d64162eacfd/4638f290-c6b3-45d1-b860-d4665f3d92cc/images/d2a0f96b-6586-4569-be7d-40e3b74c68f2/9b50a4cc-68da-45e1-991d-7fd8efdc4b68", "format": "raw", "actual-size": 1611055104, "dirty-flag": false}, "iops_wr": 0, "ro": false, "node-name": "#block143", "backing_file_depth": 0, "drv": "raw", "iops": 0, "bps_wr": 0, "write_threshold": 0, "encrypted": false, "bps": 0, "bps_rd": 0, "cache": {"no-flush": false, "direct": true, "writeback": true}, "file": "/rhev/data-center/74dd8b69-7211-4914-9a5f-3d64162eacfd/4638f290-c6b3-45d1-b860-d4665f3d92cc/images/d2a0f96b-6586-4569-be7d-40e3b74c68f2/9b50a4cc-68da-45e1-991d-7fd8efdc4b68", "encryption_key_missing": false}, "type": "unknown"}], "id": "libvirt-4550"} len=1075 2017-03-07 12:52:29.194+0000: 30196: debug : qemuDomainObjBeginJobInternal:2986 : Started job: query (async=none vm=0x7f8d54001910 name=VM1) 2017-03-07 12:52:29.194+0000: 30196: debug : virCgroupGetValueStr:822 : Get value /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-qemu\x2d11\x2dVM1.scope/cpuacct.usage 2017-03-07 12:52:29.194+0000: 30196: debug : virFileClose:102 : Closed fd 26 2017-03-07 12:52:29.194+0000: 30196: debug : virCgroupGetValueStr:822 : Get value /sys/fs/cgroup/cpu,cpuacct/machine.slice/machine-qemu\x2d11\x2dVM1.scope/cpuacct.stat 2017-03-07 12:52:29.194+0000: 30196: debug : virFileClose:102 : Closed fd 26 2017-03-07 12:52:29.194+0000: 30196: debug : qemuGetProcessInfo:1443 : Got status for 16078/16082 user=10175 sys=4435 cpu=1 rss=148656 2017-03-07 12:52:29.194+0000: 30196: debug : virFileClose:102 : Closed fd 26 2017-03-07 12:52:29.195+0000: 30196: debug : qemuDomainObjEnterMonitorInternal:3209 : Entering monitor (mon=0x7f8d54017b00 vm=0x7f8d54001910 name=VM1) 2017-03-07 12:52:29.195+0000: 30196: debug : qemuMonitorGetAllBlockStatsInfo:2122 : ret_stats=0x7f8d648698b8, backing=0 2017-03-07 12:52:29.195+0000: 30196: debug : qemuMonitorGetAllBlockStatsInfo:2124 : mon:0x7f8d54017b00 vm:0x7f8d54001910 json:1 fd:33 2017-03-07 12:52:29.195+0000: 30196: info : qemuMonitorSend:1009 : QEMU_MONITOR_SEND_MSG: mon=0x7f8d54017b00 msg={"execute":"query-blockstats","id":"libvirt-4551"} fd=-1 2017-03-07 12:52:29.195+0000: 30192: info : qemuMonitorIOWrite:534 : QEMU_MONITOR_IO_WRITE: mon=0x7f8d54017b00 buf={"execute":"query-blockstats","id":"libvirt-4551"} len=52 ret=52 errno=11 2017-03-07 12:52:29.197+0000: 30192: info : qemuMonitorIOProcess:429 : QEMU_MONITOR_IO_PROCESS: mon=0x7f8d54017b00 buf={"return": [{"device": "drive-ide0-1-0", "stats": {"flush_total_time_ns": 0, "wr_highest_offset": 0, "wr_total_time_ns": 0, "failed_wr_operations": 0, "failed_rd_operations": 0, "wr_merged": 0, "wr_bytes": 0, "timed_stats": [], "failed_flush_operations": 0, "account_invalid": false, "rd_total_time_ns": 0, "flush_operations": 0, "wr_operations": 0, "rd_merged": 0, "rd_bytes": 0, "invalid_flush_operations": 0, "account_failed": false, "rd_operations": 0, "invalid_wr_operations": 0, "invalid_rd_operations": 0}}, {"device": "drive-scsi0-0-0-0", "parent": {"stats": {"flush_total_time_ns": 0, "wr_highest_offset": 4104044544, "wr_total_time_ns": 0, "failed_wr_operations": 0, "failed_rd_operations": 0, "wr_merged": 0, "wr_bytes": 0, "timed_stats": [], "failed_flush_operations": 0, "account_invalid": false, "rd_total_time_ns": 0, "flush_operations": 0, "wr_operations": 0, "rd_merged": 0, "rd_bytes": 0, "invalid_flush_operations": 0, "account_failed": false, "rd_operations": 0, "invalid_wr_operations": 0, "invalid_rd_ len=1023 2017-03-07 12:52:29.197+0000: 30192: info : qemuMonitorIOProcess:429 : QEMU_MONITOR_IO_PROCESS: mon=0x7f8d54017b00 buf={"return": [{"device": "drive-ide0-1-0", "stats": {"flush_total_time_ns": 0, "wr_highest_offset": 0, "wr_total_time_ns": 0, "failed_wr_operations": 0, "failed_rd_operations": 0, "wr_merged": 0, "wr_bytes": 0, "timed_stats": [], "failed_flush_operations": 0, "account_invalid": false, "rd_total_time_ns": 0, "flush_operations": 0, "wr_operations": 0, "rd_merged": 0, "rd_bytes": 0, "invalid_flush_operations": 0, "account_failed": false, "rd_operations": 0, "invalid_wr_operations": 0, "invalid_rd_operations": 0}}, {"device": "drive-scsi0-0-0-0", "parent": {"stats": {"flush_total_time_ns": 0, "wr_highest_offset": 4104044544, "wr_total_time_ns": 0, "failed_wr_operations": 0, "failed_rd_operations": 0, "wr_merged": 0, "wr_bytes": 0, "timed_stats": [], "failed_flush_operations": 0, "account_invalid": false, "rd_total_time_ns": 0, "flush_operations": 0, "wr_operations": 0, "rd_merged": 0, "rd_bytes": 0, "invalid_flush_operations": 0, "account_failed": false, "rd_operations": 0, "invalid_wr_operations": 0, "invalid_rd_operations": 0}, "node-name": "#block001"}, "stats": {"flush_total_time_ns": 96504023, "wr_highest_offset": 4104044544, "wr_total_time_ns": 325011539158, "failed_wr_operations": 0, "failed_rd_operations": 0, "wr_merged": 0, "wr_bytes": 22859776, "timed_stats": [], "failed_flush_operations": 0, "account_invalid": true, "rd_total_time_ns": 0, "flush_operations": 2104, "wr_operations": 4001, "rd_merged": 0, "rd_bytes": 0, "invalid_flush_operations": 0, "account_failed": true, "idle_time_ns": 7459570, "rd_operations": 0, "invalid_wr_operations": 0, "invalid_rd_operations": 0}, "node-name": "#block143"}], "id": "libvirt-4551"} len=1654 2017-03-07 12:52:29.197+0000: 30196: debug : qemuMonitorBlockStatsUpdateCapacity:2161 : stats=0x7f8d4c002710, backing=0 2017-03-07 12:52:29.197+0000: 30196: debug : qemuMonitorBlockStatsUpdateCapacity:2163 : mon:0x7f8d54017b00 vm:0x7f8d54001910 json:1 fd:33 2017-03-07 12:52:29.197+0000: 30196: info : qemuMonitorSend:1009 : QEMU_MONITOR_SEND_MSG: mon=0x7f8d54017b00 msg={"execute":"query-block","id":"libvirt-4552"} fd=-1 2017-03-07 12:52:29.197+0000: 30192: info : qemuMonitorIOWrite:534 : QEMU_MONITOR_IO_WRITE: mon=0x7f8d54017b00 buf={"execute":"query-block","id":"libvirt-4552"} len=47 ret=47 errno=11 2017-03-07 12:52:29.200+0000: 30192: info : qemuMonitorIOProcess:429 : QEMU_MONITOR_IO_PROCESS: mon=0x7f8d54017b00 buf={"return": [{"io-status": "ok", "device": "drive-ide0-1-0", "locked": false, "removable": true, "tray_open": false, "type": "unknown"}, {"io-status": "ok", "device": "drive-scsi0-0-0-0", "locked": false, "removable": false, "inserted": {"iops_rd": 0, "detect_zeroes": "off", "image": {"virtual-size": 4294967296, "filename": "/rhev/data-center/74dd8b69-7211-4914-9a5f-3d64162eacfd/4638f290-c6b3-45d1-b860-d4665f3d92cc/images/d2a0f96b-6586-4569-be7d-40e3b74c68f2/9b50a4cc-68da-45e1-991d-7fd8efdc4b68", "format": "raw", "actual-size": 1611055104, "dirty-flag": false}, "iops_wr": 0, "ro": false, "node-name": "#block143", "backing_file_depth": 0, "drv": "raw", "iops": 0, "bps_wr": 0, "write_threshold": 0, "encrypted": false, "bps": 0, "bps_rd": 0, "cache": {"no-flush": false, "direct": true, "writeback": true}, "file": "/rhev/data-center/74dd8b69-7211-4914-9a5f-3d64162eacfd/4638f290-c6b3-45d1-b860-d4665f3d92cc/images/d2a0f96b-6586-4569-be7d-40e3b74c68f2/9b50a4cc-68da-45e1-991d-7fd8efdc4b68", "encryption_key_missing": len=1023 2017-03-07 12:52:29.200+0000: 30192: info : qemuMonitorIOProcess:429 : QEMU_MONITOR_IO_PROCESS: mon=0x7f8d54017b00 buf={"return": [{"io-status": "ok", "device": "drive-ide0-1-0", "locked": false, "removable": true, "tray_open": false, "type": "unknown"}, {"io-status": "ok", "device": "drive-scsi0-0-0-0", "locked": false, "removable": false, "inserted": {"iops_rd": 0, "detect_zeroes": "off", "image": {"virtual-size": 4294967296, "filename": "/rhev/data-center/74dd8b69-7211-4914-9a5f-3d64162eacfd/4638f290-c6b3-45d1-b860-d4665f3d92cc/images/d2a0f96b-6586-4569-be7d-40e3b74c68f2/9b50a4cc-68da-45e1-991d-7fd8efdc4b68", "format": "raw", "actual-size": 1611055104, "dirty-flag": false}, "iops_wr": 0, "ro": false, "node-name": "#block143", "backing_file_depth": 0, "drv": "raw", "iops": 0, "bps_wr": 0, "write_threshold": 0, "encrypted": false, "bps": 0, "bps_rd": 0, "cache": {"no-flush": false, "direct": true, "writeback": true}, "file": "/rhev/data-center/74dd8b69-7211-4914-9a5f-3d64162eacfd/4638f290-c6b3-45d1-b860-d4665f3d92cc/images/d2a0f96b-6586-4569-be7d-40e3b74c68f2/9b50a4cc-68da-45e1-991d-7fd8efdc4b68", "encryption_key_missing": false}, "type": "unknown"}], "id": "libvirt-4552"} len=1075 2017-03-07 12:52:29.201+0000: 30198: debug : virThreadJobSet:96 : Thread 30198 (virNetServerHandleJob) is now running job remoteDispatchConnectSupportsFeature 2017-03-07 12:52:29.201+0000: 30198: debug : virConnectSupportsFeature:72 : conn=0x7f8d50000af0, feature=9 2017-03-07 12:52:29.201+0000: 30198: debug : virAccessManagerCheckConnect:218 : manager=0x7f8d75345ca0(name=stack) driver=QEMU perm=0 2017-03-07 12:52:29.201+0000: 30198: debug : virAccessManagerCheckConnect:218 : manager=0x7f8d75345d90(name=none) driver=QEMU perm=0 2017-03-07 12:52:29.201+0000: 30198: debug : virThreadJobClear:121 : Thread 30198 (virNetServerHandleJob) finished job remoteDispatchConnectSupportsFeature with ret=0 2017-03-07 12:52:29.201+0000: 30196: debug : qemuDomainObjExitMonitorInternal:3232 : Exited monitor (mon=0x7f8d54017b00 vm=0x7f8d54001910 name=VM1) 2017-03-07 12:52:29.201+0000: 30196: info : virObjectNew:202 : OBJECT_NEW: obj=0x7f8d4c000930 classname=virDomain 2017-03-07 12:52:29.201+0000: 30196: debug : qemuDomainObjEndJob:3140 : Stopping job: query (async=none vm=0x7f8d54001910 name=VM1) 2017-03-07 12:52:29.201+0000: 30196: debug : virDomainFree:565 : dom=0x7f8d4c002370, (VM: name=VM3, uuid=eed925e6-f19c-44a7-abdf-84c608c3bbf3) 2017-03-07 12:52:29.201+0000: 30196: info : virObjectUnref:261 : OBJECT_DISPOSE: obj=0x7f8d4c002370 2017-03-07 12:52:29.201+0000: 30196: debug : virDomainDispose:316 : release domain 0x7f8d4c002370 VM3 eed925e6-f19c-44a7-abdf-84c608c3bbf3 2017-03-07 12:52:29.201+0000: 30196: debug : virDomainFree:565 : dom=0x7f8d4c002660, (VM: name=VM5, uuid=78ae390d-ffba-4116-b3f7-5fdf39cbdd62) 2017-03-07 12:52:29.201+0000: 30196: info : virObjectUnref:261 : OBJECT_DISPOSE: obj=0x7f8d4c002660 2017-03-07 12:52:29.201+0000: 30196: debug : virDomainDispose:316 : release domain 0x7f8d4c002660 VM5 78ae390d-ffba-4116-b3f7-5fdf39cbdd62 2017-03-07 12:52:29.201+0000: 30196: debug : virDomainFree:565 : dom=0x7f8d4c000930, (VM: name=VM1, uuid=3f18b745-f9d8-44fa-a1c3-ee7bacafecf8) 2017-03-07 12:52:29.201+0000: 30196: info : virObjectUnref:261 : OBJECT_DISPOSE: obj=0x7f8d4c000930 2017-03-07 12:52:29.201+0000: 30196: debug : virDomainDispose:316 : release domain 0x7f8d4c000930 VM1 3f18b745-f9d8-44fa-a1c3-ee7bacafecf8 2017-03-07 12:52:29.201+0000: 30196: debug : virThreadJobClear:121 : Thread 30196 (virNetServerHandleJob) finished job remoteDispatchConnectGetAllDomainStats with ret=0 Finally finished. There's at least one successful getAllDomainStats call in the meantime in the libvirt log, but it queries only VM3. So it seems that getAllDomainStats libvirt call blocks occasionally and its other calls in the meantime don't provide information about some VMs (probably about those not yet processed in the stuck call). I don't know what could cause the blocking, maybe we could see something in QEMU logs?
Hello Milan Thank you for looking into this. I'm running with libvirt-client-2.0.0-10.el7_3.5.x86_64 libvirt-daemon-2.0.0-10.el7_3.5.x86_64 I can't understand as well what is causing this, the thing is that it happens on all of my setups( i have 4 different ones) and both on 4.0.4 and 4.1.1.3... On different servers(8 different HWs) and it happens every few hours..maybe it is somehow related to our lab/storage...cause i was asking other qe people around and it happens on their env as well. From the logs i don't see the main/source issue for this.
Created attachment 1261265 [details] qemu logs
Thank you for QEMU logs, unfortunately there is nothing there. So all I'm able to say is that libvirt apparently blocks from time to time on retrieving storage stats, which is most likely the thing that eventually results in temporary NotResponding status of some VMs. It indeed looks like some problem with storage in your lab.
Tal, can you think of any way how to further troubleshoot this perhaps at the storage access level?
I took a look at the storage domain stats in the vdsm log (lines containing 'Run and protect: repoStats, Return response:') and the 'delay' field never showed read latency greater than 0.0281469 seconds. This suggests that the storage is remaining accessible throughout the duration. The only interesting thing I found was in libvirtd.log: 2017-03-06 06:23:12.530+0000: 27478: error : virNetSocketReadWire:1623 : End of file while reading data: Input/output error 2017-03-06 06:28:46.997+0000: 27478: error : virNetSocketReadWire:1623 : End of file while reading data: Input/output error but these could be nothing. It looks like you are doing live migration? Does this happen without live migration? Does it happen if you do not use VM leases?
Nir suggested I look at the lastCheck values in the vdsm log so I used the following pipeline to get a sorted histogram of lastCheck values: grep -io "'lastCheck': '[[:digit:]]*\.[[:digit:]]*'" vdsm.log | awk '{print $2}' | tr "'" " " | awk -F . '{print $1}' | sort -rn | uniq -c fields are: <nr occurrences> <lastCheck value> 4 36 4 32 8 31 4 29 4 27 8 25 12 21 4 20 4 19 12 16 4 13 4 11 4 10 94 9 98 8 243 7 1490 6 463 5 426 4 580 3 597 2 3184 1 981 0 lastCheck values above 10s mean problematic storage (perhaps not performant enough to handle your use case).
Can you attach sanlock.log? if we have storage issues, sanlock will most likely complain about them.
Thank you Adam, we will investigate our nfs storage domain then, it is possible that we facing some bad performance on the storage. My use case is actually pretty simple, i have some VMs just running, nothing special. Some of them with leases and some not. Yes i did some live migrations, but the warnings appears long time after the migration has ended. I will attach the sanlock.log in few minutes)
017-03-08 18:53:02+0200 727729 [15793]: s1 renewal error -202 delta_length 10 last_success 727698 2017-03-08 18:53:20+0200 727746 [15793]: b5effbba aio collect RD 0x7f3cb40008c0:0x7f3cb40008d0:0x7f3cb93f5000 result 1048576:0 other free 2017-03-08 18:53:20+0200 727746 [15855]: 68f39e69 aio collect RD 0x7f3cac0008c0:0x7f3cac0008d0:0x7f3cc806f000 result 1048576:0 other free 2017-03-08 18:53:20+0200 727746 [15849]: 4638f290 aio collect RD 0x7f3cb00008c0:0x7f3cb00008d0:0x7f3cc8273000 result 1048576:0 other free The errors in the log are at the same time the warnings appear on the engine.
Created attachment 1261359 [details] sanlock
The attached patch does not resolve the storage issue (we cannot work with such storage), but adds warnings when a storage domain checker is blocked for 10 seconds. This will make it trivial to detect storage issues by grepping warnings in vdsm logs.
(In reply to Milan Zamazal from comment #17) > So all I'm able to say is that libvirt apparently blocks from time to time > on retrieving storage stats, which is most likely the thing that eventually > results in temporary NotResponding status of some VMs. It indeed looks like > some problem with storage in your lab. Milan, why libvirt/qemu has to block when getting storage stats? this smell like a design issue in qemu. Qemu should collect stats while doing real io, and report collected stats without blocking. This is the way vdsm is reporting storage stats, and same for sanlock, you can get stats about lease renewals in a non-blocking way, even if sanlock threads are blocked on inaccessible storage. Can you open a libvirt/qemu bug about this?
Done: https://bugzilla.redhat.com/1430847
Looks like this bug depends on bug 1430847.
changing the scope to logging improvements. Do not expect a real solution any time soon, it's indeed architectural limitation in qemu and libvirt
2 patches are merged, 1 test fix pending. Tal, do you want to target this to 4.1.2?
(In reply to Nir Soffer from comment #30) > 2 patches are merged, 1 test fix pending. > > Tal, do you want to target this to 4.1.2? Yes please
Verify with: Red Hat Virtualization Manager Version: 4.1.2.1-0.1.el7 Host: OS Version:RHEL - 7.3 - 7.el7 Kernel Version:3.10.0 - 514.16.1.el7.x86_64 KVM Version:2.6.0 - 28.el7_3.9 LIBVIRT Version:libvirt-2.0.0-10.el7_3.5 VDSM Version:vdsm-4.19.12-1.el7ev Step: Set latency on the SPM host with 'tc' and check vdsm log for the WARN message. vdsm log: 2017-05-17 09:45:36,589+0300 WARN (check/loop) [storage.check] Checker '/dev/d0e0d37d-23f5-4c94-bb76-fb7453c07f10/metadata' is blocked for 10.00 seconds (check:279) 2017-05-17 09:45:37,261+0300 WARN (check/loop) [storage.check] Checker '/dev/25f23b38-42f9-4fec-af32-4ab3b8d743eb/metadata' is blocked for 9.99 seconds (check:279) 2017-05-17 09:45:37,823+0300 WARN (check/loop) [storage.check] Checker '/dev/5449c95e-b387-4d19-8f58-a22000608f3c/metadata' is blocked for 10.00 seconds (check:279) 2017-05-17 09:46:13,086+0300 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/glusterSD/gluster-server02.qa.lab.tlv.redhat.com:_virt__local__ge2__volume__1/bd80429f-b1fd-4a5d-bac6-64f025df4ed3/dom_md/metadata' is blocked for 10.04 seconds (check:279) 2017-05-17 09:46:43,045+0300 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/glusterSD/gluster-server02.qa.lab.tlv.redhat.com:_virt__local__ge2__volume__1/bd80429f-b1fd-4a5d-bac6-64f025df4ed3/dom_md/metadata' is blocked for 10.00 seconds (check:279) 2017-05-17 09:47:13,045+0300 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/glusterSD/gluster-server02.qa.lab.tlv.redhat.com:_virt__local__ge2__volume__1/bd80429f-b1fd-4a5d-bac6-64f025df4ed3/dom_md/metadata' is blocked for 10.00 seconds (check:279) 2017-05-17 09:47:16,585+0300 WARN (check/loop) [storage.check] Checker '/dev/d0e0d37d-23f5-4c94-bb76-fb7453c07f10/metadata' is blocked for 10.00 seconds (check:279) 2017-05-17 09:47:23,048+0300 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/glusterSD/gluster-server02.qa.lab.tlv.redhat.com:_virt__local__ge2__volume__1/bd80429f-b1fd-4a5d-bac6-64f025df4ed3/dom_md/metadata' is blocked for 20.00 seconds (check:279) 2017-05-17 09:47:42,035+0300 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/glusterSD/gluster-server02.qa.lab.tlv.redhat.com:_virt__local__ge2__volume__2/d71ee3d7-c1ab-4645-8f75-29ff79315d1c/dom_md/metadata' is blocked for 10.00 seconds (check:279) 2017-05-17 09:47:42,126+0300 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/glusterSD/gluster-server02.qa.lab.tlv.redhat.com:_virt__local__ge2__volume__0/a629d603-00a0-439f-b632-2c7a795092da/dom_md/metadata' is blocked for 10.00 seconds (check:279) 2017-05-17 09:47:53,051+0300 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/glusterSD/gluster-server02.qa.lab.tlv.redhat.com:_virt__local__ge2__volume__1/bd80429f-b1fd-4a5d-bac6-64f025df4ed3/dom_md/metadata' is blocked for 10.00 seconds (check:279) 2017-05-17 09:48:12,033+0300 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/glusterSD/gluster-server02.qa.lab.tlv.redhat.com:_virt__local__ge2__volume__2/d71ee3d7-c1ab-4645-8f75-29ff79315d1c/dom_md/metadata' is blocked for 10.00 seconds (check:279) 2017-05-17 09:48:12,124+0300 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/glusterSD/gluster-server02.qa.lab.tlv.redhat.com:_virt__local__ge2__volume__0/a629d603-00a0-439f-b632-2c7a795092da/dom_md/metadata' is blocked for 10.00 seconds (check:279) 2017-05-17 09:48:42,030+0300 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/glusterSD/gluster-server02.qa.lab.tlv.redhat.com:_virt__local__ge2__volume__2/d71ee3d7-c1ab-4645-8f75-29ff79315d1c/dom_md/metadata' is blocked for 10.00 seconds (check:279) 2017-05-17 09:48:47,262+0300 WARN (check/loop) [storage.check] Checker '/dev/25f23b38-42f9-4fec-af32-4ab3b8d743eb/metadata' is blocked for 10.00 seconds (check:279) 2017-05-17 09:49:12,033+0300 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/glusterSD/gluster-server02.qa.lab.tlv.redhat.com:_virt__local__ge2__volume__2/d71ee3d7-c1ab-4645-8f75-29ff79315d1c/dom_md/metadata' is blocked for 9.99 seconds (check:279) 2017-05-17 09:49:23,049+0300 WARN (check/loop) [storage.check] Checker u'/rhev/data-center/mnt/glusterSD/gluster-server02.qa.lab.tlv.redhat.com:_virt__local__ge2__volume__1/bd80429f-b1fd-4a5d-bac6-64f025df4ed3/dom_md/metadata' is blocked for 10.00 seconds (check:279)