Created attachment 1312717 [details] engine.log, vdsm.log, supervdsm.log Description of problem: When blocking connection between host and storage, a running VM doesn't switch to paused mode Version-Release number of selected component: ovirt-engine-4.2.0-0.0.master.20170802210223.git338a2bc.el7.centos.noarch vdsm-4.20.2-41.gitd789a5e.el7.centos.x86_64 How reproducible: 100% Steps to Reproduce: 1. Run dd command on a running VM. 2. (While the command is running) block connection to the storage domain. Actual results: VM switches states between: unknown, up, not responding, paused (back and forth). Expected results: For the VM to pause. Additional info: VM is not defined as HA
Can you verify it's not a QEMU/libvirt bug? What is the status of the VM in libvirt?
- back and forth between what states? all four you mentioned? - point to a specific time frame and entity - describe the storage setup and how you blocked the connection
Created attachment 1313602 [details] engine,vdsm, libvirt logs (In reply to Yaniv Kaul from comment #1) > Can you verify it's not a QEMU/libvirt bug? What is the status of the VM in > libvirt? There is the list of statuses engine displayed below, during that time I sampled the status of the VM in libvirt, the test began around 11:28 and only at around 12 the status changed from 'running' to 'paused' in libvirt (and remained that way). (In reply to Michal Skrivanek from comment #2) > - back and forth between what states? all four you mentioned? It goes like this: Not responding -> Unknown -> Up* -> Migrating -> Not responding -> Unknown -> Not responding -> Migrating -> Not responding -> Unknown -> Up* -> Unknown -> Up -> Migrating -> ... -> Paused** -> Not responding -> Paused -> Not responding -> Unknown (with a message "vm has been paused") -> Unknown (without a message at all) * It shows that the vm is up and running on the same host that doesn't have connection to the storage. ** Up until this point the vm was reported as running by libvirt - since this moment on, libvirt reported vm is paused > - point to a specific time frame and entity (attached new logs) Connection blockage starts at around 2017-08-14 11:27 vm name: vm_nfs_1411250142 > - describe the storage setup and how you blocked the connection engine (not hosted), 3 hosts, 3 nfs SD (on the same storage server), 3 iscsi SD (on the same storage server), and 3 glusterFS (again, all 3 on the same storage server). The VM's disk was on NFS. The connection was blocked using iptables (block outgoing connection from host to storage)
(In reply to Natalie Gavrielov from comment #3) > Created attachment 1313602 [details] > engine,vdsm, libvirt logs > > (In reply to Yaniv Kaul from comment #1) > > Can you verify it's not a QEMU/libvirt bug? What is the status of the VM in > > libvirt? > > There is the list of statuses engine displayed below, during that time I > sampled the status of the VM in libvirt, the test began around 11:28 and > only at around 12 the status changed from 'running' to 'paused' in libvirt > (and remained that way). > > (In reply to Michal Skrivanek from comment #2) > > - back and forth between what states? all four you mentioned? > > It goes like this: > Not responding -> Unknown -> Up* -> Migrating -> Not responding -> Unknown > -> Not responding -> Migrating -> Not responding -> Unknown -> Up* -> > Unknown -> Up -> Migrating -> ... -> Paused** -> Not responding -> Paused -> > Not responding -> Unknown (with a message "vm has been paused") -> Unknown > (without a message at all) > > * It shows that the vm is up and running on the same host that doesn't have > connection to the storage. > ** Up until this point the vm was reported as running by libvirt - since > this moment on, libvirt reported vm is paused > > > - point to a specific time frame and entity > (attached new logs) > Connection blockage starts at around 2017-08-14 11:27 the engine log ends at 2017-08-14 02:55:47 at night. Can you please provide the next logs which contain the specific timeframe? > vm name: vm_nfs_1411250142 > > > - describe the storage setup and how you blocked the connection > engine (not hosted), 3 hosts, 3 nfs SD (on the same storage server), 3 iscsi > SD (on the same storage server), and 3 glusterFS (again, all 3 on the same > storage server). > > The VM's disk was on NFS. > > The connection was blocked using iptables (block outgoing connection from > host to storage)
Created attachment 1324928 [details] engine.log
there are several issues, but it seems the core problem lies in infra. The sequence goes like this: engine detects 2017-08-14 11:27:57,854+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler6) [536b9c5e] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10 ,802), VDSM host_mixed_1 command GetStatsVDS failed: Heartbeat exceeded corresponding side on vdsm says: 2017-08-14 11:28:26,929+0300 WARN (JsonRpc (StompReactor)) [vds.dispatcher] unhandled write event (betterAsyncore:160) 2017-08-14 11:28:26,934+0300 WARN (JsonRpc (StompReactor)) [vds.dispatcher] unhandled close event (betterAsyncore:160) this causes VM state to be NotResponding. Now before the host starts VdsNotRespondingTreatment the storage is detected as non operational, so the status is set to NonOperational and VMs to Unknown. The soft fencing restarts vdsm, it recovers VMs, starts responding(after reporting Recovering from crash or initializing) and reports accurate VM status (Up). Immediately afterwards engine attempts to reconnect storage which fails with the same heartbeat exception. Host is set to NonOperational. That now triggeres a migration to other host (since VM was Up) but that's not really significant. Eventually after vdsm timeout on storage monitor it is again reported as NotResponding, while that heartbeat exceeded network exception triggers another round of soft fencing Ultimately the core issue is the network exception which IMO should not happen Piotr, please take a look
Checking the logs I see following sequence: 1. Vdsm responds to VM.getStats - 2017-08-14 11:27:53,254+0300 INFO (jsonrpc/0) [jsonrpc.JsonRpcServer] RPC call VM.getStats succeeded in 0.00 seconds (__init__:604) 2. Possibly mom connecting to vdsm 2017-08-14 11:27:53,522+0300 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:51780 (protocoldetector:74) 3. Last rpc response possibly to mom 2017-08-14 11:27:55,011+0300 INFO (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call Host.getAllVmStats succeeded in 0.00 seconds (__init__:604) 4. Engine timeouts with lack of response to VM.getStats 2017-08-14 11:27:57,855+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler6) [536b9c5e] Command 'GetStatsVDSCommand(HostName = host_mixed_1, VdsIdAndVdsVDSCommandParametersBase:{hostId='1546ec3c-2eb4-4ff0-a981-1c7e74cf3c37', vds='Host[host_mixed_1,1546ec3c-2eb4-4ff0-a981-1c7e74cf3c37]'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded 5. Engine closes connection which end up to log this on vdsm side 2017-08-14 11:28:26,929+0300 WARN (JsonRpc (StompReactor)) [vds.dispatcher] unhandled write event (betterAsyncore:160) 2017-08-14 11:28:26,934+0300 WARN (JsonRpc (StompReactor)) [vds.dispatcher] unhandled close event (betterAsyncore:160) My question is whether we see any spikes in cpu or network utilization. I am looking for a reason why VM.getStats response did not arrive on time to the engine even though it was sent from the vdsm. I wonder whether we face network congestion or compute issue caused by vdsm. Is vdsm running in a vm or bare metal?
but isn't that exception about heartbeat received too early? If the getStats is the one which failed - since it's the last engine-vdsm communication - then it's just 11:27:53 on vdsm side and engine network exception at 11:27:57 And IIUC the timeout for hearbeat is 30s now?
We do not see when the command/verb was called, we do not log it. The interval was tested before and we did not see any issues with it. There were no changes for a long time in this code so I assume at the moment that vdsm is to blame here.
It would help to get the engine logs in debug so we could better understand timing from the engine perspective.
I'm not saying it's small. But doesn't it mean it's of a call from 30 seconds before that? The log doesn't say when it starts but it does log the duration when it fineshes (0s here) so is it wrong to assume it was called at that time?
The log does not say how long it takes for a frame to arrive and come back. We only know how long it took to execute it on vdsm side. We know that the message was processed at 2017-08-14 11:27:53,254+0300 and it took 0s but we do not know how long it took to get there and how long it took for the engine to receive the response. If the network was congested or cpu occupied enough the steps that we do not know about could potentially take longer than 30 secs. That is why we need engine logs in debug to know the specific numbers.
Could you please provide the debug logs?
(In reply to Piotr Kliczewski from comment #7) > I wonder whether we face network congestion or compute issue caused by vdsm. > Is vdsm running in a vm or bare metal? VM. (In reply to Tomas Jelinek from comment #13) > Could you please provide the debug logs? I tried running the scenario described again so I can provide debug logs, with build: ovirt-engine-4.2.0-0.0.master.20170904111624.git842d79b.el7.centos.noarch Now the bug won't recreate, it works as expected - VM switches state from running to paused when blocking the connection between the host and storage domain.. guess these logs will be useless.. I'll try to find an environment with the same build to see if it will reproduce.
Created attachment 1333227 [details] engine.log, vdsm.log DEBUG level (In reply to Natalie Gavrielov from comment #14) > (In reply to Tomas Jelinek from comment #13) > > Could you please provide the debug logs? > > I tried running the scenario described again so I can provide debug logs, > with build: > ovirt-engine-4.2.0-0.0.master.20170904111624.git842d79b.el7.centos.noarch > Now the bug won't recreate, it works as expected - VM switches state from > running to paused when blocking the connection between the host and storage > domain.. > guess these logs will be useless.. I'll try to find an environment with the > same build to see if it will reproduce. Apparently when I tried reproducing this I used iscsi instead of nfs :0 Now I've attached debug logs for engine and vdsm. Builds used: ovirt-engine-4.2.0-0.0.master.20170929123516.git007c392.el7.centos.noarch vdsm-4.20.3-121.git77235c7.el7.centos.x86_64 It's an environment with 3 hosts. The block connection to storage domain started at around 2017-10-02 15:00 and during the "switch states" described in comment 3, there were a few attempts to migrate the vm, here is one of them: 2017-10-02 15:03:14,871+03 ERROR [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-90) [49fc1b2e] Failed to migrate VM 'vm_TestCaseBlockedConnecti_0214574984'
Based on the logs I see heartbeat issue: 2017-10-02 15:03:14,742+03 DEBUG [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-63) [] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exceeded which was caused by vdsm crash: 2017-10-02 15:03:13,673+0300 DEBUG (periodic/1) [storage.TaskManager.Task] (Task='e86cb403-e3a0-4017-ac8a-9392e906412b') ref 0 aborting False (task:999) 2017-10-02 15:03:46,874+0300 INFO (MainThread) [vds] (PID: 28763) I am the actual vdsm 4.20.3-121.git77235c7.el7.centos storage-jenkins-ge4-vdsm1.scl.lab.tlv.redhat.com (3.10.0-693.el7.x86_64) (vdsmd:148) btw I see many blocks tasks which could be cause by famous libvirt lock.
Based on above adding dependency on BZ1496517 and targeting to 4.2
Underlying BZ1497197 was verified and released in 7.4.z within a October 19th batch, so moving to ON_QA
Created attachment 1359161 [details] logs: vdsm, engine Performed the scenario described in comment 0, now the VM switches state to "Not Responding" (showing question mark icon), it does not switch state to "VM paused" at any point as should be. Builds: ovirt-engine-4.2.0-0.0.master.20171124141652.gita5b4f9b.el7.centos.noarch vdsm-4.20.8-29.git6ceb389.el7.centos.x86_64
(In reply to Natalie Gavrielov from comment #19) > Created attachment 1359161 [details] > logs: vdsm, engine > > Performed the scenario described in comment 0, now the VM switches state to > "Not Responding" (showing question mark icon), it does not switch state to > "VM paused" at any point as should be. > > Builds: > ovirt-engine-4.2.0-0.0.master.20171124141652.gita5b4f9b.el7.centos.noarch > vdsm-4.20.8-29.git6ceb389.el7.centos.x86_64 Are you sure it's not a libvirt/QEMU bug?
(In reply to Natalie Gavrielov from comment #14) > (In reply to Piotr Kliczewski from comment #7) > > > I wonder whether we face network congestion or compute issue caused by vdsm. > > Is vdsm running in a vm or bare metal? > > VM. Can you try to reproduce on bare-metal? Is the nested VM perhaps choked on resources? > > (In reply to Tomas Jelinek from comment #13) > > Could you please provide the debug logs? > > I tried running the scenario described again so I can provide debug logs, > with build: > ovirt-engine-4.2.0-0.0.master.20170904111624.git842d79b.el7.centos.noarch This is quite an ancient build? > Now the bug won't recreate, it works as expected - VM switches state from > running to paused when blocking the connection between the host and storage > domain.. > guess these logs will be useless.. I'll try to find an environment with the > same build to see if it will reproduce.
(In reply to Natalie Gavrielov from comment #19) > Created attachment 1359161 [details] > logs: vdsm, engine > > Performed the scenario described in comment 0, now the VM switches state to > "Not Responding" (showing question mark icon), it does not switch state to > "VM paused" at any point as should be. > > Builds: > ovirt-engine-4.2.0-0.0.master.20171124141652.gita5b4f9b.el7.centos.noarch > vdsm-4.20.8-29.git6ceb389.el7.centos.x86_64 Are you sure that you are using libvirt-python-3.2.0-3.el7_4.1 which should contain the fix for BZ1481022 in RHEL 7.4.z?
(In reply to Yaniv Kaul from comment #21) > (In reply to Natalie Gavrielov from comment #14) > > (In reply to Piotr Kliczewski from comment #7) > Can you try to reproduce on bare-metal? Is the nested VM perhaps choked on > resources? I'm not sure resources got to do with it when I run the same test with iscsi storage it works fine (same environment). Looking for a bare-metal env meantime. > > (In reply to Tomas Jelinek from comment #13) > > > Could you please provide the debug logs? > > > > I tried running the scenario described again so I can provide debug logs, > > with build: > > ovirt-engine-4.2.0-0.0.master.20170904111624.git842d79b.el7.centos.noarch > > This is quite an ancient build? I have a nested env with rhvm-4.2.0-0.5.master.el7.noarch, behaves the same. (In reply to Martin Perina from comment #22) > (In reply to Natalie Gavrielov from comment #19) > > Created attachment 1359161 [details] > > logs: vdsm, engine > > > > Performed the scenario described in comment 0, now the VM switches state to > > "Not Responding" (showing question mark icon), it does not switch state to > > "VM paused" at any point as should be. > > > > Builds: > > ovirt-engine-4.2.0-0.0.master.20171124141652.gita5b4f9b.el7.centos.noarch > > vdsm-4.20.8-29.git6ceb389.el7.centos.x86_64 > > Are you sure that you are using libvirt-python-3.2.0-3.el7_4.1 which should > contain the fix for BZ1481022 in RHEL 7.4.z? On the nested env I'm working on, with builds: rhvm-4.2.0-0.5.master.el7.noarch vdsm-4.20.8-1.el7ev.x86_64 libvirt-python-3.2.0-3.el7_4.1.x86_64 It still happens.
(In reply to Natalie Gavrielov from comment #19) > Created attachment 1359161 [details] > logs: vdsm, engine > > Performed the scenario described in comment 0, now the VM switches state to > "Not Responding" (showing question mark icon), it does not switch state to > "VM paused" at any point as should be. This statement seems to not correct: 2017-11-26 15:25:24,749+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-3) [] VM 'aa091767-2da3-4c7d-b45c-26dd2fff3f48'(vm_TestCaseBlockedConnecti_2614572831) moved from 'Up' --> 'Paused' please specify exactly which vm(s) we are talking about and at which point because I see couple of state switches to 'Paused'.
Created attachment 1362839 [details] logs: vdsm, engine Bare-metal env, engine with 2 hosts. from engine.log: 2017-12-04 22:53:49,281+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-73) [] VM '457c50e6-c794-4d50-bf42-a443f926f8b9'(test-vm1) moved from 'Up' --> 'NotResponding' During the entire test, VM was in NotResponding state. Removed the rule from iptables at 04/12/17 23:26:14 Then, this came out in the engine.log: 2017-12-04 23:26:22,448+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-1) [] VM '457c50e6-c794-4d50-bf42-a443f926f8b9'(test-vm1) moved from 'NotResponding' --> 'Paused' 2017-12-04 23:26:32,294+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-2) [] VM '457c50e6-c794-4d50-bf42-a443f926f8b9'(test-vm1) moved from 'Paused' --> 'Up' Seems that the VM becomes paused only when connection is restored. Builds: ovirt-engine-4.2.0-0.0.master.20171124141652.gita5b4f9b.el7.centos.noarch vdsm-4.20.7-1.el7.centos.x86_64 libvirt-python-3.2.0-3.el7_4.1.x86_64
I've looked at latest logs and from infra point of view hosts statuses are fine: 1. At 2017-12-04 22:44:05,678+02 VM 'test-vm1' was started on host 'h1' 2. At 2017-12-04 22:53:49,281+02 VM 'test-vm1' was moved to NotResponding 3. At 2017-12-04 22:59:32,085+02 host 'h1' was moved to NonOperational 4. At 2017-12-04 23:00:00,301+02 engine failed to autorecover host 'h1' 5. At 2017-12-04 23:03:38,881+02 data center 'DC1' move to NonOperational as host 'h1' is an SPM 6. Several attempts to recover host 'h1', all unsuccessful 7. At 2017-12-04 23:26:22,448+02 VM 'test-vm' moved from NotResponding to Paused 8. At 2017-12-04 23:26:32,294+02 VM 'test-vm' moved from Paused to Up 9. At 2017-12-04 23:26:35,565+02 host 'h1' was moved back to Up So from infra point of view host status changes working fine. Tomasi, could some from virt team take a look if also VM statuses were changed as expected?
Based on recent logs I see that vm monitoring was not run between 2017-12-04 22:53:49,281+02 and 2017-12-04 23:26:22,448+02. The question is why the vms were not updated properly.
the similar behaviour has been tested also with gluster storage with problematic though a little bit different result: circulating of VM statuses that finishes by pause "due to unknown storage error" instead of "I/O error" expected. Here is the engine reports for this scenario: 08:50:13 (vm_gluster) moved from Up' --> 'NotResponding' 08:50:42 vm_gluster moved from 'NotResponding' --> 'Up' 08:50:43 (vm_gluster) moved from 'Up' --> 'NotResponding' 08:50:48 (vm_gluster) moved from 'NotResponding' --> 'Paused VM vm_gluster has been paused. VM vm_gluster has been paused due to unknown storage error 08:50:59 (vm_gluster) moved from 'Paused' --> 'NotResponding' 08:51:13 (vm_gluster) moved from 'NotResponding' --> 'Paused' VM vm_gluster has been paused. VM vm_gluster has been paused due to unknown storage error. this bug prevents testing of VM Resume Behaviour features with nfs and gluster. Only iscsi could be tested while the phase.
There are no obvious errors anywhere (other than what's expected) It does look like the VM stats call is still getting stuck. It looks like the call never returns. Since the heartbeat now works the engine side supposedly just hangs and no further monitoring cycles are scheduled - that would be quite bad. Unfortunately these frequent API calls are not being logged by default Please enable debug log level on vdsm and engine, and provide also libvirt.log
(In reply to Michal Skrivanek from comment #29) > There are no obvious errors anywhere (other than what's expected) > It does look like the VM stats call is still getting stuck. It looks like > the call never returns. Since the heartbeat now works the engine side > supposedly just hangs and no further monitoring cycles are scheduled - that > would be quite bad. Don't forget that we have higher timeout which is raised when response do not arrives on time. It is in place since xmlrpc where we had new connection every call. Now we maintain the connection and use heartbeats to detect any issues. > > Unfortunately these frequent API calls are not being logged by default > Please enable debug log level on vdsm and engine, and provide also > libvirt.log
could you please explain how set the DEBUG INFO for engine.log and vdsm.log ?
Created attachment 1365632 [details] logs, DEBUG level: engine, vdsm, libvirt (In reply to Michal Skrivanek from comment #29) > There are no obvious errors anywhere (other than what's expected) > It does look like the VM stats call is still getting stuck. It looks like > the call never returns. Since the heartbeat now works the engine side > supposedly just hangs and no further monitoring cycles are scheduled - that > would be quite bad. > > Unfortunately these frequent API calls are not being logged by default > Please enable debug log level on vdsm and engine, and provide also > libvirt.log Builds used for testing: ovirt-engine-4.2.0-0.5.master.el7.noarch vdsm-4.20.8-1.el7ev.x86_64 Block connection was around 16:40 Restore connection was at around 16:55 VM name: vm_TestCaseBlockedConnecti_1016333518
(In reply to Natalie Gavrielov from comment #32) libvirt log seems to be from a wrong time - please check timezone other than that - I can see the host responds in a correct way, with monitorResponse -1 signifying that libvirt periodic monitoring is not responding. Need libvirt logs from the right period - but since it's NFS it could be just the timeout - check NFS mount parameters as well.
Created attachment 1365999 [details] libvirt logs (In reply to Michal Skrivanek from comment #33) > (In reply to Natalie Gavrielov from comment #32) > libvirt log seems to be from a wrong time - please check timezone > > other than that - I can see the host responds in a correct way, with > monitorResponse -1 signifying that libvirt periodic monitoring is not > responding. > Need libvirt logs from the right period - but since it's NFS it could be > just the timeout - check NFS mount parameters as well. Attached libvirt logs (for all 3 hosts) with the relevant time frame. Can you please specify exactly which parameters for NFS mount to check?
(In reply to Natalie Gavrielov from comment #34) > Attached libvirt logs (for all 3 hosts) with the relevant time frame. > Can you please specify exactly which parameters for NFS mount to check? I see it's stuck in libvirt side until the last call at 2017-12-10 14:56:38.707+0000: 28509: error : qemuDomainObjBeginJobInternal:3839 : Timed out during operation: cannot acquire state change lock (held by remoteDispatchConnectGetAllDomainStats) In vdsm log it starts reacting again at 2017-12-10 16:56:49,008+0200 and is reported as Up by vdsm(via VM status event), but the VM is shut down short after at 2017-12-10 16:56:59,502+0200 (vdsm API request). You did that manually from engine as logged in engine.log: 2017-12-10 16:56:59,463+02 INFO [org.ovirt.engine.core.bll.StopVmCommand] (default task-2) [vms_syncAction_671d3681-85d0-4ec9] Running command: StopVmCommand(StopVmType = NORMAL, StopReason = null, VmId = 2cd65ab8-6294-493e-a1fb-580c4475f041, QuotaId = null) internal: false. Entities affected : ID: 2cd65ab8-6294-493e-a1fb-580c4475f041 Type: VMAction group STOP_VM with role type USER There is one more report right after the Up event is sent - at 2017-12-10 16:56:50,393+0200 vdsm sends monitorResponse -1 again as a response to Host.getAllVmStats. It may be due to last monitoring cycle still not getting any data (Up event is sent just 0.8s earlier). This shouldn't cause much of an issue other than a one time blip on VM status On engine side I see the VM was not responding during the outage - which seems to be correct, then reported as Up (2017-12-10 16:56:49,021+02) and then again NotResponding(2017-12-10 16:56:50,400+02) and then after shut down to Down (2017-12-10 16:57:02,784+02) I do not see any problem with VM status handling.
as for NFS you should check the timeout values before the kernel actually reports I/O error. Default is 6m IIRC but it really depends on your client and server parameters - if you believe the underlying storage should have returned I/O error during your test period this should be looked at by libvirt team. In that case please supply the parameters and journal log
when you access a file on that nfs _on host_, you should get EIO after that timeout. That would verify your NFS is set up correctly and then VM should behave the same - only hen it differs in behavior it's a valid (likely libvirt) bug
Natalie, any updates or can we close this bug?
(In reply to Michal Skrivanek from comment #37) > when you access a file on that nfs _on host_, you should get EIO after that > timeout. That would verify your NFS is set up correctly and then VM should > behave the same - only hen it differs in behavior it's a valid (likely > libvirt) bug Hi Michal, There is no access to the storage from the host so I don't see the point of checking if the NFS is set up correctly or not if we don't have communication to it - it is like it doesn't exist after blocking the connection. The behavior should be exactly like running the VM with disk on block type storage domain - the VM should enter PAUSED state
Raz, what do you mean by no access to the storage from host? That's what Natalie is trying to test IIUC, and for that I asked to confirm behavior when similar access is done directly by the host without qemu involvement. Or you mean there is just no storage access by the guest? Then indeed the VM doesn't pause, but then this whole bug is just a mistake - obviously the VM doesn't pause when there is no storage access whatsoever. "The behavior should be exactly like running the VM with disk on block type storage domain - the VM should enter PAUSED state" - not exactly. The VM behavior should match the host behavior...so the VM Pauses iff EIO is returned for backing storage access by the host. With default NFS settings we use that's very different from e.g. FC
Thanks for clarifying Michal, Is there any way to overcome the differences between the storage types? To summarize this bug, we are looking for consistency no matter what the storage type is - the VM should not care where it's disk resides on (i.e. NFS, iSCSI). One thing I do not really understand is that you mentioned that the EIO is returned from the storage server. If the connection between the host and the storage is blocked, how can the host, receive any RC from it? I hope the request/expected result is clear from our POV
(In reply to Raz Tamir from comment #41) > Thanks for clarifying Michal, > > Is there any way to overcome the differences between the storage types? > > To summarize this bug, we are looking for consistency no matter what the > storage type is - the VM should not care where it's disk resides on (i.e. > NFS, iSCSI). The VM indeed does not care, but the timing of EIO would be different for different storage backends in any case due to transport limitations/configuration > One thing I do not really understand is that you mentioned that the EIO is > returned from the storage server. > If the connection between the host and the storage is blocked, how can the > host, receive any RC from it? sorry, maybe that wasn't clear in my text. I meant the transport returns EIO, and that does depend on the specifics of each transport. So e.g. for FC when the stack basically immediately detects that the path is wrong the EIO is immediate too. Whereas on IP based transports when there is an issue somewhere along the way when packages are silently dropped...it takes time. For NFS that time is defined by the client.
Created attachment 1379222 [details] engine.log, vdsm.log, libvirtd.log, /var/log/messages, mount output and a UI snapshots of nfs-paused-vm and iscsi-paused-vm I tested the following: 1. Write from host to the nfs storage (simple dd) 2. Write from a VM that's running on the host to a disk that is on nfs storage. 2. While file writing takes place block the connection to the storage (using iptables) Result: The command line just freezes for a long long time (a lot longer than the default 10 minutes we know). /var/log/messages show the following: Jan 9 11:39:07 storage-jenkins-ge4-vdsm3 kernel: nfs: server yellow-vdsb.qa.lab.tlv.redhat.com not responding, timed out Jan 9 11:33:20 storage-jenkins-ge4-vdsm3 vdsm[594]: WARN Worker blocked: <Worker name=periodic/2 running <Task <Operation action=<vdsm.virt.sampling.HostMonitor object at 0x2ec7d50> at 0x2ec7d90> timeout=15, duration=15 at 0x46e0450> task#=15626 at 0x2fa63d0>, traceback:#012File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap#012 self.__bootstrap_inner()#012File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner#012 self.run()#012File: "/usr/lib64/python2.7/threading.py", line 765, in run#012 self.__target(*self.__args, **self.__kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run#012 ret = func(*args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run#012 self._execute_task()#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task#012 task()#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__#012 self._callable()#012File: "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 213, in __call__#012 self._func()#012File: "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 578, in __call__#012 stats = hostapi.get_stats(self._cif, self._samples.stats())#012File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 62, in get_stats#012 avail, commit = _memUsageInfo(cif)#012File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 121, in _memUsageInfo#012 mem_info = v.memory_info()#012File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 802, in memory_info#012 dom_stats = self._dom.memoryStats()#012File: "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f#012 ret = attr(*args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper#012 ret = f(*args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper#012 return func(inst, *args, **kwargs)#012File: "/usr/lib64/python2.7/site-packages/libvirt.py", line 1449, in memoryStats#012 ret = libvirtmod.virDomainMemoryStats(self._o) Jan 9 11:33:35 storage-jenkins-ge4-vdsm3 vdsm[594]: WARN Worker blocked: <Worker name=periodic/2 running <Task <Operation action=<vdsm.virt.sampling.HostMonitor object at 0x2ec7d50> at 0x2ec7d90> timeout=15, duration=30 at 0x46e0450> task#=15626 at 0x2fa63d0>, traceback:#012File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap#012 self.__bootstrap_inner()#012File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner#012 self.run()#012File: "/usr/lib64/python2.7/threading.py", line 765, in run#012 self.__target(*self.__args, **self.__kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run#012 ret = func(*args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run#012 self._execute_task()#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task#012 task()#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__#012 self._callable()#012File: "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 213, in __call__#012 self._func()#012File: "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 578, in __call__#012 stats = hostapi.get_stats(self._cif, self._samples.stats())#012File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 62, in get_stats#012 avail, commit = _memUsageInfo(cif)#012File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 121, in _memUsageInfo#012 mem_info = v.memory_info()#012File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 802, in memory_info#012 dom_stats = self._dom.memoryStats()#012File: "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f#012 ret = attr(*args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper#012 ret = f(*args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper#012 return func(inst, *args, **kwargs)#012File: "/usr/lib64/python2.7/site-packages/libvirt.py", line 1449, in memoryStats#012 ret = libvirtmod.virDomainMemoryStats(self._o) ------------------------------------------------------------------------------------ Later on, I restared libvirtd (something that has to do with libvirt logging, not this issue) and retested the original scenario described in comment 0, block connection started at ~17:30 Result: libvirtd.log: 2018-01-09 15:30:38.414+0000: 9083: error : qemuDomainObjBeginJobInternal:3839 : Timed out during operation: cannot acquire state change lock (held by remoteDispatchConnectGetAllDomainStats) 2018-01-09 15:52:26.276+0000: 9083: error : qemuDomainObjBeginJobInternal:3839 : Timed out during operation: cannot acquire state change lock (held by remoteDispatchConnectGetAllDomainStats) {"timestamp": {"seconds": 1515513153, "microseconds": 425983}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk0", "nospace": false, "__com.redhat_reason": "eio", "node-name": "#block140", "reason": "Input/output error", "operation": "read", "action": "stop"}} 2018-01-09 15:52:33.433+0000: 9076: info : qemuMonitorJSONIOProcessLine:208 : QEMU_MONITOR_RECV_EVENT: mon=0x7f9a20023a60 event={"timestamp": {"seconds": 1515513153, "microseconds": 425983}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk0", "nospace": false, "__com.redhat_reason": "eio", "node-name": "#block140", "reason": "Input/output error", "operation": "read", "action": "stop"}} vdsm.log.1: 2018-01-09 17:30:59,314+0200 ERROR (monitor/b7f7132) [storage.Monitor] Error checking domain b7f7132e-1f8a-48bf-ad00-2905c71f18d3 (monitor:424) 2018-01-09 17:30:59,329+0200 ERROR (monitor/1dd8b62) [storage.Monitor] Error checking domain 1dd8b623-b91f-4a40-bdb6-631ce1715663 (monitor:424) 2018-01-09 17:30:59,332+0200 ERROR (monitor/4e28206) [storage.Monitor] Error checking domain 4e282067-fcf7-421c-b28c-ed36a8eee0c8 (monitor:424) 2018-01-09 17:30:59,334+0200 ERROR (monitor/b25ad09) [storage.Monitor] Error checking domain b25ad098-d8c9-46ed-90fe-367428c5a44d (monitor:424) 2018-01-09 17:40:01,101+0200 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__jenkins__ge4__nfs__1/4e282067-fcf7-421c-b28c-ed36a8eee0c8/dom_md/metadata (monitor:498) 2018-01-09 17:40:01,114+0200 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__jenkins__ge4__nfs__0/1dd8b623-b91f-4a40-bdb6-631ce1715663/dom_md/metadata (monitor:498) 2018-01-09 17:40:01,372+0200 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__jenkins__ge4__nfs__2/b25ad098-d8c9-46ed-90fe-367428c5a44d/dom_md/metadata (monitor:498) 2018-01-09 17:53:03,493+0200 ERROR (libvirt/events) [vds] Error running VM callback (clientIF:654) engine.log: 2018-01-09 17:30:48,701+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-24) [] VM 'b4fcb39a-d1a3-449a-aba3-518c072168e5'(vm_TestCaseBlockedConnecti_0917235822) moved from 'Up' --> 'NotResponding' 2018-01-09 17:52:33,488+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-8) [] VM 'b4fcb39a-d1a3-449a-aba3-518c072168e5'(vm_TestCaseBlockedConnecti_0917235822) moved from 'NotResponding' --> 'Paused' 2018-01-09 17:52:33,550+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-8) [] EVENT_ID: VM_PAUSED_EIO(145), VM vm_TestCaseBlockedConnecti_0917235822 has been paused due to storage I/O problem. 2018-01-09 17:52:39,242+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-35) [] VM 'b4fcb39a-d1a3-449a-aba3-518c072168e5'(vm_TestCaseBlockedConnecti_0917235822) moved from 'Paused' --> 'NotResponding' <---- why does it go back to not responding? webadmin Compute -> Virtual Machines The little icon on the left shows a question mark. When hovering over this icon, the following message appears: "Not Responding VM has been paused due to a storage I/O error" VM status on the right column shows "Not Responding" When performing the same test with iSCSI storage, the icon is not a question mark it's a paused icon, when hovering over it, the following message appears: "Paused VM has been paused due to a storage I/O error" and the status column on the right shows "Paused" snapshots attached Attached: engine.log vdsm.log libvirtd.log /var/log/messages mount output 2 snapshots of the paused VMs (iscsi and nfs)
This is similar to Bug 665820. To understand what's happening at the QEMU level, please see https://bugzilla.redhat.com/show_bug.cgi?id=665820#c11 (and other comments there). TL;DR: we can't know we got stuck until some time after getting stuck (the comments in the BZ above explain why). And we can't think of a way to implement a "detect system call got stuck on NFS" feature in QEMU that doesn't require major architectural surgery, with a considerable risk of introducing nasty bugs.
This has nothing to do (as far as I understand) with oVirt and isn't a new behavior. Until we can come up with a better answer, deferring this to 4.2.2
It would be best to rerun those tests on 7.5 and note down the differences in behavior on each underlying storage. I do not see anything else we could do in 4.2 timeframe
the tests were repeated for rhv-release-4.2.2-4-001.noarch and rhel 7.5. nfs (the same behavior as in a previous versions): 1. run vm with nfs disk. 2. block the storage <iptables -I INPUT -s yellow-vdsb.qa.lab.tlv.redhat.com -j DROP> result: vm is in unknown state. attached logs_nfs_vm.tar.gz. gluster (the behavior changed. we do have a circulation of states but then it stabilized): 1. run vm with disk on gluster. 2. block the storage : iptables -I INPUT -s 10.35.83.240 -j REJECT iptables -I INPUT -s 10.35.83.241 -j REJECT iptables -I INPUT -s 10.35.83.242 -j REJECT result: has been paused due to unknown storage error. VM golden_env_mixed_virtio_3_0 is not responding. has been paused due to storage I/O problem. attached logs_gluster.tar.gz
Created attachment 1405396 [details] logs_nfs_vm.tar.gz
Created attachment 1405397 [details] logs_gluster.tar.gz
with gluster the behavior is inconsistent. I attach engine and vdsm logs in gluster_vm_continue_scenario.tar.gz run gluster VM (not HA), with KILL Resume Behavior. The after block storage the VM is "has been paused due to unknown storage error" (no circulating). Then before the storage unblocked the VM golden_env_mixed_virtio_3_0 is down with error. Exit message: Down because paused for too long. strange inconsistent behavior
Created attachment 1405401 [details] gluster_vm_continue_scenario.tar.gz
Just to note that for ISCSI and FCP the behavior is correct - the VM is I/O Error Paused while the storage blocked. The iscsi storage blocking is done with iptables. the fcp storage could be blocked with command <echo "offline" > /sys/block/sdd/device/state> for the connected LUN ID.
Martin, can we move it to qemu-kvm?
I guess so, but I wanted to document some basic behavioral patterns of the whole system for different storage backends
(In reply to Yaniv Kaul from comment #53) > Martin, can we move it to qemu-kvm? I don't see why. qemu behaves correctly, it is (again) NFS v3 that has the issue. Maybe different mount options might help here (using hard mounts instead of soft ones). I personally would expect so, as the soft mounts would return a write error, whereas the hard mount would "queue" the IO. Have you tried using hard mounts instead of soft ones? So to me this would more look like a Storage issue that can't be fixed completely due to the nature of NFS. Adding Ademar for his view on this.
(In reply to Martin Tessun from comment #55) > Adding Ademar for his view on this. I added a comment before (comment #44) explaining that when QEMU gets stuck in a system call (read() on a non-responding NFS server, for example), there's not much to be done and the whole process freezes. We have a (very old) RFE about this behavior, see Bug 665820. But it requires some serious reengineering of QEMU to be implemented: we need a separate monitor thread or process that watches for stuck syscalls. To get more details of what's happening at the QEMU level, please see https://bugzilla.redhat.com/show_bug.cgi?id=665820#c11 (and other comments there). I recommend closing this BZ as a duplicate of Bug 665820.
(In reply to Polina from comment #48) > Created attachment 1405396 [details] > logs_nfs_vm.tar.gz Polina, reviewed this one and it's again the same problem - now I'm assuming you're using the same settings as natalie in comment #43 - you have soft mounts with 6 minutes timeout, but your logs cover only cca 5 minutes after blocking the access. It also takes 5 minutes to declare host non-operational (because of default storage monitoring timeouts) so you won't EVER get ioerror in this setup. Also reviewd logs from comment #49 (gluster): 16:31:12 VM is deemed not responsive (stuck monitoring calls) 16:31:44.691 it apparently continues to run (possibly it still was, it's just libvirt stats calls being stuck) and guest agent respond(Guest's message heartbeat) triggers Up event. This may be a bug and we may want to _not_ send Up event in case of monitorResponse == -1. It may also be a bogus message since it seems to be real close to the next event. Would be good to verify this behavior several times. 16:31:44.714 IOError is reported 16:31:56/57 VM moves back to NotResponding - that's probably a bug in monitoring code related to a IOError "respond" earlier. @Arik please look at that to sum up: - FCP and iSCSI work ok, per comment #52 - NFSv3 with default mount options works "as expected" (though not desired) - VM is not responding for 5 minutes and then host becomes non-operational. - NFSv4 not tested? It should behave the same with default options though. - Gluster - seems to work reasonably ok to get a useful data for NFSv3 it needs to be retested with different timeouts, and I guess NFSv4 should be tesetd independently
Hi Michal, for rhvm-4.2.3-0.1.el7.noarch, libvirt-3.9.0-14.el7_5.2.x86_64: Gluster seems to be ok - after gluster storage is blocked, the VM is unknown, but soon turns to be I/O paused, so that I was able to verify the Resume Behavior feature for the Gluster storage. for NFS - I test with NFSv4. Can't cat the I/O error for VM. it turns to Not Responding, then the host is non Operational. If the environment settings could be configured to create the workaround for this phenomenon, could you please explain how to configure it.
adding back needinfo from comment #57 - Arik, can you please check that? Polina, it's discussed earlier in this bug with Natalie. Just set the timeout to <5 mins.
This is not a blocker nor an exception, yet it is targeted for 4.2.3?
(In reply to Michal Skrivanek from comment #57) > Also reviewd logs from comment #49 (gluster): > 16:31:12 VM is deemed not responsive (stuck monitoring calls) > 16:31:44.691 it apparently continues to run (possibly it still was, it's > just libvirt stats calls being stuck) and guest agent respond(Guest's > message heartbeat) triggers Up event. This may be a bug and we may want to > _not_ send Up event in case of monitorResponse == -1. It may also be a bogus > message since it seems to be real close to the next event. Would be good to > verify this behavior several times. > 16:31:44.714 IOError is reported > 16:31:56/57 VM moves back to NotResponding - that's probably a bug in > monitoring code related to a IOError "respond" earlier. @Arik please look at > that The engine sets the NotResponding state only when VDSM reports monitorResponse < 0. The engine doesn't have much choice in that case: when it gets a report from VDSM that indicates the VM is NotResponding, the engine should honor this report and switch the state of the VM accordingly. If this state is incorrect, VDSM should not report it.
resolution is converging, but there is still the case of NFS which was not verified by QE properly yet. Moving to 4.2.4 then Arik, problem with that is that status events like Paused/IOError is received through lifecycle callback which may work even when stats do not. I agree we should honor monitoringResponse. Maybe the issue is due to that "Up" from guest agent), but the lifecycle event should be reflected in engine somehow. I.e. if I have a VM in NotResponding I still want to process lifecycle changes reported by VDSM events. Again, this should be probably tested by QE several times to see if this behavior is consistent and to understand that guest agent's "Up" report.
(In reply to Michal Skrivanek from comment #62) > Arik, problem with that is that status events like Paused/IOError is > received through lifecycle callback which may work even when stats do not. I > agree we should honor monitoringResponse. Maybe the issue is due to that > "Up" from guest agent), but the lifecycle event should be reflected in > engine somehow. > I.e. if I have a VM in NotResponding I still want to process lifecycle > changes reported by VDSM events. I agree and the engine does that - that's why the status of the VM changed from NotResponing to Up, right? Later, the VM state was changed to NotResponding because that was the next input from VDSM. The real advantage of the engine, when it comes to monitoring, is that it generally knows more than VDSM about the overall state of the system - he knows what's the status of other hosts, of other VMs running on different hosts, of storage domains, etc. On the other hand, VDSM is smarter when it comes to host-specific things. Sure, VDSM can tell the engine "Listen, the status of the VM is a bit complicated: I can't get its stats but I got a paused/IOError lifecycle event and the guest agent seems to work just fine" and then the engine would get all this data and either confuse the user by displaying all that information or decide what's the state of the VM in light of all that information. However, IMO VDSM should be the component that decides what's the status of the VM that is running on its host. If the stats do not work but there's an indication from the guest agent that the VM is running - there's no point in sending the engine that the VM is NotResponding. If the indication from the guest agent means that the VM is running and is usable VDSM should report the VM as Up, and maybe send another event and would produce an audit log saying that something is wrong with getting its stats. I think this bz should move to VDSM.
(In reply to Arik from comment #63) > (In reply to Michal Skrivanek from comment #62) > > Arik, problem with that is that status events like Paused/IOError is > > received through lifecycle callback which may work even when stats do not. I > > agree we should honor monitoringResponse. Maybe the issue is due to that > > "Up" from guest agent), but the lifecycle event should be reflected in > > engine somehow. > > I.e. if I have a VM in NotResponding I still want to process lifecycle > > changes reported by VDSM events. > > I agree and the engine does that - that's why the status of the VM changed > from NotResponing to Up, right? Later, the VM state was changed to > NotResponding because that was the next input from VDSM. > > The real advantage of the engine, when it comes to monitoring, is that it > generally knows more than VDSM about the overall state of the system - he > knows what's the status of other hosts, of other VMs running on different > hosts, of storage domains, etc. On the other hand, VDSM is smarter when it > comes to host-specific things. Sure, VDSM can tell the engine "Listen, the > status of the VM is a bit complicated: I can't get its stats but I got a > paused/IOError lifecycle event and the guest agent seems to work just fine" > and then the engine would get all this data and either confuse the user by > displaying all that information or decide what's the state of the VM in > light of all that information. > > However, IMO VDSM should be the component that decides what's the status of > the VM that is running on its host. If the stats do not work but there's an > indication from the guest agent that the VM is running - there's no point in > sending the engine that the VM is NotResponding. If the indication from the > guest agent means that the VM is running and is usable VDSM should report > the VM as Up, and maybe send another event and would produce an audit log > saying that something is wrong with getting its stats. Unless some operations cannot be done on that VM, like migration. In that case, NotResponding can actually be a better state. > > I think this bz should move to VDSM.
(In reply to Arik from comment #64) > > However, IMO VDSM should be the component that decides what's the status of > > the VM that is running on its host. If the stats do not work but there's an > > indication from the guest agent that the VM is running - there's no point in > > sending the engine that the VM is NotResponding. too bad this is done by monitorResponse==-1 in stats instead of actual state. It's asynchronous, and confusing in any case. > > If the indication from the > > guest agent means that the VM is running and is usable VDSM should report > > the VM as Up, and maybe send another event and would produce an audit log > > saying that something is wrong with getting its stats. > > Unless some operations cannot be done on that VM, like migration. In that > case, NotResponding can actually be a better state. this is actually an example where the opposite would work better:) if you can still instruct VM to migrate away when its stats do not work it's worth a try to initiate the migration:) > > I think this bz should move to VDSM. not sure. Of course it would be all great if we refactor everything, but that's probably not feasible. The code on vdsm side modifying the state reported by libvirt is quite awful already
(In reply to Michal Skrivanek from comment #65) > > Unless some operations cannot be done on that VM, like migration. In that > > case, NotResponding can actually be a better state. > > this is actually an example where the opposite would work better:) if you > can still instruct VM to migrate away when its stats do not work it's worth > a try to initiate the migration:) So on the other extreme, if everything is enabled in that state, e.g., migration, connection with a console, snapshotting, etc, then maybe it is wrong to interpret monitorResponse==-1 as NotResponding state. Maybe we should use the state reported by the lifecycle events and have an additional indication of not getting stats. > > > > I think this bz should move to VDSM. > > not sure. Of course it would be all great if we refactor everything, but > that's probably not feasible. The code on vdsm side modifying the state > reported by libvirt is quite awful already Maybe I'm missing something then. IIUC the problem we are discussing is about the flipping state of the VM between Paused and NotResponding. The state of the VM in libvirt is Paused but we fail to get the stats so the engine interprets monitorResponse==-1 as NotResponding on every statistics cycle. Assuming that we still want to set the state to NotResponding when getting monitorResponse==-1, the desired result in the discussed scenario would be: 1. if the engine gets monitorResponse==-1 and no VM state change, set NotReponding. 2. if the engine gets VM state change, set the VM state accordingly. 3. if the engine gets monitorResponse==-1 soon after, keep the VM state. So we'll end up with, e.g. Up -> NotReponding -> Paused and no more switches to NotReponsing. Am I right? But then you say "VDSM code is too complex for us to implement such logic there so let's move it to the engine that is easier for us to change" - I agree with that statement, but I'm afraid it will not be trivial to add such hacks on the engine side as well.
(In reply to Arik from comment #66) > Maybe I'm missing something then. > IIUC the problem we are discussing is about the flipping state of the VM > between Paused and NotResponding. The state of the VM in libvirt is Paused > but we fail to get the stats so the engine interprets monitorResponse==-1 as > NotResponding on every statistics cycle. > > Assuming that we still want to set the state to NotResponding when getting > monitorResponse==-1, the desired result in the discussed scenario would be: > 1. if the engine gets monitorResponse==-1 and no VM state change, set > NotReponding. > 2. if the engine gets VM state change, set the VM state accordingly. > 3. if the engine gets monitorResponse==-1 soon after, keep the VM state. > So we'll end up with, e.g. Up -> NotReponding -> Paused and no more switches > to NotReponsing. Am I right? maybe we can say that NotResponding state is only relevant for VMs which are running (guest CPU is running) and not change to/from it in any other case. I see 2 issues in comment #57 - first that vdsm itself sent "Up" (from guest agent report) while monitorResponse was -1. If we consider guest agent reports part of statistics (which they indeed are) then this event shoudln't have been sent (this has to be fixed in vdsm because you cannot tell if Up is from libvirt or guest agent in engine) Second issue is what you're talking about, that we move to Paused while monitorResponse is -1 (that would be ok on its own) and then back to NotResponding - I say that if we ignore monitorResponse when state is not-running(Paused) then it's ok, VM stays Paused. > But then you say "VDSM code is too complex for us to implement such logic > there so let's move it to the engine that is easier for us to change" - I > agree with that statement, but I'm afraid it will not be trivial to add such > hacks on the engine side as well. :) no, it's more about the paradigm of doing as little modification from libvirt as possible in vdsm. We already have some and it's quite difficult to follow due to async nature of reports from guest agent, libvirt event loop, libvirt stats, and internal code. There's no NotResponding state at all in vdsm
(In reply to Michal Skrivanek from comment #67) > maybe we can say that NotResponding state is only relevant for VMs which are > running (guest CPU is running) and not change to/from it in any other case. > I see 2 issues in comment #57 - first that vdsm itself sent "Up" (from guest > agent report) while monitorResponse was -1. If we consider guest agent > reports part of statistics (which they indeed are) then this event shoudln't > have been sent (this has to be fixed in vdsm because you cannot tell if Up > is from libvirt or guest agent in engine) Well, we can theoretically enhance the VDSM->engine API to reflect the source of the information, I thought about this, that's why I consider such an engine-side solution 'non-trivial' and agree that it should be done in VDSM :) > Second issue is what you're talking about, that we move to Paused while > monitorResponse is -1 (that would be ok on its own) and then back to > NotResponding - I say that if we ignore monitorResponse when state is > not-running(Paused) then it's ok, VM stays Paused. I'm concerned about the following scenario: We have a running VM => its state is Up. Then, the VM is being paused => its state is Paused. Then, the VM is resumed => its state is Up but the engine doesn't process this event. Later, all stats fail thus we get monitorResponse == -1, so the VM stays in status Paused (until the next status transition) although it is actually running in libvirt. So ignoring (on the engine side) the monitorResponse==-1 when the 'known' state of the VM on the engine is Paused would only be feasible if we are sure the stats call failed because the VM was paused and succeed otherwise. > > > But then you say "VDSM code is too complex for us to implement such logic > > there so let's move it to the engine that is easier for us to change" - I > > agree with that statement, but I'm afraid it will not be trivial to add such > > hacks on the engine side as well. > > :) no, it's more about the paradigm of doing as little modification from > libvirt as possible in vdsm. We already have some and it's quite difficult > to follow due to async nature of reports from guest agent, libvirt event > loop, libvirt stats, and internal code. There's no NotResponding state at > all in vdsm Right but you know, 'best practices' should be taken with cautious - when the only source of truth is libvirt then yeah - there's no need to override the data we get from it or convert that data to other format and so on, this makes perfect sense to me. But when VDSM is provided with data from different sources and can produce better, more accurate and simple, information, I think it should.
(In reply to Arik from comment #68) > I'm concerned about the following scenario: > We have a running VM => its state is Up. > Then, the VM is being paused => its state is Paused. > Then, the VM is resumed => its state is Up but the engine doesn't process > this event. > Later, all stats fail thus we get monitorResponse == -1, so the VM stays in > status Paused (until the next status transition) although it is actually > running in libvirt. > > So ignoring (on the engine side) the monitorResponse==-1 when the 'known' > state of the VM on the engine is Paused would only be feasible if we are > sure the stats call failed because the VM was paused and succeed otherwise. Perhaps I wasn't clear - I meant that we would ignore the processing of monitorResponse==-1 (i.e. not move to NotResponding after defined interval) when the VM status at that time is "guest CPU is not running"(i.e. Down, Paused, ResumingState, Migration Destination - you can say only the Pause is realistically the one you would ever see in processing) This wouldn't necessarily apply to the situation that VM is Paused and you get monitorResponse>0, even though I believe the caverage by VM events and full stats is good enough
The problem is still reproduced on rhv-release-4.2.4-2-001.noarch
(In reply to Polina from comment #70) > The problem is still reproduced on rhv-release-4.2.4-2-001.noarch Just noting that this is expected as only the engine side was merged.
There is still the problem with freezing virDomainMemoryStats libvirt calls from Vdsm, which should be handled some way.
verification on downstream package failed. rhv-release-4.2.5-1-001.noarch vdsm-4.20.33-1.el7ev.x86_64 libvirt-3.9.0-14.el7_5.6.x86_64 qemu-kvm-rhev-2.10.0-21.el7_5.4.x86_64 Hi, the described scenario still reproducible in the last build.
also tested on nightly build - the same behavior: rhvm-4.2.5.1_SNAPSHOT-72.g1639ffa.0.scratch.master.el7ev.noarch
(In reply to Polina from comment #73) > verification on downstream package failed. > > rhv-release-4.2.5-1-001.noarch > vdsm-4.20.33-1.el7ev.x86_64 > libvirt-3.9.0-14.el7_5.6.x86_64 > qemu-kvm-rhev-2.10.0-21.el7_5.4.x86_64 > > Hi, the described scenario still reproducible in the last build. Ok, which one exactly? Which storage type(s)? There are quite a few different scenarios in previous comments. Also please add logs, it could be a different issue.
Hi, in the attached logs (vdsm with full debug) you can see the states of two VMs: nfs VM golden_env_mixed_virtio_1_0 and gluster VM golden_env_mixed_virtio_3_0. The expected for both VMs - The VM must be paused with I/O error while the storage is blocked. Actually the VMs are in not-responding state. 1. for nfs VM: 753 2018-07-15 09:55:50,947+03 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-68) [] EVENT_ID: USER_RUN_VM(32), VM golden_env_mix ed_virtio_1_0 started on Host host_mixed_1 block the nfs storage with iptables command 906 2018-07-15 10:01:53,637+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-94) [] VM '3855687f-a362-48a9-8d5a-b1dc3ff22dd5'(golden_env_mixed_ virtio_1_0) moved from 'Up' --> 'NotResponding' the VM remains in not responding state. at some point engine tries to migrate (not expected - it is not HA VM) and fails . 1099 2018-07-15 10:06:21,068+03 ERROR [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (EE-ManagedThreadFactory-engine-Thread-355) [1b18e31c] Failed to migrate VM 'golden_env_mixed_virtio_1_0' 2.for gluster VM: 1083 2018-07-15 10:03:11,388+03 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-49) [] EVENT_ID: USER_RUN_VM(32), VM golden_env_mix ed_virtio_3_0 started on Host host_mixed_3 block replica3 gluster storage 1103 2018-07-15 10:07:12,218+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-39) [] VM '2914e97d-c6c7-4a94-a84e-8bd7a4d8cc78'(golden_env_mixed_ virtio_3_0) moved from 'Up' --> 'NotResponding' then flipping to Up state 1112 2018-07-15 10:08:12,581+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-69) [] VM '2914e97d-c6c7-4a94-a84e-8bd7a4d8cc78'(golden_env_mixed_ virtio_3_0) moved from 'NotResponding' --> 'Up' Then Engine tries to migrate this VM which fails . As result the VM is down 2018-07-15 10:11:38,625+03 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-0) [] Migration of VM 'golden_env_mixed_virtio_3_0' to host 'host_mixed_1' failed: VM destroyed during the startup. 2018-07-15 10:11:38,646+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-9) [] VM '2914e97d-c6c7-4a94-a84e-8bd7a4d8cc78'(golden_env_mixed_virtio_3_0) moved from 'MigratingFrom' --> 'Down' version rhv-release-4.2.5-3-001.noarch
Created attachment 1458964 [details] logs_engine_vdsm.tar.gz
The problem is in both the cases the same -- apparently getAllDomainStats/domainListGetStats calls in bulk sampling take long time, resulting in not having sufficiently recent VM stats and setting monitorResponse to -1.
After further inspection the behavior looks mostly expected. The VMs get into non-responsive state on I/O errors and may get back out of it with GlusterFS due to some self-healing capabilities of the file system. The hosts get non-operational due to storage problems and Engine tries to migrate VMs out of such hosts. The VM getting down may be related to a known QEMU bug: https://bugzilla.redhat.com/1548448. To confirm all that we need related logs. Polina, could you please provide: - GlusterFS logs - libvirt logs - QEMU logs
Moving to 4.2.6. There is no clear bug here anymore, just waiting for missing information from QE to confirm that the recently observed behavior corresponds to actual system conditions.
Created attachment 1459744 [details] scenario_nfs_vm_block_storage attached logs for both scenarios - nfs and gluster - scenario_nfs_vm_block_storage.tar.gz ,scenario_gluster_vm_block_storage.tar.gz
Created attachment 1459745 [details] scenario_gluster_vm_block_storage
Regarding VM crash on migration with GlusterFS, I can see in the QEMU log it's a QEMU problem: 2018-07-18 15:46:12.293+0000: initiating migration Unexpected error in raw_apply_lock_bytes() at block/file-posix.c:642: 2018-07-18T15:46:15.338010Z qemu-kvm: Failed to lock byte 100 2018-07-18 15:46:15.610+0000: shutting down, reason=crashed And in libvirt: 2018-07-18 15:46:15.405+0000: 22018: error : qemuMonitorIO:719 : internal error: End of file from qemu monitor 2018-07-18 15:46:15.405+0000: 22018: debug : qemuMonitorIO:767 : Error on monitor internal error: End of file from qemu monitor 2018-07-18 15:46:15.405+0000: 22018: debug : qemuMonitorIO:788 : Triggering EOF callback 2018-07-18 15:46:15.405+0000: 22018: debug : qemuProcessHandleMonitorEOF:290 : Received EOF on 0x7f324801f560 'golden_env_mixed_virtio_3_0' ... 2018-07-18 15:46:15.405+0000: 22018: debug : virNetlinkEventCallback:707 : dispatching to max 0 clients, called from event watch 9 2018-07-18 15:46:15.405+0000: 22018: debug : virNetlinkEventCallback:720 : event not handled. 2018-07-18 15:46:15.405+0000: 24918: debug : qemuProcessEventHandler:4818 : vm=0x7f324801f560, event=6 2018-07-18 15:46:15.405+0000: 24918: debug : qemuProcessKill:6339 : vm=0x7f324801f560 name=golden_env_mixed_virtio_3_0 pid=468 flags=0x1 2018-07-18 15:46:15.405+0000: 24918: debug : virProcessKillPainfully:355 : vpid=468 force=1 2018-07-18 15:46:15.440+0000: 22083: debug : udevHandleOneDevice:1677 : udev action: 'change' 2018-07-18 15:46:15.441+0000: 22083: debug : udevGetDeviceProperty:149 : Found property key 'DRIVER' value '<null>' for device with sysname 'kvm' 2018-07-18 15:46:15.441+0000: 22083: debug : udevGetDeviceProperty:149 : Found property key 'SUBSYSTEM' value 'misc' for device with sysname 'kvm' 2018-07-18 15:46:15.441+0000: 22083: debug : udevGetDeviceType:1349 : Could not determine device type for device with sysfs name 'kvm' 2018-07-18 15:46:15.441+0000: 22083: debug : udevAddOneDevice:1541 : Discarding device -1 0x7f31f4001650 /sys/devices/virtual/misc/kvm 2018-07-18 15:46:15.607+0000: 24918: info : virObjectRef:388 : OBJECT_REF: obj=0x7f31f01ceea0 2018-07-18 15:46:15.607+0000: 24918: debug : qemuDomainObjBeginJobInternal:4625 : Starting job: destroy (vm=0x7f324801f560 name=golden_env_mixed_virtio_3_0, current job=none async=migration out) 2018-07-18 15:46:15.607+0000: 24918: debug : qemuDomainObjBeginJobInternal:4666 : Started job: destroy (async=migration out vm=0x7f324801f560 ... 2018-07-18 15:46:15.609+0000: 24918: debug : processMonitorEOFEvent:4788 : Monitor connection to 'golden_env_mixed_virtio_3_0' closed without SHUTDOWN event; assuming the domain crashed It seems such a crash hasn't been reported yet. Polina, if the bug is reproducible on your setup, could you please file a QEMU bug?
Regarding VM status changes, I can see that: - With NFS, the VM switches to NotResponding and this is the last status change in the provided Engine log. - With GlusterFS, there are two VMs, each of them switches to NotResponding and then after a while to Up. GlusterFS logs don't report anything after the second failure. In both the cases the causes of status changes to NotResponding seem to be stuck libvirt calls. The VMs switch to Up status with GlusterFS once libvirt starts reporting VM stats again. Unless there is some less or more regular status flipping or an obviously incorrect reporting, I'm afraid we can't do anything about that. Polina, could you please clarify whether some of the currently observed behavior is clearly broken?
Created attachment 1470506 [details] qemu_libvirt_vdsm_engine_for_gluster_vm.tar.gz Hi, I'm not sure I understand the last question - The currently observed behavior is broken. The expected behavior is the VM Paused due to I/O storage error while the storage is blocked. The actual behavior: for nfs VM - Not Responging. for Gluster - 'NotResponding' --> 'Up'. Both are problematic about the requested qemu.log for gluster VM, please see attached qemu_libvirt_vdsm_engine_for_gluster_vm.tar.gz
And is it an oVirt problem or libvirt/QEMU problem? If the VM gets paused, libvirt reports it as paused and Vdsm keep reporting it as Up then it may be an oVirt bug. If the VM doesn't get paused or libvirt reports it as running then it's a libvirt/QEMU problem. From the provided logs it looks like that the VMs are either reported as running from libvirt or the libvirt calls are stuck and Vdsm reacts correctly in both the cases. Polina, could you please clarify that?
(In reply to Polina from comment #85) > Created attachment 1470506 [details] > qemu_libvirt_vdsm_engine_for_gluster_vm.tar.gz > > Hi, I'm not sure I understand the last question - The currently observed > behavior is broken. > The expected behavior is the VM Paused due to I/O storage error while the > storage is blocked. We're going in circles a bit. Yes, indeed the expected behavior is that the VM gets Paused on I/O Error. The only argument is _when_ there is a IO storage error. If there is no I/O Error reported you do not get a paused VM. > The actual behavior: > for nfs VM - Not Responging. is there I/O Error being reported in guest? In host? I doubt so, we discussed that several times that current timeout values won't ever report I/O Error before the host gets fenced instead. So -> not relevant for the "expected behavior" above. > for Gluster - 'NotResponding' --> 'Up'. Both are problematic is there I/O Error being reported in guest? In host? If not then this is again not relevant to this bug. I do not see anything in the logs indicating there was an I/O Error reported It could be a problem of e.g. gluster that it is supposed to report I/O Error and it currently does not, but that needs a separate gluster/qemu bug, with a separate and more concrete reproduction/logs. I do not see anything wrong with the feature nor anything we can do about it, moving back to ON_QA
two new bugs are inserted for libvirt https://bugzilla.redhat.com/show_bug.cgi?id=1609701 https://bugzilla.redhat.com/show_bug.cgi?id=1609561 I can't verify the bug since the endpoint behavior didn't change - the user blocks the storage and the VM is not moved to paused state due to i/o error. Instead - has Not Responding state. About not reporting from the qemu/libvirt inserted BZ 1609701 The workaround for nfs VMs that allowed to get I/O error pause state (timeout=1s and retries=2 for SD) is not working in the current version. I change these parameters for the SD and then run VM, The VM is paused immediately even before the storage is blocked. Engine enters the migration flow that must not be started here at all. It is not HA VM in Not-Responding state and the host is in Not-Operational state. The migration has no chance to succeed and should not be started. We see that engine returns to the failed migration in a circle (see the log fragment below). For the gluster VM, this failed migration process finishes with qemu crash (BZ 1609561) 2018-07-29 16:01:15,203+03 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-77) [] EVENT_ID: USER_RUN_VM(32), VM golden_env_mixed_virtio_1_1 started on Host host_mixed_2 2018-07-29 16:03:16,469+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-87) [] VM 'cd15b8d4-29cf-41b4-860c-27a38f993fa1'(golden_env_mixed_virtio_1_1) moved from 'Up' --> 'NotResponding' 2018-07-29 16:03:16,606+03 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-87) [] EVENT_ID: VM_NOT_RESPONDING(126), VM golden_env_mixed_virtio_1_1 is not responding. 2018-07-29 16:07:36,727+03 ERROR [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (EE-ManagedThreadFactory-engine-Thread-17887) [7c4836a2] Failed to migrate VM 'golden_env_mixed_virtio_1_1' 2018-07-29 16:16:15,525+03 ERROR [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (EE-ManagedThreadFactory-engine-Thread-18032) [7fa0ad53] Failed to migrate VM 'golden_env_mixed_virtio_1_1' 2018-07-29 16:25:17,320+03 ERROR [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (EE-ManagedThreadFactory-engine-Thread-18209) [23d38dca] Failed to migrate VM 'golden_env_mixed_virtio_1_1' 2018-07-29 16:35:18,053+03 ERROR [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (EE-ManagedThreadFactory-engine-Thread-18413) [30931f20] Failed to migrate VM 'golden_env_mixed_virtio_1_1' 2018-07-29 16:45:20,125+03 ERROR [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (EE-ManagedThreadFactory-engine-Thread-18614) [c648901] Failed to migrate VM 'golden_env_mixed_virtio_1_1' 2018-07-29 16:55:13,448+03 ERROR [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (EE-ManagedThreadFactory-engine-Thread-18811) [6d8cbb62] Failed to migrate VM 'golden_env_mixed_virtio_1_1'
According to comment 88 and to depends on we can't verify this bug
We are dealing with too many problems here: - Regarding migrations, it's unrelated, please file a separate bug. - Regarding QEMU crash on the attempted migration, thank you for reporting the QEMU bug. With respect to this bug, it's unrelated, so no further action needed here. - Regarding NFS pausing the VM immediately with altered settings, I'd try to use a somewhat higher timeout, let's 5s rather than 1s. - Regarding the VM not being paused, is the guest OS still running? If so then it's reported correctly as Up. If it should be paused, I'd say it's a problem not directly related to this bug and should be handled in the bug you reported on libvirt. Using different timeout settings as suggested above might help to verify regular behavior. - Regarding not responding state, IIRC the last time I inspected your reports, the not responding state was only temporary until switching back to Up. It was due to stuck libvirt calls. Such things may happen with broken storage, we can't expect everything works smoothly in a broken environment. I believe that all the reported oVirt problems within the scope of the original report have been fixed. At the moment, I don't have any further report of broken oVirt behavior at hand that I could work on within this bug. Could you attempt verification with NFS again, perhaps asking storage people for help with the settings?
(In reply to meital avital from comment #89) > According to comment 88 and to depends on we can't verify this bug according to comment #90 things work as expected within the scope of the original report. What resolution do you suggest then?
about migration - bug inserted https://bugzilla.redhat.com/show_bug.cgi?id=1610354
adding back needinfo on meital for comment #91 adding back needinfo on polina for other aspects of comment#90 . Only the NFS remains, I believe, other bugs has been opened. Unless you'd like to open a bug on that "Regarding not responding state" item. Up to you.
since 4.2.5 is released this is now done