Bug 1481022 - When blocking connection between host and NFS storage, a running VM doesn't switch to paused mode
When blocking connection between host and NFS storage, a running VM doesn't s...
Status: NEW
Product: ovirt-engine
Classification: oVirt
Component: Backend.Core (Show other bugs)
4.2.0
Unspecified Unspecified
high Severity high (vote)
: ovirt-4.2.2
: ---
Assigned To: Martin Perina
Polina
: Automation
Depends On: 1496517 1497197
Blocks: rhv_turn_off_autoresume_of_paused_VMs
  Show dependency treegraph
 
Reported: 2017-08-13 10:28 EDT by Natalie Gavrielov
Modified: 2018-02-18 03:56 EST (History)
11 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: Virt
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
rule-engine: ovirt‑4.2+


Attachments (Terms of Use)
engine.log, vdsm.log, supervdsm.log (3.61 MB, application/zip)
2017-08-13 10:28 EDT, Natalie Gavrielov
no flags Details
engine,vdsm, libvirt logs (17.69 MB, application/zip)
2017-08-15 07:44 EDT, Natalie Gavrielov
no flags Details
engine.log (1.50 MB, application/zip)
2017-09-12 10:22 EDT, Natalie Gavrielov
no flags Details
engine.log, vdsm.log DEBUG level (18.90 MB, application/zip)
2017-10-02 08:38 EDT, Natalie Gavrielov
no flags Details
logs: vdsm, engine (4.13 MB, application/zip)
2017-11-26 10:17 EST, Natalie Gavrielov
no flags Details
logs: vdsm, engine (2.35 MB, application/zip)
2017-12-04 17:00 EST, Natalie Gavrielov
no flags Details
logs, DEBUG level: engine, vdsm, libvirt (5.35 MB, application/zip)
2017-12-10 10:11 EST, Natalie Gavrielov
no flags Details
libvirt logs (13.67 MB, application/zip)
2017-12-11 08:23 EST, Natalie Gavrielov
no flags Details
engine.log, vdsm.log, libvirtd.log, /var/log/messages, mount output and a UI snapshots of nfs-paused-vm and iscsi-paused-vm (4.36 MB, application/zip)
2018-01-09 13:15 EST, Natalie Gavrielov
no flags Details

  None (edit)
Description Natalie Gavrielov 2017-08-13 10:28:55 EDT
Created attachment 1312717 [details]
engine.log, vdsm.log, supervdsm.log

Description of problem:
When blocking connection between host and storage, a running VM doesn't switch to paused mode

Version-Release number of selected component:
ovirt-engine-4.2.0-0.0.master.20170802210223.git338a2bc.el7.centos.noarch
vdsm-4.20.2-41.gitd789a5e.el7.centos.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Run dd command on a running VM.  
2. (While the command is running) block connection to the storage domain.

Actual results:
VM switches states between: unknown, up, not responding, paused (back and forth).

Expected results:
For the VM to pause.

Additional info:
VM is not defined as HA
Comment 1 Yaniv Kaul 2017-08-14 01:15:57 EDT
Can you verify it's not a QEMU/libvirt bug? What is the status of the VM in libvirt?
Comment 2 Michal Skrivanek 2017-08-14 02:59:54 EDT
- back and forth between what states? all four you mentioned?
- point to a specific time frame and entity
- describe the storage setup and how you blocked the connection
Comment 3 Natalie Gavrielov 2017-08-15 07:44 EDT
Created attachment 1313602 [details]
engine,vdsm, libvirt logs

(In reply to Yaniv Kaul from comment #1)
> Can you verify it's not a QEMU/libvirt bug? What is the status of the VM in
> libvirt?

There is the list of statuses engine displayed below, during that time I sampled the status of the VM in libvirt, the test began around 11:28 and only at around 12 the status changed from 'running' to 'paused' in libvirt (and remained that way).  

(In reply to Michal Skrivanek from comment #2)
> - back and forth between what states? all four you mentioned?

It goes like this:
Not responding -> Unknown -> Up* -> Migrating -> Not responding -> Unknown -> Not responding -> Migrating -> Not responding -> Unknown -> Up* -> Unknown -> Up -> Migrating -> ... -> Paused** -> Not responding -> Paused -> Not responding -> Unknown (with a message "vm has been paused") -> Unknown (without a message at all)

* It shows that the vm is up and running on the same host that doesn't have connection to the storage.
** Up until this point the vm was reported as running by libvirt - since this moment on, libvirt reported vm is paused 

> - point to a specific time frame and entity
(attached new logs)
Connection blockage starts at around 2017-08-14 11:27
vm name: vm_nfs_1411250142

> - describe the storage setup and how you blocked the connection
engine (not hosted), 3 hosts, 3 nfs SD (on the same storage server), 3 iscsi SD (on the same storage server), and 3 glusterFS (again, all 3 on the same storage server).

The VM's disk was on NFS.

The connection was blocked using iptables (block outgoing connection from host to storage)
Comment 4 Tomas Jelinek 2017-09-04 10:43:18 EDT
(In reply to Natalie Gavrielov from comment #3)
> Created attachment 1313602 [details]
> engine,vdsm, libvirt logs
> 
> (In reply to Yaniv Kaul from comment #1)
> > Can you verify it's not a QEMU/libvirt bug? What is the status of the VM in
> > libvirt?
> 
> There is the list of statuses engine displayed below, during that time I
> sampled the status of the VM in libvirt, the test began around 11:28 and
> only at around 12 the status changed from 'running' to 'paused' in libvirt
> (and remained that way).  
> 
> (In reply to Michal Skrivanek from comment #2)
> > - back and forth between what states? all four you mentioned?
> 
> It goes like this:
> Not responding -> Unknown -> Up* -> Migrating -> Not responding -> Unknown
> -> Not responding -> Migrating -> Not responding -> Unknown -> Up* ->
> Unknown -> Up -> Migrating -> ... -> Paused** -> Not responding -> Paused ->
> Not responding -> Unknown (with a message "vm has been paused") -> Unknown
> (without a message at all)
> 
> * It shows that the vm is up and running on the same host that doesn't have
> connection to the storage.
> ** Up until this point the vm was reported as running by libvirt - since
> this moment on, libvirt reported vm is paused 
> 
> > - point to a specific time frame and entity
> (attached new logs)
> Connection blockage starts at around 2017-08-14 11:27

the engine log ends at 2017-08-14 02:55:47 at night. Can you please provide the next logs which contain the specific timeframe?

> vm name: vm_nfs_1411250142
> 
> > - describe the storage setup and how you blocked the connection
> engine (not hosted), 3 hosts, 3 nfs SD (on the same storage server), 3 iscsi
> SD (on the same storage server), and 3 glusterFS (again, all 3 on the same
> storage server).
> 
> The VM's disk was on NFS.
> 
> The connection was blocked using iptables (block outgoing connection from
> host to storage)
Comment 5 Natalie Gavrielov 2017-09-12 10:22 EDT
Created attachment 1324928 [details]
engine.log
Comment 6 Michal Skrivanek 2017-09-18 06:44:14 EDT
there are several issues, but it seems the core problem lies in infra. 
The sequence goes like this:
engine detects 
2017-08-14 11:27:57,854+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler6) [536b9c5e] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10
,802), VDSM host_mixed_1 command GetStatsVDS failed: Heartbeat exceeded

corresponding side on vdsm says:
2017-08-14 11:28:26,929+0300 WARN  (JsonRpc (StompReactor)) [vds.dispatcher] unhandled write event (betterAsyncore:160)
2017-08-14 11:28:26,934+0300 WARN  (JsonRpc (StompReactor)) [vds.dispatcher] unhandled close event (betterAsyncore:160)

this causes VM state to be NotResponding. Now before the host starts VdsNotRespondingTreatment the storage is detected as non operational, so the status is set to NonOperational and VMs to Unknown.
The soft fencing restarts vdsm, it recovers VMs, starts responding(after reporting Recovering from crash or initializing) and reports accurate VM status (Up).
Immediately afterwards engine attempts to reconnect storage which fails with the same heartbeat exception. Host is set to NonOperational. That now triggeres a migration to other host (since VM was Up) but that's not really significant. Eventually after vdsm timeout on storage monitor it is again reported as NotResponding, while that heartbeat exceeded network exception triggers another round of soft fencing

Ultimately the core issue is the network exception which IMO should not happen

Piotr, please take a look
Comment 7 Piotr Kliczewski 2017-09-18 07:47:27 EDT
Checking the logs I see following sequence:

1. Vdsm responds to VM.getStats - 2017-08-14 11:27:53,254+0300 INFO  (jsonrpc/0) [jsonrpc.JsonRpcServer] RPC call VM.getStats succeeded in 0.00 seconds (__init__:604)

2. Possibly mom connecting to vdsm 2017-08-14 11:27:53,522+0300 INFO  (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:51780 (protocoldetector:74)

3. Last rpc response possibly to mom 2017-08-14 11:27:55,011+0300 INFO  (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call Host.getAllVmStats succeeded in 0.00 seconds (__init__:604)

4. Engine timeouts with lack of response to VM.getStats 2017-08-14 11:27:57,855+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler6) [536b9c5e] Command 'GetStatsVDSCommand(HostName = host_mixed_1, VdsIdAndVdsVDSCommandParametersBase:{hostId='1546ec3c-2eb4-4ff0-a981-1c7e74cf3c37', vds='Host[host_mixed_1,1546ec3c-2eb4-4ff0-a981-1c7e74cf3c37]'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded

5. Engine closes connection which end up to log this on vdsm side 2017-08-14 11:28:26,929+0300 WARN  (JsonRpc (StompReactor)) [vds.dispatcher] unhandled write event (betterAsyncore:160)
2017-08-14 11:28:26,934+0300 WARN  (JsonRpc (StompReactor)) [vds.dispatcher] unhandled close event (betterAsyncore:160)

My question is whether we see any spikes in cpu or network utilization. I am looking for a reason why VM.getStats response did not arrive on time to the engine even though it was sent from the vdsm.

I wonder whether we face network congestion or compute issue caused by vdsm. Is vdsm running in a vm or bare metal?
Comment 8 Michal Skrivanek 2017-09-18 09:35:39 EDT
but isn't that exception about heartbeat received too early? If the getStats is the one which failed - since it's the last engine-vdsm communication - then it's just 11:27:53 on vdsm side and engine network exception at 11:27:57
And IIUC the timeout for hearbeat is 30s now?
Comment 9 Piotr Kliczewski 2017-09-18 09:56:52 EDT
We do not see when the command/verb was called, we do not log it. The interval was tested before and we did not see any issues with it. There were no changes for a long time in this code so I assume at the moment that vdsm is to blame here.
Comment 10 Piotr Kliczewski 2017-09-18 09:57:41 EDT
It would help to get the engine logs in debug so we could better understand timing from the engine perspective.
Comment 11 Michal Skrivanek 2017-09-18 10:00:06 EDT
I'm not saying it's small. But doesn't it mean it's of a call from 30 seconds before that? The log doesn't say when it starts but it does log the duration when it fineshes (0s here) so is it wrong to assume it was called at that time?
Comment 12 Piotr Kliczewski 2017-09-18 10:06:20 EDT
The log does not say how long it takes for a frame to arrive and come back. We only know how long it took to execute it on vdsm side. We know that the message was processed at 2017-08-14 11:27:53,254+0300 and it took 0s but we do not know how long it took to get there and how long it took for the engine to receive the response. If the network was congested or cpu occupied enough the steps that we do not know about could potentially take longer than 30 secs. That is why we need engine logs in debug to know the specific numbers.
Comment 13 Tomas Jelinek 2017-09-20 03:17:51 EDT
Could you please provide the debug logs?
Comment 14 Natalie Gavrielov 2017-09-28 11:22:18 EDT
(In reply to Piotr Kliczewski from comment #7)

> I wonder whether we face network congestion or compute issue caused by vdsm.
> Is vdsm running in a vm or bare metal?

VM.

(In reply to Tomas Jelinek from comment #13)
> Could you please provide the debug logs?

I tried running the scenario described again so I can provide debug logs, with build:
ovirt-engine-4.2.0-0.0.master.20170904111624.git842d79b.el7.centos.noarch
Now the bug won't recreate, it works as expected - VM switches state from running to paused when blocking the connection between the host and storage domain..
guess these logs will be useless.. I'll try to find an environment with the same build to see if it will reproduce.
Comment 15 Natalie Gavrielov 2017-10-02 08:38 EDT
Created attachment 1333227 [details]
engine.log, vdsm.log DEBUG level

(In reply to Natalie Gavrielov from comment #14)
> (In reply to Tomas Jelinek from comment #13)
> > Could you please provide the debug logs?
> 
> I tried running the scenario described again so I can provide debug logs,
> with build:
> ovirt-engine-4.2.0-0.0.master.20170904111624.git842d79b.el7.centos.noarch
> Now the bug won't recreate, it works as expected - VM switches state from
> running to paused when blocking the connection between the host and storage
> domain..
> guess these logs will be useless.. I'll try to find an environment with the
> same build to see if it will reproduce.

Apparently when I tried reproducing this I used iscsi instead of nfs :0
Now I've attached debug logs for engine and vdsm.
Builds used:
ovirt-engine-4.2.0-0.0.master.20170929123516.git007c392.el7.centos.noarch
vdsm-4.20.3-121.git77235c7.el7.centos.x86_64

It's an environment with 3 hosts.

The block connection to storage domain started at around 2017-10-02 15:00
and during the "switch states" described in comment 3, there were a few attempts to migrate the vm, here is one of them:
 2017-10-02 15:03:14,871+03 ERROR [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-90) [49fc1b2e] Failed to migrate VM 'vm_TestCaseBlockedConnecti_0214574984'
Comment 16 Piotr Kliczewski 2017-10-02 09:31:07 EDT
Based on the logs I see heartbeat issue:

2017-10-02 15:03:14,742+03 DEBUG [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-63) [] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exceeded

which was caused by vdsm crash:

2017-10-02 15:03:13,673+0300 DEBUG (periodic/1) [storage.TaskManager.Task] (Task='e86cb403-e3a0-4017-ac8a-9392e906412b') ref 0 aborting False (task:999)
2017-10-02 15:03:46,874+0300 INFO  (MainThread) [vds] (PID: 28763) I am the actual vdsm 4.20.3-121.git77235c7.el7.centos storage-jenkins-ge4-vdsm1.scl.lab.tlv.redhat.com (3.10.0-693.el7.x86_64) (vdsmd:148)

btw I see many blocks tasks which could be cause by famous libvirt lock.
Comment 17 Martin Perina 2017-10-06 09:04:47 EDT
Based on above adding dependency on BZ1496517 and targeting to 4.2
Comment 18 Martin Perina 2017-11-21 14:22:54 EST
Underlying BZ1497197 was verified and released in 7.4.z within a October 19th batch, so moving to ON_QA
Comment 19 Natalie Gavrielov 2017-11-26 10:17 EST
Created attachment 1359161 [details]
logs: vdsm, engine

Performed the scenario described in comment 0, now the VM switches state to "Not Responding" (showing question mark icon), it does not switch state to "VM paused" at any point as should be.

Builds:
ovirt-engine-4.2.0-0.0.master.20171124141652.gita5b4f9b.el7.centos.noarch
vdsm-4.20.8-29.git6ceb389.el7.centos.x86_64
Comment 20 Yaniv Kaul 2017-12-03 11:07:19 EST
(In reply to Natalie Gavrielov from comment #19)
> Created attachment 1359161 [details]
> logs: vdsm, engine
> 
> Performed the scenario described in comment 0, now the VM switches state to
> "Not Responding" (showing question mark icon), it does not switch state to
> "VM paused" at any point as should be.
> 
> Builds:
> ovirt-engine-4.2.0-0.0.master.20171124141652.gita5b4f9b.el7.centos.noarch
> vdsm-4.20.8-29.git6ceb389.el7.centos.x86_64

Are you sure it's not a libvirt/QEMU bug?
Comment 21 Yaniv Kaul 2017-12-03 11:10:28 EST
(In reply to Natalie Gavrielov from comment #14)
> (In reply to Piotr Kliczewski from comment #7)
> 
> > I wonder whether we face network congestion or compute issue caused by vdsm.
> > Is vdsm running in a vm or bare metal?
> 
> VM.

Can you try to reproduce on bare-metal? Is the nested VM perhaps choked on resources?

> 
> (In reply to Tomas Jelinek from comment #13)
> > Could you please provide the debug logs?
> 
> I tried running the scenario described again so I can provide debug logs,
> with build:
> ovirt-engine-4.2.0-0.0.master.20170904111624.git842d79b.el7.centos.noarch

This is quite an ancient build?

> Now the bug won't recreate, it works as expected - VM switches state from
> running to paused when blocking the connection between the host and storage
> domain..
> guess these logs will be useless.. I'll try to find an environment with the
> same build to see if it will reproduce.
Comment 22 Martin Perina 2017-12-04 03:45:44 EST
(In reply to Natalie Gavrielov from comment #19)
> Created attachment 1359161 [details]
> logs: vdsm, engine
> 
> Performed the scenario described in comment 0, now the VM switches state to
> "Not Responding" (showing question mark icon), it does not switch state to
> "VM paused" at any point as should be.
> 
> Builds:
> ovirt-engine-4.2.0-0.0.master.20171124141652.gita5b4f9b.el7.centos.noarch
> vdsm-4.20.8-29.git6ceb389.el7.centos.x86_64

Are you sure that you are using libvirt-python-3.2.0-3.el7_4.1 which should contain the fix for BZ1481022 in RHEL 7.4.z?
Comment 23 Natalie Gavrielov 2017-12-04 04:42:47 EST
(In reply to Yaniv Kaul from comment #21)
> (In reply to Natalie Gavrielov from comment #14)
> > (In reply to Piotr Kliczewski from comment #7)
> Can you try to reproduce on bare-metal? Is the nested VM perhaps choked on
> resources?
I'm not sure resources got to do with it when I run the same test with iscsi storage it works fine (same environment).
Looking for a bare-metal env meantime.
 
> > (In reply to Tomas Jelinek from comment #13)
> > > Could you please provide the debug logs?
> > 
> > I tried running the scenario described again so I can provide debug logs,
> > with build:
> > ovirt-engine-4.2.0-0.0.master.20170904111624.git842d79b.el7.centos.noarch
> 
> This is quite an ancient build?
I have a nested env with rhvm-4.2.0-0.5.master.el7.noarch, behaves the same.

(In reply to Martin Perina from comment #22)
> (In reply to Natalie Gavrielov from comment #19)
> > Created attachment 1359161 [details]
> > logs: vdsm, engine
> > 
> > Performed the scenario described in comment 0, now the VM switches state to
> > "Not Responding" (showing question mark icon), it does not switch state to
> > "VM paused" at any point as should be.
> > 
> > Builds:
> > ovirt-engine-4.2.0-0.0.master.20171124141652.gita5b4f9b.el7.centos.noarch
> > vdsm-4.20.8-29.git6ceb389.el7.centos.x86_64
> 
> Are you sure that you are using libvirt-python-3.2.0-3.el7_4.1 which should
> contain the fix for BZ1481022 in RHEL 7.4.z?
On the nested env I'm working on, with builds:
rhvm-4.2.0-0.5.master.el7.noarch 
vdsm-4.20.8-1.el7ev.x86_64
libvirt-python-3.2.0-3.el7_4.1.x86_64
It still happens.
Comment 24 Piotr Kliczewski 2017-12-04 06:21:46 EST
(In reply to Natalie Gavrielov from comment #19)
> Created attachment 1359161 [details]
> logs: vdsm, engine
> 
> Performed the scenario described in comment 0, now the VM switches state to
> "Not Responding" (showing question mark icon), it does not switch state to
> "VM paused" at any point as should be.

This statement seems to not correct:

2017-11-26 15:25:24,749+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-3) [] VM 'aa091767-2da3-4c7d-b45c-26dd2fff3f48'(vm_TestCaseBlockedConnecti_2614572831) moved from 'Up' --> 'Paused'

please specify exactly which vm(s) we are talking about and at which point because I see couple of state switches to 'Paused'.
Comment 25 Natalie Gavrielov 2017-12-04 17:00 EST
Created attachment 1362839 [details]
logs: vdsm, engine

Bare-metal env, engine with 2 hosts.
from engine.log:
2017-12-04 22:53:49,281+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-73) [] VM '457c50e6-c794-4d50-bf42-a443f926f8b9'(test-vm1) moved from 'Up' --> 'NotResponding'

During the entire test, VM was in NotResponding state.
Removed the rule from iptables at 04/12/17 23:26:14

Then, this came out in the engine.log:
2017-12-04 23:26:22,448+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-1) [] VM '457c50e6-c794-4d50-bf42-a443f926f8b9'(test-vm1) moved from 'NotResponding' --> 'Paused'
2017-12-04 23:26:32,294+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-2) [] VM '457c50e6-c794-4d50-bf42-a443f926f8b9'(test-vm1) moved from 'Paused' --> 'Up'

Seems that the VM becomes paused only when connection is restored.

Builds:
ovirt-engine-4.2.0-0.0.master.20171124141652.gita5b4f9b.el7.centos.noarch
vdsm-4.20.7-1.el7.centos.x86_64
libvirt-python-3.2.0-3.el7_4.1.x86_64
Comment 26 Martin Perina 2017-12-05 04:21:05 EST
I've looked at latest logs and from infra point of view hosts statuses are fine:

1. At 2017-12-04 22:44:05,678+02 VM 'test-vm1' was started on host 'h1'

2. At 2017-12-04 22:53:49,281+02 VM 'test-vm1' was moved to NotResponding

3. At 2017-12-04 22:59:32,085+02 host 'h1' was moved to NonOperational

4. At 2017-12-04 23:00:00,301+02 engine failed to autorecover host 'h1'

5. At 2017-12-04 23:03:38,881+02 data center 'DC1' move to NonOperational as host 'h1' is an SPM

6. Several attempts to recover host 'h1', all unsuccessful

7. At 2017-12-04 23:26:22,448+02 VM 'test-vm' moved from NotResponding to Paused

8. At 2017-12-04 23:26:32,294+02 VM 'test-vm' moved from Paused to Up

9. At 2017-12-04 23:26:35,565+02 host 'h1' was moved back to Up


So from infra point of view host status changes working fine. Tomasi, could some from virt team take a look if also VM statuses were changed as expected?
Comment 27 Piotr Kliczewski 2017-12-05 04:30:47 EST
Based on recent logs I see that vm monitoring was not run between 2017-12-04 22:53:49,281+02 and 2017-12-04 23:26:22,448+02. The question is why the vms were not updated properly.
Comment 28 Polina 2017-12-05 08:49:40 EST
the similar behaviour has been tested also with gluster storage with problematic though a little bit different result: circulating of VM statuses that finishes by pause "due to unknown storage error" instead of "I/O error" expected.

Here is the engine reports for this scenario:
      08:50:13 (vm_gluster) moved from Up' --> 'NotResponding'

      08:50:42  vm_gluster moved from 'NotResponding' --> 'Up'

      08:50:43 (vm_gluster) moved from 'Up' --> 'NotResponding'

      08:50:48 (vm_gluster) moved from 'NotResponding' --> 'Paused

                 VM vm_gluster has been paused.
                 VM vm_gluster has been paused due to unknown storage error
      08:50:59  (vm_gluster) moved from 'Paused' --> 'NotResponding'

      08:51:13  (vm_gluster) moved from 'NotResponding' --> 'Paused'
                 VM vm_gluster has been paused.
                 VM vm_gluster has been paused due to unknown storage error.

this bug prevents testing of VM Resume Behaviour features with nfs and gluster. Only iscsi could be tested while the phase.
Comment 29 Michal Skrivanek 2017-12-06 05:14:38 EST
There are no obvious errors anywhere (other than what's expected)
It does look like the VM stats call is still getting stuck. It looks like the call never returns. Since the heartbeat now works the engine side supposedly just hangs and no further monitoring cycles are scheduled - that would be quite bad.

Unfortunately these frequent API calls are not being logged by default
Please enable debug log level on vdsm and engine, and provide also libvirt.log
Comment 30 Piotr Kliczewski 2017-12-06 09:33:00 EST
(In reply to Michal Skrivanek from comment #29)
> There are no obvious errors anywhere (other than what's expected)
> It does look like the VM stats call is still getting stuck. It looks like
> the call never returns. Since the heartbeat now works the engine side
> supposedly just hangs and no further monitoring cycles are scheduled - that
> would be quite bad.

Don't forget that we have higher timeout which is raised when response do not arrives on time. It is in place since xmlrpc where we had new connection every call. Now we maintain the connection and use heartbeats to detect any issues.

> 
> Unfortunately these frequent API calls are not being logged by default
> Please enable debug log level on vdsm and engine, and provide also
> libvirt.log
Comment 31 Polina 2017-12-07 06:01:14 EST
could you please explain how set the DEBUG INFO for engine.log and vdsm.log ?
Comment 32 Natalie Gavrielov 2017-12-10 10:11 EST
Created attachment 1365632 [details]
logs, DEBUG level: engine, vdsm, libvirt

(In reply to Michal Skrivanek from comment #29)
> There are no obvious errors anywhere (other than what's expected)
> It does look like the VM stats call is still getting stuck. It looks like
> the call never returns. Since the heartbeat now works the engine side
> supposedly just hangs and no further monitoring cycles are scheduled - that
> would be quite bad.
> 
> Unfortunately these frequent API calls are not being logged by default
> Please enable debug log level on vdsm and engine, and provide also
> libvirt.log

Builds used for testing:
ovirt-engine-4.2.0-0.5.master.el7.noarch
vdsm-4.20.8-1.el7ev.x86_64

Block connection was around 16:40
Restore connection was at around 16:55
VM name: vm_TestCaseBlockedConnecti_1016333518
Comment 33 Michal Skrivanek 2017-12-11 04:18:22 EST
(In reply to Natalie Gavrielov from comment #32)
libvirt log seems to be from a wrong time - please check timezone

other than that - I can see the host responds in a correct way, with monitorResponse -1 signifying that libvirt periodic monitoring is not responding. 
Need libvirt logs from the right period - but since it's NFS it could be just the timeout - check NFS mount parameters as well.
Comment 34 Natalie Gavrielov 2017-12-11 08:23 EST
Created attachment 1365999 [details]
libvirt logs

(In reply to Michal Skrivanek from comment #33)
> (In reply to Natalie Gavrielov from comment #32)
> libvirt log seems to be from a wrong time - please check timezone
> 
> other than that - I can see the host responds in a correct way, with
> monitorResponse -1 signifying that libvirt periodic monitoring is not
> responding. 
> Need libvirt logs from the right period - but since it's NFS it could be
> just the timeout - check NFS mount parameters as well.

Attached libvirt logs (for all 3 hosts) with the relevant time frame.
Can you please specify exactly which parameters for NFS mount to check?
Comment 35 Michal Skrivanek 2017-12-11 11:39:28 EST
(In reply to Natalie Gavrielov from comment #34)

> Attached libvirt logs (for all 3 hosts) with the relevant time frame.
> Can you please specify exactly which parameters for NFS mount to check?

I see it's stuck in libvirt side until the last call at 2017-12-10 14:56:38.707+0000: 28509: error : qemuDomainObjBeginJobInternal:3839 : Timed out during operation: cannot acquire state change lock (held by remoteDispatchConnectGetAllDomainStats)

In vdsm log it starts reacting again at 2017-12-10 16:56:49,008+0200 and is reported as Up by vdsm(via VM status event), but the VM is shut down short after at 2017-12-10 16:56:59,502+0200 (vdsm API request). You did that manually from engine as logged in engine.log:
2017-12-10 16:56:59,463+02 INFO  [org.ovirt.engine.core.bll.StopVmCommand] (default task-2) [vms_syncAction_671d3681-85d0-4ec9] Running command: StopVmCommand(StopVmType = NORMAL, StopReason = null, VmId = 2cd65ab8-6294-493e-a1fb-580c4475f041, QuotaId = null) internal: false. Entities affected :  ID: 2cd65ab8-6294-493e-a1fb-580c4475f041 Type: VMAction group STOP_VM with role type USER
There is one more report right after the Up event is sent - at 2017-12-10 16:56:50,393+0200 vdsm sends monitorResponse -1 again as a response to Host.getAllVmStats. It may be due to last monitoring cycle still not getting any data (Up event is sent just 0.8s earlier). This shouldn't cause much of an issue other than a one time blip on VM status

On engine side I see the VM was not responding during the outage - which seems to be correct, then reported as Up (2017-12-10 16:56:49,021+02) and then again NotResponding(2017-12-10 16:56:50,400+02) and then after shut down to Down (2017-12-10 16:57:02,784+02)

I do not see any problem with VM status handling.
Comment 36 Michal Skrivanek 2017-12-11 11:43:00 EST
as for NFS you should check the timeout values before the kernel actually reports I/O error. Default is 6m IIRC but it really depends on your client and server parameters - if you believe the underlying storage should have returned I/O error during your test period this should be looked at by libvirt team. In that case please supply the parameters and journal log
Comment 37 Michal Skrivanek 2017-12-12 08:15:00 EST
when you access a file on that nfs _on host_, you should get EIO after that timeout. That would verify your NFS is set up correctly and then VM should behave the same - only hen it differs in behavior it's a valid (likely libvirt) bug
Comment 38 Yaniv Kaul 2018-01-02 08:36:22 EST
Natalie, any updates or can we close this bug?
Comment 39 Raz Tamir 2018-01-08 07:21:30 EST
(In reply to Michal Skrivanek from comment #37)
> when you access a file on that nfs _on host_, you should get EIO after that
> timeout. That would verify your NFS is set up correctly and then VM should
> behave the same - only hen it differs in behavior it's a valid (likely
> libvirt) bug

Hi Michal,

There is no access to the storage from the host so I don't see the point of checking if the NFS is set up correctly or not if we don't have communication to it - it is like it doesn't exist after blocking the connection.

The behavior should be exactly like running the VM with disk on block type storage domain - the VM should enter PAUSED state
Comment 40 Michal Skrivanek 2018-01-08 12:36:37 EST
Raz, what do you mean by no access to the storage from host? That's what Natalie  is trying to test IIUC, and for that I asked to confirm behavior when similar access is done directly by the host without qemu involvement.

Or you mean there is just no storage access by the guest? Then indeed the VM doesn't pause, but then this whole bug is just a mistake - obviously the VM doesn't pause when there is no storage access whatsoever.

"The behavior should be exactly like running the VM with disk on block type storage domain - the VM should enter PAUSED state" - not exactly. The VM behavior should match the host behavior...so the VM Pauses iff EIO is returned for backing storage access by the host. With default NFS settings we use that's very different from e.g. FC
Comment 41 Raz Tamir 2018-01-09 03:23:08 EST
Thanks for clarifying Michal,

Is there any way to overcome the differences between the storage types?

To summarize this bug, we are looking for consistency no matter what the storage type is - the VM should not care where it's disk resides on (i.e. NFS, iSCSI).

One thing I do not really understand is that you mentioned that the EIO is returned from the storage server.
If the connection between the host and the storage is blocked, how can the host, receive any RC from it?

I hope the request/expected result is clear from our POV
Comment 42 Michal Skrivanek 2018-01-09 03:29:53 EST
(In reply to Raz Tamir from comment #41)
> Thanks for clarifying Michal,
> 
> Is there any way to overcome the differences between the storage types?
> 
> To summarize this bug, we are looking for consistency no matter what the
> storage type is - the VM should not care where it's disk resides on (i.e.
> NFS, iSCSI).

The VM indeed does not care, but the timing of EIO would be different for different storage backends in any case due to transport limitations/configuration

> One thing I do not really understand is that you mentioned that the EIO is
> returned from the storage server.
> If the connection between the host and the storage is blocked, how can the
> host, receive any RC from it?

sorry, maybe that wasn't clear in my text. I meant the transport returns EIO, and that does depend on the specifics of each transport. So e.g. for FC when the stack basically immediately detects that the path is wrong the EIO is immediate too. Whereas on IP based transports when there is an issue somewhere along the way when packages are silently dropped...it takes time. For NFS that time is defined by the client.
Comment 43 Natalie Gavrielov 2018-01-09 13:15 EST
Created attachment 1379222 [details]
engine.log, vdsm.log, libvirtd.log, /var/log/messages, mount output and a UI snapshots of nfs-paused-vm and iscsi-paused-vm

I tested the following: 
1. Write from host to the nfs storage (simple dd)
2. Write from a VM that's running on the host to a disk that is on nfs storage.
2. While file writing takes place block the connection to the storage (using iptables)

Result:
The command line just freezes for a long long time (a lot longer than the default 10 minutes we know).
/var/log/messages show the following:
Jan  9 11:39:07 storage-jenkins-ge4-vdsm3 kernel: nfs: server yellow-vdsb.qa.lab.tlv.redhat.com not responding, timed out
Jan  9 11:33:20 storage-jenkins-ge4-vdsm3 vdsm[594]: WARN Worker blocked: <Worker name=periodic/2 running <Task <Operation action=<vdsm.virt.sampling.HostMonitor object at 0x2ec7d50> at 0x2ec7d90> timeout=15, duration=15 at 0x46e0450> task#=15626 at 0x2fa63d0>, traceback:#012File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap#012  self.__bootstrap_inner()#012File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner#012  self.run()#012File: "/usr/lib64/python2.7/threading.py", line 765, in run#012  self.__target(*self.__args, **self.__kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run#012  ret = func(*args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run#012  self._execute_task()#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task#012  task()#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__#012  self._callable()#012File: "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 213, in __call__#012  self._func()#012File: "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 578, in __call__#012  stats = hostapi.get_stats(self._cif, self._samples.stats())#012File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 62, in get_stats#012  avail, commit = _memUsageInfo(cif)#012File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 121, in _memUsageInfo#012  mem_info = v.memory_info()#012File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 802, in memory_info#012  dom_stats = self._dom.memoryStats()#012File: "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f#012  ret = attr(*args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper#012  ret = f(*args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper#012  return func(inst, *args, **kwargs)#012File: "/usr/lib64/python2.7/site-packages/libvirt.py", line 1449, in memoryStats#012  ret = libvirtmod.virDomainMemoryStats(self._o)
Jan  9 11:33:35 storage-jenkins-ge4-vdsm3 vdsm[594]: WARN Worker blocked: <Worker name=periodic/2 running <Task <Operation action=<vdsm.virt.sampling.HostMonitor object at 0x2ec7d50> at 0x2ec7d90> timeout=15, duration=30 at 0x46e0450> task#=15626 at 0x2fa63d0>, traceback:#012File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap#012  self.__bootstrap_inner()#012File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner#012  self.run()#012File: "/usr/lib64/python2.7/threading.py", line 765, in run#012  self.__target(*self.__args, **self.__kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run#012  ret = func(*args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run#012  self._execute_task()#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task#012  task()#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__#012  self._callable()#012File: "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 213, in __call__#012  self._func()#012File: "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 578, in __call__#012  stats = hostapi.get_stats(self._cif, self._samples.stats())#012File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 62, in get_stats#012  avail, commit = _memUsageInfo(cif)#012File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 121, in _memUsageInfo#012  mem_info = v.memory_info()#012File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 802, in memory_info#012  dom_stats = self._dom.memoryStats()#012File: "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f#012  ret = attr(*args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper#012  ret = f(*args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper#012  return func(inst, *args, **kwargs)#012File: "/usr/lib64/python2.7/site-packages/libvirt.py", line 1449, in memoryStats#012  ret = libvirtmod.virDomainMemoryStats(self._o)

------------------------------------------------------------------------------------

Later on, I restared libvirtd (something that has to do with libvirt logging, not this issue)
and retested the original scenario described in comment 0, block connection started at ~17:30
Result:
libvirtd.log:
2018-01-09 15:30:38.414+0000: 9083: error : qemuDomainObjBeginJobInternal:3839 : Timed out during operation: cannot acquire state change lock (held by remoteDispatchConnectGetAllDomainStats)

2018-01-09 15:52:26.276+0000: 9083: error : qemuDomainObjBeginJobInternal:3839 : Timed out during operation: cannot acquire state change lock (held by remoteDispatchConnectGetAllDomainStats)
{"timestamp": {"seconds": 1515513153, "microseconds": 425983}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk0", "nospace": false, "__com.redhat_reason": "eio", "node-name": "#block140", "reason": "Input/output error", "operation": "read", "action": "stop"}}
2018-01-09 15:52:33.433+0000: 9076: info : qemuMonitorJSONIOProcessLine:208 : QEMU_MONITOR_RECV_EVENT: mon=0x7f9a20023a60 event={"timestamp": {"seconds": 1515513153, "microseconds": 425983}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk0", "nospace": false, "__com.redhat_reason": "eio", "node-name": "#block140", "reason": "Input/output error", "operation": "read", "action": "stop"}}



vdsm.log.1:
2018-01-09 17:30:59,314+0200 ERROR (monitor/b7f7132) [storage.Monitor] Error checking domain b7f7132e-1f8a-48bf-ad00-2905c71f18d3 (monitor:424)
2018-01-09 17:30:59,329+0200 ERROR (monitor/1dd8b62) [storage.Monitor] Error checking domain 1dd8b623-b91f-4a40-bdb6-631ce1715663 (monitor:424)
2018-01-09 17:30:59,332+0200 ERROR (monitor/4e28206) [storage.Monitor] Error checking domain 4e282067-fcf7-421c-b28c-ed36a8eee0c8 (monitor:424)
2018-01-09 17:30:59,334+0200 ERROR (monitor/b25ad09) [storage.Monitor] Error checking domain b25ad098-d8c9-46ed-90fe-367428c5a44d (monitor:424)

2018-01-09 17:40:01,101+0200 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__jenkins__ge4__nfs__1/4e282067-fcf7-421c-b28c-ed36a8eee0c8/dom_md/metadata (monitor:498)
2018-01-09 17:40:01,114+0200 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__jenkins__ge4__nfs__0/1dd8b623-b91f-4a40-bdb6-631ce1715663/dom_md/metadata (monitor:498)
2018-01-09 17:40:01,372+0200 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__jenkins__ge4__nfs__2/b25ad098-d8c9-46ed-90fe-367428c5a44d/dom_md/metadata (monitor:498)

2018-01-09 17:53:03,493+0200 ERROR (libvirt/events) [vds] Error running VM callback (clientIF:654)


engine.log:
2018-01-09 17:30:48,701+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-24) [] VM 'b4fcb39a-d1a3-449a-aba3-518c072168e5'(vm_TestCaseBlockedConnecti_0917235822) moved from 'Up' --> 'NotResponding'
2018-01-09 17:52:33,488+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-8) [] VM 'b4fcb39a-d1a3-449a-aba3-518c072168e5'(vm_TestCaseBlockedConnecti_0917235822) moved from 'NotResponding' --> 'Paused'
2018-01-09 17:52:33,550+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-8) [] EVENT_ID: VM_PAUSED_EIO(145), VM vm_TestCaseBlockedConnecti_0917235822 has been paused due to storage I/O problem.
2018-01-09 17:52:39,242+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-35) [] VM 'b4fcb39a-d1a3-449a-aba3-518c072168e5'(vm_TestCaseBlockedConnecti_0917235822) moved from 'Paused' --> 'NotResponding'          <---- why does it go back to not responding?

webadmin Compute -> Virtual Machines
The little icon on the left shows a question mark.
When hovering over this icon, the following message appears: 
"Not Responding
VM has been paused due to a storage I/O error"

VM status on the right column shows "Not Responding"

When performing the same test with iSCSI storage, the icon is not a question mark it's a paused icon, when hovering over it, the following message appears:
"Paused
VM has been paused due to a storage I/O error"
and the status column on the right shows "Paused"
snapshots attached

Attached:
engine.log
vdsm.log
libvirtd.log
/var/log/messages
mount output
2 snapshots of the paused VMs (iscsi and nfs)
Comment 44 Ademar Reis 2018-01-10 16:46:27 EST
This is similar to Bug 665820. To understand what's happening at the QEMU level, please see https://bugzilla.redhat.com/show_bug.cgi?id=665820#c11 (and other comments there).

TL;DR: we can't know we got stuck until some time after getting stuck (the comments in the BZ above explain why). And we can't think of a way to implement a "detect system call got stuck on NFS" feature in QEMU that doesn't require major architectural surgery, with a considerable risk of introducing nasty bugs.
Comment 45 Yaniv Kaul 2018-01-29 03:48:45 EST
This has nothing to do (as far as I understand) with oVirt and isn't a new behavior. Until we can come up with a better answer, deferring this to 4.2.2

Note You need to log in before you can comment on or make changes to this bug.