Bug 1481022 - When blocking connection between host and NFS storage, a running VM doesn't switch to paused mode
Summary: When blocking connection between host and NFS storage, a running VM doesn't s...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-engine
Classification: oVirt
Component: Backend.Core
Version: 4.2.0
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ovirt-4.2.5
: ---
Assignee: Milan Zamazal
QA Contact: Polina
URL:
Whiteboard:
Depends On: 1496517 1497197 1600599 1609561 1609701
Blocks: rhv_turn_off_autoresume_of_paused_VMs
TreeView+ depends on / blocked
 
Reported: 2017-08-13 14:28 UTC by Natalie Gavrielov
Modified: 2018-08-15 17:43 UTC (History)
12 users (show)

Fixed In Version: ovirt-engine-4.2.4
Doc Type: Bug Fix
Doc Text:
When a storage problem occurred, VM status could sometimes be reported very unreliably, e.g. frequently flipping between NonResponsive and Up status. This has been improved and the reported VM status on storage errors is more stable now.
Clone Of:
Environment:
Last Closed: 2018-08-03 11:24:56 UTC
oVirt Team: Virt
Embargoed:
rule-engine: ovirt-4.2+


Attachments (Terms of Use)
engine.log, vdsm.log, supervdsm.log (3.61 MB, application/zip)
2017-08-13 14:28 UTC, Natalie Gavrielov
no flags Details
engine,vdsm, libvirt logs (17.69 MB, application/zip)
2017-08-15 11:44 UTC, Natalie Gavrielov
no flags Details
engine.log (1.50 MB, application/zip)
2017-09-12 14:22 UTC, Natalie Gavrielov
no flags Details
engine.log, vdsm.log DEBUG level (18.90 MB, application/zip)
2017-10-02 12:38 UTC, Natalie Gavrielov
no flags Details
logs: vdsm, engine (4.13 MB, application/zip)
2017-11-26 15:17 UTC, Natalie Gavrielov
no flags Details
logs: vdsm, engine (2.35 MB, application/zip)
2017-12-04 22:00 UTC, Natalie Gavrielov
no flags Details
logs, DEBUG level: engine, vdsm, libvirt (5.35 MB, application/zip)
2017-12-10 15:11 UTC, Natalie Gavrielov
no flags Details
libvirt logs (13.67 MB, application/zip)
2017-12-11 13:23 UTC, Natalie Gavrielov
no flags Details
engine.log, vdsm.log, libvirtd.log, /var/log/messages, mount output and a UI snapshots of nfs-paused-vm and iscsi-paused-vm (4.36 MB, application/zip)
2018-01-09 18:15 UTC, Natalie Gavrielov
no flags Details
logs_nfs_vm.tar.gz (408.15 KB, application/x-gzip)
2018-03-07 14:55 UTC, Polina
no flags Details
logs_gluster.tar.gz (651.19 KB, application/x-gzip)
2018-03-07 14:56 UTC, Polina
no flags Details
gluster_vm_continue_scenario.tar.gz (1.24 MB, application/x-gzip)
2018-03-07 15:17 UTC, Polina
no flags Details
logs_engine_vdsm.tar.gz (4.42 MB, application/x-gzip)
2018-07-15 07:50 UTC, Polina
no flags Details
scenario_nfs_vm_block_storage (1.50 MB, application/x-gzip)
2018-07-18 15:53 UTC, Polina
no flags Details
scenario_gluster_vm_block_storage (2.90 MB, application/x-gzip)
2018-07-18 15:53 UTC, Polina
no flags Details
qemu_libvirt_vdsm_engine_for_gluster_vm.tar.gz (1.22 MB, application/x-gzip)
2018-07-25 12:04 UTC, Polina
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 90787 0 'None' MERGED core: set vm as not-responsive only when guest cpu is running 2020-09-11 18:04:42 UTC
oVirt gerrit 91452 0 'None' MERGED core: set vm as not-responsive only when guest cpu is running 2020-09-11 18:04:42 UTC
oVirt gerrit 91981 0 'None' MERGED virt: Enable monitor response only if the VM is monitorable 2020-09-11 18:04:41 UTC
oVirt gerrit 92544 0 'None' MERGED virt: Retrieve VM memory info from system again 2020-09-11 18:04:43 UTC
oVirt gerrit 92601 0 'None' MERGED virt: Retrieve VM memory info from system again 2020-09-11 18:04:44 UTC

Description Natalie Gavrielov 2017-08-13 14:28:55 UTC
Created attachment 1312717 [details]
engine.log, vdsm.log, supervdsm.log

Description of problem:
When blocking connection between host and storage, a running VM doesn't switch to paused mode

Version-Release number of selected component:
ovirt-engine-4.2.0-0.0.master.20170802210223.git338a2bc.el7.centos.noarch
vdsm-4.20.2-41.gitd789a5e.el7.centos.x86_64

How reproducible:
100%

Steps to Reproduce:
1. Run dd command on a running VM.  
2. (While the command is running) block connection to the storage domain.

Actual results:
VM switches states between: unknown, up, not responding, paused (back and forth).

Expected results:
For the VM to pause.

Additional info:
VM is not defined as HA

Comment 1 Yaniv Kaul 2017-08-14 05:15:57 UTC
Can you verify it's not a QEMU/libvirt bug? What is the status of the VM in libvirt?

Comment 2 Michal Skrivanek 2017-08-14 06:59:54 UTC
- back and forth between what states? all four you mentioned?
- point to a specific time frame and entity
- describe the storage setup and how you blocked the connection

Comment 3 Natalie Gavrielov 2017-08-15 11:44:25 UTC
Created attachment 1313602 [details]
engine,vdsm, libvirt logs

(In reply to Yaniv Kaul from comment #1)
> Can you verify it's not a QEMU/libvirt bug? What is the status of the VM in
> libvirt?

There is the list of statuses engine displayed below, during that time I sampled the status of the VM in libvirt, the test began around 11:28 and only at around 12 the status changed from 'running' to 'paused' in libvirt (and remained that way).  

(In reply to Michal Skrivanek from comment #2)
> - back and forth between what states? all four you mentioned?

It goes like this:
Not responding -> Unknown -> Up* -> Migrating -> Not responding -> Unknown -> Not responding -> Migrating -> Not responding -> Unknown -> Up* -> Unknown -> Up -> Migrating -> ... -> Paused** -> Not responding -> Paused -> Not responding -> Unknown (with a message "vm has been paused") -> Unknown (without a message at all)

* It shows that the vm is up and running on the same host that doesn't have connection to the storage.
** Up until this point the vm was reported as running by libvirt - since this moment on, libvirt reported vm is paused 

> - point to a specific time frame and entity
(attached new logs)
Connection blockage starts at around 2017-08-14 11:27
vm name: vm_nfs_1411250142

> - describe the storage setup and how you blocked the connection
engine (not hosted), 3 hosts, 3 nfs SD (on the same storage server), 3 iscsi SD (on the same storage server), and 3 glusterFS (again, all 3 on the same storage server).

The VM's disk was on NFS.

The connection was blocked using iptables (block outgoing connection from host to storage)

Comment 4 Tomas Jelinek 2017-09-04 14:43:18 UTC
(In reply to Natalie Gavrielov from comment #3)
> Created attachment 1313602 [details]
> engine,vdsm, libvirt logs
> 
> (In reply to Yaniv Kaul from comment #1)
> > Can you verify it's not a QEMU/libvirt bug? What is the status of the VM in
> > libvirt?
> 
> There is the list of statuses engine displayed below, during that time I
> sampled the status of the VM in libvirt, the test began around 11:28 and
> only at around 12 the status changed from 'running' to 'paused' in libvirt
> (and remained that way).  
> 
> (In reply to Michal Skrivanek from comment #2)
> > - back and forth between what states? all four you mentioned?
> 
> It goes like this:
> Not responding -> Unknown -> Up* -> Migrating -> Not responding -> Unknown
> -> Not responding -> Migrating -> Not responding -> Unknown -> Up* ->
> Unknown -> Up -> Migrating -> ... -> Paused** -> Not responding -> Paused ->
> Not responding -> Unknown (with a message "vm has been paused") -> Unknown
> (without a message at all)
> 
> * It shows that the vm is up and running on the same host that doesn't have
> connection to the storage.
> ** Up until this point the vm was reported as running by libvirt - since
> this moment on, libvirt reported vm is paused 
> 
> > - point to a specific time frame and entity
> (attached new logs)
> Connection blockage starts at around 2017-08-14 11:27

the engine log ends at 2017-08-14 02:55:47 at night. Can you please provide the next logs which contain the specific timeframe?

> vm name: vm_nfs_1411250142
> 
> > - describe the storage setup and how you blocked the connection
> engine (not hosted), 3 hosts, 3 nfs SD (on the same storage server), 3 iscsi
> SD (on the same storage server), and 3 glusterFS (again, all 3 on the same
> storage server).
> 
> The VM's disk was on NFS.
> 
> The connection was blocked using iptables (block outgoing connection from
> host to storage)

Comment 5 Natalie Gavrielov 2017-09-12 14:22:24 UTC
Created attachment 1324928 [details]
engine.log

Comment 6 Michal Skrivanek 2017-09-18 10:44:14 UTC
there are several issues, but it seems the core problem lies in infra. 
The sequence goes like this:
engine detects 
2017-08-14 11:27:57,854+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (DefaultQuartzScheduler6) [536b9c5e] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10
,802), VDSM host_mixed_1 command GetStatsVDS failed: Heartbeat exceeded

corresponding side on vdsm says:
2017-08-14 11:28:26,929+0300 WARN  (JsonRpc (StompReactor)) [vds.dispatcher] unhandled write event (betterAsyncore:160)
2017-08-14 11:28:26,934+0300 WARN  (JsonRpc (StompReactor)) [vds.dispatcher] unhandled close event (betterAsyncore:160)

this causes VM state to be NotResponding. Now before the host starts VdsNotRespondingTreatment the storage is detected as non operational, so the status is set to NonOperational and VMs to Unknown.
The soft fencing restarts vdsm, it recovers VMs, starts responding(after reporting Recovering from crash or initializing) and reports accurate VM status (Up).
Immediately afterwards engine attempts to reconnect storage which fails with the same heartbeat exception. Host is set to NonOperational. That now triggeres a migration to other host (since VM was Up) but that's not really significant. Eventually after vdsm timeout on storage monitor it is again reported as NotResponding, while that heartbeat exceeded network exception triggers another round of soft fencing

Ultimately the core issue is the network exception which IMO should not happen

Piotr, please take a look

Comment 7 Piotr Kliczewski 2017-09-18 11:47:27 UTC
Checking the logs I see following sequence:

1. Vdsm responds to VM.getStats - 2017-08-14 11:27:53,254+0300 INFO  (jsonrpc/0) [jsonrpc.JsonRpcServer] RPC call VM.getStats succeeded in 0.00 seconds (__init__:604)

2. Possibly mom connecting to vdsm 2017-08-14 11:27:53,522+0300 INFO  (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::1:51780 (protocoldetector:74)

3. Last rpc response possibly to mom 2017-08-14 11:27:55,011+0300 INFO  (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call Host.getAllVmStats succeeded in 0.00 seconds (__init__:604)

4. Engine timeouts with lack of response to VM.getStats 2017-08-14 11:27:57,855+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetStatsVDSCommand] (DefaultQuartzScheduler6) [536b9c5e] Command 'GetStatsVDSCommand(HostName = host_mixed_1, VdsIdAndVdsVDSCommandParametersBase:{hostId='1546ec3c-2eb4-4ff0-a981-1c7e74cf3c37', vds='Host[host_mixed_1,1546ec3c-2eb4-4ff0-a981-1c7e74cf3c37]'})' execution failed: VDSGenericException: VDSNetworkException: Heartbeat exceeded

5. Engine closes connection which end up to log this on vdsm side 2017-08-14 11:28:26,929+0300 WARN  (JsonRpc (StompReactor)) [vds.dispatcher] unhandled write event (betterAsyncore:160)
2017-08-14 11:28:26,934+0300 WARN  (JsonRpc (StompReactor)) [vds.dispatcher] unhandled close event (betterAsyncore:160)

My question is whether we see any spikes in cpu or network utilization. I am looking for a reason why VM.getStats response did not arrive on time to the engine even though it was sent from the vdsm.

I wonder whether we face network congestion or compute issue caused by vdsm. Is vdsm running in a vm or bare metal?

Comment 8 Michal Skrivanek 2017-09-18 13:35:39 UTC
but isn't that exception about heartbeat received too early? If the getStats is the one which failed - since it's the last engine-vdsm communication - then it's just 11:27:53 on vdsm side and engine network exception at 11:27:57
And IIUC the timeout for hearbeat is 30s now?

Comment 9 Piotr Kliczewski 2017-09-18 13:56:52 UTC
We do not see when the command/verb was called, we do not log it. The interval was tested before and we did not see any issues with it. There were no changes for a long time in this code so I assume at the moment that vdsm is to blame here.

Comment 10 Piotr Kliczewski 2017-09-18 13:57:41 UTC
It would help to get the engine logs in debug so we could better understand timing from the engine perspective.

Comment 11 Michal Skrivanek 2017-09-18 14:00:06 UTC
I'm not saying it's small. But doesn't it mean it's of a call from 30 seconds before that? The log doesn't say when it starts but it does log the duration when it fineshes (0s here) so is it wrong to assume it was called at that time?

Comment 12 Piotr Kliczewski 2017-09-18 14:06:20 UTC
The log does not say how long it takes for a frame to arrive and come back. We only know how long it took to execute it on vdsm side. We know that the message was processed at 2017-08-14 11:27:53,254+0300 and it took 0s but we do not know how long it took to get there and how long it took for the engine to receive the response. If the network was congested or cpu occupied enough the steps that we do not know about could potentially take longer than 30 secs. That is why we need engine logs in debug to know the specific numbers.

Comment 13 Tomas Jelinek 2017-09-20 07:17:51 UTC
Could you please provide the debug logs?

Comment 14 Natalie Gavrielov 2017-09-28 15:22:18 UTC
(In reply to Piotr Kliczewski from comment #7)

> I wonder whether we face network congestion or compute issue caused by vdsm.
> Is vdsm running in a vm or bare metal?

VM.

(In reply to Tomas Jelinek from comment #13)
> Could you please provide the debug logs?

I tried running the scenario described again so I can provide debug logs, with build:
ovirt-engine-4.2.0-0.0.master.20170904111624.git842d79b.el7.centos.noarch
Now the bug won't recreate, it works as expected - VM switches state from running to paused when blocking the connection between the host and storage domain..
guess these logs will be useless.. I'll try to find an environment with the same build to see if it will reproduce.

Comment 15 Natalie Gavrielov 2017-10-02 12:38:34 UTC
Created attachment 1333227 [details]
engine.log, vdsm.log DEBUG level

(In reply to Natalie Gavrielov from comment #14)
> (In reply to Tomas Jelinek from comment #13)
> > Could you please provide the debug logs?
> 
> I tried running the scenario described again so I can provide debug logs,
> with build:
> ovirt-engine-4.2.0-0.0.master.20170904111624.git842d79b.el7.centos.noarch
> Now the bug won't recreate, it works as expected - VM switches state from
> running to paused when blocking the connection between the host and storage
> domain..
> guess these logs will be useless.. I'll try to find an environment with the
> same build to see if it will reproduce.

Apparently when I tried reproducing this I used iscsi instead of nfs :0
Now I've attached debug logs for engine and vdsm.
Builds used:
ovirt-engine-4.2.0-0.0.master.20170929123516.git007c392.el7.centos.noarch
vdsm-4.20.3-121.git77235c7.el7.centos.x86_64

It's an environment with 3 hosts.

The block connection to storage domain started at around 2017-10-02 15:00
and during the "switch states" described in comment 3, there were a few attempts to migrate the vm, here is one of them:
 2017-10-02 15:03:14,871+03 ERROR [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (org.ovirt.thread.EE-ManagedThreadFactory-engine-Thread-90) [49fc1b2e] Failed to migrate VM 'vm_TestCaseBlockedConnecti_0214574984'

Comment 16 Piotr Kliczewski 2017-10-02 13:31:07 UTC
Based on the logs I see heartbeat issue:

2017-10-02 15:03:14,742+03 DEBUG [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (EE-ManagedThreadFactory-engineScheduled-Thread-63) [] Exception: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSNetworkException: VDSGenericException: VDSNetworkException: Heartbeat exceeded

which was caused by vdsm crash:

2017-10-02 15:03:13,673+0300 DEBUG (periodic/1) [storage.TaskManager.Task] (Task='e86cb403-e3a0-4017-ac8a-9392e906412b') ref 0 aborting False (task:999)
2017-10-02 15:03:46,874+0300 INFO  (MainThread) [vds] (PID: 28763) I am the actual vdsm 4.20.3-121.git77235c7.el7.centos storage-jenkins-ge4-vdsm1.scl.lab.tlv.redhat.com (3.10.0-693.el7.x86_64) (vdsmd:148)

btw I see many blocks tasks which could be cause by famous libvirt lock.

Comment 17 Martin Perina 2017-10-06 13:04:47 UTC
Based on above adding dependency on BZ1496517 and targeting to 4.2

Comment 18 Martin Perina 2017-11-21 19:22:54 UTC
Underlying BZ1497197 was verified and released in 7.4.z within a October 19th batch, so moving to ON_QA

Comment 19 Natalie Gavrielov 2017-11-26 15:17:22 UTC
Created attachment 1359161 [details]
logs: vdsm, engine

Performed the scenario described in comment 0, now the VM switches state to "Not Responding" (showing question mark icon), it does not switch state to "VM paused" at any point as should be.

Builds:
ovirt-engine-4.2.0-0.0.master.20171124141652.gita5b4f9b.el7.centos.noarch
vdsm-4.20.8-29.git6ceb389.el7.centos.x86_64

Comment 20 Yaniv Kaul 2017-12-03 16:07:19 UTC
(In reply to Natalie Gavrielov from comment #19)
> Created attachment 1359161 [details]
> logs: vdsm, engine
> 
> Performed the scenario described in comment 0, now the VM switches state to
> "Not Responding" (showing question mark icon), it does not switch state to
> "VM paused" at any point as should be.
> 
> Builds:
> ovirt-engine-4.2.0-0.0.master.20171124141652.gita5b4f9b.el7.centos.noarch
> vdsm-4.20.8-29.git6ceb389.el7.centos.x86_64

Are you sure it's not a libvirt/QEMU bug?

Comment 21 Yaniv Kaul 2017-12-03 16:10:28 UTC
(In reply to Natalie Gavrielov from comment #14)
> (In reply to Piotr Kliczewski from comment #7)
> 
> > I wonder whether we face network congestion or compute issue caused by vdsm.
> > Is vdsm running in a vm or bare metal?
> 
> VM.

Can you try to reproduce on bare-metal? Is the nested VM perhaps choked on resources?

> 
> (In reply to Tomas Jelinek from comment #13)
> > Could you please provide the debug logs?
> 
> I tried running the scenario described again so I can provide debug logs,
> with build:
> ovirt-engine-4.2.0-0.0.master.20170904111624.git842d79b.el7.centos.noarch

This is quite an ancient build?

> Now the bug won't recreate, it works as expected - VM switches state from
> running to paused when blocking the connection between the host and storage
> domain..
> guess these logs will be useless.. I'll try to find an environment with the
> same build to see if it will reproduce.

Comment 22 Martin Perina 2017-12-04 08:45:44 UTC
(In reply to Natalie Gavrielov from comment #19)
> Created attachment 1359161 [details]
> logs: vdsm, engine
> 
> Performed the scenario described in comment 0, now the VM switches state to
> "Not Responding" (showing question mark icon), it does not switch state to
> "VM paused" at any point as should be.
> 
> Builds:
> ovirt-engine-4.2.0-0.0.master.20171124141652.gita5b4f9b.el7.centos.noarch
> vdsm-4.20.8-29.git6ceb389.el7.centos.x86_64

Are you sure that you are using libvirt-python-3.2.0-3.el7_4.1 which should contain the fix for BZ1481022 in RHEL 7.4.z?

Comment 23 Natalie Gavrielov 2017-12-04 09:42:47 UTC
(In reply to Yaniv Kaul from comment #21)
> (In reply to Natalie Gavrielov from comment #14)
> > (In reply to Piotr Kliczewski from comment #7)
> Can you try to reproduce on bare-metal? Is the nested VM perhaps choked on
> resources?
I'm not sure resources got to do with it when I run the same test with iscsi storage it works fine (same environment).
Looking for a bare-metal env meantime.
 
> > (In reply to Tomas Jelinek from comment #13)
> > > Could you please provide the debug logs?
> > 
> > I tried running the scenario described again so I can provide debug logs,
> > with build:
> > ovirt-engine-4.2.0-0.0.master.20170904111624.git842d79b.el7.centos.noarch
> 
> This is quite an ancient build?
I have a nested env with rhvm-4.2.0-0.5.master.el7.noarch, behaves the same.

(In reply to Martin Perina from comment #22)
> (In reply to Natalie Gavrielov from comment #19)
> > Created attachment 1359161 [details]
> > logs: vdsm, engine
> > 
> > Performed the scenario described in comment 0, now the VM switches state to
> > "Not Responding" (showing question mark icon), it does not switch state to
> > "VM paused" at any point as should be.
> > 
> > Builds:
> > ovirt-engine-4.2.0-0.0.master.20171124141652.gita5b4f9b.el7.centos.noarch
> > vdsm-4.20.8-29.git6ceb389.el7.centos.x86_64
> 
> Are you sure that you are using libvirt-python-3.2.0-3.el7_4.1 which should
> contain the fix for BZ1481022 in RHEL 7.4.z?
On the nested env I'm working on, with builds:
rhvm-4.2.0-0.5.master.el7.noarch 
vdsm-4.20.8-1.el7ev.x86_64
libvirt-python-3.2.0-3.el7_4.1.x86_64
It still happens.

Comment 24 Piotr Kliczewski 2017-12-04 11:21:46 UTC
(In reply to Natalie Gavrielov from comment #19)
> Created attachment 1359161 [details]
> logs: vdsm, engine
> 
> Performed the scenario described in comment 0, now the VM switches state to
> "Not Responding" (showing question mark icon), it does not switch state to
> "VM paused" at any point as should be.

This statement seems to not correct:

2017-11-26 15:25:24,749+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-3) [] VM 'aa091767-2da3-4c7d-b45c-26dd2fff3f48'(vm_TestCaseBlockedConnecti_2614572831) moved from 'Up' --> 'Paused'

please specify exactly which vm(s) we are talking about and at which point because I see couple of state switches to 'Paused'.

Comment 25 Natalie Gavrielov 2017-12-04 22:00:59 UTC
Created attachment 1362839 [details]
logs: vdsm, engine

Bare-metal env, engine with 2 hosts.
from engine.log:
2017-12-04 22:53:49,281+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-73) [] VM '457c50e6-c794-4d50-bf42-a443f926f8b9'(test-vm1) moved from 'Up' --> 'NotResponding'

During the entire test, VM was in NotResponding state.
Removed the rule from iptables at 04/12/17 23:26:14

Then, this came out in the engine.log:
2017-12-04 23:26:22,448+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-1) [] VM '457c50e6-c794-4d50-bf42-a443f926f8b9'(test-vm1) moved from 'NotResponding' --> 'Paused'
2017-12-04 23:26:32,294+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-2) [] VM '457c50e6-c794-4d50-bf42-a443f926f8b9'(test-vm1) moved from 'Paused' --> 'Up'

Seems that the VM becomes paused only when connection is restored.

Builds:
ovirt-engine-4.2.0-0.0.master.20171124141652.gita5b4f9b.el7.centos.noarch
vdsm-4.20.7-1.el7.centos.x86_64
libvirt-python-3.2.0-3.el7_4.1.x86_64

Comment 26 Martin Perina 2017-12-05 09:21:05 UTC
I've looked at latest logs and from infra point of view hosts statuses are fine:

1. At 2017-12-04 22:44:05,678+02 VM 'test-vm1' was started on host 'h1'

2. At 2017-12-04 22:53:49,281+02 VM 'test-vm1' was moved to NotResponding

3. At 2017-12-04 22:59:32,085+02 host 'h1' was moved to NonOperational

4. At 2017-12-04 23:00:00,301+02 engine failed to autorecover host 'h1'

5. At 2017-12-04 23:03:38,881+02 data center 'DC1' move to NonOperational as host 'h1' is an SPM

6. Several attempts to recover host 'h1', all unsuccessful

7. At 2017-12-04 23:26:22,448+02 VM 'test-vm' moved from NotResponding to Paused

8. At 2017-12-04 23:26:32,294+02 VM 'test-vm' moved from Paused to Up

9. At 2017-12-04 23:26:35,565+02 host 'h1' was moved back to Up


So from infra point of view host status changes working fine. Tomasi, could some from virt team take a look if also VM statuses were changed as expected?

Comment 27 Piotr Kliczewski 2017-12-05 09:30:47 UTC
Based on recent logs I see that vm monitoring was not run between 2017-12-04 22:53:49,281+02 and 2017-12-04 23:26:22,448+02. The question is why the vms were not updated properly.

Comment 28 Polina 2017-12-05 13:49:40 UTC
the similar behaviour has been tested also with gluster storage with problematic though a little bit different result: circulating of VM statuses that finishes by pause "due to unknown storage error" instead of "I/O error" expected.

Here is the engine reports for this scenario:
      08:50:13 (vm_gluster) moved from Up' --> 'NotResponding'

      08:50:42  vm_gluster moved from 'NotResponding' --> 'Up'

      08:50:43 (vm_gluster) moved from 'Up' --> 'NotResponding'

      08:50:48 (vm_gluster) moved from 'NotResponding' --> 'Paused

                 VM vm_gluster has been paused.
                 VM vm_gluster has been paused due to unknown storage error
      08:50:59  (vm_gluster) moved from 'Paused' --> 'NotResponding'

      08:51:13  (vm_gluster) moved from 'NotResponding' --> 'Paused'
                 VM vm_gluster has been paused.
                 VM vm_gluster has been paused due to unknown storage error.

this bug prevents testing of VM Resume Behaviour features with nfs and gluster. Only iscsi could be tested while the phase.

Comment 29 Michal Skrivanek 2017-12-06 10:14:38 UTC
There are no obvious errors anywhere (other than what's expected)
It does look like the VM stats call is still getting stuck. It looks like the call never returns. Since the heartbeat now works the engine side supposedly just hangs and no further monitoring cycles are scheduled - that would be quite bad.

Unfortunately these frequent API calls are not being logged by default
Please enable debug log level on vdsm and engine, and provide also libvirt.log

Comment 30 Piotr Kliczewski 2017-12-06 14:33:00 UTC
(In reply to Michal Skrivanek from comment #29)
> There are no obvious errors anywhere (other than what's expected)
> It does look like the VM stats call is still getting stuck. It looks like
> the call never returns. Since the heartbeat now works the engine side
> supposedly just hangs and no further monitoring cycles are scheduled - that
> would be quite bad.

Don't forget that we have higher timeout which is raised when response do not arrives on time. It is in place since xmlrpc where we had new connection every call. Now we maintain the connection and use heartbeats to detect any issues.

> 
> Unfortunately these frequent API calls are not being logged by default
> Please enable debug log level on vdsm and engine, and provide also
> libvirt.log

Comment 31 Polina 2017-12-07 11:01:14 UTC
could you please explain how set the DEBUG INFO for engine.log and vdsm.log ?

Comment 32 Natalie Gavrielov 2017-12-10 15:11:09 UTC
Created attachment 1365632 [details]
logs, DEBUG level: engine, vdsm, libvirt

(In reply to Michal Skrivanek from comment #29)
> There are no obvious errors anywhere (other than what's expected)
> It does look like the VM stats call is still getting stuck. It looks like
> the call never returns. Since the heartbeat now works the engine side
> supposedly just hangs and no further monitoring cycles are scheduled - that
> would be quite bad.
> 
> Unfortunately these frequent API calls are not being logged by default
> Please enable debug log level on vdsm and engine, and provide also
> libvirt.log

Builds used for testing:
ovirt-engine-4.2.0-0.5.master.el7.noarch
vdsm-4.20.8-1.el7ev.x86_64

Block connection was around 16:40
Restore connection was at around 16:55
VM name: vm_TestCaseBlockedConnecti_1016333518

Comment 33 Michal Skrivanek 2017-12-11 09:18:22 UTC
(In reply to Natalie Gavrielov from comment #32)
libvirt log seems to be from a wrong time - please check timezone

other than that - I can see the host responds in a correct way, with monitorResponse -1 signifying that libvirt periodic monitoring is not responding. 
Need libvirt logs from the right period - but since it's NFS it could be just the timeout - check NFS mount parameters as well.

Comment 34 Natalie Gavrielov 2017-12-11 13:23:08 UTC
Created attachment 1365999 [details]
libvirt logs

(In reply to Michal Skrivanek from comment #33)
> (In reply to Natalie Gavrielov from comment #32)
> libvirt log seems to be from a wrong time - please check timezone
> 
> other than that - I can see the host responds in a correct way, with
> monitorResponse -1 signifying that libvirt periodic monitoring is not
> responding. 
> Need libvirt logs from the right period - but since it's NFS it could be
> just the timeout - check NFS mount parameters as well.

Attached libvirt logs (for all 3 hosts) with the relevant time frame.
Can you please specify exactly which parameters for NFS mount to check?

Comment 35 Michal Skrivanek 2017-12-11 16:39:28 UTC
(In reply to Natalie Gavrielov from comment #34)

> Attached libvirt logs (for all 3 hosts) with the relevant time frame.
> Can you please specify exactly which parameters for NFS mount to check?

I see it's stuck in libvirt side until the last call at 2017-12-10 14:56:38.707+0000: 28509: error : qemuDomainObjBeginJobInternal:3839 : Timed out during operation: cannot acquire state change lock (held by remoteDispatchConnectGetAllDomainStats)

In vdsm log it starts reacting again at 2017-12-10 16:56:49,008+0200 and is reported as Up by vdsm(via VM status event), but the VM is shut down short after at 2017-12-10 16:56:59,502+0200 (vdsm API request). You did that manually from engine as logged in engine.log:
2017-12-10 16:56:59,463+02 INFO  [org.ovirt.engine.core.bll.StopVmCommand] (default task-2) [vms_syncAction_671d3681-85d0-4ec9] Running command: StopVmCommand(StopVmType = NORMAL, StopReason = null, VmId = 2cd65ab8-6294-493e-a1fb-580c4475f041, QuotaId = null) internal: false. Entities affected :  ID: 2cd65ab8-6294-493e-a1fb-580c4475f041 Type: VMAction group STOP_VM with role type USER
There is one more report right after the Up event is sent - at 2017-12-10 16:56:50,393+0200 vdsm sends monitorResponse -1 again as a response to Host.getAllVmStats. It may be due to last monitoring cycle still not getting any data (Up event is sent just 0.8s earlier). This shouldn't cause much of an issue other than a one time blip on VM status

On engine side I see the VM was not responding during the outage - which seems to be correct, then reported as Up (2017-12-10 16:56:49,021+02) and then again NotResponding(2017-12-10 16:56:50,400+02) and then after shut down to Down (2017-12-10 16:57:02,784+02)

I do not see any problem with VM status handling.

Comment 36 Michal Skrivanek 2017-12-11 16:43:00 UTC
as for NFS you should check the timeout values before the kernel actually reports I/O error. Default is 6m IIRC but it really depends on your client and server parameters - if you believe the underlying storage should have returned I/O error during your test period this should be looked at by libvirt team. In that case please supply the parameters and journal log

Comment 37 Michal Skrivanek 2017-12-12 13:15:00 UTC
when you access a file on that nfs _on host_, you should get EIO after that timeout. That would verify your NFS is set up correctly and then VM should behave the same - only hen it differs in behavior it's a valid (likely libvirt) bug

Comment 38 Yaniv Kaul 2018-01-02 13:36:22 UTC
Natalie, any updates or can we close this bug?

Comment 39 Raz Tamir 2018-01-08 12:21:30 UTC
(In reply to Michal Skrivanek from comment #37)
> when you access a file on that nfs _on host_, you should get EIO after that
> timeout. That would verify your NFS is set up correctly and then VM should
> behave the same - only hen it differs in behavior it's a valid (likely
> libvirt) bug

Hi Michal,

There is no access to the storage from the host so I don't see the point of checking if the NFS is set up correctly or not if we don't have communication to it - it is like it doesn't exist after blocking the connection.

The behavior should be exactly like running the VM with disk on block type storage domain - the VM should enter PAUSED state

Comment 40 Michal Skrivanek 2018-01-08 17:36:37 UTC
Raz, what do you mean by no access to the storage from host? That's what Natalie  is trying to test IIUC, and for that I asked to confirm behavior when similar access is done directly by the host without qemu involvement.

Or you mean there is just no storage access by the guest? Then indeed the VM doesn't pause, but then this whole bug is just a mistake - obviously the VM doesn't pause when there is no storage access whatsoever.

"The behavior should be exactly like running the VM with disk on block type storage domain - the VM should enter PAUSED state" - not exactly. The VM behavior should match the host behavior...so the VM Pauses iff EIO is returned for backing storage access by the host. With default NFS settings we use that's very different from e.g. FC

Comment 41 Raz Tamir 2018-01-09 08:23:08 UTC
Thanks for clarifying Michal,

Is there any way to overcome the differences between the storage types?

To summarize this bug, we are looking for consistency no matter what the storage type is - the VM should not care where it's disk resides on (i.e. NFS, iSCSI).

One thing I do not really understand is that you mentioned that the EIO is returned from the storage server.
If the connection between the host and the storage is blocked, how can the host, receive any RC from it?

I hope the request/expected result is clear from our POV

Comment 42 Michal Skrivanek 2018-01-09 08:29:53 UTC
(In reply to Raz Tamir from comment #41)
> Thanks for clarifying Michal,
> 
> Is there any way to overcome the differences between the storage types?
> 
> To summarize this bug, we are looking for consistency no matter what the
> storage type is - the VM should not care where it's disk resides on (i.e.
> NFS, iSCSI).

The VM indeed does not care, but the timing of EIO would be different for different storage backends in any case due to transport limitations/configuration

> One thing I do not really understand is that you mentioned that the EIO is
> returned from the storage server.
> If the connection between the host and the storage is blocked, how can the
> host, receive any RC from it?

sorry, maybe that wasn't clear in my text. I meant the transport returns EIO, and that does depend on the specifics of each transport. So e.g. for FC when the stack basically immediately detects that the path is wrong the EIO is immediate too. Whereas on IP based transports when there is an issue somewhere along the way when packages are silently dropped...it takes time. For NFS that time is defined by the client.

Comment 43 Natalie Gavrielov 2018-01-09 18:15:23 UTC
Created attachment 1379222 [details]
engine.log, vdsm.log, libvirtd.log, /var/log/messages, mount output and a UI snapshots of nfs-paused-vm and iscsi-paused-vm

I tested the following: 
1. Write from host to the nfs storage (simple dd)
2. Write from a VM that's running on the host to a disk that is on nfs storage.
2. While file writing takes place block the connection to the storage (using iptables)

Result:
The command line just freezes for a long long time (a lot longer than the default 10 minutes we know).
/var/log/messages show the following:
Jan  9 11:39:07 storage-jenkins-ge4-vdsm3 kernel: nfs: server yellow-vdsb.qa.lab.tlv.redhat.com not responding, timed out
Jan  9 11:33:20 storage-jenkins-ge4-vdsm3 vdsm[594]: WARN Worker blocked: <Worker name=periodic/2 running <Task <Operation action=<vdsm.virt.sampling.HostMonitor object at 0x2ec7d50> at 0x2ec7d90> timeout=15, duration=15 at 0x46e0450> task#=15626 at 0x2fa63d0>, traceback:#012File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap#012  self.__bootstrap_inner()#012File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner#012  self.run()#012File: "/usr/lib64/python2.7/threading.py", line 765, in run#012  self.__target(*self.__args, **self.__kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run#012  ret = func(*args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run#012  self._execute_task()#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task#012  task()#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__#012  self._callable()#012File: "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 213, in __call__#012  self._func()#012File: "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 578, in __call__#012  stats = hostapi.get_stats(self._cif, self._samples.stats())#012File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 62, in get_stats#012  avail, commit = _memUsageInfo(cif)#012File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 121, in _memUsageInfo#012  mem_info = v.memory_info()#012File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 802, in memory_info#012  dom_stats = self._dom.memoryStats()#012File: "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f#012  ret = attr(*args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper#012  ret = f(*args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper#012  return func(inst, *args, **kwargs)#012File: "/usr/lib64/python2.7/site-packages/libvirt.py", line 1449, in memoryStats#012  ret = libvirtmod.virDomainMemoryStats(self._o)
Jan  9 11:33:35 storage-jenkins-ge4-vdsm3 vdsm[594]: WARN Worker blocked: <Worker name=periodic/2 running <Task <Operation action=<vdsm.virt.sampling.HostMonitor object at 0x2ec7d50> at 0x2ec7d90> timeout=15, duration=30 at 0x46e0450> task#=15626 at 0x2fa63d0>, traceback:#012File: "/usr/lib64/python2.7/threading.py", line 785, in __bootstrap#012  self.__bootstrap_inner()#012File: "/usr/lib64/python2.7/threading.py", line 812, in __bootstrap_inner#012  self.run()#012File: "/usr/lib64/python2.7/threading.py", line 765, in run#012  self.__target(*self.__args, **self.__kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/common/concurrent.py", line 194, in run#012  ret = func(*args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 301, in _run#012  self._execute_task()#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 315, in _execute_task#012  task()#012File: "/usr/lib/python2.7/site-packages/vdsm/executor.py", line 391, in __call__#012  self._callable()#012File: "/usr/lib/python2.7/site-packages/vdsm/virt/periodic.py", line 213, in __call__#012  self._func()#012File: "/usr/lib/python2.7/site-packages/vdsm/virt/sampling.py", line 578, in __call__#012  stats = hostapi.get_stats(self._cif, self._samples.stats())#012File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 62, in get_stats#012  avail, commit = _memUsageInfo(cif)#012File: "/usr/lib/python2.7/site-packages/vdsm/host/api.py", line 121, in _memUsageInfo#012  mem_info = v.memory_info()#012File: "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 802, in memory_info#012  dom_stats = self._dom.memoryStats()#012File: "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f#012  ret = attr(*args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper#012  ret = f(*args, **kwargs)#012File: "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper#012  return func(inst, *args, **kwargs)#012File: "/usr/lib64/python2.7/site-packages/libvirt.py", line 1449, in memoryStats#012  ret = libvirtmod.virDomainMemoryStats(self._o)

------------------------------------------------------------------------------------

Later on, I restared libvirtd (something that has to do with libvirt logging, not this issue)
and retested the original scenario described in comment 0, block connection started at ~17:30
Result:
libvirtd.log:
2018-01-09 15:30:38.414+0000: 9083: error : qemuDomainObjBeginJobInternal:3839 : Timed out during operation: cannot acquire state change lock (held by remoteDispatchConnectGetAllDomainStats)

2018-01-09 15:52:26.276+0000: 9083: error : qemuDomainObjBeginJobInternal:3839 : Timed out during operation: cannot acquire state change lock (held by remoteDispatchConnectGetAllDomainStats)
{"timestamp": {"seconds": 1515513153, "microseconds": 425983}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk0", "nospace": false, "__com.redhat_reason": "eio", "node-name": "#block140", "reason": "Input/output error", "operation": "read", "action": "stop"}}
2018-01-09 15:52:33.433+0000: 9076: info : qemuMonitorJSONIOProcessLine:208 : QEMU_MONITOR_RECV_EVENT: mon=0x7f9a20023a60 event={"timestamp": {"seconds": 1515513153, "microseconds": 425983}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk0", "nospace": false, "__com.redhat_reason": "eio", "node-name": "#block140", "reason": "Input/output error", "operation": "read", "action": "stop"}}



vdsm.log.1:
2018-01-09 17:30:59,314+0200 ERROR (monitor/b7f7132) [storage.Monitor] Error checking domain b7f7132e-1f8a-48bf-ad00-2905c71f18d3 (monitor:424)
2018-01-09 17:30:59,329+0200 ERROR (monitor/1dd8b62) [storage.Monitor] Error checking domain 1dd8b623-b91f-4a40-bdb6-631ce1715663 (monitor:424)
2018-01-09 17:30:59,332+0200 ERROR (monitor/4e28206) [storage.Monitor] Error checking domain 4e282067-fcf7-421c-b28c-ed36a8eee0c8 (monitor:424)
2018-01-09 17:30:59,334+0200 ERROR (monitor/b25ad09) [storage.Monitor] Error checking domain b25ad098-d8c9-46ed-90fe-367428c5a44d (monitor:424)

2018-01-09 17:40:01,101+0200 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__jenkins__ge4__nfs__1/4e282067-fcf7-421c-b28c-ed36a8eee0c8/dom_md/metadata (monitor:498)
2018-01-09 17:40:01,114+0200 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__jenkins__ge4__nfs__0/1dd8b623-b91f-4a40-bdb6-631ce1715663/dom_md/metadata (monitor:498)
2018-01-09 17:40:01,372+0200 ERROR (check/loop) [storage.Monitor] Error checking path /rhev/data-center/mnt/yellow-vdsb.qa.lab.tlv.redhat.com:_Storage__NFS_storage__jenkins__ge4__nfs__2/b25ad098-d8c9-46ed-90fe-367428c5a44d/dom_md/metadata (monitor:498)

2018-01-09 17:53:03,493+0200 ERROR (libvirt/events) [vds] Error running VM callback (clientIF:654)


engine.log:
2018-01-09 17:30:48,701+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-24) [] VM 'b4fcb39a-d1a3-449a-aba3-518c072168e5'(vm_TestCaseBlockedConnecti_0917235822) moved from 'Up' --> 'NotResponding'
2018-01-09 17:52:33,488+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-8) [] VM 'b4fcb39a-d1a3-449a-aba3-518c072168e5'(vm_TestCaseBlockedConnecti_0917235822) moved from 'NotResponding' --> 'Paused'
2018-01-09 17:52:33,550+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-8) [] EVENT_ID: VM_PAUSED_EIO(145), VM vm_TestCaseBlockedConnecti_0917235822 has been paused due to storage I/O problem.
2018-01-09 17:52:39,242+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-35) [] VM 'b4fcb39a-d1a3-449a-aba3-518c072168e5'(vm_TestCaseBlockedConnecti_0917235822) moved from 'Paused' --> 'NotResponding'          <---- why does it go back to not responding?

webadmin Compute -> Virtual Machines
The little icon on the left shows a question mark.
When hovering over this icon, the following message appears: 
"Not Responding
VM has been paused due to a storage I/O error"

VM status on the right column shows "Not Responding"

When performing the same test with iSCSI storage, the icon is not a question mark it's a paused icon, when hovering over it, the following message appears:
"Paused
VM has been paused due to a storage I/O error"
and the status column on the right shows "Paused"
snapshots attached

Attached:
engine.log
vdsm.log
libvirtd.log
/var/log/messages
mount output
2 snapshots of the paused VMs (iscsi and nfs)

Comment 44 Ademar Reis 2018-01-10 21:46:27 UTC
This is similar to Bug 665820. To understand what's happening at the QEMU level, please see https://bugzilla.redhat.com/show_bug.cgi?id=665820#c11 (and other comments there).

TL;DR: we can't know we got stuck until some time after getting stuck (the comments in the BZ above explain why). And we can't think of a way to implement a "detect system call got stuck on NFS" feature in QEMU that doesn't require major architectural surgery, with a considerable risk of introducing nasty bugs.

Comment 45 Yaniv Kaul 2018-01-29 08:48:45 UTC
This has nothing to do (as far as I understand) with oVirt and isn't a new behavior. Until we can come up with a better answer, deferring this to 4.2.2

Comment 46 Michal Skrivanek 2018-03-01 13:38:04 UTC
It would be best to rerun those tests on 7.5 and note down the differences in behavior on each underlying storage. I do not see anything else we could do in 4.2 timeframe

Comment 47 Polina 2018-03-07 14:55:04 UTC
the tests were repeated for  rhv-release-4.2.2-4-001.noarch and rhel 7.5.

nfs (the same behavior as in a previous versions):

   1. run vm with nfs disk. 
   2. block the storage <iptables -I INPUT -s yellow-vdsb.qa.lab.tlv.redhat.com -j 
   DROP>
   result: 
          vm is in unknown state. 
          attached logs_nfs_vm.tar.gz.

gluster (the behavior changed. we do have a circulation of states but then it stabilized):

   1. run vm with disk on gluster.
   2. block the storage :
      iptables -I INPUT -s 10.35.83.240 -j REJECT
      iptables -I INPUT -s 10.35.83.241 -j REJECT
      iptables -I INPUT -s 10.35.83.242 -j REJECT

   result:
          has been paused due to unknown storage error.
          VM golden_env_mixed_virtio_3_0 is not responding.
          has been paused due to storage I/O problem.
          attached logs_gluster.tar.gz

Comment 48 Polina 2018-03-07 14:55:44 UTC
Created attachment 1405396 [details]
logs_nfs_vm.tar.gz

Comment 49 Polina 2018-03-07 14:56:21 UTC
Created attachment 1405397 [details]
logs_gluster.tar.gz

Comment 50 Polina 2018-03-07 15:17:13 UTC
with gluster the behavior is inconsistent. I attach engine and vdsm logs in gluster_vm_continue_scenario.tar.gz

run gluster VM (not HA), with KILL Resume Behavior. The after block storage the VM is "has been paused due to unknown storage error" (no circulating).
Then before the storage unblocked the 
VM golden_env_mixed_virtio_3_0 is down with error. Exit message: Down because paused for too long.

strange inconsistent behavior

Comment 51 Polina 2018-03-07 15:17:47 UTC
Created attachment 1405401 [details]
gluster_vm_continue_scenario.tar.gz

Comment 52 Polina 2018-03-07 18:46:56 UTC
Just to note that for ISCSI and FCP the behavior is correct - the VM is I/O Error Paused while the storage blocked.

The iscsi storage blocking is done with iptables. the fcp storage could be blocked with command <echo "offline" > /sys/block/sdd/device/state> for the connected LUN ID.

Comment 53 Yaniv Kaul 2018-03-08 13:26:41 UTC
Martin, can we move it to qemu-kvm?

Comment 54 Michal Skrivanek 2018-03-08 14:45:10 UTC
I guess so, but I wanted to document some basic behavioral patterns of the whole system for different storage backends

Comment 55 Martin Tessun 2018-03-15 09:18:45 UTC
(In reply to Yaniv Kaul from comment #53)
> Martin, can we move it to qemu-kvm?

I don't see why. qemu behaves correctly, it is (again) NFS v3 that has the issue. Maybe different mount options might help here (using hard mounts instead of soft ones).

I personally would expect so, as the soft mounts would return a write error, whereas the hard mount would "queue" the IO.

Have you tried using hard mounts instead of soft ones?

So to me this would more look like a Storage issue that can't be fixed completely due to the nature of NFS.

Adding Ademar for his view on this.

Comment 56 Ademar Reis 2018-03-15 16:43:49 UTC
(In reply to Martin Tessun from comment #55)
> Adding Ademar for his view on this.

I added a comment before (comment #44) explaining that when QEMU gets stuck in a system call (read() on a non-responding NFS server, for example), there's not much to be done and the whole process freezes.

We have a (very old) RFE about this behavior, see Bug 665820. But it requires some serious reengineering of QEMU to be implemented: we need a separate monitor thread or process that watches for stuck syscalls.

To get more details of what's happening at the QEMU level, please see https://bugzilla.redhat.com/show_bug.cgi?id=665820#c11 (and other comments there).

I recommend closing this BZ as a duplicate of Bug 665820.

Comment 57 Michal Skrivanek 2018-04-13 12:18:50 UTC
(In reply to Polina from comment #48)
> Created attachment 1405396 [details]
> logs_nfs_vm.tar.gz

Polina,
reviewed this one and it's again the same problem - now I'm assuming you're using the same settings as natalie in comment #43 - you have soft mounts with 6 minutes timeout, but your logs cover only cca 5 minutes after blocking the access. It also takes 5 minutes to declare host non-operational (because of default storage monitoring timeouts) so you won't EVER get ioerror in this setup.


Also reviewd logs from comment #49 (gluster):
16:31:12 VM is deemed not responsive (stuck monitoring calls)
16:31:44.691 it apparently continues to run (possibly it still was, it's just libvirt stats calls being stuck) and guest agent respond(Guest's message heartbeat) triggers Up event. This may be a bug and we may want to _not_ send Up event in case of monitorResponse == -1. It may also be a bogus message since it seems to be real close to the next event. Would be good to verify this behavior several times.
16:31:44.714 IOError is reported
16:31:56/57 VM moves back to NotResponding - that's probably a bug in monitoring code related to a IOError "respond" earlier. @Arik please look at that



to sum up:
- FCP and iSCSI work ok, per comment #52
- NFSv3 with default mount options works "as expected" (though not desired) - VM is not responding for 5 minutes and then host becomes non-operational.
- NFSv4 not tested? It should behave the same with default options though.
- Gluster - seems to work reasonably ok

to get a useful data for NFSv3 it needs to be retested with different timeouts, and I guess NFSv4 should be tesetd independently

Comment 58 Polina 2018-04-15 16:17:51 UTC
Hi Michal,
for rhvm-4.2.3-0.1.el7.noarch, libvirt-3.9.0-14.el7_5.2.x86_64:

Gluster seems to be ok - after gluster storage is blocked, the VM is unknown, but soon turns to be I/O paused, so that I was able to verify the Resume Behavior feature for the Gluster storage.

for NFS - 
I test with NFSv4. Can't cat the I/O error for VM. it turns to Not Responding, then the host is non Operational. 
If the environment settings could be configured to create the workaround for this phenomenon, could you please explain how to configure it.

Comment 59 Michal Skrivanek 2018-04-16 08:58:26 UTC
adding back needinfo from comment #57 -  Arik, can you please check that?

Polina, it's discussed earlier in this bug with Natalie. Just set the timeout to <5 mins.

Comment 60 Yaniv Kaul 2018-04-24 11:00:52 UTC
This is not a blocker nor an exception, yet it is targeted for 4.2.3?

Comment 61 Arik 2018-04-24 12:31:47 UTC
(In reply to Michal Skrivanek from comment #57)
> Also reviewd logs from comment #49 (gluster):
> 16:31:12 VM is deemed not responsive (stuck monitoring calls)
> 16:31:44.691 it apparently continues to run (possibly it still was, it's
> just libvirt stats calls being stuck) and guest agent respond(Guest's
> message heartbeat) triggers Up event. This may be a bug and we may want to
> _not_ send Up event in case of monitorResponse == -1. It may also be a bogus
> message since it seems to be real close to the next event. Would be good to
> verify this behavior several times.
> 16:31:44.714 IOError is reported
> 16:31:56/57 VM moves back to NotResponding - that's probably a bug in
> monitoring code related to a IOError "respond" earlier. @Arik please look at
> that

The engine sets the NotResponding state only when VDSM reports monitorResponse < 0.

The engine doesn't have much choice in that case: when it gets a report from VDSM that indicates the VM is NotResponding, the engine should honor this report and switch the state of the VM accordingly. If this state is incorrect, VDSM should not report it.

Comment 62 Michal Skrivanek 2018-04-24 18:05:42 UTC
resolution is converging, but there is still the case of NFS which was not verified by QE properly yet. Moving to 4.2.4 then

Arik, problem with that is that status events like Paused/IOError is received through lifecycle callback which may work even when stats do not. I agree we should honor monitoringResponse. Maybe the issue is due to that "Up" from guest agent), but the lifecycle event should be reflected in engine somehow.
I.e. if I have a VM in NotResponding I still want to process lifecycle changes reported by VDSM events.

Again, this should be probably tested by QE several times to see if this behavior is consistent and to understand that guest agent's "Up" report.

Comment 63 Arik 2018-04-25 07:58:57 UTC
(In reply to Michal Skrivanek from comment #62)
> Arik, problem with that is that status events like Paused/IOError is
> received through lifecycle callback which may work even when stats do not. I
> agree we should honor monitoringResponse. Maybe the issue is due to that
> "Up" from guest agent), but the lifecycle event should be reflected in
> engine somehow.
> I.e. if I have a VM in NotResponding I still want to process lifecycle
> changes reported by VDSM events.

I agree and the engine does that - that's why the status of the VM changed from NotResponing to Up, right? Later, the VM state was changed to NotResponding because that was the next input from VDSM. 

The real advantage of the engine, when it comes to monitoring, is that it generally knows more than VDSM about the overall state of the system - he knows what's the status of other hosts, of other VMs running on different hosts, of storage domains, etc. On the other hand, VDSM is smarter when it comes to host-specific things. Sure, VDSM can tell the engine "Listen, the status of the VM is a bit complicated: I can't get its stats but I got a paused/IOError lifecycle event and the guest agent seems to work just fine" and then the engine would get all this data and either confuse the user by displaying all that information or decide what's the state of the VM in light of all that information.

However, IMO VDSM should be the component that decides what's the status of the VM that is running on its host. If the stats do not work but there's an indication from the guest agent that the VM is running - there's no point in sending the engine that the VM is NotResponding. If the indication from the guest agent means that the VM is running and is usable VDSM should report the VM as Up, and maybe send another event and would produce an audit log saying that something is wrong with getting its stats.

I think this bz should move to VDSM.

Comment 64 Arik 2018-04-25 08:12:09 UTC
(In reply to Arik from comment #63)
> (In reply to Michal Skrivanek from comment #62)
> > Arik, problem with that is that status events like Paused/IOError is
> > received through lifecycle callback which may work even when stats do not. I
> > agree we should honor monitoringResponse. Maybe the issue is due to that
> > "Up" from guest agent), but the lifecycle event should be reflected in
> > engine somehow.
> > I.e. if I have a VM in NotResponding I still want to process lifecycle
> > changes reported by VDSM events.
> 
> I agree and the engine does that - that's why the status of the VM changed
> from NotResponing to Up, right? Later, the VM state was changed to
> NotResponding because that was the next input from VDSM. 
> 
> The real advantage of the engine, when it comes to monitoring, is that it
> generally knows more than VDSM about the overall state of the system - he
> knows what's the status of other hosts, of other VMs running on different
> hosts, of storage domains, etc. On the other hand, VDSM is smarter when it
> comes to host-specific things. Sure, VDSM can tell the engine "Listen, the
> status of the VM is a bit complicated: I can't get its stats but I got a
> paused/IOError lifecycle event and the guest agent seems to work just fine"
> and then the engine would get all this data and either confuse the user by
> displaying all that information or decide what's the state of the VM in
> light of all that information.
> 
> However, IMO VDSM should be the component that decides what's the status of
> the VM that is running on its host. If the stats do not work but there's an
> indication from the guest agent that the VM is running - there's no point in
> sending the engine that the VM is NotResponding. If the indication from the
> guest agent means that the VM is running and is usable VDSM should report
> the VM as Up, and maybe send another event and would produce an audit log
> saying that something is wrong with getting its stats.

Unless some operations cannot be done on that VM, like migration. In that case, NotResponding can actually be a better state.

> 
> I think this bz should move to VDSM.

Comment 65 Michal Skrivanek 2018-04-25 18:17:07 UTC
(In reply to Arik from comment #64)

> > However, IMO VDSM should be the component that decides what's the status of
> > the VM that is running on its host. If the stats do not work but there's an
> > indication from the guest agent that the VM is running - there's no point in
> > sending the engine that the VM is NotResponding. 

too bad this is done by monitorResponse==-1 in stats instead of actual state. It's asynchronous, and confusing in any case.

> > If the indication from the
> > guest agent means that the VM is running and is usable VDSM should report
> > the VM as Up, and maybe send another event and would produce an audit log
> > saying that something is wrong with getting its stats.
> 
> Unless some operations cannot be done on that VM, like migration. In that
> case, NotResponding can actually be a better state.

this is actually an example where the opposite would work better:) if you can still instruct VM to migrate away when its stats do not work it's worth a try to initiate the migration:)

> > I think this bz should move to VDSM.

not sure. Of course it would be all great if we refactor everything, but that's probably not feasible. The code on vdsm side modifying the state reported by libvirt is quite awful already

Comment 66 Arik 2018-04-26 06:45:12 UTC
(In reply to Michal Skrivanek from comment #65)
> > Unless some operations cannot be done on that VM, like migration. In that
> > case, NotResponding can actually be a better state.
> 
> this is actually an example where the opposite would work better:) if you
> can still instruct VM to migrate away when its stats do not work it's worth
> a try to initiate the migration:)

So on the other extreme, if everything is enabled in that state, e.g., migration, connection with a console, snapshotting, etc, then maybe it is wrong to interpret monitorResponse==-1 as NotResponding state. Maybe we should use the state reported by the lifecycle events and have an additional indication of not getting stats.

> 
> > > I think this bz should move to VDSM.
> 
> not sure. Of course it would be all great if we refactor everything, but
> that's probably not feasible. The code on vdsm side modifying the state
> reported by libvirt is quite awful already

Maybe I'm missing something then.
IIUC the problem we are discussing is about the flipping state of the VM between Paused and NotResponding. The state of the VM in libvirt is Paused but we fail to get the stats so the engine interprets monitorResponse==-1 as NotResponding on every statistics cycle.

Assuming that we still want to set the state to NotResponding when getting monitorResponse==-1, the desired result in the discussed scenario would be:
1. if the engine gets monitorResponse==-1 and no VM state change, set NotReponding.
2. if the engine gets VM state change, set the VM state accordingly.
3. if the engine gets monitorResponse==-1 soon after, keep the VM state.
So we'll end up with, e.g. Up -> NotReponding -> Paused and no more switches to NotReponsing. Am I right?

But then you say "VDSM code is too complex for us to implement such logic there so let's move it to the engine that is easier for us to change" - I agree with that statement, but I'm afraid it will not be trivial to add such hacks on the engine side as well.

Comment 67 Michal Skrivanek 2018-04-27 11:14:12 UTC
(In reply to Arik from comment #66)
> Maybe I'm missing something then.
> IIUC the problem we are discussing is about the flipping state of the VM
> between Paused and NotResponding. The state of the VM in libvirt is Paused
> but we fail to get the stats so the engine interprets monitorResponse==-1 as
> NotResponding on every statistics cycle.
> 
> Assuming that we still want to set the state to NotResponding when getting
> monitorResponse==-1, the desired result in the discussed scenario would be:
> 1. if the engine gets monitorResponse==-1 and no VM state change, set
> NotReponding.
> 2. if the engine gets VM state change, set the VM state accordingly.
> 3. if the engine gets monitorResponse==-1 soon after, keep the VM state.
> So we'll end up with, e.g. Up -> NotReponding -> Paused and no more switches
> to NotReponsing. Am I right?

maybe we can say that NotResponding state is only relevant for VMs which are running (guest CPU is running) and not change to/from it in any other case. I see 2 issues in comment #57 - first that vdsm itself sent "Up" (from guest agent report) while monitorResponse was -1. If we consider guest agent reports part of statistics (which they indeed are) then this event shoudln't have been sent (this has to be fixed in vdsm because you cannot tell if Up is from libvirt or guest agent in engine)
Second issue is what you're talking about, that we move to Paused while monitorResponse is -1 (that would be ok on its own) and then back to NotResponding - I say that if we ignore monitorResponse when state is not-running(Paused) then it's ok, VM stays Paused.

> But then you say "VDSM code is too complex for us to implement such logic
> there so let's move it to the engine that is easier for us to change" - I
> agree with that statement, but I'm afraid it will not be trivial to add such
> hacks on the engine side as well.

:) no, it's more about the paradigm of doing as little modification from libvirt as possible in vdsm. We already have some and it's quite difficult to follow due to async nature of reports from guest agent, libvirt event loop, libvirt stats, and internal code. There's no NotResponding state at all in vdsm

Comment 68 Arik 2018-04-27 20:35:04 UTC
(In reply to Michal Skrivanek from comment #67) 
> maybe we can say that NotResponding state is only relevant for VMs which are
> running (guest CPU is running) and not change to/from it in any other case.
> I see 2 issues in comment #57 - first that vdsm itself sent "Up" (from guest
> agent report) while monitorResponse was -1. If we consider guest agent
> reports part of statistics (which they indeed are) then this event shoudln't
> have been sent (this has to be fixed in vdsm because you cannot tell if Up
> is from libvirt or guest agent in engine)

Well, we can theoretically enhance the VDSM->engine API to reflect the source of the information, I thought about this, that's why I consider such an engine-side solution 'non-trivial' and agree that it should be done in VDSM :)

> Second issue is what you're talking about, that we move to Paused while
> monitorResponse is -1 (that would be ok on its own) and then back to
> NotResponding - I say that if we ignore monitorResponse when state is
> not-running(Paused) then it's ok, VM stays Paused.

I'm concerned about the following scenario:
We have a running VM => its state is Up.
Then, the VM is being paused => its state is Paused.
Then, the VM is resumed => its state is Up but the engine doesn't process this event.
Later, all stats fail thus we get monitorResponse == -1, so the VM stays in status Paused (until the next status transition) although it is actually running in libvirt.

So ignoring (on the engine side) the monitorResponse==-1 when the 'known' state of the VM on the engine is Paused would only be feasible if we are sure the stats call failed because the VM was paused and succeed otherwise.

> 
> > But then you say "VDSM code is too complex for us to implement such logic
> > there so let's move it to the engine that is easier for us to change" - I
> > agree with that statement, but I'm afraid it will not be trivial to add such
> > hacks on the engine side as well.
> 
> :) no, it's more about the paradigm of doing as little modification from
> libvirt as possible in vdsm. We already have some and it's quite difficult
> to follow due to async nature of reports from guest agent, libvirt event
> loop, libvirt stats, and internal code. There's no NotResponding state at
> all in vdsm

Right but you know, 'best practices' should be taken with cautious - when the only source of truth is libvirt then yeah - there's no need to override the data we get from it or convert that data to other format and so on, this makes perfect sense to me. But when VDSM is provided with data from different sources and can produce better, more accurate and simple, information, I think it should.

Comment 69 Michal Skrivanek 2018-04-30 12:01:40 UTC
(In reply to Arik from comment #68)
> I'm concerned about the following scenario:
> We have a running VM => its state is Up.
> Then, the VM is being paused => its state is Paused.
> Then, the VM is resumed => its state is Up but the engine doesn't process
> this event.
> Later, all stats fail thus we get monitorResponse == -1, so the VM stays in
> status Paused (until the next status transition) although it is actually
> running in libvirt.
> 
> So ignoring (on the engine side) the monitorResponse==-1 when the 'known'
> state of the VM on the engine is Paused would only be feasible if we are
> sure the stats call failed because the VM was paused and succeed otherwise.

Perhaps I wasn't clear - I meant that we would ignore the processing of monitorResponse==-1 (i.e. not move to NotResponding after defined interval) when the VM status at that time is "guest CPU is not running"(i.e. Down, Paused, ResumingState, Migration Destination - you can say only the Pause is realistically the one you would ever see in processing)
This wouldn't necessarily apply to the situation that VM is Paused and you get monitorResponse>0, even though I believe the caverage by VM events and full stats is good enough

Comment 70 Polina 2018-06-03 13:48:52 UTC
The problem is still reproduced on rhv-release-4.2.4-2-001.noarch

Comment 71 Arik 2018-06-03 14:27:18 UTC
(In reply to Polina from comment #70)
> The problem is still reproduced on rhv-release-4.2.4-2-001.noarch

Just noting that this is expected as only the engine side was merged.

Comment 72 Milan Zamazal 2018-06-25 11:05:06 UTC
There is still the problem with freezing virDomainMemoryStats libvirt calls from Vdsm, which should be handled some way.

Comment 73 Polina 2018-07-09 20:03:16 UTC
verification on downstream package failed.

rhv-release-4.2.5-1-001.noarch
vdsm-4.20.33-1.el7ev.x86_64
libvirt-3.9.0-14.el7_5.6.x86_64
qemu-kvm-rhev-2.10.0-21.el7_5.4.x86_64

Hi, the described scenario still reproducible in the last build.

Comment 74 Polina 2018-07-10 14:54:34 UTC
also tested on nightly build - the same behavior:

rhvm-4.2.5.1_SNAPSHOT-72.g1639ffa.0.scratch.master.el7ev.noarch

Comment 75 Michal Skrivanek 2018-07-11 03:58:37 UTC
(In reply to Polina from comment #73)
> verification on downstream package failed.
> 
> rhv-release-4.2.5-1-001.noarch
> vdsm-4.20.33-1.el7ev.x86_64
> libvirt-3.9.0-14.el7_5.6.x86_64
> qemu-kvm-rhev-2.10.0-21.el7_5.4.x86_64
> 
> Hi, the described scenario still reproducible in the last build.

Ok, which one exactly? Which storage type(s)? There are quite a few different scenarios in previous comments. Also please add logs, it could be a different issue.

Comment 76 Polina 2018-07-15 07:46:19 UTC
Hi, in the attached logs (vdsm with full debug) you can see the states of two VMs: nfs VM golden_env_mixed_virtio_1_0 and gluster VM golden_env_mixed_virtio_3_0.
The expected for both VMs - The VM must be paused with I/O error while the storage is blocked.
Actually the VMs are in not-responding state.

1. for nfs VM:
753 2018-07-15 09:55:50,947+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-68) [] EVENT_ID: USER_RUN_VM(32), VM golden_env_mix        ed_virtio_1_0 started on Host host_mixed_1

block the nfs storage with iptables command

906 2018-07-15 10:01:53,637+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-94) [] VM '3855687f-a362-48a9-8d5a-b1dc3ff22dd5'(golden_env_mixed_        virtio_1_0) moved from 'Up' --> 'NotResponding'

the VM remains in not responding state. at some point engine tries to migrate (not expected - it is not HA VM) and fails . 

1099 2018-07-15 10:06:21,068+03 ERROR [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (EE-ManagedThreadFactory-engine-Thread-355) [1b18e31c] Failed to migrate VM 'golden_env_mixed_virtio_1_0'


2.for gluster VM:
1083 2018-07-15 10:03:11,388+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-49) [] EVENT_ID: USER_RUN_VM(32), VM golden_env_mix        ed_virtio_3_0 started on Host host_mixed_3

block replica3 gluster storage

1103 2018-07-15 10:07:12,218+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-39) [] VM '2914e97d-c6c7-4a94-a84e-8bd7a4d8cc78'(golden_env_mixed_        virtio_3_0) moved from 'Up' --> 'NotResponding'
then flipping to Up state
1112 2018-07-15 10:08:12,581+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-69) [] VM '2914e97d-c6c7-4a94-a84e-8bd7a4d8cc78'(golden_env_mixed_        virtio_3_0) moved from 'NotResponding' --> 'Up'

Then Engine tries to migrate this VM which fails . As result the VM is down

2018-07-15 10:11:38,625+03 ERROR [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-0) [] Migration of VM 'golden_env_mixed_virtio_3_0' to host 'host_mixed_1' failed: VM destroyed during the startup.
2018-07-15 10:11:38,646+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-9) [] VM '2914e97d-c6c7-4a94-a84e-8bd7a4d8cc78'(golden_env_mixed_virtio_3_0) moved from 'MigratingFrom' --> 'Down'

version rhv-release-4.2.5-3-001.noarch

Comment 77 Polina 2018-07-15 07:50:25 UTC
Created attachment 1458964 [details]
logs_engine_vdsm.tar.gz

Comment 78 Milan Zamazal 2018-07-16 16:19:44 UTC
The problem is in both the cases the same -- apparently getAllDomainStats/domainListGetStats calls in bulk sampling take long time, resulting in not having sufficiently recent VM stats and setting monitorResponse to -1.

Comment 79 Milan Zamazal 2018-07-17 07:37:58 UTC
After further inspection the behavior looks mostly expected. The VMs get into non-responsive state on I/O errors and may get back out of it with GlusterFS due to some self-healing capabilities of the file system. The hosts get non-operational due to storage problems and Engine tries to migrate VMs out of such hosts. The VM getting down may be related to a known QEMU bug: https://bugzilla.redhat.com/1548448. To confirm all that we need related logs.

Polina, could you please provide:

- GlusterFS logs
- libvirt logs
- QEMU logs

Comment 80 Milan Zamazal 2018-07-17 10:33:29 UTC
Moving to 4.2.6. There is no clear bug here anymore, just waiting for missing information from QE to confirm that the recently observed behavior corresponds to actual system conditions.

Comment 81 Polina 2018-07-18 15:53:01 UTC
Created attachment 1459744 [details]
scenario_nfs_vm_block_storage

attached logs for both scenarios - nfs and gluster - scenario_nfs_vm_block_storage.tar.gz ,scenario_gluster_vm_block_storage.tar.gz

Comment 82 Polina 2018-07-18 15:53:55 UTC
Created attachment 1459745 [details]
scenario_gluster_vm_block_storage

Comment 83 Milan Zamazal 2018-07-23 13:27:59 UTC
Regarding VM crash on migration with GlusterFS, I can see in the QEMU log it's a QEMU problem:

  2018-07-18 15:46:12.293+0000: initiating migration
  Unexpected error in raw_apply_lock_bytes() at block/file-posix.c:642:
  2018-07-18T15:46:15.338010Z qemu-kvm: Failed to lock byte 100
  2018-07-18 15:46:15.610+0000: shutting down, reason=crashed

And in libvirt:

  2018-07-18 15:46:15.405+0000: 22018: error : qemuMonitorIO:719 : internal error: End of file from qemu monitor
  2018-07-18 15:46:15.405+0000: 22018: debug : qemuMonitorIO:767 : Error on monitor internal error: End of file from qemu monitor
  2018-07-18 15:46:15.405+0000: 22018: debug : qemuMonitorIO:788 : Triggering EOF callback
  2018-07-18 15:46:15.405+0000: 22018: debug : qemuProcessHandleMonitorEOF:290 : Received EOF on 0x7f324801f560 'golden_env_mixed_virtio_3_0'
  ...
  2018-07-18 15:46:15.405+0000: 22018: debug : virNetlinkEventCallback:707 : dispatching to max 0 clients, called from event watch 9
  2018-07-18 15:46:15.405+0000: 22018: debug : virNetlinkEventCallback:720 : event not handled.
  2018-07-18 15:46:15.405+0000: 24918: debug : qemuProcessEventHandler:4818 : vm=0x7f324801f560, event=6
  2018-07-18 15:46:15.405+0000: 24918: debug : qemuProcessKill:6339 : vm=0x7f324801f560 name=golden_env_mixed_virtio_3_0 pid=468 flags=0x1
  2018-07-18 15:46:15.405+0000: 24918: debug : virProcessKillPainfully:355 : vpid=468 force=1
  2018-07-18 15:46:15.440+0000: 22083: debug : udevHandleOneDevice:1677 : udev action: 'change'
  2018-07-18 15:46:15.441+0000: 22083: debug : udevGetDeviceProperty:149 : Found property key 'DRIVER' value '<null>' for device with sysname 'kvm'
  2018-07-18 15:46:15.441+0000: 22083: debug : udevGetDeviceProperty:149 : Found property key 'SUBSYSTEM' value 'misc' for device with sysname 'kvm'
  2018-07-18 15:46:15.441+0000: 22083: debug : udevGetDeviceType:1349 : Could not determine device type for device with sysfs name 'kvm'
  2018-07-18 15:46:15.441+0000: 22083: debug : udevAddOneDevice:1541 : Discarding device -1 0x7f31f4001650 /sys/devices/virtual/misc/kvm
  2018-07-18 15:46:15.607+0000: 24918: info : virObjectRef:388 : OBJECT_REF: obj=0x7f31f01ceea0
  2018-07-18 15:46:15.607+0000: 24918: debug : qemuDomainObjBeginJobInternal:4625 : Starting job: destroy (vm=0x7f324801f560 name=golden_env_mixed_virtio_3_0, current job=none async=migration out)
  2018-07-18 15:46:15.607+0000: 24918: debug : qemuDomainObjBeginJobInternal:4666 : Started job: destroy (async=migration out vm=0x7f324801f560 
  ...
  2018-07-18 15:46:15.609+0000: 24918: debug : processMonitorEOFEvent:4788 : Monitor connection to 'golden_env_mixed_virtio_3_0' closed without SHUTDOWN event; assuming the domain crashed

It seems such a crash hasn't been reported yet. Polina, if the bug is reproducible on your setup, could you please file a QEMU bug?

Comment 84 Milan Zamazal 2018-07-23 14:02:16 UTC
Regarding VM status changes, I can see that:

- With NFS, the VM switches to NotResponding and this is the last status change in the provided Engine log.

- With GlusterFS, there are two VMs, each of them switches to NotResponding and then after a while to Up. GlusterFS logs don't report anything after the second failure.

In both the cases the causes of status changes to NotResponding seem to be stuck libvirt calls. The VMs switch to Up status with GlusterFS once libvirt starts reporting VM stats again. Unless there is some less or more regular status flipping or an obviously incorrect reporting, I'm afraid we can't do anything about that.

Polina, could you please clarify whether some of the currently observed behavior is clearly broken?

Comment 85 Polina 2018-07-25 12:04:32 UTC
Created attachment 1470506 [details]
qemu_libvirt_vdsm_engine_for_gluster_vm.tar.gz

Hi, I'm not sure I understand the last question - The currently observed behavior is broken. 
The expected behavior is the VM Paused due to I/O storage error while the storage is blocked.
The actual behavior:
 for nfs VM  - Not Responging.
 for Gluster - 'NotResponding' --> 'Up'. Both are problematic

about the requested qemu.log for gluster VM, please see attached qemu_libvirt_vdsm_engine_for_gluster_vm.tar.gz

Comment 86 Milan Zamazal 2018-07-25 12:19:11 UTC
And is it an oVirt problem or libvirt/QEMU problem? If the VM gets paused, libvirt reports it as paused and Vdsm keep reporting it as Up then it may be an oVirt bug. If the VM doesn't get paused or libvirt reports it as running then it's a libvirt/QEMU problem. From the provided logs it looks like that the VMs are either reported as running from libvirt or the libvirt calls are stuck and Vdsm reacts correctly in both the cases.

Polina, could you please clarify that?

Comment 87 Michal Skrivanek 2018-07-26 12:44:01 UTC
(In reply to Polina from comment #85)
> Created attachment 1470506 [details]
> qemu_libvirt_vdsm_engine_for_gluster_vm.tar.gz
> 
> Hi, I'm not sure I understand the last question - The currently observed
> behavior is broken.
> The expected behavior is the VM Paused due to I/O storage error while the
> storage is blocked.

We're going in circles a bit. Yes, indeed the expected behavior is that the VM gets Paused on I/O Error. The only argument is _when_ there is a IO storage error. If there is no I/O Error reported you do not get a paused VM.

> The actual behavior:
>  for nfs VM  - Not Responging.

is there I/O Error being reported in guest? In host? I doubt so, we discussed that several times that current timeout values won't ever report I/O Error before the host gets fenced instead. So -> not relevant for the "expected behavior" above.

>  for Gluster - 'NotResponding' --> 'Up'. Both are problematic

is there I/O Error being reported in guest? In host? If not then this is again not relevant to this bug.

I do not see anything in the logs indicating there was an I/O Error reported

It could be a problem of e.g. gluster that it is supposed to report I/O Error and it currently does not, but that needs a separate gluster/qemu bug, with a separate and more concrete reproduction/logs.


I do not see anything wrong with the feature nor anything we can do about it, moving back to ON_QA

Comment 88 Polina 2018-07-31 07:36:54 UTC
two new bugs are inserted for libvirt
https://bugzilla.redhat.com/show_bug.cgi?id=1609701
https://bugzilla.redhat.com/show_bug.cgi?id=1609561 

I can't verify the bug since the endpoint behavior didn't change - the user blocks the storage and the VM is not moved to paused state due to i/o error. Instead - has Not Responding state.

About not reporting from the qemu/libvirt inserted BZ 1609701

The workaround for nfs VMs that allowed to get I/O error pause state (timeout=1s and retries=2 for SD) is not working in the current version. I change these parameters for the SD and then run VM, The VM is paused immediately even before the storage is blocked.

Engine enters the migration flow that must not be started here at all. It is not HA VM in Not-Responding state and the host is in Not-Operational state. The migration has no chance to succeed and should not be started. We see that engine returns to the failed migration in a circle (see the log fragment below).

For the gluster VM, this failed migration process finishes with qemu crash (BZ 1609561)

2018-07-29 16:01:15,203+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-77) [] EVENT_ID: USER_RUN_VM(32), VM golden_env_mixed_virtio_1_1 started on Host host_mixed_2
2018-07-29 16:03:16,469+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-87) [] VM 'cd15b8d4-29cf-41b4-860c-27a38f993fa1'(golden_env_mixed_virtio_1_1) moved from 'Up' --> 'NotResponding'
2018-07-29 16:03:16,606+03 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-87) [] EVENT_ID: VM_NOT_RESPONDING(126), VM golden_env_mixed_virtio_1_1 is not responding.
2018-07-29 16:07:36,727+03 ERROR [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (EE-ManagedThreadFactory-engine-Thread-17887) [7c4836a2] Failed to migrate VM 'golden_env_mixed_virtio_1_1'
2018-07-29 16:16:15,525+03 ERROR [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (EE-ManagedThreadFactory-engine-Thread-18032) [7fa0ad53] Failed to migrate VM 'golden_env_mixed_virtio_1_1'
2018-07-29 16:25:17,320+03 ERROR [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (EE-ManagedThreadFactory-engine-Thread-18209) [23d38dca] Failed to migrate VM 'golden_env_mixed_virtio_1_1'
2018-07-29 16:35:18,053+03 ERROR [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (EE-ManagedThreadFactory-engine-Thread-18413) [30931f20] Failed to migrate VM 'golden_env_mixed_virtio_1_1'
2018-07-29 16:45:20,125+03 ERROR [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (EE-ManagedThreadFactory-engine-Thread-18614) [c648901] Failed to migrate VM 'golden_env_mixed_virtio_1_1'
2018-07-29 16:55:13,448+03 ERROR [org.ovirt.engine.core.bll.SetNonOperationalVdsCommand] (EE-ManagedThreadFactory-engine-Thread-18811) [6d8cbb62] Failed to migrate VM 'golden_env_mixed_virtio_1_1'

Comment 89 meital avital 2018-07-31 07:45:15 UTC
According to comment 88 and to depends on we can't verify this bug

Comment 90 Milan Zamazal 2018-07-31 08:35:06 UTC
We are dealing with too many problems here:

- Regarding migrations, it's unrelated, please file a separate bug.

- Regarding QEMU crash on the attempted migration, thank you for reporting the QEMU bug. With respect to this bug, it's unrelated, so no further action needed here.

- Regarding NFS pausing the VM immediately with altered settings, I'd try to use a somewhat higher timeout, let's 5s rather than 1s.

- Regarding the VM not being paused, is the guest OS still running? If so then it's reported correctly as Up. If it should be paused, I'd say it's a problem not directly related to this bug and should be handled in the bug you reported on libvirt. Using different timeout settings as suggested above might help to verify regular behavior.

- Regarding not responding state, IIRC the last time I inspected your reports, the not responding state was only temporary until switching back to Up. It was due to stuck libvirt calls. Such things may happen with broken storage, we can't expect everything works smoothly in a broken environment.

I believe that all the reported oVirt problems within the scope of the original report have been fixed. At the moment, I don't have any further report of broken oVirt behavior at hand that I could work on within this bug. Could you attempt verification with NFS again, perhaps asking storage people for help with the settings?

Comment 91 Michal Skrivanek 2018-07-31 13:56:41 UTC
(In reply to meital avital from comment #89)
> According to comment 88 and to depends on we can't verify this bug

according to comment #90 things work as expected within the scope of the original report.
What resolution do you suggest then?

Comment 92 Polina 2018-07-31 14:15:40 UTC
about migration - bug inserted https://bugzilla.redhat.com/show_bug.cgi?id=1610354

Comment 93 Michal Skrivanek 2018-07-31 14:32:15 UTC
adding back needinfo on meital for comment #91
adding back needinfo on polina for other aspects of comment#90 . Only the NFS remains, I believe, other bugs has been opened. Unless you'd like to open a bug on that "Regarding not responding state" item. Up to you.

Comment 94 Michal Skrivanek 2018-08-03 11:24:56 UTC
since 4.2.5 is released this is now done


Note You need to log in before you can comment on or make changes to this bug.