Bug 1526025

Summary:

VM configured with resume_behavior='AUTO_RESUME' is not resumed if it was previously manually attempted for resume

Product:

[oVirt] vdsm

Reporter:

Polina <pagranat>

Component:

Core

Assignee:

Milan Zamazal <mzamazal>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Polina <pagranat>

Severity:

high

Docs Contact:

Priority:

high

Version:

4.20.19

CC:

bugs, michal.skrivanek, mtessun, mzamazal, pagranat, tjelinek

Target Milestone:

ovirt-4.2.8

Flags:

rule-engine: ovirt-4.2+
mtessun: planning_ack+
rule-engine: devel_ack+
rule-engine: testing_ack+

Target Release:

---

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

v4.20.45

Doc Type:

Bug Fix

Doc Text:

When a VM with auto-resume behavior had been paused due to problems with storage, it might not be resumed once the storage problems disappeared. That has been fixed and VM resuming from paused status caused by I/O errors is more robust now.

Story Points:

---

Clone Of:

Environment:

Last Closed:

2019-01-22 10:23:23 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Virt

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1566153, 1569614, 1582122, 1612943, 1634758

Bug Blocks:

Attachments:

Description	Flags
see the scenario from line 3911 in engine.log	none
engine&vdsm&vm_xml&screen_print	none
logs_second_reproduce.tar.gz	none
logs_second_reproduce.tar.gz	none
vm.py	none
engine.log & vdsm_debug.log	none
patched vm.py from 4.2.2	none
vdsm and engine logs	none
engine_vdsm_v4.2.2-4.tar.gz	none
logs_debug.tar	none
libvirt, vdsm, qemu, engine log	none
engine_qemu_libvirt_vdsm.tar.gz	none
engine, vdsm, libvirt, qemu logs in the link	none

Description Polina 2017-12-14 16:12:58 UTC

Description of problem: There is a scenario where VM with  resume_behavior='AUTO_RESUME' configuration is not resumed. please see steps to reproduce 


Version-Release number of selected component (if applicable):
rhvm-4.2.0-0.6.el7.noarch

How reproducible: 100 %


Steps to Reproduce:
1. Have running  HA VM with Resume Behavior = 'AutoResume'. Block the storage -> VM Paused since I/O error.
2. while this state try to run the VM. As result sometimes the VM just stay in the same I/O Pause. And sometimes the I/O error Pause is changed to Pause. it doesn't affect the next behavior  - the same in both cases.
In the attached engine.log please see the scenario from line 3911.
3. unblock the storage after while => the VM remains Paused instead of being Resumed.


Actual results: VM remains Paused despite of the AutoResume configuration.


Expected results: VM must be Auto Resumed


Additional info:Trying to manually start the VM after this scenario brings firstly error: Launching VM Failed, and then it is started.

Comment 1 Polina 2017-12-14 16:15:32 UTC

Created attachment 1368078 [details]
see the scenario from line 3911 in engine.log

Comment 2 Tomas Jelinek 2017-12-15 07:30:36 UTC

*** Bug 1526024 has been marked as a duplicate of this bug. ***

Comment 3 Tomas Jelinek 2017-12-15 13:21:46 UTC

What happens:
1: 17:33 the storage went down and 2 VMs have been paused:
2017-12-14 17:33:44,317+02: VM_PAUSED_EIO(145), VM 2 has been paused due to storage I/O problem.

2: 17:34 the manual resume has been attempted on one VM (15a07e75-da5b-4803-9059-c021e7906d73): 2017-12-14 17:34:50,909+02 INFO  [org.ovirt.engine.core.vdsbroker.ResumeVDSCommand]

3: 17:37 the storage has been fixed and the VM which was not attempted to be manually resumed has been auto-resumed now (in vdsm log): Trying to resume VM 7e2277dc-deb9-417d-bc00-b125af5d1deb after EIO

The other VM has not been attempted to be auto-resumed because the state of the VM was not VIR_DOMAIN_PAUSED_IOERROR. It seems it has been cleared when the manual resume was attempted.

Comment 4 Milan Zamazal 2018-01-25 13:14:23 UTC

I can't reproduce the bug, but I can see the following in provided libvirtd.log, in this order:

  2017-12-14 15:34:50.922+0000: 1495: info : qemuMonitorIOWrite:536 : QEMU_MONITOR_IO_WRITE: mon=0x7f11740262a0 buf={"execute":"cont","id":"libvirt-161"}
  ...
  2017-12-14 15:34:50.924+0000: 1495: info : qemuMonitorIOProcess:431 : QEMU_MONITOR_IO_PROCESS: mon=0x7f11740262a0 buf={"timestamp": {"seconds": 1513265690, "microseconds": 924058}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk0", "nospace": false, "__com.redhat_reason": "eio", "node-name": "#block184", "reason": "Input/output error", "operation": "write", "action": "stop"}}
  <... repeated many times ...>
  ...
  2017-12-14 15:34:50.945+0000: 1495: info : qemuMonitorIOProcess:431 : QEMU_MONITOR_IO_PROCESS: mon=0x7f11740262a0 buf={"timestamp": {"seconds": 1513265690, "microseconds": 945339}, "event": "RESUME"}
  ...
  2017-12-14 15:34:50.955+0000: 1495: info : qemuMonitorIOProcess:431 : QEMU_MONITOR_IO_PROCESS: mon=0x7f11740262a0 buf={"timestamp": {"seconds": 1513265690, "microseconds": 955191}, "event": "STOP"}
  <... no more BLOCK_IO_ERROR ...>

So it looks like problem of event ordering, either from QEMU or in libvirt processing.

Comment 5 Michal Skrivanek 2018-02-09 12:33:33 UTC

(In reply to Polina from comment #0)
> Steps to Reproduce:
> 1. Have running  HA VM with Resume Behavior = 'AutoResume'. Block the
> storage -> VM Paused since I/O error.

which VM? Always make sure you specify which VM you're talking about.

> 2. while this state try to run the VM.

try how? Please always specify exactly how are you trying to do things manually and if possible note down the exact time and/or corresponding log line

> As result sometimes the VM just stay

sometimes? Please do not use vague terms, enumerate all the attempts, with details as per previous comment

> in the same I/O Pause. And sometimes the I/O error Pause is changed to
> Pause.

it should still be in EIO if the reason for pause is still IO error. Note down (time, log line) when it changed to a pause without EIO, and please note what was different in that attempt (timing? anything happened on the storage so it actually may have resumed for a while? other actions on the system?)

> it doesn't affect the next behavior  - the same in both cases.

then the above wouldn't be significant? Do you have examples of both cases in teh same run? If so, please note down the VM ids

> In the attached engine.log please see the scenario from line 3911.
> 3. unblock the storage after while => the VM remains Paused instead of being
> Resumed.

"after while" would need better qualification. Again please correlate that to logs and/or note the exact time when connection was resumed and how. What was the time of any other manual action in GUI? 

> Additional info:Trying to manually start the VM after this scenario brings
> firstly error: Launching VM Failed, and then it is started.

"after this scenario" - you mean you tried to do what exactly? With which VM? The mentioned error is happening where? Again please note down time or log. 


You say it's reproducible 100%, so please do reproduce and add individual details. Otherwise the testing done by us didn't reveal any problem and all VMs resumed fine so we would have to close this as WORKSFORME

Comment 6 Polina 2018-02-14 09:59:07 UTC

Created attachment 1395839 [details]
engine&vdsm&vm_xml&screen_print

Comment 7 Polina 2018-02-14 09:59:34 UTC

Hi , I reproduce the scenario and try to describe it again. All logs are attached, including xml VM and UI picture. 

1. Create HA VM with disk on iscsi storage domain, ResumeBehavior = 'AutoResume' (xml from get vms attached)
2. Block the iscsi storage on host => VM is in I/O error pause -  line 24566 in engine.log.
3. While VM is in I/O error pause, click on Run in Web UI. 
At this moment we can see what happens in engine.log.
    24597:
    2018-02-14 11:10:03,013+02 WARN  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-92836) [] vds 'host_mixed_2' reported domain '5c71f25a-20cf-40f7-9e74-60592184        13fd:iscsi_0' as in problem, attempting to move the vds to status NonOperational
    24603:
    2018-02-14 11:10:03,244+02 WARN  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-92837) [57536105] EVENT_ID: VDS_SET_NONOPERATIONAL_DOMAIN(52        2), Host host_mixed_2 cannot access the Storage Domain(s) iscsi_0 attached to the Data Center golden_env_mixed. Setting Host state to Non-Operational.

At line 24605: the migrate command (initiated by system ) is running  2018-02-14 11:10:04,789+02 INFO  [org.ovirt.engine.core.bll.MigrateVmCommand] (EE-ManagedThreadFactory-engine-Thread-92838) [58d5a2cb] Running command: MigrateVmCommand internal: true. Entities affected         :  ID: 0cded25e-63ef-43ed-996c-9cfc1934d37a Type: VMAction group MIGRATE_VM with role type USER

The migration failed (line 24621) and at this moment the VM is moved to paused (not I/O error anymore , also seen in UI that the status changed).

4. Unblock the iscsi storage (host is activated , line 24663). 

Result: VM remains in Pause forever.

Comment 8 Polina 2018-02-14 11:35:11 UTC

Created attachment 1395851 [details]
logs_second_reproduce.tar.gz

there is second reproduce when the VM (name=golden_env_mixed_virtio_2_1) remains in I/O error (no attempts to migrate are seen in engine.log) and not resumed after storage is back. Logs are in logs_second_reproduce.tar.gz 

The same type of HA VM on iscsi SD, the same steps: While VM is in I/O error Pause (25773) , click on Run (25785 2018-02-14 13:01:58,030+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-95513) [c2a5f4e3-e2c9-4b38-ae07-cb3eb9eceab2] EVENT_ID: USER_RESUME_VM(40), VM golden_env_mixed_virtio_2_1 was resumed by admin@internal-authz (Host: host_mixed_3).)

When the storage is back , the Host is activated 

25819 2018-02-14 13:05:29,382+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-95597) [] Domain '5c71f25a-20cf-40f7-9e74-6059218413fd:iscsi_0' recovered from         problem. vds: 'host_mixed_3'
  25820 2018-02-14 13:05:29,382+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-95597) [] Domain '5c71f25a-20cf-40f7-9e74-6059218413fd:iscsi_0' has recovered         from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer.

But the VM remains in I/O Error Pause.

The problem is 100% reproducible. I just don't understand why in the first case engine tries to migrate the VM which causes the changing state from I/O error Pause to Pause, and in the second case it doesn't , and VM remain in I/O Pause . in both cases we have VMs remaining in Paused state and not auto-resumed after the storage is back

Comment 9 Polina 2018-02-14 11:40:17 UTC

Created attachment 1395853 [details]
logs_second_reproduce.tar.gz

Comment 10 Polina 2018-02-14 11:50:16 UTC

Please see comment 7 & 8. If you still have a problem to reproduce, be in touch by irc, I'll reproduce again and share with you my environment.

Comment 11 Michal Skrivanek 2018-02-14 12:31:23 UTC

(In reply to Polina from comment #8)

thanks for the updated logs

> The problem is 100% reproducible. I just don't understand why in the first
> case engine tries to migrate the VM which causes the changing state from I/O
> error Pause to Pause, and in the second case it doesn't , and VM remain in
> I/O Pause . in both cases we have VMs remaining in Paused state and not
> auto-resumed after the storage is back

it does that because of the non operational case due to storage domain monitoring reporting issues. But it shouldn't have anything to do with you clicking on Run or not if that VM was already Paused. Are you sure it was not just a coincidence that when you clicked on it it just so happened that the SD monitoring kicked in?

Comment 12 Michal Skrivanek 2018-02-14 12:39:18 UTC

seems like timing issue due to the resume

2018-02-14 11:07:51,276+0200 INFO  (jsonrpc/7) [api.virt] START cont() from=::ffff:10.35.161.243,60322, flow_id=36f984b6-152d-4842-9454-6662b09349
60 (api:46)
2018-02-14 11:07:51,287+0200 INFO  (libvirt/events) [virt.vm] (vmId='0cded25e-63ef-43ed-996c-9cfc1934d37a') abnormal vm stop device virtio-disk0 error eio (vm:4934)
2018-02-14 11:07:51,302+0200 INFO  (jsonrpc/7) [virt.vm] (vmId='0cded25e-63ef-43ed-996c-9cfc1934d37a') CPU running: continue (vm:5942)
2018-02-14 11:07:51,322+0200 INFO  (jsonrpc/7) [api.virt] FINISH cont return={'status': {'message': 'Done', 'code': 0}} from=::ffff:10.35.161.243,60322, flow_id=36f984b6-152d-4842-9454-6662b0934960 (api:52)
2018-02-14 11:07:51,324+0200 INFO  (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call VM.cont succeeded in 0.05 seconds (__init__:573)
2018-02-14 11:07:51,388+0200 INFO  (libvirt/events) [virt.vm] (vmId='0cded25e-63ef-43ed-996c-9cfc1934d37a') CPU stopped: onIOError (vm:5942)
2018-02-14 11:07:51,408+0200 INFO  (libvirt/events) [virt.vm] (vmId='0cded25e-63ef-43ed-996c-9cfc1934d37a') CPU running: onResume (vm:5942)
2018-02-14 11:07:51,411+0200 INFO  (libvirt/events) [virt.vm] (vmId='0cded25e-63ef-43ed-996c-9cfc1934d37a') CPU running: onResume (vm:5942)
2018-02-14 11:07:51,414+0200 INFO  (libvirt/events) [virt.vm] (vmId='0cded25e-63ef-43ed-996c-9cfc1934d37a') CPU stopped: onSuspend (vm:5942)

that's ~2 minutes before the migration was initiated, I wonder if the VM was actually running or not, and whether the status in vdsm and in engine was accurate between 11:07:51 and 11:10:04

Comment 13 Milan Zamazal 2018-02-15 20:48:01 UTC

Pausing iSCSI VMs on I/O errors is currently not working for me (the errors propagate to a guest instead) and NFS and gluster behave very differently, so I can't test the situation. But the logs here show what could be the problem.

Polina, could you please test https://gerrit.ovirt.org/87766 on your setup, whether it fixes the problem or not? If not then please attach the corresponding Vdsm logs. Please make sure you have DEBUG logging enabled in /etc/vdsm/logger.conf for all loggers, especially logger_virt.

Comment 14 Polina 2018-02-18 13:10:31 UTC

Created attachment 1397575 [details]
vm.py

Comment 15 Polina 2018-02-18 13:13:55 UTC

Hi Milan, I tried to insert your patch:
               if detail == libvirt.VIR_DOMAIN_EVENT_SUSPENDED_IOERROR:
                    try:
                        self._pause_code = self._readPauseCode()
                    except libvirt.libvirtError as e:
                        self.log.warning(
                            "Couldn't retrieve pause code from libvirt: %s", e)

into my setup , but because I work with DownStream (rhv-release-4.2.2-1-001.noarch) , the code is very different , and I don't see how to insert this. 
I attach my vm.py

Comment 16 Polina 2018-02-18 14:34:48 UTC

I attach engine.log and vdsm.log with debug for all loggers:

the steps :
1. block the storage (VM is paused with IO Error)
2018-02-18 16:27:15,531+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-45) [] EVENT_ID: VM_PAUSED_EIO(145), VM golden_env_mixed_virtio_2_0 has been paused due to storage I/O problem.

2. Click on run
2018-02-18 16:27:44,693+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-162) [71975762-a83f-448e-a572-3428e8de921a] EVENT_ID: USER_RESUME_VM(40), VM golden_env_mixed_virtio_2_0 was resumed by admin@internal-authz (Host: host_mixed_1).

3. unblock the storage
2018-02-18 16:28:36,360+02 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-181) [] Domain '5c71f25a-20cf-40f7-9e74-6059218413fd:iscsi_0' recovered from problem. vds: 'host_mixed_1'

Comment 17 Polina 2018-02-18 14:36:00 UTC

Created attachment 1397590 [details]
engine.log & vdsm_debug.log

Comment 18 Milan Zamazal 2018-02-19 11:19:11 UTC

Created attachment 1397820 [details]
patched vm.py from 4.2.2

Hi Polina, attaching the patched vm.py you can use for testing.

Comment 19 Milan Zamazal 2018-03-01 08:26:34 UTC

Moving back to POST, since we don't know yet whether the patches are enough to fix the problem.

Comment 20 Milan Zamazal 2018-03-02 14:59:04 UTC

Polina, could you please check whether the problem is still present in 4.2.2 RC3?

Comment 21 Polina 2018-03-05 08:43:16 UTC

The problem still happens on rhv-release-4.2.2-3-001.noarch & vm.py attached.

The vdsm&engine logs from the last run are attached.

Comment 22 Polina 2018-03-05 08:44:09 UTC

Created attachment 1404236 [details]
vdsm and engine logs

Comment 23 Milan Zamazal 2018-03-05 13:21:32 UTC

I'm sorry, I messed up the RC version. Could you please recheck once again, with 4.2.2-4?

Comment 24 Polina 2018-03-06 14:11:31 UTC

(In reply to Milan Zamazal from comment #23)
> I'm sorry, I messed up the RC version. Could you please recheck once again,
> with 4.2.2-4?

In 4.2.2-4 the behavior is changed for HA VM with lease and HA VM no leased configured with Resume Behavior=KILL. in this case the VM is restarted on another host in any case.

The problem still happens with HA VMs no lease which are configured with Resume Behavior=AUTO_RESUME. In this case, the problem still happens. The VM remains forever in the state I/O Error Paused, so that I have three hosts in the environment in state UP, and this HA VM remains paused. If it's host moves to Non Operational state, engine tries to migrate the VM , but the attempt fails and the VM remains in I/O Paused state

Comment 25 Milan Zamazal 2018-03-06 14:56:06 UTC

Thank you Polina for testing. I can't reproduce the problem with AUTO_RESUME, so could you please provide vdsm.log (with DEBUG enabled) from the failed attempt?

Comment 26 Polina 2018-03-07 08:31:36 UTC

Created attachment 1405190 [details]
engine_vdsm_v4.2.2-4.tar.gz

Comment 27 Polina 2018-03-07 08:32:29 UTC

(In reply to Milan Zamazal from comment #25)
> Thank you Polina for testing. I can't reproduce the problem with
> AUTO_RESUME, so could you please provide vdsm.log (with DEBUG enabled) from
> the failed attempt?


Hi Milan, please see engine_vdsm_v4.2.2-4.tar.gz attached , 

in engine the scenario starts from line :
2018-03-07 10:07:41,156+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-3) [] EVENT_ID: USER_RUN_VM(32), VM golden_env_mixed_virtio_2_0 started on Host host_mixed_3

the HA VM with default AUTO_RESUME configuration remains forever in Pause state after the storage is unblocked (while we have three hosts in Up state).

Comment 28 Milan Zamazal 2018-03-07 11:17:43 UTC

Thank you Polina for the log, but it's missing DEBUG messages. Could you please provide a log of the failure with DEBUG messages?

Also, what version of libvirt is installed on the system?

Comment 29 Polina 2018-03-08 11:50:05 UTC

 rpm -qa |grep libvirt
libvirt-daemon-driver-qemu-3.9.0-13.el7.x86_64
libvirt-daemon-driver-storage-3.9.0-13.el7.x86_64
libvirt-libs-3.9.0-13.el7.x86_64
libvirt-daemon-driver-nodedev-3.9.0-13.el7.x86_64
libvirt-lock-sanlock-3.9.0-13.el7.x86_64
libvirt-daemon-driver-storage-scsi-3.9.0-13.el7.x86_64
libvirt-daemon-3.9.0-13.el7.x86_64
libvirt-daemon-driver-secret-3.9.0-13.el7.x86_64
libvirt-daemon-driver-storage-mpath-3.9.0-13.el7.x86_64
libvirt-daemon-config-nwfilter-3.9.0-13.el7.x86_64
libvirt-daemon-driver-interface-3.9.0-13.el7.x86_64
libvirt-daemon-driver-storage-core-3.9.0-13.el7.x86_64
libvirt-daemon-driver-storage-rbd-3.9.0-13.el7.x86_64
libvirt-daemon-driver-storage-gluster-3.9.0-13.el7.x86_64
libvirt-3.9.0-13.el7.x86_64
libvirt-client-3.9.0-13.el7.x86_64
libvirt-daemon-config-network-3.9.0-13.el7.x86_64
libvirt-daemon-driver-lxc-3.9.0-13.el7.x86_64
libvirt-daemon-driver-storage-disk-3.9.0-13.el7.x86_64
libvirt-daemon-driver-network-3.9.0-13.el7.x86_64
libvirt-daemon-driver-storage-iscsi-3.9.0-13.el7.x86_64
libvirt-daemon-kvm-3.9.0-13.el7.x86_64
libvirt-daemon-driver-nwfilter-3.9.0-13.el7.x86_64
libvirt-daemon-driver-storage-logical-3.9.0-13.el7.x86_64
libvirt-python-3.9.0-1.el7.x86_64

Comment 30 Milan Zamazal 2018-03-08 15:41:03 UTC

OK, I've got the same version.

I still need your log with DEBUG messages.

Comment 31 Polina 2018-03-10 11:56:56 UTC

Hi Milan see attached debug logs in logs_debag.tar.gz

In vdsm.log start looking from:

2018-03-10 13:38:28,776+0200

in engine.log from 

2018-03-10 13:38:50,735+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-10) [] EVENT_ID: VM_PAUSED_EIO(145), VM golden_env_mixed_virtio_2_0 has been paused due to storage I/O problem.

The unblock command (iptables -D INPUT 1) was run on host at Sat Mar 10 13:41:14 IST 2018

The VM  golden_env_mixed_virtio_2_0 (HA with AUTO_RESUME) remains Paused forever

Comment 32 Polina 2018-03-10 11:58:52 UTC

Created attachment 1406589 [details]
logs_debug.tar

Comment 33 Milan Zamazal 2018-03-12 12:08:33 UTC

Thank you Polina for the log. Now I can see the problem is that libvirt suspend event reports VIR_DOMAIN_EVENT_SUSPENDED_PAUSED rather than VIR_DOMAIN_EVENT_SUSPENDED_IOERROR (it reports the latter on my installation, as expected).

Polina, do you have a corresponding libvirt debug log?

Comment 34 Milan Zamazal 2018-03-12 12:27:26 UTC

Polina, could you please also check, in the failing situation, whether the VM will start being reported as paused on I/O error, sooner or later, after you restart Vdsm on the given host?

Comment 35 Yaniv Kaul 2018-03-15 13:50:06 UTC

Milan, is this on track to 4.2.2? If not, please defer.

Comment 36 Milan Zamazal 2018-03-15 14:02:16 UTC

Since this bug is not even fully understood yet, deferring it.

Comment 37 Milan Zamazal 2018-04-03 09:26:30 UTC

Polina, could you please answer my comment 33 and comment 34?

Comment 38 Polina 2018-04-04 07:53:26 UTC

Created attachment 1417095 [details]
libvirt, vdsm, qemu, engine log

Comment 39 Polina 2018-04-04 07:56:39 UTC

(In reply to Milan Zamazal from comment #33)
> Thank you Polina for the log. Now I can see the problem is that libvirt
> suspend event reports VIR_DOMAIN_EVENT_SUSPENDED_PAUSED rather than
> VIR_DOMAIN_EVENT_SUSPENDED_IOERROR (it reports the latter on my
> installation, as expected).
> 
> Polina, do you have a corresponding libvirt debug log?

Hi , the required logs are attached. In the attached tar please see the scenario from  line 866 in engine.log

866 2018-04-04 10:33:37,323+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-30) [] EVENT_ID: VM_PAUSED_EIO(145), VM golden_env_        mixed_virtio_2_0 has been paused due to storage I/O problem.

Comment 40 Polina 2018-04-04 08:05:51 UTC

(In reply to Milan Zamazal from comment #34)
> Polina, could you please also check, in the failing situation, whether the
> VM will start being reported as paused on I/O error, sooner or later, after
> you restart Vdsm on the given host?

Hi, I restarted the vdsmd on the host where the VM is paused. The VM remained paused. 

the scenario is always reproducible. If it could help, we can synchronize by irc and I'll share my env for you to look into the problem.

Comment 41 Milan Zamazal 2018-04-05 15:10:42 UTC

Thank you Polina for the logs. The weird order of events is generated in QEMU, perhaps due to locking, depending on timing. I guess EIO info is lost on resume reported between the error events and the stop event. I further guess that EIO is also missing in VM state info for the same reason, but let's check.

That should be fixed on QEMU/libvirt level. I'm not sure whether we can make a reasonable workaround in Vdsm in the meantime.

Comment 42 Milan Zamazal 2018-04-11 16:11:01 UTC

I checked that IOERROR pause code is ultimately lost after resume-pause and filed a libvirt bug (https://bugzilla.redhat.com/1566153).

I also checked that diskErrors() libvirt call still returns  libvirt.VIR_DOMAIN_DISK_ERROR_UNSPEC on the given drive, which might possibly be used as workaround in case libvirt/QEMU fix would be unavailable.

Comment 44 Michal Skrivanek 2018-06-19 07:55:38 UTC

seems ready now, can you please update spec for qemu-kvm-rhev-2.10.0-21.el7_5.4

Comment 45 Milan Zamazal 2018-06-19 11:34:33 UTC

Patch posted.

Comment 46 Milan Zamazal 2018-07-16 15:06:51 UTC

Required QEMU version updated in the referred patch.

Comment 47 Polina 2018-07-29 14:34:45 UTC

Hi, 

the libvirt bug (https://bugzilla.redhat.com/1566153) is verified. yet the scenario is still reproducible in exactly the same way. 

Tested on 

rhv-release-4.2.5-5-001.noarch
vdsm-4.20.35-1.el7ev.x86_64
libvirt-3.9.0-14.el7_5.6.x86_64
qemu-kvm-rhev-2.10.0-21.el7_5.4.x86_64

Please let me know if you want to see it reproducible in the environment. I'll run the scenario. it is 100% reproducible.

Comment 48 Milan Zamazal 2018-07-30 09:38:51 UTC

Polina, could you please provide Vdsm, Engine and libvirt logs from the current run?

Comment 49 Michal Skrivanek 2018-07-30 12:32:10 UTC

not a blocker for 4.2.5, postponing to 4.2.6 for clarification

Comment 50 Polina 2018-08-01 13:03:03 UTC

Created attachment 1472078 [details]
engine_qemu_libvirt_vdsm.tar.gz

adding logs for version rhv-release-4.2.5-5-001.noarch

please see from the line 2416 in engine.log : 
2018-08-01 15:37:47,713+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-8) [] EVENT_ID: VM_PAUSED_EIO(145), VM golden_env_mixed_virtio_2_0 has been paused due to storage I/O problem.

Comment 51 Milan Zamazal 2018-08-03 14:20:10 UTC

The order of events in QEMU looks fixed. But now there seems to be a problem with libvirt events. According to Vdsm logs, libvirt sends a spurious RESUME event after all the I/O error events. libvirt log doesn't show such an event from QEMU, so it looks like it either created a fake event or it sent an old RESUME event at the time. libvirt doesn't log sufficient information about the events sent to Vdsm (or I don't decipher it correctly) and the Vdsm log is missing virt debug logs, so I can't be 100% sure, but from the apparent matches the observation above is likely to be correct.

Polina, in order to create an accurate bug report on libvirt, I need the following information:

- Is it still 100% reproducible on your setup?
- Please describe as accurately as possible the steps you make when reproducing the bug and what you observe, preferably with corresponding times (I think this is a different bug than the one originally identified, so some observed details may be changed).
- Provide Engine, Vdsm, libvirt and QEMU logs as above, but with logger_virt (and all common loggers) level set to DEBUG (please always use DEBUG level in all the common loggers when providing Vdsm logs!).
- Please confirm Vdsm, libvirt, QEMU, and OS versions used in the corresponding run.

Comment 52 Polina 2018-08-05 07:47:25 UTC

Created attachment 1473381 [details]
engine, vdsm, libvirt, qemu logs in the link

Hi, please see the description steps and all the logs attached in resume_bug.log.tar.gz under link https://drive.google.com/drive/u/1/folders/100hYM2sBbp0ckaBONhskq9YDdfln0QlA?ogsrc=32

All loggers in /etc/vdsm/logger.conf are in DEBUG mode including the logger_virt 

[logger_virt]
level=DEBUG
handlers=logthread
qualname=virt
propagate=0

1. Edit VM golden_env_mixed_virtio_2_0 associated with iscsi disc to be HA VM with storage_error_resume_behaviour=auto_resume. Run the VM on host_mixed_1.
   line 65030 in the attached engine.log
   2018-08-05 09:57:07,353+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-100) [19493bac] EVENT_ID: USER_RUN_VM(32), VM golde        n_env_mixed_virtio_2_0 started on Host host_mixed_1

2. On the host block the storage with the command 'iptables -I INPUT -s xtremio-iscsi1.scl.lab.tlv.redhat.com -j DROP' and wait untill the VM is paused due tp I/O error storage.
    65054 2018-08-05 10:02:29,412+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-12) [] EVENT_ID: VM_PAUSED_EIO(145), VM golden_env_mixed_virtio_2_0 has been         paused due to storage I/O problem.

3. In webUI click on Run to run the VM.
   65062 2018-08-05 10:04:58,742+03 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-11166) [ea01058f-baaa-4412-8510-073f0309e790] EVENT_ID: USER        _RESUME_VM(40), VM golden_env_mixed_virtio_2_0 was resumed by admin@internal-authz (Host: host_mixed_1).

4. On the host unblock the storage with the command 'iptables -D INPUT 1'.
   65068 2018-08-05 10:06:09,074+03 INFO  [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-11194) [] Domain '37a5d799-4bd2-4387-8fc7-6087aff4582a:iscsi_0' recovered from         problem. vds: 'host_mixed_1'

Expected result: after unblocking the storage the VM is resumed .
Actual: the VM stays in pause forever.

versions: 
rhv-release-4.2.5-6-001.noarch
libvirt-3.9.0-14.el7_5.6.x86_64
qemu-kvm-rhev-2.10.0-21.el7_5.5.x86_64
vdsm-client-4.20.35-1.el7ev.noarch

Comment 53 Milan Zamazal 2018-08-06 15:12:39 UTC

Thank you Polina, libvirt bug reported: BZ 1612943

Comment 54 Milan Zamazal 2018-09-12 10:25:58 UTC

https://bugzilla.redhat.com/1612943 is apparently not going to be fixed before 4.2.7 is out, so moving to 4.2.8.

Comment 55 Polina 2018-12-16 10:02:31 UTC

Works as expected. 

Verified on:

ovirt-engine-4.2.8.1-0.1.el7ev.noarch
libvirt-4.5.0-10.el7_6.3.x86_64
vdsm-4.20.45-1.el7ev.x86_64


2018-12-16 11:49:13,369+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-12) [] VM '92424156-ecd2-4505-a9de-925f91a258ee'(golden_env_mixed_virtio_2_0) moved from 'Up' --> 'Paused'
2018-12-16 11:49:13,578+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-12) [] EVENT_ID: VM_PAUSED(1,025), VM golden_env_mixed_virtio_2_0 has been paused.
2018-12-16 11:49:13,786+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-12) [] EVENT_ID: VM_PAUSED_EIO(145), VM golden_env_mixed_virtio_2_0 has been paused due to storage I/O problem.
2018-12-16 11:50:13,741+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-362) [f35257b8-357d-421b-8e71-d32c1acf859c] EVENT_ID: USER_RESUME_VM(40), VM golden_env_mixed_virtio_2_0 was resumed by admin@internal-authz (Host: host_mixed_1).
2018-12-16 11:51:50,488+02 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-70) [] VM '92424156-ecd2-4505-a9de-925f91a258ee'(golden_env_mixed_virtio_2_0) moved from 'Paused' --> 'Up'
2018-12-16 11:51:50,724+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-70) [] EVENT_ID: VM_RECOVERED_FROM_PAUSE_ERROR(196), VM golden_env_mixed_virtio_2_0 has recovered from paused back to up.
2018-12-16 11:51:50,891+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-70) [] EVENT_ID: USER_RUN_VM(32), VM golden_env_mixed_virtio_2_0 started on Host host_mixed_1

Comment 56 Sandro Bonazzola 2019-01-22 10:23:23 UTC

This bugzilla is included in oVirt 4.2.8 release, published on January 22nd 2019.

Since the problem described in this bug report should be
resolved in oVirt 4.2.8 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.