Bug 1526025
Summary: | VM configured with resume_behavior='AUTO_RESUME' is not resumed if it was previously manually attempted for resume | ||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | [oVirt] vdsm | Reporter: | Polina <pagranat> | ||||||||||||||||||||||||||||
Component: | Core | Assignee: | Milan Zamazal <mzamazal> | ||||||||||||||||||||||||||||
Status: | CLOSED CURRENTRELEASE | QA Contact: | Polina <pagranat> | ||||||||||||||||||||||||||||
Severity: | high | Docs Contact: | |||||||||||||||||||||||||||||
Priority: | high | ||||||||||||||||||||||||||||||
Version: | 4.20.19 | CC: | bugs, michal.skrivanek, mtessun, mzamazal, pagranat, tjelinek | ||||||||||||||||||||||||||||
Target Milestone: | ovirt-4.2.8 | Flags: | rule-engine:
ovirt-4.2+
mtessun: planning_ack+ rule-engine: devel_ack+ rule-engine: testing_ack+ |
||||||||||||||||||||||||||||
Target Release: | --- | ||||||||||||||||||||||||||||||
Hardware: | x86_64 | ||||||||||||||||||||||||||||||
OS: | Linux | ||||||||||||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||||||||||||
Fixed In Version: | v4.20.45 | Doc Type: | Bug Fix | ||||||||||||||||||||||||||||
Doc Text: |
When a VM with auto-resume behavior had been paused due to problems with storage, it might not be resumed once the storage problems disappeared. That has been fixed and VM resuming from paused status caused by I/O errors is more robust now.
|
Story Points: | --- | ||||||||||||||||||||||||||||
Clone Of: | Environment: | ||||||||||||||||||||||||||||||
Last Closed: | 2019-01-22 10:23:23 UTC | Type: | Bug | ||||||||||||||||||||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||||||||||||||||||||
Documentation: | --- | CRM: | |||||||||||||||||||||||||||||
Verified Versions: | Category: | --- | |||||||||||||||||||||||||||||
oVirt Team: | Virt | RHEL 7.3 requirements from Atomic Host: | |||||||||||||||||||||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||||||||||||||||||||
Embargoed: | |||||||||||||||||||||||||||||||
Bug Depends On: | 1566153, 1569614, 1582122, 1612943, 1634758 | ||||||||||||||||||||||||||||||
Bug Blocks: | |||||||||||||||||||||||||||||||
Attachments: |
|
Description
Polina
2017-12-14 16:12:58 UTC
Created attachment 1368078 [details]
see the scenario from line 3911 in engine.log
*** Bug 1526024 has been marked as a duplicate of this bug. *** What happens: 1: 17:33 the storage went down and 2 VMs have been paused: 2017-12-14 17:33:44,317+02: VM_PAUSED_EIO(145), VM 2 has been paused due to storage I/O problem. 2: 17:34 the manual resume has been attempted on one VM (15a07e75-da5b-4803-9059-c021e7906d73): 2017-12-14 17:34:50,909+02 INFO [org.ovirt.engine.core.vdsbroker.ResumeVDSCommand] 3: 17:37 the storage has been fixed and the VM which was not attempted to be manually resumed has been auto-resumed now (in vdsm log): Trying to resume VM 7e2277dc-deb9-417d-bc00-b125af5d1deb after EIO The other VM has not been attempted to be auto-resumed because the state of the VM was not VIR_DOMAIN_PAUSED_IOERROR. It seems it has been cleared when the manual resume was attempted. I can't reproduce the bug, but I can see the following in provided libvirtd.log, in this order: 2017-12-14 15:34:50.922+0000: 1495: info : qemuMonitorIOWrite:536 : QEMU_MONITOR_IO_WRITE: mon=0x7f11740262a0 buf={"execute":"cont","id":"libvirt-161"} ... 2017-12-14 15:34:50.924+0000: 1495: info : qemuMonitorIOProcess:431 : QEMU_MONITOR_IO_PROCESS: mon=0x7f11740262a0 buf={"timestamp": {"seconds": 1513265690, "microseconds": 924058}, "event": "BLOCK_IO_ERROR", "data": {"device": "drive-virtio-disk0", "nospace": false, "__com.redhat_reason": "eio", "node-name": "#block184", "reason": "Input/output error", "operation": "write", "action": "stop"}} <... repeated many times ...> ... 2017-12-14 15:34:50.945+0000: 1495: info : qemuMonitorIOProcess:431 : QEMU_MONITOR_IO_PROCESS: mon=0x7f11740262a0 buf={"timestamp": {"seconds": 1513265690, "microseconds": 945339}, "event": "RESUME"} ... 2017-12-14 15:34:50.955+0000: 1495: info : qemuMonitorIOProcess:431 : QEMU_MONITOR_IO_PROCESS: mon=0x7f11740262a0 buf={"timestamp": {"seconds": 1513265690, "microseconds": 955191}, "event": "STOP"} <... no more BLOCK_IO_ERROR ...> So it looks like problem of event ordering, either from QEMU or in libvirt processing. (In reply to Polina from comment #0) > Steps to Reproduce: > 1. Have running HA VM with Resume Behavior = 'AutoResume'. Block the > storage -> VM Paused since I/O error. which VM? Always make sure you specify which VM you're talking about. > 2. while this state try to run the VM. try how? Please always specify exactly how are you trying to do things manually and if possible note down the exact time and/or corresponding log line > As result sometimes the VM just stay sometimes? Please do not use vague terms, enumerate all the attempts, with details as per previous comment > in the same I/O Pause. And sometimes the I/O error Pause is changed to > Pause. it should still be in EIO if the reason for pause is still IO error. Note down (time, log line) when it changed to a pause without EIO, and please note what was different in that attempt (timing? anything happened on the storage so it actually may have resumed for a while? other actions on the system?) > it doesn't affect the next behavior - the same in both cases. then the above wouldn't be significant? Do you have examples of both cases in teh same run? If so, please note down the VM ids > In the attached engine.log please see the scenario from line 3911. > 3. unblock the storage after while => the VM remains Paused instead of being > Resumed. "after while" would need better qualification. Again please correlate that to logs and/or note the exact time when connection was resumed and how. What was the time of any other manual action in GUI? > Additional info:Trying to manually start the VM after this scenario brings > firstly error: Launching VM Failed, and then it is started. "after this scenario" - you mean you tried to do what exactly? With which VM? The mentioned error is happening where? Again please note down time or log. You say it's reproducible 100%, so please do reproduce and add individual details. Otherwise the testing done by us didn't reveal any problem and all VMs resumed fine so we would have to close this as WORKSFORME Created attachment 1395839 [details]
engine&vdsm&vm_xml&screen_print
Hi , I reproduce the scenario and try to describe it again. All logs are attached, including xml VM and UI picture. 1. Create HA VM with disk on iscsi storage domain, ResumeBehavior = 'AutoResume' (xml from get vms attached) 2. Block the iscsi storage on host => VM is in I/O error pause - line 24566 in engine.log. 3. While VM is in I/O error pause, click on Run in Web UI. At this moment we can see what happens in engine.log. 24597: 2018-02-14 11:10:03,013+02 WARN [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-92836) [] vds 'host_mixed_2' reported domain '5c71f25a-20cf-40f7-9e74-60592184 13fd:iscsi_0' as in problem, attempting to move the vds to status NonOperational 24603: 2018-02-14 11:10:03,244+02 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-92837) [57536105] EVENT_ID: VDS_SET_NONOPERATIONAL_DOMAIN(52 2), Host host_mixed_2 cannot access the Storage Domain(s) iscsi_0 attached to the Data Center golden_env_mixed. Setting Host state to Non-Operational. At line 24605: the migrate command (initiated by system ) is running 2018-02-14 11:10:04,789+02 INFO [org.ovirt.engine.core.bll.MigrateVmCommand] (EE-ManagedThreadFactory-engine-Thread-92838) [58d5a2cb] Running command: MigrateVmCommand internal: true. Entities affected : ID: 0cded25e-63ef-43ed-996c-9cfc1934d37a Type: VMAction group MIGRATE_VM with role type USER The migration failed (line 24621) and at this moment the VM is moved to paused (not I/O error anymore , also seen in UI that the status changed). 4. Unblock the iscsi storage (host is activated , line 24663). Result: VM remains in Pause forever. Created attachment 1395851 [details]
logs_second_reproduce.tar.gz
there is second reproduce when the VM (name=golden_env_mixed_virtio_2_1) remains in I/O error (no attempts to migrate are seen in engine.log) and not resumed after storage is back. Logs are in logs_second_reproduce.tar.gz
The same type of HA VM on iscsi SD, the same steps: While VM is in I/O error Pause (25773) , click on Run (25785 2018-02-14 13:01:58,030+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-95513) [c2a5f4e3-e2c9-4b38-ae07-cb3eb9eceab2] EVENT_ID: USER_RESUME_VM(40), VM golden_env_mixed_virtio_2_1 was resumed by admin@internal-authz (Host: host_mixed_3).)
When the storage is back , the Host is activated
25819 2018-02-14 13:05:29,382+02 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-95597) [] Domain '5c71f25a-20cf-40f7-9e74-6059218413fd:iscsi_0' recovered from problem. vds: 'host_mixed_3'
25820 2018-02-14 13:05:29,382+02 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-95597) [] Domain '5c71f25a-20cf-40f7-9e74-6059218413fd:iscsi_0' has recovered from problem. No active host in the DC is reporting it as problematic, so clearing the domain recovery timer.
But the VM remains in I/O Error Pause.
The problem is 100% reproducible. I just don't understand why in the first case engine tries to migrate the VM which causes the changing state from I/O error Pause to Pause, and in the second case it doesn't , and VM remain in I/O Pause . in both cases we have VMs remaining in Paused state and not auto-resumed after the storage is back
Created attachment 1395853 [details]
logs_second_reproduce.tar.gz
Please see comment 7 & 8. If you still have a problem to reproduce, be in touch by irc, I'll reproduce again and share with you my environment. (In reply to Polina from comment #8) thanks for the updated logs > The problem is 100% reproducible. I just don't understand why in the first > case engine tries to migrate the VM which causes the changing state from I/O > error Pause to Pause, and in the second case it doesn't , and VM remain in > I/O Pause . in both cases we have VMs remaining in Paused state and not > auto-resumed after the storage is back it does that because of the non operational case due to storage domain monitoring reporting issues. But it shouldn't have anything to do with you clicking on Run or not if that VM was already Paused. Are you sure it was not just a coincidence that when you clicked on it it just so happened that the SD monitoring kicked in? seems like timing issue due to the resume 2018-02-14 11:07:51,276+0200 INFO (jsonrpc/7) [api.virt] START cont() from=::ffff:10.35.161.243,60322, flow_id=36f984b6-152d-4842-9454-6662b09349 60 (api:46) 2018-02-14 11:07:51,287+0200 INFO (libvirt/events) [virt.vm] (vmId='0cded25e-63ef-43ed-996c-9cfc1934d37a') abnormal vm stop device virtio-disk0 error eio (vm:4934) 2018-02-14 11:07:51,302+0200 INFO (jsonrpc/7) [virt.vm] (vmId='0cded25e-63ef-43ed-996c-9cfc1934d37a') CPU running: continue (vm:5942) 2018-02-14 11:07:51,322+0200 INFO (jsonrpc/7) [api.virt] FINISH cont return={'status': {'message': 'Done', 'code': 0}} from=::ffff:10.35.161.243,60322, flow_id=36f984b6-152d-4842-9454-6662b0934960 (api:52) 2018-02-14 11:07:51,324+0200 INFO (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call VM.cont succeeded in 0.05 seconds (__init__:573) 2018-02-14 11:07:51,388+0200 INFO (libvirt/events) [virt.vm] (vmId='0cded25e-63ef-43ed-996c-9cfc1934d37a') CPU stopped: onIOError (vm:5942) 2018-02-14 11:07:51,408+0200 INFO (libvirt/events) [virt.vm] (vmId='0cded25e-63ef-43ed-996c-9cfc1934d37a') CPU running: onResume (vm:5942) 2018-02-14 11:07:51,411+0200 INFO (libvirt/events) [virt.vm] (vmId='0cded25e-63ef-43ed-996c-9cfc1934d37a') CPU running: onResume (vm:5942) 2018-02-14 11:07:51,414+0200 INFO (libvirt/events) [virt.vm] (vmId='0cded25e-63ef-43ed-996c-9cfc1934d37a') CPU stopped: onSuspend (vm:5942) that's ~2 minutes before the migration was initiated, I wonder if the VM was actually running or not, and whether the status in vdsm and in engine was accurate between 11:07:51 and 11:10:04 Pausing iSCSI VMs on I/O errors is currently not working for me (the errors propagate to a guest instead) and NFS and gluster behave very differently, so I can't test the situation. But the logs here show what could be the problem. Polina, could you please test https://gerrit.ovirt.org/87766 on your setup, whether it fixes the problem or not? If not then please attach the corresponding Vdsm logs. Please make sure you have DEBUG logging enabled in /etc/vdsm/logger.conf for all loggers, especially logger_virt. Created attachment 1397575 [details]
vm.py
Hi Milan, I tried to insert your patch: if detail == libvirt.VIR_DOMAIN_EVENT_SUSPENDED_IOERROR: try: self._pause_code = self._readPauseCode() except libvirt.libvirtError as e: self.log.warning( "Couldn't retrieve pause code from libvirt: %s", e) into my setup , but because I work with DownStream (rhv-release-4.2.2-1-001.noarch) , the code is very different , and I don't see how to insert this. I attach my vm.py I attach engine.log and vdsm.log with debug for all loggers: the steps : 1. block the storage (VM is paused with IO Error) 2018-02-18 16:27:15,531+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-45) [] EVENT_ID: VM_PAUSED_EIO(145), VM golden_env_mixed_virtio_2_0 has been paused due to storage I/O problem. 2. Click on run 2018-02-18 16:27:44,693+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-162) [71975762-a83f-448e-a572-3428e8de921a] EVENT_ID: USER_RESUME_VM(40), VM golden_env_mixed_virtio_2_0 was resumed by admin@internal-authz (Host: host_mixed_1). 3. unblock the storage 2018-02-18 16:28:36,360+02 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-181) [] Domain '5c71f25a-20cf-40f7-9e74-6059218413fd:iscsi_0' recovered from problem. vds: 'host_mixed_1' Created attachment 1397590 [details]
engine.log & vdsm_debug.log
Created attachment 1397820 [details]
patched vm.py from 4.2.2
Hi Polina, attaching the patched vm.py you can use for testing.
Moving back to POST, since we don't know yet whether the patches are enough to fix the problem. Polina, could you please check whether the problem is still present in 4.2.2 RC3? The problem still happens on rhv-release-4.2.2-3-001.noarch & vm.py attached. The vdsm&engine logs from the last run are attached. Created attachment 1404236 [details]
vdsm and engine logs
I'm sorry, I messed up the RC version. Could you please recheck once again, with 4.2.2-4? (In reply to Milan Zamazal from comment #23) > I'm sorry, I messed up the RC version. Could you please recheck once again, > with 4.2.2-4? In 4.2.2-4 the behavior is changed for HA VM with lease and HA VM no leased configured with Resume Behavior=KILL. in this case the VM is restarted on another host in any case. The problem still happens with HA VMs no lease which are configured with Resume Behavior=AUTO_RESUME. In this case, the problem still happens. The VM remains forever in the state I/O Error Paused, so that I have three hosts in the environment in state UP, and this HA VM remains paused. If it's host moves to Non Operational state, engine tries to migrate the VM , but the attempt fails and the VM remains in I/O Paused state Thank you Polina for testing. I can't reproduce the problem with AUTO_RESUME, so could you please provide vdsm.log (with DEBUG enabled) from the failed attempt? Created attachment 1405190 [details]
engine_vdsm_v4.2.2-4.tar.gz
(In reply to Milan Zamazal from comment #25) > Thank you Polina for testing. I can't reproduce the problem with > AUTO_RESUME, so could you please provide vdsm.log (with DEBUG enabled) from > the failed attempt? Hi Milan, please see engine_vdsm_v4.2.2-4.tar.gz attached , in engine the scenario starts from line : 2018-03-07 10:07:41,156+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-3) [] EVENT_ID: USER_RUN_VM(32), VM golden_env_mixed_virtio_2_0 started on Host host_mixed_3 the HA VM with default AUTO_RESUME configuration remains forever in Pause state after the storage is unblocked (while we have three hosts in Up state). Thank you Polina for the log, but it's missing DEBUG messages. Could you please provide a log of the failure with DEBUG messages? Also, what version of libvirt is installed on the system? rpm -qa |grep libvirt libvirt-daemon-driver-qemu-3.9.0-13.el7.x86_64 libvirt-daemon-driver-storage-3.9.0-13.el7.x86_64 libvirt-libs-3.9.0-13.el7.x86_64 libvirt-daemon-driver-nodedev-3.9.0-13.el7.x86_64 libvirt-lock-sanlock-3.9.0-13.el7.x86_64 libvirt-daemon-driver-storage-scsi-3.9.0-13.el7.x86_64 libvirt-daemon-3.9.0-13.el7.x86_64 libvirt-daemon-driver-secret-3.9.0-13.el7.x86_64 libvirt-daemon-driver-storage-mpath-3.9.0-13.el7.x86_64 libvirt-daemon-config-nwfilter-3.9.0-13.el7.x86_64 libvirt-daemon-driver-interface-3.9.0-13.el7.x86_64 libvirt-daemon-driver-storage-core-3.9.0-13.el7.x86_64 libvirt-daemon-driver-storage-rbd-3.9.0-13.el7.x86_64 libvirt-daemon-driver-storage-gluster-3.9.0-13.el7.x86_64 libvirt-3.9.0-13.el7.x86_64 libvirt-client-3.9.0-13.el7.x86_64 libvirt-daemon-config-network-3.9.0-13.el7.x86_64 libvirt-daemon-driver-lxc-3.9.0-13.el7.x86_64 libvirt-daemon-driver-storage-disk-3.9.0-13.el7.x86_64 libvirt-daemon-driver-network-3.9.0-13.el7.x86_64 libvirt-daemon-driver-storage-iscsi-3.9.0-13.el7.x86_64 libvirt-daemon-kvm-3.9.0-13.el7.x86_64 libvirt-daemon-driver-nwfilter-3.9.0-13.el7.x86_64 libvirt-daemon-driver-storage-logical-3.9.0-13.el7.x86_64 libvirt-python-3.9.0-1.el7.x86_64 OK, I've got the same version. I still need your log with DEBUG messages. Hi Milan see attached debug logs in logs_debag.tar.gz In vdsm.log start looking from: 2018-03-10 13:38:28,776+0200 in engine.log from 2018-03-10 13:38:50,735+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-10) [] EVENT_ID: VM_PAUSED_EIO(145), VM golden_env_mixed_virtio_2_0 has been paused due to storage I/O problem. The unblock command (iptables -D INPUT 1) was run on host at Sat Mar 10 13:41:14 IST 2018 The VM golden_env_mixed_virtio_2_0 (HA with AUTO_RESUME) remains Paused forever Created attachment 1406589 [details]
logs_debug.tar
Thank you Polina for the log. Now I can see the problem is that libvirt suspend event reports VIR_DOMAIN_EVENT_SUSPENDED_PAUSED rather than VIR_DOMAIN_EVENT_SUSPENDED_IOERROR (it reports the latter on my installation, as expected). Polina, do you have a corresponding libvirt debug log? Polina, could you please also check, in the failing situation, whether the VM will start being reported as paused on I/O error, sooner or later, after you restart Vdsm on the given host? Milan, is this on track to 4.2.2? If not, please defer. Since this bug is not even fully understood yet, deferring it. Polina, could you please answer my comment 33 and comment 34? Created attachment 1417095 [details]
libvirt, vdsm, qemu, engine log
(In reply to Milan Zamazal from comment #33) > Thank you Polina for the log. Now I can see the problem is that libvirt > suspend event reports VIR_DOMAIN_EVENT_SUSPENDED_PAUSED rather than > VIR_DOMAIN_EVENT_SUSPENDED_IOERROR (it reports the latter on my > installation, as expected). > > Polina, do you have a corresponding libvirt debug log? Hi , the required logs are attached. In the attached tar please see the scenario from line 866 in engine.log 866 2018-04-04 10:33:37,323+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-30) [] EVENT_ID: VM_PAUSED_EIO(145), VM golden_env_ mixed_virtio_2_0 has been paused due to storage I/O problem. (In reply to Milan Zamazal from comment #34) > Polina, could you please also check, in the failing situation, whether the > VM will start being reported as paused on I/O error, sooner or later, after > you restart Vdsm on the given host? Hi, I restarted the vdsmd on the host where the VM is paused. The VM remained paused. the scenario is always reproducible. If it could help, we can synchronize by irc and I'll share my env for you to look into the problem. Thank you Polina for the logs. The weird order of events is generated in QEMU, perhaps due to locking, depending on timing. I guess EIO info is lost on resume reported between the error events and the stop event. I further guess that EIO is also missing in VM state info for the same reason, but let's check. That should be fixed on QEMU/libvirt level. I'm not sure whether we can make a reasonable workaround in Vdsm in the meantime. I checked that IOERROR pause code is ultimately lost after resume-pause and filed a libvirt bug (https://bugzilla.redhat.com/1566153). I also checked that diskErrors() libvirt call still returns libvirt.VIR_DOMAIN_DISK_ERROR_UNSPEC on the given drive, which might possibly be used as workaround in case libvirt/QEMU fix would be unavailable. seems ready now, can you please update spec for qemu-kvm-rhev-2.10.0-21.el7_5.4 Patch posted. Required QEMU version updated in the referred patch. Hi, the libvirt bug (https://bugzilla.redhat.com/1566153) is verified. yet the scenario is still reproducible in exactly the same way. Tested on rhv-release-4.2.5-5-001.noarch vdsm-4.20.35-1.el7ev.x86_64 libvirt-3.9.0-14.el7_5.6.x86_64 qemu-kvm-rhev-2.10.0-21.el7_5.4.x86_64 Please let me know if you want to see it reproducible in the environment. I'll run the scenario. it is 100% reproducible. Polina, could you please provide Vdsm, Engine and libvirt logs from the current run? not a blocker for 4.2.5, postponing to 4.2.6 for clarification Created attachment 1472078 [details]
engine_qemu_libvirt_vdsm.tar.gz
adding logs for version rhv-release-4.2.5-5-001.noarch
please see from the line 2416 in engine.log :
2018-08-01 15:37:47,713+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-8) [] EVENT_ID: VM_PAUSED_EIO(145), VM golden_env_mixed_virtio_2_0 has been paused due to storage I/O problem.
The order of events in QEMU looks fixed. But now there seems to be a problem with libvirt events. According to Vdsm logs, libvirt sends a spurious RESUME event after all the I/O error events. libvirt log doesn't show such an event from QEMU, so it looks like it either created a fake event or it sent an old RESUME event at the time. libvirt doesn't log sufficient information about the events sent to Vdsm (or I don't decipher it correctly) and the Vdsm log is missing virt debug logs, so I can't be 100% sure, but from the apparent matches the observation above is likely to be correct. Polina, in order to create an accurate bug report on libvirt, I need the following information: - Is it still 100% reproducible on your setup? - Please describe as accurately as possible the steps you make when reproducing the bug and what you observe, preferably with corresponding times (I think this is a different bug than the one originally identified, so some observed details may be changed). - Provide Engine, Vdsm, libvirt and QEMU logs as above, but with logger_virt (and all common loggers) level set to DEBUG (please always use DEBUG level in all the common loggers when providing Vdsm logs!). - Please confirm Vdsm, libvirt, QEMU, and OS versions used in the corresponding run. Created attachment 1473381 [details] engine, vdsm, libvirt, qemu logs in the link Hi, please see the description steps and all the logs attached in resume_bug.log.tar.gz under link https://drive.google.com/drive/u/1/folders/100hYM2sBbp0ckaBONhskq9YDdfln0QlA?ogsrc=32 All loggers in /etc/vdsm/logger.conf are in DEBUG mode including the logger_virt [logger_virt] level=DEBUG handlers=logthread qualname=virt propagate=0 1. Edit VM golden_env_mixed_virtio_2_0 associated with iscsi disc to be HA VM with storage_error_resume_behaviour=auto_resume. Run the VM on host_mixed_1. line 65030 in the attached engine.log 2018-08-05 09:57:07,353+03 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-100) [19493bac] EVENT_ID: USER_RUN_VM(32), VM golde n_env_mixed_virtio_2_0 started on Host host_mixed_1 2. On the host block the storage with the command 'iptables -I INPUT -s xtremio-iscsi1.scl.lab.tlv.redhat.com -j DROP' and wait untill the VM is paused due tp I/O error storage. 65054 2018-08-05 10:02:29,412+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-12) [] EVENT_ID: VM_PAUSED_EIO(145), VM golden_env_mixed_virtio_2_0 has been paused due to storage I/O problem. 3. In webUI click on Run to run the VM. 65062 2018-08-05 10:04:58,742+03 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-11166) [ea01058f-baaa-4412-8510-073f0309e790] EVENT_ID: USER _RESUME_VM(40), VM golden_env_mixed_virtio_2_0 was resumed by admin@internal-authz (Host: host_mixed_1). 4. On the host unblock the storage with the command 'iptables -D INPUT 1'. 65068 2018-08-05 10:06:09,074+03 INFO [org.ovirt.engine.core.vdsbroker.irsbroker.IrsProxy] (EE-ManagedThreadFactory-engine-Thread-11194) [] Domain '37a5d799-4bd2-4387-8fc7-6087aff4582a:iscsi_0' recovered from problem. vds: 'host_mixed_1' Expected result: after unblocking the storage the VM is resumed . Actual: the VM stays in pause forever. versions: rhv-release-4.2.5-6-001.noarch libvirt-3.9.0-14.el7_5.6.x86_64 qemu-kvm-rhev-2.10.0-21.el7_5.5.x86_64 vdsm-client-4.20.35-1.el7ev.noarch Thank you Polina, libvirt bug reported: BZ 1612943 https://bugzilla.redhat.com/1612943 is apparently not going to be fixed before 4.2.7 is out, so moving to 4.2.8. Works as expected. Verified on: ovirt-engine-4.2.8.1-0.1.el7ev.noarch libvirt-4.5.0-10.el7_6.3.x86_64 vdsm-4.20.45-1.el7ev.x86_64 2018-12-16 11:49:13,369+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-12) [] VM '92424156-ecd2-4505-a9de-925f91a258ee'(golden_env_mixed_virtio_2_0) moved from 'Up' --> 'Paused' 2018-12-16 11:49:13,578+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-12) [] EVENT_ID: VM_PAUSED(1,025), VM golden_env_mixed_virtio_2_0 has been paused. 2018-12-16 11:49:13,786+02 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-12) [] EVENT_ID: VM_PAUSED_EIO(145), VM golden_env_mixed_virtio_2_0 has been paused due to storage I/O problem. 2018-12-16 11:50:13,741+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-362) [f35257b8-357d-421b-8e71-d32c1acf859c] EVENT_ID: USER_RESUME_VM(40), VM golden_env_mixed_virtio_2_0 was resumed by admin@internal-authz (Host: host_mixed_1). 2018-12-16 11:51:50,488+02 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (EE-ManagedThreadFactory-engineScheduled-Thread-70) [] VM '92424156-ecd2-4505-a9de-925f91a258ee'(golden_env_mixed_virtio_2_0) moved from 'Paused' --> 'Up' 2018-12-16 11:51:50,724+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-70) [] EVENT_ID: VM_RECOVERED_FROM_PAUSE_ERROR(196), VM golden_env_mixed_virtio_2_0 has recovered from paused back to up. 2018-12-16 11:51:50,891+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engineScheduled-Thread-70) [] EVENT_ID: USER_RUN_VM(32), VM golden_env_mixed_virtio_2_0 started on Host host_mixed_1 This bugzilla is included in oVirt 4.2.8 release, published on January 22nd 2019. Since the problem described in this bug report should be resolved in oVirt 4.2.8 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report. |