Bug 1601212
Summary: | qemu coredumps on block-commit | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Elad <ebenahar> | ||||||
Component: | qemu-kvm-rhev | Assignee: | Kevin Wolf <kwolf> | ||||||
Status: | CLOSED ERRATA | QA Contact: | Gu Nini <ngu> | ||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 7.5 | CC: | aefrat, areis, bugs, chayang, coli, dfediuck, ebenahar, famz, fsun, fweimer, gveitmic, gwatson, juzhang, kwolf, michen, mtessun, ngu, pkrempa, qcheng, ratamir, slopezpa, tburke, tnisan, virt-maint, vkuznets | ||||||
Target Milestone: | pre-dev-freeze | Keywords: | Automation, Regression, ZStream | ||||||
Target Release: | --- | ||||||||
Hardware: | x86_64 | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | qemu-kvm-rhev-2.12.0-16.el7 | Doc Type: | If docs needed, set a value | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | |||||||||
: | 1631161 1637976 (view as bug list) | Environment: | |||||||
Last Closed: | 2018-11-01 11:13:00 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 1583424, 1609137, 1610929, 1631161, 1637976 | ||||||||
Attachments: |
|
Some clarifications: - The bug reproduces only when there are file system on the VM disks with some content written to files on them. So in steps to reproduce, step3, create FS on the disks after attaching them to the VM. - Correction: in steps to reproduce, step7, snapshot 1 has to be deleted, instead of snapshot2 Using the following steps I wasn't able to reproduce: 1. Create a VM and install rhel 7.5 (using an iso that was downloaded) 2. Create a template 4. Create a VM from the template 5. Create 4 disks on a block storage 6. Create FS on all disks 7. Write data to all disks 8. Create a snapshot (s1) 9. Write data to all disks 10. Create a snapshot (s2) 11. Write data to all disks 12. Create a snapshot (s3) 13. Write data to all disks 14. Live merge s2 ==> Works as expected. I proceeded and successfully live merged that other two snapshots, s1 and then s2. Managed to reproduce with the following steps: 1. Create a VM with 1gb preallocated disk 2. Create a template from the VM in step 1 3. Create a new VM [WORK_VM]from the template in step 2 4. Add 2 preallocated 1gb disks to WORK_VM 5. Create 3 snapshots to WORK_VM [snap1, snap2, snap3] 6. Run WORK_VM on an !!HSM host!! 7. Remove snapshot snap2 Reproduce at 90% Elad, Can you please attach libvirt.log and qemu log Created attachment 1473618 [details]
logs with libvirtd.log
vdsm.log
2018-08-06 15:32:39,946+0300 ERROR (jsonrpc/1) [virt.vm] (vmId='d7c105cf-06bd-4d79-a21d-3ef09ac94fd4') Live merge failed (job: 09de6148-74e6-4451-957d-f2dff045edfd) (vm:5978)
Traceback (most recent call last):
File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 5976, in merge
bandwidth, flags)
File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f
ret = attr(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper
ret = f(*args, **kwargs)
File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper
return func(inst, *args, **kwargs)
File "/usr/lib64/python2.7/site-packages/libvirt.py", line 708, in blockCommit
if ret == -1: raise libvirtError ('virDomainBlockCommit() failed', dom=self)
libvirtError: Unable to read from monitor: Connection reset by peer
libvirtd.log
Peter, I suspect that there is a Libvirt bug that causes this error. Can you please take a look at the libvirtd.log that Elad just added? Elad, is this bug occurs also in 4.1? Can you please check what was the last libvirt version that the flow worked properly? I've looked at the logs and found following: libvirtd was instructed to do a block-commit operation. We've sent the following command to qemu: 2018-08-06 12:09:38.887+0000: 6674: info : qemuMonitorSend:1076 : QEMU_MONITOR_SEND_MSG: mon=0x7f86d8027cc0 msg={"execute":"block-commit","arguments":{"device":"drive-ua-b3fae703-ab91-4364-9efa-4240e7977835","top":"/rhev/data-center/mnt/blockSD/ca5552b7-305a-4667-9688-a42c180806ff/images/b3fae703-ab91-4364-9efa-4240e7977835/e6d99e41-3c8a-4f84-8e50-225c4e15e1b8","base":"/rhev/data-center/mnt/blockSD/ca5552b7-305a-4667-9688-a42c180806ff/images/b3fae703-ab91-4364-9efa-4240e7977835/12f6a3a3-b7bc-41c8-aa1f-2741348c928d","backing-file":"12f6a3a3-b7bc-41c8-aa1f-2741348c928d"},"id":"libvirt-153"}^M ... and never received a reply. All subsequent APIs then got errors as there was an active API call. Note that block-commit operation is asynchronous after the command is acknowledged by qemu (which it was not): 2018-08-06 12:10:10.248+0000: 6675: warning : qemuDomainObjBeginJobInternal:4709 : Cannot start job (modify, none) for domain vm_TestCase6052_0614450184; current job is (modify, none) owned by (6674 remoteDispatchDomainBlockCommit, 0 <null>) for (31s, 0s) [....] 2018-08-06 12:17:18.055+0000: 6675: warning : qemuDomainObjBeginJobInternal:4709 : Cannot start job (query, none) for domain vm_TestCase6052_0614450184; current job is (modify, none) owned by (6674 remoteDispatchDomainBlockCommit, 0 <null>) for (459s, 0s) So the process was stuck for ~20 minutes before being killed: 2018-08-06 12:32:27.858+0000: 6675: error : virProcessKillPainfully:401 : Failed to terminate process 723 with SIGTERM: Device or resource busy We then killed the process with SIGKILL which was successful. The following reply is a result of terminating the qemu process. Closing the monitor finalizes the existing block-commit API call with an error: 11278 2018-08-06 12:32:38.011+0000: 6667: error : qemuMonitorIORead:610 : Unable to read from monitor: Connection reset by peer VM log file 'vm_TestCase6052_0614450184.log' contains the following entry from libvirtd: 2018-08-06 12:32:38.329+0000: shutting down, reason=destroyed This means that the process was terminated by the virDomainDestroy API on user's request. I suspect that the problem why qemu did not respond is in the storage, but with this data I can't provide any more insight. The VM was then terminated on request and the error message which was reported is expected in this scenario since the VM got stuck. For further debugging it would be beneficial to collect a backtrace of all threads of the qemu process which got stuck during block commit and move this issue to investigate to the qemu team. Ademar, Can you please take a look? (In reply to Eyal Shenitzky from comment #9) > Ademar, > Can you please take a look? The current information doesn't ring any bells. Please clone this BZ or transfer it to the QEMU component so it can be prioritized. Thanks. *** Bug 1613834 has been marked as a duplicate of this bug. *** *** Bug 1610929 has been marked as a duplicate of this bug. *** *** Bug 1607130 has been marked as a duplicate of this bug. *** Upstream patches by Kevin: https://lists.gnu.org/archive/html/qemu-block/2018-09/msg00160.html Fix included in qemu-kvm-rhev-2.12.0-16.el7 Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2018:3443 Vitaly, I opened this bug to track the migration issue on rhel7. Could you help close this bug as WONTFIX as we talked in email? (In reply to Qinghua Cheng from comment #42) > Vitaly, I opened this bug to track the migration issue on rhel7. Could you > help close this bug as WONTFIX as we talked in email? Please ignore Comment 42, sorry wrong bug. |
Created attachment 1458963 [details] logs Description of problem: Live merge fails on vdsm Version-Release number of selected component (if applicable): vdsm-4.20.34-1.el7ev.x86_64 libvirt-3.9.0-14.el7_5.6.x86_64 qemu-kvm-rhev-2.10.0-21.el7_5.4.x86_64 kernel 3.10.0-862.6.3.el7.x86_64 rhvm-4.2.5.1-0.1.el7ev.noarch How reproducible: Always, only on block based domain Steps to Reproduce: 1. Create a VM from a template as thin copy with disk on iSCSI domain 2. Start the VM 3. Create and attach to the VM 4 disks on iSCSI 4. Stop the VM, create a snapshot (snap0) with all disks and conf and start the VM 5. Stop the VM, create a snapshot (snap1) with all disks and conf and start the VM 6. Stop the VM, create a snapshot (snap2) with all disks and conf and start the VM 7. Delete snapshot snap2 Actual results: Live merge fails in vdsm: 2018-07-13 04:08:12,086+0300 INFO (jsonrpc/6) [virt.vm] (vmId='dd9bb2fa-5c29-45db-842f-240950242ff8') Starting merge with jobUUID='236a891a-d4e9-4ba7-b04e-bf79c1872d2a', original chain=169c1684-2b1a-449d-95c9-d ac50ca4ddea < 6795ce45-acdc-4fea-8a98-b9ede7327da0 < 6d8c0aaa-845a-4304-b589-04703addb151 < 56caa3a2-de7c-4fed-91c1-85ee529f57ad (top), disk='vdc', base='vdc[2]', top='vdc[1]', bandwidth=0, flags=8 (vm:5972) 2018-07-13 04:08:12,086+0300 DEBUG (jsonrpc/0) [storage.TaskManager.Task] (Task='0fec0b65-7906-46d9-8d76-cfaef6570f81') ref 0 aborting False (task:1002) 2018-07-13 04:08:12,440+0300 INFO (jsonrpc/7) [api.virt] START merge(drive={'imageID': '12daeed0-ff05-4597-a95c-14bf7555d6a7', 'volumeID': '6b13e352-1e60-4bbc-a8e7-655c7876c63f', 'domainID': '838925c0-5f93-4c2c -9193-8adebf695962', 'poolID': '2056646c-85f3-11e8-b0a8-001a4a168bfc'}, baseVolUUID='2e7d3507-ae3a-4e2b-87b1-2a6322882f4f', topVolUUID='3a9af59b-f768-4abe-9b57-02a08de404bb', bandwidth='0', jobUUID='1f3eb4d2-537 e-4766-a3c1-6dc0d2e65624') from=::ffff:10.46.16.252,36174, flow_id=snapshots_delete_fbfba248-02a4-48de, vmId=dd9bb2fa-5c29-45db-842f-240950242ff8 (api:46) 2018-07-13 04:08:12,921+0300 ERROR (jsonrpc/3) [virt.vm] (vmId='dd9bb2fa-5c29-45db-842f-240950242ff8') Error getting block job info (vm:5838) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 5836, in queryBlockJobs liveInfo = self._dom.blockJobInfo(drive.name, 0) File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f ret = attr(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 783, in blockJobInfo if ret is None: raise libvirtError ('virDomainGetBlockJobInfo() failed', dom=self) libvirtError: Unable to read from monitor: Connection reset by peer 2018-07-13 04:08:12,921+0300 INFO (libvirt/events) [virt.vm] (vmId='dd9bb2fa-5c29-45db-842f-240950242ff8') underlying process disconnected (vm:1056) engine.log: 2018-07-13 04:08:13,496+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-4) [snapshots_delete_fbfba248-02a4-48de] Failed in 'MergeVDS' metho d 2018-07-13 04:08:13,502+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-commandCoordinator-Thread-4) [snapshots_delete_fbfba248-02a4-48de] EVENT_ID: VDS_B ROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_2 command MergeVDS failed: General Exception: ('Unable to read from monitor: Connection reset by peer',) 2018-07-13 04:08:13,502+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-4) [snapshots_delete_fbfba248-02a4-48de] Command 'org.ovirt.engine. core.vdsbroker.vdsbroker.MergeVDSCommand' return value 'StatusOnlyReturn [status=Status [code=100, message=General Exception: ('Unable to read from monitor: Connection reset by peer',)]]' 2018-07-13 04:08:13,502+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-4) [snapshots_delete_fbfba248-02a4-48de] HostName = host_mixed_2 2018-07-13 04:08:13,503+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-4) [snapshots_delete_fbfba248-02a4-48de] Command 'MergeVDSCommand(H ostName = host_mixed_2, MergeVDSCommandParameters:{hostId='f72ef032-f9d3-4e4c-ae85-0ff6803d11ae', vmId='dd9bb2fa-5c29-45db-842f-240950242ff8', storagePoolId='2056646c-85f3-11e8-b0a8-001a4a168bfc', storageDomainId='838925c0-5f93-4c2c-9193-8adebf695962', imageGroupId='0141d39e-b7b7-47ba-b557-14a5be96ebbd', imageId='56caa3a2-de7c-4fed-91c1-85ee529f57ad', baseImageId='6795ce45-acdc-4fea-8a98-b9ede7327da0', topImageId='6d8c0aaa-845a-4304-b589-04703addb151', bandwidth='0'})' execution failed: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = General Exception: ('Unable to read from monitor: Connection reset by peer',), code = 100 2018-07-13 04:08:13,503+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-4) [snapshots_delete_fbfba248-02a4-48de] FINISH, MergeVDSCommand, log id: 27f45c42 2018-07-13 04:08:13,503+03 ERROR [org.ovirt.engine.core.bll.MergeCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-4) [snapshots_delete_fbfba248-02a4-48de] Engine exception thrown while sending merge command: org.ovirt.engine.core.common.errors.EngineException: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = General Exception: ('Unable to read from monitor: Connection reset by peer',), code = 100 (Failed with error GeneralException and code 100) at org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:118) [bll.jar:] at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.runVdsCommand(VDSBrokerFrontendImpl.java:33) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.runVdsCommand(CommandBase.java:2045) [bll.jar:] at org.ovirt.engine.core.bll.MergeCommand.executeCommand(MergeCommand.java:45) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1133) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1285) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1934) [bll.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInNewTransaction(TransactionSupport.java:202) [utils.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInRequired(TransactionSupport.java:137) [utils.jar:] at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:105) [utils.jar:] at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1345) [bll.jar:] at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:400) [bll.jar:] Expected results: Live merge should succeed Additional info: Marking as a regression as this is a tier1 test case that passed up until this build.