Bug 1601212

Summary: qemu coredumps on block-commit
Product: Red Hat Enterprise Linux 7 Reporter: Elad <ebenahar>
Component: qemu-kvm-rhevAssignee: Kevin Wolf <kwolf>
Status: CLOSED ERRATA QA Contact: Gu Nini <ngu>
Severity: high Docs Contact:
Priority: high    
Version: 7.5CC: aefrat, areis, bugs, chayang, coli, dfediuck, ebenahar, famz, fsun, fweimer, gveitmic, gwatson, juzhang, kwolf, michen, mtessun, ngu, pkrempa, qcheng, ratamir, slopezpa, tburke, tnisan, virt-maint, vkuznets
Target Milestone: pre-dev-freezeKeywords: Automation, Regression, ZStream
Target Release: ---   
Hardware: x86_64   
OS: Unspecified   
Whiteboard:
Fixed In Version: qemu-kvm-rhev-2.12.0-16.el7 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1631161 1637976 (view as bug list) Environment:
Last Closed: 2018-11-01 11:13:00 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Storage RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1583424, 1609137, 1610929, 1631161, 1637976    
Attachments:
Description Flags
logs
none
logs with libvirtd.log none

Description Elad 2018-07-15 07:49:46 UTC
Created attachment 1458963 [details]
logs

Description of problem:
Live merge fails on vdsm

Version-Release number of selected component (if applicable):
vdsm-4.20.34-1.el7ev.x86_64
libvirt-3.9.0-14.el7_5.6.x86_64
qemu-kvm-rhev-2.10.0-21.el7_5.4.x86_64
kernel 3.10.0-862.6.3.el7.x86_64
rhvm-4.2.5.1-0.1.el7ev.noarch

How reproducible:
Always, only on block based domain


Steps to Reproduce:
1. Create a VM from a template as thin copy with disk on iSCSI domain
2. Start the VM
3. Create and attach to the VM 4 disks on iSCSI
4. Stop the VM, create a snapshot (snap0) with all disks and conf and start the VM
5. Stop the VM, create a snapshot (snap1) with all disks and conf and start the VM
6. Stop the VM, create a snapshot (snap2) with all disks and conf and start the VM
7. Delete snapshot snap2 


Actual results:
Live merge fails in vdsm:

2018-07-13 04:08:12,086+0300 INFO  (jsonrpc/6) [virt.vm] (vmId='dd9bb2fa-5c29-45db-842f-240950242ff8') Starting merge with jobUUID='236a891a-d4e9-4ba7-b04e-bf79c1872d2a', original chain=169c1684-2b1a-449d-95c9-d
ac50ca4ddea < 6795ce45-acdc-4fea-8a98-b9ede7327da0 < 6d8c0aaa-845a-4304-b589-04703addb151 < 56caa3a2-de7c-4fed-91c1-85ee529f57ad (top), disk='vdc', base='vdc[2]', top='vdc[1]', bandwidth=0, flags=8 (vm:5972)
2018-07-13 04:08:12,086+0300 DEBUG (jsonrpc/0) [storage.TaskManager.Task] (Task='0fec0b65-7906-46d9-8d76-cfaef6570f81') ref 0 aborting False (task:1002)
2018-07-13 04:08:12,440+0300 INFO  (jsonrpc/7) [api.virt] START merge(drive={'imageID': '12daeed0-ff05-4597-a95c-14bf7555d6a7', 'volumeID': '6b13e352-1e60-4bbc-a8e7-655c7876c63f', 'domainID': '838925c0-5f93-4c2c
-9193-8adebf695962', 'poolID': '2056646c-85f3-11e8-b0a8-001a4a168bfc'}, baseVolUUID='2e7d3507-ae3a-4e2b-87b1-2a6322882f4f', topVolUUID='3a9af59b-f768-4abe-9b57-02a08de404bb', bandwidth='0', jobUUID='1f3eb4d2-537
e-4766-a3c1-6dc0d2e65624') from=::ffff:10.46.16.252,36174, flow_id=snapshots_delete_fbfba248-02a4-48de, vmId=dd9bb2fa-5c29-45db-842f-240950242ff8 (api:46)
2018-07-13 04:08:12,921+0300 ERROR (jsonrpc/3) [virt.vm] (vmId='dd9bb2fa-5c29-45db-842f-240950242ff8') Error getting block job info (vm:5838)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 5836, in queryBlockJobs
    liveInfo = self._dom.blockJobInfo(drive.name, 0)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 783, in blockJobInfo
    if ret is None: raise libvirtError ('virDomainGetBlockJobInfo() failed', dom=self)
libvirtError: Unable to read from monitor: Connection reset by peer
2018-07-13 04:08:12,921+0300 INFO  (libvirt/events) [virt.vm] (vmId='dd9bb2fa-5c29-45db-842f-240950242ff8') underlying process disconnected (vm:1056)



engine.log:


2018-07-13 04:08:13,496+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-4) [snapshots_delete_fbfba248-02a4-48de] Failed in 'MergeVDS' metho
d
2018-07-13 04:08:13,502+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-commandCoordinator-Thread-4) [snapshots_delete_fbfba248-02a4-48de] EVENT_ID: VDS_B
ROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_2 command MergeVDS failed: General Exception: ('Unable to read from monitor: Connection reset by peer',)
2018-07-13 04:08:13,502+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-4) [snapshots_delete_fbfba248-02a4-48de] Command 'org.ovirt.engine.
core.vdsbroker.vdsbroker.MergeVDSCommand' return value 'StatusOnlyReturn [status=Status [code=100, message=General Exception: ('Unable to read from monitor: Connection reset by peer',)]]'
2018-07-13 04:08:13,502+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-4) [snapshots_delete_fbfba248-02a4-48de] HostName = host_mixed_2
2018-07-13 04:08:13,503+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-4) [snapshots_delete_fbfba248-02a4-48de] Command 'MergeVDSCommand(H
ostName = host_mixed_2, MergeVDSCommandParameters:{hostId='f72ef032-f9d3-4e4c-ae85-0ff6803d11ae', vmId='dd9bb2fa-5c29-45db-842f-240950242ff8', storagePoolId='2056646c-85f3-11e8-b0a8-001a4a168bfc', storageDomainId='838925c0-5f93-4c2c-9193-8adebf695962', imageGroupId='0141d39e-b7b7-47ba-b557-14a5be96ebbd', imageId='56caa3a2-de7c-4fed-91c1-85ee529f57ad', baseImageId='6795ce45-acdc-4fea-8a98-b9ede7327da0', topImageId='6d8c0aaa-845a-4304-b589-04703addb151', bandwidth='0'})' execution failed: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = General Exception: ('Unable to read from monitor: Connection reset by peer',), code = 100
2018-07-13 04:08:13,503+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-4) [snapshots_delete_fbfba248-02a4-48de] FINISH, MergeVDSCommand, log id: 27f45c42
2018-07-13 04:08:13,503+03 ERROR [org.ovirt.engine.core.bll.MergeCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-4) [snapshots_delete_fbfba248-02a4-48de] Engine exception thrown while sending merge command: org.ovirt.engine.core.common.errors.EngineException: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = General Exception: ('Unable to read from monitor: Connection reset by peer',), code = 100 (Failed with error GeneralException and code 100)
        at org.ovirt.engine.core.bll.VdsHandler.handleVdsResult(VdsHandler.java:118) [bll.jar:]
        at org.ovirt.engine.core.bll.VDSBrokerFrontendImpl.runVdsCommand(VDSBrokerFrontendImpl.java:33) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.runVdsCommand(CommandBase.java:2045) [bll.jar:]
        at org.ovirt.engine.core.bll.MergeCommand.executeCommand(MergeCommand.java:45) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeWithoutTransaction(CommandBase.java:1133) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeActionInTransactionScope(CommandBase.java:1285) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.runInTransaction(CommandBase.java:1934) [bll.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInNewTransaction(TransactionSupport.java:202) [utils.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInRequired(TransactionSupport.java:137) [utils.jar:]
        at org.ovirt.engine.core.utils.transaction.TransactionSupport.executeInScope(TransactionSupport.java:105) [utils.jar:]
        at org.ovirt.engine.core.bll.CommandBase.execute(CommandBase.java:1345) [bll.jar:]
        at org.ovirt.engine.core.bll.CommandBase.executeAction(CommandBase.java:400) [bll.jar:]


Expected results:
Live merge should succeed


Additional info:
Marking as a regression as this is a tier1 test case that passed up until this build.

Comment 1 Elad 2018-07-20 11:50:48 UTC
Some clarifications:

- The bug reproduces only when there are file system on the VM disks with some content written to files on them. So in steps to reproduce, step3, create FS on the disks after attaching them to the VM.
- Correction: in steps to reproduce, step7, snapshot 1 has to be deleted, instead of snapshot2

Comment 2 Ala Hino 2018-07-22 14:08:05 UTC
Using the following steps I wasn't able to reproduce:

1. Create a VM and install rhel 7.5 (using an iso that was downloaded)
2. Create a template
4. Create a VM from the template
5. Create 4 disks on a block storage
6. Create FS on all disks
7. Write data to all disks
8. Create a snapshot (s1)
9. Write data to all disks
10. Create a snapshot (s2)
11. Write data to all disks
12. Create a snapshot (s3)
13. Write data to all disks
14. Live merge s2

==> Works as expected.

I proceeded and successfully live merged that other two snapshots, s1 and then s2.

Comment 3 Eyal Shenitzky 2018-08-01 10:48:02 UTC
Managed to reproduce with the following steps:

1. Create a VM with 1gb preallocated disk
2. Create a template from the VM in step 1
3. Create a new VM [WORK_VM]from the template in step 2
4. Add 2 preallocated 1gb disks to WORK_VM
5. Create 3 snapshots to WORK_VM [snap1, snap2, snap3]
6. Run WORK_VM on an !!HSM host!!
7. Remove snapshot snap2

Reproduce at 90%

Comment 4 Eyal Shenitzky 2018-08-06 07:18:13 UTC
Elad, Can you please attach libvirt.log and qemu log

Comment 5 Elad 2018-08-06 14:15:52 UTC
Created attachment 1473618 [details]
logs with libvirtd.log

vdsm.log

2018-08-06 15:32:39,946+0300 ERROR (jsonrpc/1) [virt.vm] (vmId='d7c105cf-06bd-4d79-a21d-3ef09ac94fd4') Live merge failed (job: 09de6148-74e6-4451-957d-f2dff045edfd) (vm:5978)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 5976, in merge
    bandwidth, flags)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 708, in blockCommit
    if ret == -1: raise libvirtError ('virDomainBlockCommit() failed', dom=self)
libvirtError: Unable to read from monitor: Connection reset by peer

libvirtd.log

Comment 6 Eyal Shenitzky 2018-08-07 05:02:51 UTC
Peter,

I suspect that there is a Libvirt bug that causes this error.

Can you please take a look at the libvirtd.log that Elad just added?

Comment 7 Eyal Shenitzky 2018-08-08 11:00:40 UTC
Elad, is this bug occurs also in 4.1?
Can you please check what was the last libvirt version that the flow worked properly?

Comment 8 Peter Krempa 2018-08-08 11:22:38 UTC
I've looked at the logs and found following:

libvirtd was instructed to do a block-commit operation. We've sent the following command to qemu:

2018-08-06 12:09:38.887+0000: 6674: info : qemuMonitorSend:1076 : QEMU_MONITOR_SEND_MSG: mon=0x7f86d8027cc0 msg={"execute":"block-commit","arguments":{"device":"drive-ua-b3fae703-ab91-4364-9efa-4240e7977835","top":"/rhev/data-center/mnt/blockSD/ca5552b7-305a-4667-9688-a42c180806ff/images/b3fae703-ab91-4364-9efa-4240e7977835/e6d99e41-3c8a-4f84-8e50-225c4e15e1b8","base":"/rhev/data-center/mnt/blockSD/ca5552b7-305a-4667-9688-a42c180806ff/images/b3fae703-ab91-4364-9efa-4240e7977835/12f6a3a3-b7bc-41c8-aa1f-2741348c928d","backing-file":"12f6a3a3-b7bc-41c8-aa1f-2741348c928d"},"id":"libvirt-153"}^M

... and never received a reply.

All subsequent APIs then got errors as there was an active API call. Note that block-commit operation is asynchronous after the command is acknowledged by qemu (which it was not):

2018-08-06 12:10:10.248+0000: 6675: warning : qemuDomainObjBeginJobInternal:4709 : Cannot start job (modify, none) for domain vm_TestCase6052_0614450184; current job is (modify, none) owned by (6674 remoteDispatchDomainBlockCommit, 0 <null>) for (31s, 0s)

[....]

2018-08-06 12:17:18.055+0000: 6675: warning : qemuDomainObjBeginJobInternal:4709 : Cannot start job (query, none) for domain       vm_TestCase6052_0614450184; current job is (modify, none) owned by (6674 remoteDispatchDomainBlockCommit, 0 <null>) for (459s, 0s)

So the process was stuck for ~20 minutes before being killed:

2018-08-06 12:32:27.858+0000: 6675: error : virProcessKillPainfully:401 : Failed to terminate process 723 with SIGTERM: Device or resource busy

We then killed the process with SIGKILL which was successful.

The following reply is a result of terminating the qemu process. Closing the monitor finalizes the existing block-commit API call with an error:

11278 2018-08-06 12:32:38.011+0000: 6667: error : qemuMonitorIORead:610 : Unable to read from monitor: Connection reset by peer


VM log file 'vm_TestCase6052_0614450184.log' contains the following entry from libvirtd:

2018-08-06 12:32:38.329+0000: shutting down, reason=destroyed

This means that the process was terminated by the virDomainDestroy API on user's request.

I suspect that the problem why qemu did not respond is in the storage, but with this data I can't provide any more insight. The VM was then terminated on request and the error message which was reported is expected in this scenario since the VM got stuck.

For further debugging it would be beneficial to collect a backtrace of all threads of the qemu process which got stuck during block commit and move this issue to investigate to the qemu team.

Comment 9 Eyal Shenitzky 2018-08-08 12:31:45 UTC
Ademar, 
Can you please take a look?

Comment 10 Ademar Reis 2018-08-08 14:51:57 UTC
(In reply to Eyal Shenitzky from comment #9)
> Ademar, 
> Can you please take a look?

The current information doesn't ring any bells. Please clone this BZ or transfer it to the QEMU component so it can be prioritized. Thanks.

Comment 13 Eyal Shenitzky 2018-08-12 06:26:39 UTC
*** Bug 1613834 has been marked as a duplicate of this bug. ***

Comment 14 Eyal Shenitzky 2018-08-14 06:21:31 UTC
*** Bug 1610929 has been marked as a duplicate of this bug. ***

Comment 25 Peter Krempa 2018-08-30 17:08:13 UTC
*** Bug 1607130 has been marked as a duplicate of this bug. ***

Comment 28 Ademar Reis 2018-09-07 16:25:57 UTC
Upstream patches by Kevin: https://lists.gnu.org/archive/html/qemu-block/2018-09/msg00160.html

Comment 32 Miroslav Rezanina 2018-09-17 08:51:00 UTC
Fix included in qemu-kvm-rhev-2.12.0-16.el7

Comment 41 errata-xmlrpc 2018-11-01 11:13:00 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2018:3443

Comment 42 Qinghua Cheng 2019-10-22 08:50:02 UTC
Vitaly, I opened this bug to track the migration issue on rhel7. Could you help close this bug as WONTFIX as we talked in email?

Comment 43 Qinghua Cheng 2019-10-22 08:53:15 UTC
(In reply to Qinghua Cheng from comment #42)
> Vitaly, I opened this bug to track the migration issue on rhel7. Could you
> help close this bug as WONTFIX as we talked in email?

Please ignore  Comment 42, sorry wrong bug.