Bug 1711643

Summary:

qemu aborts in blockCommit: qemu-kvm: block.c:3486: bdrv_replace_node: Assertion `!({ _Static_assert(!(sizeof(*&from->in_flight) > 8), "not expecting: " "sizeof(*&from->in_flight) > ATOMIC_REG_SIZE"); __atomic_load_n(&from->in_flight, 0); })' failed.

Product:

Red Hat Enterprise Linux 7

Reporter:

Avihai <aefrat>

Component:

qemu-kvm-rhev

Assignee:

Kevin Wolf <kwolf>

Status:

CLOSED ERRATA

QA Contact:

aihua liang <aliang>

Severity:

high

Docs Contact:

Priority:

high

Version:

7.7

CC:

aliang, bcholler, bugs, coli, dhuertas, gwatson, jinzhao, juzhang, kwolf, mkalinin, mtessun, nsoffer, virt-maint

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Unspecified

Whiteboard:

Fixed In Version:

qemu-kvm-rhev-2.12.0-35.el7

Doc Type:

If docs needed, set a value

Doc Text:

Story Points:

---

Clone Of:

Clones:

1750322 (view as bug list)

Environment:

Last Closed:

2020-03-31 14:34:48 UTC

Type:

Bug

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

Storage

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
engine and vdsm logs	none
qemu_vm_TestCase6052_1906050852_log	none
vm xml extracted from engine.log	none

Description Avihai 2019-05-19 08:11:10 UTC

Created attachment 1570817 [details]
engine and vdsm logs

Description of problem:
Removing VM snapshot failed with :

Engine:
2019-05-19 06:13:18,292+03 INFO  [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommandCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-31) [snapshots_delete_aebc6b48-6dd9-4ced] Comma
nd 'RemoveSnapshotSingleDiskLive' (id: 'd0fb453f-7e58-4236-880e-e8929b7a7777') waiting on child command id: '7d205c62-99a7-43ba-8798-e41006da751e' type:'Merge' to complete
2019-05-19 06:13:18,297+03 INFO  [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommandCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-31) [snapshots_delete_aebc6b48-6dd9-4ced] Comma
nd 'RemoveSnapshotSingleDiskLive' (id: 'df4e7c70-360c-4fe2-b72a-3ed2634b2e9f') waiting on child command id: '4b4d5091-ea4c-4657-a9de-ec047eee0e64' type:'Merge' to complete
2019-05-19 06:13:18,726+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-5) [snapshots_delete_aebc6b48-6dd9-4ced] Failed in 'MergeVDS' method
2019-05-19 06:13:18,731+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-commandCoordinator-Thread-5) [snapshots_delete_aebc6b48-6dd9-4ced] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_3 command MergeVDS failed: General Exception: ("VM 'e74bec08-cb6e-4931-8775-f8a49b62eaf8' was not defined yet or was undefined",)
2019-05-19 06:13:18,732+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-5) [snapshots_delete_aebc6b48-6dd9-4ced] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand' return value 'StatusOnlyReturn [status=Status [code=100, message=General Exception: ("VM 'e74bec08-cb6e-4931-8775-f8a49b62eaf8' was not defined yet or was undefined",)]]'
2019-05-19 06:13:18,732+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-5) [snapshots_delete_aebc6b48-6dd9-4ced] HostName = host_mixed_3
2019-05-19 06:13:18,732+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-5) [snapshots_delete_aebc6b48-6dd9-4ced] Command 'MergeVDSCommand(HostName = host_mixed_3, MergeVDSCommandParameters:{hostId='5f79683a-f0b3-4cf8-9730-7e431b662f83', vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8', storagePoolId='b26523f2-afcf-4675-b660-9faec6d709b3', storageDomainId='193e46d4-bb94-48e4-b5a5-236d34372920', imageGroupId='4700cd22-a5ce-4e39-af35-894519c71bca', imageId='705a6902-2b36-4883-9482-427885c5c84b', baseImageId='bc1d2b00-54ca-4550-933c-f2ee6ed132ca', topImageId='21e975df-872c-4930-8c81-358b6821eef9', bandwidth='0'})' execution failed: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = General Exception: ("VM 'e74bec08-cb6e-4931-8775-f8a49b62eaf8' was not defined yet or was undefined",), code = 100
2019-05-19 06:13:18,732+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-5) [snapshots_delete_aebc6b48-6dd9-4ced] FINISH, MergeVDSCommand, log id: dfef71b
2019-05-19 06:13:18,732+03 ERROR [org.ovirt.engine.core.bll.MergeCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-5) [snapshots_delete_aebc6b48-6dd9-4ced] Engine exception thrown while sending merge command: org.ovirt.engine.core.common.errors.EngineException: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = General Exception: ("VM 'e74bec08-cb6e-4931-8775-f8a49b62eaf8' was not defined yet or was undefined",), code = 100 (Failed with error GeneralException and code 100)

2019-05-19 06:13:18,771+03 ERROR [org.ovirt.engine.core.bll.MergeCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-5) [snapshots_delete_aebc6b48-6dd9-4ced] Transaction rolled-back for command 'org.ovi
rt.engine.core.bll.MergeCommand'.
2019-05-19 06:13:19,220+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-11) [] VM 'e74bec08-cb6e-4931-8775-f8a49b62eaf8' was reported as Down on VDS '5f79683a-f0b3-4cf8-9
730-7e431b662f83'(host_mixed_3)
2019-05-19 06:13:19,220+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-11) [] START, DestroyVDSCommand(HostName = host_mixed_3, DestroyVmVDSCommandParameters:{hostId='5f79683a-f0b3-4cf8-9730-7e431b662f83', vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='true'}), log id: 36bec028
2019-05-19 06:13:19,225+03 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-11) [] FINISH, DestroyVDSCommand, log id: 36bec028
2019-05-19 06:13:19,225+03 INFO  [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-11) [] VM 'e74bec08-cb6e-4931-8775-f8a49b62eaf8'(vm_TestCase6052_1906050852) moved from 'Up' --> 'Down'
2019-05-19 06:13:19,237+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-11) [] EVENT_ID: VM_DOWN_ERROR(119), VM vm_TestCase6052_1906050852 is down with error. Exit message: Lost connection with qemu process.

VDSM:
2019-05-19 06:13:18,683+0300 ERROR (jsonrpc/5) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Live merge failed (job: 6e010748-f6d6-49fb-9479-49d44b738737) (vm:6020)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 6018, in merge
    bandwidth, flags)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 719, in blockCommit
    if ret == -1: raise libvirtError ('virDomainBlockCommit() failed', dom=self)
libvirtError: Unable to read from monitor: Connection reset by peer


Version-Release number of selected component (if applicable):
ovirt-engine-4.2.8.7-0.1.el7ev.noarch
vdsm-4.20.49-3.el7ev.x86_64
libvirt-4.5.0-10.el7_6.9.x86_64
qemu-img-rhev-2.12.0-18.el7_6.4.x86_64


How reproducible:
Happened once so far on glusterFS only

Steps to Reproduce- from automation log:
06:05:08 2019-05-19 06:05:08,520 INFO       Test Setup   3: Creating VM vm_TestCase6052_1906050852
06:05:08 2019-05-19 06:05:08,725 INFO       Test Setup   4: [class] Description: clone vm from a pre-defined template golden_mixed_virtio_template with {'vol_sparse': True, 'name': 'vm_TestCase6052_1906050852', 'clone': False, 'cluster': 'golden_env_mixed_1', 'storagedomain': 'test_gluster_1', 'display_type': 'spice', 'timeout': 900, 'virtio_scsi': True, 'vol_format': 'cow', 'os_type': 'RHEL6x64', 'type': 'desktop', 'wait': True}
06:05:20 2019-05-19 06:05:20,769 INFO       Test Setup   5: Starting VM vm_TestCase6052_1906050852
06:05:20 2019-05-19 06:05:20,772 INFO       Test Setup   6: [class] Start VM vm_TestCase6052_1906050852 with {'wait_for_ip': True, 'pause': False, 'use_cloud_init': False, 'timeout': 600, 'wait_for_status': 'up'}
06:06:16 2019-05-19 06:06:16,291 INFO       Test Setup   7: Creating disks with filesystem and attach to VM vm_TestCase6052_1906050852
06:07:34 2019-05-19 06:07:34,438 INFO 300: storage/rhevmtests.storage.storage_remove_snapshots.test_live_merge.TestCase6052.test_basic_snapshot_deletion_with_io[glusterfs]
06:07:34 2019-05-19 06:07:34,439 INFO STORAGE: GLUSTERFS
06:07:34 2019-05-19 06:07:34,875 INFO       Test Step   8: Creating files on vm's 'vm_TestCase6052_1906050852' disks
06:07:35 2019-05-19 06:07:34,950 INFO       Test Step   9: Creating file /mount-point_1906065223/test_file_0
06:07:36 2019-05-19 06:07:36,619 INFO       Test Step  10: Creating file /mount-point_1906070525/test_file_0
06:07:38 2019-05-19 06:07:38,231 INFO       Test Step  11: Creating file /mount-point_1906071909/test_file_0
06:07:39 2019-05-19 06:07:39,898 INFO       Test Step  12: Creating file /mount-point_1906073210/test_file_0
06:07:42 2019-05-19 06:07:42,176 INFO       Test Step  13: Creating snapshot of vm vm_TestCase6052_1906050852
06:07:48 2019-05-19 06:07:48,185 INFO       Test Step  14: Before snapshot: 6 volumes
06:07:48 2019-05-19 06:07:48,343 INFO       Test Step  15: Stop vm vm_TestCase6052_1906050852 with {'async': 'true'}
06:08:00 2019-05-19 06:08:00,066 INFO       Test Step  16: Adding new snapshot to vm vm_TestCase6052_1906050852
06:08:00 2019-05-19 06:08:00,069 INFO       Test Step  17: Add snapshot to VM vm_TestCase6052_1906050852 with {'description': 'snapshot_6052_glusterfs_0', 'wait': True}
06:08:18 2019-05-19 06:08:18,908 INFO       Test Step  18: Start VMs with {'vm_list': ['vm_TestCase6052_1906050852'], 'wait_for_ip': True, 'wait_for_status': 'powering_up', 'max_workers': 2}
06:09:18 2019-05-19 06:09:18,301 INFO       Test Step  19: After snapshot: 11 volumes
06:09:18 2019-05-19 06:09:18,367 INFO       Test Step  20: Creating files on vm's 'vm_TestCase6052_1906050852' disks
06:09:18 2019-05-19 06:09:18,441 INFO       Test Step  21: Creating file /mount-point_1906065223/test_file_1
06:09:20 2019-05-19 06:09:20,184 INFO       Test Step  22: Creating file /mount-point_1906070525/test_file_1
06:09:21 2019-05-19 06:09:21,848 INFO       Test Step  23: Creating file /mount-point_1906071909/test_file_1
06:09:23 2019-05-19 06:09:23,434 INFO       Test Step  24: Creating file /mount-point_1906073210/test_file_1
06:09:25 2019-05-19 06:09:25,559 INFO       Test Step  25: Creating snapshot of vm vm_TestCase6052_1906050852
06:09:31 2019-05-19 06:09:31,464 INFO       Test Step  26: Before snapshot: 11 volumes
06:09:31 2019-05-19 06:09:31,617 INFO       Test Step  27: Stop vm vm_TestCase6052_1906050852 with {'async': 'true'}
06:09:43 2019-05-19 06:09:43,759 INFO       Test Step  28: Adding new snapshot to vm vm_TestCase6052_1906050852
06:09:43 2019-05-19 06:09:43,761 INFO       Test Step  29: Add snapshot to VM vm_TestCase6052_1906050852 with {'description': 'snapshot_6052_glusterfs_1', 'wait': True}
06:10:14 2019-05-19 06:10:14,210 INFO       Test Step  30: Start VMs with {'vm_list': ['vm_TestCase6052_1906050852'], 'wait_for_ip': True, 'wait_for_status': 'powering_up', 'max_workers': 2}
06:11:14 2019-05-19 06:11:14,248 INFO       Test Step  31: After snapshot: 16 volumes
06:11:14 2019-05-19 06:11:14,319 INFO       Test Step  32: Creating files on vm's 'vm_TestCase6052_1906050852' disks
06:11:14 2019-05-19 06:11:14,399 INFO       Test Step  33: Creating file /mount-point_1906065223/test_file_2
06:11:16 2019-05-19 06:11:16,265 INFO       Test Step  34: Creating file /mount-point_1906070525/test_file_2
06:11:17 2019-05-19 06:11:17,892 INFO       Test Step  35: Creating file /mount-point_1906071909/test_file_2
06:11:19 2019-05-19 06:11:19,517 INFO       Test Step  36: Creating file /mount-point_1906073210/test_file_2
06:11:21 2019-05-19 06:11:21,809 INFO       Test Step  37: Creating snapshot of vm vm_TestCase6052_1906050852
06:11:28 2019-05-19 06:11:27,953 INFO       Test Step  38: Before snapshot: 16 volumes
06:11:28 2019-05-19 06:11:28,108 INFO       Test Step  39: Stop vm vm_TestCase6052_1906050852 with {'async': 'true'}
06:11:40 2019-05-19 06:11:40,463 INFO       Test Step  40: Adding new snapshot to vm vm_TestCase6052_1906050852
06:11:40 2019-05-19 06:11:40,465 INFO       Test Step  41: Add snapshot to VM vm_TestCase6052_1906050852 with {'description': 'snapshot_6052_glusterfs_2', 'wait': True}
06:12:02 2019-05-19 06:12:02,446 INFO       Test Step  42: Start VMs with {'vm_list': ['vm_TestCase6052_1906050852'], 'wait_for_ip': True, 'wait_for_status': 'powering_up', 'max_workers': 2}
06:13:00 2019-05-19 06:13:00,384 INFO       Test Step  43: After snapshot: 21 volumes
06:13:08 2019-05-19 06:13:08,155 INFO       Test Step  44: Before live merge: 21 volumes
06:13:08 2019-05-19 06:13:08,324 INFO       Test Step  45: Performing continues IO on VM vm_TestCase6052_1906050852
06:13:08 2019-05-19 06:13:08,326 INFO       Test Step  46: Removing snapshot 'snapshot_6052_glusterfs_0' of vm vm_TestCase6052_1906050852
06:13:08 2019-05-19 06:13:08,330 INFO       Test Step  47: Remove vm vm_TestCase6052_1906050852 snapshot with {'description': 'snapshot_6052_glusterfs_0', 'timeout': 2400, 'wait': True}
06:53:15 2019-05-19 06:53:15,074 ERROR Result: FAILED

Actual results:
Removing VM snapshot failed.

Expected results:


Additional info:
Same automated test ran on 4 storage flavors(ISCSI,NFS,FCP,GLUSTER) and failed only on gluster.

Comment 1 Avihai 2019-05-19 08:16:08 UTC

There is no 'version' 4.2 so It's left empty for now.
Regression as this issue was not seen on this test till now.(rhv-4.2.10-1) which include the new vdsm-4.20.49-3.el7ev.x86_64 with some sensitive bug fixes.

Comment 2 Avihai 2019-05-19 10:44:04 UTC

Also this was seen qemu log of relevant vm_TestCase6052_1906050852.log, VM crashed with assert below.
Attaching vm_TestCase6052_1906050852.log.
Libvirt log of that time was not available.

2019-05-19T03:12:12.993666Z qemu-kvm: warning: CPU(s) not present in any NUMA nodes: CPU 1 [socket-id: 1, core-id: 0, thread-id: 0], CPU 2 [socket-id: 2, core-id: 0, thread-id: 0], CPU 3 [socket-id: 3, core-id: 0, thread-id: 0], CPU 4 [socket-id: 4, core-id: 0, thread-id: 0], CPU 5 [socket-id: 5, core-id: 0, thread-id: 0], CPU 6 [socket-id: 6, core-id: 0, thread-id: 0], CPU 7 [socket-id: 7, core-id: 0, thread-id: 0], CPU 8 [socket-id: 8, core-id: 0, thread-id: 0], CPU 9 [socket-id: 9, core-id: 0, thread-id: 0], CPU 10 [socket-id: 10, core-id: 0, thread-id: 0], CPU 11 [socket-id: 11, core-id: 0, thread-id: 0], CPU 12 [socket-id: 12, core-id: 0, thread-id: 0], CPU 13 [socket-id: 13, core-id: 0, thread-id: 0], CPU 14 [socket-id: 14, core-id: 0, thread-id: 0], CPU 15 [socket-id: 15, core-id: 0, thread-id: 0]
2019-05-19T03:12:12.993703Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NUMA config, ability to start up with partial NUMA mappings is obsoleted and will be removed in future
qemu-kvm: block.c:3486: bdrv_replace_node: Assertion `!({ _Static_assert(!(sizeof(*&from->in_flight) > 8), "not expecting: " "sizeof(*&from->in_flight) > ATOMIC_REG_SIZE"); __atomic_load_n(&from->in_flight, 0); })' failed.
2019-05-19 03:13:18.148+0000: shutting down, reason=crashed

Comment 3 Avihai 2019-05-19 10:44:55 UTC

Created attachment 1570830 [details]
qemu_vm_TestCase6052_1906050852_log

Comment 4 Avihai 2019-05-19 15:54:02 UTC

ran same TestCase6052 on glusterfs with same ENV - issue did not reproduce.

Comment 5 Avihai 2019-05-19 15:56:34 UTC

Maybe this is a reproduction of Bug 1601212.

Comment 6 Nir Soffer 2019-05-19 16:35:07 UTC

(In reply to Avihai from comment #2)
 
> qemu-kvm: block.c:3486: bdrv_replace_node: Assertion `!({
> _Static_assert(!(sizeof(*&from->in_flight) > 8), "not expecting: "
> "sizeof(*&from->in_flight) > ATOMIC_REG_SIZE");
> __atomic_load_n(&from->in_flight, 0); })' failed.
> 2019-05-19 03:13:18.148+0000: shutting down, reason=crashed

assert in runtime must be a qemu bug, and It does not look like bug 1601212 to me.

Kevin, what do you think?

Comment 7 Nir Soffer 2019-05-19 23:14:37 UTC

Move context seen in the logs from attachment 1570817 [details]:

We have 5 concurrent merge operations:

$ grep 'START merge' *.log

hypervisor-lynx24_vdsm.log:2019-05-19 06:13:12,191+0300 INFO  (jsonrpc/5) [api.virt] START merge(drive={u'imageID': u'4700cd22-a5ce-4e39-af35-894519c71bca', u'volumeID': u'705a6902-2b36-4883-9482-427885c5c84b', u'domainID': u'193e46d4-bb94-48e4-b5a5-236d34372920', u'poolID': u'b26523f2-afcf-4675-b660-9faec6d709b3'}, baseVolUUID=u'bc1d2b00-54ca-4550-933c-f2ee6ed132ca', topVolUUID=u'21e975df-872c-4930-8c81-358b6821eef9', bandwidth=u'0', jobUUID=u'6e010748-f6d6-49fb-9479-49d44b738737') from=::ffff:10.46.16.252,40370, flow_id=snapshots_delete_aebc6b48-6dd9-4ced, vmId=e74bec08-cb6e-4931-8775-f8a49b62eaf8 (api:46)

hypervisor-lynx24_vdsm.log:2019-05-19 06:13:12,227+0300 INFO  (jsonrpc/0) [api.virt] START merge(drive={u'imageID': u'6b08223e-df65-473b-85a3-55fce42c18a4', u'volumeID': u'07441e1f-e4fe-46ce-bc59-f41cea15d2e4', u'domainID': u'193e46d4-bb94-48e4-b5a5-236d34372920', u'poolID': u'b26523f2-afcf-4675-b660-9faec6d709b3'}, baseVolUUID=u'e87a1c47-1117-421c-8200-5bef3d03ff1a', topVolUUID=u'c755441e-2602-424a-bc0b-beb099f0d987', bandwidth=u'0', jobUUID=u'3edb6508-e4cd-4da8-8d1b-6ae1bfa9fb38') from=::ffff:10.46.16.252,40370, flow_id=snapshots_delete_aebc6b48-6dd9-4ced, vmId=e74bec08-cb6e-4931-8775-f8a49b62eaf8 (api:46)

hypervisor-lynx24_vdsm.log:2019-05-19 06:13:12,228+0300 INFO  (jsonrpc/6) [api.virt] START merge(drive={u'imageID': u'3c678812-3ac8-4e9c-801e-e9fc9bd376d9', u'volumeID': u'87fd8fc6-255a-416f-8eca-f8c1ede93f63', u'domainID': u'193e46d4-bb94-48e4-b5a5-236d34372920', u'poolID': u'b26523f2-afcf-4675-b660-9faec6d709b3'}, baseVolUUID=u'22d4d570-ecd1-40f2-b585-8f0e6a7635cc', topVolUUID=u'4f6af503-c703-44a0-afb2-67581bc5b6f8', bandwidth=u'0', jobUUID=u'e55c7454-e221-4e8e-a075-f945becd2277') from=::ffff:10.46.16.252,40370, flow_id=snapshots_delete_aebc6b48-6dd9-4ced, vmId=e74bec08-cb6e-4931-8775-f8a49b62eaf8 (api:46)

hypervisor-lynx24_vdsm.log:2019-05-19 06:13:12,274+0300 INFO  (jsonrpc/2) [api.virt] START merge(drive={u'imageID': u'8979b329-cb70-47b9-9ee9-83716713fd50', u'volumeID': u'21577de7-2583-4c96-a409-5e5cce9ac1ef', u'domainID': u'193e46d4-bb94-48e4-b5a5-236d34372920', u'poolID': u'b26523f2-afcf-4675-b660-9faec6d709b3'}, baseVolUUID=u'6346a581-33dd-420b-aae2-8765ea1477dd', topVolUUID=u'a8aae32a-d705-4dcb-a8f6-a3320c673635', bandwidth=u'0', jobUUID=u'aae8cb12-56ba-415c-90e0-de218304e8b5') from=::ffff:10.46.16.252,40370, flow_id=snapshots_delete_aebc6b48-6dd9-4ced, vmId=e74bec08-cb6e-4931-8775-f8a49b62eaf8 (api:46)

hypervisor-lynx24_vdsm.log:2019-05-19 06:13:12,275+0300 INFO  (jsonrpc/1) [api.virt] START merge(drive={u'imageID': u'ddd06cff-2643-4c7a-bd39-16c649ffbb3d', u'volumeID': u'f7973684-d3fe-4ecc-8305-0920d965e593', u'domainID': u'193e46d4-bb94-48e4-b5a5-236d34372920', u'poolID': u'b26523f2-afcf-4675-b660-9faec6d709b3'}, baseVolUUID=u'85683cc5-0b81-4ecd-b7d4-f6cc409c8679', topVolUUID=u'2c646920-b4e5-407b-942f-c9c72cb89f30', bandwidth=u'0', jobUUID=u'ce2eb173-0e7c-4188-a79b-4b67ec7de629') from=::ffff:10.46.16.252,40370, flow_id=snapshots_delete_aebc6b48-6dd9-4ced, vmId=e74bec08-cb6e-4931-8775-f8a49b62eaf8 (api:46)


Merge job 6e010748-f6d6-49fb-9479-49d44b738737 - failed

$ grep 6e010748-f6d6-49fb-9479-49d44b738737 *.log

hypervisor-lynx24_vdsm.log:2019-05-19 06:13:12,191+0300 INFO  (jsonrpc/5) [api.virt] START merge(drive={u'imageID': u'4700cd22-a5ce-4e39-af35-894519c71bca', u'volumeID': u'705a6902-2b36-4883-9482-427885c5c84b', u'domainID': u'193e46d4-bb94-48e4-b5a5-236d34372920', u'poolID': u'b26523f2-afcf-4675-b660-9faec6d709b3'}, baseVolUUID=u'bc1d2b00-54ca-4550-933c-f2ee6ed132ca', topVolUUID=u'21e975df-872c-4930-8c81-358b6821eef9', bandwidth=u'0', jobUUID=u'6e010748-f6d6-49fb-9479-49d44b738737') from=::ffff:10.46.16.252,40370, flow_id=snapshots_delete_aebc6b48-6dd9-4ced, vmId=e74bec08-cb6e-4931-8775-f8a49b62eaf8 (api:46)

hypervisor-lynx24_vdsm.log:2019-05-19 06:13:17,070+0300 INFO  (jsonrpc/5) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Starting merge with jobUUID=u'6e010748-f6d6-49fb-9479-49d44b738737', original chain=798d3d8a-f034-4af7-ac6a-c7a46107c632 < bc1d2b00-54ca-4550-933c-f2ee6ed132ca < 21e975df-872c-4930-8c81-358b6821eef9 < 6cdf5a03-075d-40b9-9eb2-27a2da38e519 < 705a6902-2b36-4883-9482-427885c5c84b (top), disk='vda', base='vda[3]', top='vda[2]', bandwidth=0, flags=8 (vm:6014)

hypervisor-lynx24_vdsm.log:2019-05-19 06:13:18,683+0300 ERROR (jsonrpc/5) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Live merge failed (job: 6e010748-f6d6-49fb-9479-49d44b738737) (vm:6020)


Merge job 3edb6508-e4cd-4da8-8d1b-6ae1bfa9fb38 - succeeded

$ grep 3edb6508-e4cd-4da8-8d1b-6ae1bfa9fb38 *.log

hypervisor-lynx24_vdsm.log:2019-05-19 06:13:12,227+0300 INFO  (jsonrpc/0) [api.virt] START merge(drive={u'imageID': u'6b08223e-df65-473b-85a3-55fce42c18a4', u'volumeID': u'07441e1f-e4fe-46ce-bc59-f41cea15d2e4', u'domainID': u'193e46d4-bb94-48e4-b5a5-236d34372920', u'poolID': u'b26523f2-afcf-4675-b660-9faec6d709b3'}, baseVolUUID=u'e87a1c47-1117-421c-8200-5bef3d03ff1a', topVolUUID=u'c755441e-2602-424a-bc0b-beb099f0d987', bandwidth=u'0', jobUUID=u'3edb6508-e4cd-4da8-8d1b-6ae1bfa9fb38') from=::ffff:10.46.16.252,40370, flow_id=snapshots_delete_aebc6b48-6dd9-4ced, vmId=e74bec08-cb6e-4931-8775-f8a49b62eaf8 (api:46)

hypervisor-lynx24_vdsm.log:2019-05-19 06:13:15,323+0300 INFO  (jsonrpc/0) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Starting merge with jobUUID=u'3edb6508-e4cd-4da8-8d1b-6ae1bfa9fb38', original chain=e87a1c47-1117-421c-8200-5bef3d03ff1a < c755441e-2602-424a-bc0b-beb099f0d987 < cb52ff05-4dc2-4da1-b6c7-0ca1d1764f34 < 07441e1f-e4fe-46ce-bc59-f41cea15d2e4 (top), disk='vdb', base='vdb[3]', top='vdb[2]', bandwidth=0, flags=8 (vm:6014)

hypervisor-lynx24_vdsm.log:2019-05-19 06:13:16,425+0300 INFO  (jsonrpc/0) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Libvirt job 3edb6508-e4cd-4da8-8d1b-6ae1bfa9fb38 was terminated (vm:5894)

hypervisor-lynx24_vdsm.log:2019-05-19 06:13:16,426+0300 INFO  (jsonrpc/0) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Starting cleanup thread for job: 3edb6508-e4cd-4da8-8d1b-6ae1bfa9fb38 (vm:5902)

hypervisor-lynx24_vdsm.log:2019-05-19 06:13:16,531+0300 INFO  (merge/e74bec08) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Synchronizing volume chain after live merge (job 3edb6508-e4cd-4da8-8d1b-6ae1bfa9fb38) (vm:6360)

hypervisor-lynx24_vdsm.log:2019-05-19 06:13:16,729+0300 INFO  (merge/e74bec08) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Synchronization completed (job 3edb6508-e4cd-4da8-8d1b-6ae1bfa9fb38) (vm:6369)

hypervisor-lynx24_vdsm.log:2019-05-19 06:13:17,012+0300 INFO  (jsonrpc/1) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Cleanup thread <vdsm.virt.vm.LiveMergeCleanupThread object at 0x7fc9c41673d0> successfully completed, untracking job 3edb6508-e4cd-4da8-8d1b-6ae1bfa9fb38 (base=e87a1c47-1117-421c-8200-5bef3d03ff1a, top=c755441e-2602-424a-bc0b-beb099f0d987) (vm:5845)


Merge job e55c7454-e221-4e8e-a075-f945becd2277 - succeeded

$ grep e55c7454-e221-4e8e-a075-f945becd2277 *.log

hypervisor-lynx24_vdsm.log:2019-05-19 06:13:12,228+0300 INFO  (jsonrpc/6) [api.virt] START merge(drive={u'imageID': u'3c678812-3ac8-4e9c-801e-e9fc9bd376d9', u'volumeID': u'87fd8fc6-255a-416f-8eca-f8c1ede93f63', u'domainID': u'193e46d4-bb94-48e4-b5a5-236d34372920', u'poolID': u'b26523f2-afcf-4675-b660-9faec6d709b3'}, baseVolUUID=u'22d4d570-ecd1-40f2-b585-8f0e6a7635cc', topVolUUID=u'4f6af503-c703-44a0-afb2-67581bc5b6f8', bandwidth=u'0', jobUUID=u'e55c7454-e221-4e8e-a075-f945becd2277') from=::ffff:10.46.16.252,40370, flow_id=snapshots_delete_aebc6b48-6dd9-4ced, vmId=e74bec08-cb6e-4931-8775-f8a49b62eaf8 (api:46)

hypervisor-lynx24_vdsm.log:2019-05-19 06:13:12,573+0300 INFO  (jsonrpc/6) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Starting merge with jobUUID=u'e55c7454-e221-4e8e-a075-f945becd2277', original chain=22d4d570-ecd1-40f2-b585-8f0e6a7635cc < 4f6af503-c703-44a0-afb2-67581bc5b6f8 < af5e3be7-93b9-4fda-9281-7a0b9e053c11 < 87fd8fc6-255a-416f-8eca-f8c1ede93f63 (top), disk='vdc', base='vdc[3]', top='vdc[2]', bandwidth=0, flags=8 (vm:6014)

hypervisor-lynx24_vdsm.log:2019-05-19 06:13:15,240+0300 INFO  (jsonrpc/2) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Libvirt job e55c7454-e221-4e8e-a075-f945becd2277 was terminated (vm:5894)

hypervisor-lynx24_vdsm.log:2019-05-19 06:13:15,240+0300 INFO  (jsonrpc/2) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Starting cleanup thread for job: e55c7454-e221-4e8e-a075-f945becd2277 (vm:5902)

hypervisor-lynx24_vdsm.log:2019-05-19 06:13:15,380+0300 INFO  (merge/e74bec08) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Synchronizing volume chain after live merge (job e55c7454-e221-4e8e-a075-f945becd2277) (vm:6360)

hypervisor-lynx24_vdsm.log:2019-05-19 06:13:15,721+0300 INFO  (merge/e74bec08) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Synchronization completed (job e55c7454-e221-4e8e-a075-f945becd2277) (vm:6369)

hypervisor-lynx24_vdsm.log:2019-05-19 06:13:15,781+0300 INFO  (jsonrpc/0) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Cleanup thread <vdsm.virt.vm.LiveMergeCleanupThread object at 0x7fc9e4d8fcd0> successfully completed, untracking job e55c7454-e221-4e8e-a075-f945becd2277 (base=22d4d570-ecd1-40f2-b585-8f0e6a7635cc, top=4f6af503-c703-44a0-afb2-67581bc5b6f8) (vm:5845)


Merge job aae8cb12-56ba-415c-90e0-de218304e8b5 - succeeded

$ grep aae8cb12-56ba-415c-90e0-de218304e8b5 *.log

hypervisor-lynx24_vdsm.log:2019-05-19 06:13:12,274+0300 INFO  (jsonrpc/2) [api.virt] START merge(drive={u'imageID': u'8979b329-cb70-47b9-9ee9-83716713fd50', u'volumeID': u'21577de7-2583-4c96-a409-5e5cce9ac1ef', u'domainID': u'193e46d4-bb94-48e4-b5a5-236d34372920', u'poolID': u'b26523f2-afcf-4675-b660-9faec6d709b3'}, baseVolUUID=u'6346a581-33dd-420b-aae2-8765ea1477dd', topVolUUID=u'a8aae32a-d705-4dcb-a8f6-a3320c673635', bandwidth=u'0', jobUUID=u'aae8cb12-56ba-415c-90e0-de218304e8b5') from=::ffff:10.46.16.252,40370, flow_id=snapshots_delete_aebc6b48-6dd9-4ced, vmId=e74bec08-cb6e-4931-8775-f8a49b62eaf8 (api:46)

hypervisor-lynx24_vdsm.log:2019-05-19 06:13:13,032+0300 INFO  (jsonrpc/2) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Starting merge with jobUUID=u'aae8cb12-56ba-415c-90e0-de218304e8b5', original chain=6346a581-33dd-420b-aae2-8765ea1477dd < a8aae32a-d705-4dcb-a8f6-a3320c673635 < b32eb034-18c4-4da6-b45c-eff6f00a08c7 < 21577de7-2583-4c96-a409-5e5cce9ac1ef (top), disk='sda', base='sda[3]', top='sda[2]', bandwidth=0, flags=8 (vm:6014)

hypervisor-lynx24_vdsm.log:2019-05-19 06:13:15,243+0300 INFO  (jsonrpc/2) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Libvirt job aae8cb12-56ba-415c-90e0-de218304e8b5 was terminated (vm:5894)

hypervisor-lynx24_vdsm.log:2019-05-19 06:13:15,243+0300 INFO  (jsonrpc/2) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Starting cleanup thread for job: aae8cb12-56ba-415c-90e0-de218304e8b5 (vm:5902)

hypervisor-lynx24_vdsm.log:2019-05-19 06:13:15,413+0300 INFO  (merge/e74bec08) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Synchronizing volume chain after live merge (job aae8cb12-56ba-415c-90e0-de218304e8b5) (vm:6360)

hypervisor-lynx24_vdsm.log:2019-05-19 06:13:15,748+0300 INFO  (merge/e74bec08) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Synchronization completed (job aae8cb12-56ba-415c-90e0-de218304e8b5) (vm:6369)

hypervisor-lynx24_vdsm.log:2019-05-19 06:13:15,811+0300 INFO  (jsonrpc/0) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Cleanup thread <vdsm.virt.vm.LiveMergeCleanupThread object at 0x7fc9e4e94890> successfully completed, untracking job aae8cb12-56ba-415c-90e0-de218304e8b5 (base=6346a581-33dd-420b-aae2-8765ea1477dd, top=a8aae32a-d705-4dcb-a8f6-a3320c673635) (vm:5845)


Merge job ce2eb173-0e7c-4188-a79b-4b67ec7de629 - started - no log after start

$ grep ce2eb173-0e7c-4188-a79b-4b67ec7de629 *.log

hypervisor-lynx24_vdsm.log:2019-05-19 06:13:12,275+0300 INFO  (jsonrpc/1) [api.virt] START merge(drive={u'imageID': u'ddd06cff-2643-4c7a-bd39-16c649ffbb3d', u'volumeID': u'f7973684-d3fe-4ecc-8305-0920d965e593', u'domainID': u'193e46d4-bb94-48e4-b5a5-236d34372920', u'poolID': u'b26523f2-afcf-4675-b660-9faec6d709b3'}, baseVolUUID=u'85683cc5-0b81-4ecd-b7d4-f6cc409c8679', topVolUUID=u'2c646920-b4e5-407b-942f-c9c72cb89f30', bandwidth=u'0', jobUUID=u'ce2eb173-0e7c-4188-a79b-4b67ec7de629') from=::ffff:10.46.16.252,40370, flow_id=snapshots_delete_aebc6b48-6dd9-4ced, vmId=e74bec08-cb6e-4931-8775-f8a49b62eaf8 (api:46)

hypervisor-lynx24_vdsm.log:2019-05-19 06:13:16,585+0300 INFO  (jsonrpc/1) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Starting merge with jobUUID=u'ce2eb173-0e7c-4188-a79b-4b67ec7de629', original chain=85683cc5-0b81-4ecd-b7d4-f6cc409c8679 < 2c646920-b4e5-407b-942f-c9c72cb89f30 < 95314255-1e34-4d10-a145-3aeb68ce8a88 < f7973684-d3fe-4ecc-8305-0920d965e593 (top), disk='sdb', base='sdb[3]', top='sdb[2]', bandwidth=0, flags=8 (vm:6014)


Checking how the merged volumes were created:


$ grep "START createVolume.\+volUUID=u'21e975df-872c-4930-8c81-358b6821eef9" *.log

hypervisor-lynx23_vdsm.log:2019-05-19 06:08:00,422+0300 INFO  (jsonrpc/2) [vdsm.api] START createVolume(sdUUID=u'193e46d4-bb94-48e4-b5a5-236d34372920', spUUID=u'b26523f2-afcf-4675-b660-9faec6d709b3', imgUUID=u'4700cd22-a5ce-4e39-af35-894519c71bca', size=u'10737418240', volFormat=4, preallocate=2, diskType=u'DATA', volUUID=u'21e975df-872c-4930-8c81-358b6821eef9', desc=u'', srcImgUUID=u'4700cd22-a5ce-4e39-af35-894519c71bca', srcVolUUID=u'bc1d2b00-54ca-4550-933c-f2ee6ed132ca', initialSize=None) from=::ffff:10.46.16.252,38534, flow_id=snapshots_create_f70e48d0-067d-4683, task_id=4518f253-17a6-4850-b47f-6c82d9c651df (api:46)

$ grep "START createVolume.\+volUUID=u'c755441e-2602-424a-bc0b-beb099f0d987" *.log

hypervisor-lynx23_vdsm.log:2019-05-19 06:08:01,166+0300 INFO  (jsonrpc/6) [vdsm.api] START createVolume(sdUUID=u'193e46d4-bb94-48e4-b5a5-236d34372920', spUUID=u'b26523f2-afcf-4675-b660-9faec6d709b3', imgUUID=u'6b08223e-df65-473b-85a3-55fce42c18a4', size=u'1073741824', volFormat=4, preallocate=2, diskType=u'DATA', volUUID=u'c755441e-2602-424a-bc0b-beb099f0d987', desc=u'', srcImgUUID=u'6b08223e-df65-473b-85a3-55fce42c18a4', srcVolUUID=u'e87a1c47-1117-421c-8200-5bef3d03ff1a', initialSize=None) from=::ffff:10.46.16.252,38534, flow_id=snapshots_create_f70e48d0-067d-4683, task_id=93740b2d-29f4-4bb9-af8d-cb15ff99e278 (api:46)

$ grep "START createVolume.\+volUUID=u'4f6af503-c703-44a0-afb2-67581bc5b6f8" *.log

hypervisor-lynx23_vdsm.log:2019-05-19 06:08:00,792+0300 INFO  (jsonrpc/7) [vdsm.api] START createVolume(sdUUID=u'193e46d4-bb94-48e4-b5a5-236d34372920', spUUID=u'b26523f2-afcf-4675-b660-9faec6d709b3', imgUUID=u'3c678812-3ac8-4e9c-801e-e9fc9bd376d9', size=u'1073741824', volFormat=4, preallocate=2, diskType=u'DATA', volUUID=u'4f6af503-c703-44a0-afb2-67581bc5b6f8', desc=u'', srcImgUUID=u'3c678812-3ac8-4e9c-801e-e9fc9bd376d9', srcVolUUID=u'22d4d570-ecd1-40f2-b585-8f0e6a7635cc', initialSize=None) from=::ffff:10.46.16.252,38534, flow_id=snapshots_create_f70e48d0-067d-4683, task_id=2f335380-40fb-4a93-8ed3-8f1fd1d3803d (api:46)

$ grep "START createVolume.\+volUUID=u'a8aae32a-d705-4dcb-a8f6-a3320c673635" *.log

hypervisor-lynx23_vdsm.log:2019-05-19 06:08:00,975+0300 INFO  (jsonrpc/1) [vdsm.api] START createVolume(sdUUID=u'193e46d4-bb94-48e4-b5a5-236d34372920', spUUID=u'b26523f2-afcf-4675-b660-9faec6d709b3', imgUUID=u'8979b329-cb70-47b9-9ee9-83716713fd50', size=u'1073741824', volFormat=4, preallocate=2, diskType=u'DATA', volUUID=u'a8aae32a-d705-4dcb-a8f6-a3320c673635', desc=u'', srcImgUUID=u'8979b329-cb70-47b9-9ee9-83716713fd50', srcVolUUID=u'6346a581-33dd-420b-aae2-8765ea1477dd', initialSize=None) from=::ffff:10.46.16.252,38534, flow_id=snapshots_create_f70e48d0-067d-4683, task_id=c2b98326-f169-47c0-af99-3b9175c8606e (api:46)

$ grep "START createVolume.\+volUUID=u'2c646920-b4e5-407b-942f-c9c72cb89f30" *.log

hypervisor-lynx23_vdsm.log:2019-05-19 06:08:00,604+0300 INFO  (jsonrpc/3) [vdsm.api] START createVolume(sdUUID=u'193e46d4-bb94-48e4-b5a5-236d34372920', spUUID=u'b26523f2-afcf-4675-b660-9faec6d709b3', imgUUID=u'ddd06cff-2643-4c7a-bd39-16c649ffbb3d', size=u'1073741824', volFormat=4, preallocate=2, diskType=u'DATA', volUUID=u'2c646920-b4e5-407b-942f-c9c72cb89f30', desc=u'', srcImgUUID=u'ddd06cff-2643-4c7a-bd39-16c649ffbb3d', srcVolUUID=u'85683cc5-0b81-4ecd-b7d4-f6cc409c8679', initialSize=None) from=::ffff:10.46.16.252,38534, flow_id=snapshots_create_f70e48d0-067d-4683, task_id=23595231-f8b9-4344-9908-99476f919f58 (api:46)


Errors:

$ grep ERROR *vdsm.log
hypervisor-lynx24_vdsm.log:2019-05-19 06:13:18,683+0300 ERROR (jsonrpc/5) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Live merge failed (job: 6e010748-f6d6-49fb-9479-49d44b738737) (vm:6020)

hypervisor-lynx24_vdsm.log:2019-05-19 06:13:18,709+0300 ERROR (jsonrpc/5) [api] FINISH merge error=VM 'e74bec08-cb6e-4931-8775-f8a49b62eaf8' was not defined yet or was undefined (api:132)


I don't see anything that may be related to bug 1700623. Looks like normal
merge from vdsm point of view, failed because qemu crashed.

Comment 8 Kevin Wolf 2019-05-20 14:40:47 UTC

(In reply to Nir Soffer from comment #6)
> assert in runtime must be a qemu bug, and It does not look like bug 1601212
> to me.
> 
> Kevin, what do you think?

Yes, definitely a QEMU bug. Either some caller of bdrv_replace_node() forgets to drain the involved nodes, or we do indeed have another drain bug that was not covered by the fixes for bug 1601212 (because the version used here should already contain the fixes, if I understand correctly - otherwise, the bug would seem quite relevant).

Do we have a QEMU backtrace somewhere?

Can you help me translate the VDSM log entries to QEMU? I think live merge will be a commit block job. If so, are we committing the top of the backing chain to one of its backing files (active commit), or are we committing somewhere in the middle of the backing chain? You mention that there are multiple live merge operations going on. Are they all for the same image chain of the same VM?

Comment 9 Nir Soffer 2019-05-20 17:11:42 UTC

Avihai, did you collect qemu core dump or backtrace from hypervisor-lynx24?
(see comment 8)

Comment 10 Avihai 2019-05-20 17:40:12 UTC

(In reply to Nir Soffer from comment #9)
> Avihai, did you collect qemu core dump or backtrace from hypervisor-lynx24?
> (see comment 8)

No, unfortunately, the environment was already taken& rebuilt by someone else right after I tried to reproduce the issue on the same ENV- which did not reproduce.
I will try again, is there something that can be done with current logs?

Comment 11 Nir Soffer 2019-05-20 19:21:30 UTC

(In reply to Kevin Wolf from comment #8)
> Can you help me translate the VDSM log entries to QEMU?

I can translate to libvirt. We have a vm with 5 disks, and we
start blockCommit job per disk merging nodes name[2] to name[3]
(using libvirt notation), not active layer commit.

In the failed job both top and base volumes are qcow2, in other
jobs the base volume is qcow2 or raw-sparse file.

Bellow a more clear description of the state when we start the merge:


## Merge job 6e010748-f6d6-49fb-9479-49d44b738737 - failed

Date:           2019-05-19 06:13:12,191+0300 INFO
jobUUID:        6e010748-f6d6-49fb-9479-49d44b738737
imageID:        4700cd22-a5ce-4e39-af35-894519c71bca
volumeID:       705a6902-2b36-4883-9482-427885c5c84b
topVolUUID:     21e975df-872c-4930-8c81-358b6821eef9
baseVolUUID:    bc1d2b00-54ca-4550-933c-f2ee6ed132ca
type:           file (glusterfs/fuse)

original chain (active first):

705a6902-2b36-4883-9482-427885c5c84b
6cdf5a03-075d-40b9-9eb2-27a2da38e519
21e975df-872c-4930-8c81-358b6821eef9  <-- top
bc1d2b00-54ca-4550-933c-f2ee6ed132ca  <-- base
798d3d8a-f034-4af7-ac6a-c7a46107c632

blockCommit arguments:

disk:           vda
top:            vda[2]
base:           vda[3]
flags:          8 (VIR_DOMAIN_BLOCK_COMMIT_RELATIVE)


## Merge job 3edb6508-e4cd-4da8-8d1b-6ae1bfa9fb38 - succeeded

Date:           2019-05-19 06:13:12,227+0300
jobUUID:        3edb6508-e4cd-4da8-8d1b-6ae1bfa9fb38
imageID:        6b08223e-df65-473b-85a3-55fce42c18a4
volumeID:       07441e1f-e4fe-46ce-bc59-f41cea15d2e4
topVolUUID:     c755441e-2602-424a-bc0b-beb099f0d987
baseVolUUID:    e87a1c47-1117-421c-8200-5bef3d03ff1a
type:           file (glusterfs/fuse)

original chain (active first):

07441e1f-e4fe-46ce-bc59-f41cea15d2e4
cb52ff05-4dc2-4da1-b6c7-0ca1d1764f34
c755441e-2602-424a-bc0b-beb099f0d987  <-- top
e87a1c47-1117-421c-8200-5bef3d03ff1a  <-- base (qcow2)

blockCommit arguments:

disk:           vdb
top:            vdb[2]
base:           vdb[3]
flags:          8


## Merge job e55c7454-e221-4e8e-a075-f945becd2277 - succeeded

Date:           2019-05-19 06:13:12,228+0300
jobUUID:        e55c7454-e221-4e8e-a075-f945becd2277
imageID:        3c678812-3ac8-4e9c-801e-e9fc9bd376d9
volumeID:       87fd8fc6-255a-416f-8eca-f8c1ede93f63
topVolUUID:     4f6af503-c703-44a0-afb2-67581bc5b6f8
baseVolUUID:    22d4d570-ecd1-40f2-b585-8f0e6a7635cc
type:           file (glusterfs/fuse)

original chain (active first):

87fd8fc6-255a-416f-8eca-f8c1ede93f63
af5e3be7-93b9-4fda-9281-7a0b9e053c11
4f6af503-c703-44a0-afb2-67581bc5b6f8  <-- top
22d4d570-ecd1-40f2-b585-8f0e6a7635cc  <-- base (raw sparse)

blockCommit arguments:

disk:           vdc
top:            vdc[2]
base:           vdc[3]
flags:          8


## Merge job aae8cb12-56ba-415c-90e0-de218304e8b5 - succeeded

Date:           2019-05-19 06:13:12,274+0300
jobUUID:        aae8cb12-56ba-415c-90e0-de218304e8b5
imageID:        8979b329-cb70-47b9-9ee9-83716713fd50
volumeID:       21577de7-2583-4c96-a409-5e5cce9ac1ef
topVolUUID:     a8aae32a-d705-4dcb-a8f6-a3320c673635
baseVolUUID:    6346a581-33dd-420b-aae2-8765ea1477dd
type:           file (glusterfs/fuse)

original chain (active first):

21577de7-2583-4c96-a409-5e5cce9ac1ef
b32eb034-18c4-4da6-b45c-eff6f00a08c7
a8aae32a-d705-4dcb-a8f6-a3320c673635  <- top
6346a581-33dd-420b-aae2-8765ea1477dd  <- base (qcow2)

blockCommit arguments:

disk:           sda
top:            sda[2]
base:           sda[3]
flags:          8


## Merge job ce2eb173-0e7c-4188-a79b-4b67ec7de629 - started - no log after start

Date:           2019-05-19 06:13:12,275+0300
jobUUID:        ce2eb173-0e7c-4188-a79b-4b67ec7de629
imageID:        ddd06cff-2643-4c7a-bd39-16c649ffbb3d
volumeID:       f7973684-d3fe-4ecc-8305-0920d965e593
topVolUUID:     2c646920-b4e5-407b-942f-c9c72cb89f30
baseVolUUID:    85683cc5-0b81-4ecd-b7d4-f6cc409c8679
type:           file (glusterfs/fuse)

original chain (active first):

f7973684-d3fe-4ecc-8305-0920d965e593
95314255-1e34-4d10-a145-3aeb68ce8a88
2c646920-b4e5-407b-942f-c9c72cb89f30  <-- top
85683cc5-0b81-4ecd-b7d4-f6cc409c8679  <-- base (raw sparse)

blockCommit arguments:

disk:           sdb
top:            sdb[2]
base:           sdb[3]
flags:          8

Comment 12 Nir Soffer 2019-05-20 19:41:15 UTC

Created attachment 1571374 [details]
vm xml extracted from engine.log

Kevin, please check the domain xml for more info the configuration of this vm.

Comment 13 Avihai 2019-05-21 03:21:51 UTC

I ran the same test over the night for an additional 12 iterations and the issue did not reproduce.
Issue looks very hard to reproduce.

Comment 14 Kevin Wolf 2019-05-21 09:24:25 UTC

As we don't have a backtrace, we need to guess from the information that we do have. bdrv_replace_node() is called in three different contexts:

1. Start and completion of the commit block job. This is the most likely one here because we know that commit jobs were running.
2. Start and completion of the mirror block job (including commit of the active layer). You said the jobs in question are not mirror jobs, so we can probably rule this out.
3. Taking an external snapshot. This wasn't mentioned anywhere. Can we rule this out as well?

If we assume the assertion was triggered from a commit job, it could be either on start or completion. We have one job that has failed and one job for which we only have a start message, but nothing after this. Do we have more information about these than just "Starting merge" and "Live merge failed"? In particular, do we have a more concrete error message for the failed one? Does a QEMU crash result in failure, or does the failure message mean that the job "successfully" failed without crashing and the crash happened only later? Is the start message printed before starting the job or only when the QMP command to start the job returned success?

I really wish we had a QMP log available each time I'm dealing with a bug reported on the VDSM level. Maybe we can improve something for the future there?

Comment 15 Nir Soffer 2019-05-21 13:26:56 UTC

(In reply to Kevin Wolf from comment #14)
> As we don't have a backtrace, we need to guess from the information that we
> do have. bdrv_replace_node() is called in three different contexts:
> 
> 1. Start and completion of the commit block job. This is the most likely one
> here because we know that commit jobs were running.
> 2. Start and completion of the mirror block job (including commit of the
> active layer). You said the jobs in question are not mirror jobs, so we can
> probably rule this out.
> 3. Taking an external snapshot. This wasn't mentioned anywhere. Can we rule
> this out as well?

Yes, when we preform live merge engine takes locks and will not let a user
preform snapshots.

> If we assume the assertion was triggered from a commit job, it could be
> either on start or completion. We have one job that has failed and one job
> for which we only have a start message, but nothing after this. Do we have
> more information about these than just "Starting merge" and "Live merge
> failed"? In particular, do we have a more concrete error message for the
> failed one? 

This is what we have for the failing one (thread jsonrpc/5):

2019-05-19 06:13:17,070+0300 INFO  (jsonrpc/5) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Starting merge with jobUUID=u'6e010748-f6d6-49fb-9479-49d44b738737', original chain=798d3d8a-f034-4af7-ac6a-c7a46107c632 < bc1d2b00-54ca-4550-933c-f2ee6ed132ca < 21e975df-872c-4930-8c81-358b6821eef9 < 6cdf5a03-075d-40b9-9eb2-27a2da38e519 < 705a6902-2b36-4883-9482-427885c5c84b (top), disk='vda', base='vda[3]', top='vda[2]', bandwidth=0, flags=8 (vm:6014)

2019-05-19 06:13:18,683+0300 ERROR (jsonrpc/5) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Live merge failed (job: 6e010748-f6d6-49fb-9479-49d44b738737) (vm:6020)
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 6018, in merge
    bandwidth, flags)
  File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f
    ret = attr(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper
    ret = f(*args, **kwargs)
  File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper
    return func(inst, *args, **kwargs)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 719, in blockCommit
    if ret == -1: raise libvirtError ('virDomainBlockCommit() failed', dom=self)
libvirtError: Unable to read from monitor: Connection reset by peer

This looks like qemu crash from libvirt point of view.

And on the same time we get these events from libvirt (thread libvirt/events):

2019-05-19 06:13:18,697+0300 INFO  (libvirt/events) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') underlying process disconnected (vm:1063)

2019-05-19 06:13:18,699+0300 INFO  (libvirt/events) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Release VM resources (vm:5279)

2019-05-19 06:13:18,699+0300 INFO  (libvirt/events) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Stopping connection (guestagent:442)

So the crash is likely a result of calling virtDomainBlockCommit().

> Does a QEMU crash result in failure, or does the failure message
> mean that the job "successfully" failed without crashing and the crash
> happened only later?

Looks like qemu aborts, no successful failure :-)

> Is the start message printed before starting the job or
> only when the QMP command to start the job returned success?

We don't have any libvirt logs.

> I really wish we had a QMP log available each time I'm dealing with a bug
> reported on the VDSM level. Maybe we can improve something for the future
> there?

Vdsm communicate with libvirt, and libvirt debug log is disabled by default, so
we lack minimal info.

I think we should have libivrt INFO level logs, with one message for every change
in system state. Not sure how if this can be configured and who is responsible
for this configuration.

Comment 16 Kevin Wolf 2019-05-21 13:41:35 UTC

Looks like I have a reliable reproducer even on upstream. The trick I used is to have a throttled backing file so that the guest boots very slowly and the top node of the commit operation always has some active I/O:

gdb --args x86_64-softmmu/qemu-system-x86_64 \
    -object throttle-group,x-bps-read=4096,id=throttle0 \
    -blockdev file,filename=/home/kwolf/images/hd.img,node-name=base \
    -blockdev throttle,throttle-group=throttle0,file=base,node-name=throttled \
    -drive if=none,file=/tmp/overlay.qcow2,backing.node-name=backing,backing.backing=throttled,node-name=overlay \
    -device ide-hd,drive=overlay -qmp stdio -monitor vc

{"QMP": {"version": {"qemu": {"micro": 50, "minor": 0, "major": 4}, "package": "v4.0.0-600-gc423a6af59"}, "capabilities": ["oob"]}}
{'execute':'qmp_capabilities'}
{"return": {}}
{'execute':'block-commit', 'arguments': {'device': 'none0', 'top-node': 'backing'}}
{"timestamp": {"seconds": 1558445691, "microseconds": 749010}, "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "none0"}}
qemu-system-x86_64: block.c:4024: bdrv_replace_node: Zusicherung »!atomic_read(&from->in_flight)« nicht erfüllt.
Thread 1 "qemu-system-x86" received signal SIGABRT, Aborted.

(gdb) bt
#0  0x00007ffff596157f in raise () at /lib64/libc.so.6
#1  0x00007ffff594b895 in abort () at /lib64/libc.so.6
#2  0x00007ffff594b769 in _nl_load_domain.cold.0 () at /lib64/libc.so.6
#3  0x00007ffff5959a26 in .annobin_assert.c_end () at /lib64/libc.so.6
#4  0x0000555555b82f6d in bdrv_replace_node (from=0x55555667d410, to=0x55555751b280, errp=0x7fffffffd2d0) at block.c:4024
#5  0x0000555555b82fee in bdrv_append (bs_new=bs_new@entry=0x55555751b280, bs_top=bs_top@entry=0x55555667d410, errp=errp@entry=0x7fffffffd328) at block.c:4101
#6  0x0000555555bd84e7 in commit_start
    (job_id=job_id@entry=0x0, bs=bs@entry=0x5555565f9800, base=base@entry=0x5555565f0400, top=top@entry=0x55555667d410, creation_flags=creation_flags@entry=0, speed=speed@entry=0, on_error=BLOCKDEV_ON_ERROR_REPORT, backing_file_str=0x0, filter_node_name=0x0, errp=0x7fffffffd410) at block/commit.c:306
#7  0x00005555559949b9 in qmp_block_commit
    (has_job_id=<optimized out>, job_id=0x0, device=<optimized out>, has_base_node=<optimized out>, base_node=0x0, has_base=<optimized out>, base=0x0, has_top_node=true, top_node=0x555557a295f0 "backing", has_top=false, top=0x0, has_backing_file=false, backing_file=0x0, has_speed=false, speed=0, has_filter_node_name=false, filter_node_name=0x0, has_auto_finalize=false, auto_finalize=false, has_auto_dismiss=false, auto_dismiss=false, errp=0x7fffffffd4e8) at blockdev.c:3400
#8  0x0000555555b3345f in qmp_marshal_block_commit (args=<optimized out>, ret=<optimized out>, errp=0x7fffffffd5d8) at qapi/qapi-commands-block-core.c:409
#9  0x0000555555c2cd6c in do_qmp_dispatch (errp=0x7fffffffd5d0, allow_oob=<optimized out>, request=<optimized out>, cmds=0x55555638f290 <qmp_commands>) at qapi/qmp-dispatch.c:131
#10 0x0000555555c2cd6c in qmp_dispatch (cmds=0x55555638f290 <qmp_commands>, request=<optimized out>, allow_oob=<optimized out>) at qapi/qmp-dispatch.c:174
#11 0x000055555586f5d2 in monitor_qmp_dispatch (mon=0x55555653c690, req=<optimized out>) at /home/kwolf/source/qemu/monitor.c:4141
#12 0x0000555555875779 in monitor_qmp_bh_dispatcher (data=<optimized out>) at /home/kwolf/source/qemu/monitor.c:4224
#13 0x0000555555c72a4e in aio_bh_call (bh=0x55555649abc0) at util/async.c:118
#14 0x0000555555c72a4e in aio_bh_poll (ctx=ctx@entry=0x55555649a1f0) at util/async.c:118
#15 0x0000555555c76000 in aio_dispatch (ctx=0x55555649a1f0) at util/aio-posix.c:460
#16 0x0000555555c7292e in aio_ctx_dispatch (source=<optimized out>, callback=<optimized out>, user_data=<optimized out>) at util/async.c:261
#17 0x00007ffff7eb506d in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
#18 0x0000555555c75218 in glib_pollfds_poll () at util/main-loop.c:213
#19 0x0000555555c75218 in os_host_main_loop_wait (timeout=<optimized out>) at util/main-loop.c:236
#20 0x0000555555c75218 in main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:512
#21 0x000055555599df59 in main_loop () at vl.c:1959
#22 0x0000555555821474 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4623

Comment 19 CongLi 2019-05-24 02:58:32 UTC

(In reply to Kevin Wolf from comment #16)
> Looks like I have a reliable reproducer even on upstream. The trick I used
> is to have a throttled backing file so that the guest boots very slowly and
> the top node of the commit operation always has some active I/O:
> 
> gdb --args x86_64-softmmu/qemu-system-x86_64 \
>     -object throttle-group,x-bps-read=4096,id=throttle0 \
>     -blockdev file,filename=/home/kwolf/images/hd.img,node-name=base \
>     -blockdev
> throttle,throttle-group=throttle0,file=base,node-name=throttled \
>     -drive
> if=none,file=/tmp/overlay.qcow2,backing.node-name=backing,backing.
> backing=throttled,node-name=overlay \
>     -device ide-hd,drive=overlay -qmp stdio -monitor vc
> 
> {"QMP": {"version": {"qemu": {"micro": 50, "minor": 0, "major": 4},
> "package": "v4.0.0-600-gc423a6af59"}, "capabilities": ["oob"]}}
> {'execute':'qmp_capabilities'}
> {"return": {}}
> {'execute':'block-commit', 'arguments': {'device': 'none0', 'top-node':
> 'backing'}}
> {"timestamp": {"seconds": 1558445691, "microseconds": 749010}, "event":
> "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "none0"}}
> qemu-system-x86_64: block.c:4024: bdrv_replace_node: Zusicherung
> »!atomic_read(&from->in_flight)« nicht erfüllt.
> Thread 1 "qemu-system-x86" received signal SIGABRT, Aborted.
> 
> (gdb) bt
> #0  0x00007ffff596157f in raise () at /lib64/libc.so.6
> #1  0x00007ffff594b895 in abort () at /lib64/libc.so.6
> #2  0x00007ffff594b769 in _nl_load_domain.cold.0 () at /lib64/libc.so.6
> #3  0x00007ffff5959a26 in .annobin_assert.c_end () at /lib64/libc.so.6
> #4  0x0000555555b82f6d in bdrv_replace_node (from=0x55555667d410,
> to=0x55555751b280, errp=0x7fffffffd2d0) at block.c:4024
> #5  0x0000555555b82fee in bdrv_append (bs_new=bs_new@entry=0x55555751b280,
> bs_top=bs_top@entry=0x55555667d410, errp=errp@entry=0x7fffffffd328) at
> block.c:4101
> #6  0x0000555555bd84e7 in commit_start
>     (job_id=job_id@entry=0x0, bs=bs@entry=0x5555565f9800,
> base=base@entry=0x5555565f0400, top=top@entry=0x55555667d410,
> creation_flags=creation_flags@entry=0, speed=speed@entry=0,
> on_error=BLOCKDEV_ON_ERROR_REPORT, backing_file_str=0x0,
> filter_node_name=0x0, errp=0x7fffffffd410) at block/commit.c:306
> #7  0x00005555559949b9 in qmp_block_commit
>     (has_job_id=<optimized out>, job_id=0x0, device=<optimized out>,
> has_base_node=<optimized out>, base_node=0x0, has_base=<optimized out>,
> base=0x0, has_top_node=true, top_node=0x555557a295f0 "backing",
> has_top=false, top=0x0, has_backing_file=false, backing_file=0x0,
> has_speed=false, speed=0, has_filter_node_name=false, filter_node_name=0x0,
> has_auto_finalize=false, auto_finalize=false, has_auto_dismiss=false,
> auto_dismiss=false, errp=0x7fffffffd4e8) at blockdev.c:3400
> #8  0x0000555555b3345f in qmp_marshal_block_commit (args=<optimized out>,
> ret=<optimized out>, errp=0x7fffffffd5d8) at
> qapi/qapi-commands-block-core.c:409
> #9  0x0000555555c2cd6c in do_qmp_dispatch (errp=0x7fffffffd5d0,
> allow_oob=<optimized out>, request=<optimized out>, cmds=0x55555638f290
> <qmp_commands>) at qapi/qmp-dispatch.c:131
> #10 0x0000555555c2cd6c in qmp_dispatch (cmds=0x55555638f290 <qmp_commands>,
> request=<optimized out>, allow_oob=<optimized out>) at
> qapi/qmp-dispatch.c:174
> #11 0x000055555586f5d2 in monitor_qmp_dispatch (mon=0x55555653c690,
> req=<optimized out>) at /home/kwolf/source/qemu/monitor.c:4141
> #12 0x0000555555875779 in monitor_qmp_bh_dispatcher (data=<optimized out>)
> at /home/kwolf/source/qemu/monitor.c:4224
> #13 0x0000555555c72a4e in aio_bh_call (bh=0x55555649abc0) at util/async.c:118
> #14 0x0000555555c72a4e in aio_bh_poll (ctx=ctx@entry=0x55555649a1f0) at
> util/async.c:118
> #15 0x0000555555c76000 in aio_dispatch (ctx=0x55555649a1f0) at
> util/aio-posix.c:460
> #16 0x0000555555c7292e in aio_ctx_dispatch (source=<optimized out>,
> callback=<optimized out>, user_data=<optimized out>) at util/async.c:261
> #17 0x00007ffff7eb506d in g_main_context_dispatch () at
> /lib64/libglib-2.0.so.0
> #18 0x0000555555c75218 in glib_pollfds_poll () at util/main-loop.c:213
> #19 0x0000555555c75218 in os_host_main_loop_wait (timeout=<optimized out>)
> at util/main-loop.c:236
> #20 0x0000555555c75218 in main_loop_wait (nonblocking=<optimized out>) at
> util/main-loop.c:512
> #21 0x000055555599df59 in main_loop () at vl.c:1959
> #22 0x0000555555821474 in main (argc=<optimized out>, argv=<optimized out>,
> envp=<optimized out>) at vl.c:4623


Hi Aihua,

Could you please have a try ?

Thanks.

Comment 20 aihua liang 2019-05-24 10:17:39 UTC

Hit this issue when create a snapshot file from base that located on nfs server.

Test env:
  kernel version:3.10.0-1044.el7.x86_64
  qemu-kvm-rhev version:qemu-kvm-rhev-2.12.0-29.el7.x86_64

Test steps:
 (nfs server)#cat /etc/exports
/mnt/nfs *(rw,no_root_squash,sync)

            #ls /mnt/nfs
rhel77-64-virtio.qcow2


(nfs client)
  1.#mount 10.66.144.83:/mnt/nfs /mnt/gluster/

  2.Start guest with qemu cmds:
      /usr/libexec/qemu-kvm \
    -name 'avocado-vt-vm1' \
    -machine pc  \
    -nodefaults \
    -device VGA,bus=pci.0,addr=0x2  \
    -chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/monitor-qmpmonitor1-20190123-032240-rOoB4cgD,server,nowait \
    -mon chardev=qmp_id_qmpmonitor1,mode=control  \
    -chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/monitor-catch_monitor-20190123-032240-rOoB4cgD,server,nowait \
    -mon chardev=qmp_id_catch_monitor,mode=control \
    -device pvpanic,ioport=0x505,id=id8Ec4Bn  \
    -chardev socket,id=serial_id_serial0,path=/var/tmp/serial-serial0-20190123-032240-rOoB4cgD,server,nowait \
    -device isa-serial,chardev=serial_id_serial0  \
    -chardev socket,id=seabioslog_id_20190123-032240-rOoB4cgD,path=/var/tmp/seabios-20190123-032240-rOoB4cgD,server,nowait \
    -device isa-debugcon,chardev=seabioslog_id_20190123-032240-rOoB4cgD,iobase=0x402 \
    -device ich9-usb-ehci1,id=usb1,addr=0x1d.7,multifunction=on,bus=pci.0 \
    -device ich9-usb-uhci1,id=usb1.0,multifunction=on,masterbus=usb1.0,addr=0x1d.0,firstport=0,bus=pci.0 \
    -device ich9-usb-uhci2,id=usb1.1,multifunction=on,masterbus=usb1.0,addr=0x1d.2,firstport=2,bus=pci.0 \
    -device ich9-usb-uhci3,id=usb1.2,multifunction=on,masterbus=usb1.0,addr=0x1d.4,firstport=4,bus=pci.0 \
    -object iothread,id=iothread0 \
    -drive id=drive_image1,if=none,snapshot=off,aio=threads,cache=none,format=qcow2,file=/mnt/gluster/rhel77-64-virtio.qcow2 \
    -device virtio-blk-pci,id=image1,drive=drive_image1,bus=pci.0,iothread=iothread0 \
    -drive id=drive_data1,if=none,snapshot=off,aio=threads,cache=none,format=qcow2,file=/root/test.qcow2 \
    -device virtio-blk-pci,id=data1,drive=drive_data1,bus=pci.0,iothread=iothread0 \
    -device virtio-net-pci,mac=9a:39:3a:3b:3c:3d,id=id1JNQsL,vectors=4,netdev=idVpZZ6A,bus=pci.0  \
    -netdev tap,id=idVpZZ6A,vhost=on \
    -m 8192  \
    -smp 2,maxcpus=2,cores=1,threads=1,sockets=2  \
    -cpu 'Penryn',+kvm_pv_unhalt \
    -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1  \
    -vnc :0  \
    -rtc base=utc,clock=host,driftfix=slew  \
    -boot order=cdn,once=c,menu=off,strict=off \
    -enable-kvm \
    -monitor stdio \
    -qmp tcp:0:3000,server,nowait \

  3.Create snapshots:
      { "execute": "blockdev-snapshot-sync", "arguments": { "device": "drive_image1", "snapshot-file": "/root/sn1", "format": "qcow2" } }
Ncat: Connection reset by peer.

  After step3, qemu coredump with info:
    (qemu) Formatting '/root/sn1', fmt=qcow2 size=21474836480 backing_file=/mnt/gluster/rhel77-64-virtio.qcow2 backing_fmt=qcow2 cluster_size=65536 lazy_refcounts=off refcount_bits=16
qemu-kvm: block.c:3508: bdrv_replace_node: Assertion `!({ _Static_assert(!(sizeof(*&from->in_flight) > 8), "not expecting: " "sizeof(*&from->in_flight) > ATOMIC_REG_SIZE"); __atomic_load_n(&from->in_flight, 0); })' failed.
bug.txt: line 36: 28438 Aborted                 (core dumped) /usr/libexec/qemu-kvm -name 'avocado-vt-vm1' -machine pc -nodefaults -device VGA,bus=pci.0,addr=0x2 -chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/monitor-qmpmonitor1-20190123-032240-rOoB4cgD,server,nowait -mon chardev=qmp_id_qmpmonitor1,mode=control -chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/monitor-catch_monitor-20190123-032240-rOoB4cgD,server,nowait -mon chardev=qmp_id_catch_monitor,mode=control -device pvpanic,ioport=0x505,id=id8Ec4Bn -chardev socket,id=serial_id_serial0,path=/var/tmp/serial-serial0-20190123-032240-rOoB4cgD,server,nowait -device isa-serial,chardev=serial_id_serial0 -chardev socket,id=seabioslog_id_20190123-032240-rOoB4cgD,path=/var/tmp/seabios-20190123-032240-rOoB4cgD,server,nowait -device isa-debugcon,chardev=seabioslog_id_20190123-032240-rOoB4cgD,iobase=0x402 -device ich9-usb-ehci1,id=usb1,addr=0x1d.7,multifunction=on,bus=pci.0 -device ich9-usb-uhci1,id=usb1.0,multifunction=on,masterbus=usb1.0,addr=0x1d.0,firstport=0,bus=pci.0 -device ich9-usb-uhci2,id=usb1.1,multifunction=on,masterbus=usb1.0,addr=0x1d.2,firstport=2,bus=pci.0 -device ich9-usb-uhci3,id=usb1.2,multifunction=on,masterbus=usb1.0,addr=0x1d.4,firstport=4,bus=pci.0 -object iothread,id=iothread0 -drive id=drive_image1,if=none,snapshot=off,aio=threads,cache=none,format=qcow2,file=/mnt/gluster/rhel77-64-virtio.qcow2 -device virtio-blk-pci,id=image1,drive=drive_image1,bus=pci.0,iothread=iothread0 -drive id=drive_data1,if=none,snapshot=off,aio=threads,cache=none,format=qcow2,file=/root/test.qcow2 -device virtio-blk-pci,id=data1,drive=drive_data1,bus=pci.0,iothread=iothread0 -device virtio-net-pci,mac=9a:39:3a:3b:3c:3d,id=id1JNQsL,vectors=4,netdev=idVpZZ6A,bus=pci.0 -netdev tap,id=idVpZZ6A,vhost=on -m 8192 -smp 2,maxcpus=2,cores=1,threads=1,sockets=2 -cpu 'Penryn',+kvm_pv_unhalt -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 -vnc :0 -rtc base=utc,clock=host,driftfix=slew -boot order=cdn,once=c,menu=off,strict=off -enable-kvm -monitor stdio -qmp tcp:0:3000,server,nowait

 (gdb) bt
#0  0x00007f575c026337 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55
#1  0x00007f575c027a28 in __GI_abort () at abort.c:90
#2  0x00007f575c01f156 in __assert_fail_base (fmt=0x7f575c17abe0 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x562fe1c0d028 "!({ _Static_assert(!(sizeof(*&from->in_flight) > 8), \"not expecting: \" \"sizeof(*&from->in_flight) > ATOMIC_REG_SIZE\"); __atomic_load_n(&from->in_flight, 0); })", file=file@entry=0x562fe1b71ccc "block.c", line=line@entry=3508, function=function@entry=0x562fe1c0dce0 <__PRETTY_FUNCTION__.31797> "bdrv_replace_node") at assert.c:92
#3  0x00007f575c01f202 in __GI___assert_fail (assertion=assertion@entry=0x562fe1c0d028 "!({ _Static_assert(!(sizeof(*&from->in_flight) > 8), \"not expecting: \" \"sizeof(*&from->in_flight) > ATOMIC_REG_SIZE\"); __atomic_load_n(&from->in_flight, 0); })", file=file@entry=0x562fe1b71ccc "block.c", line=line@entry=3508, function=function@entry=0x562fe1c0dce0 <__PRETTY_FUNCTION__.31797> "bdrv_replace_node") at assert.c:101
#4  0x0000562fe19c72ca in bdrv_replace_node (from=0x562fe458e000, to=<optimized out>, errp=0x7ffd20a3cfa0) at block.c:3508
#5  0x0000562fe19c756e in bdrv_append (bs_new=0x562fe4888800, bs_top=0x562fe458e000, errp=errp@entry=0x7ffd20a3d010) at block.c:3579
#6  0x0000562fe186522d in external_snapshot_prepare (common=0x562fe458a550, errp=0x7ffd20a3d088) at blockdev.c:1831
#7  0x0000562fe1867a09 in qmp_transaction (dev_list=dev_list@entry=0x7ffd20a3d100, has_props=has_props@entry=false, props=<optimized out>, 
    props@entry=0x0, errp=errp@entry=0x7ffd20a3d1b8) at blockdev.c:2440
#8  0x0000562fe1867c24 in qmp_blockdev_snapshot_sync (errp=0x7ffd20a3d1b8, action=0x7ffd20a3d0f0) at blockdev.c:1267
#9  0x0000562fe1867c24 in qmp_blockdev_snapshot_sync (has_device=<optimized out>, device=<optimized out>, has_node_name=<optimized out>, node_name=<optimized out>, snapshot_file=<optimized out>, has_snapshot_node_name=<optimized out>, snapshot_node_name=0x0, has_format=true, format=0x562fe5a57d78 "qcow2", has_mode=false, mode=NEW_IMAGE_MODE_EXISTING, errp=errp@entry=0x7ffd20a3d1b8) at blockdev.c:1295
#10 0x0000562fe18745e1 in qmp_marshal_blockdev_snapshot_sync (args=<optimized out>, ret=<optimized out>, errp=0x7ffd20a3d268) at qapi/qapi-commands-block-core.c:313
#11 0x0000562fe1a9216a in qmp_dispatch (errp=0x7ffd20a3d260, request=0x7ffd20a3d260, cmds=<optimized out>) at qapi/qmp-dispatch.c:111
#12 0x0000562fe1a9216a in qmp_dispatch (cmds=<optimized out>, request=request@entry=0x562fe4b60400) at qapi/qmp-dispatch.c:160
#13 0x0000562fe178a151 in monitor_qmp_dispatch_one (req_obj=<optimized out>) at /usr/src/debug/qemu-2.12.0/monitor.c:4102
#14 0x0000562fe178a3b5 in monitor_qmp_bh_dispatcher (data=<optimized out>) at /usr/src/debug/qemu-2.12.0/monitor.c:4160
#15 0x0000562fe1a9c0b1 in aio_bh_poll (bh=0x562fe45127b0) at util/async.c:90
#16 0x0000562fe1a9c0b1 in aio_bh_poll (ctx=ctx@entry=0x562fe4408500) at util/async.c:118
#17 0x0000562fe1a9f160 in aio_dispatch (ctx=0x562fe4408500) at util/aio-posix.c:440
#18 0x0000562fe1a9bf8e in aio_ctx_dispatch (source=<optimized out>, callback=<optimized out>, user_data=<optimized out>) at util/async.c:261
#19 0x00007f57626eb049 in g_main_context_dispatch (context=0x562fe43eb600) at gmain.c:3175
#20 0x00007f57626eb049 in g_main_context_dispatch (context=context@entry=0x562fe43eb600) at gmain.c:3828
#21 0x0000562fe1a9e457 in main_loop_wait () at util/main-loop.c:215
#22 0x0000562fe1a9e457 in main_loop_wait (timeout=<optimized out>) at util/main-loop.c:238
#23 0x0000562fe1a9e457 in main_loop_wait (nonblocking=nonblocking@entry=0) at util/main-loop.c:497
#24 0x0000562fe173e1f7 in main () at vl.c:2013
#25 0x0000562fe173e1f7 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4777

Comment 21 aihua liang 2019-05-24 10:51:28 UTC

BTW, i can reproduce it occasionally. So the reproduce rate is low.

Comment 22 Kevin Wolf 2019-05-24 12:06:30 UTC

(In reply to aihua liang from comment #21)
> BTW, i can reproduce it occasionally. So the reproduce rate is low.

With my command line from comment 16 (with a throttle node above base), it's a 100% reproducer for me.

Comment 23 Kevin Wolf 2019-06-03 09:59:25 UTC

I posted a fix upstream: https://lists.gnu.org/archive/html/qemu-block/2019-05/msg00709.html

However, I'm afraid that this is too late for 7.7 unless you can justify a blocker.

Comment 38 aihua liang 2019-09-19 08:06:13 UTC

Can reproduce it on qemu-kvm-rhev-2.12.0-29.el7.x86_64 with Kevin's steps in comment 16.
 
 Reproduce steps:
   1.Create images sn1 ~ sn5 
     for i in range (1,6)
     #qemu-img create -f qcow2 /root/sn$i 20G

   2.Start guest with qemu cmds:
      /usr/libexec/qemu-kvm \
    -object throttle-group,x-bps-read=4096,id=throttle0 \
    -blockdev file,filename=/home/kvm_autotest_root/images/rhel77-64-virtio-scsi.qcow2,node-name=file_base \
    -blockdev qcow2,file=file_base,node-name=base \
    -blockdev throttle,throttle-group=throttle0,file=base,node-name=throttled \
    -blockdev file,filename=/root/sn1,node-name=root_sn1 \
    -blockdev qcow2,file=root_sn1,node-name=sn1,backing=throttled \
    -blockdev file,filename=/root/sn2,node-name=root_sn2 \
    -blockdev qcow2,file=root_sn2,node-name=sn2,backing=sn1 \
    -blockdev file,filename=/root/sn3,node-name=root_sn3 \
    -blockdev qcow2,file=root_sn3,node-name=sn3,backing=sn2 \
    -blockdev file,filename=/root/sn4,node-name=root_sn4 \
    -blockdev qcow2,file=root_sn4,node-name=sn4,backing=sn3 \
    -blockdev file,filename=/root/sn5,node-name=root_sn5 \
    -blockdev qcow2,file=root_sn5,node-name=sn5,backing=sn4 \
    -device ide-hd,drive=sn5 \
    -m 8192  \
    -smp 2,maxcpus=2,cores=1,threads=1,sockets=2  \
    -cpu 'Penryn',+kvm_pv_unhalt \
    -rtc base=utc,clock=host,driftfix=slew  \
    -boot order=cdn,once=c,menu=off,strict=off \
    -enable-kvm \
    -vnc :0 \
    -monitor stdio \
    -qmp tcp:0:3000,server,nowait \

  3.During guest boot, do block commit
    {'execute':'block-commit', 'arguments': {'device': 'sn5', 'top-node': 'sn4','job-id':'j1'}}
{"timestamp": {"seconds": 1568879801, "microseconds": 692671}, "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "j1"}}
Connection closed by foreign host.

  After step3:
    (qemu) qemu-kvm: block.c:3508: bdrv_replace_node: Assertion `!({ _Static_assert(!(sizeof(*&from->in_flight) > 8), "not expecting: " "sizeof(*&from->in_flight) > ATOMIC_REG_SIZE"); __atomic_load_n(&from->in_flight, 0); })' failed.
tat.txt: Line 26: 30297 Aborted               (Coredump)/usr/libexec/qemu-kvm -object throttle-group,x-bps-read=4096,id=throttle0 -blockdev file,filename=/home/kvm_autotest_root/images/rhel77-64-virtio-scsi.qcow2,node-name=file_base...

Comment 39 aihua liang 2019-09-19 10:18:05 UTC

Verified on qemu-kvm-rhev-2.12.0-35.el7.x86_64, the issue has been fixed, but there's a new issue.
   
   1.Create images sn1 ~ sn5 
     for i in range (1,6)
     #qemu-img create -f qcow2 /root/sn$i 20G

   2.Start guest with qemu cmds:
      /usr/libexec/qemu-kvm \
    -object throttle-group,x-bps-read=4096,id=throttle0 \
    -blockdev file,filename=/home/kvm_autotest_root/images/rhel77-64-virtio-scsi.qcow2,node-name=file_base \
    -blockdev qcow2,file=file_base,node-name=base \
    -blockdev throttle,throttle-group=throttle0,file=base,node-name=throttled \
    -blockdev file,filename=/root/sn1,node-name=root_sn1 \
    -blockdev qcow2,file=root_sn1,node-name=sn1,backing=throttled \
    -blockdev file,filename=/root/sn2,node-name=root_sn2 \
    -blockdev qcow2,file=root_sn2,node-name=sn2,backing=sn1 \
    -blockdev file,filename=/root/sn3,node-name=root_sn3 \
    -blockdev qcow2,file=root_sn3,node-name=sn3,backing=sn2 \
    -blockdev file,filename=/root/sn4,node-name=root_sn4 \
    -blockdev qcow2,file=root_sn4,node-name=sn4,backing=sn3 \
    -blockdev file,filename=/root/sn5,node-name=root_sn5 \
    -blockdev qcow2,file=root_sn5,node-name=sn5,backing=sn4 \
    -device ide-hd,drive=sn5 \
    -m 8192  \
    -smp 2,maxcpus=2,cores=1,threads=1,sockets=2  \
    -cpu 'Penryn',+kvm_pv_unhalt \
    -rtc base=utc,clock=host,driftfix=slew  \
    -boot order=cdn,once=c,menu=off,strict=off \
    -enable-kvm \
    -vnc :0 \
    -monitor stdio \
    -qmp tcp:0:3000,server,nowait \

  3.During guest boot, do block commit from sn4 to base.
    {'execute':'block-commit', 'arguments': {'device': 'sn5', 'top-node': 'sn4','job-id':'j1'}}
{"timestamp": {"seconds": 1568881186, "microseconds": 588417}, "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "j1"}}
{"timestamp": {"seconds": 1568881186, "microseconds": 646349}, "event": "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "j1"}}
{"return": {}}
{"timestamp": {"seconds": 1568881187, "microseconds": 347398}, "event": "JOB_STATUS_CHANGE", "data": {"status": "waiting", "id": "j1"}}
{"timestamp": {"seconds": 1568881187, "microseconds": 347460}, "event": "JOB_STATUS_CHANGE", "data": {"status": "pending", "id": "j1"}}
{"timestamp": {"seconds": 1568881187, "microseconds": 347739}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "j1", "len": 21474836480, "offset": 21474836480, "speed": 0, "type": "commit"}}
{"timestamp": {"seconds": 1568881187, "microseconds": 347800}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "j1"}}
{"timestamp": {"seconds": 1568881187, "microseconds": 347847}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "j1"}}

   4. Do block commit from sn5 to base.
    {'execute':'block-commit', 'arguments': {'device': 'sn5','job-id':'j1'}}
Connection closed by foreign host.
  
   After step4, qemu coredump with info:
     (qemu) qemu-kvm: block/throttle.c:214: throttle_co_drain_end: Assertion `tgm->io_limits_disabled' failed.
tat.txt: line 26: 30790 Aborted               (Coredump)/usr/libexec/qemu-kvm -object throttle-group,x-bps-read=4096,id=throttle0 -blockdev file,filename=/home/kvm_autotest_root/images/rhel77-64-virtio-scsi.qcow2,node-name=file_base -blockdev qcow2,file=file_base,node-name=base -blockdev throttle,throttle-group=throttle0,file=base,node-name=throttled ...

   (gdb) bt
#0  0x00007fbc6d54c377 in raise () at /lib64/libc.so.6
#1  0x00007fbc6d54da68 in abort () at /lib64/libc.so.6
#2  0x00007fbc6d545196 in __assert_fail_base () at /lib64/libc.so.6
#3  0x00007fbc6d545242 in  () at /lib64/libc.so.6
#4  0x0000558a1c13c921 in throttle_co_drain_end (bs=<optimized out>) at block/throttle.c:214
#5  0x0000558a1c12771e in bdrv_drain_invoke_entry (opaque=0x558a20ef0ae0) at block/io.c:197
#6  0x0000558a1c1c68ba in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>)
    at util/coroutine-ucontext.c:116
#7  0x00007fbc6d55e180 in __start_context () at /lib64/libc.so.6
#8  0x00007ffd321a3970 in  ()
#9  0x0000000000000000 in  ()

Set this bug's status to "Verified" and track the new issue by a new bug.

Comment 42 errata-xmlrpc 2020-03-31 14:34:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:1216