+++ This bug was initially created as a clone of Bug #1796124 +++ Description of problem: Attempted the following: VM is running and writing to disk, a VM snapshot is performed, disk is extended, then a snapshot is removed, then another snapshot is taken. During snapshot removal following issues occur: vdsm.log 2020-01-28 18:22:51,039+0000 INFO (libvirt/events) [virt.vm] (vmId='de5803b9-d6aa-45a7-bba4-432fe53b7f7e') CPU stopped: onSuspend (vm:6062) 2020-01-28 18:22:51,042+0000 WARN (libvirt/events) [virt.vm] (vmId='de5803b9-d6aa-45a7-bba4-432fe53b7f7e') device vda out of space (vm:3997) 2020-01-28 18:22:51,043+0000 INFO (libvirt/events) [virt.vm] (vmId='de5803b9-d6aa-45a7-bba4-432fe53b7f7e') abnormal vm stop device ua-624caba8-2d6d-4596-bf7d-e5ba50b4632c error enospc (vm:5075) 2020-01-28 18:22:51,043+0000 INFO (libvirt/events) [virt.vm] (vmId='de5803b9-d6aa-45a7-bba4-432fe53b7f7e') CPU stopped: onIOError (vm:6062) 2020-01-28 18:22:51,048+0000 INFO (libvirt/events) [virt.vm] (vmId='de5803b9-d6aa-45a7-bba4-432fe53b7f7e') Requesting extension for volume 3f7b06fe-0d5e-4c5e-8c70-b908f1e18bd2 on domain 7ff38788-4421-4029-b23c-6159f155c9f2 (apparent: 1073741824, capacity: 107374182400, allocated: 1078198272, physical: 1073741824 threshold_state: exceeded) (vm:1288) 2020-01-28 18:22:51,048+0000 INFO (libvirt/events) [vdsm.api] START sendExtendMsg(spUUID='d8bdc838-93f3-4833-9774-e41ad0881f9e', volDict={'newSize': 2147483648, 'domainID': u'7ff38788-4421-4029-b23c-6159f155c9f2', 'name': 'vda', 'poolID': 'd8bdc838-93f3-4833-9774-e41ad0881f9e', 'clock': <Clock(total=0.00*, extend-volume=0.00*)>, 'internal': False, 'volumeID': u'3f7b06fe-0d5e-4c5e-8c70-b908f1e18bd2', 'imageID': u'624caba8-2d6d-4596-bf7d-e5ba50b4632c'}, newSize=2147483648, callbackFunc=<bound method Vm.__afterVolumeExtension of <vdsm.virt.vm.Vm object at 0x7fc333410110>>) from=internal, task_id=507d3975-b6d5-4674-b26f-f77ed96f2d73 (api:48) Resulting in: engine.log 2020-01-28 19:37:28,871-05 WARN [org.ovirt.engine.core.vdsbroker.libvirt.VmDevicesConverter] (EE-ManagedExecutorService-commandCoordinator-Thread-10) [e11a6f96-3e61-4566-98b2-f3c280755c5f] unmanaged disk with path '/var/run/vdsm/payload/f13ca8ef-d81b-4aaa-bec0-a31f69ca1421.0f6e773a0c690fd566c4718f0d76cdb2.img' is ignored 2020-01-28 19:37:28,878-05 ERROR [org.ovirt.engine.core.bll.MergeStatusCommand] (EE-ManagedExecutorService-commandCoordinator-Thread-10) [e11a6f96-3e61-4566-98b2-f3c280755c5f] Failed to live merge. Top volume 22cd84b2-0809-4365-a629-1ac5f928ac7c is still in qemu chain [dd848b17-889d-449e-980f-80caab017565, 22cd84b2-0809-4365-a629-1ac5f928ac7c, acebaabc-a96b-407f-8408-bb59d193910c] 2020-01-28 19:37:30,834-05 ERROR [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-89) [e11a6f96-3e61-4566-98b2-f3c280755c5f] Command id: '63a21e84-0187-496f-8964-581644da0b2d failed child command status for step 'MERGE_STATUS' 2020-01-28 19:37:30,834-05 INFO [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommandCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-89) [e11a6f96-3e61-4566-98b2-f3c280755c5f] Command 'RemoveSnapshotSingleDiskLive' id: '63a21e84-0187-496f-8964-581644da0b2d' child commands '[e0e206bc-eae3-42b8-a39d-665395f5f539, fe5081a9-8f06-4fe0-8ec6-c707636a2a42, 311507c6-541e-405e-8c95-7507db6df3fa]' executions were completed, status 'FAILED' 2020-01-28 19:37:31,885-05 ERROR [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-31) [e11a6f96-3e61-4566-98b2-f3c280755c5f] Merging of snapshot '643a447d-4837-404d-9772-da28b1a544a4' images 'acebaabc-a96b-407f-8408-bb59d193910c'..'22cd84b2-0809-4365-a629-1ac5f928ac7c' failed. Images have been marked illegal and can no longer be previewed or reverted to. Please retry Live Merge on the snapshot to complete the operation. 2020-01-28 19:37:31,897-05 ERROR [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-31) [e11a6f96-3e61-4566-98b2-f3c280755c5f] Ending command 'org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommand' with failure. 2020-01-28 19:37:31,916-05 INFO [org.ovirt.engine.core.bll.ConcurrentChildCommandsExecutionCallback] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-31) [e11a6f96-3e61-4566-98b2-f3c280755c5f] Command 'RemoveSnapshot' id: '56b00a6c-9260-4d3d-982f-5aab7be4c8be' child commands '[63a21e84-0187-496f-8964-581644da0b2d]' executions were completed, status 'FAILED' 2020-01-28 19:37:33,003-05 ERROR [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-68) [e11a6f96-3e61-4566-98b2-f3c280755c5f] Ending command 'org.ovirt.engine.core.bll.snapshots.RemoveSnapshotCommand' with failure. 2020-01-28 19:37:33,052-05 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-68) [e11a6f96-3e61-4566-98b2-f3c280755c5f] EVENT_ID: USER_REMOVE_SNAPSHOT_FINISHED_FAILURE(357), Failed to delete snapshot 'VM_JMeter_Snapshot_42588' for VM 'VM_SnapshotBug_1'. Logs are ovirt log collector and separate tar of relevant vdsms[1] [1] https://drive.google.com/open?id=1Pi-EPyPVwVh6r6ya8ikMCDphgJyhNKB6 Version-Release number of selected component (if applicable): Version 4.3.8.1-0.1.master.el7 How reproducible: reproduces Steps to Reproduce: 1. Create VM with an existing snapshot 2. Take VM snapshot 3. IO activity force extend 4. Wait 5 mins 5. Remove Snapshot Actual results: Failed to live merge. Top volume 22cd84b2-0809-4365-a629-1ac5f928ac7c is still in qemu chain [dd848b17-889d-449e-980f-80caab017565, 22cd84b2-0809-4365-a629-1ac5f928ac7c, acebaabc-a96b-407f-8408-bb59d193910c] 2020-01-28 19:37:30,834-05 ERROR [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-89) [e11a6f96-3e61-4566-98b2-f3c280755c5f] Command id: '63a21e84-0187-496f-8964-581644da0b2d failed child command status for step 'MERGE_STATUS' Expected results: Successful live merge/vm snapshot removal Additional info: Logs are ovirt log collector and seperate tar of relevant vdsms[1] [1] https://drive.google.com/open?id=1Pi-EPyPVwVh6r6ya8ikMCDphgJyhNKB6
I have reviewed the logs we get the libvirt event and then we do the xmldump and the volume is still reported in the chain. So I am not sure if itis the same issue. 2020-01-29 20:51:19,914+0100 INFO (jsonrpc/5) [api.virt] START merge(drive={u'imageID': u'1b8e0d7f-7596-45de-aeb2-6a14a8623788', u'volumeID': u'fd66ff43-b8d4-4ef5-8535-219ed736b56d', u'domainID': u'899e0bb2-6eb9-4fdd-a6fa-1956362bd3fd', u'poolID': u'224c9107-4439-4b45-812f-6c88453cb8aa'}, baseVolUUID=u'359babf4-b0d8-48a9-8969-8ad98f636390', topVolUUID=u'fd66ff43-b8d4-4ef5-8535-219ed736b56d', bandwidth=u'0', jobUUID=u'183a508e-ee64-456d-890d-8b1510a727a6') from=::ffff:172.26.8.101,57294, flow_id=f2e7f2cb-968c-46e4-8e1b-8594098c4146, vmId=d50fc3b9-f27b-436e-b204-38e943f3c8a8 (api:48) 2020-01-29 20:51:19,959+0100 INFO (jsonrpc/4) [api.virt] START merge(drive={u'imageID': u'42f4e07b-62a9-43d9-a96a-b41b9cbd6c10', u'volumeID': u'7d221386-f395-45a9-8a0f-195aa9dd2f2c', u'domainID': u'899e0bb2-6eb9-4fdd-a6fa-1956362bd3fd', u'poolID': u'224c9107-4439-4b45-812f-6c88453cb8aa'}, baseVolUUID=u'6898a359-8a58-4c20-842e-d7b46ab19db5', topVolUUID=u'7d221386-f395-45a9-8a0f-195aa9dd2f2c', bandwidth=u'0', jobUUID=u'f49afb45-62de-4767-8404-da2d6d2babb3') from=::ffff:172.26.8.101,57294, flow_id=f2e7f2cb-968c-46e4-8e1b-8594098c4146, vmId=d50fc3b9-f27b-436e-b204-38e943f3c8a8 (api:48) 2020-01-29 20:51:20,188+0100 INFO (jsonrpc/5) [virt.vm] (vmId='d50fc3b9-f27b-436e-b204-38e943f3c8a8') Starting merge with jobUUID=u'183a508e-ee64-456d-890d-8b1510a727a6', original chain=359babf4-b0d8-48a9-8969-8ad98f636390 < fd66ff43-b8d4-4ef5-8535-219ed736b56d (top), disk='vda', base='vda[1]', top=None, bandwidth=0, flags=12 (vm:5899) 2020-01-29 20:51:20,236+0100 INFO (jsonrpc/4) [virt.vm] (vmId='d50fc3b9-f27b-436e-b204-38e943f3c8a8') Starting merge with jobUUID=u'f49afb45-62de-4767-8404-da2d6d2babb3', original chain=6898a359-8a58-4c20-842e-d7b46ab19db5 < 7d221386-f395-45a9-8a0f-195aa9dd2f2c (top), disk='vdb', base='vdb[1]', top=None, bandwidth=0, flags=12 (vm:5899) 2020-01-29 20:51:20,601+0100 DEBUG (libvirt/events) [virt.vm] (vmId='d50fc3b9-f27b-436e-b204-38e943f3c8a8') unhandled libvirt event (event_name=Unknown id 8, args=('/rhev/data-center/mnt/blockSD/899e0bb2-6eb9-4fdd-a6fa-1956362bd3fd/images/42f4e07b-62a9-43d9-a96a-b41b9cbd6c10/7d221386-f395-45a9-8a0f-195aa9dd2f2c', 4, 1, 8)) (clientIF:691) 2020-01-29 20:51:21,015+0100 DEBUG (libvirt/events) [virt.vm] (vmId='d50fc3b9-f27b-436e-b204-38e943f3c8a8') unhandled libvirt event (event_name=Unknown id 8, args=('/rhev/data-center/mnt/blockSD/899e0bb2-6eb9-4fdd-a6fa-1956362bd3fd/images/1b8e0d7f-7596-45de-aeb2-6a14a8623788/fd66ff43-b8d4-4ef5-8535-219ed736b56d', 4, 1, 8)) (clientIF:691) 2020-01-29 20:51:35,737+0100 INFO (periodic/2) [virt.vm] (vmId='d50fc3b9-f27b-436e-b204-38e943f3c8a8') Cleanup thread <vdsm.virt.vm.LiveMergeCleanupThread object at 0x7eff5071f190> successfully completed, untracking job f49afb45-62de-4767-8404-da2d6d2babb3 (base=6898a359-8a58-4c20-842e-d7b46ab19db5, top=7d221386-f395-45a9-8a0f-195aa9dd2f2c) (vm:5734) 2020-01-29 20:51:50,738+0100 INFO (periodic/1) [virt.vm] (vmId='d50fc3b9-f27b-436e-b204-38e943f3c8a8') Cleanup thread <vdsm.virt.vm.LiveMergeCleanupThread object at 0x7eff5016f650> successfully completed, untracking job 183a508e-ee64-456d-890d-8b1510a727a6 (base=359babf4-b0d8-48a9-8969-8ad98f636390, top=fd66ff43-b8d4-4ef5-8535-219ed736b56d) (vm:5734) engine.log: 2020-01-29 20:52:25,482+01 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DumpXmlsVDSCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-4) [f2e7f2cb-968c-46e4-8e1b-8594098c4146] FINISH, DumpXmlsVDSCommand, return: {d50fc3b9-f27b-436e-b204-38e943f3c8a8=<domain type='kvm' id='4'> ... <ovirt-vm:device devtype="disk" name="vda"> <ovirt-vm:imageID>1b8e0d7f-7596-45de-aeb2-6a14a8623788</ovirt-vm:imageID> <ovirt-vm:poolID>224c9107-4439-4b45-812f-6c88453cb8aa</ovirt-vm:poolID> <ovirt-vm:volumeID>fd66ff43-b8d4-4ef5-8535-219ed736b56d</ovirt-vm:volumeID> ... <ovirt-vm:volumeChain> <ovirt-vm:volumeChainNode> <ovirt-vm:domainID>899e0bb2-6eb9-4fdd-a6fa-1956362bd3fd</ovirt-vm:domainID> <ovirt-vm:imageID>1b8e0d7f-7596-45de-aeb2-6a14a8623788</ovirt-vm:imageID> <ovirt-vm:leaseOffset type="int">141557760</ovirt-vm:leaseOffset> <ovirt-vm:leasePath>/dev/899e0bb2-6eb9-4fdd-a6fa-1956362bd3fd/leases</ovirt-vm:leasePath> <ovirt-vm:path>/rhev/data-center/mnt/blockSD/899e0bb2-6eb9-4fdd-a6fa-1956362bd3fd/images/1b8e0d7f-7596-45de-aeb2-6a14a8623788/fd66ff43-b8d4-4ef5-8535-219ed736b56d</ovirt-vm:path> <ovirt-vm:volumeID>fd66ff43-b8d4-4ef5-8535-219ed736b56d</ovirt-vm:volumeID> </ovirt-vm:volumeChainNode> <ovirt-vm:volumeChainNode> <ovirt-vm:domainID>899e0bb2-6eb9-4fdd-a6fa-1956362bd3fd</ovirt-vm:domainID> <ovirt-vm:imageID>1b8e0d7f-7596-45de-aeb2-6a14a8623788</ovirt-vm:imageID> <ovirt-vm:leaseOffset type="int">144703488</ovirt-vm:leaseOffset> <ovirt-vm:leasePath>/dev/899e0bb2-6eb9-4fdd-a6fa-1956362bd3fd/leases</ovirt-vm:leasePath> <ovirt-vm:path>/rhev/data-center/mnt/blockSD/899e0bb2-6eb9-4fdd-a6fa-1956362bd3fd/images/1b8e0d7f-7596-45de-aeb2-6a14a8623788/359babf4-b0d8-48a9-8969-8ad98f636390</ovirt-vm:path> <ovirt-vm:volumeID>359babf4-b0d8-48a9-8969-8ad98f636390</ovirt-vm:volumeID> </ovirt-vm:volumeChainNode> </ovirt-vm:volumeChain> ... <ovirt-vm:device devtype="disk" name="vdb"> <ovirt-vm:domainID>899e0bb2-6eb9-4fdd-a6fa-1956362bd3fd</ovirt-vm:domainID> <ovirt-vm:guestName>/dev/vdb</ovirt-vm:guestName> <ovirt-vm:imageID>42f4e07b-62a9-43d9-a96a-b41b9cbd6c10</ovirt-vm:imageID> <ovirt-vm:poolID>224c9107-4439-4b45-812f-6c88453cb8aa</ovirt-vm:poolID> <ovirt-vm:volumeID>7d221386-f395-45a9-8a0f-195aa9dd2f2c</ovirt-vm:volumeID> ... <ovirt-vm:volumeChain> <ovirt-vm:volumeChainNode> <ovirt-vm:domainID>899e0bb2-6eb9-4fdd-a6fa-1956362bd3fd</ovirt-vm:domainID> <ovirt-vm:imageID>42f4e07b-62a9-43d9-a96a-b41b9cbd6c10</ovirt-vm:imageID> <ovirt-vm:leaseOffset type="int">149946368</ovirt-vm:leaseOffset> <ovirt-vm:leasePath>/dev/899e0bb2-6eb9-4fdd-a6fa-1956362bd3fd/leases</ovirt-vm:leasePath> <ovirt-vm:path>/rhev/data-center/mnt/blockSD/899e0bb2-6eb9-4fdd-a6fa-1956362bd3fd/images/42f4e07b-62a9-43d9-a96a-b41b9cbd6c10/7d221386-f395-45a9-8a0f-195aa9dd2f2c</ovirt-vm:path> <ovirt-vm:volumeID>7d221386-f395-45a9-8a0f-195aa9dd2f2c</ovirt-vm:volumeID> </ovirt-vm:volumeChainNode> <ovirt-vm:volumeChainNode> <ovirt-vm:domainID>899e0bb2-6eb9-4fdd-a6fa-1956362bd3fd</ovirt-vm:domainID> <ovirt-vm:imageID>42f4e07b-62a9-43d9-a96a-b41b9cbd6c10</ovirt-vm:imageID> <ovirt-vm:leaseOffset type="int">145752064</ovirt-vm:leaseOffset> <ovirt-vm:leasePath>/dev/899e0bb2-6eb9-4fdd-a6fa-1956362bd3fd/leases</ovirt-vm:leasePath> <ovirt-vm:path>/rhev/data-center/mnt/blockSD/899e0bb2-6eb9-4fdd-a6fa-1956362bd3fd/images/42f4e07b-62a9-43d9-a96a-b41b9cbd6c10/6898a359-8a58-4c20-842e-d7b46ab19db5</ovirt-vm:path> <ovirt-vm:volumeID>6898a359-8a58-4c20-842e-d7b46ab19db5</ovirt-vm:volumeID> </ovirt-vm:volumeChainNode> </ovirt-vm:volumeChain> ... 2020-01-29 20:52:25,511+01 ERROR [org.ovirt.engine.core.bll.MergeStatusCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-4) [f2e7f2cb-968c-46e4-8e1b-8594098c4146] Failed to live merge. Top volume 7d221 386-f395-45a9-8a0f-195aa9dd2f2c is still in qemu chain [6898a359-8a58-4c20-842e-d7b46ab19db5, 7d221386-f395-45a9-8a0f-195aa9dd2f2c]
Based on: 2020-01-29 20:51:20,601+0100 DEBUG (libvirt/events) [virt.vm] (vmId='d50fc3b9-f27b-436e-b204-38e943f3c8a8') unhandled libvirt event (event_name=Unknown id 8, args=('/rhev/data-center/mnt/blockSD/899e0bb2-6eb9-4fdd-a6fa-1956362bd3fd/images/42f4e07b-62a9-43d9-a96a-b41b9cbd6c10/7d221386-f395-45a9-8a0f-195aa9dd2f2c', 4, 1, 8)) (clientIF:691) Event_id=8 VIR_DOMAIN_EVENT_ID_BLOCK_JOB virConnectDomainEventBlockJobCallback: typedef void (*virConnectDomainEventBlockJobCallback) (virConnectPtr conn, virDomainPtr dom, const char * disk, int type, int status, void * opaque) disk: /rhev/data-center/mnt/blockSD/899e0bb2-6eb9-4fdd-a6fa-1956362bd3fd/images/42f4e07b-62a9-43d9-a96a-b41b9cbd6c10/7d221386-f395-45a9-8a0f-195aa9dd2f2c type: 4 (virDomainBlockJobType) VIR_DOMAIN_BLOCK_JOB_TYPE_ACTIVE_COMMIT = 4 (0x4) status: 1 (virConnectDomainEventBlockJobStatus) VIR_DOMAIN_BLOCK_JOB_FAILED = 1 (0x1) opaque: 8 So if I am right the issue is not because of monitoring, but because the commit operation failed. Is that correct assumption?
Yes that is a very correct assumption. To better understand why it failed I'd need to see the libvirtd debug log which would contain the QMP conversation with qemu where qemu would tell us what happened in the error message.
We do not have any debug logs yet, but we enabled the debug logs in our reproducer. I hope we will hit the issue soon. One important note is that we run snapshot removals for 3 VMs in parallel. When the issue happened it affected all three VMs. All of them failed to merge.
Peter, could you please suggest some filter for debug log for libvirt, not to have log everything? Thanks.
I'm not sure which libvirt version you are using but the defaults documented in https://wiki.libvirt.org/page/DebugLogs are usually pretty good for debugging qemu issues. That means that if libvirt-4.4 and later is used use the follwing filter setting: log_filters="3:remote 4:event 3:util.json 3:rpc 1:*" I personally use somewhat more aggressive setting. That one should also work well for blockjobs: log_filters="3:remote 4:event 3:util.json 3:util.object 3:util.dbus 3:util.udev 3:node_device 3:rpc 3:access.accessmanager 1:*"
(In reply to Roman Hodain from comment #4) > Based on: > 2020-01-29 20:51:20,601+0100 DEBUG (libvirt/events) [virt.vm] > (vmId='d50fc3b9-f27b-436e-b204-38e943f3c8a8') unhandled libvirt event > (event_name=Unknown id 8, > args=('/rhev/data-center/mnt/blockSD/899e0bb2-6eb9-4fdd-a6fa-1956362bd3fd/ > images/42f4e07b-62a9-43d9-a96a-b41b9cbd6c10/7d221386-f395-45a9-8a0f- > 195aa9dd2f2c', 4, 1, 8)) (clientIF:691) > > Event_id=8 VIR_DOMAIN_EVENT_ID_BLOCK_JOB > > virConnectDomainEventBlockJobCallback: > typedef void (*virConnectDomainEventBlockJobCallback) (virConnectPtr conn, > virDomainPtr dom, > const char * disk, > int type, > int status, > void * opaque) > > disk: > /rhev/data-center/mnt/blockSD/899e0bb2-6eb9-4fdd-a6fa-1956362bd3fd/images/ > 42f4e07b-62a9-43d9-a96a-b41b9cbd6c10/7d221386-f395-45a9-8a0f-195aa9dd2f2c > type: 4 (virDomainBlockJobType) > VIR_DOMAIN_BLOCK_JOB_TYPE_ACTIVE_COMMIT = 4 (0x4) > status: 1 (virConnectDomainEventBlockJobStatus) > VIR_DOMAIN_BLOCK_JOB_FAILED = 1 (0x1) > opaque: 8 > > > So if I am right the issue is not because of monitoring, but because the > commit operation failed. In this case blockCommit has failed, but this is ACTIVE_COMMIT (commit of the active layer) while in backup use case it is always COMMIT, which is diffent flow. The issue we see is: - Starting merge 1 for a vm - Starting merge 2 for same vm - The second merge detect that the first merge job is gone 65 milliseconds after we started the merge. - We complete the merge without removing the volume because it is still in the chain - Engine continue to the next step, and fail because the volume is in the chain. We don't know why the job has disappeared so quickly. We have 2 options: - Libvirt does not report the job for some time after blockCommit - The job failed very quickly in less than 65 milliseconds With the attached patches we log block job events in INFO level so when we reproduce this (or customer hit this issue again) we will have more information and can decide if vdsm or libvirt needs a change.
We got the debug logs from the cusomter. There are two disk attached to eht VM so we do two merget together. One finishes Ok and the second one with ERROR: engine.log: 2020-02-03 16:01:16,244+01 INFO [org.ovirt.engine.core.bll.MergeCommandCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-31) [c2854904-0644-4e4d-b2fb-5b96f1c1150c] Merge command (jobId = 06076430-ec86-4e89-9620-e1cc273dab11) has completed for images '8ff7b504-cd6a-4fee-aa01-b6d05d9de252'..'00fe4900-bd5b-477f-97d6-49529fba6279' 2020-02-03 16:01:16,248+01 INFO [org.ovirt.engine.core.bll.MergeCommandCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-31) [c2854904-0644-4e4d-b2fb-5b96f1c1150c] Merge command (jobId = 7570fdd6-f733-42dc-84bf-f1be590fbfdf) has completed for images '36d1eae7-1aa7-40fa-8714-2483c542f4d8'..'ebe43b12-8088-4c56-931f-aa2279f71f06' ... 2020-02-03 16:01:17,373+01 ERROR [org.ovirt.engine.core.bll.MergeStatusCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-1) [c2854904-0644-4e4d-b2fb-5b96f1c1150c] Failed to live merge. Top volume 00fe4900-bd5b-477f-97d6-49529fba6279 is still in qemu chain [e8bc5187-3441-4f6b-a040-9427e94c444f, 8ff7b504-cd6a-4fee-aa01-b6d05d9de252, 00fe4900-bd5b-477f-97d6-49529fba6279] So volume 00fe4900-bd5b-477f-97d6-49529fba6279 was not actually merged. vdsm.log.1.xz: Vdsm reports two libvirt event. Each for one of the disks. The first one fails on block commit and the second one does not. 2020-02-03 15:59:54,349+0100 DEBUG (libvirt/events) [virt.vm] (vmId='c38641c7-08f3-445f-afad-a0e4b58ac3d1') unhandled libvirt event (event_name=Unknown id 8, args=('/rhev/data-center/mnt/blockSD/e5697f4c-e58a-4d54-845c-7b0f4f6d4d2d/images/0529f3ad-b646-4087-a8bf-67869b4abcd7/e8bc5187-3441-4f6b-a040-9427e94c444f', 3, 1, 8)) (clientIF:691) <----- Failed 2020-02-03 15:59:55,414+0100 DEBUG (libvirt/events) [virt.vm] (vmId='c38641c7-08f3-445f-afad-a0e4b58ac3d1') unhandled libvirt event (event_name=Unknown id 8, args=('/rhev/data-center/mnt/blockSD/b04da7bc-3846-4d51-b532-aaa3b0b17df1/images/9b74f3a5-62d6-45c8-94d9-b5347cd73c43/82d70b25-bc02-4fb9-9fe7-ec97fb1d6042', 3, 0, 8)) (clientIF:691) 2020-02-03 15:59:56,879+0100 DEBUG (periodic/2) [virt.vm] (vmId='c38641c7-08f3-445f-afad-a0e4b58ac3d1') Checking job 06076430-ec86-4e89-9620-e1cc273dab11 (vm:5724) 2020-02-03 15:59:56,880+0100 INFO (periodic/2) [virt.vm] (vmId='c38641c7-08f3-445f-afad-a0e4b58ac3d1') Libvirt job 06076430-ec86-4e89-9620-e1cc273dab11 was terminated (vm:5783) 2020-02-03 15:59:56,880+0100 INFO (periodic/2) [virt.vm] (vmId='c38641c7-08f3-445f-afad-a0e4b58ac3d1') Starting cleanup thread for job: 06076430-ec86-4e89-9620-e1cc273dab11 (vm:5791) ... 2020-02-03 15:59:56,882+0100 INFO (periodic/2) [virt.vm] (vmId='c38641c7-08f3-445f-afad-a0e4b58ac3d1') Libvirt job 7570fdd6-f733-42dc-84bf-f1be590fbfdf was terminated (vm:5783) 2020-02-03 15:59:56,882+0100 INFO (periodic/2) [virt.vm] (vmId='c38641c7-08f3-445f-afad-a0e4b58ac3d1') Starting cleanup thread for job: 7570fdd6-f733-42dc-84bf-f1be590fbfdf (vm:5791) ... 2020-02-03 15:59:57,199+0100 INFO (merge/c38641c7) [virt.vm] (vmId='c38641c7-08f3-445f-afad-a0e4b58ac3d1') Synchronizing volume chain after live merge (job 7570fdd6-f733-42dc-84bf-f1be590fbfdf) (vm:6206) 2020-02-03 15:59:57,218+0100 DEBUG (merge/c38641c7) [virt.vm] (vmId='c38641c7-08f3-445f-afad-a0e4b58ac3d1') vdsm chain: ['82d70b25-bc02-4fb9-9fe7-ec97fb1d6042', 'ebe43b12-8088-4c56-931f-aa2279f71f06', '36d1eae7-1aa7-40fa-8714-2483c542f4d8'], libvirt chain: ['36d1eae7-1aa7-40fa-8714-2483c542f4d8', '82d70b25-bc02-4fb9-9fe7-ec97fb1d6042'] (vm:5982) 2020-02-03 15:59:57,250+0100 INFO (merge/c38641c7) [virt.vm] (vmId='c38641c7-08f3-445f-afad-a0e4b58ac3d1') Synchronizing volume chain after live merge (job 06076430-ec86-4e89-9620-e1cc273dab11) (vm:6206) 2020-02-03 15:59:57,264+0100 DEBUG (merge/c38641c7) [virt.vm] (vmId='c38641c7-08f3-445f-afad-a0e4b58ac3d1') vdsm chain: ['e8bc5187-3441-4f6b-a040-9427e94c444f', '00fe4900-bd5b-477f-97d6-49529fba6279', '8ff7b504-cd6a-4fee-aa01-b6d05d9de252'], libvirt chain: ['8ff7b504-cd6a-4fee-aa01-b6d05d9de252', '00fe4900-bd5b-477f-97d6-49529fba6279', 'e8bc5187-3441-4f6b-a040-9427e94c444f'] (vm:5982) libvirt.log: The libvirt log reports BLOCK_JOB_ERROR. 2020-02-03 14:59:54.347+0000: 7363: debug : qemuMonitorJSONIOProcessLine:197 : Line [{"timestamp": {"seconds": 1580741994, "microseconds": 347849}, "event": "BLOCK_JOB_ERROR", "data": {"device": "drive-ua-0529f3ad-b646-4087-a8bf-67869b4abcd7", "operation": "write", "action": "report"}}] 2020-02-03 14:59:54.348+0000: 7363: info : qemuMonitorJSONIOProcessLine:212 : QEMU_MONITOR_RECV_EVENT: mon=0x7f537c0523e0 event={"timestamp": {"seconds": 1580741994, "microseconds": 347849}, "event": "BLOCK_JOB_ERROR", "data": {"device": "drive-ua-0529f3ad-b646-4087-a8bf-67869b4abcd7", "operation": "write", "action": "report"}} 2020-02-03 14:59:54.348+0000: 7363: debug : qemuMonitorJSONIOProcessEvent:158 : mon=0x7f537c0523e0 obj=0x55705461b210 2020-02-03 14:59:54.348+0000: 7363: debug : qemuMonitorEmitEvent:1325 : mon=0x7f537c0523e0 event=BLOCK_JOB_ERROR As this comes from the monitor that I think it was reported by qemu, but I am not sure if this should be the reason for the failure and what could cause this report. The qemu logs are empty.
I think the relevant report from qemu regarding the block job mentioned above is: 2020-02-03 14:59:54.348+0000: 7363: debug : qemuMonitorJSONIOProcessLine:197 : Line [{"timestamp": {"seconds": 1580741994, "microseconds": 348121}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "drive-ua-0529f3ad-b646-4087-a8bf-67869b4abcd7", "len": 53687091200, "offset": 38497484800, "speed": 0, "type": "commit", "error": "No space left on device"}}]
So could this be that we just filled the volume and we do not do the volume extend as the monitoring is disabled? Do you know from the top of your hear Nir?
I just got a customer ticket that seems to be hitting the same problem described here. PROBLEM: qemu reports "No space left on device" and the commit/merge fails, the engine progresses to MERGE_STATUS and fails as the volume is still there. IMHO it could be a race between extend and commit, see: The merge is also internal, 89925ca4 (top) into e75e565f (base). 1. SIZES: ~~~ 'uuid': u'e75e565f-68a8-4e7e-98df-03b3d0de98d5', 'apparentsize': '5368709120', 'capacity': '64424509440', 'truesize': '5368709120', ~~~ ~~~ 'uuid': u'89925ca4-d49a-4cab-84ef-0a5c00681a54', 'apparentsize': '4294967296', 'capacity': '64424509440', 'truesize': '4294967296', ~~~ 2. VDSM asks libvirt to do blockcommit: https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/vm.py#L5321 2020-02-05 10:25:04,221+0100 INFO (jsonrpc/6) [virt.vm] (vmId='336611e0-7630-427d-89a2-1f683bebc9fa') Starting merge with jobUUID=u'6d918676-0017-43e3-ac29-c579f5c3597a', original chain=8f4882db-a5f7-451e-b403-a7c91fb5161a < 18ad1cfd-17a3-4cb4-8c6d-00d656ba8289 < 36ed054a-15f1-4d1c-82dc-73751a0253af < 9458d950-5304-4737-a452-9ebdb0cbe8ae < 8bf9c60a-d98e-4a37-b765-ce5bb66bbf9b < 5a17ab9b-8b32-4cb8-bfd0-a42df9506667 < 7e852897-fc77-42f3-8e22-92e7de529424 < 4458e029-7364-4f60-b661-2c4542682be5 < 21391d90-4c00-438e-a629-ea4474b642fc < 7980530f-17ff-4e8c-872b-e13c1dc36ca1 < e75e565f-68a8-4e7e-98df-03b3d0de98d5 < 89925ca4-d49a-4cab-84ef-0a5c00681a54 < 758add53-2aa2-41a6-a00a-8dc502e7dbdc < 6624b6fb-366d-41d8-b5e0-d7cc29af3f85 (top), disk='vdb', base='vdb[3]', top='vdb[2]', bandwidth=0, flags=8 (vm:5898) 3. VDSM then asks SPM to extend the base volume: https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/vm.py#L5347 apparentsize of top = 4294967296 apparentsize of base = 5368709120 [irs] volume_utilization_chunk_mb = 2048 newSize = 4294967296+5368709120+2048*1024=11811160064 2020-02-05 10:25:04,458+0100 INFO (jsonrpc/6) [vdsm.api] START sendExtendMsg(spUUID='59fc7670-0090-006e-0176-0000000000f2', volDict={'newSize': 11811160064, 'domainID': u'333b37f3-f24c-4595-9129-5b742c297be5', 'name': 'vdb', 'poolID': '59fc7670-0090-006e-0176-0000000000f2', 'clock': <Clock(total=0.00*, extend-volume=0.00*)>, 'internal': True, 'volumeID': u'e75e565f-68a8-4e7e-98df-03b3d0de98d5', 'imageID': u'aeb3e0cf-04aa-46f8-9475-9eadff36d195'}, newSize=11811160064, callbackFunc=<bound method Vm.__afterVolumeExtension of <vdsm.virt.vm.Vm object at 0x7fcd500c9d50>>) from=::ffff:172.16.14.10,56642, flow_id=97b4c089-8533-4e38-9997-034b9fc7b963, task_id=9d7fb41b-5985-45a6-a323-122832a52800 (api:48) 4. VDSM starts refresh volume: 2020-02-05 10:25:06,514+0100 INFO (mailbox-hsm/3) [storage.LVM] Refreshing LVs (vg=333b37f3-f24c-4595-9129-5b742c297be5, lvs=(u'e75e565f-68a8-4e7e-98df-03b3d0de98d5',)) (lvm:1438) 5. qemu tells libvirt that the job finished with enopsc: [UTC TIME] 2020-02-05 09:25:06.617+0000: 21135: debug : qemuMonitorJSONIOProcessLine:197 : Line [{"timestamp": {"seconds": 1580894706, "microseconds": 617116}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "drive-ua-aeb3e0cf-04aa-46f8-9475-9eadff36d195", "len": 64424509440, "offset": 9763225600, "speed": 0, "type": "commit", "error": "No space left on device"}}] 6. VDSM seems the libvirt job terminated (which will trigger engine to proceed to MERGE_STATUS and fail with volume still in chain) 2020-02-05 10:25:07,801+0100 INFO (periodic/2) [virt.vm] (vmId='336611e0-7630-427d-89a2-1f683bebc9fa') Libvirt job 6d918676-0017-43e3-ac29-c579f5c3597a was terminated (vm:5782) 7. Refresh completes (from item 4) 2020-02-05 10:25:07,998+0100 INFO (mailbox-hsm/3) [vdsm.api] FINISH refreshVolume return=None from=internal, task_id=3e5b857b-445b-41a5-8f1f-f280541cc43b (api:54) 2020-02-05 10:25:07,999+0100 INFO (mailbox-hsm/3) [vdsm.api] FINISH getVolumeSize return={'truesize': '11811160064', 'apparentsize': '11811160064'} from=internal, task_id=932df045-d343-422b-959c-29046535b54a (api:54) 2020-02-05 10:25:08,000+0100 INFO (mailbox-hsm/3) [virt.vm] (vmId='336611e0-7630-427d-89a2-1f683bebc9fa') Extend volume e75e565f-68a8-4e7e-98df-03b3d0de98d5 completed <Clock(total=3.54, extend-volume=2.05, refresh-volume=1.49)> (vm:1400) 8. Customer tries again, it works (base already has enough size?). * Isn't there a problem here of concurrent blockcommit and volume extend? The blockcommit starts before the extend request, and both run in parallel. * Is it possible that qemu needs the extended space to do the commit before its actually extended? * Shouldn't the extension be done before the commit, and wait for it to complete and refresh the LV before starting the libvirt job?
Thanks for pointing this out Germano, it really seems to be relevant. I checked the last logs we have and the base volume (which I wrongly believed is raw disk) is being extended while merge is happening: 2020-02-03 15:59:51,062+0100 INFO (jsonrpc/5) [virt.vm] (vmId='c38641c7-08f3-445f-afad-a0e4b58ac3d1') Starting merge with jobUUID=u'06076430-ec86-4e89-9620-e1cc273dab11', original chain=8ff7b504-cd6a-4fee-aa01-b6d05d9de252 < 00fe4900-bd5b-477f-97d6-49529fba6279 < e8bc5187-3441-4f6b-a040-9427e94c444f (top), disk='vdb', base='vdb[2]', top='vdb[1]', bandwidth=0, flags=8 (vm:5899) [...] 2020-02-03 15:59:51,086+0100 DEBUG (jsonrpc/5) [virt.vm] (vmId='c38641c7-08f3-445f-afad-a0e4b58ac3d1') Extension info for drive vdb volume e8bc5187-3441-4f6b-a040-9427e94c444f: BlockInfo(capacity=53687091200L, allocation=2752512L, physical=2147483648L) (vm:1205) 2020-02-03 15:59:51,086+0100 DEBUG (jsonrpc/5) [virt.vm] (vmId='c38641c7-08f3-445f-afad-a0e4b58ac3d1') Requesting an extension for the volume: {'newSize': 9663676416, 'domainID': u'e5697f4c-e58a-4d54-845c-7b0f4f6d4d2d', 'name': 'vdb', 'poolID': '224c9107-4439-4b45-812f-6c88453cb8aa', 'clock': <Clock(total=0.00*, extend-volume=0.00*)>, 'internal': True, 'volumeID': u'8ff7b504-cd6a-4fee-aa01-b6d05d9de252', 'imageID': u'0529f3ad-b646-4087-a8bf-67869b4abcd7'} (vm:1361) Kevin, can you confirm this scenario (concurrent extension of the base COW disk and merge with top disk, merge being faster than extension) can result into ENOSPC error we see? Thanks!
(In reply to Vojtech Juranek from comment #22) > Kevin, can you confirm this scenario (concurrent extension of the base COW > disk and merge with top disk, merge being faster than extension) can result > into ENOSPC error we see? Oh, I think I see the problem. :-( In theory, VDSM should try to always extend the volume in time so that we never hit an ENOSPC error. We know that it doesn't always manage to do that, so it configures the VM with werror=enospc (or is it even werror=stop?) so that instead of seeing an error, the guest is just stopped. VDSM can then extend the volume while the guest is paused and resume the guest afterwards. Block jobs are independent of guest devices, and stopping the guest wouldn't help them anyway. So they have their own error handling options, which need to be set to 'enospc' so that then the block job can be automatically paused when it hits ENOSPC, VDSM can extend the volume and then continue the job. The only problem is that block-commit doesn't even expose this error handling option in QMP. Consequently, VDSM also can't have implemented managing the error case properly. I think this means that we need changes in QEMU, libvirt and VDSM.
(In reply to Vojtech Juranek from comment #22) > Thanks for pointing this out Germano, it really seems to be relevant. I > checked the last logs we have and the base volume (which I wrongly believed > is raw disk) is being extended while merge is happening: Exactly. And this extend is not a normal threshold extension on the active layer. This extend is from merge() https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/vm.py#L5347 I find it very weird that VDSM first asks libvirt to do the blockcommit and then triggers the extend. Considering its extending to the maximum possible size of the merge (top+base+extension+chunk), and taking into accout what Kevin said, would it be more logical to first extend, refresh LV and then start the blockcommit with the new extended size already seen? The ticket I attached to this case already hit this a few times. And I don't remember seeing these before, maybe something else changed that makes it easier to be hit now?
(In reply to Germano Veit Michel from comment #24) > Exactly. And this extend is not a normal threshold extension on the active > layer. This extend is from merge() > https://github.com/oVirt/vdsm/blob/master/lib/vdsm/virt/vm.py#L5347 Oh, right. I didn't even consider allocating everything upfront. This is obviously much easier than actually implementing the dynamic resizing that I mentioned, so we should probably focus on fixing this approach (which will hopefully require changes only in VDSM). > Considering its extending to the maximum possible size of the merge > (top+base+extension+chunk), and taking into accout what Kevin said, would it > be more logical to first extend, refresh LV and then start the blockcommit > with the new extended size already seen? Indeed, that would make more sense to me.
(In reply to Germano Veit Michel from comment #19) Geramno, I agree this is the issue. Vdsm is assuming that the base disk contains enough space so asynchronous extend would be quick enough. If the merge failed because the extend was not fast enough, trying the merge again would succeed, as we see in the logs from this case. This assumption in vdsm was probably a bad idea but it will be hard to change this to do: - extend the disk, wait for completion - start merge Since job reporting code assumes that the merge job is started immediately, and cannot handle the state when we don't have libivrt job (it will try to finalize the merge). Fixing this requires: - adding new (preparing) state to merge job, reporting job in preparing step before the libvirt block job is started - handling case when vdsm is restarted during this state - handle extend of internal volume - sending one extend message and waiting for reply is not enough. We need to use the same logic used for the active volume (monitor volume and send new extend message until the operation is done). - handle timeout extending the disk (e.g SPM is down). - keeping extend state during merge to avoid unneeded extending if extending was interrupted
Roman, I suspect the reason for the common merge failures is incorrect vdsm configuration we already discussed. According to our previous discussion, the hosts in this case use this configuration: [irs] volume_utilization_percent = 80 volume_utilization_chunk_mb = 2048 This wrong configuration will delay volume extend only when 20% free space is available in the last chunk. For example if the initial volume size is 2 GiB, the next extend will occur only when free space is 0.4 GiB. If the VM is writing quickly, the disk may be almost full when we take a snapshot, so when we merge we may have very little available space in the base disk. If storage is fast, the merge may fill the base disk before we manage to extend the disk (typically 4-6 seconds). The recommended configuration is: [irs] volume_utilization_percent = 20 volume_utilization_chunk_mb = 2048 This will extend the disk when free space is 80% of 2 GiB (1.6 GiB). With this settings a based disk is likely to have enough disk space we we start a merge so async extend will be good enough. To make sure we don't have this issue, can we get /etc/vdsm/vdsm.conf and contents of /etc/vdsm/vdsm.conf.d/*.conf from all hosts in this DC?
(In reply to Nir Soffer from comment #27) > - handle extend of internal volume - sending one extend message and waiting > for reply > is not enough. We need to use the same logic used for the active volume > (monitor > volume and send new extend message until the operation is done). If I (and Germano) read the code correctly, VDSM is currently extending the base image (the target of the commit job) to the maximum size immediately, so I think actively monitoring the image afterwards is not necessary to fix this bug. Of course, if you want to stop extending the image to the maximum right away and increase the size only step by step like you do for the active image, we would need to implement this kind of monitoring. As I said in comment 23, this will require changes in QEMU and probably libvirt, too. I'm working on the necessary extensions in QEMU so you can actually implement this, but I would prefer if we handle this as an optimisation that is separate from this bug.
(In reply to Kevin Wolf from comment #29) I'm not suggesting to monitor write threshold and extend the disk gradually. What I mean is that sending one extend request to maximum size and waiting for reply is not enough. Regarding smarter extend (that we can work on later), I think we can do: - Find how much free space we have in the base image (e.g. using qemu-img map, using depth=0) - Find how much data will be written by the merge in the top image (.e.g qmeu-img map) - Extend to the actual required size (instead of base size + top size) Is this possible using qemu-img map, or maybe we can extend qemu-img measure to report the required size for blockCommit? Note that we support only merge of consecutive images. Base volume is always parent of top volume.
Hi Nir Roman is OoO so answering your query From what i see almost all hosts have set -------- volume_utilization_percent = 80 volume_utilization_chunk_mb = 2048 -------- and a few of them -------- volume_utilization_percent = 50 volume_utilization_chunk_mb = 2048 --------
(In reply to Jaroslav Spanko from comment #31) Hi Jaroslav, let's normalize all hosts to the recommended settings: [irs] volume_utilization_percent = 20 volume_utilization_chunk_mb = 2048 I assume that value of 80 was misunderstanding of the way this configuration works. If the value of 80 was intended to avoid too early extend, we can use 50 (the default). But using higher value will lead to more failed merges and more vm pauses when disks are not extended quickly enough. I expect to see much less failed merges 1-2 days after vdsm is restarted with this change. Note that vdsm must be restarted to enable this change. Note that /etc/vdsm/vdsm.conf should not be modified. The recommended way to change host configuration is to install a drop-in file in: /etc/vdsm/vdsm.conf.d/50-local.conf With the changes above. Same file can be deployed on all hosts.
Germano, can you check the configuration mentioned in comment 28 for the case related to comment 19?
Hi Nir, Of course: [irs] volume_utilization_percent = 25 volume_utilization_chunk_mb = 2048 The file was lastly modified on Nov 26th and the snapshot was created after that. Correct me if I'm wrong, but are you suggesting that if the base volume of the merge was extended more aggressively when it was the active layer in the past, then there would have more free space for the initial steps of the merge, giving VDSM more time to complete the extend fired by the merge?
(In reply to Germano Veit Michel from comment #34) > Correct me if I'm wrong, but are you suggesting that if the base volume of > the merge was extended more aggressively when it was the active layer in the > past, then there would have more free space for the initial steps of the > merge, giving VDSM more time to complete the extend fired by the merge? Right. With: [irs] volume_utilization_percent = 25 volume_utilization_chunk_mb = 2048 New snapshot will be empty 2 GiB volume. Vdsm registers high watermark events for allocation=512 MiB. When qemu reach the threshold vdsm gets an event and will trigger the next extend to 4 GiB, and register for next high watermark event when allocation is 2.5 GiB. So a volume is likely to have 0-3.5 GiB free space with this setting and is less likely to fail with ENOSPC during merge.
(In reply to Nir Soffer from comment #30) > (In reply to Kevin Wolf from comment #29) > I'm not suggesting to monitor write threshold and extend the disk > gradually. What I mean is that sending one extend request to maximum size > and waiting for reply is not enough. Oh, ok. Why would that be? Can the first request report success, but the volume wasn't resized anyway? > Regarding smarter extend (that we can work on later), I think we can do: > > - Find how much free space we have in the base image > (e.g. using qemu-img map, using depth=0) > - Find how much data will be written by the merge in the top image > (.e.g qmeu-img map) > - Extend to the actual required size (instead of base size + top size) This works as long as we are talking about committing down from a read-only intermediate image to its backing file. When committing the active image, you obviously can't tell beforehand how much space you will need because it depends on how much new space the guest will allocate while the operation is running. > Is this possible using qemu-img map, or maybe we can extend qemu-img measure > to report the required size for blockCommit? Intuitively this sounds like a job for 'qemu-img measure'. As long as we don't have support for it, I guess 'qemu-img map' and generously estimating the metadata overhead would do the job. > Note that we support only merge of consecutive images. Base volume is always > parent of top volume. You mean that you use commit only to remove a single image from the backing chain, never two or more images at once? Ok, that's good to know, though it shouldn't make much of a difference from QEMU's perspective.
(In reply to Kevin Wolf from comment #36) > (In reply to Nir Soffer from comment #30) > > (In reply to Kevin Wolf from comment #29) > > I'm not suggesting to monitor write threshold and extend the disk > > gradually. What I mean is that sending one extend request to maximum size > > and waiting for reply is not enough. > > Oh, ok. Why would that be? Can the first request report success, but the > volume wasn't resized anyway? Resizing is done on the SPM host. The role can move to another host just when you asked to extend, so I'm not sure sending one request is enough. When monitoring volumes we send one request every 2 seconds and the code knows how to ignore duplicate requests. So this is not a trivial change of moving code around.
For some reason we have 2 bugs about the same thing. I'll add more info in this bug about how it can be tested. The issue that we tried to fix here is not waiting until the base volume is extending before starting the commit. If the base volume does not have enough free space for the data in the top volume, and the commit is fast, the commit will fail with ENOSPC. The issue can happen both in internal merge, and active merge. Live merge may be to be more tricky to reproduce, but we need to test both flows. To make the commit fail, we must have more data in the top volume than the amount of free space in the base volume. Typically we have 0.5g-1.5g free space in a volume. When free space reaches 0.5g, we extend the volume by 1g, and at this point free space is 1.5g. ## How to reproduce with older vdsm version In both flows we need to start with a VM that have thin disk on block storage. If the VM has a preallocated disk this cannot be reproduced since we don't extend preallocated disks. Active merge flow 1. Create snapshot 1 At this point the amount of free space in the previous volume is 0.5-1.5g 2. Write 2g of data to top volume dd if=/dev/urandom bs=1M count=2048 of=/root/data oflag=direct conv=fsync Wait until the command completes. Starting the merge before the command completes lowers the chance to reproduce the issue. 3. Start delete snapshot operation for snapshot 1 This will start committing data from top volume to base volume, and at the same time start extending the base volume. Internal merge flow 1. Create snapshot 1 2. Write 2g of data to top volume dd if=/dev/urandom bs=1M count=2048 of=/root/data oflag=direct conv=fsync 3. Create snapshot 2 4. Start delete snapshot operation for snapshot 1 This again starts a commit and extend operations at the same time, and may fail in the same way. Typically extend finishes in 2-6 seconds. If you have a strong server and fast storage, committing 2g of data can take 3-4 seconds. If you have an old server and poor storage the commit can take 10-15 seconds and the chance to reproduce the error is very low. To increase the chance to reproduce this: 1. Use strong server and fast storage 2. Make extend slower There is no easy way to do this, but since extend happens on the SPM host, the SPM can do only 5 concurrent extends, and extend time is mostly constant time, one way to make extend slower is to run many concurrent delete snapshot operations. The easiest way to do this is to delete a snapshot with many disks (I tested vm with 14 disks) and test several vms at the same time. 3. Run many tests Running the tests one is not likely to reproduce the issue. 10-20 runs is more likely to fail. In summary, this is not a functional test that can reliably reproduce an issue and can run periodically to ensure that we don't have regression. This is more a stress test. ## How to test the new version Current vdsm (4.40.50.7) does not run extend and commit at the same time, so it can never break because commit was faster than extend. We want to check the same flows (internal merge, active merge), both with many disks (e.g. 10? 20?) to ensure that concurrent merges work correctly. To make it easier to debug, I would create 4 tests: 1. internal merge, one disk 2. active merge, one disk 3. internal merge, 10 disks 4. active merge, 10 disks The first 2 tests should run for every build. The last 2 tests may less often, maybe once per release. Additionally, we need a stress tests running tests 3 and 4 with 5-10 vms at the same time. This is mostly stress test for the SPM, having to extend 50-100 volumes at the same time. This test probably belongs to the scale team, and they should have test like this, developed about a year ago, when we started to work on a related bug.
# Test results We did 3 runs with different configurations: - run1 - ansible script (delete-snapshot.yml) creating 4 VMs, creating snapshot in all VMs, writing data to some disks in all VMs, removing the snapshot in all VMs, and removing the VMs. In this run all the VMs are performing the same operation at the same time. - run2 - python script (test.py), running 8 independent workers. Each worker is doing the same flow, creating VM, creating snapshot, writing data, removing snapshot, and removing the VM. In this run the VMs are preforming different operatinos at the same time, more like real world flows. - run3 - same as run 2, with improved error handling in the test script, in case engine retrun unexpected response. In all runs, we used a test VM with: - 3 qcow disks - writing 2G to each disks before removing the snapshot, so 2G of data are commited during the merge. - 3 empty cow disks - these disks are extended before the merge, but the merge is very quick since there is no data to commit. - 3 empty raw disks - these disks are not extended before the merge, and the merge is very quick because there is no data to merge. The test VMs were cloned from the template, without depending on the temlate. ## Stats We peformed large number of remove snapshot operaions without any error in this flow. ------------------------------------------------------------------------- remove snapshot operations run time vms start extend[2] commit pivot ------------------------------------------------------------------------- 1 48:00 4 8030 5624 8030 8025[1] 2 17:27 8 6140 4298 6140 6140 3 3:45 8 1730 1211 1730 1730 ------------------------------------------------------------------------- total 15900 11133 15900 15895 [1] There were no errors. The missing 5 pivot were probabbly result of aboting the test after few minutes without clearning the logs. [2] raw disks are not extended during merge. ## Comparing extend times to a real system Extend time depends on the number of concurrent extend requests and io load on the SPM and the host requesting the extend. We can see that in the test we had about twice the load compared with the user system. ------------------------------------------------------- system count mininum avarage maximum ------------------------------------------------------- test 11133 2.19 6.69 26.75 user[1] 1237 2.23 3.84 10.65 ------------------------------------------------------- [1] timing info from user sosreport related to this bug. ## iostat We recorded iostat for the multipath device used by the storage domain. ------------------------------- iostat avarage maximum ------------------------------- r/s 145.25 7663.00 w/s 86.50 2215.50 rMB/s 51.75 795.82 wMB/s 81.96 1429.28 %util 15.65 101.70 ------------------------------- ## Errors We did not have any failures during remove snapshot operation, but the stress test exposed issues in other parts of the system. -------------- run errors -------------- 1 1 2 116 3 6 -------------- ### run1 errors Looks like dmsetup issue. We don't expect changes in multipath devices during this test. 2021-03-11 21:25:01,126+0200 ERROR (mpathhealth) [storage.mpathhealth] multipath health update failed (mpathhealth:98) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/mpathhealth.py", line 96, in _run self._update_status() File "/usr/lib/python3.6/site-packages/vdsm/storage/mpathhealth.py", line 111, in _update_status for guid, paths in devicemapper.multipath_status().items(): File "/usr/lib/python3.6/site-packages/vdsm/storage/devicemapper.py", line 183, in multipath_status for guid, paths in dmsetup.status(target="multipath"): File "/usr/lib/python3.6/site-packages/vdsm/storage/dmsetup.py", line 31, in status lines = run_status(target=target).decode("utf-8").splitlines() File "/usr/lib/python3.6/site-packages/vdsm/storage/dmsetup.py", line 45, in run_status return supervdsm.getProxy().dmsetup_run_status(target) File "/usr/lib/python3.6/site-packages/vdsm/common/supervdsm.py", line 56, in __call__ return callMethod() File "/usr/lib/python3.6/site-packages/vdsm/common/supervdsm.py", line 54, in <lambda> **kwargs) File "<string>", line 2, in dmsetup_run_status File "/usr/lib64/python3.6/multiprocessing/managers.py", line 772, in _callmethod raise convert_to_error(kind, result) vdsm.common.cmdutils.Error: Command ['/usr/sbin/dmsetup', 'status', '--target', 'multipath'] failed with rc=1 out=b'360002ac000000000000000ba00021f6b: 0 4294967296 multipath 2 0 1 0 1 1 A 0 6 2 8:48 A 1 0 1 8:80 A 1 0 1 8:112 A 1 0 1 8:144 A 2 0 1 8:176 A 2 0 1 8:208 A 2 0 1 \n360002ac0000000000000085500021f6b: 0 6442450944 multipath 2 0 1 0 1 1 A 0 6 2 8:32 A 1 0 1 8:64 A 1 0 1 8:96 A 1 0 1 8:128 A 2 0 1 8:160 A 2 0 1 8:192 A 2 0 1 \n' err=b'Device does not exist.\nCommand failed.\n' ### run2 errors Looks like race extend request completed after vm was removed. This is expected condition that should not log an error. This errors apears 94 times. 2021-03-14 07:02:47,828+0200 ERROR (mailbox-hsm/2) [storage.TaskManager.Task] (Task='554b9a57-0967-424b-ac05-26d1f967b957') Unexpected error (task:880) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 887, in _run return fn(*args, **kargs) File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 1481, in after_volume_extension self._update_drive_volume_size(drive, volSize) File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 1504, in _update_drive_volume_size self.drive_monitor.set_threshold(drive, volsize.apparentsize) File "/usr/lib/python3.6/site-packages/vdsm/virt/drivemonitor.py", line 121, in set_threshold self._vm._dom.setBlockThreshold(drive.name, threshold) File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 50, in __getattr__ % self.vmid) vdsm.virt.virdomain.NotConnectedError: VM '1a18f0ab-689d-4268-8e89-472d454c01a5' was not defined yet or was undefined Looking at the log around the error, this is an issue with managing prepared volumes. The measure command prepared the volume, but copy_data job that ended just at this point tore down the volume. This error appears 12 times. 2021-03-14 06:12:51,906+0200 ERROR (jsonrpc/0) [storage.TaskManager.Task] (Task='31fbe49e-325d-41dd-b186-fa122e77d1ad') Unexpected error (task:880) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 887, in _run return fn(*args, **kargs) File "<decorator-gen-163>", line 2, in measure File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 50, in method ret = func(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 3198, in measure is_block=vol.is_block() File "/usr/lib/python3.6/site-packages/vdsm/storage/qemuimg.py", line 156, in measure out = _run_cmd(cmd) File "/usr/lib/python3.6/site-packages/vdsm/storage/qemuimg.py", line 578, in _run_cmd raise cmdutils.Error(cmd, rc, out, err) vdsm.common.cmdutils.Error: Command ['/usr/bin/qemu-img', 'measure', '--output', 'json', '-O', 'raw', 'json:{"file": {"driver": "host_device", "filename": "/rhev/data-center/mnt/blockSD/c917dcb1-fe03-44c4-a9d3-cdb8673b66b4/images/80f3ba99-b010-4ce5-b2e2-42de40ac5c68/bb976b27-3a9d-4e13-9135-c597fdfae854"}, "driver": "raw"}'] failed with rc=1 out=b'' err=b'qemu-img: Could not open \'json:{"file": {"driver": "host_device", "filename": "/rhev/data-center/mnt/blockSD/c917dcb1-fe03-44c4-a9d3-cdb8673b66b4/images/80f3ba99-b010-4ce5-b2e2-42de40ac5c68/bb976b27-3a9d-4e13-9135-c597fdfae854"}, "driver": "raw"}\': Could not open \'/rhev/data-center/mnt/blockSD/c917dcb1-fe03-44c4-a9d3-cdb8673b66b4/images/80f3ba99-b010-4ce5-b2e2-42de40ac5c68/bb976b27-3a9d-4e13-9135-c597fdfae854\': No such file or directory\n' ### run3 errors This looks like expected condition when extending VM after extend. Should not logged as error, maybe debug message. Appears 2 times. 2021-03-15 02:24:15,280+0200 ERROR (mailbox-hsm/3) [virt.vm] (vmId='c07bd71b-d8bc-4ddd-82d7-fe14e5bd24ca') Cannot resume VM (vm:1518) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 1508, in _resume_if_needed self.cont() File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 1608, in cont self._underlyingCont() File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 4014, in _underlyingCont self._dom.resume() File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 101, in f ret = attr(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 94, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python3.6/site-packages/libvirt.py", line 2180, in resume if ret == -1: raise libvirtError ('virDomainResume() failed', dom=self) libvirt.libvirtError: Requested operation is not valid: domain is already running Same as in run2, 2 incidents. 2021-03-15 04:04:34,850+0200 ERROR (jsonrpc/4) [storage.Dispatcher] FINISH measure error=Command ['/usr/bin/qemu-img', 'measure', '--output', 'json', '-O', 'raw', 'json:{"file": {"driver": "host_device", "filename": "/rhev/data-center/mnt/blockSD/c917dcb1-fe03-44c4-a9d3-cdb8673b66b4/images/9d35f6f4-a65a-4fdf-930e-7acef8202a22/d263f8e6-8b31-41b0-b28c-146f02482b50"}, "driver": "raw"}'] failed with rc=1 out=b'' err=b'qemu-img: Could not open \'json:{"file": {"driver": "host_device", "filename": "/rhev/data-center/mnt/blockSD/c917dcb1-fe03-44c4-a9d3-cdb8673b66b4/images/9d35f6f4-a65a-4fdf-930e-7acef8202a22/d263f8e6-8b31-41b0-b28c-146f02482b50"}, "driver": "raw"}\': Could not open \'/rhev/data-center/mnt/blockSD/c917dcb1-fe03-44c4-a9d3-cdb8673b66b4/images/9d35f6f4-a65a-4fdf-930e-7acef8202a22/d263f8e6-8b31-41b0-b28c-146f02482b50\': No such file or directory\n' (dispatcher:87) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/dispatcher.py", line 74, in wrapper result = ctask.prepare(func, *args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 110, in wrapper return m(self, *a, **kw) File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 1198, in prepare raise self.error File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 887, in _run return fn(*args, **kargs) File "<decorator-gen-163>", line 2, in measure File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 50, in method ret = func(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 3198, in measure is_block=vol.is_block() File "/usr/lib/python3.6/site-packages/vdsm/storage/qemuimg.py", line 156, in measure out = _run_cmd(cmd) File "/usr/lib/python3.6/site-packages/vdsm/storage/qemuimg.py", line 578, in _run_cmd raise cmdutils.Error(cmd, rc, out, err) vdsm.common.cmdutils.Error: Command ['/usr/bin/qemu-img', 'measure', '--output', 'json', '-O', 'raw', 'json:{"file": {"driver": "host_device", "filename": "/rhev/data-center/mnt/blockSD/c917dcb1-fe03-44c4-a9d3-cdb8673b66b4/images/9d35f6f4-a65a-4fdf-930e-7acef8202a22/d263f8e6-8b31-41b0-b28c-146f02482b50"}, "driver": "raw"}'] failed with rc=1 out=b'' err=b'qemu-img: Could not open \'json:{"file": {"driver": "host_device", "filename": "/rhev/data-center/mnt/blockSD/c917dcb1-fe03-44c4-a9d3-cdb8673b66b4/images/9d35f6f4-a65a-4fdf-930e-7acef8202a22/d263f8e6-8b31-41b0-b28c-146f02482b50"}, "driver": "raw"}\': Could not open \'/rhev/data-center/mnt/blockSD/c917dcb1-fe03-44c4-a9d3-cdb8673b66b4/images/9d35f6f4-a65a-4fdf-930e-7acef8202a22/d263f8e6-8b31-41b0-b28c-146f02482b50\': No such file or directory\n' ## SPM errors Some create vm and create snapshot operations failed on the SPM. This error fail create snapshot operation - engine sendt wrong capacity value. Error appers 54 times. We created 15900 snapshot during this test, I have no idea why engine sent wrong value 54 times (0.3%). 2021-03-12 17:12:21,873+0200 ERROR (tasks/8) [storage.Volume] Requested capacity 1073741824 < parent capacity 3221225472 (volume:1227) ... 2021-03-12 17:12:22,065+0200 ERROR (tasks/8) [storage.TaskManager.Task] (Task='4e2810cb-eb3d-4f61-8dd8-f13ce0aec4e0') Unexpected error (task:880) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 887, in _run return fn(*args, **kargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 350, in run return self.cmd(*self.argslist, **self.argsdict) File "/usr/lib/python3.6/site-packages/vdsm/storage/securable.py", line 79, in wrapper return method(self, *args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/sp.py", line 1904, in createVolume initial_size=initialSize, add_bitmaps=addBitmaps) File "/usr/lib/python3.6/site-packages/vdsm/storage/sd.py", line 1013, in createVolume initial_size=initial_size, add_bitmaps=add_bitmaps) File "/usr/lib/python3.6/site-packages/vdsm/storage/volume.py", line 1228, in create raise se.InvalidParameterException("capacity", capacity) vdsm.storage.exception.InvalidParameterException: Invalid parameter: 'capacity=1073741824' This looks like cleanup failure for non-existing volume. Not sure why this is logged as an error, and it also logged twice for same error. Error apears 478! times. 2021-03-12 17:13:03,633+0200 ERROR (jsonrpc/7) [storage.TaskManager.Task] (Task='33aaca48-08b7-4f67-be5a-edc826f4e2b0') Unexpected error (task:880) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 887, in _run return fn(*args, **kargs) File "<decorator-gen-113>", line 2, in revertTask File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 50, in method ret = func(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 2341, in revertTask return self.taskMng.revertTask(taskID=taskID) File "/usr/lib/python3.6/site-packages/vdsm/storage/taskManager.py", line 161, in revertTask t = self._getTask(taskID) File "/usr/lib/python3.6/site-packages/vdsm/storage/taskManager.py", line 85, in _getTask raise se.UnknownTask(taskID) vdsm.storage.exception.UnknownTask: Task id unknown: ('5ff89966-3b7d-46c1-a626-ce93f8c99668',) 2021-03-12 17:13:03,634+0200 ERROR (jsonrpc/7) [storage.Dispatcher] FINISH revertTask error=Task id unknown: ('5ff89966-3b7d-46c1-a626-ce93f8c99668',) (dispatcher:83) This looks like lvm issue, but we have lvm error in the error log. Appears 34 times. 2021-03-12 14:59:59,264+0200 ERROR (jsonrpc/1) [storage.TaskManager.Task] (Task='40725fb6-468a-45b3-8df9-c7fd629d2f10') Unexpected error (task:880) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/storage/task.py", line 887, in _run return fn(*args, **kargs) File "<decorator-gen-169>", line 2, in prepareImage File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 50, in method ret = func(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/storage/hsm.py", line 3282, in prepareImage imgPath = dom.activateVolumes(imgUUID, imgVolumes) File "/usr/lib/python3.6/site-packages/vdsm/storage/blockSD.py", line 1398, in activateVolumes lvm.activateLVs(self.sdUUID, volUUIDs) File "/usr/lib/python3.6/site-packages/vdsm/storage/lvm.py", line 1735, in activateLVs _refreshLVs(vgName, active) File "/usr/lib/python3.6/site-packages/vdsm/storage/lvm.py", line 1773, in _refreshLVs raise se.LogicalVolumeRefreshError("%s failed" % list2cmdline(cmd)) vdsm.storage.exception.LogicalVolumeRefreshError: Cannot refresh Logical Volume: ('lvchange --refresh c917dcb1-fe03-44c4-a9d3-cdb8673b66b4/92e59ca2-fe4f-47f9-ab0b-481262102e25 failed',) This seems to be lvm issue - this the error from lvm for this error: 2021-03-12 14:59:59,263+0200 WARN (jsonrpc/1) [storage.LVM] All 1 tries have failed: cmd=['/sbin/lvm', 'lvchange', '--config', 'devices { preferred_names=["^/dev/mapper/"] ignore_suspended_devices=1 write_cache_state=0 disable_after_error_count=3 filter=["a|^/dev/mapper/360002ac000000000000000ba00021f6b$|^/dev/mapper/360002ac0000000000000085500021f 6b$|", "r|.*|"] hints="none" obtain_device_list_from_udev=0 } global { locking_type=1 prioritise_write_locks=1 wait_for_locks=1 use_lvmetad=0 } backup { retain_min=50 retain_days=0 }', '--refresh', 'c917dcb1-fe03-44c4-a9d3-cdb8673b66b4/92e59ca2-fe4f-47f9-ab0b-481262102e25'] rc=5 err=[' device-mapper: resume ioctl on (253:12) failed: Invalid argument', ' Unable to resume c917dcb1--fe03--44c4--a9d3--cdb8673b66b4-92e59ca2--fe4f--47f9--ab0b--481262102e25 (253:12).', ' Failed to reactivate c917dcb1-fe03-44c4-a9d3-cdb8673b66b4/92e59ca2-fe4f-47f9-ab0b-481262102e25.', ' Releasing activation in critical section.', ' Releasing activation in critical section.', ' libdevmapper exiting with 1 device(s) still suspended.'] (lvm:538) This seems to be lvm caching issue. We cannot find the image to delete. Appers 55 times. 2021-03-12 15:01:00,616+0200 ERROR (jsonrpc/7) [storage.HSM] Empty or not found image 96806abd-2d1c-4b33-89ec-26af95a171ac in SD c917dcb1-fe03-44c4-a9d3-cdb8673b66b4. ... ## Links The test files are mainained in this repo: https://gitlab.com/nirs/ovirt-stress/-/tree/master/delete-snapshot
As we didn't find any regression and we didn't have any failures in remove snapshot operation during the manual flows and the stress test, this bug can be considered as verified. Moving to VERIFIED.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Moderate: RHV RHEL Host (ovirt-host) 4.4.z [ovirt-4.4.5] security, bug fix, enhancement), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2021:1184
*** Bug 1913448 has been marked as a duplicate of this bug. ***