Bug 1711643
Summary: | qemu aborts in blockCommit: qemu-kvm: block.c:3486: bdrv_replace_node: Assertion `!({ _Static_assert(!(sizeof(*&from->in_flight) > 8), "not expecting: " "sizeof(*&from->in_flight) > ATOMIC_REG_SIZE"); __atomic_load_n(&from->in_flight, 0); })' failed. | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 7 | Reporter: | Avihai <aefrat> | ||||||||
Component: | qemu-kvm-rhev | Assignee: | Kevin Wolf <kwolf> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | aihua liang <aliang> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | 7.7 | CC: | aliang, bcholler, bugs, coli, dhuertas, gwatson, jinzhao, juzhang, kwolf, mkalinin, mtessun, nsoffer, virt-maint | ||||||||
Target Milestone: | rc | ||||||||||
Target Release: | --- | ||||||||||
Hardware: | Unspecified | ||||||||||
OS: | Unspecified | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | qemu-kvm-rhev-2.12.0-35.el7 | Doc Type: | If docs needed, set a value | ||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | |||||||||||
: | 1750322 (view as bug list) | Environment: | |||||||||
Last Closed: | 2020-03-31 14:34:48 UTC | Type: | Bug | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | Storage | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
There is no 'version' 4.2 so It's left empty for now. Regression as this issue was not seen on this test till now.(rhv-4.2.10-1) which include the new vdsm-4.20.49-3.el7ev.x86_64 with some sensitive bug fixes. Also this was seen qemu log of relevant vm_TestCase6052_1906050852.log, VM crashed with assert below. Attaching vm_TestCase6052_1906050852.log. Libvirt log of that time was not available. 2019-05-19T03:12:12.993666Z qemu-kvm: warning: CPU(s) not present in any NUMA nodes: CPU 1 [socket-id: 1, core-id: 0, thread-id: 0], CPU 2 [socket-id: 2, core-id: 0, thread-id: 0], CPU 3 [socket-id: 3, core-id: 0, thread-id: 0], CPU 4 [socket-id: 4, core-id: 0, thread-id: 0], CPU 5 [socket-id: 5, core-id: 0, thread-id: 0], CPU 6 [socket-id: 6, core-id: 0, thread-id: 0], CPU 7 [socket-id: 7, core-id: 0, thread-id: 0], CPU 8 [socket-id: 8, core-id: 0, thread-id: 0], CPU 9 [socket-id: 9, core-id: 0, thread-id: 0], CPU 10 [socket-id: 10, core-id: 0, thread-id: 0], CPU 11 [socket-id: 11, core-id: 0, thread-id: 0], CPU 12 [socket-id: 12, core-id: 0, thread-id: 0], CPU 13 [socket-id: 13, core-id: 0, thread-id: 0], CPU 14 [socket-id: 14, core-id: 0, thread-id: 0], CPU 15 [socket-id: 15, core-id: 0, thread-id: 0] 2019-05-19T03:12:12.993703Z qemu-kvm: warning: All CPU(s) up to maxcpus should be described in NUMA config, ability to start up with partial NUMA mappings is obsoleted and will be removed in future qemu-kvm: block.c:3486: bdrv_replace_node: Assertion `!({ _Static_assert(!(sizeof(*&from->in_flight) > 8), "not expecting: " "sizeof(*&from->in_flight) > ATOMIC_REG_SIZE"); __atomic_load_n(&from->in_flight, 0); })' failed. 2019-05-19 03:13:18.148+0000: shutting down, reason=crashed Created attachment 1570830 [details]
qemu_vm_TestCase6052_1906050852_log
ran same TestCase6052 on glusterfs with same ENV - issue did not reproduce. Maybe this is a reproduction of Bug 1601212. (In reply to Avihai from comment #2) > qemu-kvm: block.c:3486: bdrv_replace_node: Assertion `!({ > _Static_assert(!(sizeof(*&from->in_flight) > 8), "not expecting: " > "sizeof(*&from->in_flight) > ATOMIC_REG_SIZE"); > __atomic_load_n(&from->in_flight, 0); })' failed. > 2019-05-19 03:13:18.148+0000: shutting down, reason=crashed assert in runtime must be a qemu bug, and It does not look like bug 1601212 to me. Kevin, what do you think? Move context seen in the logs from attachment 1570817 [details]: We have 5 concurrent merge operations: $ grep 'START merge' *.log hypervisor-lynx24_vdsm.log:2019-05-19 06:13:12,191+0300 INFO (jsonrpc/5) [api.virt] START merge(drive={u'imageID': u'4700cd22-a5ce-4e39-af35-894519c71bca', u'volumeID': u'705a6902-2b36-4883-9482-427885c5c84b', u'domainID': u'193e46d4-bb94-48e4-b5a5-236d34372920', u'poolID': u'b26523f2-afcf-4675-b660-9faec6d709b3'}, baseVolUUID=u'bc1d2b00-54ca-4550-933c-f2ee6ed132ca', topVolUUID=u'21e975df-872c-4930-8c81-358b6821eef9', bandwidth=u'0', jobUUID=u'6e010748-f6d6-49fb-9479-49d44b738737') from=::ffff:10.46.16.252,40370, flow_id=snapshots_delete_aebc6b48-6dd9-4ced, vmId=e74bec08-cb6e-4931-8775-f8a49b62eaf8 (api:46) hypervisor-lynx24_vdsm.log:2019-05-19 06:13:12,227+0300 INFO (jsonrpc/0) [api.virt] START merge(drive={u'imageID': u'6b08223e-df65-473b-85a3-55fce42c18a4', u'volumeID': u'07441e1f-e4fe-46ce-bc59-f41cea15d2e4', u'domainID': u'193e46d4-bb94-48e4-b5a5-236d34372920', u'poolID': u'b26523f2-afcf-4675-b660-9faec6d709b3'}, baseVolUUID=u'e87a1c47-1117-421c-8200-5bef3d03ff1a', topVolUUID=u'c755441e-2602-424a-bc0b-beb099f0d987', bandwidth=u'0', jobUUID=u'3edb6508-e4cd-4da8-8d1b-6ae1bfa9fb38') from=::ffff:10.46.16.252,40370, flow_id=snapshots_delete_aebc6b48-6dd9-4ced, vmId=e74bec08-cb6e-4931-8775-f8a49b62eaf8 (api:46) hypervisor-lynx24_vdsm.log:2019-05-19 06:13:12,228+0300 INFO (jsonrpc/6) [api.virt] START merge(drive={u'imageID': u'3c678812-3ac8-4e9c-801e-e9fc9bd376d9', u'volumeID': u'87fd8fc6-255a-416f-8eca-f8c1ede93f63', u'domainID': u'193e46d4-bb94-48e4-b5a5-236d34372920', u'poolID': u'b26523f2-afcf-4675-b660-9faec6d709b3'}, baseVolUUID=u'22d4d570-ecd1-40f2-b585-8f0e6a7635cc', topVolUUID=u'4f6af503-c703-44a0-afb2-67581bc5b6f8', bandwidth=u'0', jobUUID=u'e55c7454-e221-4e8e-a075-f945becd2277') from=::ffff:10.46.16.252,40370, flow_id=snapshots_delete_aebc6b48-6dd9-4ced, vmId=e74bec08-cb6e-4931-8775-f8a49b62eaf8 (api:46) hypervisor-lynx24_vdsm.log:2019-05-19 06:13:12,274+0300 INFO (jsonrpc/2) [api.virt] START merge(drive={u'imageID': u'8979b329-cb70-47b9-9ee9-83716713fd50', u'volumeID': u'21577de7-2583-4c96-a409-5e5cce9ac1ef', u'domainID': u'193e46d4-bb94-48e4-b5a5-236d34372920', u'poolID': u'b26523f2-afcf-4675-b660-9faec6d709b3'}, baseVolUUID=u'6346a581-33dd-420b-aae2-8765ea1477dd', topVolUUID=u'a8aae32a-d705-4dcb-a8f6-a3320c673635', bandwidth=u'0', jobUUID=u'aae8cb12-56ba-415c-90e0-de218304e8b5') from=::ffff:10.46.16.252,40370, flow_id=snapshots_delete_aebc6b48-6dd9-4ced, vmId=e74bec08-cb6e-4931-8775-f8a49b62eaf8 (api:46) hypervisor-lynx24_vdsm.log:2019-05-19 06:13:12,275+0300 INFO (jsonrpc/1) [api.virt] START merge(drive={u'imageID': u'ddd06cff-2643-4c7a-bd39-16c649ffbb3d', u'volumeID': u'f7973684-d3fe-4ecc-8305-0920d965e593', u'domainID': u'193e46d4-bb94-48e4-b5a5-236d34372920', u'poolID': u'b26523f2-afcf-4675-b660-9faec6d709b3'}, baseVolUUID=u'85683cc5-0b81-4ecd-b7d4-f6cc409c8679', topVolUUID=u'2c646920-b4e5-407b-942f-c9c72cb89f30', bandwidth=u'0', jobUUID=u'ce2eb173-0e7c-4188-a79b-4b67ec7de629') from=::ffff:10.46.16.252,40370, flow_id=snapshots_delete_aebc6b48-6dd9-4ced, vmId=e74bec08-cb6e-4931-8775-f8a49b62eaf8 (api:46) Merge job 6e010748-f6d6-49fb-9479-49d44b738737 - failed $ grep 6e010748-f6d6-49fb-9479-49d44b738737 *.log hypervisor-lynx24_vdsm.log:2019-05-19 06:13:12,191+0300 INFO (jsonrpc/5) [api.virt] START merge(drive={u'imageID': u'4700cd22-a5ce-4e39-af35-894519c71bca', u'volumeID': u'705a6902-2b36-4883-9482-427885c5c84b', u'domainID': u'193e46d4-bb94-48e4-b5a5-236d34372920', u'poolID': u'b26523f2-afcf-4675-b660-9faec6d709b3'}, baseVolUUID=u'bc1d2b00-54ca-4550-933c-f2ee6ed132ca', topVolUUID=u'21e975df-872c-4930-8c81-358b6821eef9', bandwidth=u'0', jobUUID=u'6e010748-f6d6-49fb-9479-49d44b738737') from=::ffff:10.46.16.252,40370, flow_id=snapshots_delete_aebc6b48-6dd9-4ced, vmId=e74bec08-cb6e-4931-8775-f8a49b62eaf8 (api:46) hypervisor-lynx24_vdsm.log:2019-05-19 06:13:17,070+0300 INFO (jsonrpc/5) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Starting merge with jobUUID=u'6e010748-f6d6-49fb-9479-49d44b738737', original chain=798d3d8a-f034-4af7-ac6a-c7a46107c632 < bc1d2b00-54ca-4550-933c-f2ee6ed132ca < 21e975df-872c-4930-8c81-358b6821eef9 < 6cdf5a03-075d-40b9-9eb2-27a2da38e519 < 705a6902-2b36-4883-9482-427885c5c84b (top), disk='vda', base='vda[3]', top='vda[2]', bandwidth=0, flags=8 (vm:6014) hypervisor-lynx24_vdsm.log:2019-05-19 06:13:18,683+0300 ERROR (jsonrpc/5) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Live merge failed (job: 6e010748-f6d6-49fb-9479-49d44b738737) (vm:6020) Merge job 3edb6508-e4cd-4da8-8d1b-6ae1bfa9fb38 - succeeded $ grep 3edb6508-e4cd-4da8-8d1b-6ae1bfa9fb38 *.log hypervisor-lynx24_vdsm.log:2019-05-19 06:13:12,227+0300 INFO (jsonrpc/0) [api.virt] START merge(drive={u'imageID': u'6b08223e-df65-473b-85a3-55fce42c18a4', u'volumeID': u'07441e1f-e4fe-46ce-bc59-f41cea15d2e4', u'domainID': u'193e46d4-bb94-48e4-b5a5-236d34372920', u'poolID': u'b26523f2-afcf-4675-b660-9faec6d709b3'}, baseVolUUID=u'e87a1c47-1117-421c-8200-5bef3d03ff1a', topVolUUID=u'c755441e-2602-424a-bc0b-beb099f0d987', bandwidth=u'0', jobUUID=u'3edb6508-e4cd-4da8-8d1b-6ae1bfa9fb38') from=::ffff:10.46.16.252,40370, flow_id=snapshots_delete_aebc6b48-6dd9-4ced, vmId=e74bec08-cb6e-4931-8775-f8a49b62eaf8 (api:46) hypervisor-lynx24_vdsm.log:2019-05-19 06:13:15,323+0300 INFO (jsonrpc/0) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Starting merge with jobUUID=u'3edb6508-e4cd-4da8-8d1b-6ae1bfa9fb38', original chain=e87a1c47-1117-421c-8200-5bef3d03ff1a < c755441e-2602-424a-bc0b-beb099f0d987 < cb52ff05-4dc2-4da1-b6c7-0ca1d1764f34 < 07441e1f-e4fe-46ce-bc59-f41cea15d2e4 (top), disk='vdb', base='vdb[3]', top='vdb[2]', bandwidth=0, flags=8 (vm:6014) hypervisor-lynx24_vdsm.log:2019-05-19 06:13:16,425+0300 INFO (jsonrpc/0) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Libvirt job 3edb6508-e4cd-4da8-8d1b-6ae1bfa9fb38 was terminated (vm:5894) hypervisor-lynx24_vdsm.log:2019-05-19 06:13:16,426+0300 INFO (jsonrpc/0) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Starting cleanup thread for job: 3edb6508-e4cd-4da8-8d1b-6ae1bfa9fb38 (vm:5902) hypervisor-lynx24_vdsm.log:2019-05-19 06:13:16,531+0300 INFO (merge/e74bec08) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Synchronizing volume chain after live merge (job 3edb6508-e4cd-4da8-8d1b-6ae1bfa9fb38) (vm:6360) hypervisor-lynx24_vdsm.log:2019-05-19 06:13:16,729+0300 INFO (merge/e74bec08) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Synchronization completed (job 3edb6508-e4cd-4da8-8d1b-6ae1bfa9fb38) (vm:6369) hypervisor-lynx24_vdsm.log:2019-05-19 06:13:17,012+0300 INFO (jsonrpc/1) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Cleanup thread <vdsm.virt.vm.LiveMergeCleanupThread object at 0x7fc9c41673d0> successfully completed, untracking job 3edb6508-e4cd-4da8-8d1b-6ae1bfa9fb38 (base=e87a1c47-1117-421c-8200-5bef3d03ff1a, top=c755441e-2602-424a-bc0b-beb099f0d987) (vm:5845) Merge job e55c7454-e221-4e8e-a075-f945becd2277 - succeeded $ grep e55c7454-e221-4e8e-a075-f945becd2277 *.log hypervisor-lynx24_vdsm.log:2019-05-19 06:13:12,228+0300 INFO (jsonrpc/6) [api.virt] START merge(drive={u'imageID': u'3c678812-3ac8-4e9c-801e-e9fc9bd376d9', u'volumeID': u'87fd8fc6-255a-416f-8eca-f8c1ede93f63', u'domainID': u'193e46d4-bb94-48e4-b5a5-236d34372920', u'poolID': u'b26523f2-afcf-4675-b660-9faec6d709b3'}, baseVolUUID=u'22d4d570-ecd1-40f2-b585-8f0e6a7635cc', topVolUUID=u'4f6af503-c703-44a0-afb2-67581bc5b6f8', bandwidth=u'0', jobUUID=u'e55c7454-e221-4e8e-a075-f945becd2277') from=::ffff:10.46.16.252,40370, flow_id=snapshots_delete_aebc6b48-6dd9-4ced, vmId=e74bec08-cb6e-4931-8775-f8a49b62eaf8 (api:46) hypervisor-lynx24_vdsm.log:2019-05-19 06:13:12,573+0300 INFO (jsonrpc/6) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Starting merge with jobUUID=u'e55c7454-e221-4e8e-a075-f945becd2277', original chain=22d4d570-ecd1-40f2-b585-8f0e6a7635cc < 4f6af503-c703-44a0-afb2-67581bc5b6f8 < af5e3be7-93b9-4fda-9281-7a0b9e053c11 < 87fd8fc6-255a-416f-8eca-f8c1ede93f63 (top), disk='vdc', base='vdc[3]', top='vdc[2]', bandwidth=0, flags=8 (vm:6014) hypervisor-lynx24_vdsm.log:2019-05-19 06:13:15,240+0300 INFO (jsonrpc/2) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Libvirt job e55c7454-e221-4e8e-a075-f945becd2277 was terminated (vm:5894) hypervisor-lynx24_vdsm.log:2019-05-19 06:13:15,240+0300 INFO (jsonrpc/2) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Starting cleanup thread for job: e55c7454-e221-4e8e-a075-f945becd2277 (vm:5902) hypervisor-lynx24_vdsm.log:2019-05-19 06:13:15,380+0300 INFO (merge/e74bec08) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Synchronizing volume chain after live merge (job e55c7454-e221-4e8e-a075-f945becd2277) (vm:6360) hypervisor-lynx24_vdsm.log:2019-05-19 06:13:15,721+0300 INFO (merge/e74bec08) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Synchronization completed (job e55c7454-e221-4e8e-a075-f945becd2277) (vm:6369) hypervisor-lynx24_vdsm.log:2019-05-19 06:13:15,781+0300 INFO (jsonrpc/0) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Cleanup thread <vdsm.virt.vm.LiveMergeCleanupThread object at 0x7fc9e4d8fcd0> successfully completed, untracking job e55c7454-e221-4e8e-a075-f945becd2277 (base=22d4d570-ecd1-40f2-b585-8f0e6a7635cc, top=4f6af503-c703-44a0-afb2-67581bc5b6f8) (vm:5845) Merge job aae8cb12-56ba-415c-90e0-de218304e8b5 - succeeded $ grep aae8cb12-56ba-415c-90e0-de218304e8b5 *.log hypervisor-lynx24_vdsm.log:2019-05-19 06:13:12,274+0300 INFO (jsonrpc/2) [api.virt] START merge(drive={u'imageID': u'8979b329-cb70-47b9-9ee9-83716713fd50', u'volumeID': u'21577de7-2583-4c96-a409-5e5cce9ac1ef', u'domainID': u'193e46d4-bb94-48e4-b5a5-236d34372920', u'poolID': u'b26523f2-afcf-4675-b660-9faec6d709b3'}, baseVolUUID=u'6346a581-33dd-420b-aae2-8765ea1477dd', topVolUUID=u'a8aae32a-d705-4dcb-a8f6-a3320c673635', bandwidth=u'0', jobUUID=u'aae8cb12-56ba-415c-90e0-de218304e8b5') from=::ffff:10.46.16.252,40370, flow_id=snapshots_delete_aebc6b48-6dd9-4ced, vmId=e74bec08-cb6e-4931-8775-f8a49b62eaf8 (api:46) hypervisor-lynx24_vdsm.log:2019-05-19 06:13:13,032+0300 INFO (jsonrpc/2) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Starting merge with jobUUID=u'aae8cb12-56ba-415c-90e0-de218304e8b5', original chain=6346a581-33dd-420b-aae2-8765ea1477dd < a8aae32a-d705-4dcb-a8f6-a3320c673635 < b32eb034-18c4-4da6-b45c-eff6f00a08c7 < 21577de7-2583-4c96-a409-5e5cce9ac1ef (top), disk='sda', base='sda[3]', top='sda[2]', bandwidth=0, flags=8 (vm:6014) hypervisor-lynx24_vdsm.log:2019-05-19 06:13:15,243+0300 INFO (jsonrpc/2) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Libvirt job aae8cb12-56ba-415c-90e0-de218304e8b5 was terminated (vm:5894) hypervisor-lynx24_vdsm.log:2019-05-19 06:13:15,243+0300 INFO (jsonrpc/2) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Starting cleanup thread for job: aae8cb12-56ba-415c-90e0-de218304e8b5 (vm:5902) hypervisor-lynx24_vdsm.log:2019-05-19 06:13:15,413+0300 INFO (merge/e74bec08) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Synchronizing volume chain after live merge (job aae8cb12-56ba-415c-90e0-de218304e8b5) (vm:6360) hypervisor-lynx24_vdsm.log:2019-05-19 06:13:15,748+0300 INFO (merge/e74bec08) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Synchronization completed (job aae8cb12-56ba-415c-90e0-de218304e8b5) (vm:6369) hypervisor-lynx24_vdsm.log:2019-05-19 06:13:15,811+0300 INFO (jsonrpc/0) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Cleanup thread <vdsm.virt.vm.LiveMergeCleanupThread object at 0x7fc9e4e94890> successfully completed, untracking job aae8cb12-56ba-415c-90e0-de218304e8b5 (base=6346a581-33dd-420b-aae2-8765ea1477dd, top=a8aae32a-d705-4dcb-a8f6-a3320c673635) (vm:5845) Merge job ce2eb173-0e7c-4188-a79b-4b67ec7de629 - started - no log after start $ grep ce2eb173-0e7c-4188-a79b-4b67ec7de629 *.log hypervisor-lynx24_vdsm.log:2019-05-19 06:13:12,275+0300 INFO (jsonrpc/1) [api.virt] START merge(drive={u'imageID': u'ddd06cff-2643-4c7a-bd39-16c649ffbb3d', u'volumeID': u'f7973684-d3fe-4ecc-8305-0920d965e593', u'domainID': u'193e46d4-bb94-48e4-b5a5-236d34372920', u'poolID': u'b26523f2-afcf-4675-b660-9faec6d709b3'}, baseVolUUID=u'85683cc5-0b81-4ecd-b7d4-f6cc409c8679', topVolUUID=u'2c646920-b4e5-407b-942f-c9c72cb89f30', bandwidth=u'0', jobUUID=u'ce2eb173-0e7c-4188-a79b-4b67ec7de629') from=::ffff:10.46.16.252,40370, flow_id=snapshots_delete_aebc6b48-6dd9-4ced, vmId=e74bec08-cb6e-4931-8775-f8a49b62eaf8 (api:46) hypervisor-lynx24_vdsm.log:2019-05-19 06:13:16,585+0300 INFO (jsonrpc/1) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Starting merge with jobUUID=u'ce2eb173-0e7c-4188-a79b-4b67ec7de629', original chain=85683cc5-0b81-4ecd-b7d4-f6cc409c8679 < 2c646920-b4e5-407b-942f-c9c72cb89f30 < 95314255-1e34-4d10-a145-3aeb68ce8a88 < f7973684-d3fe-4ecc-8305-0920d965e593 (top), disk='sdb', base='sdb[3]', top='sdb[2]', bandwidth=0, flags=8 (vm:6014) Checking how the merged volumes were created: $ grep "START createVolume.\+volUUID=u'21e975df-872c-4930-8c81-358b6821eef9" *.log hypervisor-lynx23_vdsm.log:2019-05-19 06:08:00,422+0300 INFO (jsonrpc/2) [vdsm.api] START createVolume(sdUUID=u'193e46d4-bb94-48e4-b5a5-236d34372920', spUUID=u'b26523f2-afcf-4675-b660-9faec6d709b3', imgUUID=u'4700cd22-a5ce-4e39-af35-894519c71bca', size=u'10737418240', volFormat=4, preallocate=2, diskType=u'DATA', volUUID=u'21e975df-872c-4930-8c81-358b6821eef9', desc=u'', srcImgUUID=u'4700cd22-a5ce-4e39-af35-894519c71bca', srcVolUUID=u'bc1d2b00-54ca-4550-933c-f2ee6ed132ca', initialSize=None) from=::ffff:10.46.16.252,38534, flow_id=snapshots_create_f70e48d0-067d-4683, task_id=4518f253-17a6-4850-b47f-6c82d9c651df (api:46) $ grep "START createVolume.\+volUUID=u'c755441e-2602-424a-bc0b-beb099f0d987" *.log hypervisor-lynx23_vdsm.log:2019-05-19 06:08:01,166+0300 INFO (jsonrpc/6) [vdsm.api] START createVolume(sdUUID=u'193e46d4-bb94-48e4-b5a5-236d34372920', spUUID=u'b26523f2-afcf-4675-b660-9faec6d709b3', imgUUID=u'6b08223e-df65-473b-85a3-55fce42c18a4', size=u'1073741824', volFormat=4, preallocate=2, diskType=u'DATA', volUUID=u'c755441e-2602-424a-bc0b-beb099f0d987', desc=u'', srcImgUUID=u'6b08223e-df65-473b-85a3-55fce42c18a4', srcVolUUID=u'e87a1c47-1117-421c-8200-5bef3d03ff1a', initialSize=None) from=::ffff:10.46.16.252,38534, flow_id=snapshots_create_f70e48d0-067d-4683, task_id=93740b2d-29f4-4bb9-af8d-cb15ff99e278 (api:46) $ grep "START createVolume.\+volUUID=u'4f6af503-c703-44a0-afb2-67581bc5b6f8" *.log hypervisor-lynx23_vdsm.log:2019-05-19 06:08:00,792+0300 INFO (jsonrpc/7) [vdsm.api] START createVolume(sdUUID=u'193e46d4-bb94-48e4-b5a5-236d34372920', spUUID=u'b26523f2-afcf-4675-b660-9faec6d709b3', imgUUID=u'3c678812-3ac8-4e9c-801e-e9fc9bd376d9', size=u'1073741824', volFormat=4, preallocate=2, diskType=u'DATA', volUUID=u'4f6af503-c703-44a0-afb2-67581bc5b6f8', desc=u'', srcImgUUID=u'3c678812-3ac8-4e9c-801e-e9fc9bd376d9', srcVolUUID=u'22d4d570-ecd1-40f2-b585-8f0e6a7635cc', initialSize=None) from=::ffff:10.46.16.252,38534, flow_id=snapshots_create_f70e48d0-067d-4683, task_id=2f335380-40fb-4a93-8ed3-8f1fd1d3803d (api:46) $ grep "START createVolume.\+volUUID=u'a8aae32a-d705-4dcb-a8f6-a3320c673635" *.log hypervisor-lynx23_vdsm.log:2019-05-19 06:08:00,975+0300 INFO (jsonrpc/1) [vdsm.api] START createVolume(sdUUID=u'193e46d4-bb94-48e4-b5a5-236d34372920', spUUID=u'b26523f2-afcf-4675-b660-9faec6d709b3', imgUUID=u'8979b329-cb70-47b9-9ee9-83716713fd50', size=u'1073741824', volFormat=4, preallocate=2, diskType=u'DATA', volUUID=u'a8aae32a-d705-4dcb-a8f6-a3320c673635', desc=u'', srcImgUUID=u'8979b329-cb70-47b9-9ee9-83716713fd50', srcVolUUID=u'6346a581-33dd-420b-aae2-8765ea1477dd', initialSize=None) from=::ffff:10.46.16.252,38534, flow_id=snapshots_create_f70e48d0-067d-4683, task_id=c2b98326-f169-47c0-af99-3b9175c8606e (api:46) $ grep "START createVolume.\+volUUID=u'2c646920-b4e5-407b-942f-c9c72cb89f30" *.log hypervisor-lynx23_vdsm.log:2019-05-19 06:08:00,604+0300 INFO (jsonrpc/3) [vdsm.api] START createVolume(sdUUID=u'193e46d4-bb94-48e4-b5a5-236d34372920', spUUID=u'b26523f2-afcf-4675-b660-9faec6d709b3', imgUUID=u'ddd06cff-2643-4c7a-bd39-16c649ffbb3d', size=u'1073741824', volFormat=4, preallocate=2, diskType=u'DATA', volUUID=u'2c646920-b4e5-407b-942f-c9c72cb89f30', desc=u'', srcImgUUID=u'ddd06cff-2643-4c7a-bd39-16c649ffbb3d', srcVolUUID=u'85683cc5-0b81-4ecd-b7d4-f6cc409c8679', initialSize=None) from=::ffff:10.46.16.252,38534, flow_id=snapshots_create_f70e48d0-067d-4683, task_id=23595231-f8b9-4344-9908-99476f919f58 (api:46) Errors: $ grep ERROR *vdsm.log hypervisor-lynx24_vdsm.log:2019-05-19 06:13:18,683+0300 ERROR (jsonrpc/5) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Live merge failed (job: 6e010748-f6d6-49fb-9479-49d44b738737) (vm:6020) hypervisor-lynx24_vdsm.log:2019-05-19 06:13:18,709+0300 ERROR (jsonrpc/5) [api] FINISH merge error=VM 'e74bec08-cb6e-4931-8775-f8a49b62eaf8' was not defined yet or was undefined (api:132) I don't see anything that may be related to bug 1700623. Looks like normal merge from vdsm point of view, failed because qemu crashed. (In reply to Nir Soffer from comment #6) > assert in runtime must be a qemu bug, and It does not look like bug 1601212 > to me. > > Kevin, what do you think? Yes, definitely a QEMU bug. Either some caller of bdrv_replace_node() forgets to drain the involved nodes, or we do indeed have another drain bug that was not covered by the fixes for bug 1601212 (because the version used here should already contain the fixes, if I understand correctly - otherwise, the bug would seem quite relevant). Do we have a QEMU backtrace somewhere? Can you help me translate the VDSM log entries to QEMU? I think live merge will be a commit block job. If so, are we committing the top of the backing chain to one of its backing files (active commit), or are we committing somewhere in the middle of the backing chain? You mention that there are multiple live merge operations going on. Are they all for the same image chain of the same VM? Avihai, did you collect qemu core dump or backtrace from hypervisor-lynx24? (see comment 8) (In reply to Nir Soffer from comment #9) > Avihai, did you collect qemu core dump or backtrace from hypervisor-lynx24? > (see comment 8) No, unfortunately, the environment was already taken& rebuilt by someone else right after I tried to reproduce the issue on the same ENV- which did not reproduce. I will try again, is there something that can be done with current logs? (In reply to Kevin Wolf from comment #8) > Can you help me translate the VDSM log entries to QEMU? I can translate to libvirt. We have a vm with 5 disks, and we start blockCommit job per disk merging nodes name[2] to name[3] (using libvirt notation), not active layer commit. In the failed job both top and base volumes are qcow2, in other jobs the base volume is qcow2 or raw-sparse file. Bellow a more clear description of the state when we start the merge: ## Merge job 6e010748-f6d6-49fb-9479-49d44b738737 - failed Date: 2019-05-19 06:13:12,191+0300 INFO jobUUID: 6e010748-f6d6-49fb-9479-49d44b738737 imageID: 4700cd22-a5ce-4e39-af35-894519c71bca volumeID: 705a6902-2b36-4883-9482-427885c5c84b topVolUUID: 21e975df-872c-4930-8c81-358b6821eef9 baseVolUUID: bc1d2b00-54ca-4550-933c-f2ee6ed132ca type: file (glusterfs/fuse) original chain (active first): 705a6902-2b36-4883-9482-427885c5c84b 6cdf5a03-075d-40b9-9eb2-27a2da38e519 21e975df-872c-4930-8c81-358b6821eef9 <-- top bc1d2b00-54ca-4550-933c-f2ee6ed132ca <-- base 798d3d8a-f034-4af7-ac6a-c7a46107c632 blockCommit arguments: disk: vda top: vda[2] base: vda[3] flags: 8 (VIR_DOMAIN_BLOCK_COMMIT_RELATIVE) ## Merge job 3edb6508-e4cd-4da8-8d1b-6ae1bfa9fb38 - succeeded Date: 2019-05-19 06:13:12,227+0300 jobUUID: 3edb6508-e4cd-4da8-8d1b-6ae1bfa9fb38 imageID: 6b08223e-df65-473b-85a3-55fce42c18a4 volumeID: 07441e1f-e4fe-46ce-bc59-f41cea15d2e4 topVolUUID: c755441e-2602-424a-bc0b-beb099f0d987 baseVolUUID: e87a1c47-1117-421c-8200-5bef3d03ff1a type: file (glusterfs/fuse) original chain (active first): 07441e1f-e4fe-46ce-bc59-f41cea15d2e4 cb52ff05-4dc2-4da1-b6c7-0ca1d1764f34 c755441e-2602-424a-bc0b-beb099f0d987 <-- top e87a1c47-1117-421c-8200-5bef3d03ff1a <-- base (qcow2) blockCommit arguments: disk: vdb top: vdb[2] base: vdb[3] flags: 8 ## Merge job e55c7454-e221-4e8e-a075-f945becd2277 - succeeded Date: 2019-05-19 06:13:12,228+0300 jobUUID: e55c7454-e221-4e8e-a075-f945becd2277 imageID: 3c678812-3ac8-4e9c-801e-e9fc9bd376d9 volumeID: 87fd8fc6-255a-416f-8eca-f8c1ede93f63 topVolUUID: 4f6af503-c703-44a0-afb2-67581bc5b6f8 baseVolUUID: 22d4d570-ecd1-40f2-b585-8f0e6a7635cc type: file (glusterfs/fuse) original chain (active first): 87fd8fc6-255a-416f-8eca-f8c1ede93f63 af5e3be7-93b9-4fda-9281-7a0b9e053c11 4f6af503-c703-44a0-afb2-67581bc5b6f8 <-- top 22d4d570-ecd1-40f2-b585-8f0e6a7635cc <-- base (raw sparse) blockCommit arguments: disk: vdc top: vdc[2] base: vdc[3] flags: 8 ## Merge job aae8cb12-56ba-415c-90e0-de218304e8b5 - succeeded Date: 2019-05-19 06:13:12,274+0300 jobUUID: aae8cb12-56ba-415c-90e0-de218304e8b5 imageID: 8979b329-cb70-47b9-9ee9-83716713fd50 volumeID: 21577de7-2583-4c96-a409-5e5cce9ac1ef topVolUUID: a8aae32a-d705-4dcb-a8f6-a3320c673635 baseVolUUID: 6346a581-33dd-420b-aae2-8765ea1477dd type: file (glusterfs/fuse) original chain (active first): 21577de7-2583-4c96-a409-5e5cce9ac1ef b32eb034-18c4-4da6-b45c-eff6f00a08c7 a8aae32a-d705-4dcb-a8f6-a3320c673635 <- top 6346a581-33dd-420b-aae2-8765ea1477dd <- base (qcow2) blockCommit arguments: disk: sda top: sda[2] base: sda[3] flags: 8 ## Merge job ce2eb173-0e7c-4188-a79b-4b67ec7de629 - started - no log after start Date: 2019-05-19 06:13:12,275+0300 jobUUID: ce2eb173-0e7c-4188-a79b-4b67ec7de629 imageID: ddd06cff-2643-4c7a-bd39-16c649ffbb3d volumeID: f7973684-d3fe-4ecc-8305-0920d965e593 topVolUUID: 2c646920-b4e5-407b-942f-c9c72cb89f30 baseVolUUID: 85683cc5-0b81-4ecd-b7d4-f6cc409c8679 type: file (glusterfs/fuse) original chain (active first): f7973684-d3fe-4ecc-8305-0920d965e593 95314255-1e34-4d10-a145-3aeb68ce8a88 2c646920-b4e5-407b-942f-c9c72cb89f30 <-- top 85683cc5-0b81-4ecd-b7d4-f6cc409c8679 <-- base (raw sparse) blockCommit arguments: disk: sdb top: sdb[2] base: sdb[3] flags: 8 Created attachment 1571374 [details]
vm xml extracted from engine.log
Kevin, please check the domain xml for more info the configuration of this vm.
I ran the same test over the night for an additional 12 iterations and the issue did not reproduce. Issue looks very hard to reproduce. As we don't have a backtrace, we need to guess from the information that we do have. bdrv_replace_node() is called in three different contexts: 1. Start and completion of the commit block job. This is the most likely one here because we know that commit jobs were running. 2. Start and completion of the mirror block job (including commit of the active layer). You said the jobs in question are not mirror jobs, so we can probably rule this out. 3. Taking an external snapshot. This wasn't mentioned anywhere. Can we rule this out as well? If we assume the assertion was triggered from a commit job, it could be either on start or completion. We have one job that has failed and one job for which we only have a start message, but nothing after this. Do we have more information about these than just "Starting merge" and "Live merge failed"? In particular, do we have a more concrete error message for the failed one? Does a QEMU crash result in failure, or does the failure message mean that the job "successfully" failed without crashing and the crash happened only later? Is the start message printed before starting the job or only when the QMP command to start the job returned success? I really wish we had a QMP log available each time I'm dealing with a bug reported on the VDSM level. Maybe we can improve something for the future there? (In reply to Kevin Wolf from comment #14) > As we don't have a backtrace, we need to guess from the information that we > do have. bdrv_replace_node() is called in three different contexts: > > 1. Start and completion of the commit block job. This is the most likely one > here because we know that commit jobs were running. > 2. Start and completion of the mirror block job (including commit of the > active layer). You said the jobs in question are not mirror jobs, so we can > probably rule this out. > 3. Taking an external snapshot. This wasn't mentioned anywhere. Can we rule > this out as well? Yes, when we preform live merge engine takes locks and will not let a user preform snapshots. > If we assume the assertion was triggered from a commit job, it could be > either on start or completion. We have one job that has failed and one job > for which we only have a start message, but nothing after this. Do we have > more information about these than just "Starting merge" and "Live merge > failed"? In particular, do we have a more concrete error message for the > failed one? This is what we have for the failing one (thread jsonrpc/5): 2019-05-19 06:13:17,070+0300 INFO (jsonrpc/5) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Starting merge with jobUUID=u'6e010748-f6d6-49fb-9479-49d44b738737', original chain=798d3d8a-f034-4af7-ac6a-c7a46107c632 < bc1d2b00-54ca-4550-933c-f2ee6ed132ca < 21e975df-872c-4930-8c81-358b6821eef9 < 6cdf5a03-075d-40b9-9eb2-27a2da38e519 < 705a6902-2b36-4883-9482-427885c5c84b (top), disk='vda', base='vda[3]', top='vda[2]', bandwidth=0, flags=8 (vm:6014) 2019-05-19 06:13:18,683+0300 ERROR (jsonrpc/5) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Live merge failed (job: 6e010748-f6d6-49fb-9479-49d44b738737) (vm:6020) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 6018, in merge bandwidth, flags) File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f ret = attr(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 719, in blockCommit if ret == -1: raise libvirtError ('virDomainBlockCommit() failed', dom=self) libvirtError: Unable to read from monitor: Connection reset by peer This looks like qemu crash from libvirt point of view. And on the same time we get these events from libvirt (thread libvirt/events): 2019-05-19 06:13:18,697+0300 INFO (libvirt/events) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') underlying process disconnected (vm:1063) 2019-05-19 06:13:18,699+0300 INFO (libvirt/events) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Release VM resources (vm:5279) 2019-05-19 06:13:18,699+0300 INFO (libvirt/events) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Stopping connection (guestagent:442) So the crash is likely a result of calling virtDomainBlockCommit(). > Does a QEMU crash result in failure, or does the failure message > mean that the job "successfully" failed without crashing and the crash > happened only later? Looks like qemu aborts, no successful failure :-) > Is the start message printed before starting the job or > only when the QMP command to start the job returned success? We don't have any libvirt logs. > I really wish we had a QMP log available each time I'm dealing with a bug > reported on the VDSM level. Maybe we can improve something for the future > there? Vdsm communicate with libvirt, and libvirt debug log is disabled by default, so we lack minimal info. I think we should have libivrt INFO level logs, with one message for every change in system state. Not sure how if this can be configured and who is responsible for this configuration. Looks like I have a reliable reproducer even on upstream. The trick I used is to have a throttled backing file so that the guest boots very slowly and the top node of the commit operation always has some active I/O: gdb --args x86_64-softmmu/qemu-system-x86_64 \ -object throttle-group,x-bps-read=4096,id=throttle0 \ -blockdev file,filename=/home/kwolf/images/hd.img,node-name=base \ -blockdev throttle,throttle-group=throttle0,file=base,node-name=throttled \ -drive if=none,file=/tmp/overlay.qcow2,backing.node-name=backing,backing.backing=throttled,node-name=overlay \ -device ide-hd,drive=overlay -qmp stdio -monitor vc {"QMP": {"version": {"qemu": {"micro": 50, "minor": 0, "major": 4}, "package": "v4.0.0-600-gc423a6af59"}, "capabilities": ["oob"]}} {'execute':'qmp_capabilities'} {"return": {}} {'execute':'block-commit', 'arguments': {'device': 'none0', 'top-node': 'backing'}} {"timestamp": {"seconds": 1558445691, "microseconds": 749010}, "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "none0"}} qemu-system-x86_64: block.c:4024: bdrv_replace_node: Zusicherung »!atomic_read(&from->in_flight)« nicht erfüllt. Thread 1 "qemu-system-x86" received signal SIGABRT, Aborted. (gdb) bt #0 0x00007ffff596157f in raise () at /lib64/libc.so.6 #1 0x00007ffff594b895 in abort () at /lib64/libc.so.6 #2 0x00007ffff594b769 in _nl_load_domain.cold.0 () at /lib64/libc.so.6 #3 0x00007ffff5959a26 in .annobin_assert.c_end () at /lib64/libc.so.6 #4 0x0000555555b82f6d in bdrv_replace_node (from=0x55555667d410, to=0x55555751b280, errp=0x7fffffffd2d0) at block.c:4024 #5 0x0000555555b82fee in bdrv_append (bs_new=bs_new@entry=0x55555751b280, bs_top=bs_top@entry=0x55555667d410, errp=errp@entry=0x7fffffffd328) at block.c:4101 #6 0x0000555555bd84e7 in commit_start (job_id=job_id@entry=0x0, bs=bs@entry=0x5555565f9800, base=base@entry=0x5555565f0400, top=top@entry=0x55555667d410, creation_flags=creation_flags@entry=0, speed=speed@entry=0, on_error=BLOCKDEV_ON_ERROR_REPORT, backing_file_str=0x0, filter_node_name=0x0, errp=0x7fffffffd410) at block/commit.c:306 #7 0x00005555559949b9 in qmp_block_commit (has_job_id=<optimized out>, job_id=0x0, device=<optimized out>, has_base_node=<optimized out>, base_node=0x0, has_base=<optimized out>, base=0x0, has_top_node=true, top_node=0x555557a295f0 "backing", has_top=false, top=0x0, has_backing_file=false, backing_file=0x0, has_speed=false, speed=0, has_filter_node_name=false, filter_node_name=0x0, has_auto_finalize=false, auto_finalize=false, has_auto_dismiss=false, auto_dismiss=false, errp=0x7fffffffd4e8) at blockdev.c:3400 #8 0x0000555555b3345f in qmp_marshal_block_commit (args=<optimized out>, ret=<optimized out>, errp=0x7fffffffd5d8) at qapi/qapi-commands-block-core.c:409 #9 0x0000555555c2cd6c in do_qmp_dispatch (errp=0x7fffffffd5d0, allow_oob=<optimized out>, request=<optimized out>, cmds=0x55555638f290 <qmp_commands>) at qapi/qmp-dispatch.c:131 #10 0x0000555555c2cd6c in qmp_dispatch (cmds=0x55555638f290 <qmp_commands>, request=<optimized out>, allow_oob=<optimized out>) at qapi/qmp-dispatch.c:174 #11 0x000055555586f5d2 in monitor_qmp_dispatch (mon=0x55555653c690, req=<optimized out>) at /home/kwolf/source/qemu/monitor.c:4141 #12 0x0000555555875779 in monitor_qmp_bh_dispatcher (data=<optimized out>) at /home/kwolf/source/qemu/monitor.c:4224 #13 0x0000555555c72a4e in aio_bh_call (bh=0x55555649abc0) at util/async.c:118 #14 0x0000555555c72a4e in aio_bh_poll (ctx=ctx@entry=0x55555649a1f0) at util/async.c:118 #15 0x0000555555c76000 in aio_dispatch (ctx=0x55555649a1f0) at util/aio-posix.c:460 #16 0x0000555555c7292e in aio_ctx_dispatch (source=<optimized out>, callback=<optimized out>, user_data=<optimized out>) at util/async.c:261 #17 0x00007ffff7eb506d in g_main_context_dispatch () at /lib64/libglib-2.0.so.0 #18 0x0000555555c75218 in glib_pollfds_poll () at util/main-loop.c:213 #19 0x0000555555c75218 in os_host_main_loop_wait (timeout=<optimized out>) at util/main-loop.c:236 #20 0x0000555555c75218 in main_loop_wait (nonblocking=<optimized out>) at util/main-loop.c:512 #21 0x000055555599df59 in main_loop () at vl.c:1959 #22 0x0000555555821474 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4623 (In reply to Kevin Wolf from comment #16) > Looks like I have a reliable reproducer even on upstream. The trick I used > is to have a throttled backing file so that the guest boots very slowly and > the top node of the commit operation always has some active I/O: > > gdb --args x86_64-softmmu/qemu-system-x86_64 \ > -object throttle-group,x-bps-read=4096,id=throttle0 \ > -blockdev file,filename=/home/kwolf/images/hd.img,node-name=base \ > -blockdev > throttle,throttle-group=throttle0,file=base,node-name=throttled \ > -drive > if=none,file=/tmp/overlay.qcow2,backing.node-name=backing,backing. > backing=throttled,node-name=overlay \ > -device ide-hd,drive=overlay -qmp stdio -monitor vc > > {"QMP": {"version": {"qemu": {"micro": 50, "minor": 0, "major": 4}, > "package": "v4.0.0-600-gc423a6af59"}, "capabilities": ["oob"]}} > {'execute':'qmp_capabilities'} > {"return": {}} > {'execute':'block-commit', 'arguments': {'device': 'none0', 'top-node': > 'backing'}} > {"timestamp": {"seconds": 1558445691, "microseconds": 749010}, "event": > "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "none0"}} > qemu-system-x86_64: block.c:4024: bdrv_replace_node: Zusicherung > »!atomic_read(&from->in_flight)« nicht erfüllt. > Thread 1 "qemu-system-x86" received signal SIGABRT, Aborted. > > (gdb) bt > #0 0x00007ffff596157f in raise () at /lib64/libc.so.6 > #1 0x00007ffff594b895 in abort () at /lib64/libc.so.6 > #2 0x00007ffff594b769 in _nl_load_domain.cold.0 () at /lib64/libc.so.6 > #3 0x00007ffff5959a26 in .annobin_assert.c_end () at /lib64/libc.so.6 > #4 0x0000555555b82f6d in bdrv_replace_node (from=0x55555667d410, > to=0x55555751b280, errp=0x7fffffffd2d0) at block.c:4024 > #5 0x0000555555b82fee in bdrv_append (bs_new=bs_new@entry=0x55555751b280, > bs_top=bs_top@entry=0x55555667d410, errp=errp@entry=0x7fffffffd328) at > block.c:4101 > #6 0x0000555555bd84e7 in commit_start > (job_id=job_id@entry=0x0, bs=bs@entry=0x5555565f9800, > base=base@entry=0x5555565f0400, top=top@entry=0x55555667d410, > creation_flags=creation_flags@entry=0, speed=speed@entry=0, > on_error=BLOCKDEV_ON_ERROR_REPORT, backing_file_str=0x0, > filter_node_name=0x0, errp=0x7fffffffd410) at block/commit.c:306 > #7 0x00005555559949b9 in qmp_block_commit > (has_job_id=<optimized out>, job_id=0x0, device=<optimized out>, > has_base_node=<optimized out>, base_node=0x0, has_base=<optimized out>, > base=0x0, has_top_node=true, top_node=0x555557a295f0 "backing", > has_top=false, top=0x0, has_backing_file=false, backing_file=0x0, > has_speed=false, speed=0, has_filter_node_name=false, filter_node_name=0x0, > has_auto_finalize=false, auto_finalize=false, has_auto_dismiss=false, > auto_dismiss=false, errp=0x7fffffffd4e8) at blockdev.c:3400 > #8 0x0000555555b3345f in qmp_marshal_block_commit (args=<optimized out>, > ret=<optimized out>, errp=0x7fffffffd5d8) at > qapi/qapi-commands-block-core.c:409 > #9 0x0000555555c2cd6c in do_qmp_dispatch (errp=0x7fffffffd5d0, > allow_oob=<optimized out>, request=<optimized out>, cmds=0x55555638f290 > <qmp_commands>) at qapi/qmp-dispatch.c:131 > #10 0x0000555555c2cd6c in qmp_dispatch (cmds=0x55555638f290 <qmp_commands>, > request=<optimized out>, allow_oob=<optimized out>) at > qapi/qmp-dispatch.c:174 > #11 0x000055555586f5d2 in monitor_qmp_dispatch (mon=0x55555653c690, > req=<optimized out>) at /home/kwolf/source/qemu/monitor.c:4141 > #12 0x0000555555875779 in monitor_qmp_bh_dispatcher (data=<optimized out>) > at /home/kwolf/source/qemu/monitor.c:4224 > #13 0x0000555555c72a4e in aio_bh_call (bh=0x55555649abc0) at util/async.c:118 > #14 0x0000555555c72a4e in aio_bh_poll (ctx=ctx@entry=0x55555649a1f0) at > util/async.c:118 > #15 0x0000555555c76000 in aio_dispatch (ctx=0x55555649a1f0) at > util/aio-posix.c:460 > #16 0x0000555555c7292e in aio_ctx_dispatch (source=<optimized out>, > callback=<optimized out>, user_data=<optimized out>) at util/async.c:261 > #17 0x00007ffff7eb506d in g_main_context_dispatch () at > /lib64/libglib-2.0.so.0 > #18 0x0000555555c75218 in glib_pollfds_poll () at util/main-loop.c:213 > #19 0x0000555555c75218 in os_host_main_loop_wait (timeout=<optimized out>) > at util/main-loop.c:236 > #20 0x0000555555c75218 in main_loop_wait (nonblocking=<optimized out>) at > util/main-loop.c:512 > #21 0x000055555599df59 in main_loop () at vl.c:1959 > #22 0x0000555555821474 in main (argc=<optimized out>, argv=<optimized out>, > envp=<optimized out>) at vl.c:4623 Hi Aihua, Could you please have a try ? Thanks. Hit this issue when create a snapshot file from base that located on nfs server. Test env: kernel version:3.10.0-1044.el7.x86_64 qemu-kvm-rhev version:qemu-kvm-rhev-2.12.0-29.el7.x86_64 Test steps: (nfs server)#cat /etc/exports /mnt/nfs *(rw,no_root_squash,sync) #ls /mnt/nfs rhel77-64-virtio.qcow2 (nfs client) 1.#mount 10.66.144.83:/mnt/nfs /mnt/gluster/ 2.Start guest with qemu cmds: /usr/libexec/qemu-kvm \ -name 'avocado-vt-vm1' \ -machine pc \ -nodefaults \ -device VGA,bus=pci.0,addr=0x2 \ -chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/monitor-qmpmonitor1-20190123-032240-rOoB4cgD,server,nowait \ -mon chardev=qmp_id_qmpmonitor1,mode=control \ -chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/monitor-catch_monitor-20190123-032240-rOoB4cgD,server,nowait \ -mon chardev=qmp_id_catch_monitor,mode=control \ -device pvpanic,ioport=0x505,id=id8Ec4Bn \ -chardev socket,id=serial_id_serial0,path=/var/tmp/serial-serial0-20190123-032240-rOoB4cgD,server,nowait \ -device isa-serial,chardev=serial_id_serial0 \ -chardev socket,id=seabioslog_id_20190123-032240-rOoB4cgD,path=/var/tmp/seabios-20190123-032240-rOoB4cgD,server,nowait \ -device isa-debugcon,chardev=seabioslog_id_20190123-032240-rOoB4cgD,iobase=0x402 \ -device ich9-usb-ehci1,id=usb1,addr=0x1d.7,multifunction=on,bus=pci.0 \ -device ich9-usb-uhci1,id=usb1.0,multifunction=on,masterbus=usb1.0,addr=0x1d.0,firstport=0,bus=pci.0 \ -device ich9-usb-uhci2,id=usb1.1,multifunction=on,masterbus=usb1.0,addr=0x1d.2,firstport=2,bus=pci.0 \ -device ich9-usb-uhci3,id=usb1.2,multifunction=on,masterbus=usb1.0,addr=0x1d.4,firstport=4,bus=pci.0 \ -object iothread,id=iothread0 \ -drive id=drive_image1,if=none,snapshot=off,aio=threads,cache=none,format=qcow2,file=/mnt/gluster/rhel77-64-virtio.qcow2 \ -device virtio-blk-pci,id=image1,drive=drive_image1,bus=pci.0,iothread=iothread0 \ -drive id=drive_data1,if=none,snapshot=off,aio=threads,cache=none,format=qcow2,file=/root/test.qcow2 \ -device virtio-blk-pci,id=data1,drive=drive_data1,bus=pci.0,iothread=iothread0 \ -device virtio-net-pci,mac=9a:39:3a:3b:3c:3d,id=id1JNQsL,vectors=4,netdev=idVpZZ6A,bus=pci.0 \ -netdev tap,id=idVpZZ6A,vhost=on \ -m 8192 \ -smp 2,maxcpus=2,cores=1,threads=1,sockets=2 \ -cpu 'Penryn',+kvm_pv_unhalt \ -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 \ -vnc :0 \ -rtc base=utc,clock=host,driftfix=slew \ -boot order=cdn,once=c,menu=off,strict=off \ -enable-kvm \ -monitor stdio \ -qmp tcp:0:3000,server,nowait \ 3.Create snapshots: { "execute": "blockdev-snapshot-sync", "arguments": { "device": "drive_image1", "snapshot-file": "/root/sn1", "format": "qcow2" } } Ncat: Connection reset by peer. After step3, qemu coredump with info: (qemu) Formatting '/root/sn1', fmt=qcow2 size=21474836480 backing_file=/mnt/gluster/rhel77-64-virtio.qcow2 backing_fmt=qcow2 cluster_size=65536 lazy_refcounts=off refcount_bits=16 qemu-kvm: block.c:3508: bdrv_replace_node: Assertion `!({ _Static_assert(!(sizeof(*&from->in_flight) > 8), "not expecting: " "sizeof(*&from->in_flight) > ATOMIC_REG_SIZE"); __atomic_load_n(&from->in_flight, 0); })' failed. bug.txt: line 36: 28438 Aborted (core dumped) /usr/libexec/qemu-kvm -name 'avocado-vt-vm1' -machine pc -nodefaults -device VGA,bus=pci.0,addr=0x2 -chardev socket,id=qmp_id_qmpmonitor1,path=/var/tmp/monitor-qmpmonitor1-20190123-032240-rOoB4cgD,server,nowait -mon chardev=qmp_id_qmpmonitor1,mode=control -chardev socket,id=qmp_id_catch_monitor,path=/var/tmp/monitor-catch_monitor-20190123-032240-rOoB4cgD,server,nowait -mon chardev=qmp_id_catch_monitor,mode=control -device pvpanic,ioport=0x505,id=id8Ec4Bn -chardev socket,id=serial_id_serial0,path=/var/tmp/serial-serial0-20190123-032240-rOoB4cgD,server,nowait -device isa-serial,chardev=serial_id_serial0 -chardev socket,id=seabioslog_id_20190123-032240-rOoB4cgD,path=/var/tmp/seabios-20190123-032240-rOoB4cgD,server,nowait -device isa-debugcon,chardev=seabioslog_id_20190123-032240-rOoB4cgD,iobase=0x402 -device ich9-usb-ehci1,id=usb1,addr=0x1d.7,multifunction=on,bus=pci.0 -device ich9-usb-uhci1,id=usb1.0,multifunction=on,masterbus=usb1.0,addr=0x1d.0,firstport=0,bus=pci.0 -device ich9-usb-uhci2,id=usb1.1,multifunction=on,masterbus=usb1.0,addr=0x1d.2,firstport=2,bus=pci.0 -device ich9-usb-uhci3,id=usb1.2,multifunction=on,masterbus=usb1.0,addr=0x1d.4,firstport=4,bus=pci.0 -object iothread,id=iothread0 -drive id=drive_image1,if=none,snapshot=off,aio=threads,cache=none,format=qcow2,file=/mnt/gluster/rhel77-64-virtio.qcow2 -device virtio-blk-pci,id=image1,drive=drive_image1,bus=pci.0,iothread=iothread0 -drive id=drive_data1,if=none,snapshot=off,aio=threads,cache=none,format=qcow2,file=/root/test.qcow2 -device virtio-blk-pci,id=data1,drive=drive_data1,bus=pci.0,iothread=iothread0 -device virtio-net-pci,mac=9a:39:3a:3b:3c:3d,id=id1JNQsL,vectors=4,netdev=idVpZZ6A,bus=pci.0 -netdev tap,id=idVpZZ6A,vhost=on -m 8192 -smp 2,maxcpus=2,cores=1,threads=1,sockets=2 -cpu 'Penryn',+kvm_pv_unhalt -device usb-tablet,id=usb-tablet1,bus=usb1.0,port=1 -vnc :0 -rtc base=utc,clock=host,driftfix=slew -boot order=cdn,once=c,menu=off,strict=off -enable-kvm -monitor stdio -qmp tcp:0:3000,server,nowait (gdb) bt #0 0x00007f575c026337 in __GI_raise (sig=sig@entry=6) at ../nptl/sysdeps/unix/sysv/linux/raise.c:55 #1 0x00007f575c027a28 in __GI_abort () at abort.c:90 #2 0x00007f575c01f156 in __assert_fail_base (fmt=0x7f575c17abe0 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", assertion=assertion@entry=0x562fe1c0d028 "!({ _Static_assert(!(sizeof(*&from->in_flight) > 8), \"not expecting: \" \"sizeof(*&from->in_flight) > ATOMIC_REG_SIZE\"); __atomic_load_n(&from->in_flight, 0); })", file=file@entry=0x562fe1b71ccc "block.c", line=line@entry=3508, function=function@entry=0x562fe1c0dce0 <__PRETTY_FUNCTION__.31797> "bdrv_replace_node") at assert.c:92 #3 0x00007f575c01f202 in __GI___assert_fail (assertion=assertion@entry=0x562fe1c0d028 "!({ _Static_assert(!(sizeof(*&from->in_flight) > 8), \"not expecting: \" \"sizeof(*&from->in_flight) > ATOMIC_REG_SIZE\"); __atomic_load_n(&from->in_flight, 0); })", file=file@entry=0x562fe1b71ccc "block.c", line=line@entry=3508, function=function@entry=0x562fe1c0dce0 <__PRETTY_FUNCTION__.31797> "bdrv_replace_node") at assert.c:101 #4 0x0000562fe19c72ca in bdrv_replace_node (from=0x562fe458e000, to=<optimized out>, errp=0x7ffd20a3cfa0) at block.c:3508 #5 0x0000562fe19c756e in bdrv_append (bs_new=0x562fe4888800, bs_top=0x562fe458e000, errp=errp@entry=0x7ffd20a3d010) at block.c:3579 #6 0x0000562fe186522d in external_snapshot_prepare (common=0x562fe458a550, errp=0x7ffd20a3d088) at blockdev.c:1831 #7 0x0000562fe1867a09 in qmp_transaction (dev_list=dev_list@entry=0x7ffd20a3d100, has_props=has_props@entry=false, props=<optimized out>, props@entry=0x0, errp=errp@entry=0x7ffd20a3d1b8) at blockdev.c:2440 #8 0x0000562fe1867c24 in qmp_blockdev_snapshot_sync (errp=0x7ffd20a3d1b8, action=0x7ffd20a3d0f0) at blockdev.c:1267 #9 0x0000562fe1867c24 in qmp_blockdev_snapshot_sync (has_device=<optimized out>, device=<optimized out>, has_node_name=<optimized out>, node_name=<optimized out>, snapshot_file=<optimized out>, has_snapshot_node_name=<optimized out>, snapshot_node_name=0x0, has_format=true, format=0x562fe5a57d78 "qcow2", has_mode=false, mode=NEW_IMAGE_MODE_EXISTING, errp=errp@entry=0x7ffd20a3d1b8) at blockdev.c:1295 #10 0x0000562fe18745e1 in qmp_marshal_blockdev_snapshot_sync (args=<optimized out>, ret=<optimized out>, errp=0x7ffd20a3d268) at qapi/qapi-commands-block-core.c:313 #11 0x0000562fe1a9216a in qmp_dispatch (errp=0x7ffd20a3d260, request=0x7ffd20a3d260, cmds=<optimized out>) at qapi/qmp-dispatch.c:111 #12 0x0000562fe1a9216a in qmp_dispatch (cmds=<optimized out>, request=request@entry=0x562fe4b60400) at qapi/qmp-dispatch.c:160 #13 0x0000562fe178a151 in monitor_qmp_dispatch_one (req_obj=<optimized out>) at /usr/src/debug/qemu-2.12.0/monitor.c:4102 #14 0x0000562fe178a3b5 in monitor_qmp_bh_dispatcher (data=<optimized out>) at /usr/src/debug/qemu-2.12.0/monitor.c:4160 #15 0x0000562fe1a9c0b1 in aio_bh_poll (bh=0x562fe45127b0) at util/async.c:90 #16 0x0000562fe1a9c0b1 in aio_bh_poll (ctx=ctx@entry=0x562fe4408500) at util/async.c:118 #17 0x0000562fe1a9f160 in aio_dispatch (ctx=0x562fe4408500) at util/aio-posix.c:440 #18 0x0000562fe1a9bf8e in aio_ctx_dispatch (source=<optimized out>, callback=<optimized out>, user_data=<optimized out>) at util/async.c:261 #19 0x00007f57626eb049 in g_main_context_dispatch (context=0x562fe43eb600) at gmain.c:3175 #20 0x00007f57626eb049 in g_main_context_dispatch (context=context@entry=0x562fe43eb600) at gmain.c:3828 #21 0x0000562fe1a9e457 in main_loop_wait () at util/main-loop.c:215 #22 0x0000562fe1a9e457 in main_loop_wait (timeout=<optimized out>) at util/main-loop.c:238 #23 0x0000562fe1a9e457 in main_loop_wait (nonblocking=nonblocking@entry=0) at util/main-loop.c:497 #24 0x0000562fe173e1f7 in main () at vl.c:2013 #25 0x0000562fe173e1f7 in main (argc=<optimized out>, argv=<optimized out>, envp=<optimized out>) at vl.c:4777 BTW, i can reproduce it occasionally. So the reproduce rate is low. (In reply to aihua liang from comment #21) > BTW, i can reproduce it occasionally. So the reproduce rate is low. With my command line from comment 16 (with a throttle node above base), it's a 100% reproducer for me. I posted a fix upstream: https://lists.gnu.org/archive/html/qemu-block/2019-05/msg00709.html However, I'm afraid that this is too late for 7.7 unless you can justify a blocker. Can reproduce it on qemu-kvm-rhev-2.12.0-29.el7.x86_64 with Kevin's steps in comment 16. Reproduce steps: 1.Create images sn1 ~ sn5 for i in range (1,6) #qemu-img create -f qcow2 /root/sn$i 20G 2.Start guest with qemu cmds: /usr/libexec/qemu-kvm \ -object throttle-group,x-bps-read=4096,id=throttle0 \ -blockdev file,filename=/home/kvm_autotest_root/images/rhel77-64-virtio-scsi.qcow2,node-name=file_base \ -blockdev qcow2,file=file_base,node-name=base \ -blockdev throttle,throttle-group=throttle0,file=base,node-name=throttled \ -blockdev file,filename=/root/sn1,node-name=root_sn1 \ -blockdev qcow2,file=root_sn1,node-name=sn1,backing=throttled \ -blockdev file,filename=/root/sn2,node-name=root_sn2 \ -blockdev qcow2,file=root_sn2,node-name=sn2,backing=sn1 \ -blockdev file,filename=/root/sn3,node-name=root_sn3 \ -blockdev qcow2,file=root_sn3,node-name=sn3,backing=sn2 \ -blockdev file,filename=/root/sn4,node-name=root_sn4 \ -blockdev qcow2,file=root_sn4,node-name=sn4,backing=sn3 \ -blockdev file,filename=/root/sn5,node-name=root_sn5 \ -blockdev qcow2,file=root_sn5,node-name=sn5,backing=sn4 \ -device ide-hd,drive=sn5 \ -m 8192 \ -smp 2,maxcpus=2,cores=1,threads=1,sockets=2 \ -cpu 'Penryn',+kvm_pv_unhalt \ -rtc base=utc,clock=host,driftfix=slew \ -boot order=cdn,once=c,menu=off,strict=off \ -enable-kvm \ -vnc :0 \ -monitor stdio \ -qmp tcp:0:3000,server,nowait \ 3.During guest boot, do block commit {'execute':'block-commit', 'arguments': {'device': 'sn5', 'top-node': 'sn4','job-id':'j1'}} {"timestamp": {"seconds": 1568879801, "microseconds": 692671}, "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "j1"}} Connection closed by foreign host. After step3: (qemu) qemu-kvm: block.c:3508: bdrv_replace_node: Assertion `!({ _Static_assert(!(sizeof(*&from->in_flight) > 8), "not expecting: " "sizeof(*&from->in_flight) > ATOMIC_REG_SIZE"); __atomic_load_n(&from->in_flight, 0); })' failed. tat.txt: Line 26: 30297 Aborted (Coredump)/usr/libexec/qemu-kvm -object throttle-group,x-bps-read=4096,id=throttle0 -blockdev file,filename=/home/kvm_autotest_root/images/rhel77-64-virtio-scsi.qcow2,node-name=file_base... Verified on qemu-kvm-rhev-2.12.0-35.el7.x86_64, the issue has been fixed, but there's a new issue. 1.Create images sn1 ~ sn5 for i in range (1,6) #qemu-img create -f qcow2 /root/sn$i 20G 2.Start guest with qemu cmds: /usr/libexec/qemu-kvm \ -object throttle-group,x-bps-read=4096,id=throttle0 \ -blockdev file,filename=/home/kvm_autotest_root/images/rhel77-64-virtio-scsi.qcow2,node-name=file_base \ -blockdev qcow2,file=file_base,node-name=base \ -blockdev throttle,throttle-group=throttle0,file=base,node-name=throttled \ -blockdev file,filename=/root/sn1,node-name=root_sn1 \ -blockdev qcow2,file=root_sn1,node-name=sn1,backing=throttled \ -blockdev file,filename=/root/sn2,node-name=root_sn2 \ -blockdev qcow2,file=root_sn2,node-name=sn2,backing=sn1 \ -blockdev file,filename=/root/sn3,node-name=root_sn3 \ -blockdev qcow2,file=root_sn3,node-name=sn3,backing=sn2 \ -blockdev file,filename=/root/sn4,node-name=root_sn4 \ -blockdev qcow2,file=root_sn4,node-name=sn4,backing=sn3 \ -blockdev file,filename=/root/sn5,node-name=root_sn5 \ -blockdev qcow2,file=root_sn5,node-name=sn5,backing=sn4 \ -device ide-hd,drive=sn5 \ -m 8192 \ -smp 2,maxcpus=2,cores=1,threads=1,sockets=2 \ -cpu 'Penryn',+kvm_pv_unhalt \ -rtc base=utc,clock=host,driftfix=slew \ -boot order=cdn,once=c,menu=off,strict=off \ -enable-kvm \ -vnc :0 \ -monitor stdio \ -qmp tcp:0:3000,server,nowait \ 3.During guest boot, do block commit from sn4 to base. {'execute':'block-commit', 'arguments': {'device': 'sn5', 'top-node': 'sn4','job-id':'j1'}} {"timestamp": {"seconds": 1568881186, "microseconds": 588417}, "event": "JOB_STATUS_CHANGE", "data": {"status": "created", "id": "j1"}} {"timestamp": {"seconds": 1568881186, "microseconds": 646349}, "event": "JOB_STATUS_CHANGE", "data": {"status": "running", "id": "j1"}} {"return": {}} {"timestamp": {"seconds": 1568881187, "microseconds": 347398}, "event": "JOB_STATUS_CHANGE", "data": {"status": "waiting", "id": "j1"}} {"timestamp": {"seconds": 1568881187, "microseconds": 347460}, "event": "JOB_STATUS_CHANGE", "data": {"status": "pending", "id": "j1"}} {"timestamp": {"seconds": 1568881187, "microseconds": 347739}, "event": "BLOCK_JOB_COMPLETED", "data": {"device": "j1", "len": 21474836480, "offset": 21474836480, "speed": 0, "type": "commit"}} {"timestamp": {"seconds": 1568881187, "microseconds": 347800}, "event": "JOB_STATUS_CHANGE", "data": {"status": "concluded", "id": "j1"}} {"timestamp": {"seconds": 1568881187, "microseconds": 347847}, "event": "JOB_STATUS_CHANGE", "data": {"status": "null", "id": "j1"}} 4. Do block commit from sn5 to base. {'execute':'block-commit', 'arguments': {'device': 'sn5','job-id':'j1'}} Connection closed by foreign host. After step4, qemu coredump with info: (qemu) qemu-kvm: block/throttle.c:214: throttle_co_drain_end: Assertion `tgm->io_limits_disabled' failed. tat.txt: line 26: 30790 Aborted (Coredump)/usr/libexec/qemu-kvm -object throttle-group,x-bps-read=4096,id=throttle0 -blockdev file,filename=/home/kvm_autotest_root/images/rhel77-64-virtio-scsi.qcow2,node-name=file_base -blockdev qcow2,file=file_base,node-name=base -blockdev throttle,throttle-group=throttle0,file=base,node-name=throttled ... (gdb) bt #0 0x00007fbc6d54c377 in raise () at /lib64/libc.so.6 #1 0x00007fbc6d54da68 in abort () at /lib64/libc.so.6 #2 0x00007fbc6d545196 in __assert_fail_base () at /lib64/libc.so.6 #3 0x00007fbc6d545242 in () at /lib64/libc.so.6 #4 0x0000558a1c13c921 in throttle_co_drain_end (bs=<optimized out>) at block/throttle.c:214 #5 0x0000558a1c12771e in bdrv_drain_invoke_entry (opaque=0x558a20ef0ae0) at block/io.c:197 #6 0x0000558a1c1c68ba in coroutine_trampoline (i0=<optimized out>, i1=<optimized out>) at util/coroutine-ucontext.c:116 #7 0x00007fbc6d55e180 in __start_context () at /lib64/libc.so.6 #8 0x00007ffd321a3970 in () #9 0x0000000000000000 in () Set this bug's status to "Verified" and track the new issue by a new bug. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:1216 |
Created attachment 1570817 [details] engine and vdsm logs Description of problem: Removing VM snapshot failed with : Engine: 2019-05-19 06:13:18,292+03 INFO [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommandCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-31) [snapshots_delete_aebc6b48-6dd9-4ced] Comma nd 'RemoveSnapshotSingleDiskLive' (id: 'd0fb453f-7e58-4236-880e-e8929b7a7777') waiting on child command id: '7d205c62-99a7-43ba-8798-e41006da751e' type:'Merge' to complete 2019-05-19 06:13:18,297+03 INFO [org.ovirt.engine.core.bll.snapshots.RemoveSnapshotSingleDiskLiveCommandCallback] (EE-ManagedThreadFactory-engineScheduled-Thread-31) [snapshots_delete_aebc6b48-6dd9-4ced] Comma nd 'RemoveSnapshotSingleDiskLive' (id: 'df4e7c70-360c-4fe2-b72a-3ed2634b2e9f') waiting on child command id: '4b4d5091-ea4c-4657-a9de-ec047eee0e64' type:'Merge' to complete 2019-05-19 06:13:18,726+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-5) [snapshots_delete_aebc6b48-6dd9-4ced] Failed in 'MergeVDS' method 2019-05-19 06:13:18,731+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-commandCoordinator-Thread-5) [snapshots_delete_aebc6b48-6dd9-4ced] EVENT_ID: VDS_BROKER_COMMAND_FAILURE(10,802), VDSM host_mixed_3 command MergeVDS failed: General Exception: ("VM 'e74bec08-cb6e-4931-8775-f8a49b62eaf8' was not defined yet or was undefined",) 2019-05-19 06:13:18,732+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-5) [snapshots_delete_aebc6b48-6dd9-4ced] Command 'org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand' return value 'StatusOnlyReturn [status=Status [code=100, message=General Exception: ("VM 'e74bec08-cb6e-4931-8775-f8a49b62eaf8' was not defined yet or was undefined",)]]' 2019-05-19 06:13:18,732+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-5) [snapshots_delete_aebc6b48-6dd9-4ced] HostName = host_mixed_3 2019-05-19 06:13:18,732+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-5) [snapshots_delete_aebc6b48-6dd9-4ced] Command 'MergeVDSCommand(HostName = host_mixed_3, MergeVDSCommandParameters:{hostId='5f79683a-f0b3-4cf8-9730-7e431b662f83', vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8', storagePoolId='b26523f2-afcf-4675-b660-9faec6d709b3', storageDomainId='193e46d4-bb94-48e4-b5a5-236d34372920', imageGroupId='4700cd22-a5ce-4e39-af35-894519c71bca', imageId='705a6902-2b36-4883-9482-427885c5c84b', baseImageId='bc1d2b00-54ca-4550-933c-f2ee6ed132ca', topImageId='21e975df-872c-4930-8c81-358b6821eef9', bandwidth='0'})' execution failed: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = General Exception: ("VM 'e74bec08-cb6e-4931-8775-f8a49b62eaf8' was not defined yet or was undefined",), code = 100 2019-05-19 06:13:18,732+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.MergeVDSCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-5) [snapshots_delete_aebc6b48-6dd9-4ced] FINISH, MergeVDSCommand, log id: dfef71b 2019-05-19 06:13:18,732+03 ERROR [org.ovirt.engine.core.bll.MergeCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-5) [snapshots_delete_aebc6b48-6dd9-4ced] Engine exception thrown while sending merge command: org.ovirt.engine.core.common.errors.EngineException: EngineException: org.ovirt.engine.core.vdsbroker.vdsbroker.VDSErrorException: VDSGenericException: VDSErrorException: Failed to MergeVDS, error = General Exception: ("VM 'e74bec08-cb6e-4931-8775-f8a49b62eaf8' was not defined yet or was undefined",), code = 100 (Failed with error GeneralException and code 100) 2019-05-19 06:13:18,771+03 ERROR [org.ovirt.engine.core.bll.MergeCommand] (EE-ManagedThreadFactory-commandCoordinator-Thread-5) [snapshots_delete_aebc6b48-6dd9-4ced] Transaction rolled-back for command 'org.ovi rt.engine.core.bll.MergeCommand'. 2019-05-19 06:13:19,220+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-11) [] VM 'e74bec08-cb6e-4931-8775-f8a49b62eaf8' was reported as Down on VDS '5f79683a-f0b3-4cf8-9 730-7e431b662f83'(host_mixed_3) 2019-05-19 06:13:19,220+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-11) [] START, DestroyVDSCommand(HostName = host_mixed_3, DestroyVmVDSCommandParameters:{hostId='5f79683a-f0b3-4cf8-9730-7e431b662f83', vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8', secondsToWait='0', gracefully='false', reason='', ignoreNoVm='true'}), log id: 36bec028 2019-05-19 06:13:19,225+03 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.DestroyVDSCommand] (ForkJoinPool-1-worker-11) [] FINISH, DestroyVDSCommand, log id: 36bec028 2019-05-19 06:13:19,225+03 INFO [org.ovirt.engine.core.vdsbroker.monitoring.VmAnalyzer] (ForkJoinPool-1-worker-11) [] VM 'e74bec08-cb6e-4931-8775-f8a49b62eaf8'(vm_TestCase6052_1906050852) moved from 'Up' --> 'Down' 2019-05-19 06:13:19,237+03 ERROR [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (ForkJoinPool-1-worker-11) [] EVENT_ID: VM_DOWN_ERROR(119), VM vm_TestCase6052_1906050852 is down with error. Exit message: Lost connection with qemu process. VDSM: 2019-05-19 06:13:18,683+0300 ERROR (jsonrpc/5) [virt.vm] (vmId='e74bec08-cb6e-4931-8775-f8a49b62eaf8') Live merge failed (job: 6e010748-f6d6-49fb-9479-49d44b738737) (vm:6020) Traceback (most recent call last): File "/usr/lib/python2.7/site-packages/vdsm/virt/vm.py", line 6018, in merge bandwidth, flags) File "/usr/lib/python2.7/site-packages/vdsm/virt/virdomain.py", line 98, in f ret = attr(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/libvirtconnection.py", line 130, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python2.7/site-packages/vdsm/common/function.py", line 92, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python2.7/site-packages/libvirt.py", line 719, in blockCommit if ret == -1: raise libvirtError ('virDomainBlockCommit() failed', dom=self) libvirtError: Unable to read from monitor: Connection reset by peer Version-Release number of selected component (if applicable): ovirt-engine-4.2.8.7-0.1.el7ev.noarch vdsm-4.20.49-3.el7ev.x86_64 libvirt-4.5.0-10.el7_6.9.x86_64 qemu-img-rhev-2.12.0-18.el7_6.4.x86_64 How reproducible: Happened once so far on glusterFS only Steps to Reproduce- from automation log: 06:05:08 2019-05-19 06:05:08,520 INFO Test Setup 3: Creating VM vm_TestCase6052_1906050852 06:05:08 2019-05-19 06:05:08,725 INFO Test Setup 4: [class] Description: clone vm from a pre-defined template golden_mixed_virtio_template with {'vol_sparse': True, 'name': 'vm_TestCase6052_1906050852', 'clone': False, 'cluster': 'golden_env_mixed_1', 'storagedomain': 'test_gluster_1', 'display_type': 'spice', 'timeout': 900, 'virtio_scsi': True, 'vol_format': 'cow', 'os_type': 'RHEL6x64', 'type': 'desktop', 'wait': True} 06:05:20 2019-05-19 06:05:20,769 INFO Test Setup 5: Starting VM vm_TestCase6052_1906050852 06:05:20 2019-05-19 06:05:20,772 INFO Test Setup 6: [class] Start VM vm_TestCase6052_1906050852 with {'wait_for_ip': True, 'pause': False, 'use_cloud_init': False, 'timeout': 600, 'wait_for_status': 'up'} 06:06:16 2019-05-19 06:06:16,291 INFO Test Setup 7: Creating disks with filesystem and attach to VM vm_TestCase6052_1906050852 06:07:34 2019-05-19 06:07:34,438 INFO 300: storage/rhevmtests.storage.storage_remove_snapshots.test_live_merge.TestCase6052.test_basic_snapshot_deletion_with_io[glusterfs] 06:07:34 2019-05-19 06:07:34,439 INFO STORAGE: GLUSTERFS 06:07:34 2019-05-19 06:07:34,875 INFO Test Step 8: Creating files on vm's 'vm_TestCase6052_1906050852' disks 06:07:35 2019-05-19 06:07:34,950 INFO Test Step 9: Creating file /mount-point_1906065223/test_file_0 06:07:36 2019-05-19 06:07:36,619 INFO Test Step 10: Creating file /mount-point_1906070525/test_file_0 06:07:38 2019-05-19 06:07:38,231 INFO Test Step 11: Creating file /mount-point_1906071909/test_file_0 06:07:39 2019-05-19 06:07:39,898 INFO Test Step 12: Creating file /mount-point_1906073210/test_file_0 06:07:42 2019-05-19 06:07:42,176 INFO Test Step 13: Creating snapshot of vm vm_TestCase6052_1906050852 06:07:48 2019-05-19 06:07:48,185 INFO Test Step 14: Before snapshot: 6 volumes 06:07:48 2019-05-19 06:07:48,343 INFO Test Step 15: Stop vm vm_TestCase6052_1906050852 with {'async': 'true'} 06:08:00 2019-05-19 06:08:00,066 INFO Test Step 16: Adding new snapshot to vm vm_TestCase6052_1906050852 06:08:00 2019-05-19 06:08:00,069 INFO Test Step 17: Add snapshot to VM vm_TestCase6052_1906050852 with {'description': 'snapshot_6052_glusterfs_0', 'wait': True} 06:08:18 2019-05-19 06:08:18,908 INFO Test Step 18: Start VMs with {'vm_list': ['vm_TestCase6052_1906050852'], 'wait_for_ip': True, 'wait_for_status': 'powering_up', 'max_workers': 2} 06:09:18 2019-05-19 06:09:18,301 INFO Test Step 19: After snapshot: 11 volumes 06:09:18 2019-05-19 06:09:18,367 INFO Test Step 20: Creating files on vm's 'vm_TestCase6052_1906050852' disks 06:09:18 2019-05-19 06:09:18,441 INFO Test Step 21: Creating file /mount-point_1906065223/test_file_1 06:09:20 2019-05-19 06:09:20,184 INFO Test Step 22: Creating file /mount-point_1906070525/test_file_1 06:09:21 2019-05-19 06:09:21,848 INFO Test Step 23: Creating file /mount-point_1906071909/test_file_1 06:09:23 2019-05-19 06:09:23,434 INFO Test Step 24: Creating file /mount-point_1906073210/test_file_1 06:09:25 2019-05-19 06:09:25,559 INFO Test Step 25: Creating snapshot of vm vm_TestCase6052_1906050852 06:09:31 2019-05-19 06:09:31,464 INFO Test Step 26: Before snapshot: 11 volumes 06:09:31 2019-05-19 06:09:31,617 INFO Test Step 27: Stop vm vm_TestCase6052_1906050852 with {'async': 'true'} 06:09:43 2019-05-19 06:09:43,759 INFO Test Step 28: Adding new snapshot to vm vm_TestCase6052_1906050852 06:09:43 2019-05-19 06:09:43,761 INFO Test Step 29: Add snapshot to VM vm_TestCase6052_1906050852 with {'description': 'snapshot_6052_glusterfs_1', 'wait': True} 06:10:14 2019-05-19 06:10:14,210 INFO Test Step 30: Start VMs with {'vm_list': ['vm_TestCase6052_1906050852'], 'wait_for_ip': True, 'wait_for_status': 'powering_up', 'max_workers': 2} 06:11:14 2019-05-19 06:11:14,248 INFO Test Step 31: After snapshot: 16 volumes 06:11:14 2019-05-19 06:11:14,319 INFO Test Step 32: Creating files on vm's 'vm_TestCase6052_1906050852' disks 06:11:14 2019-05-19 06:11:14,399 INFO Test Step 33: Creating file /mount-point_1906065223/test_file_2 06:11:16 2019-05-19 06:11:16,265 INFO Test Step 34: Creating file /mount-point_1906070525/test_file_2 06:11:17 2019-05-19 06:11:17,892 INFO Test Step 35: Creating file /mount-point_1906071909/test_file_2 06:11:19 2019-05-19 06:11:19,517 INFO Test Step 36: Creating file /mount-point_1906073210/test_file_2 06:11:21 2019-05-19 06:11:21,809 INFO Test Step 37: Creating snapshot of vm vm_TestCase6052_1906050852 06:11:28 2019-05-19 06:11:27,953 INFO Test Step 38: Before snapshot: 16 volumes 06:11:28 2019-05-19 06:11:28,108 INFO Test Step 39: Stop vm vm_TestCase6052_1906050852 with {'async': 'true'} 06:11:40 2019-05-19 06:11:40,463 INFO Test Step 40: Adding new snapshot to vm vm_TestCase6052_1906050852 06:11:40 2019-05-19 06:11:40,465 INFO Test Step 41: Add snapshot to VM vm_TestCase6052_1906050852 with {'description': 'snapshot_6052_glusterfs_2', 'wait': True} 06:12:02 2019-05-19 06:12:02,446 INFO Test Step 42: Start VMs with {'vm_list': ['vm_TestCase6052_1906050852'], 'wait_for_ip': True, 'wait_for_status': 'powering_up', 'max_workers': 2} 06:13:00 2019-05-19 06:13:00,384 INFO Test Step 43: After snapshot: 21 volumes 06:13:08 2019-05-19 06:13:08,155 INFO Test Step 44: Before live merge: 21 volumes 06:13:08 2019-05-19 06:13:08,324 INFO Test Step 45: Performing continues IO on VM vm_TestCase6052_1906050852 06:13:08 2019-05-19 06:13:08,326 INFO Test Step 46: Removing snapshot 'snapshot_6052_glusterfs_0' of vm vm_TestCase6052_1906050852 06:13:08 2019-05-19 06:13:08,330 INFO Test Step 47: Remove vm vm_TestCase6052_1906050852 snapshot with {'description': 'snapshot_6052_glusterfs_0', 'timeout': 2400, 'wait': True} 06:53:15 2019-05-19 06:53:15,074 ERROR Result: FAILED Actual results: Removing VM snapshot failed. Expected results: Additional info: Same automated test ran on 4 storage flavors(ISCSI,NFS,FCP,GLUSTER) and failed only on gluster.