Bug 2168769
Summary: | Save guest failed sometimes after restarting the service(mainly rhel6 guest) | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 9 | Reporter: | Meina Li <meili> | ||||||
Component: | libvirt | Assignee: | Peter Krempa <pkrempa> | ||||||
libvirt sub component: | Storage | QA Contact: | Meina Li <meili> | ||||||
Status: | CLOSED ERRATA | Docs Contact: | |||||||
Severity: | unspecified | ||||||||
Priority: | high | CC: | jdenemar, lmen, pkrempa, virt-maint, yicui | ||||||
Version: | 9.2 | Keywords: | Automation, Regression, Triaged | ||||||
Target Milestone: | rc | ||||||||
Target Release: | --- | ||||||||
Hardware: | Unspecified | ||||||||
OS: | Unspecified | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | libvirt-9.0.0-5.el9 | Doc Type: | If docs needed, set a value | ||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2023-05-09 07:27:45 UTC | Type: | Bug | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Attachments: |
|
Looks like we don't dismiss the 'blockdev-create' jobs properly after the snapshots. After restart libvirt tries to clean them up but they are already in use. I'm not sure what's up with the save job though. The log you've attached is not a debug log. Please include one which doesn't filter out the debug level messages. E.g use filter: 3:remote 4:event 3:util.json 3:util.object 3:util.dbus 3:util.netlink 3:node_device 3:rpc 3:access 1:* Created attachment 1943269 [details]
fill-libvirtd.log
While I was not able to reproduce the issue I'm fairly sure I know what's going on. After a recent commit libvirt doesn't properly progress through the blockjob state transitions and the state for 'blockdev-create' job gets stuck at 'pending' in the internal libvirt data, but in qemu the job properly transitions to 'concluded'. Libvirt doesn't watch the job any more so it doesn't dismiss it. Now after restart of the daemon libvirt tries to cleanup blockjobs that are leftover and as part of cleanup of the 'blockdev-create' job we remove security labels from the images, which are already in use. QEMU then can't actually use the images and that causes migration failure. Fixed upstream: commit c433c2434c0459df98ed3355ef615e341acd9009 Author: Peter Krempa <pkrempa> Date: Fri Feb 10 17:16:43 2023 +0100 qemu: blockjob: Handle 'pending' blockjob state only when we need it The 'pending' state needs to be handled by the blockjob code only when the snapshot code requests a block-commit without auto-finalization. If we always handle it we fail to properly remove the blockjob data for the 'blockdev-create' job as that also transitions trhough 'pending' but we'd never update it once it reaches 'concluded' as the code already thinks that the job has finished and is no longer watching it. Introduce a 'processPending' property into block job data and set it only when we know that we need to process 'pending'. Fixes: 90d9bc9d74a5157167548b26c00b1a016655e295 Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=2168769 Signed-off-by: Peter Krempa <pkrempa> Reviewed-by: Pavel Hrdina <phrdina> Pre-verified Version: libvirt-9.0.0-5.el9.x86_64 qemu-kvm-7.2.0-8.el9.x86_64 Pre-verified Steps: 1. The related auto cases are passed. virtio_transitional_blk.rhel6_guest.default.with_virtio_blk.snapshot_test.virtio_transitional backingchain.lifecycle_test.check_mirror_with_restart_libvirtd.blockcopy 2. The related manual cases are also passed. 2.1 For rhel6 guest with virtio-transitional: verify it with the steps in Description part. 2.2 For backingchain case: # virsh domstate rhel running # virsh blockcopy rhel vda /tmp/rhel.copy --transient-job --shallow Block Copy started # virsh dumpxml rhel | xmllint --xpath //disk - <disk type="file" device="disk"> <driver name="qemu" type="qcow2"/> <source file="/var/lib/libvirt/images/rhel.qcow2" index="1"/> <backingStore/> <mirror type="file" file="/tmp/rhel.copy" format="qcow2" job="copy"> <format type="qcow2"/> <source file="/tmp/rhel.copy" index="2"/> <backingStore/> </mirror> <target dev="vda" bus="virtio"/> <alias name="virtio-disk0"/> <address type="pci" domain="0x0000" bus="0x04" slot="0x00" function="0x0"/> </disk> # systemctl restart virtqemud # virsh dumpxml rhel | xmllint --xpath //disk - <disk type="file" device="disk"> <driver name="qemu" type="qcow2"/> <source file="/var/lib/libvirt/images/rhel.qcow2" index="1"/> <backingStore/> <mirror type="file" file="/tmp/rhel.copy" format="qcow2" job="copy" ready="yes"> <format type="qcow2"/> <source file="/tmp/rhel.copy" index="2"/> <backingStore/> </mirror> <target dev="vda" bus="virtio"/> <alias name="virtio-disk0"/> <address type="pci" domain="0x0000" bus="0x04" slot="0x00" function="0x0"/> </disk> # virsh blockjob rhel vda --pivot # virsh dumpxml rhel | xmllint --xpath //disk - <disk type="file" device="disk"> <driver name="qemu" type="qcow2"/> <source file="/tmp/rhel.copy" index="2"/> <backingStore/> <target dev="vda" bus="virtio"/> <alias name="virtio-disk0"/> <address type="pci" domain="0x0000" bus="0x04" slot="0x00" function="0x0"/> </disk> Verified Version: libvirt-8.0.0-5.module+el8.6.0+14480+c0a3aa0f.x86_64 qemu-kvm-6.2.0-11.module+el8.6.0+16538+01ea313d.6.x86_64 Verified Steps: S1: Save rhel6 guest after restarting service 1. Prepare a running rhel6 seabios guest with a virtio-transitional model disk. # virsh dumpxml rhel| xmllint --xpath //disk - <disk type='file' device='disk' model='virtio-transitional'> <driver name='qemu' type='qcow2'/> <source file='/var/lib/libvirt/images/rhel6.qcow2'/> <backingStore/> <target dev='vda' bus='virtio'/> <alias name='virtio-disk0'/> <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/> </disk> 2. Create snapshots for the rhel6 guest. # for i in {1..3}; do virsh snapshot-create-as rhel s$i --disk-only; done Domain snapshot s1 created Domain snapshot s2 created Domain snapshot s3 created 3. Restart the service. # systemctl restart virtqemud 4. Save the guest and restore it. # virsh save rhel rhel.save Domain 'rhel' saved to test.save # virsh restore test.save Domain restored from test.save S2: Check mirror status of guest after restaring service. 1. Prepare a running guest. # virsh domstate rhel running 2. Do blockcopy for it. # virsh blockcopy rhel vda /tmp/rhel.copy --transient-job --shallow Block Copy started # virsh dumpxml rhel | xmllint --xpath //disk - <disk type="file" device="disk"> <driver name="qemu" type="qcow2"/> <source file="/var/lib/libvirt/images/rhel.qcow2" index="1"/> <backingStore/> <mirror type="file" file="/tmp/rhel.copy" format="qcow2" job="copy"> <format type="qcow2"/> <source file="/tmp/rhel.copy" index="2"/> <backingStore/> </mirror> <target dev="vda" bus="virtio"/> <alias name="virtio-disk0"/> <address type="pci" domain="0x0000" bus="0x04" slot="0x00" function="0x0"/> </disk> 3. Restart the virtqemud and wait a while to check the xml. # systemctl restart virtqemud # virsh dumpxml rhel | xmllint --xpath //disk - <disk type="file" device="disk"> <driver name="qemu" type="qcow2"/> <source file="/var/lib/libvirt/images/rhel.qcow2" index="1"/> <backingStore/> <mirror type="file" file="/tmp/rhel.copy" format="qcow2" job="copy" ready="yes"> <format type="qcow2"/> <source file="/tmp/rhel.copy" index="2"/> <backingStore/> </mirror> <target dev="vda" bus="virtio"/> <alias name="virtio-disk0"/> <address type="pci" domain="0x0000" bus="0x04" slot="0x00" function="0x0"/> </disk> 4. Pivot the blockjob and check the status. # virsh blockjob rhel vda --pivot # virsh dumpxml rhel | xmllint --xpath //disk - <disk type="file" device="disk"> <driver name="qemu" type="qcow2"/> <source file="/tmp/rhel.copy" index="2"/> <backingStore/> <target dev="vda" bus="virtio"/> <alias name="virtio-disk0"/> <address type="pci" domain="0x0000" bus="0x04" slot="0x00" function="0x0"/> </disk> Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (libvirt bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2023:2171 |
Created attachment 1943206 [details] libvirtd.log Description of problem: Save guest failed sometimes after restarting the service(mainly rhel6 guest) Version-Release number of selected component (if applicable): libvirt-9.0.0-3.el9.x86_64 qemu-kvm-7.2.0-7.el9.x86_64 How reproducible: 100% in auto test but may 50% in manual test Steps to Reproduce: 1. Prepare a running rhel6 seabios guest with a virtio-transitional model disk. # virsh dumpxml rhel| xmllint --xpath //disk - <disk type='file' device='disk' model='virtio-transitional'> <driver name='qemu' type='qcow2'/> <source file='/var/lib/libvirt/images/rhel6.qcow2'/> <backingStore/> <target dev='vda' bus='virtio'/> <alias name='virtio-disk0'/> <address type='pci' domain='0x0000' bus='0x04' slot='0x00' function='0x0'/> </disk> 2. Create snapshots for the rhel6 guest. # for i in {1..3}; do virsh snapshot-create-as rhel s$i --disk-only; done Domain snapshot s1 created Domain snapshot s2 created Domain snapshot s3 created 3. Restart the service. # systemctl restart virtqemud 4. Save the guest. # virsh save rhel rhel.save error: Failed to save domain 'rhel' to rhel.save error: operation failed: job 'domain save' unexpectedly failed Actual results: As above step4. Expected results: Save success. Additional info: 1) It must be reproduced in libvirt auto case: virtio_transitional_blk.rhel6_guest.default.with_virtio_blk.snapshot_test.virtio_transitional 2) In some machines it has a 100% reproduction rate. But in some others, it may not reproduce. 3) Sometimes it can also be reproduced in the latest RHEL9. 4) libvirtd.log: 2023-02-09 06:06:59.633+0000: 1465479: info : qemuMonitorSend:861 : QEMU_MONITOR_SEND_MSG: mon=0x7fe9dc0e3460 msg={"execute":"blockdev-del","arguments":{"node-name":"libvirt-4-storage"},"id":"libvirt-12"} fd=-1 2023-02-09 06:06:59.633+0000: 1465480: info : virObjectRef:400 : OBJECT_REF: obj=0x7fe9dc0e3460 2023-02-09 06:06:59.633+0000: 1465480: info : qemuMonitorIOWrite:366 : QEMU_MONITOR_IO_WRITE: mon=0x7fe9dc0e3460 buf={"execute":"blockdev-del","arguments":{"node-name":"libvirt-4-storage"},"id":"libvirt-12"} len=92 ret=92 errno=0 2023-02-09 06:06:59.633+0000: 1465480: info : virObjectRef:400 : OBJECT_REF: obj=0x7fe9dc0e3460 2023-02-09 06:06:59.633+0000: 1465480: info : virObjectUnref:378 : OBJECT_UNREF: obj=0x7fe9dc0e3460 2023-02-09 06:06:59.633+0000: 1465480: info : virObjectUnref:378 : OBJECT_UNREF: obj=0x7fe9dc0e3460 2023-02-09 06:06:59.635+0000: 1465480: info : virObjectRef:400 : OBJECT_REF: obj=0x7fe9dc0e3460 2023-02-09 06:06:59.635+0000: 1465480: info : qemuMonitorJSONIOProcessLine:208 : QEMU_MONITOR_RECV_REPLY: mon=0x7fe9dc0e3460 reply={"id": "libvirt-12", "error": {"class": "GenericError", "desc": "Block device libvirt-4-storage is in use"}} 2023-02-09 06:06:59.635+0000: 1465480: info : virObjectRef:400 : OBJECT_REF: obj=0x7fe9dc0e3460 2023-02-09 06:06:59.635+0000: 1465480: info : virObjectUnref:378 : OBJECT_UNREF: obj=0x7fe9dc0e3460 2023-02-09 06:06:59.635+0000: 1465480: info : virObjectUnref:378 : OBJECT_UNREF: obj=0x7fe9dc0e3460 2023-02-09 06:06:59.635+0000: 1465479: info : virObjectUnref:378 : OBJECT_UNREF: obj=0x7fe9dc0e3460 2023-02-09 06:06:59.635+0000: 1465479: info : virObjectRef:400 : OBJECT_REF: obj=0x7fe9dc0e3460 2023-02-09 06:06:59.635+0000: 1465479: error : qemuMonitorJSONCheckErrorFull:366 : internal error: unable to execute QEMU command 'blockdev-del': Block device libvirt-4-storage is in use