Created attachment 1775520 [details] logs Description of problem: Hot plugging an ISCSI lease (editing the VM to be highly available) causing libvirt failure which is leading to VDSM hosts 'unresponsive' state. If the customer has only one host with the VM, and and tries this scenario, he would fail to complete the VM reboot (in order for the HA VM to take place). And he would face a failed VDSM host. When invoking the 'edit' of the VM (to make it HA), engine fails with [1] IRSGenericException: IRSErrorException: No such lease: 'lease=95888924-202e-447a-91cc-73f08049aa63' Then, the libvirt is failing with [2]: libvirt.libvirtError: End of file while reading data: Input/output error The the host becomes unresponsive: 2021-04-25 13:05:04,733+03 WARN [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedThreadFactory-engine-Thread-5356) [] EVENT_ID: VDS_HOST_NOT_RESPONDING(9,027), Host host_mixed_1 is not responding. Host cannot be fenced automatically because power management for the host is disabled. Eventually (after few minutes) the VDSM host is able to recover: 2021-04-25 13:10:11,205+03 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (EE-ManagedScheduledExecutorSe rvice-engineScheduledThreadPool-Thread-77) [5a9d3fd] EVENT_ID: VDS_DETECTED(13), Status of host host_mixed_1 was set to Up. [1]: 2021-04-25 13:04:50,722+03 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetVmLeaseInfoVDSCommand] (default task-25) [vms_update_45f57b60-73da-481a] Command 'GetVmLeaseInfoVDSCommand( VmLeaseVDSParameters:{expectedEngineErrors='[NoSuchVmLeaseOnDomain]', storagePoolId='14625613-3e8e-439f-b488-1c65925a0465', ignoreFailoverLimit='false', leaseId='95888924-202e-447a-91cc-73f08049aa63', storageDomainId='9d3ad0b6-2eb3-436e-b4da-f1092b0dd599'})' execution failed: IRSGenericException: IRSErrorException: No such lease: 'lease=95888924-202e-447a-91cc-73f08049aa63' [2] 2021-04-25 13:04:59,253+0300 ERROR (jsonrpc/3) [api] FINISH hotplugLease error=Failed to hotplug lease: {'lease': <lease.Device sd_id=9d3ad0b6-2eb3-436e-b4da-f1092b0dd599, lease_id=95888924-202e-447a-91cc-73f08049aa63, path=/dev/9d3ad0b6-2eb3-436e-b4da-f1092b0dd599/xleases, offset=3145728 at 0x7faa835fbb48>, 'reason': 'End of file while reading data: Input/output error'} (api:131) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 3837, in hotplugLease self._dom.attachDevice(leaseXml) File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 101, in f ret = attr(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 94, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python3.6/site-packages/libvirt.py", line 682, in attachDevice raise libvirtError('virDomainAttachDevice() failed') libvirt.libvirtError: End of file while reading data: Input/output error During handling of the above exception, another exception occurred: Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 124, in method ret = func(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/API.py", line 410, in hotplugLease return self.vm.hotplugLease(lease) File "<decorator-gen-263>", line 2, in hotplugLease File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 101, in method return func(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 3842, in hotplugLease raise exception.HotplugLeaseFailed(reason=str(e), lease=lease) vdsm.common.exception.HotplugLeaseFailed: Failed to hotplug lease: {'lease': <lease.Device sd_id=9d3ad0b6-2eb3-436e-b4da-f1092b0dd599, lease_id=95888924-202e-447a-91cc-73f08049aa63, path=/dev/9d3ad0b6-2eb3-436e-b4da-f1092b0dd599/xleases, offset=3145728 at 0x7faa835fbb48>, 'reason': 'End of file while reading data: Input/output error'} . . . . 2021-04-25 13:04:59,425+0300 ERROR (MainThread) [vds] Exception raised (vdsmd:167) Traceback (most recent call last): File "/usr/lib/python3.6/site-packages/vdsm/vdsmd.py", line 165, in run serve_clients(log) File "/usr/lib/python3.6/site-packages/vdsm/vdsmd.py", line 132, in serve_clients cif.prepareForShutdown() File "/usr/lib/python3.6/site-packages/vdsm/clientIF.py", line 391, in prepareForShutdown secret.clear() File "/usr/lib/python3.6/site-packages/vdsm/virt/secret.py", line 96, in clear for virsecret in con.listAllSecrets(): File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper ret = f(*args, **kwargs) File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 94, in wrapper return func(inst, *args, **kwargs) File "/usr/lib64/python3.6/site-packages/libvirt.py", line 6074, in listAllSecrets raise libvirtError("virConnectListAllSecrets() failed") libvirt.libvirtError: internal error: client socket is closed Marking this as a 'regression' bug because: It Works on rhv-4.4.5-11 (tested manually + automatic test). FAILS on rhv-release-4.4.6-6 According the report portal, this test was working fine on rhv-4.4.5-10 as well In between those versions we didnt have tier3 test suit run. Tried also to test the following scenario: Clone VM from template, with ISCSI lease (making the VM HA) and start it. This works as expected. Version-Release number of selected component (if applicable): rhv-release-4.4.6-6-001.noarch libvirt-7.0.0-13.module+el8.4.0+10604+5608c2b4.x86_64 How reproducible: Fully reproducible manually AND with the automation test Steps to Reproduce: 1. Clone a VM out of rhel 8.3 template and run it 2. Edit the running VM, and make it HA VM with ISCSI lease (at this point i made the dump xml for this VM). In the opened dialog box, dont mark the 'apply later' option and click 'ok'. Actual results: Multiple errors on engine + vdsm logs VDSM host fails and becomes 'unresponsive'. Expected results: Shouldnt be any errors or failures VDSM host shouldnt fail Additional info: Attaching engine + vdsm logs + VM dumpxml + libvirtd log
The libvirtd log contains only 'info' level messages. Please configure logging using the following log filters: https://www.libvirt.org/kbase/debuglogs.html#less-verbose-logging-for-qemu-vms In addition the error message hints that libvirtd has crashed. Can you please confirm that and ideally collect the backtrace from coredumpctl if it indeed crashed.
I reproduced the issue once more, and Attaching again vdsm+engine logs + libvirt DEBUG logs. From looking at messages I do see that libvirt indeed crashed: Apr 27 09:46:38 storage-ge13-vdsm1 libvirtd[697822]: corrupted size vs. prev_size Apr 27 09:46:38 storage-ge13-vdsm1 abrt-hook-ccpp[700800]: Process 697822 (libvirtd) of user 0 killed by SIGABRT - dumping core Apr 27 09:46:39 storage-ge13-vdsm1 vdsm[698097]: WARN connection to libvirt broken. ecode: 1 edom: 7 Apr 27 09:46:39 storage-ge13-vdsm1 systemd[1]: libvirtd.service: Main process exited, code=killed, status=6/ABRT Apr 27 09:46:39 storage-ge13-vdsm1 vdsm[698097]: CRIT taking calling process down. Apr 27 09:46:39 storage-ge13-vdsm1 systemd[1]: libvirtd.service: Failed with result 'signal'. Apr 27 09:46:39 storage-ge13-vdsm1 vdsm[698097]: ERROR FINISH hotplugLease error=Failed to hotplug lease: {'lease': <lease.Device sd_id=9d3ad0b6-2eb3-436e-b4da-f1092b0dd599, lease_id=676030b1-b795-4c74-9a1c-a766f8923837, path=/dev/9d3ad0b6-2eb3-436e-b4da-f1092b0dd599/xleases, offset=3145728 at 0x7f452405fa98>, 'reason': 'End of file while reading data: Input/output error'}#012Traceback (most recent call last):#012 File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 3837, in hotplugLease#012 self._dom.attachDevice(leaseXml)#012 File "/usr/lib/python3.6/site-packages/vdsm/virt/virdomain.py", line 101, in f#012 ret = attr(*args, **kwargs)#012 File "/usr/lib/python3.6/site-packages/vdsm/common/libvirtconnection.py", line 131, in wrapper#012 ret = f(*args, **kwargs)#012 File "/usr/lib/python3.6/site-packages/vdsm/common/function.py", line 94, in wrapper#012 return func(inst, *args, **kwargs)#012 File "/usr/lib64/python3.6/site-packages/libvirt.py", line 682, in attachDevice#012 raise libvirtError('virDomainAttachDevice() failed')#012libvirt.libvirtError: End of file while reading data: Input/output error#012#012During handling of the above exception, another exception occurred:#012#012Traceback (most recent call last):#012 File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 124, in method#012 ret = func(*args, **kwargs)#012 File "/usr/lib/python3.6/site-packages/vdsm/API.py", line 410, in hotplugLease#012 return self.vm.hotplugLease(lease)#012 File "<decorator-gen-263>", line 2, in hotplugLease#012 File "/usr/lib/python3.6/site-packages/vdsm/common/api.py", line 101, in method#012 return func(*args, **kwargs)#012 File "/usr/lib/python3.6/site-packages/vdsm/virt/vm.py", line 3842, in hotplugLease#012 raise exception.HotplugLeaseFailed(reason=str(e), lease=lease)#012vdsm.common.exception.HotplugLeaseFailed: Failed to hotplug lease: {'lease': <lease.Device sd_id=9d3ad0b6-2eb3-436e-b4da-f1092b0dd599, lease_id=676030b1-b795-4c74-9a1c-a766f8923837, path=/dev/9d3ad0b6-2eb3-436e-b4da-f1092b0dd599/xleases, offset=3145728 at 0x7f452405fa98>, 'reason': 'End of file while reading data: Input/output error'}
Since it indeed crashed, please make sure to collect the coredump of all threads. Use the following approach: 1) make sure that 'libvirt-debuginfo' is installed 2) coredumpctl debug /usr/sbin/libvirtd <<< "thread apply all backtrace" > backtrace 3) attach the 'backtrace' file
Actually I've noticed that 'abrt' collected the coredump. Could you please have a look whether 'abrt' knows about the crash? That should be possible via 'abrt-cli' but my fedora doesn't have that any more so you might need to read the man page to figure out the proper invocation.
Attaching the output of command: [root@storage-ge13-vdsm1 ~]# abrt-cli info -d /var/tmp/abrt/ccpp-2021-04-27-09:46:38-697822 > report.txt If you need additional artifacts or dumps or anything else from this host, pls contact me in private, and I will give you an access to it.
Created attachment 1778390 [details] [root@storage-ge13-vdsm1 ~]# abrt-cli info -d /var/tmp/abrt/ccpp-2021-04-27-09:46:38-697822 > report.txt
Thank you the report showed that the crash is an abort() in realloc(), thus caused by memory corruption so any further data would not actually be relevant to the crash itself. I've traced the problem to a bad refactor of memory allocation which would cause a subsequent heap corruption when attaching a lease. Upstream patch proposed: https://listman.redhat.com/archives/libvir-list/2021-May/msg00001.html
Fixed upstream: commit 6dca93e4bcaf1906fb3379dd98fc8056a597c4cc Author: Peter Krempa <pkrempa> Date: Sun May 2 11:45:22 2021 +0200 conf: Fix heap corruption when hot-adding a lease Commit 28a86993162f7d2f ( v6.9.0-179-g28a8699316 ) incorrectly replaced VIR_EXPAND_N by g_renew. VIR_EXPAND_N has these two extra effects apart from reallocating memory: 1) The newly allocated memory is zeroed out 2) The number of elements in the array which is passed to VIR_EXPAND_N is increased. This comes into play when used with virDomainLeaseInsertPreAlloced, which expects that the array element count already includes the space for the added 'lease', by plainly just assigning to 'leases[nleases - 1]' Since g_renew does not increase the number of elements in the array any existing code which calls virDomainLeaseInsertPreAlloced thus either overwrites a lease definition or corrupts the heap if there are no leases to start with. To preserve existing functionality we revert the code back to using VIR_EXPAND_N which at this point doesn't return any value, so other commits don't need to be reverted. Resolves: https://bugzilla.redhat.com/show_bug.cgi?id=1953577 Signed-off-by: Peter Krempa <pkrempa> Reviewed-by: Jiri Denemark <jdenemar> v7.3.0-rc2-1-g6dca93e4bc
Created attachment 1780075 [details] VM xml, device xml, full threads backtrace Reproduced on libvirt-7.0.0-4.el9.x86_64 # virsh attach-device avocado-vt-vm1 lease.xml error: Disconnected from qemu:///system due to end of file error: Failed to attach device from lease.xml error: End of file while reading data: Input/output error Backtrace: (gdb) bt #0 0x00007f6958821262 in raise () from /lib64/libc.so.6 #1 0x00007f695880a8a4 in abort () from /lib64/libc.so.6 #2 0x00007f6958863cd7 in __libc_message () from /lib64/libc.so.6 #3 0x00007f695886b95c in malloc_printerr () from /lib64/libc.so.6 #4 0x00007f695886c856 in unlink_chunk.constprop () from /lib64/libc.so.6 #5 0x00007f695886c9e0 in malloc_consolidate () from /lib64/libc.so.6 #6 0x00007f695886e8c0 in _int_malloc () from /lib64/libc.so.6 #7 0x00007f6958870551 in malloc () from /lib64/libc.so.6 #8 0x00007f695825a566 in yajl_buf_ensure_available (want=1, buf=0x7f6944005c90) at /usr/src/debug/yajl-2.1.0-16.el9.x86_64/src/yajl_buf.c:42 #9 yajl_buf_append (buf=0x7f6944005c90, data=0x7f695826016c, len=1) at /usr/src/debug/yajl-2.1.0-16.el9.x86_64/src/yajl_buf.c:73 #10 0x00007f695825eac2 in yajl_gen_map_open (g=g@entry=0x7f6944006db0) at /usr/src/debug/yajl-2.1.0-16.el9.x86_64/src/yajl_gen.c:303 #11 0x00007f6958e50290 in virJSONValueToStringOne (object=0x7f6944004a70, g=0x7f6944006db0) at ../src/util/virjson.c:1925 #12 0x00007f6958e59fc6 in virJSONValueToBuffer (object=object@entry=0x7f6944004a70, buf=buf@entry=0x7f69559dd600, pretty=pretty@entry=false) at ../src/util/virjson.c:2002 #13 0x00007f694ccbcf2c in qemuMonitorJSONCommandWithFd (mon=0x7f6924034020, cmd=0x7f6944004a70, scm_fd=-1, reply=0x7f69559dd6b0) at ../src/qemu/qemu_monitor_json.c:322 #14 0x00007f694ccc8327 in qemuMonitorJSONCommand (reply=0x7f69559dd6b0, cmd=0x7f6944004a70, mon=0x7f6924034020) at ../src/qemu/qemu_monitor_json.c:355 #15 qemuMonitorJSONGetObjectListPaths (mon=0x7f6924034020, path=<optimized out>, paths=0x7f69559dd710) at ../src/qemu/qemu_monitor_json.c:6498 #16 0x00007f694ccd1936 in qemuMonitorJSONGetDeviceAliases (mon=<optimized out>, aliases=0x7f69559dd760) at ../src/qemu/qemu_monitor_json.c:7669 #17 0x00007f694cc54583 in qemuDomainUpdateDeviceList (driver=0x5569831a9860, vm=0x5569830ee6a0, asyncJob=<optimized out>) at ../src/qemu/qemu_domain.c:7933 #18 0x00007f694cc6d183 in qemuDomainAttachDeviceLive (driver=0x5569831a9860, dev=0x7f694400cee0, vm=0x5569830ee6a0) at ../src/qemu/qemu_driver.c:7033 #19 qemuDomainAttachDeviceLiveAndConfig (flags=<optimized out>, xml=0x7f6944007dc0 " <lease>\n <lockspace>somearea</lockspace>\n <key>somekey</key>\n <target path='/some/lease/path' offset='1024'/>\n </lease>\n", driver=0x5569831a9860, vm=0x5569830ee6a0) at ../src/qemu/qemu_driver.c:7795 #20 qemuDomainAttachDeviceFlags (dom=<optimized out>, xml=0x7f6944007dc0 " <lease>\n <lockspace>somearea</lockspace>\n <key>somekey</key>\n <target path='/some/lease/path' offset='1024'/>\n </lease>\n", flags=<optimized out>) at ../src/qemu/qemu_driver.c:7841 #21 0x00007f6959049884 in virDomainAttachDevice (domain=domain@entry=0x7f6950005da0, xml=0x7f6944007dc0 " <lease>\n <lockspace>somearea</lockspace>\n <key>somekey</key>\n <target path='/some/lease/path' offset='1024'/>\n </lease>\n") at ../src/libvirt-domain.c:8339 #22 0x000055698276d689 in remoteDispatchDomainAttachDevice (server=0x5569830ee080, msg=0x556983435d10, args=0x7f6944004b60, rerr=0x7f69559dd9a0, client=<optimized out>) at src/remote/remote_daemon_dispatch_stubs.h:3660 #23 remoteDispatchDomainAttachDeviceHelper (server=0x5569830ee080, client=<optimized out>, msg=0x556983435d10, rerr=0x7f69559dd9a0, args=0x7f6944004b60, ret=0x0) at src/remote/remote_daemon_dispatch_stubs.h:3639 #24 0x00007f6958f58ab6 in virNetServerProgramDispatchCall (msg=0x556983435d10, client=0x556983246150, server=0x5569830ee080, prog=0x556983151010) at ../src/rpc/virnetserverprogram.c:428 #25 virNetServerProgramDispatch (prog=0x556983151010, server=0x5569830ee080, client=0x556983246150, msg=0x556983435d10) at ../src/rpc/virnetserverprogram.c:302 #26 0x00007f6958f5eb28 in virNetServerProcessMsg (msg=<optimized out>, prog=<optimized out>, client=<optimized out>, srv=0x5569830ee080) at ../src/rpc/virnetserver.c:137 #27 virNetServerHandleJob (jobOpaque=0x55698324c090, opaque=0x5569830ee080) at ../src/rpc/virnetserver.c:154 #28 0x00007f6958e96cf2 in virThreadPoolWorker (opaque=<optimized out>) at ../src/util/virthreadpool.c:163 #29 0x00007f6958e9de29 in virThreadHelper (data=<optimized out>) at ../src/util/virthread.c:233 #30 0x00007f6958108299 in start_thread () from /lib64/libpthread.so.0 #31 0x00007f69588e4863 in clone () from /lib64/libc.so.6
Not reproduced on v7.3.0-85-g68e8fbe6b1 ➜ ~ virsh attach-device 1 ./lease.xml Device attached successfully ➜ ~ virsh detach-device 1 ./lease.xml Device detached successfully
Created attachment 1780257 [details] Screenshot_rhv_reproduce.png Reproduce with libvirt_RHV integrated: rhv-release-4.4.6-6-001.noarch libvirt-7.0.0-13.module+el8.4.0+10604+5608c2b4.x86_64 Steps: RHV env: Software Version:4.4.6.5-0.17.el8ev Registered host: rhv-release-4.4.6-6-001.noarch libvirt-7.0.0-13.module+el8.4.0+10604+5608c2b4.x86_64 Steps: 1. Prepare a running VM on RHV. Getting from the registered rhv host. ... <disk type='block' device='disk' snapshot='no'> <driver name='qemu' type='qcow2' cache='none' error_policy='stop' io='native'/> <source dev='/rhev/data-center/mnt/blockSD/0ab0c1a3-8982-4421-ba41-932f60d1082f/images/4622f8a4-3357-470c-bd8e-dd7099c6f475/642b76c3-ff7c-4cd0-94d9-58c9219efbb4' index='1'> <seclabel model='dac' relabel='no'/> </source> <backingStore/> <target dev='sda' bus='scsi'/> <serial>4622f8a4-3357-470c-bd8e-dd7099c6f475</serial> <boot order='1'/> <alias name='ua-4622f8a4-3357-470c-bd8e-dd7099c6f475'/> <address type='drive' controller='0' bus='0' target='0' unit='0'/> </disk> ... 2. Edit the running VM, and make it HA VM with ISCSI lease. In the opened dialog box, don't mark the 'apply later' option and click 'ok'. Test result: The registered rhv host becomes 'unresponsive' and will 'UP' again after some minutes automatically. Details please see the attachment: Screenshot_rhv_reproduce.png
Based on the steps on Comment 17. Then I retest it with the libvirt new version: libvirt-7.0.0-14.module+el8.4.0+10886+79296686.x86_64 Steps: RHV env: Software Version:4.4.6.5-0.17.el8ev Registered host: rhv-release-4.4.6-6-001.noarch libvirt-7.0.0-14.module+el8.4.0+10886+79296686.x86_64 Steps: 1. Prepare a running VM on RHV. Getting from the registered rhv host. ... <disk type='block' device='disk' snapshot='no'> <driver name='qemu' type='qcow2' cache='none' error_policy='stop' io='native'/> <source dev='/rhev/data-center/mnt/blockSD/0ab0c1a3-8982-4421-ba41-932f60d1082f/images/15c38688-b030-4ea2-8003-a039b69387ab/84082d1e-739a-4d13-9ab6-4805c1eb3ff9' index='1'> <seclabel model='dac' relabel='no'/> </source> <backingStore/> <target dev='sda' bus='scsi'/> <serial>15c38688-b030-4ea2-8003-a039b69387ab</serial> <boot order='1'/> <alias name='ua-15c38688-b030-4ea2-8003-a039b69387ab'/> <address type='drive' controller='0' bus='0' target='0' unit='0'/> </disk> ... 2. Edit the running VM, and make it HA VM with ISCSI lease. In the opened dialog box, don't mark the 'apply later' option and click 'ok'. Test result: 2.1 The registered rhv keeps in 'UP' status. 2.2 Succeed to hotplug a lease to the running VM. ... <lease> <lockspace>0ab0c1a3-8982-4421-ba41-932f60d1082f</lockspace> <key>03cb0d4a-3c59-40d4-8ecd-d689fc0cf0eb</key> <target path='/dev/0ab0c1a3-8982-4421-ba41-932f60d1082f/xleases' offset='5242880'/> </lease> ... 2.3 I can also hot unplug the lease from the VM successfully by doing: Edit the running VM, and uncheck the HA option. In the opened dialog box, don't mark the 'apply later' option and click 'ok'. In summary, it works well from the perspective of the libvirt_RHV integration test, thanks.
Verified as comment16 on libvirt-7.0.0-14.module+el8.4.0+10886+79296686.x86_64 qemu-kvm-5.2.0-16.module+el8.4.0+10806+b7d97207.x86_64: [root@hhan83 ~]# virsh attach-device rhel lease.xml Device attached successfully [root@hhan83 ~]# virsh detach-device rhel lease.xml Device detached successfully
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (virt:av bug fix and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2021:2098