Description of problem: ----------------------- Virtual Machines (VMs) created with gfapi access mechanism are not highly available. VMs depends on the first/primary gluster volfile server to obtain the volfiles. In the case, when the first/primary server is down, then it leads to the situation where VMs start fails. Version-Release number of selected component (if applicable): ------------------------------------------------------------- RHV 4.1.5 glusterfs-3.8.4 vdsm-4.19.28-1 qemu-kvm-rhev-2.9.0-16 How reproducible: ----------------- Always Steps to Reproduce: -------------------- 1. Add 3 RHEL 7.4 nodes to a converged 4.1 cluster ( with virt & gluster capability enabled ) 2. Enable libgfapi with engine ( LibgfApiSupported=true ) 3. Create a gluster replica 3 volume, use it as RHV data domain (GlusterFS Type) 4. Create VMs and install OS 5. Move to first node ( primary volfile server ) to maintenance, with stopping gluster services 6. Stop the VMs 7. Start the VMs Actual results: --------------- Unable to start VMs Expected results: ----------------- VMs should start even when the first node ( primary volfile server ) is unavailable Additional info: ---------------- Error message as seen in the events, when starting VM in the absence of first node ( primary volfile server ): <snip> VM appvm01 is down with error. Exit message: failed to initialize gluster connection (src=0x7f55080198c0 priv=0x7f550800be30): Transport endpoint is not connected. </snip>
Following is the observation from the XML definition of the VM. There are 'no' additional volfile servers mentioned. So every time, when the VM starts, the QEMU process gets the volfile from the primary volfile server ( in this case, its 10.70.36.73 ) and if its unavailable, QEMU fails to start the VM stating - 'Transport endpoing is not connected' We should provide the additional mount options passed with GlusterFS storage domain,(i.e) 'backup-volfile-servers' as the fallback hosts, so that QEMU can query those servers too for fetching volfiles. <disk type='network' device='disk' snapshot='no'> <driver name='qemu' type='raw' cache='none' error_policy='stop' io='threads'/> <source protocol='gluster' name='vmstore/051c9cd5-807c-4131-97e7-db306a7b3142/images/98b106c6-b2b3-4a94-8178-b912242567a1/895a694a-11a7-4327-bc13-55ab08805cb3'> <host name='10.70.36.73' port='0'/> </source> <backingStore/> <target dev='sda' bus='scsi'/> <serial>98b106c6-b2b3-4a94-8178-b912242567a1</serial> <boot order='2'/> <alias name='scsi0-0-0-0'/> <address type='drive' controller='0' bus='0' target='0' unit='0'/> </disk>
This issue will be breaking the high availability of virtual machines, in case VMs are stopped and started again, and affects the Red Hat's hyperconverged product ( RHHI 1.1 )
*** Bug 1596600 has been marked as a duplicate of this bug. ***
No plans to enable libgfapi in RHHI-V for now. Closing this bug