Description of problem: ----------------------- Virtual Machines (VMs) created with gfapi access mechanism are not highly available. VMs depends on the first/primary gluster volfile server to obtain the volfiles. In the case, when the first/primary server is down, then it leads to the situation where VMs start fails. Version-Release number of selected component (if applicable): ------------------------------------------------------------- RHV 4.1.5 glusterfs-3.8.4 vdsm-4.19.28-1 qemu-kvm-rhev-2.9.0-16 How reproducible: ----------------- Always Steps to Reproduce: -------------------- 1. Add 3 RHEL 7.4 nodes to a converged 4.1 cluster ( with virt & gluster capability enabled ) 2. Enable libgfapi with engine ( LibgfApiSupported=true ) 3. Create a gluster replica 3 volume, use it as RHV data domain (GlusterFS Type) 4. Create VMs and install OS 5. Move to first node ( primary volfile server ) to maintenance, with stopping gluster services 6. Stop the VMs 7. Start the VMs Actual results: --------------- Unable to start VMs Expected results: ----------------- VMs should start even when the first node ( primary volfile server ) is unavailable Additional info: ---------------- Error message as seen in the events, when starting VM in the absence of first node ( primary volfile server ): <snip> VM appvm01 is down with error. Exit message: failed to initialize gluster connection (src=0x7f55080198c0 priv=0x7f550800be30): Transport endpoint is not connected. </snip>
Following is the observation from the XML definition of the VM. There are 'no' additional volfile servers mentioned. So every time, when the VM starts, the QEMU process gets the volfile from the primary volfile server ( in this case, its 10.70.36.73 ) and if its unavailable, QEMU fails to start the VM stating - 'Transport endpoing is not connected' We should provide the additional mount options passed with GlusterFS storage domain,(i.e) 'backup-volfile-servers' as the fallback hosts, so that QEMU can query those servers too for fetching volfiles. <disk type='network' device='disk' snapshot='no'> <driver name='qemu' type='raw' cache='none' error_policy='stop' io='threads'/> <source protocol='gluster' name='vmstore/051c9cd5-807c-4131-97e7-db306a7b3142/images/98b106c6-b2b3-4a94-8178-b912242567a1/895a694a-11a7-4327-bc13-55ab08805cb3'> <host name='10.70.36.73' port='0'/> </source> <backingStore/> <target dev='sda' bus='scsi'/> <serial>98b106c6-b2b3-4a94-8178-b912242567a1</serial> <boot order='2'/> <alias name='scsi0-0-0-0'/> <address type='drive' controller='0' bus='0' target='0' unit='0'/> </disk>
This issue will be breaking the high availability of virtual machines, in case VMs are stopped and started again, and affects the Red Hat's hyperconverged product ( RHHI 1.1 )
Denis, can you take a look at this? Shouldn't multiple hosts have been used in the domain xml?
It is not yet supported by the libvirt. Please check the following bug: https://bugzilla.redhat.com/show_bug.cgi?id=1465810
the bug is currently targeted for 7.5, so defer this one to 4.3?
(In reply to Michal Skrivanek from comment #5) > the bug is currently targeted for 7.5, so defer this one to 4.3? I think we need to work with qemu/libvirt team to see if this bug can make it earlier that 7.5 as VM HA is a critical requirement for RHHI. Denis, is it possible to choose the online host to pass to domain XML while creating the xml from vdsm?
Manually or automatically? Atm VDSM just takes first host from the list of of hosts, supplied by the engine. I think we can order that list at the any way we like.
(In reply to Denis Chaplygin from comment #7) > Manually or automatically? Automatically > > > Atm VDSM just takes first host from the list of of hosts, supplied by the > engine. I think we can order that list at the any way we like. Instead of picking first host, pick the host that's online?
VDSM have no idea about state of other hosts. I can remove hosts, which are in "non UP" state on the engine side.
I investigated that issue and discovered, that host list is not sent by the engine within libvirt xml during VM creation. Instead, it is sent only on storage domain mount and them cached by the vdsm. During VM creation vdsm takes that cached list and takes a fist host from it to workarount libvirt bug [1] Unfortunately, in run time vdsm have no idea, whether gluster bricks are alive or not. Even worse, a peer status call uses just a first host from the list, mentioned above and if it is not available, no status data is returned. Thus, we could either modify vdsm api and engine libvirt builder to send host information during vm creation or wait for libvirt bug to be fixed and just remove that single host workaround. [1] https://bugzilla.redhat.com/show_bug.cgi?id=1465810
(In reply to Denis Chaplygin from comment #10) > Thus, we could either modify vdsm api and engine libvirt builder to send > host information during vm creation or wait for libvirt bug to be fixed and > just remove that single host workaround. > > > [1] https://bugzilla.redhat.com/show_bug.cgi?id=1465810 Denis, This options needs to be discussed with wider audience. But how costly is it to modify vdsm api and engine libvirt builder to send host info during VM creation ?
is there a workaround for this i.e to manualy edit the proper gluster server somewhere? The only thing I could do was to put the gluster domain in maintenance and edit it and change the host there... Is there a workaround for this? It gets pretty painfull if we have to take down the whole gluster domain just to start a vm...
Closing this as no action taken from long back.Please reopen if required.
Things seem to be moving in qemu : https://bugzilla.redhat.com/show_bug.cgi?id=1465810 Time to reopen?
Users on ovirt-user report 4x~5x performance improvement with libgfapi