Bug 1484660 - [Tracker RHV] Virtual Machines are not highly available with gluster libgfapi access mechanism
Summary: [Tracker RHV] Virtual Machines are not highly available with gluster libgfapi...
Keywords:
Status: CLOSED DEFERRED
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: rhhi
Version: rhhi-1.1
Hardware: x86_64
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: Sahina Bose
QA Contact: SATHEESARAN
URL:
Whiteboard:
: 1596600 (view as bug list)
Depends On: 1484227
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-08-24 05:34 UTC by SATHEESARAN
Modified: 2020-04-09 14:44 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2019-07-10 07:01:36 UTC
Embargoed:


Attachments (Terms of Use)

Description SATHEESARAN 2017-08-24 05:34:47 UTC
Description of problem:
-----------------------
Virtual Machines (VMs) created with gfapi access mechanism are not highly available.

VMs depends on the first/primary gluster volfile server to obtain the volfiles.
In the case, when the first/primary server is down, then it leads to the situation where VMs start fails.

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
RHV 4.1.5
glusterfs-3.8.4
vdsm-4.19.28-1
qemu-kvm-rhev-2.9.0-16

How reproducible:
-----------------
Always

Steps to Reproduce:
--------------------
1. Add 3 RHEL 7.4 nodes to a converged 4.1 cluster ( with virt & gluster capability enabled )
2. Enable libgfapi with engine ( LibgfApiSupported=true )
3. Create a gluster replica 3 volume, use it as RHV data domain (GlusterFS Type)
4. Create VMs and install OS
5. Move to first node ( primary volfile server ) to maintenance, with stopping gluster services
6. Stop the VMs
7. Start the VMs

Actual results:
---------------
Unable to start VMs

Expected results:
-----------------
VMs should start even when the first node ( primary volfile server ) is unavailable

Additional info:
----------------
Error message as seen in the events, when starting VM in the absence of first node ( primary volfile server ):
<snip>
VM appvm01 is down with error. Exit message: failed to initialize gluster connection (src=0x7f55080198c0 priv=0x7f550800be30): Transport endpoint is not connected.
</snip>

Comment 1 SATHEESARAN 2017-08-24 05:48:22 UTC
Following is the observation from the XML definition of the VM. There are 'no' additional volfile servers mentioned. So every time, when the VM starts, the QEMU process gets the volfile from the primary volfile server ( in this case, its 10.70.36.73 ) and if its unavailable, QEMU fails to start the VM stating - 'Transport endpoing is not connected'

We should provide the additional mount options passed with GlusterFS storage domain,(i.e) 'backup-volfile-servers' as the fallback hosts, so that QEMU can query those servers too for fetching volfiles.


  <disk type='network' device='disk' snapshot='no'>
      <driver name='qemu' type='raw' cache='none' error_policy='stop' io='threads'/>
      <source protocol='gluster' name='vmstore/051c9cd5-807c-4131-97e7-db306a7b3142/images/98b106c6-b2b3-4a94-8178-b912242567a1/895a694a-11a7-4327-bc13-55ab08805cb3'>
        <host name='10.70.36.73' port='0'/>
      </source>
      <backingStore/>
      <target dev='sda' bus='scsi'/>
      <serial>98b106c6-b2b3-4a94-8178-b912242567a1</serial>
      <boot order='2'/>
      <alias name='scsi0-0-0-0'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>

Comment 2 SATHEESARAN 2017-08-24 05:48:39 UTC
This issue will be breaking the high availability of virtual machines, in case VMs are stopped and started again, and affects the Red Hat's hyperconverged product ( RHHI 1.1 )

Comment 5 Sahina Bose 2018-08-20 08:31:55 UTC
*** Bug 1596600 has been marked as a duplicate of this bug. ***

Comment 6 Sahina Bose 2019-07-10 07:01:36 UTC
No plans to enable libgfapi in RHHI-V for now. Closing this bug


Note You need to log in before you can comment on or make changes to this bug.