Bug 1484660

Summary: [Tracker RHV] Virtual Machines are not highly available with gluster libgfapi access mechanism
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: SATHEESARAN <sasundar>
Component: rhhiAssignee: Sahina Bose <sabose>
Status: CLOSED DEFERRED QA Contact: SATHEESARAN <sasundar>
Severity: high Docs Contact:
Priority: medium    
Version: rhhi-1.1CC: guillaume.pavese, kborup, rhs-bugs
Target Milestone: ---Keywords: Tracking
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-07-10 07:01:36 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1484227    
Bug Blocks:    

Description SATHEESARAN 2017-08-24 05:34:47 UTC
Description of problem:
-----------------------
Virtual Machines (VMs) created with gfapi access mechanism are not highly available.

VMs depends on the first/primary gluster volfile server to obtain the volfiles.
In the case, when the first/primary server is down, then it leads to the situation where VMs start fails.

Version-Release number of selected component (if applicable):
-------------------------------------------------------------
RHV 4.1.5
glusterfs-3.8.4
vdsm-4.19.28-1
qemu-kvm-rhev-2.9.0-16

How reproducible:
-----------------
Always

Steps to Reproduce:
--------------------
1. Add 3 RHEL 7.4 nodes to a converged 4.1 cluster ( with virt & gluster capability enabled )
2. Enable libgfapi with engine ( LibgfApiSupported=true )
3. Create a gluster replica 3 volume, use it as RHV data domain (GlusterFS Type)
4. Create VMs and install OS
5. Move to first node ( primary volfile server ) to maintenance, with stopping gluster services
6. Stop the VMs
7. Start the VMs

Actual results:
---------------
Unable to start VMs

Expected results:
-----------------
VMs should start even when the first node ( primary volfile server ) is unavailable

Additional info:
----------------
Error message as seen in the events, when starting VM in the absence of first node ( primary volfile server ):
<snip>
VM appvm01 is down with error. Exit message: failed to initialize gluster connection (src=0x7f55080198c0 priv=0x7f550800be30): Transport endpoint is not connected.
</snip>

Comment 1 SATHEESARAN 2017-08-24 05:48:22 UTC
Following is the observation from the XML definition of the VM. There are 'no' additional volfile servers mentioned. So every time, when the VM starts, the QEMU process gets the volfile from the primary volfile server ( in this case, its 10.70.36.73 ) and if its unavailable, QEMU fails to start the VM stating - 'Transport endpoing is not connected'

We should provide the additional mount options passed with GlusterFS storage domain,(i.e) 'backup-volfile-servers' as the fallback hosts, so that QEMU can query those servers too for fetching volfiles.


  <disk type='network' device='disk' snapshot='no'>
      <driver name='qemu' type='raw' cache='none' error_policy='stop' io='threads'/>
      <source protocol='gluster' name='vmstore/051c9cd5-807c-4131-97e7-db306a7b3142/images/98b106c6-b2b3-4a94-8178-b912242567a1/895a694a-11a7-4327-bc13-55ab08805cb3'>
        <host name='10.70.36.73' port='0'/>
      </source>
      <backingStore/>
      <target dev='sda' bus='scsi'/>
      <serial>98b106c6-b2b3-4a94-8178-b912242567a1</serial>
      <boot order='2'/>
      <alias name='scsi0-0-0-0'/>
      <address type='drive' controller='0' bus='0' target='0' unit='0'/>
    </disk>

Comment 2 SATHEESARAN 2017-08-24 05:48:39 UTC
This issue will be breaking the high availability of virtual machines, in case VMs are stopped and started again, and affects the Red Hat's hyperconverged product ( RHHI 1.1 )

Comment 5 Sahina Bose 2018-08-20 08:31:55 UTC
*** Bug 1596600 has been marked as a duplicate of this bug. ***

Comment 6 Sahina Bose 2019-07-10 07:01:36 UTC
No plans to enable libgfapi in RHHI-V for now. Closing this bug