Bug 1663367

Summary: [RHV-RHGS] Deleting 1TB image file, leads to errors in RHV
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: SATHEESARAN <sasundar>
Component: shardingAssignee: Krutika Dhananjay <kdhananj>
Status: CLOSED WORKSFORME QA Contact: SATHEESARAN <sasundar>
Severity: high Docs Contact:
Priority: high    
Version: rhgs-3.4CC: bkunal, rcyriac, rhs-bugs, sabose, sankarshan, sasundar, storage-qa-internal
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1663368 (view as bug list) Environment:
Last Closed: 2019-01-21 10:59:09 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1663368    

Description SATHEESARAN 2019-01-04 05:16:46 UTC
Description of problem:
-----------------------
When deleting the VM image file of size 1TB, there are sequence of issues/errors seen in RHV Manager. SPM goes non-operational and reboots, sanlock errors are  seen. Possible guess is that the latency in the gluster storage domain is causing such problem.


Version-Release number of selected component (if applicable):
--------------------------------------------------------------
RHV 4.0
RHGS 3.4.2

How reproducible:
-----------------
Always

Steps to Reproduce:
--------------------
1. Create a gluster storage domain
2. Create a disk of size 1TB ( either preallocate the disk or thin-allocate and write some data in to the disk )
3. Delete the VM disk from RHV Manager UI

Actual results:
---------------
On the hosts tab, host with SPM role goes inactive, events tab shows that sanlock error has occurred, vdsm heartbeat exceeded on that host, and the SPM host goes to reboot. VMs running on the SPM host goes to unknown state

Expected results:
-----------------
No errors and healthy VMs

Comment 1 Krutika Dhananjay 2019-01-04 05:32:12 UTC
Requesting volume-profile for the run where the host went unresponsive.

Comment 2 SATHEESARAN 2019-01-09 07:02:38 UTC
(In reply to Krutika Dhananjay from comment #1)
> Requesting volume-profile for the run where the host went unresponsive.

Hi Krutika,

We did the series of tests and here are the observations.

RHV                  RHGS                               Result
4.0.7          3.0 ( 3.8.4-18.6.el7rhgs )      Deleting the image causes problems
4.0.7          3.4.3 ( 3.12.2-35.el7rhgs )     Deleting the image causes problems
4.2.8          3.4.3 ( 3.12.2-35.el7rhgs )     No issues seen while deleting the images

We too tried to have RHGS 3.0 with RHV 4.2.8 and due to dependency problems, unable to add RHV 4.0 nodes to RHV Manager 4.2.8
The tests are carried out for preallocated disk images and also with creating a 1TB thin provisioned disk and filling it with
data up to 990GB.

So results looks like RHV 4.0 with RHGS combinations is having a problem.


@Sahina, what do you think ?

Comment 3 Sahina Bose 2019-01-09 08:54:12 UTC
(In reply to SATHEESARAN from comment #2)
> (In reply to Krutika Dhananjay from comment #1)
> > Requesting volume-profile for the run where the host went unresponsive.
> 
> Hi Krutika,
> 
> We did the series of tests and here are the observations.
> 
> RHV                  RHGS                               Result
> 4.0.7          3.0 ( 3.8.4-18.6.el7rhgs )      Deleting the image causes
> problems
> 4.0.7          3.4.3 ( 3.12.2-35.el7rhgs )     Deleting the image causes
> problems
> 4.2.8          3.4.3 ( 3.12.2-35.el7rhgs )     No issues seen while deleting
> the images
> 
> We too tried to have RHGS 3.0 with RHV 4.2.8 and due to dependency problems,
> unable to add RHV 4.0 nodes to RHV Manager 4.2.8
> The tests are carried out for preallocated disk images and also with
> creating a 1TB thin provisioned disk and filling it with
> data up to 990GB.
> 
> So results looks like RHV 4.0 with RHGS combinations is having a problem.
> 
> 
> @Sahina, what do you think ?

Is the issue with the images created using older version of RHV + RHGS?
Can these images be deleted successfully on update to latest versions of RHV & RHGS?

Comment 4 SATHEESARAN 2019-01-09 11:16:32 UTC
(In reply to Sahina Bose from comment #3)
> (In reply to SATHEESARAN from comment #2)
> > (In reply to Krutika Dhananjay from comment #1)
> > > Requesting volume-profile for the run where the host went unresponsive.
> > 
> > Hi Krutika,
> > 
> > We did the series of tests and here are the observations.
> > 
> > RHV                  RHGS                               Result
> > 4.0.7          3.0 ( 3.8.4-18.6.el7rhgs )      Deleting the image causes
> > problems
> > 4.0.7          3.4.3 ( 3.12.2-35.el7rhgs )     Deleting the image causes
> > problems
> > 4.2.8          3.4.3 ( 3.12.2-35.el7rhgs )     No issues seen while deleting
> > the images
> > 
> > We too tried to have RHGS 3.0 with RHV 4.2.8 and due to dependency problems,
> > unable to add RHV 4.0 nodes to RHV Manager 4.2.8
> > The tests are carried out for preallocated disk images and also with
> > creating a 1TB thin provisioned disk and filling it with
> > data up to 990GB.
> > 
> > So results looks like RHV 4.0 with RHGS combinations is having a problem.
> > 
> > 
> > @Sahina, what do you think ?
> 
> Is the issue with the images created using older version of RHV + RHGS?
> Can these images be deleted successfully on update to latest versions of RHV
> & RHGS?

The issue was seen with RHV 4.0 & upgrading to new gluster version. 
But the RHV version wasn't upgraded to RHV 4.2.8. Will complete that part of the testing, 
and let you know the results

Comment 5 SATHEESARAN 2019-01-21 10:59:09 UTC
This issue is seen with RHGS 3.0 & RHV 4.0.7.

When updating to the latest RHGS 3.4.2 ( glusterfs-3.12.2-32.el7rhgs ) and RHV 4.2.7,
this issue is not seen any more.

I have discussed the same with Sahina, and I'm closing this bug as the issue is not seen with latest gluster builds