Bug 1579719

Summary: Geo-Replication failing to kick off geo-rep session daily, when the same volume is used for two different sessions and one gets destroyed.
Product: [oVirt] ovirt-engine Reporter: Sahina Bose <sabose>
Component: BLL.GlusterAssignee: Sahina Bose <sabose>
Status: CLOSED CURRENTRELEASE QA Contact: SATHEESARAN <sasundar>
Severity: high Docs Contact:
Priority: high    
Version: 4.2.3.2CC: ascerra, avishwan, bugs, rhs-bugs, sasundar
Target Milestone: ovirt-4.2.5Flags: rule-engine: ovirt-4.2+
rule-engine: exception+
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Cause: Lock was not released on a failed geo-replication start Consequence: Subsequent geo-replication based DR sync fails Fix: Ensure that commands are ended and locks released when there's a failure to start geo-replication. Result: Multiple DR sync sessions can be setup to run even when there's failure in one.
Story Points: ---
Clone Of: 1554487 Environment:
Last Closed: 2018-07-31 15:29:17 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Gluster RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1554487    

Description Sahina Bose 2018-05-18 07:59:31 UTC
Description of problem:
I have two rhhi pods each with a gluster cluster containing three gluster volumes running on rhv-m and another gluster cluster with three gluster volumes not running on rhv-m 

1)
RHV-M for rhhi pod 1: on dell-per630-01.css.lab.eng.rdu2.redhat.com
  rhhi-engine1.css.lab.eng.rdu2.redhat.com 
Gluster volumes:
  data
  vmstore
  engine  

2)
RHV-M for rhhi pod 2: on dell-per630-05.css.lab.eng.rdu2.redhat.com
  rhhi-engine2.css.lab.eng.rdu2.redhat.com 
Gluster volumes:
  data
  vmstore
  engine

3)
Gluster cluster: on css-storinator-01.css.lab.eng.rdu2.redhat.com
  Gluster volumes:
  data
  vmstore
  engine

Geo-replication was configured using the data volume from the dell-per630-05 to replicate to the data volume of the css-storinator-01.
( geo-replication session was kicked off daily at 11:45 )


Another geo-replication session was configured using the same data volume from the dell-per630-05 to the data volume of the the dell-per630-01
( geo-replication session was kicked off daily at 1:35 )

These geo-replication sessions had been successfully synchronized and successfully geo-replicating once a day for 10 days straight. (2/24 - 3/5)

This was until we tore down our rhhi pod 1 on dell-per630-01. When this happened the geo-replication session turned into a faulty state between the dell-per630-05 and the dell-per630-01, which was expected behaviour.

The issue here is that once geo-replication had tried and failed for the session that had just turned faulty, this led to the already scheduled and previously working geo-rep session between the dell-per630-05 and css-storinator-01 (both currently up and running) to stop geo-replicating. This geo-rep session did not turn up faulty it just stopped running the geo-rep session at 11:45 every day like it had been scheduled to do.

For reference, here is the geo-replication sessions being mentioned. This is after the fact when the geo-rep session between the dell-per630-05 (192.168.50.21) and dell-per630-01 had been destroyed, hence their state being faulty.
 
MASTER NODE      MASTER VOL    MASTER BRICK         SLAVE USER    SLAVE                                                        SLAVE NODE    STATUS     CRAWL STATUS    LAST_SYNCED          
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
192.168.50.21    data          /rhgs/brick2/data    root          ssh://dell-per630-01.css.lab.eng.rdu2.redhat.com::data       N/A           Faulty     N/A             N/A                  
192.168.50.22    data          /rhgs/brick2/data    root          ssh://dell-per630-01.css.lab.eng.rdu2.redhat.com::data       N/A           Faulty     N/A             N/A                  
192.168.50.23    data          /rhgs/brick2/data    root          ssh://dell-per630-01.css.lab.eng.rdu2.redhat.com::data       N/A           Faulty     N/A             N/A                  
192.168.50.21    data          /rhgs/brick2/data    root          ssh://css-storinator-01.css.lab.eng.rdu2.redhat.com::data    N/A           Stopped    N/A             N/A                  
192.168.50.22    data          /rhgs/brick2/data    root          ssh://css-storinator-01.css.lab.eng.rdu2.redhat.com::data    N/A           Stopped    N/A             N/A                  
192.168.50.23    data          /rhgs/brick2/data    root          ssh://css-storinator-01.css.lab.eng.rdu2.redhat.com::data    N/A           Stopped    N/A             N/A  

The gluster volumes in the remaining clusters are all started and recognizing their peers.

I have also attached a screen shot to the rhv-m engine that shows the event log of the geo-replication runs.

We can see that a geo-replication session was successful on march 5th at 11:49. This was the geo-rep session between the dell-per630-05 and css-storinator-01

Then we see at 1:35 "Failed to start geo-replication session". This was the session from the dell-per630-05 to the dell-per630-01 which was turned faulty after the dell-per630-01 rhhi pod was destroyed.

As we move up the event log we see all the events from the next day (March 6th). There should be a geo-replication session that kicks off at 11:45 to geo-rep the dell-per630-05 to the css-storinator-01. This never happens. It moves on to the next day without ever kicking off geo-rep. 

As you can see in the second screenshot there is still a remote Data Sync Setup scheduled for 11:45 and this never kicks off, even though it had been successfully for 11 days straight before the dell-per630-01 pod was destroyed.


Version-Release number of selected component (if applicable):
rhhi 1.1
gluster 3.3

How reproducible:
always

Steps to Reproduce:
See above

Actual results:
See above

Expected results:
When there is two geo-replication sessions sending data out to two different locations from the same volume, if one of the locations fails it should not affect an already existing geo-replication session to a different location.

Additional info:
I have included a third screenshot to show that the two geo-rep sessions had been successful the day before (March 4th), they had been running once a day at those times successfully since 2/24

Comment 1 SATHEESARAN 2018-06-26 08:43:44 UTC
Unable to verify this bug, as hit with another bug - BZ 1595140 where having multiple geo-rep session with gluster volume failed.

This bug is blocked because of that issue.

On these grounds, moving this bug out to 4.2.5

Comment 2 Sahina Bose 2018-06-26 08:51:53 UTC
(In reply to SATHEESARAN from comment #1)
> Unable to verify this bug, as hit with another bug - BZ 1595140 where having
> multiple geo-rep session with gluster volume failed.
> 
> This bug is blocked because of that issue.
> 
> On these grounds, moving this bug out to 4.2.5

The Bug 1595140 can be resolved if you add all the gluster hosts to setup. So I think you can continue verification on this one?

Comment 3 SATHEESARAN 2018-07-11 10:41:07 UTC
(In reply to Sahina Bose from comment #2)
> (In reply to SATHEESARAN from comment #1)
> > Unable to verify this bug, as hit with another bug - BZ 1595140 where having
> > multiple geo-rep session with gluster volume failed.
> > 
> > This bug is blocked because of that issue.
> > 
> > On these grounds, moving this bug out to 4.2.5
> 
> The Bug 1595140 can be resolved if you add all the gluster hosts to setup.
> So I think you can continue verification on this one?

Thanks for that information. Yes, I can now proceed with this exceptions in place

Comment 4 SATHEESARAN 2018-07-20 18:55:39 UTC
Tested with ovirt-4.2.5 and glusterfs-3.8.4-54.15 with the following step.

1. Complete RHHI deployment. Treat as the primary site PrimSite1. Select any volume. This case I chose 'data' volume.

2. Create 2 geo-rep session from this volume to 2 secondary site. SecSite1 & SecSite2

3. Create a remote sync for the storage domain.

4. Start geo-rep on one session. While the session is in progress stop the volume. Geo-rep will go faulty. With this faulty geo-rep session, schedule a remote data sync to SecSite1 & SecSite2.

5. Even when one of the geo-rep session is faulty, the other session worked as expected

Before the remote sync:

MASTER NODE                                   MASTER VOL    MASTER BRICK                 SLAVE USER    SLAVE                         SLAVE NODE    STATUS     CRAWL STATUS    LAST_SYNCED          
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
rhsqa-grafton1-nic2.lab.eng.blr.redhat.com    data          /gluster_bricks/data/data    root          ssh://10.70.45.29::sasvol1    N/A           Faulty     N/A             N/A                  
rhsqa-grafton2-nic2.lab.eng.blr.redhat.com    data          /gluster_bricks/data/data    root          ssh://10.70.45.29::sasvol1    N/A           Faulty     N/A             N/A                  
rhsqa-grafton3-nic2.lab.eng.blr.redhat.com    data          /gluster_bricks/data/data    root          ssh://10.70.45.29::sasvol1    N/A           Faulty     N/A             N/A                  
rhsqa-grafton1-nic2.lab.eng.blr.redhat.com    data          /gluster_bricks/data/data    root          ssh://10.70.45.32::data       N/A           Stopped    N/A             N/A                  
rhsqa-grafton2-nic2.lab.eng.blr.redhat.com    data          /gluster_bricks/data/data    root          ssh://10.70.45.32::data       N/A           Stopped    N/A             N/A                  
rhsqa-grafton3-nic2.lab.eng.blr.redhat.com    data          /gluster_bricks/data/data    root          ssh://10.70.45.32::data       N/A           Stopped    N/A             N/A        

Remote sync worked with faulty session:
MASTER NODE                                   MASTER VOL    MASTER BRICK                 SLAVE USER    SLAVE                         SLAVE NODE     STATUS     CRAWL STATUS     LAST_SYNCED                  
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
rhsqa-grafton1-nic2.lab.eng.blr.redhat.com    data          /gluster_bricks/data/data    root          ssh://10.70.45.29::sasvol1    N/A            Faulty     N/A              N/A                          
rhsqa-grafton2-nic2.lab.eng.blr.redhat.com    data          /gluster_bricks/data/data    root          ssh://10.70.45.29::sasvol1    N/A            Faulty     N/A              N/A                          
rhsqa-grafton3-nic2.lab.eng.blr.redhat.com    data          /gluster_bricks/data/data    root          ssh://10.70.45.29::sasvol1    N/A            Faulty     N/A              N/A                          
rhsqa-grafton1-nic2.lab.eng.blr.redhat.com    data          /gluster_bricks/data/data    root          ssh://10.70.45.32::data       10.70.45.33    Passive    N/A              N/A                          
rhsqa-grafton2-nic2.lab.eng.blr.redhat.com    data          /gluster_bricks/data/data    root          ssh://10.70.45.32::data       10.70.45.32    Passive    N/A              N/A                          
rhsqa-grafton3-nic2.lab.eng.blr.redhat.com    data          /gluster_bricks/data/data    root          ssh://10.70.45.32::data       10.70.45.34    Active     History Crawl    2018-07-20 21:38:10

Comment 5 Sandro Bonazzola 2018-07-31 15:29:17 UTC
This bugzilla is included in oVirt 4.2.5 release, published on July 30th 2018.

Since the problem described in this bug report should be
resolved in oVirt 4.2.5 release, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.