1554487 – Geo-Replication failing to kick off geo-rep session daily, when the same volume is used for two different sessions and one gets destroyed.

Bug 1554487 - Geo-Replication failing to kick off geo-rep session daily, when the same volume is used for two different sessions and one gets destroyed.

Summary: Geo-Replication failing to kick off geo-rep session daily, when the same volu...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	rhhi
Sub Component:
Version:	rhhi-1.1
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	RHHI-V 1.5
Assignee:	Sahina Bose
QA Contact:	SATHEESARAN
Docs Contact:
URL:
Whiteboard:
Depends On:	1579719
Blocks:	1724792 1520836
TreeView+	depends on / blocked

Reported:	2018-03-12 18:40 UTC by Adam Scerra
Modified:	2019-06-28 16:04 UTC (History)
CC List:	3 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:	When a storage domain was configured to synchronize data to two secondary sites, and one of the sessions moved to a faulty state, the second session also failed to synchronize. This occurred because the locks for the faulty session were not released correctly on session failure, so the second session waited for locks to be released and could not synchronize. Locks are now released correctly when geo-replication sessions move to a faulty state, and this issue no longer occurs.
Clone Of:
Clones:	1579719 (view as bug list)
Environment:
Last Closed:	2018-11-08 05:38:31 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Three screenshots listed in order as mentioned in bug report (620.89 KB, application/pdf) 2018-03-12 18:40 UTC, Adam Scerra	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHEA-2018:3523	0	None	None	None	2018-11-08 05:39:12 UTC

Description Adam Scerra 2018-03-12 18:40:57 UTC

Created attachment 1407358 [details]
Three screenshots listed in order as mentioned in bug report

Description of problem:
I have two rhhi pods each with a gluster cluster containing three gluster volumes running on rhv-m and another gluster cluster with three gluster volumes not running on rhv-m 

1)
RHV-M for rhhi pod 1: on dell-per630-01.css.lab.eng.rdu2.redhat.com
  rhhi-engine1.css.lab.eng.rdu2.redhat.com 
Gluster volumes:
  data
  vmstore
  engine  

2)
RHV-M for rhhi pod 2: on dell-per630-05.css.lab.eng.rdu2.redhat.com
  rhhi-engine2.css.lab.eng.rdu2.redhat.com 
Gluster volumes:
  data
  vmstore
  engine

3)
Gluster cluster: on css-storinator-01.css.lab.eng.rdu2.redhat.com
  Gluster volumes:
  data
  vmstore
  engine

Geo-replication was configured using the data volume from the dell-per630-05 to replicate to the data volume of the css-storinator-01.
( geo-replication session was kicked off daily at 11:45 )


Another geo-replication session was configured using the same data volume from the dell-per630-05 to the data volume of the the dell-per630-01
( geo-replication session was kicked off daily at 1:35 )

These geo-replication sessions had been successfully synchronized and successfully geo-replicating once a day for 10 days straight. (2/24 - 3/5)

This was until we tore down our rhhi pod 1 on dell-per630-01. When this happened the geo-replication session turned into a faulty state between the dell-per630-05 and the dell-per630-01, which was expected behaviour.

The issue here is that once geo-replication had tried and failed for the session that had just turned faulty, this led to the already scheduled and previously working geo-rep session between the dell-per630-05 and css-storinator-01 (both currently up and running) to stop geo-replicating. This geo-rep session did not turn up faulty it just stopped running the geo-rep session at 11:45 every day like it had been scheduled to do.

For reference, here is the geo-replication sessions being mentioned. This is after the fact when the geo-rep session between the dell-per630-05 (192.168.50.21) and dell-per630-01 had been destroyed, hence their state being faulty.
 
MASTER NODE      MASTER VOL    MASTER BRICK         SLAVE USER    SLAVE                                                        SLAVE NODE    STATUS     CRAWL STATUS    LAST_SYNCED          
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
192.168.50.21    data          /rhgs/brick2/data    root          ssh://dell-per630-01.css.lab.eng.rdu2.redhat.com::data       N/A           Faulty     N/A             N/A                  
192.168.50.22    data          /rhgs/brick2/data    root          ssh://dell-per630-01.css.lab.eng.rdu2.redhat.com::data       N/A           Faulty     N/A             N/A                  
192.168.50.23    data          /rhgs/brick2/data    root          ssh://dell-per630-01.css.lab.eng.rdu2.redhat.com::data       N/A           Faulty     N/A             N/A                  
192.168.50.21    data          /rhgs/brick2/data    root          ssh://css-storinator-01.css.lab.eng.rdu2.redhat.com::data    N/A           Stopped    N/A             N/A                  
192.168.50.22    data          /rhgs/brick2/data    root          ssh://css-storinator-01.css.lab.eng.rdu2.redhat.com::data    N/A           Stopped    N/A             N/A                  
192.168.50.23    data          /rhgs/brick2/data    root          ssh://css-storinator-01.css.lab.eng.rdu2.redhat.com::data    N/A           Stopped    N/A             N/A  

The gluster volumes in the remaining clusters are all started and recognizing their peers.

I have also attached a screen shot to the rhv-m engine that shows the event log of the geo-replication runs.

We can see that a geo-replication session was successful on march 5th at 11:49. This was the geo-rep session between the dell-per630-05 and css-storinator-01

Then we see at 1:35 "Failed to start geo-replication session". This was the session from the dell-per630-05 to the dell-per630-01 which was turned faulty after the dell-per630-01 rhhi pod was destroyed.

As we move up the event log we see all the events from the next day (March 6th). There should be a geo-replication session that kicks off at 11:45 to geo-rep the dell-per630-05 to the css-storinator-01. This never happens. It moves on to the next day without ever kicking off geo-rep. 

As you can see in the second screenshot there is still a remote Data Sync Setup scheduled for 11:45 and this never kicks off, even though it had been successfully for 11 days straight before the dell-per630-01 pod was destroyed.


Version-Release number of selected component (if applicable):
rhhi 1.1
gluster 3.3

How reproducible:
always

Steps to Reproduce:
See above

Actual results:
See above

Expected results:
When there is two geo-replication sessions sending data out to two different locations from the same volume, if one of the locations fails it should not affect an already existing geo-replication session to a different location.

Additional info:
I have included a third screenshot to show that the two geo-rep sessions had been successful the day before (March 4th), they had been running once a day at those times successfully since 2/24

Comment 2 Sahina Bose 2018-03-14 09:59:05 UTC

This looks like a geo-replication issue. Can you take a look?

Comment 3 Aravinda VK 2018-03-14 10:17:54 UTC

Please share the log files from the Master nodes(/var/log/glusterfs/geo-replication)

Comment 4 Sahina Bose 2018-03-14 10:39:12 UTC

(In reply to Aravinda VK from comment #3)
> Please share the log files from the Master
> nodes(/var/log/glusterfs/geo-replication)

Looking again at the bug, this seems more like a scheduling issue rather than geo-replication. I'll take a look, and pass it on if I find issues.

Comment 9 SATHEESARAN 2018-07-20 18:56:19 UTC

Tested with ovirt-4.2.5 and glusterfs-3.8.4-54.15 with the following step.

1. Complete RHHI deployment. Treat as the primary site PrimSite1. Select any volume. This case I chose 'data' volume.

2. Create 2 geo-rep session from this volume to 2 secondary site. SecSite1 & SecSite2

3. Create a remote sync for the storage domain.

4. Start geo-rep on one session. While the session is in progress stop the volume. Geo-rep will go faulty. With this faulty geo-rep session, schedule a remote data sync to SecSite1 & SecSite2.

5. Even when one of the geo-rep session is faulty, the other session worked as expected

Before the remote sync:

MASTER NODE                                   MASTER VOL    MASTER BRICK                 SLAVE USER    SLAVE                         SLAVE NODE    STATUS     CRAWL STATUS    LAST_SYNCED          
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
rhsqa-grafton1-nic2.lab.eng.blr.redhat.com    data          /gluster_bricks/data/data    root          ssh://10.70.45.29::sasvol1    N/A           Faulty     N/A             N/A                  
rhsqa-grafton2-nic2.lab.eng.blr.redhat.com    data          /gluster_bricks/data/data    root          ssh://10.70.45.29::sasvol1    N/A           Faulty     N/A             N/A                  
rhsqa-grafton3-nic2.lab.eng.blr.redhat.com    data          /gluster_bricks/data/data    root          ssh://10.70.45.29::sasvol1    N/A           Faulty     N/A             N/A                  
rhsqa-grafton1-nic2.lab.eng.blr.redhat.com    data          /gluster_bricks/data/data    root          ssh://10.70.45.32::data       N/A           Stopped    N/A             N/A                  
rhsqa-grafton2-nic2.lab.eng.blr.redhat.com    data          /gluster_bricks/data/data    root          ssh://10.70.45.32::data       N/A           Stopped    N/A             N/A                  
rhsqa-grafton3-nic2.lab.eng.blr.redhat.com    data          /gluster_bricks/data/data    root          ssh://10.70.45.32::data       N/A           Stopped    N/A             N/A        

Remote sync worked with faulty session:
MASTER NODE                                   MASTER VOL    MASTER BRICK                 SLAVE USER    SLAVE                         SLAVE NODE     STATUS     CRAWL STATUS     LAST_SYNCED                  
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
rhsqa-grafton1-nic2.lab.eng.blr.redhat.com    data          /gluster_bricks/data/data    root          ssh://10.70.45.29::sasvol1    N/A            Faulty     N/A              N/A                          
rhsqa-grafton2-nic2.lab.eng.blr.redhat.com    data          /gluster_bricks/data/data    root          ssh://10.70.45.29::sasvol1    N/A            Faulty     N/A              N/A                          
rhsqa-grafton3-nic2.lab.eng.blr.redhat.com    data          /gluster_bricks/data/data    root          ssh://10.70.45.29::sasvol1    N/A            Faulty     N/A              N/A                          
rhsqa-grafton1-nic2.lab.eng.blr.redhat.com    data          /gluster_bricks/data/data    root          ssh://10.70.45.32::data       10.70.45.33    Passive    N/A              N/A                          
rhsqa-grafton2-nic2.lab.eng.blr.redhat.com    data          /gluster_bricks/data/data    root          ssh://10.70.45.32::data       10.70.45.32    Passive    N/A              N/A                          
rhsqa-grafton3-nic2.lab.eng.blr.redhat.com    data          /gluster_bricks/data/data    root          ssh://10.70.45.32::data       10.70.45.34    Active     History Crawl    2018-07-20 21:38:10

Comment 13 errata-xmlrpc 2018-11-08 05:38:31 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2018:3523

Note You need to log in before you can comment on or make changes to this bug.