Created attachment 1407358 [details] Three screenshots listed in order as mentioned in bug report Description of problem: I have two rhhi pods each with a gluster cluster containing three gluster volumes running on rhv-m and another gluster cluster with three gluster volumes not running on rhv-m 1) RHV-M for rhhi pod 1: on dell-per630-01.css.lab.eng.rdu2.redhat.com rhhi-engine1.css.lab.eng.rdu2.redhat.com Gluster volumes: data vmstore engine 2) RHV-M for rhhi pod 2: on dell-per630-05.css.lab.eng.rdu2.redhat.com rhhi-engine2.css.lab.eng.rdu2.redhat.com Gluster volumes: data vmstore engine 3) Gluster cluster: on css-storinator-01.css.lab.eng.rdu2.redhat.com Gluster volumes: data vmstore engine Geo-replication was configured using the data volume from the dell-per630-05 to replicate to the data volume of the css-storinator-01. ( geo-replication session was kicked off daily at 11:45 ) Another geo-replication session was configured using the same data volume from the dell-per630-05 to the data volume of the the dell-per630-01 ( geo-replication session was kicked off daily at 1:35 ) These geo-replication sessions had been successfully synchronized and successfully geo-replicating once a day for 10 days straight. (2/24 - 3/5) This was until we tore down our rhhi pod 1 on dell-per630-01. When this happened the geo-replication session turned into a faulty state between the dell-per630-05 and the dell-per630-01, which was expected behaviour. The issue here is that once geo-replication had tried and failed for the session that had just turned faulty, this led to the already scheduled and previously working geo-rep session between the dell-per630-05 and css-storinator-01 (both currently up and running) to stop geo-replicating. This geo-rep session did not turn up faulty it just stopped running the geo-rep session at 11:45 every day like it had been scheduled to do. For reference, here is the geo-replication sessions being mentioned. This is after the fact when the geo-rep session between the dell-per630-05 (192.168.50.21) and dell-per630-01 had been destroyed, hence their state being faulty. MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 192.168.50.21 data /rhgs/brick2/data root ssh://dell-per630-01.css.lab.eng.rdu2.redhat.com::data N/A Faulty N/A N/A 192.168.50.22 data /rhgs/brick2/data root ssh://dell-per630-01.css.lab.eng.rdu2.redhat.com::data N/A Faulty N/A N/A 192.168.50.23 data /rhgs/brick2/data root ssh://dell-per630-01.css.lab.eng.rdu2.redhat.com::data N/A Faulty N/A N/A 192.168.50.21 data /rhgs/brick2/data root ssh://css-storinator-01.css.lab.eng.rdu2.redhat.com::data N/A Stopped N/A N/A 192.168.50.22 data /rhgs/brick2/data root ssh://css-storinator-01.css.lab.eng.rdu2.redhat.com::data N/A Stopped N/A N/A 192.168.50.23 data /rhgs/brick2/data root ssh://css-storinator-01.css.lab.eng.rdu2.redhat.com::data N/A Stopped N/A N/A The gluster volumes in the remaining clusters are all started and recognizing their peers. I have also attached a screen shot to the rhv-m engine that shows the event log of the geo-replication runs. We can see that a geo-replication session was successful on march 5th at 11:49. This was the geo-rep session between the dell-per630-05 and css-storinator-01 Then we see at 1:35 "Failed to start geo-replication session". This was the session from the dell-per630-05 to the dell-per630-01 which was turned faulty after the dell-per630-01 rhhi pod was destroyed. As we move up the event log we see all the events from the next day (March 6th). There should be a geo-replication session that kicks off at 11:45 to geo-rep the dell-per630-05 to the css-storinator-01. This never happens. It moves on to the next day without ever kicking off geo-rep. As you can see in the second screenshot there is still a remote Data Sync Setup scheduled for 11:45 and this never kicks off, even though it had been successfully for 11 days straight before the dell-per630-01 pod was destroyed. Version-Release number of selected component (if applicable): rhhi 1.1 gluster 3.3 How reproducible: always Steps to Reproduce: See above Actual results: See above Expected results: When there is two geo-replication sessions sending data out to two different locations from the same volume, if one of the locations fails it should not affect an already existing geo-replication session to a different location. Additional info: I have included a third screenshot to show that the two geo-rep sessions had been successful the day before (March 4th), they had been running once a day at those times successfully since 2/24
This looks like a geo-replication issue. Can you take a look?
Please share the log files from the Master nodes(/var/log/glusterfs/geo-replication)
(In reply to Aravinda VK from comment #3) > Please share the log files from the Master > nodes(/var/log/glusterfs/geo-replication) Looking again at the bug, this seems more like a scheduling issue rather than geo-replication. I'll take a look, and pass it on if I find issues.
Tested with ovirt-4.2.5 and glusterfs-3.8.4-54.15 with the following step. 1. Complete RHHI deployment. Treat as the primary site PrimSite1. Select any volume. This case I chose 'data' volume. 2. Create 2 geo-rep session from this volume to 2 secondary site. SecSite1 & SecSite2 3. Create a remote sync for the storage domain. 4. Start geo-rep on one session. While the session is in progress stop the volume. Geo-rep will go faulty. With this faulty geo-rep session, schedule a remote data sync to SecSite1 & SecSite2. 5. Even when one of the geo-rep session is faulty, the other session worked as expected Before the remote sync: MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- rhsqa-grafton1-nic2.lab.eng.blr.redhat.com data /gluster_bricks/data/data root ssh://10.70.45.29::sasvol1 N/A Faulty N/A N/A rhsqa-grafton2-nic2.lab.eng.blr.redhat.com data /gluster_bricks/data/data root ssh://10.70.45.29::sasvol1 N/A Faulty N/A N/A rhsqa-grafton3-nic2.lab.eng.blr.redhat.com data /gluster_bricks/data/data root ssh://10.70.45.29::sasvol1 N/A Faulty N/A N/A rhsqa-grafton1-nic2.lab.eng.blr.redhat.com data /gluster_bricks/data/data root ssh://10.70.45.32::data N/A Stopped N/A N/A rhsqa-grafton2-nic2.lab.eng.blr.redhat.com data /gluster_bricks/data/data root ssh://10.70.45.32::data N/A Stopped N/A N/A rhsqa-grafton3-nic2.lab.eng.blr.redhat.com data /gluster_bricks/data/data root ssh://10.70.45.32::data N/A Stopped N/A N/A Remote sync worked with faulty session: MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- rhsqa-grafton1-nic2.lab.eng.blr.redhat.com data /gluster_bricks/data/data root ssh://10.70.45.29::sasvol1 N/A Faulty N/A N/A rhsqa-grafton2-nic2.lab.eng.blr.redhat.com data /gluster_bricks/data/data root ssh://10.70.45.29::sasvol1 N/A Faulty N/A N/A rhsqa-grafton3-nic2.lab.eng.blr.redhat.com data /gluster_bricks/data/data root ssh://10.70.45.29::sasvol1 N/A Faulty N/A N/A rhsqa-grafton1-nic2.lab.eng.blr.redhat.com data /gluster_bricks/data/data root ssh://10.70.45.32::data 10.70.45.33 Passive N/A N/A rhsqa-grafton2-nic2.lab.eng.blr.redhat.com data /gluster_bricks/data/data root ssh://10.70.45.32::data 10.70.45.32 Passive N/A N/A rhsqa-grafton3-nic2.lab.eng.blr.redhat.com data /gluster_bricks/data/data root ssh://10.70.45.32::data 10.70.45.34 Active History Crawl 2018-07-20 21:38:10
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHEA-2018:3523