Hide Forgot
Description of problem: I have two rhhi pods each with a gluster cluster containing three gluster volumes running on rhv-m and another gluster cluster with three gluster volumes not running on rhv-m 1) RHV-M for rhhi pod 1: on dell-per630-01.css.lab.eng.rdu2.redhat.com rhhi-engine1.css.lab.eng.rdu2.redhat.com Gluster volumes: data vmstore engine 2) RHV-M for rhhi pod 2: on dell-per630-05.css.lab.eng.rdu2.redhat.com rhhi-engine2.css.lab.eng.rdu2.redhat.com Gluster volumes: data vmstore engine 3) Gluster cluster: on css-storinator-01.css.lab.eng.rdu2.redhat.com Gluster volumes: data vmstore engine Geo-replication was configured using the data volume from the dell-per630-05 to replicate to the data volume of the css-storinator-01. ( geo-replication session was kicked off daily at 11:45 ) Another geo-replication session was configured using the same data volume from the dell-per630-05 to the data volume of the the dell-per630-01 ( geo-replication session was kicked off daily at 1:35 ) These geo-replication sessions had been successfully synchronized and successfully geo-replicating once a day for 10 days straight. (2/24 - 3/5) This was until we tore down our rhhi pod 1 on dell-per630-01. When this happened the geo-replication session turned into a faulty state between the dell-per630-05 and the dell-per630-01, which was expected behaviour. The issue here is that once geo-replication had tried and failed for the session that had just turned faulty, this led to the already scheduled and previously working geo-rep session between the dell-per630-05 and css-storinator-01 (both currently up and running) to stop geo-replicating. This geo-rep session did not turn up faulty it just stopped running the geo-rep session at 11:45 every day like it had been scheduled to do. For reference, here is the geo-replication sessions being mentioned. This is after the fact when the geo-rep session between the dell-per630-05 (192.168.50.21) and dell-per630-01 had been destroyed, hence their state being faulty. MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- 192.168.50.21 data /rhgs/brick2/data root ssh://dell-per630-01.css.lab.eng.rdu2.redhat.com::data N/A Faulty N/A N/A 192.168.50.22 data /rhgs/brick2/data root ssh://dell-per630-01.css.lab.eng.rdu2.redhat.com::data N/A Faulty N/A N/A 192.168.50.23 data /rhgs/brick2/data root ssh://dell-per630-01.css.lab.eng.rdu2.redhat.com::data N/A Faulty N/A N/A 192.168.50.21 data /rhgs/brick2/data root ssh://css-storinator-01.css.lab.eng.rdu2.redhat.com::data N/A Stopped N/A N/A 192.168.50.22 data /rhgs/brick2/data root ssh://css-storinator-01.css.lab.eng.rdu2.redhat.com::data N/A Stopped N/A N/A 192.168.50.23 data /rhgs/brick2/data root ssh://css-storinator-01.css.lab.eng.rdu2.redhat.com::data N/A Stopped N/A N/A The gluster volumes in the remaining clusters are all started and recognizing their peers. I have also attached a screen shot to the rhv-m engine that shows the event log of the geo-replication runs. We can see that a geo-replication session was successful on march 5th at 11:49. This was the geo-rep session between the dell-per630-05 and css-storinator-01 Then we see at 1:35 "Failed to start geo-replication session". This was the session from the dell-per630-05 to the dell-per630-01 which was turned faulty after the dell-per630-01 rhhi pod was destroyed. As we move up the event log we see all the events from the next day (March 6th). There should be a geo-replication session that kicks off at 11:45 to geo-rep the dell-per630-05 to the css-storinator-01. This never happens. It moves on to the next day without ever kicking off geo-rep. As you can see in the second screenshot there is still a remote Data Sync Setup scheduled for 11:45 and this never kicks off, even though it had been successfully for 11 days straight before the dell-per630-01 pod was destroyed. Version-Release number of selected component (if applicable): rhhi 1.1 gluster 3.3 How reproducible: always Steps to Reproduce: See above Actual results: See above Expected results: When there is two geo-replication sessions sending data out to two different locations from the same volume, if one of the locations fails it should not affect an already existing geo-replication session to a different location. Additional info: I have included a third screenshot to show that the two geo-rep sessions had been successful the day before (March 4th), they had been running once a day at those times successfully since 2/24
Unable to verify this bug, as hit with another bug - BZ 1595140 where having multiple geo-rep session with gluster volume failed. This bug is blocked because of that issue. On these grounds, moving this bug out to 4.2.5
(In reply to SATHEESARAN from comment #1) > Unable to verify this bug, as hit with another bug - BZ 1595140 where having > multiple geo-rep session with gluster volume failed. > > This bug is blocked because of that issue. > > On these grounds, moving this bug out to 4.2.5 The Bug 1595140 can be resolved if you add all the gluster hosts to setup. So I think you can continue verification on this one?
(In reply to Sahina Bose from comment #2) > (In reply to SATHEESARAN from comment #1) > > Unable to verify this bug, as hit with another bug - BZ 1595140 where having > > multiple geo-rep session with gluster volume failed. > > > > This bug is blocked because of that issue. > > > > On these grounds, moving this bug out to 4.2.5 > > The Bug 1595140 can be resolved if you add all the gluster hosts to setup. > So I think you can continue verification on this one? Thanks for that information. Yes, I can now proceed with this exceptions in place
Tested with ovirt-4.2.5 and glusterfs-3.8.4-54.15 with the following step. 1. Complete RHHI deployment. Treat as the primary site PrimSite1. Select any volume. This case I chose 'data' volume. 2. Create 2 geo-rep session from this volume to 2 secondary site. SecSite1 & SecSite2 3. Create a remote sync for the storage domain. 4. Start geo-rep on one session. While the session is in progress stop the volume. Geo-rep will go faulty. With this faulty geo-rep session, schedule a remote data sync to SecSite1 & SecSite2. 5. Even when one of the geo-rep session is faulty, the other session worked as expected Before the remote sync: MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- rhsqa-grafton1-nic2.lab.eng.blr.redhat.com data /gluster_bricks/data/data root ssh://10.70.45.29::sasvol1 N/A Faulty N/A N/A rhsqa-grafton2-nic2.lab.eng.blr.redhat.com data /gluster_bricks/data/data root ssh://10.70.45.29::sasvol1 N/A Faulty N/A N/A rhsqa-grafton3-nic2.lab.eng.blr.redhat.com data /gluster_bricks/data/data root ssh://10.70.45.29::sasvol1 N/A Faulty N/A N/A rhsqa-grafton1-nic2.lab.eng.blr.redhat.com data /gluster_bricks/data/data root ssh://10.70.45.32::data N/A Stopped N/A N/A rhsqa-grafton2-nic2.lab.eng.blr.redhat.com data /gluster_bricks/data/data root ssh://10.70.45.32::data N/A Stopped N/A N/A rhsqa-grafton3-nic2.lab.eng.blr.redhat.com data /gluster_bricks/data/data root ssh://10.70.45.32::data N/A Stopped N/A N/A Remote sync worked with faulty session: MASTER NODE MASTER VOL MASTER BRICK SLAVE USER SLAVE SLAVE NODE STATUS CRAWL STATUS LAST_SYNCED -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- rhsqa-grafton1-nic2.lab.eng.blr.redhat.com data /gluster_bricks/data/data root ssh://10.70.45.29::sasvol1 N/A Faulty N/A N/A rhsqa-grafton2-nic2.lab.eng.blr.redhat.com data /gluster_bricks/data/data root ssh://10.70.45.29::sasvol1 N/A Faulty N/A N/A rhsqa-grafton3-nic2.lab.eng.blr.redhat.com data /gluster_bricks/data/data root ssh://10.70.45.29::sasvol1 N/A Faulty N/A N/A rhsqa-grafton1-nic2.lab.eng.blr.redhat.com data /gluster_bricks/data/data root ssh://10.70.45.32::data 10.70.45.33 Passive N/A N/A rhsqa-grafton2-nic2.lab.eng.blr.redhat.com data /gluster_bricks/data/data root ssh://10.70.45.32::data 10.70.45.32 Passive N/A N/A rhsqa-grafton3-nic2.lab.eng.blr.redhat.com data /gluster_bricks/data/data root ssh://10.70.45.32::data 10.70.45.34 Active History Crawl 2018-07-20 21:38:10
This bugzilla is included in oVirt 4.2.5 release, published on July 30th 2018. Since the problem described in this bug report should be resolved in oVirt 4.2.5 release, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.