Bug 1142960 - dist-geo-rep: geo-rep status in one of rebooted node remains at "Stable(paused)" after session is resumed.
Summary: dist-geo-rep: geo-rep status in one of rebooted node remains at "Stable(pause...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: geo-replication
Version: rhgs-3.0
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ---
: RHGS 3.0.3
Assignee: Kotresh HR
QA Contact: M S Vishwanath Bhat
URL:
Whiteboard:
Depends On:
Blocks: 1149982 1159195 1162694
TreeView+ depends on / blocked
 
Reported: 2014-09-17 16:22 UTC by M S Vishwanath Bhat
Modified: 2016-06-01 01:56 UTC (History)
13 users (show)

Fixed In Version: glusterfs-3.6.0.31-1.el6rhs
Doc Type: Bug Fix
Doc Text:
Previously, when geo-replication was paused and the node was rebooted, geo-replication status remained at "Stable(paused)" state even after session was resumed. The further geo-replication pause displayed "Geo-rep already paused" message. With this fix, there is no mismatch between status file and actual status of geo-replication processes and the geo-replication status in rebooted node remains intact after session is resumed.
Clone Of:
: 1149982 (view as bug list)
Environment:
Last Closed: 2015-01-15 13:39:59 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2015:0038 0 normal SHIPPED_LIVE Red Hat Storage 3.0 enhancement and bug fix update #3 2015-01-15 18:35:28 UTC

Description M S Vishwanath Bhat 2014-09-17 16:22:59 UTC
Description of problem:
When you pause the geo-rep session and reboot one of the passive node and resume the session after the session comes back up online, the status of the node is stuck at "Stable(paused)" even after long a long. All other machines have moved on to Active/Passive state except for the node which got rebooted.

Version-Release number of selected component (if applicable):
glusterfs-3.6.0.28-1.el6rhs.x86_64

How reproducible:
Hit only once. Haven't tried but seems like a easily reproducible issue.

Steps to Reproduce:
1. Create and start a geo-rep session between 2*2 dist-rep master and 282 dist-rep slave.
2. Now create some data and let it sync to slave.
3. pause the session using the geo-rep pause command. And check the status. It should be "Stable(paused)"
4. Now reboot one of the Passive node and wait for the node to come back online.
5. Check the status. All but the rebooted node should have "Stable(paused)" state. And rebooted node should have "faulty(paused)" state.
6. Now resume the session using the geo-rep resume command.
7. And check the status

Actual results:
Now the status of the rebooted node gets stuck at "Stable(paused)" while other node's state goes back to "Active/Passive"

MASTER NODE                 MASTER VOL    MASTER BRICK      SLAVE               STATUS            CHECKPOINT STATUS    CRAWL STATUS           
---------------------------------------------------------------------------------------------------------------------------------------
ccr.blr.redhat.com          master        /bricks/brick0    nirvana::slave      Active            N/A                  Changelog Crawl        
metallica.blr.redhat.com    master        /bricks/brick1    acdc::slave         Passive           N/A                  N/A                    
beatles.blr.redhat.com      master        /bricks/brick3    rammstein::slave    Stable(Paused)    N/A                  N/A                    
pinkfloyd.blr.redhat.com    master        /bricks/brick2    led::slave          Active            N/A                  Changelog Crawl        




Expected results:
All nodes should have the proper status updated after resume.


Additional info:
All the data got synced because the Active node was not paused (status was showing Ative). Not sure what happens when an Active node gets rebooted.

Comment 2 M S Vishwanath Bhat 2014-09-19 15:03:32 UTC
I hit the same issue of one node in paused state and other node in Active/Passive state even without a reboot. It happens intermittently. Does not happen every time.

Resume fails following error. That geo-rep is not paused in following machines.

[root@pinkfloyd ~]# gluster v geo master acdc::slave resume
Staging failed on 10.70.43.127. Error: Geo-replication session between master and acdc::slave is not Paused.
Staging failed on beatles. Error: Geo-replication session between master and acdc::slave is not Paused.
Staging failed on metallica. Error: Geo-replication session between master and acdc::slave is not Paused.
geo-replication command failed


Then pause fails with following message.

[root@pinkfloyd ~]# gluster v geo master acdc::slave pause
Geo-replication session between master and acdc::slave already Paused.
geo-replication command failed


This does not happen evrytime without reboot. The only workaround I found was to stop and then restart geo-replication.

Comment 3 Kotresh HR 2014-10-16 05:24:55 UTC
Upstream Patch (Status: Merged):
http://review.gluster.org/#/c/8911/

Downstream Patch:
https://code.engineering.redhat.com/gerrit/#/c/34691/

Comment 7 shilpa 2014-11-21 12:01:59 UTC
Moving the bug to verified. Reference: c#5 and c#6

Comment 8 Shalaka 2015-01-09 09:33:54 UTC
Please review and sign-off edited doc text.

Comment 9 Kotresh HR 2015-01-12 09:11:05 UTC
Doc text looks fine to me.

Comment 11 errata-xmlrpc 2015-01-15 13:39:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2015-0038.html


Note You need to log in before you can comment on or make changes to this bug.