1149982 – dist-geo-rep: geo-rep status in one of rebooted node remains at "Stable(paused)" after session is resumed.

Bug 1149982 - dist-geo-rep: geo-rep status in one of rebooted node remains at "Stable(paused)" after session is resumed.

Summary: dist-geo-rep: geo-rep status in one of rebooted node remains at "Stable(pause...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	geo-replication
Sub Component:
Version:	mainline
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	medium
Target Milestone:	---
Assignee:	Kotresh HR
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1142960
Blocks:	1159195
TreeView+	depends on / blocked

Reported:	2014-10-07 06:13 UTC by Kotresh HR
Modified:	2015-05-14 17:35 UTC (History)
CC List:	11 users (show)
Fixed In Version:	glusterfs-3.7.0beta1
Clone Of:	1142960
Clones:	1159195 (view as bug list)
Environment:
Last Closed:	2015-05-14 17:26:15 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Kotresh HR 2014-10-07 06:13:33 UTC

+++ This bug was initially created as a clone of Bug #1142960 +++

Description of problem:
When you pause the geo-rep session and reboot one of the passive node and resume the session after the session comes back up online, the status of the node is stuck at "Stable(paused)" even after long a long. All other machines have moved on to Active/Passive state except for the node which got rebooted.

Version-Release number of selected component (if applicable):
mainline

How reproducible:
Hit only once. Haven't tried but seems like a easily reproducible issue.

Steps to Reproduce:
1. Create and start a geo-rep session between 2*2 dist-rep master and 282 dist-rep slave.
2. Now create some data and let it sync to slave.
3. pause the session using the geo-rep pause command. And check the status. It should be "Stable(paused)"
4. Now reboot one of the Passive node and wait for the node to come back online.
5. Check the status. All but the rebooted node should have "Stable(paused)" state. And rebooted node should have "faulty(paused)" state.
6. Now resume the session using the geo-rep resume command.
7. And check the status

Actual results:
Now the status of the rebooted node gets stuck at "Stable(paused)" while other node's state goes back to "Active/Passive"

MASTER NODE                 MASTER VOL    MASTER BRICK      SLAVE               STATUS            CHECKPOINT STATUS    CRAWL STATUS           
---------------------------------------------------------------------------------------------------------------------------------------
ccr.blr.redhat.com          master        /bricks/brick0    nirvana::slave      Active            N/A                  Changelog Crawl        
metallica.blr.redhat.com    master        /bricks/brick1    acdc::slave         Passive           N/A                  N/A                    
beatles.blr.redhat.com      master        /bricks/brick3    rammstein::slave    Stable(Paused)    N/A                  N/A                    
pinkfloyd.blr.redhat.com    master        /bricks/brick2    led::slave          Active            N/A                  Changelog Crawl        




Expected results:
All nodes should have the proper status updated after resume.


Additional info:
All the data got synced because the Active node was not paused (status was showing Ative). Not sure what happens when an Active node gets rebooted.


I hit the same issue of one node in paused state and other node in Active/Passive state even without a reboot. It happens intermittently. Does not happen every time.

Resume fails following error. That geo-rep is not paused in following machines.

[root@pinkfloyd ~]# gluster v geo master acdc::slave resume
Staging failed on 10.70.43.127. Error: Geo-replication session between master and acdc::slave is not Paused.
Staging failed on beatles. Error: Geo-replication session between master and acdc::slave is not Paused.
Staging failed on metallica. Error: Geo-replication session between master and acdc::slave is not Paused.
geo-replication command failed


Then pause fails with following message.

[root@pinkfloyd ~]# gluster v geo master acdc::slave pause
Geo-replication session between master and acdc::slave already Paused.
geo-replication command failed


This does not happen evrytime without reboot. The only workaround I found was to stop and then restart geo-replication.

Comment 1 Anand Avati 2014-10-07 11:23:07 UTC

REVIEW: http://review.gluster.org/8911 (glusterd/geo-rep: Fix race in updating status file) posted (#1) for review on master by Kotresh HR (khiremat)

Comment 2 Anand Avati 2014-10-13 02:42:14 UTC

COMMIT: http://review.gluster.org/8911 committed in master by Venky Shankar (vshankar) 
------
commit 3b5b5042ec2f119e3ec807829c101c421e90e2da
Author: Kotresh HR <khiremat>
Date:   Fri Oct 3 17:35:47 2014 +0530

    glusterd/geo-rep: Fix race in updating status file
    
    When geo-rep is in paused state and a node in a cluster
    is rebooted, the geo-rep status goes to "faulty (Paused)"
    and no worker processes are started on that node yet. In
    this state, when geo-rep is resumed, there is a race in
    updating status file between glusterd and gsyncd itself
    as geo-rep is resumed first and then status is updated.
    glusterd tries to update to previous state and gsyncd
    tries to update it to "Initializing...(Paused)" on
    restart as it was paused previously. If gsyncd on restart
    wins, the state is always paused but the process is not
    acutally paused. So the solution is glusterd to update
    the status file and then resume.
    
    Change-Id: I348761a6e8c3ad2630c79833bc86587d062a8f92
    BUG: 1149982
    Signed-off-by: Kotresh HR <khiremat>
    Reviewed-on: http://review.gluster.org/8911
    Reviewed-by: Aravinda VK <avishwan>
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Venky Shankar <vshankar>
    Tested-by: Venky Shankar <vshankar>

Comment 3 Niels de Vos 2015-05-14 17:26:15 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Comment 4 Niels de Vos 2015-05-14 17:28:15 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Comment 5 Niels de Vos 2015-05-14 17:35:14 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.

glusterfs-3.7.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://thread.gmane.org/gmane.comp.file-systems.gluster.devel/10939
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Note You need to log in before you can comment on or make changes to this bug.