+++ This bug was initially created as a clone of Bug #1149982 +++ +++ This bug was initially created as a clone of Bug #1142960 +++ Description of problem: When you pause the geo-rep session and reboot one of the passive node and resume the session after the session comes back up online, the status of the node is stuck at "Stable(paused)" even after long a long. All other machines have moved on to Active/Passive state except for the node which got rebooted. Version-Release number of selected component (if applicable): mainline How reproducible: Hit only once. Haven't tried but seems like a easily reproducible issue. Steps to Reproduce: 1. Create and start a geo-rep session between 2*2 dist-rep master and 282 dist-rep slave. 2. Now create some data and let it sync to slave. 3. pause the session using the geo-rep pause command. And check the status. It should be "Stable(paused)" 4. Now reboot one of the Passive node and wait for the node to come back online. 5. Check the status. All but the rebooted node should have "Stable(paused)" state. And rebooted node should have "faulty(paused)" state. 6. Now resume the session using the geo-rep resume command. 7. And check the status Actual results: Now the status of the rebooted node gets stuck at "Stable(paused)" while other node's state goes back to "Active/Passive" MASTER NODE MASTER VOL MASTER BRICK SLAVE STATUS CHECKPOINT STATUS CRAWL STATUS --------------------------------------------------------------------------------------------------------------------------------------- ccr.blr.redhat.com master /bricks/brick0 nirvana::slave Active N/A Changelog Crawl metallica.blr.redhat.com master /bricks/brick1 acdc::slave Passive N/A N/A beatles.blr.redhat.com master /bricks/brick3 rammstein::slave Stable(Paused) N/A N/A pinkfloyd.blr.redhat.com master /bricks/brick2 led::slave Active N/A Changelog Crawl Expected results: All nodes should have the proper status updated after resume. Additional info: All the data got synced because the Active node was not paused (status was showing Ative). Not sure what happens when an Active node gets rebooted. I hit the same issue of one node in paused state and other node in Active/Passive state even without a reboot. It happens intermittently. Does not happen every time. Resume fails following error. That geo-rep is not paused in following machines. [root@pinkfloyd ~]# gluster v geo master acdc::slave resume Staging failed on 10.70.43.127. Error: Geo-replication session between master and acdc::slave is not Paused. Staging failed on beatles. Error: Geo-replication session between master and acdc::slave is not Paused. Staging failed on metallica. Error: Geo-replication session between master and acdc::slave is not Paused. geo-replication command failed Then pause fails with following message. [root@pinkfloyd ~]# gluster v geo master acdc::slave pause Geo-replication session between master and acdc::slave already Paused. geo-replication command failed This does not happen evrytime without reboot. The only workaround I found was to stop and then restart geo-replication.
REVIEW: http://review.gluster.org/9021 (glusterd/geo-rep: Fix race in updating status file) posted (#1) for review on release-3.6 by Kotresh HR (khiremat)
COMMIT: http://review.gluster.org/9021 committed in release-3.6 by Venky Shankar (vshankar) ------ commit beedf68266f19ac77b77f2ec5f9533f3e63c159f Author: Kotresh HR <khiremat> Date: Fri Oct 3 17:35:47 2014 +0530 glusterd/geo-rep: Fix race in updating status file When geo-rep is in paused state and a node in a cluster is rebooted, the geo-rep status goes to "faulty (Paused)" and no worker processes are started on that node yet. In this state, when geo-rep is resumed, there is a race in updating status file between glusterd and gsyncd itself as geo-rep is resumed first and then status is updated. glusterd tries to update to previous state and gsyncd tries to update it to "Initializing...(Paused)" on restart as it was paused previously. If gsyncd on restart wins, the state is always paused but the process is not acutally paused. So the solution is glusterd to update the status file and then resume. BUG: 1159195 Change-Id: I4c06f42226db98f5a3c49b90f31ecf6cf2b6d0cb Reviewed-on: http://review.gluster.org/8911 Signed-off-by: Kotresh HR <khiremat> Reviewed-on: http://review.gluster.org/9021 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Venky Shankar <vshankar> Tested-by: Venky Shankar <vshankar>
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.7.0, please open a new bug report.