Bug 1474012

Summary:	[geo-rep]: Incorrect last sync "0" during hystory crawl after upgrade/stop-start
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Rahul Hinduja <rhinduja>
Component:	geo-replication	Assignee:	Kotresh HR <khiremat>
Status:	CLOSED ERRATA	QA Contact:	Rochelle <rallan>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	rhgs-3.3	CC:	csaba, rhs-bugs, sheggodu, storage-qa-internal
Target Milestone:	---
Target Release:	RHGS 3.4.0
Hardware:	x86_64
OS:	Linux
Whiteboard:	rebase
Fixed In Version:	glusterfs-3.12.2-1	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:
Clones:	1500346 (view as bug list)		Environment:
Last Closed:	2018-09-04 06:34:19 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1569490, 1575490, 1577862, 1611104
Bug Blocks:	1500346, 1500853, 1503134

Description Rahul Hinduja 2017-07-23 06:27:21 UTC

Description of problem:
=======================

Observed a scenario where lasy sync became zero post upgrade/reboot during hystory crawl. Before upgrade started, the sync was "changelog crawl" with last sync time as: "2017-07-21 12:51:55". However after upgrade and starting the geo-rep, the last sync for few workers were shown as "0". The corresponding status file shows "0"

[root@dhcp42-79 ~]# gluster volume geo-replication master 10.70.41.209::slave status
 
MASTER NODE     MASTER VOL    MASTER BRICK       SLAVE USER    SLAVE                  SLAVE NODE      STATUS     CRAWL STATUS     LAST_SYNCED                  
----------------------------------------------------------------------------------------------------------------------------------------------------
10.70.42.79     master        /rhs/brick1/b1     root          10.70.41.209::slave    10.70.41.209    Active     History Crawl    2017-07-21 12:51:55          
10.70.42.79     master        /rhs/brick2/b5     root          10.70.41.209::slave    10.70.41.209    Active     History Crawl    2017-07-21 12:51:55          
10.70.42.79     master        /rhs/brick3/b9     root          10.70.41.209::slave    10.70.41.209    Active     History Crawl    2017-07-21 12:51:55          
10.70.42.74     master        /rhs/brick1/b3     root          10.70.41.209::slave    10.70.41.202    Active     History Crawl    N/A                          
10.70.42.74     master        /rhs/brick2/b7     root          10.70.41.209::slave    10.70.41.202    Active     History Crawl    N/A                          
10.70.42.74     master        /rhs/brick3/b11    root          10.70.41.209::slave    10.70.41.202    Active     History Crawl    N/A                          
10.70.41.217    master        /rhs/brick1/b4     root          10.70.41.209::slave    10.70.42.177    Passive    N/A              N/A                          
10.70.41.217    master        /rhs/brick2/b8     root          10.70.41.209::slave    10.70.42.177    Passive    N/A              N/A                          
10.70.41.217    master        /rhs/brick3/b12    root          10.70.41.209::slave    10.70.42.177    Passive    N/A              N/A                          
10.70.43.210    master        /rhs/brick1/b2     root          10.70.41.209::slave    10.70.41.194    Passive    N/A              N/A                          
10.70.43.210    master        /rhs/brick2/b6     root          10.70.41.209::slave    10.70.41.194    Passive    N/A              N/A                          
10.70.43.210    master        /rhs/brick3/b10    root          10.70.41.209::slave    10.70.41.194    Passive    N/A              N/A                          
[root@dhcp42-79 ~]# 
[root@dhcp42-79 ~]# date
Sun Jul 23 11:04:25 IST 2017
[root@dhcp42-79 ~]#


[root@dhcp42-74 ~]# cd /var/lib/glusterd/geo-replication/master_10.70.41.209_slave/
[root@dhcp42-74 master_10.70.41.209_slave]# ls
brick_%2Frhs%2Fbrick1%2Fb3.status  brick_%2Frhs%2Fbrick2%2Fb7.status  brick_%2Frhs%2Fbrick3%2Fb11.status  gsyncd.conf  monitor.pid  monitor.status
[root@dhcp42-74 master_10.70.41.209_slave]# cat brick_%2Frhs%2Fbrick1%2Fb3.status
{"checkpoint_time": 0, "last_synced": 0, "checkpoint_completed": "N/A", "meta": 0, "failures": 0, "entry": 583, "slave_node": "10.70.41.202", "data": 2083, "worker_status": "Active", "crawl_status": "History Crawl", "checkpoint_completion_time": 0}[root@dhcp42-74 master_10.70.41.209_slave]# 
[root@dhcp42-74 master_10.70.41.209_slave]# cat brick_%2Frhs%2Fbrick2%2Fb7.status
{"checkpoint_time": 0, "last_synced": 0, "checkpoint_completed": "N/A", "meta": 0, "failures": 0, "entry": 584, "slave_node": "10.70.41.202", "data": 2059, "worker_status": "Active", "crawl_status": "History Crawl", "checkpoint_completion_time": 0}[root@dhcp42-74 master_10.70.41.209_slave]# 
[root@dhcp42-74 master_10.70.41.209_slave]# cat brick_%2Frhs%2Fbrick3%2Fb11.status
{"checkpoint_time": 0, "last_synced": 0, "checkpoint_completed": "N/A", "meta": 0, "failures": 0, "entry": 586, "slave_node": "10.70.41.202", "data": 2101, "worker_status": "Active", "crawl_status": "History Crawl", "checkpoint_completion_time": 0}[root@dhcp42-74 master_10.70.41.209_slave]# 
[root@dhcp42-74 master_10.70.41.209_slave]# cat monitor.status
Started[root@dhcp42-74 master_10.70.41.209_slave]# 


The status remained same for more than 10 mins until one batch did not sync



MASTER NODE     MASTER VOL    MASTER BRICK       SLAVE USER    SLAVE                  SLAVE NODE      STATUS     CRAWL STATUS     LAST_SYNCED                  
----------------------------------------------------------------------------------------------------------------------------------------------------
10.70.42.79     master        /rhs/brick1/b1     root          10.70.41.209::slave    10.70.41.209    Active     History Crawl    2017-07-21 12:51:55          
10.70.42.79     master        /rhs/brick2/b5     root          10.70.41.209::slave    10.70.41.209    Active     History Crawl    2017-07-21 12:51:55          
10.70.42.79     master        /rhs/brick3/b9     root          10.70.41.209::slave    10.70.41.209    Active     History Crawl    2017-07-21 12:51:55          
10.70.41.217    master        /rhs/brick1/b4     root          10.70.41.209::slave    10.70.42.177    Passive    N/A              N/A                          
10.70.41.217    master        /rhs/brick2/b8     root          10.70.41.209::slave    10.70.42.177    Passive    N/A              N/A                          
10.70.41.217    master        /rhs/brick3/b12    root          10.70.41.209::slave    10.70.42.177    Passive    N/A              N/A                          
10.70.42.74     master        /rhs/brick1/b3     root          10.70.41.209::slave    10.70.41.202    Active     History Crawl    N/A                          
10.70.42.74     master        /rhs/brick2/b7     root          10.70.41.209::slave    10.70.41.202    Active     History Crawl    N/A                          
10.70.42.74     master        /rhs/brick3/b11    root          10.70.41.209::slave    10.70.41.202    Active     History Crawl    N/A                          
10.70.43.210    master        /rhs/brick1/b2     root          10.70.41.209::slave    10.70.41.194    Passive    N/A              N/A                          
10.70.43.210    master        /rhs/brick2/b6     root          10.70.41.209::slave    10.70.41.194    Passive    N/A              N/A                          
10.70.43.210    master        /rhs/brick3/b10    root          10.70.41.209::slave    10.70.41.194    Passive    N/A              N/A                          
Sun Jul 23 11:14:50 IST 2017


Version-Release number of selected component (if applicable):
=============================================================

glusterfs-geo-replication-3.8.4-35.el7rhgs.x86_64


How reproducible:
=================

I remember seeing this only once before upon stop/start. Have tried upgrade twice and seen this once. 

Steps to Reproduce:
===================

No specific steps, the systems were upgraded and as part of upgrade geo-replication was stopped/started.

Actual results:
===============

Last sync is "0"


Expected results:
=================

Last sync should be what it was before geo-rep stopped. Looks like brick status file was overwritten with "0" as last synced.

Comment 4 Kotresh HR 2017-10-10 12:43:50 UTC

Upstream Patch:

https://review.gluster.org/18468  (master)

Comment 8 errata-xmlrpc 2018-09-04 06:34:19 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607