Bug 1474012

Summary: [geo-rep]: Incorrect last sync "0" during hystory crawl after upgrade/stop-start
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Rahul Hinduja <rhinduja>
Component: geo-replicationAssignee: Kotresh HR <khiremat>
Status: CLOSED ERRATA QA Contact: Rochelle <rallan>
Severity: medium Docs Contact:
Priority: unspecified    
Version: rhgs-3.3CC: csaba, rhs-bugs, sheggodu, storage-qa-internal
Target Milestone: ---   
Target Release: RHGS 3.4.0   
Hardware: x86_64   
OS: Linux   
Whiteboard: rebase
Fixed In Version: glusterfs-3.12.2-1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1500346 (view as bug list) Environment:
Last Closed: 2018-09-04 06:34:19 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1569490, 1575490, 1577862, 1611104    
Bug Blocks: 1500346, 1500853, 1503134    

Description Rahul Hinduja 2017-07-23 06:27:21 UTC
Description of problem:
=======================

Observed a scenario where lasy sync became zero post upgrade/reboot during hystory crawl. Before upgrade started, the sync was "changelog crawl" with last sync time as: "2017-07-21 12:51:55". However after upgrade and starting the geo-rep, the last sync for few workers were shown as "0". The corresponding status file shows "0"

[root@dhcp42-79 ~]# gluster volume geo-replication master 10.70.41.209::slave status
 
MASTER NODE     MASTER VOL    MASTER BRICK       SLAVE USER    SLAVE                  SLAVE NODE      STATUS     CRAWL STATUS     LAST_SYNCED                  
----------------------------------------------------------------------------------------------------------------------------------------------------
10.70.42.79     master        /rhs/brick1/b1     root          10.70.41.209::slave    10.70.41.209    Active     History Crawl    2017-07-21 12:51:55          
10.70.42.79     master        /rhs/brick2/b5     root          10.70.41.209::slave    10.70.41.209    Active     History Crawl    2017-07-21 12:51:55          
10.70.42.79     master        /rhs/brick3/b9     root          10.70.41.209::slave    10.70.41.209    Active     History Crawl    2017-07-21 12:51:55          
10.70.42.74     master        /rhs/brick1/b3     root          10.70.41.209::slave    10.70.41.202    Active     History Crawl    N/A                          
10.70.42.74     master        /rhs/brick2/b7     root          10.70.41.209::slave    10.70.41.202    Active     History Crawl    N/A                          
10.70.42.74     master        /rhs/brick3/b11    root          10.70.41.209::slave    10.70.41.202    Active     History Crawl    N/A                          
10.70.41.217    master        /rhs/brick1/b4     root          10.70.41.209::slave    10.70.42.177    Passive    N/A              N/A                          
10.70.41.217    master        /rhs/brick2/b8     root          10.70.41.209::slave    10.70.42.177    Passive    N/A              N/A                          
10.70.41.217    master        /rhs/brick3/b12    root          10.70.41.209::slave    10.70.42.177    Passive    N/A              N/A                          
10.70.43.210    master        /rhs/brick1/b2     root          10.70.41.209::slave    10.70.41.194    Passive    N/A              N/A                          
10.70.43.210    master        /rhs/brick2/b6     root          10.70.41.209::slave    10.70.41.194    Passive    N/A              N/A                          
10.70.43.210    master        /rhs/brick3/b10    root          10.70.41.209::slave    10.70.41.194    Passive    N/A              N/A                          
[root@dhcp42-79 ~]# 
[root@dhcp42-79 ~]# date
Sun Jul 23 11:04:25 IST 2017
[root@dhcp42-79 ~]#


[root@dhcp42-74 ~]# cd /var/lib/glusterd/geo-replication/master_10.70.41.209_slave/
[root@dhcp42-74 master_10.70.41.209_slave]# ls
brick_%2Frhs%2Fbrick1%2Fb3.status  brick_%2Frhs%2Fbrick2%2Fb7.status  brick_%2Frhs%2Fbrick3%2Fb11.status  gsyncd.conf  monitor.pid  monitor.status
[root@dhcp42-74 master_10.70.41.209_slave]# cat brick_%2Frhs%2Fbrick1%2Fb3.status
{"checkpoint_time": 0, "last_synced": 0, "checkpoint_completed": "N/A", "meta": 0, "failures": 0, "entry": 583, "slave_node": "10.70.41.202", "data": 2083, "worker_status": "Active", "crawl_status": "History Crawl", "checkpoint_completion_time": 0}[root@dhcp42-74 master_10.70.41.209_slave]# 
[root@dhcp42-74 master_10.70.41.209_slave]# cat brick_%2Frhs%2Fbrick2%2Fb7.status
{"checkpoint_time": 0, "last_synced": 0, "checkpoint_completed": "N/A", "meta": 0, "failures": 0, "entry": 584, "slave_node": "10.70.41.202", "data": 2059, "worker_status": "Active", "crawl_status": "History Crawl", "checkpoint_completion_time": 0}[root@dhcp42-74 master_10.70.41.209_slave]# 
[root@dhcp42-74 master_10.70.41.209_slave]# cat brick_%2Frhs%2Fbrick3%2Fb11.status
{"checkpoint_time": 0, "last_synced": 0, "checkpoint_completed": "N/A", "meta": 0, "failures": 0, "entry": 586, "slave_node": "10.70.41.202", "data": 2101, "worker_status": "Active", "crawl_status": "History Crawl", "checkpoint_completion_time": 0}[root@dhcp42-74 master_10.70.41.209_slave]# 
[root@dhcp42-74 master_10.70.41.209_slave]# cat monitor.status
Started[root@dhcp42-74 master_10.70.41.209_slave]# 


The status remained same for more than 10 mins until one batch did not sync



MASTER NODE     MASTER VOL    MASTER BRICK       SLAVE USER    SLAVE                  SLAVE NODE      STATUS     CRAWL STATUS     LAST_SYNCED                  
----------------------------------------------------------------------------------------------------------------------------------------------------
10.70.42.79     master        /rhs/brick1/b1     root          10.70.41.209::slave    10.70.41.209    Active     History Crawl    2017-07-21 12:51:55          
10.70.42.79     master        /rhs/brick2/b5     root          10.70.41.209::slave    10.70.41.209    Active     History Crawl    2017-07-21 12:51:55          
10.70.42.79     master        /rhs/brick3/b9     root          10.70.41.209::slave    10.70.41.209    Active     History Crawl    2017-07-21 12:51:55          
10.70.41.217    master        /rhs/brick1/b4     root          10.70.41.209::slave    10.70.42.177    Passive    N/A              N/A                          
10.70.41.217    master        /rhs/brick2/b8     root          10.70.41.209::slave    10.70.42.177    Passive    N/A              N/A                          
10.70.41.217    master        /rhs/brick3/b12    root          10.70.41.209::slave    10.70.42.177    Passive    N/A              N/A                          
10.70.42.74     master        /rhs/brick1/b3     root          10.70.41.209::slave    10.70.41.202    Active     History Crawl    N/A                          
10.70.42.74     master        /rhs/brick2/b7     root          10.70.41.209::slave    10.70.41.202    Active     History Crawl    N/A                          
10.70.42.74     master        /rhs/brick3/b11    root          10.70.41.209::slave    10.70.41.202    Active     History Crawl    N/A                          
10.70.43.210    master        /rhs/brick1/b2     root          10.70.41.209::slave    10.70.41.194    Passive    N/A              N/A                          
10.70.43.210    master        /rhs/brick2/b6     root          10.70.41.209::slave    10.70.41.194    Passive    N/A              N/A                          
10.70.43.210    master        /rhs/brick3/b10    root          10.70.41.209::slave    10.70.41.194    Passive    N/A              N/A                          
Sun Jul 23 11:14:50 IST 2017


Version-Release number of selected component (if applicable):
=============================================================

glusterfs-geo-replication-3.8.4-35.el7rhgs.x86_64


How reproducible:
=================

I remember seeing this only once before upon stop/start. Have tried upgrade twice and seen this once. 

Steps to Reproduce:
===================

No specific steps, the systems were upgraded and as part of upgrade geo-replication was stopped/started.

Actual results:
===============

Last sync is "0"


Expected results:
=================

Last sync should be what it was before geo-rep stopped. Looks like brick status file was overwritten with "0" as last synced.

Comment 4 Kotresh HR 2017-10-10 12:43:50 UTC
Upstream Patch:

https://review.gluster.org/18468  (master)

Comment 8 errata-xmlrpc 2018-09-04 06:34:19 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607