Bug 1236554

Summary: [geo-rep]: Once the bricks are killed, worker dies after few retry the worker comesback and session becomes active withount the brick online
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Rahul Hinduja <rhinduja>
Component: geo-replicationAssignee: Bug Updates Notification Mailing List <rhs-bugs>
Status: CLOSED WONTFIX QA Contact: storage-qa-internal <storage-qa-internal>
Severity: high Docs Contact:
Priority: high    
Version: rhgs-3.1CC: avishwan, chrisw, csaba, khiremat, nlevinki, sarumuga, smohan
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-04-16 15:55:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1236546, 1239044, 1247882    
Bug Blocks:    

Description Rahul Hinduja 2015-06-29 12:20:30 UTC
Description of problem:
=======================

The geo-rep status shows one of the brick as ACTIVE, even its corresponding brick process is not running. This brick was killed using kill -9 and the session goes to faulty which is expected, but it retries and comes back online. 

This is seen after the issue mentioned in bug id: 1236546

No brick process running from the node: georep3 for volume master:
==================================================================
[root@georep3 ~]# ps -eaf | grep glusterfsd | grep master
[root@georep3 ~]#

But the worker is running as:
=============================
[root@georep3 ~]# ps -eaf | grep gsyncd | grep feedback
root     27264 16706  0 19:40 ?        00:00:23 python /usr/libexec/glusterfs/python/syncdaemon/gsyncd.py --path=/rhs/brick1/b1 --path=/rhs/brick2/b2  -c /var/lib/glusterd/geo-replication/master_10.70.46.101_slave/gsyncd.conf --iprefix=/var :master --glusterd-uuid=932e669a-e61a-426b-8caf-d698a7ddb6f2 10.70.46.101::slave -N -p  --slave-id 868d5550-8bb6-4360-bfd5-40d2bd9b9adf --feedback-fd 13 --local-path /rhs/brick1/b1 --local-id .%2Frhs%2Fbrick1%2Fb1 --rpc-fd 10,9,7,11 --subvol-num 2 --resource-remote ssh://root.46.101:gluster://localhost:slave
[root@georep3 ~]#

Due to this, the geo-rep status is shown as active and participating in syncing:
================================================================================

[root@georep3 ~]# gluster volume geo status | grep georep3
georep3        master        /rhs/brick1/b1    root          ssh://10.70.46.101::slave    10.70.46.101    Active     Changelog Crawl    2015-06-29 18:11:28          
georep3        master        /rhs/brick2/b2    root          ssh://10.70.46.101::slave    N/A             Faulty     N/A                N/A                          
[root@georep3 ~]#


Version-Release number of selected component (if applicable):
=============================================================

glusterfs-3.7.1-5.el6rhs.x86_64



How reproducible:
=================
Tried once, will update with the retry of bz: 1236546


Steps to Reproduce:
===================
As mentioned in bz: 1236546

Comment 5 Kotresh HR 2015-12-02 06:21:49 UTC
Rahul,

I believe the setup in which the bug is hit is invalid setup where ntp was not configured (BZ 1236546). Could you do a re-test and close this bug if it can't be reproduced.