Bug 1236554

Summary:	[geo-rep]: Once the bricks are killed, worker dies after few retry the worker comesback and session becomes active withount the brick online
Product:	[Red Hat Storage] Red Hat Gluster Storage	Reporter:	Rahul Hinduja <rhinduja>
Component:	geo-replication	Assignee:	Bug Updates Notification Mailing List <rhs-bugs>
Status:	CLOSED WONTFIX	QA Contact:	storage-qa-internal <storage-qa-internal>
Severity:	high	Docs Contact:
Priority:	high
Version:	rhgs-3.1	CC:	avishwan, chrisw, csaba, khiremat, nlevinki, sarumuga, smohan
Target Milestone:	---	Keywords:	ZStream
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-04-16 15:55:53 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1236546, 1239044, 1247882
Bug Blocks:

Description Rahul Hinduja 2015-06-29 12:20:30 UTC

Description of problem:
=======================

The geo-rep status shows one of the brick as ACTIVE, even its corresponding brick process is not running. This brick was killed using kill -9 and the session goes to faulty which is expected, but it retries and comes back online. 

This is seen after the issue mentioned in bug id: 1236546

No brick process running from the node: georep3 for volume master:
==================================================================
[root@georep3 ~]# ps -eaf | grep glusterfsd | grep master
[root@georep3 ~]#

But the worker is running as:
=============================
[root@georep3 ~]# ps -eaf | grep gsyncd | grep feedback
root     27264 16706  0 19:40 ?        00:00:23 python /usr/libexec/glusterfs/python/syncdaemon/gsyncd.py --path=/rhs/brick1/b1 --path=/rhs/brick2/b2  -c /var/lib/glusterd/geo-replication/master_10.70.46.101_slave/gsyncd.conf --iprefix=/var :master --glusterd-uuid=932e669a-e61a-426b-8caf-d698a7ddb6f2 10.70.46.101::slave -N -p  --slave-id 868d5550-8bb6-4360-bfd5-40d2bd9b9adf --feedback-fd 13 --local-path /rhs/brick1/b1 --local-id .%2Frhs%2Fbrick1%2Fb1 --rpc-fd 10,9,7,11 --subvol-num 2 --resource-remote ssh://root.46.101:gluster://localhost:slave
[root@georep3 ~]#

Due to this, the geo-rep status is shown as active and participating in syncing:
================================================================================

[root@georep3 ~]# gluster volume geo status | grep georep3
georep3        master        /rhs/brick1/b1    root          ssh://10.70.46.101::slave    10.70.46.101    Active     Changelog Crawl    2015-06-29 18:11:28          
georep3        master        /rhs/brick2/b2    root          ssh://10.70.46.101::slave    N/A             Faulty     N/A                N/A                          
[root@georep3 ~]#


Version-Release number of selected component (if applicable):
=============================================================

glusterfs-3.7.1-5.el6rhs.x86_64



How reproducible:
=================
Tried once, will update with the retry of bz: 1236546


Steps to Reproduce:
===================
As mentioned in bz: 1236546

Comment 5 Kotresh HR 2015-12-02 06:21:49 UTC

Rahul,

I believe the setup in which the bug is hit is invalid setup where ntp was not configured (BZ 1236546). Could you do a re-test and close this bug if it can't be reproduced.