Bug 1062138

Summary: Dist-geo-rep : too many "connection to peer is broken" which resulted in failures in removing from slave.
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Vijaykumar Koppad <vkoppad>
Component: geo-replicationAssignee: Bug Updates Notification Mailing List <rhs-bugs>
Status: CLOSED EOL QA Contact: storage-qa-internal <storage-qa-internal>
Severity: high Docs Contact:
Priority: medium    
Version: 2.1CC: avishwan, chrisw, csaba, david.macdonald, nlevinki
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2015-11-25 08:49:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
sosreports of all the nodes of master and slave cluster none

Description Vijaykumar Koppad 2014-02-06 09:41:15 UTC
Description of problem: geo-rep session between master and slave keeps on getting disconnected because of unknown reasons. The disconnections would result in restart of gsyncd and kick in hybrid crawl. Since the hybrid crawl can't sync deletes and renames to slaves, these disconnection could become major problem, if the master is going through lot of renames and deletes.     

logs from geo-rep log file

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
[2014-02-06 14:43:56.596056] E [syncdutils(/bricks/brick11):223:log_raise_exception] <top>: connection to peer is bro
ken
[2014-02-06 14:43:56.610052] E [resource(/bricks/brick11):204:errlog] Popen: command "ssh -oPasswordAuthentication=no
 -oStrictHostKeyChecking=no -i /var/lib/glusterd/geo-replication/secret.pem -oControlMaster=auto -S /tmp/gsyncd-aux-s
sh-LtOvaT/0a2c0d8cd2752e32a15bade57111bc93.sock root.37.141 /nonexistent/gsyncd --session-owner f8c73824-3fa5-4
439-bfbb-50760f8773c8 -N --listen --timeout 120 gluster://localhost:imaster" returned with 255, saying:
[2014-02-06 14:43:56.610382] E [resource(/bricks/brick11):207:logerr] Popen: ssh> [2014-02-06 09:04:26.161084] I [soc
ket.c:3505:socket_init] 0-glusterfs: SSL support is NOT enabled
[2014-02-06 14:43:56.610667] E [resource(/bricks/brick11):207:logerr] Popen: ssh> [2014-02-06 09:04:26.161111] I [soc
ket.c:3520:socket_init] 0-glusterfs: using system polling thread
[2014-02-06 14:43:56.610968] E [resource(/bricks/brick11):207:logerr] Popen: ssh> [2014-02-06 09:04:26.161735] I [soc
ket.c:3505:socket_init] 0-glusterfs: SSL support is NOT enabled
[2014-02-06 14:43:56.611215] E [resource(/bricks/brick11):207:logerr] Popen: ssh> [2014-02-06 09:04:26.161752] I [soc
ket.c:3520:socket_init] 0-glusterfs: using system polling thread
[2014-02-06 14:43:56.611485] E [resource(/bricks/brick11):207:logerr] Popen: ssh> [2014-02-06 09:04:26.353283] I [soc
ket.c:2235:socket_event_handler] 0-transport: disconnecting now
[2014-02-06 14:43:56.611813] E [resource(/bricks/brick11):207:logerr] Popen: ssh> [2014-02-06 09:04:26.354623] I [cli
-rpc-ops.c:5338:gf_cli_getwd_cbk] 0-cli: Received resp to getwd
[2014-02-06 14:43:56.612189] E [resource(/bricks/brick11):207:logerr] Popen: ssh> [2014-02-06 09:04:26.354680] I [inp
ut.c:36:cli_batch] 0-: Exiting with: 0
[2014-02-06 14:43:56.612432] E [resource(/bricks/brick11):207:logerr] Popen: ssh> Killed by signal 15.
[2014-02-06 14:43:56.613003] I [syncdutils(/bricks/brick11):192:finalize] <top>: exiting.
[2014-02-06 14:43:56.616291] E [syncdutils(/bricks/brick3):223:log_raise_exception] <top>: connection to peer is brok
en
[2014-02-06 14:43:56.617547] I [monitor(monitor):81:set_state] Monitor: new state: faulty
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>


Version-Release number of selected component (if applicable):glusterfs-3.4.0.59rhs-1


How reproducible: doesn't happen everytime.


Steps to Reproduce:
No proper steps, can happen anytime

1.create and start a geo-rep relationship between master(6x2) and slave(6x2) 
2.keep creating and deleting files from master
3.check geo-rep logs for disconnections logs. 

Actual results: too many disconnections between master and slave. 


Expected results: there shouldn't so many disconnection without a reason. 


Additional info:

Comment 1 Vijaykumar Koppad 2014-02-06 09:49:50 UTC
Created attachment 860080 [details]
sosreports of all the nodes of master and slave cluster

Comment 4 Aravinda VK 2015-11-25 08:49:25 UTC
Closing this bug since RHGS 2.1 release reached EOL. Required bugs are cloned to RHGS 3.1. Please re-open this issue if found again.

Comment 5 Aravinda VK 2015-11-25 08:51:05 UTC
Closing this bug since RHGS 2.1 release reached EOL. Required bugs are cloned to RHGS 3.1. Please re-open this issue if found again.