Bug 846569

Summary: geo-rep session goes to faulty with logging "connection to peer is broken"
Product: [Community] GlusterFS Reporter: Vijaykumar Koppad <vkoppad>
Component: geo-replicationAssignee: Venky Shankar <vshankar>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: unspecified    
Version: mainlineCC: bbandari, gluster-bugs, rabhat, vshankar
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: glusterfs-3.4.0 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 849304 (view as bug list) Environment:
Last Closed: 2013-07-24 17:57:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 849304    

Description Vijaykumar Koppad 2012-08-08 07:11:33 UTC
Description of problem:If you start geo-rep session , the status says faulty. 
And log says  "connection to peer is broken". If you run gsyncd binary by hand , 
ie running /usr/local/libexec/gsyncd , it says 
"Segmentation fault (core dumped)", but no core will be dumped. 

Version-Release number of selected component (if applicable): Glusterfs-3.3 maser  [bfac66f129646bc78f1ed3a7dccb3010114e57aa]



How reproducible:Consistently 


Steps to Reproduce:
1.start a geo-rep session b/w master and slave 
2.Check the status. 

  
Actual results:The status is faulty 


Expected results:THe status should be ok. 


Additional info:
Logs- 

[2012-08-07 19:19:59.37766] I [syncdutils:148:finalize] <top>: exiting.
[2012-08-07 19:20:09.49307] I [monitor(monitor):80:monitor] Monitor: ------------------------------------------------------------
[2012-08-07 19:20:09.49647] I [monitor(monitor):81:monitor] Monitor: starting gsyncd worker
[2012-08-07 19:20:09.93501] I [gsyncd:388:main_i] <top>: syncing: gluster://localhost:master -> file:///root/geo
[2012-08-07 19:20:09.115250] E [syncdutils:179:log_raise_exception] <top>: connection to peer is broken
[2012-08-07 19:20:09.115509] E [resource:191:errlog] Popen: command "/usr/local/libexec/glusterfs/gsyncd --session-owner a9afadf5-c9d1-452c-883b-fd16f9f7a686 -N --listen --timeout 120 file:///root/geo" returned with -11

Comment 1 Venky Shankar 2012-08-09 05:10:20 UTC
Initial analysis:

Gsyncd slave process experienced problems while starting. The gsyncd wrapper (binary) relies on GF_* macros. Here is the bt (in ascending order of frame number) by attaching gdb to the gsyncd wrapper.

----------------------------------------------------------------------------
function=0x7fbdbde4a3c0 "__glusterfs_this_location", line=124, level=GF_LOG_WARNING, 
    fmt=0x7fbdbde4a2b4 "pthread setspecific failed") at logging.c:442
#24358 0x00007fbdbde21809 in __glusterfs_this_location () at globals.c:124
#24359 0x00007fbdbddee4cc in _gf_log (domain=0x7fbdbde4a10b "", file=0x7fbdbde4a299 "globals.c", function=0x7fbdbde4a3c0 "__glusterfs_this_location", line=124, level=GF_LOG_WARNING, 
    fmt=0x7fbdbde4a2b4 "pthread setspecific failed") at logging.c:442
#24360 0x00007fbdbde21809 in __glusterfs_this_location () at globals.c:124
#24361 0x00007fbdbddee4cc in _gf_log (domain=0x7fbdbde4a10b "", file=0x7fbdbde4a299 "globals.c", function=0x7fbdbde4a3c0 "__glusterfs_this_location", line=124, level=GF_LOG_WARNING, 
    fmt=0x7fbdbde4a2b4 "pthread setspecific failed") at logging.c:442
#24362 0x00007fbdbde21809 in __glusterfs_this_location () at globals.c:124
#24363 0x00007fbdbde1d09c in __gf_calloc (nmemb=64, size=8, type=82) at mem-pool.c:112
#24364 0x00007fbdbde373a6 in runinit (runner=0x7fffc82abf40) at run.c:54
#24365 0x00000000004017f4 in invoke_gsyncd (argc=8, argv=0x7fffc82ad0e8) at gsyncd.c:114
#24366 0x0000000000402312 in main (argc=8, argv=0x7fffc82ad0e8) at gsyncd.c:331

-------------------------------------------------------------------------------

There's a recursive sequence of _gf_log and __glusterfs_this_location. This is trigerred by __gf_calloc trying to access THIS. But THIS is not valid in gsyncd context. __glusterfs_this_location tries to log this information (using gf_log). gf_log again tries to access THIS. Hence the recursice sequence of these function calls.

Looks like commit ed4b76ba introduced some references to THIS in __gf_calloc.

As a quick workaround we could do -DRUN_STANDALONE while compiling gsyncd wrapper (Makefile changes). This would result in GF_CALLOC expand to calloc() and other GF_* to their relavent non glusterfs calls.

Other fix would be to make THIS valid in gsyncd context; which would involved some initialization steps (similar to what is done in cli)

Comment 2 Csaba Henk 2012-08-27 14:20:26 UTC
*** Bug 851951 has been marked as a duplicate of this bug. ***