Bug 762227 (GLUSTER-495)

Summary: cluster/replicate segfaults on armv5tel.
Product: [Community] GlusterFS Reporter: Hraban Luyat <bubblboy>
Component: replicateAssignee: Vikas Gorur <vikas>
Status: CLOSED NOTABUG QA Contact:
Severity: high Docs Contact:
Priority: medium    
Version: mainlineCC: anush, gluster-bugs
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: RTNR Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Log of a GDB session where the segfault is observed. none

Description Hraban Luyat 2009-12-21 03:25:37 UTC
I forgot to mention that the segfault only occurs when accessing the volume. Starting glusterfs is no problem and if you leave it alone nothing goes wrong. The breakpoint is only reached after a $ ls /mnt/test/.

Regards

Comment 1 Hraban Luyat 2009-12-21 03:55:11 UTC
*ouf*, mea culpa, this was totally my fault. I thought the return value of dict_get_ptr was never checked so neither did I, but I overlooked the whole if (ret != 0) pretty much everywhere, in my second patch for bug #762225. All my apologies. This bug is invalid, the other patch as well, I will commit an updated patch asap!

also, I mixed up step and next in gdb... that gave it away, eventually.

blegh.

Comment 2 Hraban Luyat 2009-12-21 06:23:10 UTC
Hello all,

Yet another error on the armv5tel arch, ye good ole Segfault this time. It has something to do with afr, but it all looks pretty confusing to me. A simple configuration (two local dirs replicated) already yields the error, consistently at the same place in the code. I ran it in GDB and printed some random local variables that looked interesting as the code was approaching certain doom, hopefully it is of some use to anybody.

The error always occurs during the STACK_WIND call on line 2955 of fuse-brigde.c:

2989 
2990         dict = dict_new ();
2991         frame = create_frame (this, this->ctx->pool);
2992         frame->root->type = GF_OP_TYPE_FOP_REQUEST;
2993         xl = this->children->xlator;
2994 
2995         STACK_WIND (frame, fuse_first_lookup_cbk, xl, xl->fops->lookup,
2996                     &loc, dict);
2997         dict_unref (dict);
2998 

Another interesting note: in a more complex setup (nufa over two replicated bricks of each two bricks imported over the net, i.e.: 4 transport/tcp imports) the error occurs almost in the same place, but not quite. Here is where it goes awry in that case:

2998 
2999         pthread_mutex_lock (&priv->first_call_mutex);
3000         {
3001                 while (priv->first_call) {
3002                         pthread_cond_wait (&priv->first_call_cond,
3003                                            &priv->first_call_mutex);
3004                 }                          
3005         }       
3006         pthread_mutex_unlock (&priv->first_call_mutex);
3007         

The program will consistently segfault during the call to pthread_cond_wait on line 3002, but it passes line 2995 without problems.

Any suggestions / pointers for debugging this further are greatly appreciated; I am pretty much clueless here.

Greetings,

Hraban Luyat