Bug 762227 (GLUSTER-495) - cluster/replicate segfaults on armv5tel.
Summary: cluster/replicate segfaults on armv5tel.
Keywords:
Status: CLOSED NOTABUG
Alias: GLUSTER-495
Product: GlusterFS
Classification: Community
Component: replicate
Version: mainline
Hardware: All
OS: Linux
medium
high
Target Milestone: ---
Assignee: Vikas Gorur
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-12-21 06:23 UTC by Hraban Luyat
Modified: 2010-02-16 07:28 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Regression: RTNR
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)
Log of a GDB session where the segfault is observed. (5.61 KB, text/plain)
2009-12-21 03:23 UTC, Hraban Luyat
no flags Details

Description Hraban Luyat 2009-12-21 03:25:37 UTC
I forgot to mention that the segfault only occurs when accessing the volume. Starting glusterfs is no problem and if you leave it alone nothing goes wrong. The breakpoint is only reached after a $ ls /mnt/test/.

Regards

Comment 1 Hraban Luyat 2009-12-21 03:55:11 UTC
*ouf*, mea culpa, this was totally my fault. I thought the return value of dict_get_ptr was never checked so neither did I, but I overlooked the whole if (ret != 0) pretty much everywhere, in my second patch for bug #762225. All my apologies. This bug is invalid, the other patch as well, I will commit an updated patch asap!

also, I mixed up step and next in gdb... that gave it away, eventually.

blegh.

Comment 2 Hraban Luyat 2009-12-21 06:23:10 UTC
Hello all,

Yet another error on the armv5tel arch, ye good ole Segfault this time. It has something to do with afr, but it all looks pretty confusing to me. A simple configuration (two local dirs replicated) already yields the error, consistently at the same place in the code. I ran it in GDB and printed some random local variables that looked interesting as the code was approaching certain doom, hopefully it is of some use to anybody.

The error always occurs during the STACK_WIND call on line 2955 of fuse-brigde.c:

2989 
2990         dict = dict_new ();
2991         frame = create_frame (this, this->ctx->pool);
2992         frame->root->type = GF_OP_TYPE_FOP_REQUEST;
2993         xl = this->children->xlator;
2994 
2995         STACK_WIND (frame, fuse_first_lookup_cbk, xl, xl->fops->lookup,
2996                     &loc, dict);
2997         dict_unref (dict);
2998 

Another interesting note: in a more complex setup (nufa over two replicated bricks of each two bricks imported over the net, i.e.: 4 transport/tcp imports) the error occurs almost in the same place, but not quite. Here is where it goes awry in that case:

2998 
2999         pthread_mutex_lock (&priv->first_call_mutex);
3000         {
3001                 while (priv->first_call) {
3002                         pthread_cond_wait (&priv->first_call_cond,
3003                                            &priv->first_call_mutex);
3004                 }                          
3005         }       
3006         pthread_mutex_unlock (&priv->first_call_mutex);
3007         

The program will consistently segfault during the call to pthread_cond_wait on line 3002, but it passes line 2995 without problems.

Any suggestions / pointers for debugging this further are greatly appreciated; I am pretty much clueless here.

Greetings,

Hraban Luyat


Note You need to log in before you can comment on or make changes to this bug.