Bug 961682 - Rebalance Crash on Distribute Replicate Volume
Summary: Rebalance Crash on Distribute Replicate Volume
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: glusterfs
Version: 2.1
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: ---
Assignee: krishnan parthasarathi
QA Contact: senaik
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-05-10 09:24 UTC by senaik
Modified: 2015-09-01 12:23 UTC (History)
4 users (show)

Fixed In Version: glusterfs-3.4.0.7rhs-1.el5
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-09-23 22:35:30 UTC
Embargoed:


Attachments (Terms of Use)

Description senaik 2013-05-10 09:24:41 UTC
Description of problem:
---------------------------
While running rebalance on Distribute Replicate Volume , Rebalance process crashed . 

Version-Release number of selected component (if applicable):
------------------------------------------------------------------ 
3.4.0.5rhs-1.el6rhs.x86_64

How reproducible:
-------------------- 

Steps to Reproduce:
------------------- 
- One of the Rebalance processes earlier had failed due to :

Request received from non-privileged port. Failing request
[2013-05-10 07:34:23.057388] E [rpcsvc.c:519:rpcsvc_handle_rpc_call] 0-glusterd: Request received from non-privileged port. Failing request
[2013-05-10 07:34:23.073047] E [rpcsvc.c:519:rpcsvc_handle_rpc_call] 0-glusterd: Request received from non-privileged port. Failing request

- Hence enabled requests from insecure ports in /etc/glusterfs/glusterd.vol :
  option rpc-auth-allow-insecure on

- Restarted glusterd 

1.Created a 2x2 distributed volume and started it 
2.Mounted the volume and created some files 
3.Add brick and started rebalance 

4. Check Rebalance status 
gluster v rebalance distribute-replicate status

Node   Rebalanced-files  size  scanned  failures  status  run time in secs
------  ---------------  -----  ------  --------- ------  ----------------
localhost     0          0Bytes   0       0       failed    0.00
localhost     0          0Bytes   0       0       failed    0.00
localhost     0          0Bytes   0       0       failed    0.00
10.70.34.86   0          0Bytes   0       0       failed    0.00

------------------------------------------------------------------------- 

patchset: git://git.gluster.com/glusterfs.git
signal received: 11
time of crash: 2013-05-10 07:44:01configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.4.0.5rhs
/lib64/libc.so.6[0x38ce432920]
/lib64/libc.so.6[0x38ce4b2483]
/lib64/libc.so.6(fnmatch+0x73)[0x38ce4b68e3]
/usr/lib64/libglusterfs.so.0(+0x52f71)[0x7fddf0a8bf71]
/usr/lib64/libglusterfs.so.0(+0x560d3)[0x7fddf0a8f0d3]
/usr/lib64/libglusterfs.so.0(dict_foreach+0x45)[0x7fddf0a4c565]
/usr/lib64/libglusterfs.so.0(xlator_options_validate_list+0x2f)[0x7fddf0a8ed5f]
/usr/lib64/libglusterfs.so.0(xlator_options_validate+0x39)[0x7fddf0a8edc9]
/usr/lib64/libglusterfs.so.0(glusterfs_graph_validate_options+0x2f)[0x7fddf0a7bf6f]
/usr/lib64/libglusterfs.so.0(glusterfs_graph_activate+0x1e)[0x7fddf0a7bffe]
/usr/sbin/glusterfs(glusterfs_process_volfp+0xeb)[0x404ffb]
/usr/sbin/glusterfs(mgmt_getspec_cbk+0x2eb)[0x40bbdb]
/usr/lib64/libgfrpc.so.0(rpc_clnt_handle_reply+0xa5)[0x7fddf08323d5]
/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x127)[0x7fddf0832fb7]




(gdb) bt
#0  0x00000038ce4b2483 in internal_fnmatch () from /lib64/libc.so.6
#1  0x00000038ce4b68e3 in fnmatch@@GLIBC_2.2.5 () from /lib64/libc.so.6
#2  0x00007fddf0a8bf71 in xlator_volume_option_get_list (vol_list=<value optimized out>, key=0x1953d00 "data-self-heal") at options.c:786
#3  0x00007fddf0a8f0d3 in xl_opt_validate (dict=0x7fddef253a04, key=0x1953d00 "data-self-heal", value=0x7fddef071abc, data=0x7fff2242b0a0)
    at options.c:832
#4  0x00007fddf0a4c565 in dict_foreach (dict=0x7fddef253a04, fn=0x7fddf0a8f090 <xl_opt_validate>, data=0x7fff2242b0a0) at dict.c:1109
#5  0x00007fddf0a8ed5f in xlator_options_validate_list (xl=<value optimized out>, options=<value optimized out>, vol_opt=<value optimized out>, 
    op_errstr=0x7fff2242b118) at options.c:871
#6  0x00007fddf0a8edc9 in xlator_options_validate (xl=0x1965700, options=0x7fddef253a04, op_errstr=0x7fff2242b118) at options.c:899
#7  0x00007fddf0a7bf6f in glusterfs_graph_validate_options (graph=<value optimized out>) at graph.c:267
#8  0x00007fddf0a7bffe in glusterfs_graph_activate (graph=0x1953ae0, ctx=0x1917010) at graph.c:470
#9  0x0000000000404ffb in glusterfs_process_volfp (ctx=0x1917010, fp=0x19538a0) at glusterfsd.c:1802
#10 0x000000000040bbdb in mgmt_getspec_cbk (req=<value optimized out>, iov=<value optimized out>, count=<value optimized out>, 
    myframe=0x7fddef68c7a4) at glusterfsd-mgmt.c:1583
#11 0x00007fddf08323d5 in rpc_clnt_handle_reply (clnt=0x1949c80, pollin=0x1952b10) at rpc-clnt.c:771
#12 0x00007fddf0832fb7 in rpc_clnt_notify (trans=<value optimized out>, mydata=0x1949cb0, event=<value optimized out>, data=<value optimized out>)
    at rpc-clnt.c:890
#13 0x00007fddf082e8e8 in rpc_transport_notify (this=<value optimized out>, event=<value optimized out>, data=<value optimized out>)
    at rpc-transport.c:497
#14 0x00007fdded0653a4 in socket_event_poll_in (this=0x194e830) at socket.c:2119
#15 0x00007fdded0654fd in socket_event_handler (fd=<value optimized out>, idx=<value optimized out>, data=0x194e830, poll_in=1, poll_out=0, 
    poll_err=0) at socket.c:2231
#16 0x00007fddf0a939b7 in event_dispatch_epoll_handler (event_pool=0x1932eb0) at event-epoll.c:384
#17 event_dispatch_epoll (event_pool=0x1932eb0) at event-epoll.c:445
#18 0x0000000000406776 in main (argc=31, argv=0x7fff2242c8f8) at glusterfsd.c:1943

  
Actual results:


Expected results:


Additional info:
gluster v info distribute-replicate
 
Volume Name: distribute-replicate
Type: Distributed-Replicate
Volume ID: 8cdd9b2b-7e92-4311-8f44-1615e21cf010
Status: Started
Number of Bricks: 3 x 2 = 6
Transport-type: tcp
Bricks:
Brick1: 10.70.34.85:/rhs/brick1/h1
Brick2: 10.70.34.86:/rhs/brick1/h2
Brick3: 10.70.34.105:/rhs/brick1/h3
Brick4: 10.70.34.85:/rhs/brick1/h4
Brick5: 10.70.34.105:/rhs/brick1/h5
Brick6: 10.70.34.85:/rhs/brick1/h6

Comment 3 shishir gowda 2013-05-11 05:47:30 UTC
Looks like a corruption in volume option list.
in xlator_volume_option_get_list
   cmp_key = opt[index].key[i];

 but opt[4].key[1] is corrupted.

(gdb) p stub->vol_opt->given_opt[4]
$19 = {key = {0x7f29eb5d329b "system.posix_acl_access", 0x100000000 <Address 0x100000000 out of bounds>, 0x7f29eb5d32b3 "system.posix_acl_default", 
    0x100000000 <Address 0x100000000 out of bounds>}, type = 3948753612, min = 2.1219957909652723e-314, max = 6.907927992680449e-310, value = {
    0x100000000 <Address 0x100000000 out of bounds>, 0x7f29eb5d32f1 "gfid-req", 0x100000000 <Address 0x100000000 out of bounds>, 0x0, 0x0, 0x0, 0x0, 
    0xf5b89c3d00000067 <Address 0xf5b89c3d00000067 out of bounds>, 0x7472747368732e00 <Address 0x7472747368732e00 out of bounds>, 
    0x65746f6e2e006261 <Address 0x65746f6e2e006261 out of bounds>, 0x6975622e756e672e <Address 0x6975622e756e672e out of bounds>, 
    0x672e0064692d646c <Address 0x672e0064692d646c out of bounds>, 0x687361682e756e <Address 0x687361682e756e out of bounds>, 
    0x6d79736e79642e <Address 0x6d79736e79642e out of bounds>, 0x7274736e79642e <Address 0x7274736e79642e out of bounds>, 
    0x7265762e756e672e <Address 0x7265762e756e672e out of bounds>, 0x6e672e006e6f6973 <Address 0x6e672e006e6f6973 out of bounds>, 
    0x6f69737265762e75 <Address 0x6f69737265762e75 out of bounds>, 0x6c65722e00725f6e <Address 0x6c65722e00725f6e out of bounds>, 
    0x722e006e79642e61 <Address 0x722e006e79642e61 out of bounds>, 0x746c702e616c65 <Address 0x746c702e616c65 out of bounds>, 
    0x742e0074696e692e <Address 0x742e0074696e692e out of bounds>, 0x6e69662e00747865 <Address 0x6e69662e00747865 out of bounds>, 
    0x7461646f722e0069 <Address 0x7461646f722e0069 out of bounds>, 0x72665f68652e0061 <Address 0x72665f68652e0061 out of bounds>, 
    0x7264685f656d61 <Address 0x7264685f656d61 out of bounds>, 0x6d6172665f68652e <Address 0x6d6172665f68652e out of bounds>, 
    0x73726f74632e0065 <Address 0x73726f74632e0065 out of bounds>, 0x73726f74642e00 <Address 0x73726f74642e00 out of bounds>, 
    0x61642e0072636a2e <Address 0x61642e0072636a2e out of bounds>, 0x722e6c65722e6174 <Address 0x722e6c65722e6174 out of bounds>, 
    0x6d616e79642e006f <Address 0x6d616e79642e006f out of bounds>, 0x746f672e006369 <Address 0x746f672e006369 out of bounds>, 
    0x746c702e746f672e <Address 0x746c702e746f672e out of bounds>, 0x2e00617461642e00 <Address 0x2e00617461642e00 out of bounds>, 
    0x756e672e00737362 <Address 0x756e672e00737362 out of bounds>, 0x696c67756265645f <Address 0x696c67756265645f out of bounds>, 
    0x6b6e <Address 0x6b6e out of bounds>, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x70000000b <Address 0x70000000b out of bounds>, 
    0x2 <Address 0x2 out of bounds>, 0x190 <Address 0x190 out of bounds>, 0x190 <Address 0x190 out of bounds>, 0x24 <Address 0x24 out of bounds>, 
    0x0, 0x4 <Address 0x4 out of bounds>, 0x0, 0x6ffffff60000001e <Address 0x6ffffff60000001e out of bounds>, 0x2 <Address 0x2 out of bounds>, 
    0x1b8 <Address 0x1b8 out of bounds>, 0x1b8 <Address 0x1b8 out of bounds>, 0x2a0 <Address 0x2a0 out of bounds>, 0x3 <Address 0x3 out of bounds>, 
    0x8 <Address 0x8 out of bounds>, 0x0, 0xb00000028 <Address 0xb00000028 out of bounds>, 0x2 <Address 0x2 out of bounds>}, 
  default_value = 0x458 <Address 0x458 out of bounds>, description = 0x458 <Address 0x458 out of bounds>, validate = 2928}

(gdb) info reg
rax            0x0      0
rbx            0x7fff4bcad330   140734464971568
rcx            0x0      0
rdx            0x0      0
rsi            0x7fff4bcac970   140734464969072
rdi            0x100000000      4294967296
rbp            0x7f29f38bfa04   0x7f29f38bfa04
rsp            0x7fff4bcad2a0   0x7fff4bcad2a0
r8             0x2      2
r9             0x0      0
r10            0x6ea81e 7251998
r11            0x7f29f50fad30   139818181831984
r12            0x6ea810 7251984
r13            0x7f29f36ddabc   139818154449596
r14            0x6fc230 7324208
r15            0x7f29f38bfa04   139818156423684
rip            0x7f29f50fb0d3   0x7f29f50fb0d3 <xl_opt_validate+67>
eflags         0x246    [ PF ZF IF ]
cs             0x33     51
ss             0x2b     43
ds             0x0      0
es             0x0      0
fs             0x0      0
gs             0x0      0

Other options:
(gdb) p stub->vol_opt->given_opt[1]
$22 = {key = {0x7f29eb5d3082 "cache-posix-acl", 0x0, 0x0, 0x0}, type = GF_OPTION_TYPE_BOOL, min = 0, max = 0, value = {0x0 <repeats 64 times>}, 
  default_value = 0x7f29eb5d328e "false", description = 0x0, validate = GF_OPT_VALIDATE_BOTH}
(gdb) p stub->vol_opt->given_opt[2]
$23 = {key = {0x7f29eb5d3059 "md-cache-timeout", 0x0, 0x0, 0x0}, type = GF_OPTION_TYPE_INT, min = 0, max = 60, value = {0x0 <repeats 64 times>}, 
  default_value = 0x7f29eb5d3294 "1", description = 0x7f29eb5d3648 "Time period after which cache has to be refreshed", 
  validate = GF_OPT_VALIDATE_BOTH}
(gdb) p stub->vol_opt->given_opt[3]
$24 = {key = {0x7f29eb5d30a4 "force-readdirp", 0x0, 0x0, 0x0}, type = GF_OPTION_TYPE_BOOL, min = 0, max = 0, value = {0x0 <repeats 64 times>}, 
  default_value = 0x7f29eb5d3296 "true", 
  description = 0x7f29eb5d3680 "Convert all readdir requests to readdirplus to collect stat info on each entry.", validate = GF_OPT_VALIDATE_BOTH}

Comment 4 shishir gowda 2013-05-11 05:49:30 UTC
(gdb) p stub->vol_opt->given_opt[0]
$27 = {key = {0x7f29eb5d306a "cache-selinux", 0x0, 0x0, 0x0}, type = GF_OPTION_TYPE_BOOL, min = 0, max = 0, value = {0x0 <repeats 64 times>}, 
  default_value = 0x7f29eb5d328e "false", description = 0x0, validate = GF_OPT_VALIDATE_BOTH}

Comment 6 senaik 2013-05-14 10:56:53 UTC
Version : 
======= 
3.4.0.7rhs-1.el6rhs.x86_64

Could not reproduce the issue . Marking it as Verfied .

Comment 7 Scott Haines 2013-09-23 22:35:30 UTC
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. 

For information on the advisory, and where to find the updated files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2013-1262.html


Note You need to log in before you can comment on or make changes to this bug.