Bug 762530 (GLUSTER-798) - Crash in dht_first_up_subvol
Summary: Crash in dht_first_up_subvol
Keywords:
Status: CLOSED WONTFIX
Alias: GLUSTER-798
Product: GlusterFS
Classification: Community
Component: distribute
Version: mainline
Hardware: All
OS: Linux
low
high
Target Milestone: ---
Assignee: shishir gowda
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-03-31 23:01 UTC by Vikas Gorur
Modified: 2015-12-01 16:45 UTC (History)
3 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)

Description Vikas Gorur 2010-03-31 20:03:18 UTC
Log file for this session:

================================================================================
Version      : glusterfs 3.0.3 built on Mar 19 2010 11:55:08
git: v3.0.2-41-g029062c
Starting Time: 2010-03-31 17:17:10
Command line : /sbin/glusterfs --log-level=NORMAL --volfile=/etc/glusterfs/glusterfs.vol /mnt/backups/ 
PID          : 3263
System name  : Linux
Nodename     : backup0
Kernel Release : 2.6.30-2-686
Hardware Identifier: i686

Given volfile:
+------------------------------------------------------------------------------+
  1: 
  2: #####################################
  3: ###  GlusterFS Client Volume File  ##
  4: #####################################
  5: 
  6: #### CONFIG FILE RULES:
  7: ### "#" is comment character.
  8: ### - Config file is case sensitive
  9: ### - Options within a volume block can be in any order.
 10: ### - Spaces or tabs are used as delimitter within a line. 
 11: ### - Each option should end within a line.
 12: ### - Missing or commented fields will assume default values.
 13: ### - Blank/commented lines are allowed.
 14: ### - Sub-volumes should already be defined above before referring.
 15: 
 16: # drives
 17: volume backup1-sda1
 18:   type protocol/client
 19:   option transport-type tcp
 20:   option remote-host 209.163.224.123
 21:   option transport.socket.nodelay on
 22:   option remote-subvolume sda1
 23: end-volume
 24: 
 25: volume backup2-sda1
 26:   type protocol/client
 27:   option transport-type tcp
 28:   option remote-host 209.163.224.124
 29:   option transport.socket.nodelay on
 30:   option remote-subvolume sda1
 31: end-volume
 32: 
 33: volume backup3-sda1
 34:   type protocol/client
 35:   option transport-type tcp
 36:   option remote-host 209.163.224.125
 37:   option transport.socket.nodelay on
 38:   option remote-subvolume sda1
 39: end-volume
 40: 
 41: volume backup4-sda1
 42:   type protocol/client
 43:   option transport-type tcp
 44:   option remote-host 209.163.224.126
 45:   option transport.socket.nodelay on
 46:   option remote-subvolume sda1
 47: end-volume
 48: 
 49: # replication
 50: volume b-1-2-sda1
 51:   type cluster/replicate
 52:   subvolumes backup1-sda1 backup2-sda1
 53: end-volume
 54: 
 55: volume b-3-4-sda1
 56:   type cluster/replicate
 57:   subvolumes backup3-sda1 backup4-sda1
 58: end-volume
 59: 
 60: # striping
 61: volume distribute
 62:   type cluster/distribute
 63:   subvolumes b-1-2-sda1 b-3-4-sda1
 64: end-volume
 65: 
 66: #### Add readahead feature
 67: volume readahead
 68:   type performance/read-ahead
 69: #  option page-size 1MB     # unit in bytes
 70:   option page-count 4      # cache per file  = (page-count x page-size)
 71:   subvolumes distribute
 72: end-volume
 73: 
 74: #### Add IO-Cache feature
 75: volume iocache
 76:     type performance/io-cache
 77:     option cache-size `grep 'MemTotal' /proc/meminfo  | awk '{print $2 * 0.2 / 1024}' | cut -f1 -d.`MB
 78:     option cache-timeout 1
 79:     subvolumes readahead
 80: end-volume
 81: 
 82: #### Add quick-read feature
 83: volume quickread
 84:     type performance/quick-read
 85:     option cache-timeout 1
 86:     option max-file-size 64kB
 87:     subvolumes iocache
 88: end-volume
 89: 
 90: ### JG: we seem to need this, without this, rsync crashes
 91: #volume statprefetch
 92: #    type performance/stat-prefetch
 93: #    subvolumes quickread
 94: #end-volume
 95: 
 96: #### Add writeback feature
 97: #volume writeback
 98: #  type performance/write-behind
 99: #  option aggregate-size 1MB
100: #  option window-size 2MB
101: #  option flush-behind off
102: #  subvolumes quickread
103: #end-volume
104: 

+------------------------------------------------------------------------------+
[2010-03-31 17:17:10] N [glusterfsd.c:1396:main] glusterfs: Successfully started
[2010-03-31 17:17:10] N [client-protocol.c:6246:client_setvolume_cbk] backup2-sda1: Connected to 209.163.224.124:6996, attached to remote volume 'sda1'.
[2010-03-31 17:17:10] N [afr.c:2627:notify] b-1-2-sda1: Subvolume 'backup2-sda1' came back up; going online.
[2010-03-31 17:17:10] N [fuse-bridge.c:2942:fuse_init] glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.13 kernel 7.11
[2010-03-31 17:17:10] N [client-protocol.c:6246:client_setvolume_cbk] backup2-sda1: Connected to 209.163.224.124:6996, attached to remote volume 'sda1'.
[2010-03-31 17:17:10] N [afr.c:2627:notify] b-1-2-sda1: Subvolume 'backup2-sda1' came back up; going online.
[2010-03-31 17:17:10] N [client-protocol.c:6246:client_setvolume_cbk] backup1-sda1: Connected to 209.163.224.123:6996, attached to remote volume 'sda1'.
[2010-03-31 17:17:10] N [client-protocol.c:6246:client_setvolume_cbk] backup1-sda1: Connected to 209.163.224.123:6996, attached to remote volume 'sda1'.
[2010-03-31 17:17:10] N [client-protocol.c:6246:client_setvolume_cbk] backup3-sda1: Connected to 209.163.224.125:6996, attached to remote volume 'sda1'.
[2010-03-31 17:17:10] N [afr.c:2627:notify] b-3-4-sda1: Subvolume 'backup3-sda1' came back up; going online.
[2010-03-31 17:17:10] N [client-protocol.c:6246:client_setvolume_cbk] backup4-sda1: Connected to 209.163.224.126:6996, attached to remote volume 'sda1'.
[2010-03-31 17:17:10] N [client-protocol.c:6246:client_setvolume_cbk] backup3-sda1: Connected to 209.163.224.125:6996, attached to remote volume 'sda1'.
[2010-03-31 17:17:10] N [client-protocol.c:6246:client_setvolume_cbk] backup4-sda1: Connected to 209.163.224.126:6996, attached to remote volume 'sda1'.
[2010-03-31 17:55:02] E [client-protocol.c:415:client_ping_timer_expired] backup1-sda1: Server 209.163.224.123:6996 has not responded in the last 42 seconds, disconnecting.
[2010-03-31 17:55:02] E [client-protocol.c:415:client_ping_timer_expired] backup2-sda1: Server 209.163.224.124:6996 has not responded in the last 42 seconds, disconnecting.
[2010-03-31 17:55:02] E [client-protocol.c:415:client_ping_timer_expired] backup3-sda1: Server 209.163.224.125:6996 has not responded in the last 42 seconds, disconnecting.
[2010-03-31 17:55:02] E [client-protocol.c:415:client_ping_timer_expired] backup4-sda1: Server 209.163.224.126:6996 has not responded in the last 42 seconds, disconnecting.
[2010-03-31 17:55:02] E [saved-frames.c:165:saved_frames_unwind] backup1-sda1: forced unwinding frame type(1) op(LOOKUP)
[2010-03-31 17:55:02] E [saved-frames.c:165:saved_frames_unwind] backup1-sda1: forced unwinding frame type(2) op(PING)
[2010-03-31 17:55:02] N [client-protocol.c:6994:notify] backup1-sda1: disconnected
[2010-03-31 17:55:02] E [saved-frames.c:165:saved_frames_unwind] backup2-sda1: forced unwinding frame type(1) op(LOOKUP)
[2010-03-31 17:55:02] E [saved-frames.c:165:saved_frames_unwind] backup2-sda1: forced unwinding frame type(2) op(PING)
[2010-03-31 17:55:02] N [client-protocol.c:6994:notify] backup2-sda1: disconnected
[2010-03-31 17:55:02] E [afr.c:2657:notify] b-1-2-sda1: All subvolumes are down. Going offline until atleast one of them comes back up.
[2010-03-31 17:55:02] E [saved-frames.c:165:saved_frames_unwind] backup3-sda1: forced unwinding frame type(1) op(LOOKUP)
[2010-03-31 17:55:02] E [saved-frames.c:165:saved_frames_unwind] backup3-sda1: forced unwinding frame type(2) op(PING)
[2010-03-31 17:55:02] N [client-protocol.c:6994:notify] backup3-sda1: disconnected
[2010-03-31 17:55:02] E [saved-frames.c:165:saved_frames_unwind] backup4-sda1: forced unwinding frame type(1) op(LOOKUP)
[2010-03-31 17:55:02] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 1022: LOOKUP() / => -1 (Transport endpoint is not connected)
[2010-03-31 17:55:02] E [saved-frames.c:165:saved_frames_unwind] backup4-sda1: forced unwinding frame type(2) op(PING)
[2010-03-31 17:55:02] N [client-protocol.c:6994:notify] backup4-sda1: disconnected
[2010-03-31 17:55:02] E [afr.c:2657:notify] b-3-4-sda1: All subvolumes are down. Going offline until atleast one of them comes back up.
[2010-03-31 17:55:02] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 1023: LOOKUP() / => -1 (Transport endpoint is not connected)
[2010-03-31 17:55:02] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 1024: LOOKUP() / => -1 (Transport endpoint is not connected)
[2010-03-31 17:55:02] W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 1025: LOOKUP() / => -1 (Transport endpoint is not connected)
[2010-03-31 17:55:05] E [socket.c:762:socket_connect_finish] backup1-sda1: connection to 209.163.224.123:6996 failed (No route to host)
[2010-03-31 17:55:05] E [socket.c:762:socket_connect_finish] backup1-sda1: connection to 209.163.224.123:6996 failed (No route to host)
[2010-03-31 17:55:05] E [socket.c:762:socket_connect_finish] backup2-sda1: connection to 209.163.224.124:6996 failed (No route to host)
[2010-03-31 17:55:05] E [socket.c:762:socket_connect_finish] backup2-sda1: connection to 209.163.224.124:6996 failed (No route to host)
[2010-03-31 17:55:05] E [socket.c:762:socket_connect_finish] backup3-sda1: connection to 209.163.224.125:6996 failed (No route to host)
[2010-03-31 17:55:05] E [socket.c:762:socket_connect_finish] backup3-sda1: connection to 209.163.224.125:6996 failed (No route to host)
[2010-03-31 17:55:05] E [socket.c:762:socket_connect_finish] backup4-sda1: connection to 209.163.224.126:6996 failed (No route to host)
[2010-03-31 17:55:08] W [fuse-bridge.c:2352:fuse_statfs_cbk] glusterfs-fuse: 1026: ERR => -1 (Transport endpoint is not connected)
[2010-03-31 17:55:08] E [socket.c:762:socket_connect_finish] backup4-sda1: connection to 209.163.224.126:6996 failed (No route to host)

Comment 1 Vikas Gorur 2010-03-31 21:10:35 UTC
Disregard the log file. The core was not produced in the time frame that the log file is for.

Comment 2 Vikas Gorur 2010-03-31 23:01:24 UTC
Backtrace:

#0  0xb74f07e0 in dht_first_up_subvol (this=0x9f7f288) at dht-helper.c:200
#1  0xb74f0855 in dht_subvol_get_hashed (this=0x9f7f288, loc=0xb6c9e314) at dht-helper.c:219
#2  0xb7503f6b in dht_lookup (frame=0xb6301318, this=0x9f7f288, loc=0xb6c9e314, 
    xattr_req=0x9f80f28) at dht-common.c:814
#3  0xb7f2b585 in default_lookup (frame=0xb63012e8, this=0x9f7f898, loc=0xb6c9e314, 
    xattr_req=0x9f80f28) at defaults.c:63
#4  0xb74dc345 in ioc_lookup (frame=0xb63012a0, this=0x9f7fe58, loc=0xb6c9e314, 
    xattr_req=0x9f80f28) at io-cache.c:226
#5  0xb74d75f7 in qr_lookup (frame=0xb6301220, this=0x9f80418, loc=0xb6c9e314, 
    xattr_req=0x9f80f28) at quick-read.c:315
#6  0xb74ce401 in sp_lookup (frame=0xb6300f58, this=0x9f809f8, loc=0xb6c9e314, 
    xattr_req=0x9f80f28) at stat-prefetch.c:1087
#7  0xb74add78 in fuse_first_lookup (this=0x9f771d0) at fuse-bridge.c:3027
#8  0xb74ae19b in fuse_thread_proc (data=0x9f771d0) at fuse-bridge.c:3123
#9  0xb7ef9595 in start_thread () from /lib/i686/cmov/libpthread.so.0
#10 0xb7e7a18e in clone () from /lib/i686/cmov/libc.so.6

#0  0xb74f07e0 in dht_first_up_subvol (this=0x9f7f288) at dht-helper.c:200
200                             if (conf->subvolume_status[i]) {

(gdb) p *conf
$1 = {subvolume_lock = 0, subvolume_cnt = 8, subvolumes = 0xeeeeeeee, 
  subvolume_status = 0xeeeeeeee <Address 0xeeeeeeee out of bounds>, 
  file_layouts = 0xeeeeeeee, dir_layouts = 0x0, default_dir_layout = 0x0, 
  search_unhashed = 2, gen = 10, du_stats = 0x9fb8ac8, min_free_disk = 10, 
  disk_unit = 112 'p', refresh_interval = 0, unhashed_sticky_bit = _gf_false, 
  last_stat_fetch = {tv_sec = 0, tv_usec = 0}, layout_lock = 1, private = 0x1c0}

Comment 3 Amar Tumballi 2010-04-20 08:06:41 UTC
This is a race at the time of process termination. Here 'fini()' was called and during the same time, there was a lookup() call.

Will be fixed in 3.1 where we need proper 'fini()' for dynamic volumes.

Comment 4 shishir gowda 2010-08-12 06:41:46 UTC
In 3.1 fini will be only called on the detached old graphs, hence the race condition between lookup and fini is obsoleted


Note You need to log in before you can comment on or make changes to this bug.