762663 – (GLUSTER-931) glusterfsd crashes

Bug 762663 (GLUSTER-931) - glusterfsd crashes

Summary: glusterfsd crashes

Keywords:
Status:	CLOSED DUPLICATE of bug 762614
Alias:	GLUSTER-931
Product:	GlusterFS
Classification:	Community
Component:	core
Sub Component:
Version:	3.0.4
Hardware:	x86_64
OS:	Linux
Priority:	urgent
Severity:	medium
Target Milestone:	---
Assignee:	Pavan Vilas Sondur
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-05-18 10:28 UTC by Deniss Gaplevsky
Modified:	2015-12-01 16:45 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Pavan Vilas Sondur 2010-05-18 08:07:17 UTC

The configuration while this crash was observed is not officially suppported, hence the reduction in priority. Nevertheless need to look into this to see if why the crash occured.

Deniss,
Can you paste the whole backtrace from the core:
Run gdb on the core:

# gdb -c <core-file> <glusterfs-binary path>

(gdb) thread apply all bt full


Also, please mention what steps led to the crash and if it is reproducible.

Comment 1 Deniss Gaplevsky 2010-05-18 08:25:43 UTC

unfortunately that happened on production system, so i moved back to 2.0.9 version quickly (works for now) and all core files were deleted. Future more, all libs and binaries were stripped on linking by default.
Sorry.

Comment 2 Pavan Vilas Sondur 2010-05-18 09:12:01 UTC

Can you give us a little more info as to how you ran into this crash? What was happening when this crash occurred? Was it reproducible, if yes, what steps led to the crash?

Comment 3 Deniss Gaplevsky 2010-05-18 09:55:23 UTC

well, the story is really sad:
I ran two nodes with two gluster v2.0.8 instances with config as described in ticket for year. Up to 30Gb of pictures are stored.
When 3th series come out i try each release in test environment with my config and version 3.0.4 seemed to be stable, so I start to upgrade production.
And the bad things became.
I stopped first node and upgrade other to new version. Then upgrade first one.
When second came up iowait on both rose to 100% and errors like
"W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 2: LOOKUP() / => -1 (Transport endpoint is not connected)" started to appears randomly
Then one of node die with 

general protection fault: 0000 [#1] SMP 
last sysfs file: /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor
CPU 1 
Pid: 5570, comm: glusterfsd Not tainted 2.6.28-hardened-r9.dale.20100518 #4
RIP: 0010:[<ffffffff802fece3>]  [<ffffffff802fece3>] 0xffffffff802fece3
RSP: 0018:ffff88004c071b78  EFLAGS: 00010206
RAX: 0000000000000028 RBX: 5000000000000000 RCX: 0000000000000000
RDX: ffff88004c071c98 RSI: 5000000000000000 RDI: ffffffff809790e2
RBP: ffff88004c071c58 R08: 00000000fffffffe R09: 0000000000000017
R10: ffffffff809794cd R11: 00000000809790a0 R12: ffffffff809790e2
R13: ffffffff809794a0 R14: ffffffff809790a0 R15: ffffffff809790e2
FS:  00006e46e1102910(0000) GS:ffff88007f804880(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000a97fd2e20e8 CR3: 000000004c013000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process glusterfsd (pid: 5570, threadinfo ffff88004c070000, task ffff88007e9d5550)
Stack:
 000000000000006b ffff88004c071c58 0000000000000001 ffffffff802fe7d0
 ffff88007e9d5788 ffffe20000fb0940 0000000000000000 ffffffff8040aa05
 0000000000000000 ffff8800472790e0 0000000000000000 ffffffff809794cd
Call Trace:
 [<ffffffff802fe7d0>] ? 0xffffffff802fe7d0
 [<ffffffff8040aa05>] ? 0xffffffff8040aa05
 [<ffffffff80261d80>] ? 0xffffffff80261d80
 [<ffffffff806b3b1b>] ? 0xffffffff806b3b1b
 [<ffffffff802fec8f>] ? 0xffffffff802fec8f
 [<ffffffff802f2a50>] ? 0xffffffff802f2a50
 [<ffffffff80261d5e>] ? 0xffffffff80261d5e
 [<ffffffff8024ac20>] ? 0xffffffff8024ac20
 [<ffffffff8026224a>] ? 0xffffffff8026224a
 [<ffffffff803131cd>] ? 0xffffffff803131cd
 [<ffffffff803126fd>] ? 0xffffffff803126fd
 [<ffffffff802b256d>] ? 0xffffffff802b256d
 [<ffffffff802b261f>] ? 0xffffffff802b261f
 [<ffffffff802a1bb2>] ? 0xffffffff802a1bb2
 [<ffffffff802a068d>] ? 0xffffffff802a068d
 [<ffffffff802a2c88>] ? 0xffffffff802a2c88
 [<ffffffff80299e06>] ? 0xffffffff80299e06
 [<ffffffff802b27d3>] ? 0xffffffff802b27d3
 [<ffffffff8020289b>] ? 0xffffffff8020289b
Code: c3 48 c7 c6 a0 90 97 80 48 c7 c7 b6 30 77 80 31 c0 e8 fd 41 3b 00 eb e0 41 54 48 85 f6 49 89 fc 55 53 48 89 f3 0f 84 b5 02 00 00 <48> 8b 46 08 48 c1 e8 3c 3c 03 0f 86 bd 00 00 00 8b 43 0c 31 d2 
RIP  [<ffffffff802fece3>] 0xffffffff802fece3
 RSP <ffff88004c071b78>
Kernel panic - not syncing: Fatal exception

so i start to switch off translators (looks like io-threads was the case)
after each reboot time on nodes became different and this probably caused additional problem, cos files are stat-ed permanently.

nothing helped and i decided to down one node to have some time for additional research in test environment. But when load has been increased even one-node config became unstable - errors like
"W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 1416: LOOKUP() / => -1 (Transport endpoint is not connected)"
and crashes.
so I quickly rolled back to 2.0.9 (same config) and it works.

Comment 4 Deniss Gaplevsky 2010-05-18 10:28:52 UTC

after migrating from 2.0.8 to 3.0.4 glusterfs with preexisting data (~30Gb) crashes even second node is down
In log:
================================================================================
Version      : glusterfs 3.0.4 built on May 17 2010 16:17:25
git: v3.0.4
Starting Time: 2010-05-18 12:15:07
Command line : /usr/sbin/glusterfsd --pid-file=/var/run/glusterfsd.pid --log-file=/var/log/glusterfs/glusterfsd.log --volfile=/etc/glusterfs/glusterfsd.vol /uploads 
PID          : 10075
System name  : Linux
Nodename     : test1
Kernel Release : 2.6.28-hardened-r9.install.64.20100125
Hardware Identifier: x86_64

Given volfile:
+------------------------------------------------------------------------------+
  1: # file: /etc/glusterfs/glusterfs-server.vol
  2: #
  3: ###                                ###
  4: #          PHOTO             #
  5: ###                                ###
  6: volume posix_photo
  7:   type storage/posix
  8:   option directory /mnt/glusterfs_photo
  9: end-volume
 10: 
 11: volume brick_photo
 12: #volume locks_photo
 13:   type features/locks
 14: #  option mandatory-locks on
 15:   subvolumes posix_photo
 16: end-volume
 17: 
 18: #volume brick_photo
 19: #  type performance/io-threads
 20: #  option thread-count 1
 21: #  subvolumes locks_photo
 22: #end-volume
 23: 
 24: volume remote_photo
 25:   type protocol/client
 26:   option transport-type tcp
 27:   option remote-host 10.0.0.2 # Change this
 28:   option remote-subvolume brick_photo
 29: end-volume
 30: 
 31: volume replicate_photo
 32:  type cluster/replicate
 33:  option read-subvolume brick_photo
 34:  subvolumes brick_photo remote_photo
 35: end-volume
 36: 
 37: ###                                ###
 38: #          CDB               #
 39: ###                                ###
 40: volume posix_cdb
 41:   type storage/posix
 42:   option directory /mnt/glusterfs_cdb
 43: end-volume
 44: 
 45: #volume locks_cdb
 46: volume brick_cdb
 47:   type features/locks
 48: #  option mandatory-locks on
 49:   subvolumes posix_cdb
 50: end-volume
 51: 
 52: #volume brick_cdb
 53: #  type performance/io-threads
 54: #  option thread-count 4
 55: #  subvolumes locks_cdb
 56: #end-volume
 57: 
 58: volume remote_cdb
 59:   type protocol/client
 60:   option transport-type tcp
 61:   option remote-host 10.0.0.2 # Change this
 62:   option remote-subvolume brick_cdb
 63: end-volume
 64: 
 65: volume replicate_cdb
 66:  type cluster/replicate
 67:  option read-subvolume brick_cdb
 68:  subvolumes brick_cdb remote_cdb
 69: end-volume
 70: 
 71: 
 72: #### Exporting volumes #####
 73: volume server
 74:   type protocol/server
 75: #  option transport.socket.bind-address 
 76:   option transport-type tcp
 77:   option auth.addr.brick_cdb.allow 10.0.0.1,10.0.0.2
 78:   option auth.addr.replicate_cdb.allow 10.0.0.1,10.0.0.2
 79:   option auth.addr.brick_photo.allow 10.0.0.1,10.0.0.2
 80:   option auth.addr.replicate_photo.allow 10.0.0.1,10.0.0.2
 81:   subvolumes replicate_cdb brick_cdb replicate_photo brick_photo
 82: end-volume
 83: 
 84: ##### PHOTO CLIENT #####
 85: #volume iothreads
 86: #  type performance/io-threads
 87: #  option thread-count 4
 88: #  subvolumes replicate_photo
 89: #end-volume
 90: 
 91: #volume writebehind
 92: #  type performance/write-behind
 93: #  option cache-size 4MB
 94: #  subvolumes iothreads
 95: #end-volume
 96: 
 97: #volume readahead
 98: #  type performance/read-ahead
 99: #  option page-count 4
100: #  subvolumes writebehind
101: #end-volume
102: 
103: volume iocache
104:   type performance/io-cache
105:   option cache-size 256MB
106:   #subvolumes readahead
107:   subvolumes replicate_photo
108: end-volume
109: 
110: # BUG http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=723
111: #volume quickread
112: #    type performance/quick-read
113: #    option cache-timeout 1
114: #    option max-file-size 64kB
115: #    subvolumes iocache
116: #end-volume

+------------------------------------------------------------------------------+
[2010-05-18 12:15:07] N [afr.c:2632:notify] replicate_photo: Subvolume 'brick_photo' came back up; going online.
[2010-05-18 12:15:07] N [fuse-bridge.c:2950:fuse_init] glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.13 kernel 7.10
[2010-05-18 12:15:07] N [afr.c:2632:notify] replicate_photo: Subvolume 'brick_photo' came back up; going online.
[2010-05-18 12:15:07] N [afr.c:2632:notify] replicate_cdb: Subvolume 'brick_cdb' came back up; going online.
[2010-05-18 12:15:07] N [afr.c:2632:notify] replicate_cdb: Subvolume 'brick_cdb' came back up; going online.
[2010-05-18 12:15:07] N [afr.c:2632:notify] replicate_photo: Subvolume 'brick_photo' came back up; going online.
[2010-05-18 12:15:07] N [afr.c:2632:notify] replicate_photo: Subvolume 'brick_photo' came back up; going online.
[2010-05-18 12:15:07] N [glusterfsd.c:1408:main] glusterfs: Successfully started
[2010-05-18 12:15:07] E [socket.c:762:socket_connect_finish] remote_photo: connection to  failed (Connection refused)
[2010-05-18 12:15:07] E [socket.c:762:socket_connect_finish] remote_photo: connection to  failed (Connection refused)
[2010-05-18 12:15:07] E [socket.c:762:socket_connect_finish] remote_cdb: connection to  failed (Connection refused)
[2010-05-18 12:15:07] E [socket.c:762:socket_connect_finish] remote_cdb: connection to  failed (Connection refused)
[2010-05-18 12:15:09] N [server-protocol.c:5852:mop_setvolume] server: accepted client from 10.0.0.1:1021
[2010-05-18 12:15:09] N [server-protocol.c:5852:mop_setvolume] server: accepted client from 10.0.0.1:1020
pending frames:
frame : type(1) op(LK)

patchset: v3.0.4
signal received: 11
time of crash: 2010-05-18 12:22:34
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.0.4
/lib/libc.so.6[0x703e069b7f60]
/usr/lib64/glusterfs/3.0.4/xlator/protocol/server.so(server_lk_cbk+0xf8)[0x703e056e09b8]
/usr/lib64/glusterfs/3.0.4/xlator/cluster/replicate.so(afr_lk_cbk+0xd3)[0x703e058ff563]
/usr/lib64/glusterfs/3.0.4/xlator/protocol/client.so(client_lk+0x276)[0x703e05b4d2a6]
/usr/lib64/glusterfs/3.0.4/xlator/cluster/replicate.so(afr_lk_cbk+0x21c)[0x703e058ff6ac]
/usr/lib64/glusterfs/3.0.4/xlator/features/locks.so(pl_lk+0x14e)[0x703e05d6390e]
/usr/lib64/glusterfs/3.0.4/xlator/cluster/replicate.so(afr_lk+0x1c1)[0x703e05902881]
/usr/lib64/glusterfs/3.0.4/xlator/protocol/server.so(server_lk_resume+0xf5)[0x703e056e0b85]
/usr/lib64/glusterfs/3.0.4/xlator/protocol/server.so(server_resolve_done+0x30)[0x703e056e8ff0]
/usr/lib64/glusterfs/3.0.4/xlator/protocol/server.so(server_resolve_all+0xbd)[0x703e056e991d]
/usr/lib64/glusterfs/3.0.4/xlator/protocol/server.so(server_resolve+0x8b)[0x703e056e985b]
/usr/lib64/glusterfs/3.0.4/xlator/protocol/server.so(server_resolve_all+0xb6)[0x703e056e9916]
/usr/lib64/glusterfs/3.0.4/xlator/protocol/server.so(server_resolve_fd+0x4c)[0x703e056e9a0c]
/usr/lib64/glusterfs/3.0.4/xlator/protocol/server.so(server_resolve+0x4d)[0x703e056e981d]
/usr/lib64/glusterfs/3.0.4/xlator/protocol/server.so(server_resolve_all+0x93)[0x703e056e98f3]
/usr/lib64/glusterfs/3.0.4/xlator/protocol/server.so(resolve_and_resume+0x14)[0x703e056e99b4]
/usr/lib64/glusterfs/3.0.4/xlator/protocol/server.so(server_lk+0xf3)[0x703e056dcc23]
/usr/lib64/glusterfs/3.0.4/xlator/protocol/server.so(protocol_server_pollin+0x92)[0x703e056d9d12]
/usr/lib64/glusterfs/3.0.4/xlator/protocol/server.so(notify+0x83)[0x703e056d9da3]
/usr/lib/libglusterfs.so.0(xlator_notify+0x43)[0x703e071122b3]
/usr/lib64/glusterfs/3.0.4/transport/socket.so(socket_event_handler+0xc8)[0x703e048a4798]
/usr/lib/libglusterfs.so.0[0x703e0712fa61]
/usr/sbin/glusterfsd(main+0x8d5)[0x1f569b65f75]
/lib/libc.so.6(__libc_start_main+0xe6)[0x703e069a4a56]
/usr/sbin/glusterfsd[0x1f569b644b9]
---------

Comment 5 Pavan Vilas Sondur 2010-08-10 06:16:51 UTC

Officially unsupported configuration (server side replication). Deferring to post 3.1

Comment 6 Vijay Bellur 2010-10-27 06:11:56 UTC


*** This bug has been marked as a duplicate of bug 882 ***

Note You need to log in before you can comment on or make changes to this bug.