The configuration while this crash was observed is not officially suppported, hence the reduction in priority. Nevertheless need to look into this to see if why the crash occured. Deniss, Can you paste the whole backtrace from the core: Run gdb on the core: # gdb -c <core-file> <glusterfs-binary path> (gdb) thread apply all bt full Also, please mention what steps led to the crash and if it is reproducible.
unfortunately that happened on production system, so i moved back to 2.0.9 version quickly (works for now) and all core files were deleted. Future more, all libs and binaries were stripped on linking by default. Sorry.
Can you give us a little more info as to how you ran into this crash? What was happening when this crash occurred? Was it reproducible, if yes, what steps led to the crash?
well, the story is really sad: I ran two nodes with two gluster v2.0.8 instances with config as described in ticket for year. Up to 30Gb of pictures are stored. When 3th series come out i try each release in test environment with my config and version 3.0.4 seemed to be stable, so I start to upgrade production. And the bad things became. I stopped first node and upgrade other to new version. Then upgrade first one. When second came up iowait on both rose to 100% and errors like "W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 2: LOOKUP() / => -1 (Transport endpoint is not connected)" started to appears randomly Then one of node die with general protection fault: 0000 [#1] SMP last sysfs file: /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor CPU 1 Pid: 5570, comm: glusterfsd Not tainted 2.6.28-hardened-r9.dale.20100518 #4 RIP: 0010:[<ffffffff802fece3>] [<ffffffff802fece3>] 0xffffffff802fece3 RSP: 0018:ffff88004c071b78 EFLAGS: 00010206 RAX: 0000000000000028 RBX: 5000000000000000 RCX: 0000000000000000 RDX: ffff88004c071c98 RSI: 5000000000000000 RDI: ffffffff809790e2 RBP: ffff88004c071c58 R08: 00000000fffffffe R09: 0000000000000017 R10: ffffffff809794cd R11: 00000000809790a0 R12: ffffffff809790e2 R13: ffffffff809794a0 R14: ffffffff809790a0 R15: ffffffff809790e2 FS: 00006e46e1102910(0000) GS:ffff88007f804880(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000a97fd2e20e8 CR3: 000000004c013000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process glusterfsd (pid: 5570, threadinfo ffff88004c070000, task ffff88007e9d5550) Stack: 000000000000006b ffff88004c071c58 0000000000000001 ffffffff802fe7d0 ffff88007e9d5788 ffffe20000fb0940 0000000000000000 ffffffff8040aa05 0000000000000000 ffff8800472790e0 0000000000000000 ffffffff809794cd Call Trace: [<ffffffff802fe7d0>] ? 0xffffffff802fe7d0 [<ffffffff8040aa05>] ? 0xffffffff8040aa05 [<ffffffff80261d80>] ? 0xffffffff80261d80 [<ffffffff806b3b1b>] ? 0xffffffff806b3b1b [<ffffffff802fec8f>] ? 0xffffffff802fec8f [<ffffffff802f2a50>] ? 0xffffffff802f2a50 [<ffffffff80261d5e>] ? 0xffffffff80261d5e [<ffffffff8024ac20>] ? 0xffffffff8024ac20 [<ffffffff8026224a>] ? 0xffffffff8026224a [<ffffffff803131cd>] ? 0xffffffff803131cd [<ffffffff803126fd>] ? 0xffffffff803126fd [<ffffffff802b256d>] ? 0xffffffff802b256d [<ffffffff802b261f>] ? 0xffffffff802b261f [<ffffffff802a1bb2>] ? 0xffffffff802a1bb2 [<ffffffff802a068d>] ? 0xffffffff802a068d [<ffffffff802a2c88>] ? 0xffffffff802a2c88 [<ffffffff80299e06>] ? 0xffffffff80299e06 [<ffffffff802b27d3>] ? 0xffffffff802b27d3 [<ffffffff8020289b>] ? 0xffffffff8020289b Code: c3 48 c7 c6 a0 90 97 80 48 c7 c7 b6 30 77 80 31 c0 e8 fd 41 3b 00 eb e0 41 54 48 85 f6 49 89 fc 55 53 48 89 f3 0f 84 b5 02 00 00 <48> 8b 46 08 48 c1 e8 3c 3c 03 0f 86 bd 00 00 00 8b 43 0c 31 d2 RIP [<ffffffff802fece3>] 0xffffffff802fece3 RSP <ffff88004c071b78> Kernel panic - not syncing: Fatal exception so i start to switch off translators (looks like io-threads was the case) after each reboot time on nodes became different and this probably caused additional problem, cos files are stat-ed permanently. nothing helped and i decided to down one node to have some time for additional research in test environment. But when load has been increased even one-node config became unstable - errors like "W [fuse-bridge.c:722:fuse_attr_cbk] glusterfs-fuse: 1416: LOOKUP() / => -1 (Transport endpoint is not connected)" and crashes. so I quickly rolled back to 2.0.9 (same config) and it works.
after migrating from 2.0.8 to 3.0.4 glusterfs with preexisting data (~30Gb) crashes even second node is down In log: ================================================================================ Version : glusterfs 3.0.4 built on May 17 2010 16:17:25 git: v3.0.4 Starting Time: 2010-05-18 12:15:07 Command line : /usr/sbin/glusterfsd --pid-file=/var/run/glusterfsd.pid --log-file=/var/log/glusterfs/glusterfsd.log --volfile=/etc/glusterfs/glusterfsd.vol /uploads PID : 10075 System name : Linux Nodename : test1 Kernel Release : 2.6.28-hardened-r9.install.64.20100125 Hardware Identifier: x86_64 Given volfile: +------------------------------------------------------------------------------+ 1: # file: /etc/glusterfs/glusterfs-server.vol 2: # 3: ### ### 4: # PHOTO # 5: ### ### 6: volume posix_photo 7: type storage/posix 8: option directory /mnt/glusterfs_photo 9: end-volume 10: 11: volume brick_photo 12: #volume locks_photo 13: type features/locks 14: # option mandatory-locks on 15: subvolumes posix_photo 16: end-volume 17: 18: #volume brick_photo 19: # type performance/io-threads 20: # option thread-count 1 21: # subvolumes locks_photo 22: #end-volume 23: 24: volume remote_photo 25: type protocol/client 26: option transport-type tcp 27: option remote-host 10.0.0.2 # Change this 28: option remote-subvolume brick_photo 29: end-volume 30: 31: volume replicate_photo 32: type cluster/replicate 33: option read-subvolume brick_photo 34: subvolumes brick_photo remote_photo 35: end-volume 36: 37: ### ### 38: # CDB # 39: ### ### 40: volume posix_cdb 41: type storage/posix 42: option directory /mnt/glusterfs_cdb 43: end-volume 44: 45: #volume locks_cdb 46: volume brick_cdb 47: type features/locks 48: # option mandatory-locks on 49: subvolumes posix_cdb 50: end-volume 51: 52: #volume brick_cdb 53: # type performance/io-threads 54: # option thread-count 4 55: # subvolumes locks_cdb 56: #end-volume 57: 58: volume remote_cdb 59: type protocol/client 60: option transport-type tcp 61: option remote-host 10.0.0.2 # Change this 62: option remote-subvolume brick_cdb 63: end-volume 64: 65: volume replicate_cdb 66: type cluster/replicate 67: option read-subvolume brick_cdb 68: subvolumes brick_cdb remote_cdb 69: end-volume 70: 71: 72: #### Exporting volumes ##### 73: volume server 74: type protocol/server 75: # option transport.socket.bind-address 76: option transport-type tcp 77: option auth.addr.brick_cdb.allow 10.0.0.1,10.0.0.2 78: option auth.addr.replicate_cdb.allow 10.0.0.1,10.0.0.2 79: option auth.addr.brick_photo.allow 10.0.0.1,10.0.0.2 80: option auth.addr.replicate_photo.allow 10.0.0.1,10.0.0.2 81: subvolumes replicate_cdb brick_cdb replicate_photo brick_photo 82: end-volume 83: 84: ##### PHOTO CLIENT ##### 85: #volume iothreads 86: # type performance/io-threads 87: # option thread-count 4 88: # subvolumes replicate_photo 89: #end-volume 90: 91: #volume writebehind 92: # type performance/write-behind 93: # option cache-size 4MB 94: # subvolumes iothreads 95: #end-volume 96: 97: #volume readahead 98: # type performance/read-ahead 99: # option page-count 4 100: # subvolumes writebehind 101: #end-volume 102: 103: volume iocache 104: type performance/io-cache 105: option cache-size 256MB 106: #subvolumes readahead 107: subvolumes replicate_photo 108: end-volume 109: 110: # BUG http://bugs.gluster.com/cgi-bin/bugzilla3/show_bug.cgi?id=723 111: #volume quickread 112: # type performance/quick-read 113: # option cache-timeout 1 114: # option max-file-size 64kB 115: # subvolumes iocache 116: #end-volume +------------------------------------------------------------------------------+ [2010-05-18 12:15:07] N [afr.c:2632:notify] replicate_photo: Subvolume 'brick_photo' came back up; going online. [2010-05-18 12:15:07] N [fuse-bridge.c:2950:fuse_init] glusterfs-fuse: FUSE inited with protocol versions: glusterfs 7.13 kernel 7.10 [2010-05-18 12:15:07] N [afr.c:2632:notify] replicate_photo: Subvolume 'brick_photo' came back up; going online. [2010-05-18 12:15:07] N [afr.c:2632:notify] replicate_cdb: Subvolume 'brick_cdb' came back up; going online. [2010-05-18 12:15:07] N [afr.c:2632:notify] replicate_cdb: Subvolume 'brick_cdb' came back up; going online. [2010-05-18 12:15:07] N [afr.c:2632:notify] replicate_photo: Subvolume 'brick_photo' came back up; going online. [2010-05-18 12:15:07] N [afr.c:2632:notify] replicate_photo: Subvolume 'brick_photo' came back up; going online. [2010-05-18 12:15:07] N [glusterfsd.c:1408:main] glusterfs: Successfully started [2010-05-18 12:15:07] E [socket.c:762:socket_connect_finish] remote_photo: connection to failed (Connection refused) [2010-05-18 12:15:07] E [socket.c:762:socket_connect_finish] remote_photo: connection to failed (Connection refused) [2010-05-18 12:15:07] E [socket.c:762:socket_connect_finish] remote_cdb: connection to failed (Connection refused) [2010-05-18 12:15:07] E [socket.c:762:socket_connect_finish] remote_cdb: connection to failed (Connection refused) [2010-05-18 12:15:09] N [server-protocol.c:5852:mop_setvolume] server: accepted client from 10.0.0.1:1021 [2010-05-18 12:15:09] N [server-protocol.c:5852:mop_setvolume] server: accepted client from 10.0.0.1:1020 pending frames: frame : type(1) op(LK) patchset: v3.0.4 signal received: 11 time of crash: 2010-05-18 12:22:34 configuration details: argp 1 backtrace 1 dlfcn 1 fdatasync 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.0.4 /lib/libc.so.6[0x703e069b7f60] /usr/lib64/glusterfs/3.0.4/xlator/protocol/server.so(server_lk_cbk+0xf8)[0x703e056e09b8] /usr/lib64/glusterfs/3.0.4/xlator/cluster/replicate.so(afr_lk_cbk+0xd3)[0x703e058ff563] /usr/lib64/glusterfs/3.0.4/xlator/protocol/client.so(client_lk+0x276)[0x703e05b4d2a6] /usr/lib64/glusterfs/3.0.4/xlator/cluster/replicate.so(afr_lk_cbk+0x21c)[0x703e058ff6ac] /usr/lib64/glusterfs/3.0.4/xlator/features/locks.so(pl_lk+0x14e)[0x703e05d6390e] /usr/lib64/glusterfs/3.0.4/xlator/cluster/replicate.so(afr_lk+0x1c1)[0x703e05902881] /usr/lib64/glusterfs/3.0.4/xlator/protocol/server.so(server_lk_resume+0xf5)[0x703e056e0b85] /usr/lib64/glusterfs/3.0.4/xlator/protocol/server.so(server_resolve_done+0x30)[0x703e056e8ff0] /usr/lib64/glusterfs/3.0.4/xlator/protocol/server.so(server_resolve_all+0xbd)[0x703e056e991d] /usr/lib64/glusterfs/3.0.4/xlator/protocol/server.so(server_resolve+0x8b)[0x703e056e985b] /usr/lib64/glusterfs/3.0.4/xlator/protocol/server.so(server_resolve_all+0xb6)[0x703e056e9916] /usr/lib64/glusterfs/3.0.4/xlator/protocol/server.so(server_resolve_fd+0x4c)[0x703e056e9a0c] /usr/lib64/glusterfs/3.0.4/xlator/protocol/server.so(server_resolve+0x4d)[0x703e056e981d] /usr/lib64/glusterfs/3.0.4/xlator/protocol/server.so(server_resolve_all+0x93)[0x703e056e98f3] /usr/lib64/glusterfs/3.0.4/xlator/protocol/server.so(resolve_and_resume+0x14)[0x703e056e99b4] /usr/lib64/glusterfs/3.0.4/xlator/protocol/server.so(server_lk+0xf3)[0x703e056dcc23] /usr/lib64/glusterfs/3.0.4/xlator/protocol/server.so(protocol_server_pollin+0x92)[0x703e056d9d12] /usr/lib64/glusterfs/3.0.4/xlator/protocol/server.so(notify+0x83)[0x703e056d9da3] /usr/lib/libglusterfs.so.0(xlator_notify+0x43)[0x703e071122b3] /usr/lib64/glusterfs/3.0.4/transport/socket.so(socket_event_handler+0xc8)[0x703e048a4798] /usr/lib/libglusterfs.so.0[0x703e0712fa61] /usr/sbin/glusterfsd(main+0x8d5)[0x1f569b65f75] /lib/libc.so.6(__libc_start_main+0xe6)[0x703e069a4a56] /usr/sbin/glusterfsd[0x1f569b644b9] ---------
Officially unsupported configuration (server side replication). Deferring to post 3.1
*** This bug has been marked as a duplicate of bug 882 ***