Thanks for taking the time to report this bug, Antoine. We recently saw this ourselves (bug 762614) and a patch has gone into the repository. *** This bug has been marked as a duplicate of bug 882 ***
I have a setup here where a client (desktop002) will have its client glusterfs process completely crash when any application tries to acquire a lock on the mounted filesystem, but only if one master is down. The setup is a bit peculiar here as I'm doing some testing. I have two masters (called "rhea" and "desktop002") and a single client ("desktop002", yes, it's also a master). If I mount the export on desktop002 while the daemon is not started on desktop002, any lock created on the partition will crash: anarcat@desktop002:~$ /home/anarcat/dist/locktest type=0, whence=0, start=0, len=0, pid=134514297 fnctl(F_GETLK): Software caused connection abort Once glusterfsd is started on desktop002, the lock is created properly: anarcat@desktop002:~$ /home/anarcat/dist/locktest type=0, whence=0, start=0, len=0, pid=134514297 type=2, whence=0, start=0, len=0, pid=134514297 type=1, whence=0, start=0, len=0, pid=134514297 So to reproduce this, you need the following. * Debian lenny. * glusterfs backported from squeeze (3.0.3-1~koumbit50+1 here) * a lock testing program (firefox 3.0 works fine, but you can also use the one provided here: https://answers.launchpad.net/ubuntu/+source/firefox/+question/5562) This is the server configuration: ## file auto generated by /usr/bin/glusterfs-volgen (export.vol) # Cmd line: # $ /usr/bin/glusterfs-volgen --raid 1 --name test rhea:/test desktop002:/test volume posix1 type storage/posix option directory /test end-volume volume locks1 type features/locks subvolumes posix1 end-volume volume brick1 type performance/io-threads option thread-count 8 subvolumes locks1 end-volume volume server-tcp type protocol/server option transport-type tcp option auth.addr.brick1.allow * option transport.socket.listen-port 6996 option transport.socket.nodelay on subvolumes brick1 end-volume And this is the client configuration: ## file auto generated by /usr/bin/glusterfs-volgen (mount.vol) # Cmd line: # $ /usr/bin/glusterfs-volgen --raid 1 --name srv rhea:/srv/home desktop002:/srv/home # RAID 1 # TRANSPORT-TYPE tcp volume rhea-1 type protocol/client option transport-type tcp option remote-host rhea option transport.socket.nodelay on option transport.remote-port 6996 option remote-subvolume brick1 end-volume volume desktop002-1 type protocol/client option transport-type tcp option remote-host desktop002 option transport.socket.nodelay on option transport.remote-port 6996 option remote-subvolume brick1 end-volume volume mirror-0 type cluster/replicate subvolumes rhea-1 desktop002-1 end-volume volume writebehind type performance/write-behind option cache-size 4MB subvolumes mirror-0 end-volume volume readahead type performance/read-ahead option page-count 4 subvolumes writebehind end-volume volume iocache type performance/io-cache option cache-size `grep 'MemTotal' /proc/meminfo | awk '{print $2 * 0.2 / 1024}' | cut -f1 -d,`MB option cache-timeout 1 subvolumes readahead end-volume volume quickread type performance/quick-read option cache-timeout 1 option max-file-size 64kB subvolumes iocache end-volume volume statprefetch type performance/stat-prefetch subvolumes quickread end-volume Steps to reproduce: 1. install the server config file on rhea in /etc/gluster/glusterd.vol 2. start the server on rhea: /etc/init.d/glusterfs start 3. *DO NOT* start the other server in the replication cluster (on desktop002) 4. install the client config file on desktop002 in /etc/gluster/srv-tcp.vol 5. mount the partition on desktop002: mount -t glusterfs /etc/glusterfs/srv-tcp.vol /mnt 6. cd /mnt ; touch lockfile ; locktest Alternatively, you can try to start firefox: 6. export HOME=/mnt ; firefox Expected results: * locking should occur as normal (firefox would start and the locking program would complete properly) Actual results: * the client glusterfs completely crashes and the mount needs to be redone This is 100% reproduceable. Workaround: * start the second server in the cluster Traces: this is the strace that hinted me the problem was with locks, from strace firefox: open("/mnt/anarcat/.mozilla/firefox/profiles.ini", O_RDONLY) = 16 fstat64(16, {st_mode=S_IFREG|0644, st_size=174, ...}) = 0 [...] read(16, "[General]\nStartWithLastProfile=1\n"..., 174) = 174 _llseek(16, 174, [174], SEEK_SET) = 0 close(16) = 0 [...] stat64("/mnt/anarcat/.mozilla/firefox/be92ht7k.default", {st_mode=S_IFDIR|0700, st_size=4096, ...}) = 0 open("/mnt/anarcat/.mozilla/firefox/be92ht7k.default/.parentlock", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 16 fcntl64(16, F_GETLK, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0, pid=3083141920}) = -1 ECONNABORTED (Software cause d connection abort) close(16) = -1 ENOTCONN (Transport endpoint is not connected) This is the error in mnt.log: patchset: v3.0.2-41-g029062c signal received: 11 time of crash: 2010-05-06 17:31:09 configuration details: argp 1 backtrace 1 dlfcn 1 fdatasync 1 libpthread 1 llistxattr 1 setfsid 1 spinlock 1 epoll.h 1 xattr.h 1 st_atim.tv_nsec 1 package-string: glusterfs 3.0.3 [0xb7fab400] /usr/lib/libglusterfs.so.0[0xb7f7088e] /usr/lib/glusterfs/3.0.3/xlator/performance/quick-read.so(qr_lk_cbk+0x6e)[0xb752d42e] /usr/lib/glusterfs/3.0.3/xlator/performance/io-cache.so(ioc_lk_cbk+0x6e)[0xb75367ce] /usr/lib/libglusterfs.so.0[0xb7f7088e] /usr/lib/libglusterfs.so.0[0xb7f7088e] /usr/lib/glusterfs/3.0.3/xlator/cluster/replicate.so(afr_lk_cbk+0xe5)[0xb7547c55] /usr/lib/glusterfs/3.0.3/xlator/protocol/client.so(client_lk+0x3f4)[0xb75907d4] /usr/lib/glusterfs/3.0.3/xlator/cluster/replicate.so(afr_lk_cbk+0x242)[0xb7547db2] /usr/lib/glusterfs/3.0.3/xlator/protocol/client.so(client_lk_common_cbk+0x11e)[0xb758017e] /usr/lib/glusterfs/3.0.3/xlator/protocol/client.so(protocol_client_interpret+0x245)[0xb75823e5] /usr/lib/glusterfs/3.0.3/xlator/protocol/client.so(protocol_client_pollin+0xcf)[0xb758258f] /usr/lib/glusterfs/3.0.3/xlator/protocol/client.so(notify+0xd2)[0xb7592b22] /usr/lib/libglusterfs.so.0(xlator_notify+0x3f)[0xb7f6bc8f] /usr/lib/glusterfs/3.0.3/transport/socket.so(socket_event_poll_in+0x3d)[0xb6cfee6d] /usr/lib/glusterfs/3.0.3/transport/socket.so(socket_event_handler+0xab)[0xb6cfef2b] /usr/lib/libglusterfs.so.0[0xb7f8707a] /usr/lib/libglusterfs.so.0(event_dispatch+0x21)[0xb7f85e61] /usr/sbin/glusterfs(main+0xb3d)[0x804bded] /lib/i686/cmov/libc.so.6(__libc_start_main+0xe5)[0xb7df7455] /usr/sbin/glusterfs[0x8049f11] --------- I understand it could expected that both master be up for locking, but this should at the very least not crash totally the client filesystem. I have tested this with fuse 2.7 from lenny and also with the 2.8 backport, without any luck. I haven't tried the patched version documented in the FAQ.