| Summary: | client crashes on locking when a master is down in replication | ||
|---|---|---|---|
| Product: | [Community] GlusterFS | Reporter: | Antoine Beaupré <anarcat> |
| Component: | locks | Assignee: | Pavan Vilas Sondur <pavan> |
| Status: | CLOSED DUPLICATE | QA Contact: | |
| Severity: | medium | Docs Contact: | |
| Priority: | low | ||
| Version: | 3.0.3 | CC: | gluster-bugs, vikas |
| Target Milestone: | --- | ||
| Target Release: | --- | ||
| Hardware: | All | ||
| OS: | Linux | ||
| URL: | https://wiki.koumbit.net/Glusterfs#Crashing_whon_locks_with_a_master_down | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | Type: | --- | |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
Vikas Gorur
2010-05-06 19:56:13 UTC
I have a setup here where a client (desktop002) will have its client glusterfs process completely crash when any application tries to acquire a lock on the mounted filesystem, but only if one master is down.
The setup is a bit peculiar here as I'm doing some testing. I have two masters (called "rhea" and "desktop002") and a single client ("desktop002", yes, it's also a master). If I mount the export on desktop002 while the daemon is not started on desktop002, any lock created on the partition will crash:
anarcat@desktop002:~$ /home/anarcat/dist/locktest
type=0, whence=0, start=0, len=0, pid=134514297
fnctl(F_GETLK): Software caused connection abort
Once glusterfsd is started on desktop002, the lock is created properly:
anarcat@desktop002:~$ /home/anarcat/dist/locktest
type=0, whence=0, start=0, len=0, pid=134514297
type=2, whence=0, start=0, len=0, pid=134514297
type=1, whence=0, start=0, len=0, pid=134514297
So to reproduce this, you need the following.
* Debian lenny.
* glusterfs backported from squeeze (3.0.3-1~koumbit50+1 here)
* a lock testing program (firefox 3.0 works fine, but you can also use the one provided here: https://answers.launchpad.net/ubuntu/+source/firefox/+question/5562)
This is the server configuration:
## file auto generated by /usr/bin/glusterfs-volgen (export.vol)
# Cmd line:
# $ /usr/bin/glusterfs-volgen --raid 1 --name test rhea:/test desktop002:/test
volume posix1
type storage/posix
option directory /test
end-volume
volume locks1
type features/locks
subvolumes posix1
end-volume
volume brick1
type performance/io-threads
option thread-count 8
subvolumes locks1
end-volume
volume server-tcp
type protocol/server
option transport-type tcp
option auth.addr.brick1.allow *
option transport.socket.listen-port 6996
option transport.socket.nodelay on
subvolumes brick1
end-volume
And this is the client configuration:
## file auto generated by /usr/bin/glusterfs-volgen (mount.vol)
# Cmd line:
# $ /usr/bin/glusterfs-volgen --raid 1 --name srv rhea:/srv/home desktop002:/srv/home
# RAID 1
# TRANSPORT-TYPE tcp
volume rhea-1
type protocol/client
option transport-type tcp
option remote-host rhea
option transport.socket.nodelay on
option transport.remote-port 6996
option remote-subvolume brick1
end-volume
volume desktop002-1
type protocol/client
option transport-type tcp
option remote-host desktop002
option transport.socket.nodelay on
option transport.remote-port 6996
option remote-subvolume brick1
end-volume
volume mirror-0
type cluster/replicate
subvolumes rhea-1 desktop002-1
end-volume
volume writebehind
type performance/write-behind
option cache-size 4MB
subvolumes mirror-0
end-volume
volume readahead
type performance/read-ahead
option page-count 4
subvolumes writebehind
end-volume
volume iocache
type performance/io-cache
option cache-size `grep 'MemTotal' /proc/meminfo | awk '{print $2 * 0.2 / 1024}' | cut -f1 -d,`MB
option cache-timeout 1
subvolumes readahead
end-volume
volume quickread
type performance/quick-read
option cache-timeout 1
option max-file-size 64kB
subvolumes iocache
end-volume
volume statprefetch
type performance/stat-prefetch
subvolumes quickread
end-volume
Steps to reproduce:
1. install the server config file on rhea in /etc/gluster/glusterd.vol
2. start the server on rhea: /etc/init.d/glusterfs start
3. *DO NOT* start the other server in the replication cluster (on desktop002)
4. install the client config file on desktop002 in /etc/gluster/srv-tcp.vol
5. mount the partition on desktop002: mount -t glusterfs /etc/glusterfs/srv-tcp.vol /mnt
6. cd /mnt ; touch lockfile ; locktest
Alternatively, you can try to start firefox:
6. export HOME=/mnt ; firefox
Expected results:
* locking should occur as normal (firefox would start and the locking program would complete properly)
Actual results:
* the client glusterfs completely crashes and the mount needs to be redone
This is 100% reproduceable.
Workaround:
* start the second server in the cluster
Traces:
this is the strace that hinted me the problem was with locks, from strace firefox:
open("/mnt/anarcat/.mozilla/firefox/profiles.ini", O_RDONLY) = 16
fstat64(16, {st_mode=S_IFREG|0644, st_size=174, ...}) = 0
[...]
read(16, "[General]\nStartWithLastProfile=1\n"..., 174) = 174
_llseek(16, 174, [174], SEEK_SET) = 0
close(16) = 0
[...]
stat64("/mnt/anarcat/.mozilla/firefox/be92ht7k.default", {st_mode=S_IFDIR|0700, st_size=4096, ...}) = 0
open("/mnt/anarcat/.mozilla/firefox/be92ht7k.default/.parentlock", O_WRONLY|O_CREAT|O_TRUNC, 0666) = 16
fcntl64(16, F_GETLK, {type=F_WRLCK, whence=SEEK_SET, start=0, len=0, pid=3083141920}) = -1 ECONNABORTED (Software cause
d connection abort)
close(16) = -1 ENOTCONN (Transport endpoint is not connected)
This is the error in mnt.log:
patchset: v3.0.2-41-g029062c
signal received: 11
time of crash: 2010-05-06 17:31:09
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.0.3
[0xb7fab400]
/usr/lib/libglusterfs.so.0[0xb7f7088e]
/usr/lib/glusterfs/3.0.3/xlator/performance/quick-read.so(qr_lk_cbk+0x6e)[0xb752d42e]
/usr/lib/glusterfs/3.0.3/xlator/performance/io-cache.so(ioc_lk_cbk+0x6e)[0xb75367ce]
/usr/lib/libglusterfs.so.0[0xb7f7088e]
/usr/lib/libglusterfs.so.0[0xb7f7088e]
/usr/lib/glusterfs/3.0.3/xlator/cluster/replicate.so(afr_lk_cbk+0xe5)[0xb7547c55]
/usr/lib/glusterfs/3.0.3/xlator/protocol/client.so(client_lk+0x3f4)[0xb75907d4]
/usr/lib/glusterfs/3.0.3/xlator/cluster/replicate.so(afr_lk_cbk+0x242)[0xb7547db2]
/usr/lib/glusterfs/3.0.3/xlator/protocol/client.so(client_lk_common_cbk+0x11e)[0xb758017e]
/usr/lib/glusterfs/3.0.3/xlator/protocol/client.so(protocol_client_interpret+0x245)[0xb75823e5]
/usr/lib/glusterfs/3.0.3/xlator/protocol/client.so(protocol_client_pollin+0xcf)[0xb758258f]
/usr/lib/glusterfs/3.0.3/xlator/protocol/client.so(notify+0xd2)[0xb7592b22]
/usr/lib/libglusterfs.so.0(xlator_notify+0x3f)[0xb7f6bc8f]
/usr/lib/glusterfs/3.0.3/transport/socket.so(socket_event_poll_in+0x3d)[0xb6cfee6d]
/usr/lib/glusterfs/3.0.3/transport/socket.so(socket_event_handler+0xab)[0xb6cfef2b]
/usr/lib/libglusterfs.so.0[0xb7f8707a]
/usr/lib/libglusterfs.so.0(event_dispatch+0x21)[0xb7f85e61]
/usr/sbin/glusterfs(main+0xb3d)[0x804bded]
/lib/i686/cmov/libc.so.6(__libc_start_main+0xe5)[0xb7df7455]
/usr/sbin/glusterfs[0x8049f11]
---------
I understand it could expected that both master be up for locking, but this should at the very least not crash totally the client filesystem.
I have tested this with fuse 2.7 from lenny and also with the 2.8 backport, without any luck. I haven't tried the patched version documented in the FAQ.
|