Bug 763464 (GLUSTER-1732) - VM hangs while autohealing
Summary: VM hangs while autohealing
Keywords:
Status: CLOSED DUPLICATE of bug 762563
Alias: GLUSTER-1732
Product: GlusterFS
Classification: Community
Component: replicate
Version: 3.0.5
Hardware: All
OS: Linux
low
medium
Target Milestone: ---
Assignee: Pavan Vilas Sondur
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-09-28 13:57 UTC by Lakshmipathi G
Modified: 2015-12-01 16:45 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Regression: RTP
Mount Type: fuse
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)
server and client logs (5.05 KB, application/zip)
2010-09-28 10:57 UTC, Lakshmipathi G
no flags Details

Description Lakshmipathi G 2010-09-28 13:57:06 UTC
reported by :Fred Fischer on mailing list
---
I'm attaching 3 logfiles - 2 Servers, 1 Client containing the following:
* Creation of a virtual machine with a diskfile called image.raw
* Shutting down brick2
* Waiting for vm to modify its diskfile
* Starting brick2 back up
* Start of autohealing (VM starts hanging here)
* Finish autohealing (VM continues to run here)

--------
Hi,

i have 2 machines running a simple replicate volume to provide highly
available storage for kvm virtual machines.
As soon as auto healing starts, glusterfs will start blocking the vm's
storage access (apparently writes are what causes this) leaving the
whole virtual machine hanging.
I can replicate this bug on both ext3 and ext4 filesystems, on real
machines as well as on vm's.

Any help would be appreciated, we have to run the vm's without glusterfs
at the moment because of this problem :-(

More on my config:

* Ubuntu 10.04 Server 64bit
* Kernel 2.6.32-21-server
* Fuse 2.8.1
* Glusterfs v3.0.2

How to replicate:

* 2 Nodes running glusterfs replicate
* Start KVM virtual machine with diskfile on glusterfs
* Stop glusterfsd on one node
* Make changes to the diskfile
* Bring glusterfsd back online (auto healing starts) (replicate: no
missing files - /image.raw. proceeding to metadata check)
* As soon as the vm starts writing data, it will be blocked until
autohealing finishes (Making it completely unresponsive)

Message from Kernel (Printed several times while healing):

INFO: task kvm:7774 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kvm           D 00000000ffffffff     0  7774      1 0x00000000
ffff8801adcd9e48 0000000000000082 0000000000015bc0 0000000000015bc0
ffff880308d9df80 ffff8801adcd9fd8 0000000000015bc0 ffff880308d9dbc0
0000000000015bc0 ffff8801adcd9fd8 0000000000015bc0 ffff880308d9df80
Call Trace:
[<ffffffff8153f867>] __mutex_lock_slowpath+0xe7/0x170
[<ffffffff8153f75b>] mutex_lock+0x2b/0x50
[<ffffffff8123a1d1>] fuse_file_llseek+0x41/0xe0
[<ffffffff8114238a>] vfs_llseek+0x3a/0x40
[<ffffffff81142fd6>] sys_lseek+0x66/0x80
[<ffffffff810131b2>] system_call_fastpath+0x16/0x1b

Gluster Configuration:

### glusterfsd.vol ###
volume posix
   type storage/posix
   option directory /data/export
end-volume

volume locks
   type features/locks
   subvolumes posix
end-volume

volume brick
   type performance/io-threads
   option thread-count 16
   subvolumes locks
end-volume

volume server
   type protocol/server
   option transport-type tcp
   option transport.socket.nodelay on
   option transport.socket.bind-address 192.168.158.141
   option auth.addr.brick.allow 192.168.158.*
   subvolumes brick
end-volume

### glusterfs.vol ###
volume gluster1
   type protocol/client
   option transport-type tcp
   option remote-host 192.168.158.141
   option remote-subvolume brick
end-volume

volume gluster2
   type protocol/client
   option transport-type tcp
   option remote-host 192.168.158.142
   option remote-subvolume brick
end-volume

volume replicate
   type cluster/replicate
   subvolumes gluster1 gluster2
end-volume

### fstab ###
/etc/glusterfs/glusterfs.vol  /mnt/glusterfs  glusterfs
log-level=DEBUG,direct-io-mode=disable 0  0


I read that you wanted users to kill -11 the glusterfs process for more
debug info - here it is:

pending frames:
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)
frame : type(1) op(WRITE)

patchset: v3.0.2
signal received: 11
time of crash: 2010-09-28 11:14:31
configuration details:
argp 1
backtrace 1
dlfcn 1
fdatasync 1
libpthread 1
llistxattr 1
setfsid 1
spinlock 1
epoll.h 1
xattr.h 1
st_atim.tv_nsec 1
package-string: glusterfs 3.0.2
/lib/libc.so.6(+0x33af0)[0x7f0c6bf0eaf0]
/lib/libc.so.6(epoll_wait+0x33)[0x7f0c6bfc1c93]
/usr/lib/libglusterfs.so.0(+0x2e261)[0x7f0c6c6ac261]
glusterfs(main+0x852)[0x4044f2]
/lib/libc.so.6(__libc_start_main+0xfd)[0x7f0c6bef9c4d]
glusterfs[0x402ab9]
---------

Comment 1 Pavan Vilas Sondur 2010-10-11 06:03:09 UTC
This is a dup of 831. The issue is hit because self heal holds a full file lock while self healing and I/O operations are blocked until self heal is done. If the files which need to be self healed are huge in size, the I/O calls might timeout (as in this case)

It needs to be fixed with self heal holding locks with higher granularity.

*** This bug has been marked as a duplicate of bug 831 ***


Note You need to log in before you can comment on or make changes to this bug.