Bug 763041 (GLUSTER-1309) - self-heal aborts after exactly 2G (2E31 bytes) on invalid argument to lseek
Summary: self-heal aborts after exactly 2G (2E31 bytes) on invalid argument to lseek
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: GLUSTER-1309
Product: GlusterFS
Classification: Community
Component: replicate
Version: 3.0.4
Hardware: All
OS: Linux
low
medium
Target Milestone: ---
Assignee: Pavan Vilas Sondur
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-08-07 16:07 UTC by stefaandr
Modified: 2015-12-01 16:45 UTC (History)
2 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)

Description stefaandr 2010-08-07 16:07:31 UTC
My configuration: glusterfs-volgen --name vboxstore1 --raid 1 vboxserver1:/space/glusterfs vboxserver2:/space/glusterfs, running 3.0.4 (previously 3.0.5 seeing same problems). 

Scenario: disk on vboxserver2:/space had crashed, was replaced with new blank filesystem, self-heal was attempted using "find" on a client. 

Suddenly, the client repeatedly says
[2010-08-07 10:17:14] D [afr-self-heal-algorithm.c:152:sh_full_write_cbk] mirror-0: write to /oldcurie/virtualbox/bobtest/disk.vdi failed on subvolume vboxserver2-1 (Invalid argument)

On vboxserver2, at the same time, we get
[2010-08-07 10:17:14] E [posix.c:2608:posix_writev] posix1: lseek(-2147483648) o
n fd=0x1047260 failed: Invalid argument
[2010-08-07 10:17:14] D [server-protocol.c:1939:server_writev_cbk] server-tcp: 6
17605: WRITEV 4 (5423176) ==> -1 (Invalid argument)
[2010-08-07 10:17:14] E [posix.c:2608:posix_writev] posix1: lseek(-2147418112) on fd=0x1047260 failed: Invalid argument
[2010-08-07 10:17:14] D [server-protocol.c:1939:server_writev_cbk] server-tcp: 617611: WRITEV 4 (5423176) ==> -1 (Invalid argument)
[2010-08-07 10:17:14] E [posix.c:2608:posix_writev] posix1: lseek(-2147352576) on fd=0x1047260 failed: Invalid argument
[2010-08-07 10:17:14] D [server-protocol.c:1939:server_writev_cbk] server-tcp: 617618: WRITEV 4 (5423176) ==> -1 (Invalid argument)
[2010-08-07 10:17:14] E [posix.c:2608:posix_writev] posix1: lseek(-2147287040) on fd=0x1047260 failed: Invalid argument
...

Looking at the filesystem where the file was still intact:
vboxserver1:/space/glusterfs/oldcurie/virtualbox/bobtest# ls -ld disk.vdi; du -cs disk.vdi 
-rw------- 1 root root 4284522496 Jul 30 19:13 disk.vdi
4188200	disk.vdi
4188200	total
vboxserver1:/space/glusterfs/oldcurie/virtualbox/bobtest# getfattr -dm "" -e hex . disk.vdi 
# file: .
trusted.afr.vboxserver1-1=0x000000000000000000000000
trusted.afr.vboxserver2-1=0x000000000000000000000000
trusted.posix1.gen=0x4c547d7c00000007

# file: disk.vdi
trusted.afr.vboxserver1-1=0x000000000000000000000000
trusted.afr.vboxserver2-1=0x000000020000000000000000
trusted.posix1.gen=0x4c547d7c0000007f

vboxserver2:/space/glusterfs/oldcurie/virtualbox/bobtest# ls -ld disk.vdi; du -cs disk.vdi 
-rw------- 1 root root 2147483648 Jul 30 19:13 disk.vdi
2099208	disk.vdi
2099208	total
(note: 2147483648 = 2*10^31, the "large file" limit)

vboxserver2:/space/glusterfs/oldcurie/virtualbox/bobtest# getfattr -dm "" -e hex . disk.vdi 
# file: .
trusted.afr.vboxserver1-1=0x000000000000000000000000
trusted.afr.vboxserver2-1=0x000000000000000000000000
trusted.posix1.gen=0x4c5c35c10000013c

# file: disk.vdi
trusted.afr.vboxserver1-1=0x000000000000000000000000
trusted.afr.vboxserver2-1=0x000000000000000000000000
trusted.posix1.gen=0x4c5c78030000007b

For some other files, the clipping at 2G is also there. For some other large (>2G) files however, the self-healing is succesful. 

I am on a Debian Lenny system, using the packages from www.backports.org. I have seen the errors on 3.0.5 when both servers were still running a 32-bit os, changing the second server to a 64-bit os or downgrading to 3.0.4 apparently gives the same results. This bug report against 3.0.4, with vboxserver1 running 32-bit os, vboxserver2 running 64-bit os.

Comment 1 stefaandr 2010-08-09 19:39:29 UTC
The errors do not seem to appear when all performance translators are disabled. 
Server (I suspect it's the most likely culprit):
#volume brick1
#    type performance/io-threads
#    option thread-count 8
#    subvolumes locks1
#end-volume
Client:
#volume readahead
#    type performance/read-ahead
#    option page-count 4
#    subvolumes mirror-0
#end-volume
#
#volume iocache
#    type performance/io-cache
#    option cache-size `echo $(( $(grep 'MemTotal' /proc/meminfo | sed 's/[^0-9]//g') / 5120 ))`MB
#    option cache-timeout 1
#    subvolumes readahead
#end-volume
#
#volume quickread
#    type performance/quick-read
#    option cache-timeout 1
#    option max-file-size 64kB
#    subvolumes iocache
#end-volume
#
#volume writebehind
#    type performance/write-behind
#    option cache-size 4MB
#    subvolumes quickread
#end-volume
#
#volume statprefetch
#    type performance/stat-prefetch
#    subvolumes writebehind
#end-volume

Comment 2 Amar Tumballi 2010-10-05 09:58:18 UTC
This is fixed in both release 3.0.5 and in master (3.1) branch.. please upgrade


Note You need to log in before you can comment on or make changes to this bug.