Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 761911 (GLUSTER-179)

Summary:

fileop reports miscompares on read tests

Product:

[Community] GlusterFS

Reporter:

Shehjar Tikoo <shehjart>

Component:

libglusterfsclient

Assignee:

Shehjar Tikoo <shehjart>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Severity:

high

Docs Contact:

Priority:

urgent

Version:

mainline

CC:

gluster-bugs

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

Type:

---

Regression:

RTP

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
tshark NFS trace of the fileop run	none

Description Shehjar Tikoo 2009-07-30 17:11:45 UTC

Created attachment 49 [details]
System info and Addendum#1

This is a trace file that should be opened using wireshark.

The point of this trace is to show the GETATTR request that follows the 2 WRITEs.
This GETATTR request returns the size of 64KiB even though the 2 WRITEs that were handled before the GETATTR wrote 65 KiB of data.

Comment 1 Shehjar Tikoo 2009-07-30 17:40:41 UTC

The bug seems to be in write-behind where the wb_stat, needs to not just flush/wind all the pending operations on a file but also wait for the unwinds from the fops that have already been sent.

It is possible that the second writev is the fop that has already been wound but not yet unwound. Without the waiting for this unwind, stat is sent to lower level translators. One of those translators is replicate which allows a stat fop to overtake a writev fop.

Comment 2 Shehjar Tikoo 2009-07-30 19:34:46 UTC

Earlier today, a fix was committed to bug 761910 that fixes one reason for the mis-compare errors reported by fileop.

Even after the fix for that is committed, the second reason why we're seeing mis-compares is because of the so-called short reads being returned to fileop. 

For an example, see the following output from one particular run of fileop where it write 65KiB but on reading the file, receives only 64KiB.

==========================================================================
[root@client12 mount]# /root/shehjart/iozone3_326/src/current/fileop 1 -f 1  -s 65K -b -w -e -t
mkdir:   Dirs =         3 Total Time =  0.273663044 seconds
         Avg mkdir(s)/sec     =        10.96 ( 0.091221015 seconds/op)
         Best mkdir(s)/sec    =       169.38 ( 0.005903959 seconds/op)
         Worst mkdir(s)/sec   =         4.31 ( 0.232119083 seconds/op)

chdir:   Dirs =         3 Total Time =  0.000015020 seconds
         Avg chdir(s)/sec     =    199728.76 ( 0.000005007 seconds/op)
         Best chdir(s)/sec    =    262144.00 ( 0.000003815 seconds/op)
         Worst chdir(s)/sec   =    161319.38 ( 0.000006199 seconds/op)

rmdir:   Dirs =         3 Total Time =  0.045747042 seconds
         Avg rmdir(s)/sec     =        65.58 ( 0.015249014 seconds/op)
         Best rmdir(s)/sec    =       559.61 ( 0.001786947 seconds/op)
         Worst rmdir(s)/sec   =        23.74 ( 0.042126894 seconds/op)

WRITE: sz 66560, returned: 66560

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
The above line shows the write syscall did return with all 65k of data being written to nfs server.
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
create:  Files =         1 Total Time =  0.003466129 seconds
         Avg create(s)/sec    =       288.51 ( 0.003466129 seconds/op)
         Best create(s)/sec   =       288.51 ( 0.003466129 seconds/op)
         Worst create(s)/sec  =       288.51 ( 0.003466129 seconds/op)

write:   Files =         0 Total Time =  0.000000000 seconds
         Avg write(s)/sec     =          nan (         nan seconds/op)
         Best write(s)/sec    =         0.00 (999999.900000000 seconds/op)
         Worst write(s)/sec   =          inf ( 0.000000000 seconds/op)

close:   Files =         1 Total Time =  0.000005007 seconds
         Avg close(s)/sec     =    199728.76 ( 0.000005007 seconds/op)
         Best close(s)/sec    =    199728.76 ( 0.000005007 seconds/op)
         Worst close(s)/sec   =    199728.76 ( 0.000005007 seconds/op)

stat:    Files =         1 Total Time =  0.000004053 seconds
         Avg stat(s)/sec      =    246723.76 ( 0.000004053 seconds/op)
         Best stat(s)/sec     =    246723.76 ( 0.000004053 seconds/op)
         Worst stat(s)/sec    =    246723.76 ( 0.000004053 seconds/op)

sz :66560, read: 65536 
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
The above line is being printed right after the read syscall. It shows that the client receives on 64K of data even though the read size was 65KiB.
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

open:    Files =         1 Total Time =  0.000704050 seconds
         Avg open(s)/sec      =      1420.35 ( 0.000704050 seconds/op)
         Best open(s)/sec     =      1420.35 ( 0.000704050 seconds/op)
         Worst open(s)/sec    =      1420.35 ( 0.000704050 seconds/op)

read:    Files =         1 Total Time =  0.003770113 seconds
         Avg read(s)/sec      =       265.24 ( 0.003770113 seconds/op)
         Best read(s)/sec     =       265.24 ( 0.003770113 seconds/op)
         Worst read(s)/sec    =       265.24 ( 0.003770113 seconds/op)

access:  Files =         1 Total Time =  0.000326157 seconds
         Avg access(s)/sec    =      3066.01 ( 0.000326157 seconds/op)
         Best access(s)/sec   =      3066.01 ( 0.000326157 seconds/op)
         Worst access(s)/sec  =      3066.01 ( 0.000326157 seconds/op)

chmod:   Files =         1 Total Time =  0.001621008 seconds
         Avg chmod(s)/sec     =       616.90 ( 0.001621008 seconds/op)
         Best chmod(s)/sec    =       616.90 ( 0.001621008 seconds/op)
         Worst chmod(s)/sec   =       616.90 ( 0.001621008 seconds/op)

readdir: Files =         1 Total Time =  0.002246141 seconds
         Avg readdir(s)/sec   =       445.21 ( 0.002246141 seconds/op)
         Best readdir(s)/sec  =       445.21 ( 0.002246141 seconds/op)
         Worst readdir(s)/sec =       445.21 ( 0.002246141 seconds/op)

link:    Files =         1 Total Time =  0.049368858 seconds
         Avg link(s)/sec      =        20.26 ( 0.049368858 seconds/op)
         Best link(s)/sec     =        20.26 ( 0.049368858 seconds/op)
         Worst link(s)/sec    =        20.26 ( 0.049368858 seconds/op)

unlink:  Files =         1 Total Time =  0.002592087 seconds
         Avg unlink(s)/sec    =       385.79 ( 0.002592087 seconds/op)
         Best unlink(s)/sec   =       385.79 ( 0.002592087 seconds/op)
         Worst unlink(s)/sec  =       385.79 ( 0.002592087 seconds/op)

delete:  Files =         1 Total Time =  0.044466019 seconds
         Avg delete(s)/sec    =        22.49 ( 0.044466019 seconds/op)
         Best delete(s)/sec   =        22.49 ( 0.044466019 seconds/op)
         Worst delete(s)/sec  =        22.49 ( 0.044466019 seconds/op)

===========================================================================

Once the read is done, fileop performs a byte-by-byte comparison of the buffer it had previously written with the buffer it just read back. In the buffer read back, only a part will match the buffer written to the server, since the remaining buffer could contain any random data because of the short read. Since the remaining bytes, i.e. the bytes beyond the ending offset of the received data, do not match the written data, fileop reports miscompares.

The reason client receives only 64kiB is because it only requests that much and not 65KiB. This in turn happens because after the client writes 65KiB of data, it is sending a NFS GETATTR request, which, surprisingly, is returning a file size of 64KiB. Since the client is informed of a different size than what it wrote, the nfs client updates its caches. When it comes the time to read, it sends only a single 64KiB read request since it knows that the file is not bigger than this size. This 64KiB buffer is then handed to the fileop read syscall.


Looking further into the booster/libglusterfsclient logs, this is what I see:
<only pasting relevant portion>
============================================================================
[2009-07-30 06:59:34] N [trace.c:1590:trace_open] tr: 48: (loc {path=/fileop_L1_0/fileop_L1_0_L2_0/fileop_file_0_0_0, ino=5672402965}, flags=1, fd=0x2aaaac
013db0)

[2009-07-30 06:59:34] N [trace.c:116:trace_open_cbk] tr: 48: (op_ret=0, op_errno=117, *fd=0x2aaaac013db0)

[2009-07-30 06:59:34] N [trace.c:1669:trace_writev] tr: 49: (*fd=0x2aaaac013db0, *vector=0x7fff61d27930, count=1, offset=0)

[2009-07-30 06:59:34] N [trace.c:226:trace_writev_cbk] tr: 49: (op_ret=65536, *buf {st_ino=5672402965, st_size=65536, st_blocks=0, st_atime=[Jul 30 12:01:5
2], st_mtime=[Jul 30 12:01:52], st_ctime=[Jul 30 12:01:52]})

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
In the above trace output is the first 64 KiB write, since we've mounted NFS client using [wr]size of 64 KiB. The stat in callback shows that the data has been written and the size has increased to 64 KiB. This stat is filled with non-zero data because write-behind is disabled for this file, since unfs3 might have requested it.
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

[2009-07-30 06:59:34] D [libglusterfsclient.c:1409:libgf_vmp_search_entry] libglusterfsclient: VMP Entry found: /testpath//fileop_L1_0/fileop_L1_0_L2_0/fil
eop_file_0_0_0: /testpath/
[2009-07-30 06:59:34] D [libglusterfsclient-dentry.c:366:libgf_client_path_lookup] libglusterfsclient: resolved path(/fileop_L1_0/fileop_L1_0_L2_0/fileop_f
ile_0_0_0) to 5672402961/5672402965


[2009-07-30 06:59:34] N [trace.c:1267:trace_stat] tr: 50: (loc {path=/fileop_L1_0/fileop_L1_0_L2_0/fileop_file_0_0_0, ino=5672402965})

[2009-07-30 06:59:34] N [trace.c:150:trace_stat_cbk] tr: 50: (op_ret=0, buf {st_dev=2065, st_ino=5672402965, st_mode=100600, st_nlink=1, st_uid=0, st_gid=0
, st_rdev=0, st_size=65536, st_blksize=4096, st_blocks=0, st_atime=[Jul 30 12:01:52], st_mtime=[Jul 30 12:01:52], st_ctime=[Jul 30 12:01:52]})
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
A subsequent stat cbk above confirms that the file is actually 64KiB in size now.
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@


[2009-07-30 06:59:34] D [libglusterfsclient.c:1409:libgf_vmp_search_entry] libglusterfsclient: VMP Entry found: /testpath//fileop_L1_0/fileop_L1_0_L2_0/fil
eop_file_0_0_0: /testpath/
[2009-07-30 06:59:34] D [libglusterfsclient-dentry.c:366:libgf_client_path_lookup] libglusterfsclient: resolved path(/fileop_L1_0/fileop_L1_0_L2_0/fileop_f
ile_0_0_0) to 5672402961/5672402965


[2009-07-30 06:59:34] N [trace.c:1669:trace_writev] tr: 51: (*fd=0x2aaaac013db0, *vector=0x7fff61d27930, count=1, offset=65536)

[2009-07-30 06:59:34] N [trace.c:226:trace_writev_cbk] tr: 51: (op_ret=1024, *buf {st_ino=0, st_size=0, st_blocks=0, st_atime=[Dec 31 16:00:00], st_mtime=[
Dec 31 16:00:00], st_ctime=[Dec 31 16:00:00]})
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Then we see that the next write that comes in, starts at offset 65536, as expected. Since we're writing 65KiB of data, I've checked through wireshark that the write data is 1024 bytes in the vector argument. Since write-behind was enabled on this file open, the stat cbk shows a zeroed-out structure.
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@



[2009-07-30 06:59:34] D [libglusterfsclient.c:1409:libgf_vmp_search_entry] libglusterfsclient: VMP Entry found: /testpath//fileop_L1_0/fileop_L1_0_L2_0/fil
eop_file_0_0_0: /testpath/
[2009-07-30 06:59:34] D [libglusterfsclient-dentry.c:366:libgf_client_path_lookup] libglusterfsclient: resolved path(/fileop_L1_0/fileop_L1_0_L2_0/fileop_f
ile_0_0_0) to 5672402961/5672402965

[2009-07-30 06:59:34] N [trace.c:1267:trace_stat] tr: 52: (loc {path=/fileop_L1_0/fileop_L1_0_L2_0/fileop_file_0_0_0, ino=5672402965})

[2009-07-30 06:59:34] N [trace.c:150:trace_stat_cbk] tr: 52: (op_ret=0, buf {st_dev=2065, st_ino=5672402965, st_mode=100600, st_nlink=1, st_uid=0, st_gid=0
, st_rdev=0, st_size=65536, st_blksize=4096, st_blocks=0, st_atime=[Jul 30 12:01:52], st_mtime=[Jul 30 12:01:52], st_ctime=[Jul 30 12:01:52]})
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
This is a big catch. After the second write above, we still find that the stat callback returns a structure with 64KiB size and not 65 KiB.
At this point, I can only think that this could because the writing behind of the previous write is not complete yet.
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@

I am not sure where this bug is coming from. One could say that inode attr caching in libglusterfsclient is a suspect but seeing that a lower layer translator is returning an incorrect size is a bigger worry.

Comment 3 Raghavendra G 2009-07-31 07:21:09 UTC

(In reply to comment #2)
> The bug seems to be in write-behind where the wb_stat, needs to not just
> flush/wind all the pending operations on a file but also wait for the unwinds
> from the fops that have already been sent.
> 
> It is possible that the second writev is the fop that has already been wound
> but not yet unwound. Without the waiting for this unwind, stat is sent to lower
> level translators. One of those translators is replicate which allows a stat
> fop to overtake a writev fop.

I walked through the code and found that such situation is not possible. Stat is held back from winding till reply to write has come.

Comment 4 Shehjar Tikoo 2009-07-31 07:50:59 UTC

(In reply to comment #3)
> (In reply to comment #2)
> > The bug seems to be in write-behind where the wb_stat, needs to not just
> > flush/wind all the pending operations on a file but also wait for the unwinds
> > from the fops that have already been sent.
> > 
> > It is possible that the second writev is the fop that has already been wound
> > but not yet unwound. Without the waiting for this unwind, stat is sent to lower
> > level translators. One of those translators is replicate which allows a stat
> > fop to overtake a writev fop.
> 
> I walked through the code and found that such situation is not possible. Stat
> is held back from winding till reply to write has come.

There is some truth to what you're saying because if there was any way for stat to have executed before writev, the trace_writev_cbk would've been logged after trace_stat_cbk.

Comment 5 Shehjar Tikoo 2009-07-31 08:56:03 UTC

Here is more output from trace from a run we just did:
We've loaded one trace above write-behind, one below it and one trace each on the 4 protocol/clients.
Credit for this debugging approach goes to Raghu.


[2009-07-30 23:09:27] N [trace.c:1669:trace_writev] tr: 77: (*fd=0x1e4671e0, *vector=0x7fffe0c89980, count=1, offset=65536)
[2009-07-30 23:09:27] N [trace.c:226:trace_writev_cbk] tr: 77: (op_ret=1024, *buf {st_ino=0, st_size=0, st_blocks=0, st_atime=[Dec 31 16:00:00], st_mtime=[
Dec 31 16:00:00], st_ctime=[Dec 31 16:00:00]})

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
The two lines above show that the top level trace writev call is unwound by write-behind, as expected.
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@



[2009-07-30 23:09:27] N [trace.c:1669:trace_writev] tr-below-wb: 77: (*fd=0x1e4671e0, *vector=0x1e4677c0, count=1, offset=65536)
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Write behind then hands that writev to the the trace/subvolume below it.
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@


[2009-07-30 23:09:27] N [trace.c:1267:trace_stat] tr: 78: (loc {path=/fileop_L1_0/fileop_L1_0_L2_0/fileop_file_0_0_0, ino=67294997})

[2009-07-30 23:09:27] N [trace.c:1267:trace_stat] tr-below-wb: 78: (loc {path=/fileop_L1_0/fileop_L1_0_L2_0/fileop_file_0_0_0, ino=67294997})

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
Then, and this is unexpected, a stat call comes in, which is handed by write-behind to the lower translator without waiting for the previous write call to unwind back.

The subsequent lines of log just show how this stat call is then re-ordered inside replicate and sent before the write resulting in a stat struct with size of 64KiB.

@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@


[2009-07-30 23:09:27] N [trace.c:1267:trace_stat] tr-brick4: 78: (loc {path=/fileop_L1_0/fileop_L1_0_L2_0/fileop_file_0_0_0, ino=67294997})
[2009-07-30 23:09:27] N [trace.c:1154:trace_finodelk_cbk] tr-brick5: 77: op_ret=0, op_errno=0
[2009-07-30 23:09:27] N [trace.c:1669:trace_writev] tr-brick5: 77: (*fd=0x1e4671e0, *vector=0x1e455850, count=1, offset=65536)
[2009-07-30 23:09:27] N [trace.c:1669:trace_writev] tr-brick4: 77: (*fd=0x1e4671e0, *vector=0x1e455850, count=1, offset=65536)
[2009-07-30 23:09:27] N [trace.c:150:trace_stat_cbk] tr-brick4: 78: (op_ret=0, buf {st_dev=2065, st_ino=33647498, st_mode=100600, st_nlink=1, st_uid=0, st_
gid=0, st_rdev=0, st_size=65536, st_blksize=4096, st_blocks=0, st_atime=[Jul 31 04:11:48], st_mtime=[Jul 31 04:11:48], st_ctime=[Jul 31 04:11:48]})
[2009-07-30 23:09:27] N [trace.c:150:trace_stat_cbk] tr-below-wb: 78: (op_ret=0, buf {st_dev=2065, st_ino=67294997, st_mode=100600, st_nlink=1, st_uid=0, s
t_gid=0, st_rdev=0, st_size=65536, st_blksize=4096, st_blocks=0, st_atime=[Jul 31 04:11:48], st_mtime=[Jul 31 04:11:48], st_ctime=[Jul 31 04:11:48]})
[2009-07-30 23:09:27] N [trace.c:150:trace_stat_cbk] tr: 78: (op_ret=0, buf {st_dev=2065, st_ino=67294997, st_mode=100600, st_nlink=1, st_uid=0, st_gid=0, 
st_rdev=0, st_size=65536, st_blksize=4096, st_blocks=0, st_atime=[Jul 31 04:11:48], st_mtime=[Jul 31 04:11:48], st_ctime=[Jul 31 04:11:48]})
[2009-07-30 23:09:27] N [trace.c:226:trace_writev_cbk] tr-brick5: 77: (op_ret=1024, *buf {st_ino=1418100741, st_size=66560, st_blocks=152, st_atime=[Jul 31
 04:11:48], st_mtime=[Jul 31 04:11:48], st_ctime=[Jul 31 04:11:48]})
[2009-07-30 23:09:27] N [trace.c:226:trace_writev_cbk] tr-brick4: 77: (op_ret=1024, *buf {st_ino=33647498, st_size=66560, st_blocks=0, st_atime=[Jul 31 04:
11:48], st_mtime=[Jul 31 04:11:48], st_ctime=[Jul 31 04:11:48]})
[2009-07-30 23:09:27] N [trace.c:226:trace_writev_cbk] tr-below-wb: 77: (op_ret=1024, *buf {st_ino=67294997, st_size=66560, st_blocks=0, st_atime=[Jul 31 0
4:11:48], st_mtime=[Jul 31 04:11:48], st_ctime=[Jul 31 04:11:48]})

Comment 6 Shehjar Tikoo 2009-07-31 09:02:11 UTC

The previous output shows that the stat call is not being ordered on the fd.
Since Raghu was sure about the wait-till-unwind mechanism being bug-free, we've investigated the behaviour of wb_stat and why it is sent right away.

See the gdb log below.

====================================================================

Breakpoint 2, wb_stat (frame=0x14bb0ee0, this=0x14b94500, loc=0x7fff8b8e43a0) at write-behind.c:483
483             wb_file_t   *file = NULL;
(gdb) n
484             fd_t        *iter_fd = NULL;
(gdb)
485             wb_local_t  *local = NULL;
(gdb)
486             uint64_t     tmp_file = 0;
(gdb)
487             call_stub_t *stub = NULL;
(gdb)
489             if (loc->inode) {
(gdb)
491                     iter_fd = fd_lookup (loc->inode, frame->root->pid);
(gdb) n
492                     if (iter_fd) {
(gdb) p iter_fd
$1 = (fd_t *) 0x0
(gdb) p frame->root->pid
$2 = 14620

This is the first sign of the cause of the bug. Stat overtakes write-behind
because write-behind cannot find the fd on which it should order the stat.


========================================================================
Stepping into the fd_lookup:

Breakpoint 2, wb_stat (frame=0x2aaaac003cf0, this=0x14b94500, loc=0x7fff8b8e43a0) at write-behind.c:483
483             wb_file_t   *file = NULL;
(gdb) n
484             fd_t        *iter_fd = NULL;
(gdb) n
485             wb_local_t  *local = NULL;
(gdb) 
486             uint64_t     tmp_file = 0;
(gdb) 
487             call_stub_t *stub = NULL;
(gdb) 
489             if (loc->inode) {
(gdb) 
491                     iter_fd = fd_lookup (loc->inode, frame->root->pid);
(gdb) s
fd_lookup (inode=0x14bb3750, pid=14620) at fd.c:545
545             fd_t *fd = NULL;
(gdb) n
546             fd_t *iter_fd = NULL;
(gdb) n
548             LOCK (&inode->lock);
(gdb) n
550                     if (list_empty (&inode->fd_list)) {
(gdb) n
551                             fd = NULL;
(gdb) n
566             UNLOCK (&inode->lock);
(gdb) n
568             return fd;


Well, this is just incredible. We dont even have any fd associated with the inode.

Further investigation showed that libglusterfsclient does not call fd_bind after an fd_create and that is why the fd_t does not get associated with the inode.

Comment 7 Anand Avati 2009-07-31 15:40:35 UTC

PATCH: http://patches.gluster.com/patch/848 in master (libglusterfsclient: Bind fd on fd creation code paths)

Comment 8 Anand Avati 2009-07-31 15:40:43 UTC

PATCH: http://patches.gluster.com/patch/847 in release-2.0 (libglusterfsclient: Bind fd on fd creation code paths)

Comment 9 Shehjar Tikoo 2009-08-05 05:12:55 UTC

Fixed in mainline:
http://git.gluster.com/?p=glusterfs.git;a=commit;h=b63f69cdf40ef5bd79abdcd641b3dcc4bcb3cb1a

in release-2.0:
http://git.gluster.com/?p=glusterfs.git;a=commit;h=48b027c82322ef8c34c16e3da03a3375de3fe699