Bug 763972 - (GLUSTER-2240) Solaris client hangs on file read operations
Solaris client hangs on file read operations
Status: CLOSED CURRENTRELEASE
Product: GlusterFS
Classification: Community
Component: nfs (Show other bugs)
3.1.1
All Linux
low Severity medium
: ---
: ---
Assigned To: Raghavendra G
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2010-12-20 06:20 EST by Shehjar Tikoo
Modified: 2015-12-01 11:45 EST (History)
5 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: ---
Regression: RTP
Mount Type: nfs
Documentation: DNR
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Shehjar Tikoo 2010-12-20 06:20:53 EST
Even a cat on a 12 byte file hangs the Solaris client. The problem is somewhere in the translator stack on the read callback path.

Some translator is not propagating the op_errno=ENOENT which is set by posix on seeing an EOF. NFS uses op_errno = ENOENT to let clients know that end of file was reached file reading a file. It has worked till now because linux client never reads beyond the file size given in the file's attributes. Solaris client on the other hand, tries to read till the eof flag is set in the read reply.

For eg, the first NFS read request looks like:

 nfs-nfsv3: XID: cea10fa0, READ: args: FH: hashcount 2, exportid c99db2fc-ab91-406d-a3a5-acc7c2b672d8, gfid 8c862c6b-00d7-4753-b76a-21544ed94363, offset: 0,  count: 4096

i.e. request to read 4kb starting offset 0, which nfs server replies correctly as:

 XID: cea10fa0, READ: NFS: 0(Call completed successfully.), POSIX: -1(Unknown error 18446744073709551615), count: 12, is_eof: 0, vector: count: 1, len: 12

But the EOF bit is not set for a file of 12 bytes so Solaris sends another read request:

XID: cfa10fa0, READ: args: FH: hashcount 2, exportid c99db2fc-ab91-406d-a3a5-acc7c2b672d8, gfid 8c862c6b-00d7-4753-b76a-21544ed94363, offset: 12,  count: 4084

This time starting to read at offset 12, to which nfs server replies.

XID: cfa10fa0, READ: NFS: 0(Call completed successfully.), POSIX: -1(Unknown error 18446744073709551615), count: 0, is_eof: 0

i.e. not returning any data as well as not setting EOF.
Comment 1 Shehjar Tikoo 2010-12-21 00:44:23 EST
The bug is somewhere in io-cache, where it fails to propagate the op_errno from its subvolume to its parent. Still figuring out if there can be a quick fix.
Comment 2 Shehjar Tikoo 2010-12-21 01:28:43 EST
I think the bug is somewhere in ioc_fault_cbk where we need to copy the op_errno so that it gets propagated to parent xlator.
Comment 3 Shehjar Tikoo 2010-12-21 01:34:13 EST
(In reply to comment #1)
> The bug is somewhere in io-cache, where it fails to propagate the op_errno from
> its subvolume to its parent. Still figuring out if there can be a quick fix.

Confirmed that by removing other translators one by one. Adding io-cache introduces the bug.
Comment 4 Anand Avati 2011-02-18 23:32:41 EST
PATCH: http://patches.gluster.com/patch/6115 in master (performance/quick-read: disable caching for fds opened with GF_OPEN_NOWB flags.)
Comment 5 Amar Tumballi 2011-04-13 01:08:58 EDT
The issue is fixed. And hence we don't need any document about this bug. (as a known issue).
Comment 6 Saurabh 2011-04-15 02:26:35 EDT
bash-3.00# ls -li f.3
6590356411317675395 -rw-r--r--   1 root     root          12 Apr 15 14:54 f.3
bash-3.00# cat f.3
ddd
aaa
ggg
bash-3.00# mount | grep nfs-test
/mnt/nfs-test on nfs://10.1.12.134:38467/dist4 remote/read/write/setuid/devices/proto=tcp/vers=3/xattr/dev=4b40002 on Fri Apr 15 14:58:55 2011
bash-3.00# 

cat to the 12 byte file didn't hang.

Note You need to log in before you can comment on or make changes to this bug.