763972 – (GLUSTER-2240) Solaris client hangs on file read operations

Bug 763972 (GLUSTER-2240) - Solaris client hangs on file read operations

Summary: Solaris client hangs on file read operations

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	GLUSTER-2240
Product:	GlusterFS
Classification:	Community
Component:	nfs
Sub Component:
Version:	3.1.1
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	---
Assignee:	Raghavendra G
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-12-20 11:20 UTC by Shehjar Tikoo
Modified:	2015-12-01 16:45 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:
Regression:	RTP
Mount Type:	nfs
Documentation:	DNR
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Shehjar Tikoo 2010-12-20 11:20:53 UTC

Even a cat on a 12 byte file hangs the Solaris client. The problem is somewhere in the translator stack on the read callback path.

Some translator is not propagating the op_errno=ENOENT which is set by posix on seeing an EOF. NFS uses op_errno = ENOENT to let clients know that end of file was reached file reading a file. It has worked till now because linux client never reads beyond the file size given in the file's attributes. Solaris client on the other hand, tries to read till the eof flag is set in the read reply.

For eg, the first NFS read request looks like:

 nfs-nfsv3: XID: cea10fa0, READ: args: FH: hashcount 2, exportid c99db2fc-ab91-406d-a3a5-acc7c2b672d8, gfid 8c862c6b-00d7-4753-b76a-21544ed94363, offset: 0,  count: 4096

i.e. request to read 4kb starting offset 0, which nfs server replies correctly as:

 XID: cea10fa0, READ: NFS: 0(Call completed successfully.), POSIX: -1(Unknown error 18446744073709551615), count: 12, is_eof: 0, vector: count: 1, len: 12

But the EOF bit is not set for a file of 12 bytes so Solaris sends another read request:

XID: cfa10fa0, READ: args: FH: hashcount 2, exportid c99db2fc-ab91-406d-a3a5-acc7c2b672d8, gfid 8c862c6b-00d7-4753-b76a-21544ed94363, offset: 12,  count: 4084

This time starting to read at offset 12, to which nfs server replies.

XID: cfa10fa0, READ: NFS: 0(Call completed successfully.), POSIX: -1(Unknown error 18446744073709551615), count: 0, is_eof: 0

i.e. not returning any data as well as not setting EOF.

Comment 1 Shehjar Tikoo 2010-12-21 05:44:23 UTC

The bug is somewhere in io-cache, where it fails to propagate the op_errno from its subvolume to its parent. Still figuring out if there can be a quick fix.

Comment 2 Shehjar Tikoo 2010-12-21 06:28:43 UTC

I think the bug is somewhere in ioc_fault_cbk where we need to copy the op_errno so that it gets propagated to parent xlator.

Comment 3 Shehjar Tikoo 2010-12-21 06:34:13 UTC

(In reply to comment #1)
> The bug is somewhere in io-cache, where it fails to propagate the op_errno from
> its subvolume to its parent. Still figuring out if there can be a quick fix.

Confirmed that by removing other translators one by one. Adding io-cache introduces the bug.

Comment 4 Anand Avati 2011-02-19 04:32:41 UTC

PATCH: http://patches.gluster.com/patch/6115 in master (performance/quick-read: disable caching for fds opened with GF_OPEN_NOWB flags.)

Comment 5 Amar Tumballi 2011-04-13 05:08:58 UTC

The issue is fixed. And hence we don't need any document about this bug. (as a known issue).

Comment 6 Saurabh 2011-04-15 06:26:35 UTC

bash-3.00# ls -li f.3
6590356411317675395 -rw-r--r--   1 root     root          12 Apr 15 14:54 f.3
bash-3.00# cat f.3
ddd
aaa
ggg
bash-3.00# mount | grep nfs-test
/mnt/nfs-test on nfs://10.1.12.134:38467/dist4 remote/read/write/setuid/devices/proto=tcp/vers=3/xattr/dev=4b40002 on Fri Apr 15 14:58:55 2011
bash-3.00# 

cat to the 12 byte file didn't hang.

Note You need to log in before you can comment on or make changes to this bug.