Bug 762310 (GLUSTER-578)

Summary: touch over distribute of 4 bricks randomly fails with ENOENT
Product: [Community] GlusterFS Reporter: Shehjar Tikoo <shehjart>
Component: distributeAssignee: Anand Avati <aavati>
Status: CLOSED NOTABUG QA Contact:
Severity: medium Docs Contact:
Priority: low    
Version: mainlineCC: anush, chrisw, fharshav, gluster-bugs, jdarcy
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: Type: ---
Regression: RTNR Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Attachments:
Description Flags
Server volume config
none
Client log
none
Client volume spec none

Description Shehjar Tikoo 2010-01-28 08:12:18 UTC
Created attachment 140 [details]
Patch for sgml-tools-1.0.9-2.i386.rpm

Comment 1 Shehjar Tikoo 2010-01-28 08:12:45 UTC
Created attachment 141 [details]
Patch to fix the problem

Comment 2 Shehjar Tikoo 2010-01-28 08:13:05 UTC
Created attachment 142 [details]
The new bad XF86Config file

Comment 3 Shehjar Tikoo 2010-01-28 11:11:44 UTC
As the subject says, with 3.0.2rc1 server and client, touch returns a "No such file or directory" without a patten for touch. Sometimes touch works fine, at other times,it fails.

The brick and client vol files and client log is attached. Was first observed on the storage platform.

For relevant part of the client log, search for file "s9". A touch on this non-existent file fails.

FWIW, I think the relevant portion of the client log is:

[2010-01-28 15:58:15] D [client-protocol.c:7023:notify] box3-1: got GF_EVENT_CHILD_UP
[2010-01-28 15:58:15] D [client-protocol.c:7023:notify] box3-1: got GF_EVENT_CHILD_UP
[2010-01-28 15:58:15] D [client-protocol.c:6160:client_setvolume_cbk] box3-1: setvolume failed (Stale NFS file handle)
[2010-01-28 15:58:15] D [client-protocol.c:6160:client_setvolume_cbk] box3-1: setvolume failed (Stale NFS file handle)
[2010-01-28 15:58:21] D [client-protocol.c:7023:notify] box5-1: got GF_EVENT_CHILD_UP
[2010-01-28 15:58:21] D [client-protocol.c:6160:client_setvolume_cbk] box5-1: setvolume failed (Stale NFS file handle)
[2010-01-28 15:58:21] D [dht-diskusage.c:71:dht_du_info_cbk] distribute: on subvolume 'box2-1': avail_percent is: 99.00 and avail_space is: 980362805248
[2010-01-28 15:58:21] D [dht-diskusage.c:71:dht_du_info_cbk] distribute: on subvolume 'box1-1': avail_percent is: 99.00 and avail_space is: 980362801152
[2010-01-28 15:58:21] D [client-protocol.c:7023:notify] box3-1: got GF_EVENT_CHILD_UP
[2010-01-28 15:58:21] D [client-protocol.c:6160:client_setvolume_cbk] box3-1: setvolume failed (Stale NFS file handle)
[2010-01-28 15:58:21] D [client-protocol.c:7023:notify] box5-1: got GF_EVENT_CHILD_UP
[2010-01-28 15:58:21] D [client-protocol.c:6160:client_setvolume_cbk] box5-1: setvolume failed (Stale NFS file handle)
[2010-01-28 15:58:25] D [dht-layout.c:184:dht_layout_search] distribute: no subvolume for hash (value) = 3833074946
[2010-01-28 15:58:25] D [dht-helper.c:235:dht_subvol_get_hashed] distribute: could not find subvolume for path=/s9
[2010-01-28 15:58:25] D [dht-common.c:864:dht_lookup] distribute: no subvolume in layout for path=/s9, checking on all the subvols to see if it is a directory
[2010-01-28 15:58:25] D [dht-common.c:114:dht_lookup_dir_cbk] distribute: lookup of /s9 on box5-1 returned error (Transport endpoint is not connected)
[2010-01-28 15:58:25] D [dht-common.c:114:dht_lookup_dir_cbk] distribute: lookup of /s9 on box3-1 returned error (Transport endpoint is not connected)

###############################################################
Overall, it looks like the volumes are up, going by the CHILD_UP events received
right at the start of the log portion but right after that the setvolume fails with stale file handle. If box3-1 is in fact going down, should distribute not receive the CHILD_DOWN event?

Comment 4 Jeff Darcy 2010-02-10 21:00:00 UTC
I'm not sure if it's relevant, but I saw a similar problem with a similar translator.  I tracked it down to a check in client_lookup, which would fail if the *parent* had a non-zero ino (which had come from another subvolume) and inode_ctx_get failed (because that parent had never been looked up on this subvolume).  The same issue is likely to appear in stripe and replicate.  I'm not sure if it can appear in distribute as well, but it might be possible in various rename cases.