Bug 762310 (GLUSTER-578) - touch over distribute of 4 bricks randomly fails with ENOENT
Summary: touch over distribute of 4 bricks randomly fails with ENOENT
Keywords:
Status: CLOSED NOTABUG
Alias: GLUSTER-578
Product: GlusterFS
Classification: Community
Component: distribute
Version: mainline
Hardware: All
OS: Linux
low
medium
Target Milestone: ---
Assignee: Anand Avati
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-01-28 11:11 UTC by Shehjar Tikoo
Modified: 2015-12-01 16:45 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:
Regression: RTNR
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:


Attachments (Terms of Use)
Server volume config (796 bytes, text/plain)
2010-01-28 08:12 UTC, Shehjar Tikoo
no flags Details
Client log (65.82 KB, text/plain)
2010-01-28 08:12 UTC, Shehjar Tikoo
no flags Details
Client volume spec (1.97 KB, text/plain)
2010-01-28 08:13 UTC, Shehjar Tikoo
no flags Details

Description Shehjar Tikoo 2010-01-28 08:12:18 UTC
Created attachment 140 [details]
Patch for sgml-tools-1.0.9-2.i386.rpm

Comment 1 Shehjar Tikoo 2010-01-28 08:12:45 UTC
Created attachment 141 [details]
Patch to fix the problem

Comment 2 Shehjar Tikoo 2010-01-28 08:13:05 UTC
Created attachment 142 [details]
The new bad XF86Config file

Comment 3 Shehjar Tikoo 2010-01-28 11:11:44 UTC
As the subject says, with 3.0.2rc1 server and client, touch returns a "No such file or directory" without a patten for touch. Sometimes touch works fine, at other times,it fails.

The brick and client vol files and client log is attached. Was first observed on the storage platform.

For relevant part of the client log, search for file "s9". A touch on this non-existent file fails.

FWIW, I think the relevant portion of the client log is:

[2010-01-28 15:58:15] D [client-protocol.c:7023:notify] box3-1: got GF_EVENT_CHILD_UP
[2010-01-28 15:58:15] D [client-protocol.c:7023:notify] box3-1: got GF_EVENT_CHILD_UP
[2010-01-28 15:58:15] D [client-protocol.c:6160:client_setvolume_cbk] box3-1: setvolume failed (Stale NFS file handle)
[2010-01-28 15:58:15] D [client-protocol.c:6160:client_setvolume_cbk] box3-1: setvolume failed (Stale NFS file handle)
[2010-01-28 15:58:21] D [client-protocol.c:7023:notify] box5-1: got GF_EVENT_CHILD_UP
[2010-01-28 15:58:21] D [client-protocol.c:6160:client_setvolume_cbk] box5-1: setvolume failed (Stale NFS file handle)
[2010-01-28 15:58:21] D [dht-diskusage.c:71:dht_du_info_cbk] distribute: on subvolume 'box2-1': avail_percent is: 99.00 and avail_space is: 980362805248
[2010-01-28 15:58:21] D [dht-diskusage.c:71:dht_du_info_cbk] distribute: on subvolume 'box1-1': avail_percent is: 99.00 and avail_space is: 980362801152
[2010-01-28 15:58:21] D [client-protocol.c:7023:notify] box3-1: got GF_EVENT_CHILD_UP
[2010-01-28 15:58:21] D [client-protocol.c:6160:client_setvolume_cbk] box3-1: setvolume failed (Stale NFS file handle)
[2010-01-28 15:58:21] D [client-protocol.c:7023:notify] box5-1: got GF_EVENT_CHILD_UP
[2010-01-28 15:58:21] D [client-protocol.c:6160:client_setvolume_cbk] box5-1: setvolume failed (Stale NFS file handle)
[2010-01-28 15:58:25] D [dht-layout.c:184:dht_layout_search] distribute: no subvolume for hash (value) = 3833074946
[2010-01-28 15:58:25] D [dht-helper.c:235:dht_subvol_get_hashed] distribute: could not find subvolume for path=/s9
[2010-01-28 15:58:25] D [dht-common.c:864:dht_lookup] distribute: no subvolume in layout for path=/s9, checking on all the subvols to see if it is a directory
[2010-01-28 15:58:25] D [dht-common.c:114:dht_lookup_dir_cbk] distribute: lookup of /s9 on box5-1 returned error (Transport endpoint is not connected)
[2010-01-28 15:58:25] D [dht-common.c:114:dht_lookup_dir_cbk] distribute: lookup of /s9 on box3-1 returned error (Transport endpoint is not connected)

###############################################################
Overall, it looks like the volumes are up, going by the CHILD_UP events received
right at the start of the log portion but right after that the setvolume fails with stale file handle. If box3-1 is in fact going down, should distribute not receive the CHILD_DOWN event?

Comment 4 Jeff Darcy 2010-02-10 21:00:00 UTC
I'm not sure if it's relevant, but I saw a similar problem with a similar translator.  I tracked it down to a check in client_lookup, which would fail if the *parent* had a non-zero ino (which had come from another subvolume) and inode_ctx_get failed (because that parent had never been looked up on this subvolume).  The same issue is likely to appear in stripe and replicate.  I'm not sure if it can appear in distribute as well, but it might be possible in various rename cases.


Note You need to log in before you can comment on or make changes to this bug.