Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 762310 (GLUSTER-578)

Summary:

touch over distribute of 4 bricks randomly fails with ENOENT

Product:

[Community] GlusterFS

Reporter:

Shehjar Tikoo <shehjart>

Component:

distribute

Assignee:

Anand Avati <aavati>

Status:

CLOSED NOTABUG

QA Contact:

Severity:

medium

Docs Contact:

Priority:

low

Version:

mainline

CC:

anush, chrisw, fharshav, gluster-bugs, jdarcy

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

Type:

---

Regression:

RTNR

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
Server volume config	none
Client log	none
Client volume spec	none

Description Shehjar Tikoo 2010-01-28 08:12:18 UTC

Created attachment 140 [details]
Patch for sgml-tools-1.0.9-2.i386.rpm

Comment 1 Shehjar Tikoo 2010-01-28 08:12:45 UTC

Created attachment 141 [details]
Patch to fix the problem

Comment 2 Shehjar Tikoo 2010-01-28 08:13:05 UTC

Created attachment 142 [details]
The new bad XF86Config file

Comment 3 Shehjar Tikoo 2010-01-28 11:11:44 UTC

As the subject says, with 3.0.2rc1 server and client, touch returns a "No such file or directory" without a patten for touch. Sometimes touch works fine, at other times,it fails.

The brick and client vol files and client log is attached. Was first observed on the storage platform.

For relevant part of the client log, search for file "s9". A touch on this non-existent file fails.

FWIW, I think the relevant portion of the client log is:

[2010-01-28 15:58:15] D [client-protocol.c:7023:notify] box3-1: got GF_EVENT_CHILD_UP
[2010-01-28 15:58:15] D [client-protocol.c:7023:notify] box3-1: got GF_EVENT_CHILD_UP
[2010-01-28 15:58:15] D [client-protocol.c:6160:client_setvolume_cbk] box3-1: setvolume failed (Stale NFS file handle)
[2010-01-28 15:58:15] D [client-protocol.c:6160:client_setvolume_cbk] box3-1: setvolume failed (Stale NFS file handle)
[2010-01-28 15:58:21] D [client-protocol.c:7023:notify] box5-1: got GF_EVENT_CHILD_UP
[2010-01-28 15:58:21] D [client-protocol.c:6160:client_setvolume_cbk] box5-1: setvolume failed (Stale NFS file handle)
[2010-01-28 15:58:21] D [dht-diskusage.c:71:dht_du_info_cbk] distribute: on subvolume 'box2-1': avail_percent is: 99.00 and avail_space is: 980362805248
[2010-01-28 15:58:21] D [dht-diskusage.c:71:dht_du_info_cbk] distribute: on subvolume 'box1-1': avail_percent is: 99.00 and avail_space is: 980362801152
[2010-01-28 15:58:21] D [client-protocol.c:7023:notify] box3-1: got GF_EVENT_CHILD_UP
[2010-01-28 15:58:21] D [client-protocol.c:6160:client_setvolume_cbk] box3-1: setvolume failed (Stale NFS file handle)
[2010-01-28 15:58:21] D [client-protocol.c:7023:notify] box5-1: got GF_EVENT_CHILD_UP
[2010-01-28 15:58:21] D [client-protocol.c:6160:client_setvolume_cbk] box5-1: setvolume failed (Stale NFS file handle)
[2010-01-28 15:58:25] D [dht-layout.c:184:dht_layout_search] distribute: no subvolume for hash (value) = 3833074946
[2010-01-28 15:58:25] D [dht-helper.c:235:dht_subvol_get_hashed] distribute: could not find subvolume for path=/s9
[2010-01-28 15:58:25] D [dht-common.c:864:dht_lookup] distribute: no subvolume in layout for path=/s9, checking on all the subvols to see if it is a directory
[2010-01-28 15:58:25] D [dht-common.c:114:dht_lookup_dir_cbk] distribute: lookup of /s9 on box5-1 returned error (Transport endpoint is not connected)
[2010-01-28 15:58:25] D [dht-common.c:114:dht_lookup_dir_cbk] distribute: lookup of /s9 on box3-1 returned error (Transport endpoint is not connected)

###############################################################
Overall, it looks like the volumes are up, going by the CHILD_UP events received
right at the start of the log portion but right after that the setvolume fails with stale file handle. If box3-1 is in fact going down, should distribute not receive the CHILD_DOWN event?

Comment 4 Jeff Darcy 2010-02-10 21:00:00 UTC

I'm not sure if it's relevant, but I saw a similar problem with a similar translator.  I tracked it down to a check in client_lookup, which would fail if the *parent* had a non-zero ino (which had come from another subvolume) and inode_ctx_get failed (because that parent had never been looked up on this subvolume).  The same issue is likely to appear in stripe and replicate.  I'm not sure if it can appear in distribute as well, but it might be possible in various rename cases.