| Summary: | touch over distribute of 4 bricks randomly fails with ENOENT | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [Community] GlusterFS | Reporter: | Shehjar Tikoo <shehjart> | ||||||||
| Component: | distribute | Assignee: | Anand Avati <aavati> | ||||||||
| Status: | CLOSED NOTABUG | QA Contact: | |||||||||
| Severity: | medium | Docs Contact: | |||||||||
| Priority: | low | ||||||||||
| Version: | mainline | CC: | anush, chrisw, fharshav, gluster-bugs, jdarcy | ||||||||
| Target Milestone: | --- | ||||||||||
| Target Release: | --- | ||||||||||
| Hardware: | All | ||||||||||
| OS: | Linux | ||||||||||
| Whiteboard: | |||||||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||||||
| Doc Text: | Story Points: | --- | |||||||||
| Clone Of: | Environment: | ||||||||||
| Last Closed: | Type: | --- | |||||||||
| Regression: | RTNR | Mount Type: | --- | ||||||||
| Documentation: | --- | CRM: | |||||||||
| Verified Versions: | Category: | --- | |||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||
| Attachments: |
|
||||||||||
Created attachment 141 [details]
Patch to fix the problem
Created attachment 142 [details]
The new bad XF86Config file
As the subject says, with 3.0.2rc1 server and client, touch returns a "No such file or directory" without a patten for touch. Sometimes touch works fine, at other times,it fails. The brick and client vol files and client log is attached. Was first observed on the storage platform. For relevant part of the client log, search for file "s9". A touch on this non-existent file fails. FWIW, I think the relevant portion of the client log is: [2010-01-28 15:58:15] D [client-protocol.c:7023:notify] box3-1: got GF_EVENT_CHILD_UP [2010-01-28 15:58:15] D [client-protocol.c:7023:notify] box3-1: got GF_EVENT_CHILD_UP [2010-01-28 15:58:15] D [client-protocol.c:6160:client_setvolume_cbk] box3-1: setvolume failed (Stale NFS file handle) [2010-01-28 15:58:15] D [client-protocol.c:6160:client_setvolume_cbk] box3-1: setvolume failed (Stale NFS file handle) [2010-01-28 15:58:21] D [client-protocol.c:7023:notify] box5-1: got GF_EVENT_CHILD_UP [2010-01-28 15:58:21] D [client-protocol.c:6160:client_setvolume_cbk] box5-1: setvolume failed (Stale NFS file handle) [2010-01-28 15:58:21] D [dht-diskusage.c:71:dht_du_info_cbk] distribute: on subvolume 'box2-1': avail_percent is: 99.00 and avail_space is: 980362805248 [2010-01-28 15:58:21] D [dht-diskusage.c:71:dht_du_info_cbk] distribute: on subvolume 'box1-1': avail_percent is: 99.00 and avail_space is: 980362801152 [2010-01-28 15:58:21] D [client-protocol.c:7023:notify] box3-1: got GF_EVENT_CHILD_UP [2010-01-28 15:58:21] D [client-protocol.c:6160:client_setvolume_cbk] box3-1: setvolume failed (Stale NFS file handle) [2010-01-28 15:58:21] D [client-protocol.c:7023:notify] box5-1: got GF_EVENT_CHILD_UP [2010-01-28 15:58:21] D [client-protocol.c:6160:client_setvolume_cbk] box5-1: setvolume failed (Stale NFS file handle) [2010-01-28 15:58:25] D [dht-layout.c:184:dht_layout_search] distribute: no subvolume for hash (value) = 3833074946 [2010-01-28 15:58:25] D [dht-helper.c:235:dht_subvol_get_hashed] distribute: could not find subvolume for path=/s9 [2010-01-28 15:58:25] D [dht-common.c:864:dht_lookup] distribute: no subvolume in layout for path=/s9, checking on all the subvols to see if it is a directory [2010-01-28 15:58:25] D [dht-common.c:114:dht_lookup_dir_cbk] distribute: lookup of /s9 on box5-1 returned error (Transport endpoint is not connected) [2010-01-28 15:58:25] D [dht-common.c:114:dht_lookup_dir_cbk] distribute: lookup of /s9 on box3-1 returned error (Transport endpoint is not connected) ############################################################### Overall, it looks like the volumes are up, going by the CHILD_UP events received right at the start of the log portion but right after that the setvolume fails with stale file handle. If box3-1 is in fact going down, should distribute not receive the CHILD_DOWN event? I'm not sure if it's relevant, but I saw a similar problem with a similar translator. I tracked it down to a check in client_lookup, which would fail if the *parent* had a non-zero ino (which had come from another subvolume) and inode_ctx_get failed (because that parent had never been looked up on this subvolume). The same issue is likely to appear in stripe and replicate. I'm not sure if it can appear in distribute as well, but it might be possible in various rename cases. |
Created attachment 140 [details] Patch for sgml-tools-1.0.9-2.i386.rpm