Bug 763373 - (GLUSTER-1641) Distribute must return error when dir selfheal has no fix
Distribute must return error when dir selfheal has no fix
Status: CLOSED CURRENTRELEASE
Product: GlusterFS
Classification: Community
Component: distribute (Show other bugs)
nfs-alpha
All Linux
low Severity high
: ---
: ---
Assigned To: Shehjar Tikoo
:
Depends On:
Blocks: GLUSTER-1643
  Show dependency treegraph
 
Reported: 2010-09-18 07:29 EDT by Shehjar Tikoo
Modified: 2015-12-01 11:45 EST (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: ---
Regression: RTP
Mount Type: nfs
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description Shehjar Tikoo 2010-09-18 04:31:33 EDT
Complete log at: dev:/share/tickets/1641
Comment 1 Shehjar Tikoo 2010-09-18 04:33:44 EDT
Adding Amar for more advice.
Comment 2 Shehjar Tikoo 2010-09-18 07:29:55 EDT
General remarks first:
o Setting to Blocker because Harsha is facing problems with customer setup w/ a deadline.
o Seen on nfs beta rc11.


First problem is that when distribute subvolumes are down and a top level translator performs a LOOKUP on root, dht self-heal does not return an error to root xlator even when dht realizes there is a problem. For eg. see the log lines below:


[2010-09-18 04:29:05] D [dht-layout.c:593:dht_layout_normalize] distribute: path=/ err=Transport endpoint is not connected on subvol=10.1.100.204-10
[2010-09-18 04:29:05] D [dht-layout.c:593:dht_layout_normalize] distribute: path=/ err=Transport endpoint is not connected on subvol=10.1.100.204-11
[2010-09-18 04:29:05] D [dht-layout.c:593:dht_layout_normalize] distribute: path=/ err=Transport endpoint is not connected on subvol=10.1.100.204-12
[2010-09-18 04:29:05] D [dht-common.c:168:dht_lookup_dir_cbk] distribute: fixing assignment on /
[2010-09-18 04:29:05] D [dht-selfheal.c:487:dht_selfheal_directory] distribute: 36 subvolumes down -- not fixing
#########################################################
Layout cannot be fixed and dht says so but does not return an error to nfs which continues thinking that the lookup succeeded and exports the subvolume as normal.
##########################################################
[2010-09-18 04:29:05] T [nfs.c:234:nfs_start_subvol_lookup_cbk] nfs: Started distribute

The code block to blame is:
In dht-selfheal.c:dht_selfheal_directory:487
        if (down) {
                gf_log (this->name, GF_LOG_DEBUG,
                        "%d subvolumes down -- not fixing", down);
                ret = 0; ############### Must change to error? ###########
                goto sorry_no_fix;
        }
.......
.......
.......
sorry_no_fix:
        /* TODO: need to put appropriate local->op_errno */
        dht_selfheal_dir_finish (frame, this, ret);
Comment 3 Harshavardhana 2010-09-18 18:14:03 EDT
http://dev.gluster.com/~shehjart/nfs-export-on-root-lookup-success.mbox - patch fixes the self-heal issue which was seen before.
Comment 4 Vijay Bellur 2010-09-22 04:14:22 EDT
PATCH: http://patches.gluster.com/patch/4919 in master (distribute: Return ESTALE when dir selfheal finds no fix)

Note You need to log in before you can comment on or make changes to this bug.