Complete log at: dev:/share/tickets/1641
Adding Amar for more advice.
General remarks first: o Setting to Blocker because Harsha is facing problems with customer setup w/ a deadline. o Seen on nfs beta rc11. First problem is that when distribute subvolumes are down and a top level translator performs a LOOKUP on root, dht self-heal does not return an error to root xlator even when dht realizes there is a problem. For eg. see the log lines below: [2010-09-18 04:29:05] D [dht-layout.c:593:dht_layout_normalize] distribute: path=/ err=Transport endpoint is not connected on subvol=10.1.100.204-10 [2010-09-18 04:29:05] D [dht-layout.c:593:dht_layout_normalize] distribute: path=/ err=Transport endpoint is not connected on subvol=10.1.100.204-11 [2010-09-18 04:29:05] D [dht-layout.c:593:dht_layout_normalize] distribute: path=/ err=Transport endpoint is not connected on subvol=10.1.100.204-12 [2010-09-18 04:29:05] D [dht-common.c:168:dht_lookup_dir_cbk] distribute: fixing assignment on / [2010-09-18 04:29:05] D [dht-selfheal.c:487:dht_selfheal_directory] distribute: 36 subvolumes down -- not fixing ######################################################### Layout cannot be fixed and dht says so but does not return an error to nfs which continues thinking that the lookup succeeded and exports the subvolume as normal. ########################################################## [2010-09-18 04:29:05] T [nfs.c:234:nfs_start_subvol_lookup_cbk] nfs: Started distribute The code block to blame is: In dht-selfheal.c:dht_selfheal_directory:487 if (down) { gf_log (this->name, GF_LOG_DEBUG, "%d subvolumes down -- not fixing", down); ret = 0; ############### Must change to error? ########### goto sorry_no_fix; } ....... ....... ....... sorry_no_fix: /* TODO: need to put appropriate local->op_errno */ dht_selfheal_dir_finish (frame, this, ret);
http://dev.gluster.com/~shehjart/nfs-export-on-root-lookup-success.mbox - patch fixes the self-heal issue which was seen before.
PATCH: http://patches.gluster.com/patch/4919 in master (distribute: Return ESTALE when dir selfheal finds no fix)