Description of problem: Unexporting a volume sometimes fails with "Dynamic export addition/deletion failed". Version-Release number of selected component (if applicable): nfs-ganesha-2.3.1-4 How reproducible: Intermittent Steps to Reproduce: Not sure about the exact steps but sometimes the unexport of volume fails with "dynamic export addition/deletion failed" Performed below steps when i hit this issue. 1.Created a 4 node ganesha cluster. 2.Created a tiered volume, enabled quota and mounted on 2 clients using vers=4 3.Performed IO's from both the mount points. 4.Umounted the volume from both clients. 5.Unexported the volume and it fails with "dynamic export addition/deletion fails" with below messages in logs [2016-04-25 16:20:35.129472] E [MSGID: 106472] [glusterd-ganesha.c:145:glusterd_check_ganesha_cmd] 0-management: Handling NFS-Ganesha op failed. [2016-04-25 16:20:35.129582] E [MSGID: 106123] [glusterd-syncop.c:1411:gd_commit_op_phase] 0-management: Commit of operation 'Volume Set' failed on localhost : Dynamic export addition/deletion failed. Please see log file for details It unexports the volume from the node where ganesha.enable off, was run but from all other nodes, the volume doesnt get unexported [root@dhcp37-180 exports]# pwd /etc/ganesha/exports [root@dhcp37-180 exports]# ls [root@dhcp37-180 exports]# [root@dhcp37-158 exports]# pwd /etc/ganesha/exports [root@dhcp37-158 exports]# ls export.tiervolume.conf [root@dhcp37-158 exports]# [root@dhcp37-127 exports]# pwd /etc/ganesha/exports [root@dhcp37-127 exports]# ls export.tiervolume.conf [root@dhcp37-127 exports]# [root@dhcp37-174 exports]# pwd /etc/ganesha/exports [root@dhcp37-174 exports]# ls export.tiervolume.conf [root@dhcp37-174 exports]# whereas showmount shows volume exported on all the nodes [root@dhcp37-180 exports]# showmount -e localhost Export list for localhost: /tiervolume (everyone) [root@dhcp37-158 exports]# showmount -e localhost Export list for localhost: /tiervolume (everyone) [root@dhcp37-127 exports]# showmount -e localhost Export list for localhost: /tiervolume (everyone) [root@dhcp37-174 exports]# showmount -e localhost Export list for localhost: /tiervolume (everyone) peer status is proper in this case: [root@dhcp37-180 exports]# gluster peer status Number of Peers: 3 Hostname: dhcp37-158.lab.eng.blr.redhat.com Uuid: 18fa3cca-c714-4c70-b227-cef260fffa27 State: Peer in Cluster (Connected) Hostname: dhcp37-127.lab.eng.blr.redhat.com Uuid: 43649367-7f47-41cf-8d63-97896e3504d4 State: Peer in Cluster (Connected) Hostname: dhcp37-174.lab.eng.blr.redhat.com Uuid: 1a5a806a-ab58-462b-b939-0b8158a2d914 State: Peer in Cluster (Connected) After restarting glusterd on all the nodes, showmount doesn't list the volume as exported on localhost: [root@dhcp37-180 exports]# showmount -e localhost Export list for localhost: [root@dhcp37-180 exports]# ls [root@dhcp37-180 exports]# [root@dhcp37-158 exports]# showmount -e localhost Export list for localhost: /tiervolume (everyone) [root@dhcp37-158 exports]# ls export.tiervolume.conf [root@dhcp37-158 exports]# [root@dhcp37-127 exports]# showmount -e localhost Export list for localhost: /tiervolume (everyone) [root@dhcp37-127 exports]# ls export.tiervolume.conf [root@dhcp37-127 exports]# [root@dhcp37-174 exports]# showmount -e localhost Export list for localhost: /tiervolume (everyone) [root@dhcp37-174 exports]# ls export.tiervolume.conf [root@dhcp37-174 exports]# Actual results: Unexporting a volume sometimes fails with "Dynamic export addition/deletion failed". Expected results: The volume should get unexported without any issues. Additional info: No AVC's/SElinux related issue seen in the logs
hard to reproduce. Document as known issue for 3.1.3 while we RCA
This issue seems to be hitting most of the time when we unexport the volume after doing some IO's. However still not sure about of the exact steps: [root@dhcp37-180 ~]# gluster volume set newvolume ganesha.enable off volume set: failed: Dynamic export addition/deletion failed. Please see log file for details the volume gets unexported from the node from where we run the command and on the other nodes it remains as it is. Since, we are hitting this very frequently now, setting the severity to high
I hit similar while testing another fix. It seems both are same and I posted a patch upstream http://review.gluster.org/#/c/14225/ For me I can consistently reproduce this issue by following steps 1.)create and start a volume 2.)gluster nfs-ganesha enable 3.)cd /usr/libexec/ganesha 4.)execute following scripts ./create-export-ganesha.sh /etc/ganesha/ <volname> and checkwhether export conf file is created ./dbus-send.sh /etc/ganesha/ on <volname> and wait till volume get exported. ./dbus-send.sh /etc/ganesha/ off <volname>
The above scripts are ran in background when "gluster v set <volname> ganesha.enable <on/off>"
Please provide pm ack
Verified this bug with latest nfs-ganesha-2.3.1-6 build: [root@dhcp42-20 exports]# rpm -qa|grep ganesha nfs-ganesha-2.3.1-6.el7rhgs.x86_64 nfs-ganesha-gluster-2.3.1-6.el7rhgs.x86_64 glusterfs-ganesha-3.7.9-4.el7rhgs.x86_64 Observed the same issue in below scenarios: >>>> tried unexporting the volume, after fssanity test run completed, with volume mounted on client. [root@dhcp42-20 ~]# gluster volume set newvolume ganesha.enable off volume set: failed: Dynamic export addition/deletion failed. Please see log file for details volume got unexported from the node where ganesha.enable off was run and on all other nodes it remains as it is: [root@dhcp42-20 ~]# cd /etc/ganesha/exports/ [root@dhcp42-20 exports]# ls [root@dhcp42-20 exports]# showmount -e localhost Export list for localhost: [root@dhcp42-20 exports]# [root@dhcp42-239 log]# cd /etc/ganesha/exports/ [root@dhcp42-239 exports]# ls export.newvolume.conf [root@dhcp42-239 exports]# showmount -e localhost Export list for localhost: /newvolume (everyone) [root@dhcp43-175 log]# cd /etc/ganesha/exports/ [root@dhcp43-175 exports]# ls export.newvolume.conf [root@dhcp43-175 exports]# showmount -e localhost Export list for localhost: /newvolume (everyone) [root@dhcp42-196 log]# cd /etc/ganesha/exports/ [root@dhcp42-196 exports]# ls export.newvolume.conf [root@dhcp42-196 exports]# showmount -e localhost Export list for localhost: /newvolume (everyone) >>>> executed pynfs test suite on the volume and with volume mounted on client, executed below script: [root@dhcp42-20 ~]# cd /usr/libexec/ganesha/ [root@dhcp42-20 ganesha]# ./dbus-send.sh /etc/ganesha/ off newvolume Error org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken. no AVC's are seen in audit.log after this volume got unexported from that node and on all the other nodes, it remains as it is: [root@dhcp42-20 ~]# cd /etc/ganesha/exports/ [root@dhcp42-20 exports]# ls [root@dhcp42-20 exports]# showmount -e localhost Export list for localhost: [root@dhcp42-20 exports]# [root@dhcp42-239 ~]# cd /etc/ganesha/exports/ [root@dhcp42-239 exports]# ls export.newvolume.conf [root@dhcp42-239 exports]# showmount -e localhost Export list for localhost: /newvolume (everyone) [root@dhcp43-175 ~]# cd /etc/ganesha/exports/ [root@dhcp43-175 exports]# ls export.newvolume.conf [root@dhcp43-175 exports]# showmount -e localhost Export list for localhost: /newvolume (everyone) [root@dhcp42-196 ~]# cd /etc/ganesha/exports/ [root@dhcp42-196 exports]# ls export.newvolume.conf [root@dhcp42-196 exports]# showmount -e localhost Export list for localhost: /newvolume (everyone)
Can you please provide sosreport and the ganesha logs of all the systems after the above mentioned issue is observed.
Here the dbus-send command in script does not get a reply from ganesha process and throws "org.freedesktop.DBus.Error.NoReply" error. As part of clean up , it flushes all the cache inode belong to that export which enters deadlock in cache_inode layer.This make ganesha server unresponsive Similar issue is reported on the https://bugzilla.redhat.com/show_bug.cgi?id=1301450 and Soumya send a patch for fixing the issue[1] and upstream discussion[2] has started based on it [1] https://review.gerrithub.io/#/c/258687/ [2] https://sourceforge.net/p/nfs-ganesha/mailman/message/34811062/ Steps to reproduce 1.) create a volume and start a volume 2.) export the volume 3.) run the pynfs test on it(only subset is enough, I ran "link" test) 4.) try to unexport the volume using script. /usr/libexec/ganesha/dbus-send.sh /etc/ganesha off <volname>
A correction. The above mentioned issue is observed on bug1301450, but there was no fix provided. It was under discussion with upstream (https://sourceforge.net/p/nfs-ganesha/mailman/message/34811062/). Since there was no definite way of reproducing it and as upstream dev branch (next) of nfs-ganesha codebase went through major revamp with the introduction of MDCACHE, this issue could not be fixed. Request QE to reproduce and confirm if they are indeed hitting the same issue. Meanwhile will check with upstream how/if we can fix it in stable V2.3-stable branch to which we have rebased to.
Able to reproduce the issue in first attempt: 1. Started pynfs test suite and let it get completed. 2. tried unexporting the volume and it failed with below error: [root@dhcp42-20 exports]# gluster volume set newvolume ganesha.enable off volume set: failed: Dynamic export addition/deletion failed. Please see log file for details Soumya/Jiffin, Please use below setup to debug. Kept it in same state. 10.70.42.20
Thanks Shashank and Jiffin. This indeed the same issue which Jiffin reported in comment#12. From the process gdb, cache_inode_unexport (export=export@entry=0x7f040c002df8) at /usr/src/debug/nfs-ganesha-2.3.1/src/cache_inode/cache_inode_misc.c:498 498 expmap = glist_first_entry(&export->entry_list, (gdb) 502 if (unlikely(expmap == NULL)) { (gdb) 508 entry = expmap->entry; (gdb) 510 status = cache_inode_lru_ref(entry, LRU_FLAG_NONE); (gdb) p *entry $1 = {attr_lock = {__data = {__lock = 0, __nr_readers = 0, __readers_wakeup = 0, __writer_wakeup = 0, __nr_readers_queued = 0, __nr_writers_queued = 0, __writer = 0, __shared = 0, __pad1 = 0, __pad2 = 0, __flags = 0}, __size = '\000' <repeats 55 times>, __align = 0}, obj_handle = 0x7f039804e948, fh_hk = {node_k = {left = 0x0, right = 0x0, parent = 139653815383394}, key = {hk = 11004833217579149150, fsal = 0x7f040c006db0, kv = { addr = 0x7f039802fed0, len = 32}}, inavl = false}, type = REGULAR_FILE, flags = 1, icreate_refcnt = 0, change_time = 1463602654000000000, attr_time = 1463602654, lru = {q = {next = 0x7f0424b20440 <LRU+2656>, prev = 0x7f0424b20440 <LRU+2656>}, qid = LRU_ENTRY_CLEANUP, refcnt = 1, pin_refcnt = 1, flags = 1, lane = 10, cf = 7}, exp_root_refcount = 0, state_lock = {__data = {__lock = 0, __nr_readers = 0, __readers_wakeup = 0, __writer_wakeup = 0, __nr_readers_queued = 0, __nr_writers_queued = 0, __writer = 0, __shared = 0, __pad1 = 0, __pad2 = 0, __flags = 0}, __size = '\000' <repeats 55 times>, __align = 0}, list_of_states = {next = 0x7f039804ed18, prev = 0x7f039804ed18}, export_list = {next = 0x7f0398080220, prev = 0x7f0398080220}, first_export = 0x7f040c002df8, layoutrecall_list = {next = 0x7f039804ed40, prev = 0x7f039804ed40}, content_lock = {__data = {__lock = 0, __nr_readers = 0, __readers_wakeup = 0, __writer_wakeup = 0, __nr_readers_queued = 0, __nr_writers_queued = 0, __writer = 0, __shared = 0, __pad1 = 0, __pad2 = 0, __flags = 0}, __size = '\000' <repeats 55 times>, __align = 0}, object = {file = {lock_list = {next = 0x7f039804ed88, prev = 0x7f039804ed88}, nlm_share_list = {next = 0x7f039804ed98, prev = 0x7f039804ed98}, share_state = {share_access_read = 0, share_access_write = 0, share_deny_read = 0, share_deny_write = 0, share_deny_write_v4 = 0}, write_delegated = false, fdeleg_stats = {fds_curr_delegations = 0, fds_deleg_type = OPEN_DELEGATE_NONE, fds_delegation_count = 0, fds_recall_count = 0, fds_avg_hold = 0, fds_last_delegation = 0, fds_last_recall = 0, fds_num_opens = 0, fds_first_open = 0}, anon_ops = 0}, dir = {nbactive = 2550459784, parent = {hk = 139653412089224, fsal = 0x7f039804ed98, kv = {addr = 0x7f039804ed98, len = 0}}, avl = {t = {root = 0x0, cmp_fn = 0x0, height = 0, first = 0x0, last = 0x0, size = 0}, c = {root = 0x0, cmp_fn = 0x0, height = 0, first = 0x0, last = 0x0, size = 0}, collisions = 0}, junction_export = 0x0, export_roots = { next = 0x0, prev = 0x0}}}} (gdb) n 512 if (status != CACHE_INODE_SUCCESS) { (gdb) 514 PTHREAD_RWLOCK_unlock(&export->lock); (gdb) l 509 510 status = cache_inode_lru_ref(entry, LRU_FLAG_NONE); 511 512 if (status != CACHE_INODE_SUCCESS) { 513 /* This entry was going stale, skip it. */ 514 PTHREAD_RWLOCK_unlock(&export->lock); 515 continue; 516 } 517 518 PTHREAD_RWLOCK_unlock(&export->lock); Similar to what mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1301450#c3, there is an entry marked "LRU_ENTRY_CLEANUP" and there is no other thread cleaning it up. Maybe the last unref shall clean it up and we maybe having refleak. (gdb) p *entry->objhandle There is no member named objhandle. (gdb) p entry->objhandle There is no member named objhandle. (gdb) p entry->obj_handle $2 = (struct fsal_obj_handle *) 0x7f039804e948 (gdb) p (struct glusterfs_handle *)((char *) $2 - 56) $15 = (struct glusterfs_handle *) 0x7f039804e910 (gdb) p *$15 $16 = {glhandle = 0x7f0398040280, globjhdl = "\177.\001cr,E\246\227\210\227\356rZ,\256Ԋ=\277\217\320K5\265\203EX[\333", <incomplete sequence \312>, glfd = 0x7f039804eef0, openflags = 3, handle = {handles = {next = 0x7f040c006dd0, prev = 0x7f03b802b6b8}, fs = 0x0, fsal = 0x7f040c006db0, obj_ops = {release = 0x7f041056c660 <handle_release>, lookup = 0x7f041056c4d0 <lookup>, readdir = 0x7f041056b9a0 <read_dirents>, create = 0x7f041056c220 <create>, mkdir = 0x7f041056bf70 <makedir>, mknode = 0x7f041056bb70 <makenode>, symlink = 0x7f041056b700 <makesymlink>, readlink = 0x7f041056b610 <readsymlink>, test_access = 0x7f04247f0310 <fsal_test_access>, getattrs = 0x7f041056b4e0 <getattrs>, setattrs = 0x7f041056c750 <setattrs>, link = 0x7f041056b3a0 <linkfile>, fs_locations = 0x7f04247f0fd0 <fs_locations>, rename = 0x7f041056b250 <renamefile>, unlink = 0x7f041056b110 <file_unlink>, open = 0x7f041056b040 <file_open>, reopen = 0x7f04247f0fc0 <file_reopen>, status = 0x7f041056a940 <file_status>, read = 0x7f041056afd0 <file_read>, read_plus = 0x7f04247f1000 <file_read_plus>, write = 0x7f041056af40 <file_write>, write_plus = 0x7f04247f1020 <file_write_plus>, seek = 0x7f04247f1030 <file_seek>, io_advise = 0x7f04247f1040 <file_io_advise>, commit = 0x7f041056aef0 <commit>, lock_op = 0x7f041056ab30 <lock_op>, share_op = 0x7f04247f1070 <share_op>, close = 0x7f041056a960 <file_close>, list_ext_attrs = 0x7f04247f1090 <list_ext_attrs>, getextattr_id_by_name = 0x7f04247f10a0 <getextattr_id_by_name>, getextattr_value_by_name = 0x7f04247f10b0 <getextattr_value_by_name>, getextattr_value_by_id = 0x7f04247f10c0 <getextattr_value_by_id>, setextattr_value = 0x7f04247f10d0 <setextattr_value>, setextattr_value_by_id = 0x7f04247f10e0 <setextattr_value_by_id>, getextattr_attrs = 0x7f04247f10f0 <getextattr_attrs>, remove_extattr_by_id = 0x7f04247f1100 <remove_extattr_by_id>, remove_extattr_by_name = 0x7f04247f1110 <remove_extattr_by_name>, handle_is = 0x7f04247f0ed0 <handle_is>, lru_cleanup = 0x7f041056aa10 <lru_cleanup>, handle_digest = 0x7f041056aa60 <handle_digest>, handle_to_key = 0x7f041056a950 <handle_to_key>, layoutget = 0x7f041056ec00 <pnfs_layout_get>, layoutreturn = 0x7f041056e0f0 <pnfs_layout_return>, layoutcommit = 0x7f041056e150 <pnfs_layout_commit>}, lock = {__data = {__lock = 0, __nr_readers = 0, __readers_wakeup = 0, __writer_wakeup = 0, __nr_readers_queued = 0, __nr_writers_queued = 0, __writer = 0, __shared = 0, __pad1 = 0, __pad2 = 0, __flags = 0}, __size = '\000' <repeats 55 times>, __align = 0}, attrs = 0x7f039804eb10, type = REGULAR_FILE}, attributes = {mask = 385006, type = REGULAR_FILE, filesize = 0, fsid = {major = 1871, minor = 404719}, acl = 0x0, fileid = 13079373988614579146, mode = 493, numlinks = 1, owner = 0, group = 0, rawdev = {major = 0, minor = 0}, atime = {tv_sec = 1463602179, tv_nsec = 0}, creation = {tv_sec = 0, tv_nsec = 0}, ctime = { tv_sec = 1463602654, tv_nsec = 0}, mtime = {tv_sec = 1463602179, tv_nsec = 0}, chgtime = {tv_sec = 1463602654, tv_nsec = 0}, spaceused = 0, change = 1463602654, generation = 0, expire_time_attr = 60}, rd_issued = 0, rd_serial = 0, rw_issued = 0, rw_serial = 0, rw_max_len = 0} (gdb) p $15->glhandle $17 = (struct glfs_object *) 0x7f0398040280 (gdb) As can be seen above, the entry->obj_handle still glfd and openstate associated. So maybe the entry opened is not closed and led to refleak.. Have to look at the code.
With the upstream latest bits, i can still hit this issue: [root@dhcp43-116 log]# rpm -qa|grep ganesha glusterfs-ganesha-3.8.3-0.6.git7956718.el7.centos.x86_64 nfs-ganesha-gluster-2.4-0.rc4.el7.centos.x86_64 nfs-ganesha-debuginfo-2.4-0.rc4.el7.centos.x86_64 nfs-ganesha-2.4-0.rc4.el7.centos.x86_64 Sometimes after performing the IO's, if we try to unexport the volume it fails with below message [root@dhcp43-116 ~]# gluster vol set v1 ganesha.enable off volume set: failed: Dynamic export addition/deletion failed. Please see log file for details 9/09/2016 16:48:47 : epoch 57dbe5c8 : dhcp43-116.lab.eng.blr.redhat.com : ganesha.nfsd-6039[dbus_heartbeat] dbus_message_entrypoint :DBUS :MAJ :Method (RemoveExport) on (org.ganesha.nfsd.exportmgr) failed: name = (org.freedesktop.DBus.Error.InvalidArgs), message = (lookup_export failed with Export id not found)
In latest nfs-ganesha2.4 unexport/export works fine with pynfs. Requesting Shashank to retest the same
With the latest ganesha build, i don't see this issue while unexporting the volume. [root@dhcp43-110 exports]# rpm -qa|grep ganesha nfs-ganesha-debuginfo-2.4.0-2.el7rhgs.x86_64 nfs-ganesha-2.4.0-2.el7rhgs.x86_64 glusterfs-ganesha-3.8.4-2.el7rhgs.x86_64 nfs-ganesha-gluster-2.4.0-2.el7rhgs.x86_64 Following things are tried: 1) Tried with client-io-threads as off on the volume. 2) Multiple export/unexport on a newly created volume. 3) Multiple export/unexport after running pynfs test suite.
Can close this based on above comment.
Based on comment 22, closing this bug.