Description of problem: NFS-Ganesha service stops while running IO from different NFS client but on same directory Version-Release number of selected component (if applicable): Ceph: 10.2.5-7.el7cp (59e9fee4a935fdd2bc8197e07596dc4313c410a3) NFS version: nfs-ganesha-rgw-2.4.1-3.el7cp.x86_64 RHEL: 7.3 How reproducible: 2/2 Steps to Reproduce: 1. Configure cluster, RGW and NFS 2. Mount NFS in 3 clients. 3. Create one directory in NFS mount 4. Used below mentioned tool for creating files and folders, Run from all the clients on same directory cmd used " python smallfile_cli.py --operation create --threads 8 --file-size 1024 --files 2048 --top /mnt/nf/smallfileio/" https://github.com/bengland2/smallfile 5. IO's failed on 1 one client and mount point was not accessible, "ls" command is hung. 6. Observed ganesha service not running. Actual results: Ganesha service is stopped Expected results: Ganesha service should not stop and mount should be accessible from all clients. Additional info: Not seeing any failure messages in Ganesha logs or rgw logs.
Hi Matt, With same test scenario in latest build (ceph:10.2.5-12.el7cp, NFS:nfs-ganesha-rgw-2.4.2-1.el7cp.x86_64), observing below crash in rgw -1> 2017-01-25 12:35:39.154179 7f5141bf7700 1 -- 10.8.128.23:0/2521978547 <== osd.4 10.8.128.112:6804/2834 3550 ==== osd_op_reply(9093 .dir.95173e61-a772-4331-bfda-20f2d5f4d9ce.53603.1 [call] v57'102548 uv102548 ondisk = 0) v7 ==== 169+0+0 (3525528611 0 0) 0x7f511c01eb60 con 0x7f516a7c8ec0 0> 2017-01-25 12:35:39.154175 7f4fd4944700 -1 *** Caught signal (Segmentation fault) ** in thread 7f4fd4944700 thread_name:ganesha.nfsd ceph version 10.2.5-12.el7cp (8614488f8c3e7a9be34e58fb1aaf23416156152c) 1: (()+0x56dc0a) [0x7f515acf0c0a] 2: (()+0xf370) [0x7f5167545370] 3: (rgw::RGWLibFS::getattr(rgw::RGWFileHandle*, stat*)+0) [0x7f515ac86860] 4: (rgw_getattr()+0x11) [0x7f515ac86d31] 5: (rgw_fsal_open2()+0x7a6) [0x7f51643dfbb6] 6: (mdcache_open2()+0x34f) [0x7f516908614f] 7: (fsal_open2()+0x1fb) [0x7f5168fbacfb] 8: (()+0x2b386) [0x7f5168fa6386] 9: (nfs4_op_open()+0xaa9) [0x7f5168fee8a9] 10: (nfs4_Compound()+0x63d) [0x7f5168fe0ded] 11: (nfs_rpc_execute()+0x5bc) [0x7f5168fd1f9c] 12: (()+0x585fa) [0x7f5168fd35fa] 13: (()+0xe2289) [0x7f516905d289] 14: (()+0x7dc5) [0x7f516753ddc5] 15: (clone()+0x6d) [0x7f5166c0c73d]
Moving the bug to verified state. Problem not seen in below build ceph: 10.2.5-21.el7cp (4a76f1521d766e24af2b265f9809a6bff411bf12) nfs-rgw: nfs-ganesha-rgw-2.4.2-4.el7cp.x86_64
NFS ganesha service stopped again while running multiple IO operations across multiple clients... IO operations like, crefi, smallfile, dd, scp, iozone , wget were running from 3 different clients on different directories from the mountpoint. IO's were hung as the data was really huge and no IO errors seen.. After leaving it in that state for an overnight the client IO operations remained in the same state., NO rw operation status seen in "ceph -w" and there were no error msges displayed in rgw log file. Killed all the Client IO operations and started to re-run one by one once again. As I started to run the 4th IO operation from the 3rd client, it started to error out. ganesha service was stopped! and no crash seen. [root@magna020 /]# ps aux | grep ganesha root 4192 3.0 1.3 9475868 448776 ? Ssl Feb08 76:19 /usr/bin/ganesha.nfsd -f /etc/ganesha/ganesha.conf [root@magna020 /]# ps aux | grep ganesha root 4192 3.0 1.3 9475868 448776 ? Ssl Feb08 76:19 /usr/bin/ganesha.nfsd -f /etc/ganesha/ganesha.conf [root@magna020 /]# ps aux | grep ganesha [root@magna020 /]# ps aux | grep ganesha [root@magna020 /]# [root@magna020 /]# /usr/bin/ganesha.nfsd -f /etc/ganesha/ganesha.conf [root@magna020 /]# systemctl restart ceph-radosgw.service [root@magna020 /]# ps aux | grep ganesha root 564 0.0 0.0 112648 964 pts/1 S+ 08:08 0:00 grep --color=auto ganesha root 32239 3.4 0.6 9462568 212996 ? Ssl 07:47 0:43 /usr/bin/ganesha.nfsd -f /etc/ganesha/ganesha.conf [root@magna020 /]# Re-opening the BZ as the nfs ganesha service stopped again. Will update more if its seen again..
Candidate fixes (for multiple issues) pushed to ceph and nfs-ganesha ceph-2-rhel-patches.
Repeated the same operations as mentioned in Comment #13 and hit the crash again.. [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". Core was generated by `/usr/bin/ganesha.nfsd -f /etc/ganesha/ganesha.conf'. Program terminated with signal 6, Aborted. #0 0x00007fba8ecc223b in raise () from /lib64/libpthread.so.0 Missing separate debuginfos, use: debuginfo-install nfs-ganesha-2.4.2-5.el7cp.x86_64 (gdb) bt #0 0x00007fba8ecc223b in raise () from /lib64/libpthread.so.0 #1 0x00007fba8246cd05 in handle_fatal_signal(int) () from /lib64/librgw.so.2 #2 <signal handler called> #3 0x00007fba8e2c71d7 in raise () from /lib64/libc.so.6 #4 0x00007fba8e2c88c8 in abort () from /lib64/libc.so.6 #5 0x00007fba8e2c0146 in __assert_fail_base () from /lib64/libc.so.6 #6 0x00007fba8e2c01f2 in __assert_fail () from /lib64/libc.so.6 #7 0x00007fba82401468 in void boost::intrusive::detail::destructor_impl<boost::intrusive::detail::generic_hook<boost::intrusive::get_set_node_algo<void*, false>, boost::intrusive::member_tag, (boost::intrusive::link_mode_type)1, 0> >(boost::intrusive::detail::generic_hook<boost::intrusive::get_set_node_algo<void*, false>, boost::intrusive::member_tag, (boost::intrusive::link_mode_type)1, 0>&, boost::intrusive::detail::link_dispatch<(boost::intrusive::link_mode_type)1>) [clone .isra.584] () from /lib64/librgw.so.2 #8 0x00007fba82407f6d in rgw::RGWFileHandle::~RGWFileHandle() () from /lib64/librgw.so.2 #9 0x00007fba82407fe9 in rgw::RGWFileHandle::~RGWFileHandle() () from /lib64/librgw.so.2 #10 0x00007fba82410dcd in cohort::lru::LRU<std::mutex>::unref(cohort::lru::Object*, unsigned int) () from /lib64/librgw.so.2 #11 0x00007fba82402b5d in rgw_fh_rele () from /lib64/librgw.so.2 #12 0x00007fba8bb5bcb6 in release () from /usr/lib64/ganesha/libfsalrgw.so #13 0x00007fba907fae12 in mdcache_lru_get () #14 0x00007fba90807a1e in mdcache_new_entry () #15 0x00007fba907ff954 in mdcache_alloc_and_check_handle () #16 0x00007fba90803232 in mdcache_open2 () #17 0x00007fba90734e49 in open2_by_name () #18 0x00007fba90737c18 in fsal_open2 () #19 0x00007fba90723386 in open4_ex () #20 0x00007fba9076b8a9 in nfs4_op_open () #21 0x00007fba9075dded in nfs4_Compound () #22 0x00007fba9074ef9c in nfs_rpc_execute () #23 0x00007fba907505fa in worker_run () #24 0x00007fba907da2c9 in fridgethr_start_routine () #25 0x00007fba8ecbadc5 in start_thread () from /lib64/libpthread.so.0 #26 0x00007fba8e38973d in clone () from /lib64/libc.so.6 (gdb)
> In this smallfile script I added rename and delete_renamed operations and > when it reached delete_renamed operation , script failed as the command I > had given in the script was wrong, it should have been "delete-renamed" > instead of "delete_renamed". As the command exited with errors and continued > with next operations IOs got stuck and remained stuck for a while > Later, Checked nfs daemon and the daemon was stopped.. > Not sure if this was caused due to rename operation or something else. > > Observing the problem with proper "delete-renamed" value also. " for i in create read stat chmod setxattr getxattr mkdir readdir ls-l rename delete_renamed create delete cleanup ; do python smallfile_cli.py --operation $i --threads 8 --file-size 1024 --files 10 --top /hello/folder2 ; done " If we run only above IO tool not seeing any issue but along with it if there is some other IOs in-progress then nfs-ganesha service stops every time.
Moving this bug to verified state. Ran IO's on 4 different directories in NFS mount from 2 clients using Crefi, Smallfile IO's completed without any issues also there is no services stop observed.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:1497