Bug 1416041 - NFS-Ganesha service stops while running IO from multiple clients
Summary: NFS-Ganesha service stops while running IO from multiple clients
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Ceph Storage
Classification: Red Hat
Component: RGW
Version: 2.2
Hardware: Unspecified
OS: Linux
unspecified
high
Target Milestone: rc
: 2.3
Assignee: Matt Benjamin (redhat)
QA Contact: Ramakrishnan Periyasamy
URL:
Whiteboard:
Keywords:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-01-24 12:52 UTC by Ramakrishnan Periyasamy
Modified: 2017-07-30 15:50 UTC (History)
11 users (show)

(edit)
Clone Of:
(edit)
Last Closed: 2017-06-19 13:28:59 UTC


Attachments (Terms of Use)


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:1497 normal SHIPPED_LIVE Red Hat Ceph Storage 2.3 bug fix and enhancement update 2017-06-19 17:24:11 UTC
Ceph Project Bug Tracker 19111 None None None 2017-02-28 21:36 UTC
Ceph Project Bug Tracker 19112 None None None 2017-03-01 02:36 UTC

Description Ramakrishnan Periyasamy 2017-01-24 12:52:55 UTC
Description of problem:
NFS-Ganesha service stops while running IO from different NFS client but on same directory


Version-Release number of selected component (if applicable):
Ceph: 10.2.5-7.el7cp (59e9fee4a935fdd2bc8197e07596dc4313c410a3)
NFS version: nfs-ganesha-rgw-2.4.1-3.el7cp.x86_64
RHEL: 7.3

How reproducible:
2/2

Steps to Reproduce:
1. Configure cluster, RGW and NFS 
2. Mount NFS in 3 clients.
3. Create one directory in NFS mount
4. Used below mentioned tool for creating files and folders, Run from all the clients on same directory cmd used "  python smallfile_cli.py --operation create  --threads 8 --file-size 1024 --files 2048 --top /mnt/nf/smallfileio/"
https://github.com/bengland2/smallfile
5. IO's failed on 1 one client and mount point was not accessible, "ls" command is hung.
6. Observed ganesha service not running.

Actual results:
Ganesha service is stopped

Expected results:
Ganesha service should not stop and mount should be accessible from all clients.

Additional info:
Not seeing any failure messages in Ganesha logs or rgw logs.

Comment 5 Ramakrishnan Periyasamy 2017-01-25 12:43:21 UTC
Hi Matt,

With same test scenario in latest build (ceph:10.2.5-12.el7cp, NFS:nfs-ganesha-rgw-2.4.2-1.el7cp.x86_64), observing below crash in rgw

   -1> 2017-01-25 12:35:39.154179 7f5141bf7700  1 -- 10.8.128.23:0/2521978547 <== osd.4 10.8.128.112:6804/2834 3550 ==== osd_op_reply(9093 .dir.95173e61-a772-4331-bfda-20f2d5f4d9ce.53603.1 
[call] v57'102548 uv102548 ondisk = 0) v7 ==== 169+0+0 (3525528611 0 0) 0x7f511c01eb60 con 0x7f516a7c8ec0
     0> 2017-01-25 12:35:39.154175 7f4fd4944700 -1 *** Caught signal (Segmentation fault) **
 in thread 7f4fd4944700 thread_name:ganesha.nfsd

 ceph version 10.2.5-12.el7cp (8614488f8c3e7a9be34e58fb1aaf23416156152c)
 1: (()+0x56dc0a) [0x7f515acf0c0a]
 2: (()+0xf370) [0x7f5167545370]
 3: (rgw::RGWLibFS::getattr(rgw::RGWFileHandle*, stat*)+0) [0x7f515ac86860]
 4: (rgw_getattr()+0x11) [0x7f515ac86d31]
 5: (rgw_fsal_open2()+0x7a6) [0x7f51643dfbb6]
 6: (mdcache_open2()+0x34f) [0x7f516908614f]
 7: (fsal_open2()+0x1fb) [0x7f5168fbacfb]
 8: (()+0x2b386) [0x7f5168fa6386]
 9: (nfs4_op_open()+0xaa9) [0x7f5168fee8a9]
 10: (nfs4_Compound()+0x63d) [0x7f5168fe0ded]
 11: (nfs_rpc_execute()+0x5bc) [0x7f5168fd1f9c]
 12: (()+0x585fa) [0x7f5168fd35fa]
 13: (()+0xe2289) [0x7f516905d289]
 14: (()+0x7dc5) [0x7f516753ddc5]
 15: (clone()+0x6d) [0x7f5166c0c73d]

Comment 12 Ramakrishnan Periyasamy 2017-02-07 06:03:22 UTC
Moving the bug to verified state.
Problem not seen in below build
ceph: 10.2.5-21.el7cp (4a76f1521d766e24af2b265f9809a6bff411bf12)
nfs-rgw: nfs-ganesha-rgw-2.4.2-4.el7cp.x86_64

Comment 13 Hemanth Kumar 2017-02-10 08:44:10 UTC
NFS ganesha service stopped again while running multiple IO operations across multiple clients...

IO operations like, crefi, smallfile, dd, scp, iozone , wget were running from 3 different clients on different directories from the mountpoint.
IO's were hung as the data was really huge and no IO errors seen.. 
After leaving it in that state for an overnight the client IO operations remained in the same state., NO rw operation status seen in "ceph -w" and there were no error msges displayed in rgw log file. 
Killed all the Client IO operations and started to re-run one by one once again.

As I started to run the 4th IO operation from the 3rd client, it started to error out. ganesha service was stopped! and no crash seen. 

[root@magna020 /]# ps aux | grep ganesha
root      4192  3.0  1.3 9475868 448776 ?      Ssl  Feb08  76:19 /usr/bin/ganesha.nfsd -f /etc/ganesha/ganesha.conf
[root@magna020 /]# ps aux | grep ganesha
root      4192  3.0  1.3 9475868 448776 ?      Ssl  Feb08  76:19 /usr/bin/ganesha.nfsd -f /etc/ganesha/ganesha.conf

[root@magna020 /]# ps aux | grep ganesha
[root@magna020 /]# ps aux | grep ganesha
[root@magna020 /]#
[root@magna020 /]# /usr/bin/ganesha.nfsd -f /etc/ganesha/ganesha.conf
[root@magna020 /]# systemctl restart  ceph-radosgw@rgw.magna020.service

[root@magna020 /]# ps aux | grep ganesha
root       564  0.0  0.0 112648   964 pts/1    S+   08:08   0:00 grep --color=auto ganesha
root     32239  3.4  0.6 9462568 212996 ?      Ssl  07:47   0:43 /usr/bin/ganesha.nfsd -f /etc/ganesha/ganesha.conf
[root@magna020 /]#



Re-opening the BZ as the nfs ganesha service stopped again. Will update more if its seen again..

Comment 15 Matt Benjamin (redhat) 2017-02-13 22:55:22 UTC
Candidate fixes (for multiple issues) pushed to ceph and nfs-ganesha ceph-2-rhel-patches.

Comment 19 Hemanth Kumar 2017-02-16 08:00:46 UTC
Repeated the same operations as mentioned in Comment #13 and hit the crash again..


[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/bin/ganesha.nfsd -f /etc/ganesha/ganesha.conf'.
Program terminated with signal 6, Aborted.
#0  0x00007fba8ecc223b in raise () from /lib64/libpthread.so.0
Missing separate debuginfos, use: debuginfo-install nfs-ganesha-2.4.2-5.el7cp.x86_64
(gdb) bt
#0  0x00007fba8ecc223b in raise () from /lib64/libpthread.so.0
#1  0x00007fba8246cd05 in handle_fatal_signal(int) () from /lib64/librgw.so.2
#2  <signal handler called>
#3  0x00007fba8e2c71d7 in raise () from /lib64/libc.so.6
#4  0x00007fba8e2c88c8 in abort () from /lib64/libc.so.6
#5  0x00007fba8e2c0146 in __assert_fail_base () from /lib64/libc.so.6
#6  0x00007fba8e2c01f2 in __assert_fail () from /lib64/libc.so.6
#7  0x00007fba82401468 in void boost::intrusive::detail::destructor_impl<boost::intrusive::detail::generic_hook<boost::intrusive::get_set_node_algo<void*, false>, boost::intrusive::member_tag, (boost::intrusive::link_mode_type)1, 0> >(boost::intrusive::detail::generic_hook<boost::intrusive::get_set_node_algo<void*, false>, boost::intrusive::member_tag, (boost::intrusive::link_mode_type)1, 0>&, boost::intrusive::detail::link_dispatch<(boost::intrusive::link_mode_type)1>) [clone .isra.584] () from /lib64/librgw.so.2
#8  0x00007fba82407f6d in rgw::RGWFileHandle::~RGWFileHandle() () from /lib64/librgw.so.2
#9  0x00007fba82407fe9 in rgw::RGWFileHandle::~RGWFileHandle() () from /lib64/librgw.so.2
#10 0x00007fba82410dcd in cohort::lru::LRU<std::mutex>::unref(cohort::lru::Object*, unsigned int) () from /lib64/librgw.so.2
#11 0x00007fba82402b5d in rgw_fh_rele () from /lib64/librgw.so.2
#12 0x00007fba8bb5bcb6 in release () from /usr/lib64/ganesha/libfsalrgw.so
#13 0x00007fba907fae12 in mdcache_lru_get ()
#14 0x00007fba90807a1e in mdcache_new_entry ()
#15 0x00007fba907ff954 in mdcache_alloc_and_check_handle ()
#16 0x00007fba90803232 in mdcache_open2 ()
#17 0x00007fba90734e49 in open2_by_name ()
#18 0x00007fba90737c18 in fsal_open2 ()
#19 0x00007fba90723386 in open4_ex ()
#20 0x00007fba9076b8a9 in nfs4_op_open ()
#21 0x00007fba9075dded in nfs4_Compound ()
#22 0x00007fba9074ef9c in nfs_rpc_execute ()
#23 0x00007fba907505fa in worker_run ()
#24 0x00007fba907da2c9 in fridgethr_start_routine ()
#25 0x00007fba8ecbadc5 in start_thread () from /lib64/libpthread.so.0
#26 0x00007fba8e38973d in clone () from /lib64/libc.so.6
(gdb)

Comment 44 Ramakrishnan Periyasamy 2017-03-01 13:04:28 UTC
> In this smallfile script I added rename and delete_renamed operations and
> when it reached delete_renamed operation , script failed as the command I
> had given in the script was wrong, it should have been "delete-renamed"
> instead of "delete_renamed". As the command exited with errors and continued
> with next operations IOs got stuck and remained stuck for a while
> Later, Checked nfs daemon and the daemon was stopped..
> Not sure if this was caused due to rename operation or something else.
> 
> 

Observing the problem with proper "delete-renamed" value also.

" for i in create read stat chmod setxattr getxattr mkdir readdir ls-l rename delete_renamed create delete cleanup ; do python smallfile_cli.py --operation $i --threads 8 --file-size 1024 --files 10 --top /hello/folder2 ; done "

If we run only above IO tool not seeing any issue but along with it if there is some other IOs in-progress then nfs-ganesha service stops every time.

Comment 52 Ramakrishnan Periyasamy 2017-05-16 09:49:08 UTC
Moving this bug to verified state.

Ran IO's on 4 different directories in NFS mount from 2 clients using Crefi, Smallfile IO's completed without any issues also there is no services stop observed.

Comment 54 errata-xmlrpc 2017-06-19 13:28:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:1497


Note You need to log in before you can comment on or make changes to this bug.