Bug 1413501

Summary:	rgw_obj_expirer thread segfaults and nfsd process terminates
Product:	[Red Hat Storage] Red Hat Ceph Storage	Reporter:	shilpa <smanjara>
Component:	Documentation	Assignee:	Bara Ancincova <bancinco>
Status:	CLOSED CURRENTRELEASE	QA Contact:	shilpa <smanjara>
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	2.1	CC:	asriram, cbodley, ceph-eng-bugs, hnallurv, kdreyer, mbenjamin, owasserm, smanjara, sweil
Target Milestone:	rc
Target Release:	2.2
Hardware:	Unspecified
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2017-03-21 23:48:46 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description shilpa 2017-01-16 08:36:25 UTC

Description of problem:
While trying to mount NFS share on a client, the nfsd process crashes with the following crashdump:

     0> 2017-01-16 07:36:02.166385 7f8a610ec700 -1 *** Caught signal (Aborted) **
 in thread 7f8a610ec700 thread_name:rgw_obj_expirer

 ceph version 10.2.5-3.el7cp (1337a819287fd59af47dbbe186c465dfa1b384e7)
 1: (()+0x56e10a) [0x7f8a8467510a]
 2: (()+0xf370) [0x7f8a90eca370]
 3: (gsignal()+0x37) [0x7f8a904cf1d7]
 4: (abort()+0x148) [0x7f8a904d08c8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f8a78804ab5]
 6: (()+0x5ea26) [0x7f8a78802a26]
 7: (()+0x5ea53) [0x7f8a78802a53]
 8: (()+0x5ec73) [0x7f8a78802c73]
 9: (operator new(unsigned long)+0x7d) [0x7f8a7880320d]
 10: (std::string::_Rep::_S_create(unsigned long, unsigned long, std::allocator<char> const&)+0x59) [0x7f8a78861ce9]
 11: (std::string::_Rep::_M_clone(std::allocator<char> const&, unsigned long)+0x1b) [0x7f8a788628fb]
 12: (std::basic_string<char, std::char_traits<char>, std::allocator<char> >::basic_string(std::string const&)+0x5c) [0x7f8a78862fcc]
 13: (RGWObjectExpirer::process_single_shard(std::string const&, utime_t const&, utime_t const&)+0x133) [0x7f8a844aced3]
 14: (RGWObjectExpirer::inspect_all_shards(utime_t const&, utime_t const&)+0xb2) [0x7f8a844ad542]
 15: (RGWObjectExpirer::OEWorker::entry()+0x7f) [0x7f8a844ad7ef]
 16: (()+0x7dc5) [0x7f8a90ec2dc5]
 17: (clone()+0x6d) [0x7f8a9059173d]


Version-Release number of selected component (if applicable):

     0> 2017-01-16 07:36:02.166385 7f8a610ec700 -1 *** Caught signal (Aborted) **
 in thread 7f8a610ec700 thread_name:rgw_obj_expirer


Version:
nfs-ganesha-2.4.1-3.el7cp.x86_64
nfs-ganesha-rgw-2.4.1-3.el7cp.x86_64
ceph-radosgw-10.2.5-3.el7cp.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Configure NFS on RGW server and start ganesha.nfsd process
2. Try to mount it on a client:
# mount -t nfs -o nfsvers=4.1,sync,noauto,soft,proto=tcp magna039.ceph.redhat.com:/ /mntr
mount.nfs: access denied by server while mounting magna039.ceph.redhat.com:/
3. After some time check for nfsd process on server. 


Expected results:
The mount should be successful and process should not have crashed

Additional info:

NFS_Core_Param {
        #Use supplied name other tha IP In NSM operations
        NSM_Use_Caller_Name = true;
        #Copy lock states into "/var/lib/nfs/ganesha" dir
        Clustered = false;
        #By default port number '2049' is used for NFS service.
        #Configure ports for MNT, NLM, RQuota services.
        #The ports chosen here are from '/etc/sysconfig/nfs' 
#        MNT_Port = 20048;
        NLM_Port = 32803;
        Rquota_Port = 875;
}

CACHEINODE {
        Entries_HWMark = 25000;
}

EXPORT_DEFAULTS {
       # To reflect nfsnobody
        Anonymous_uid = 65534;
        Anonymous_gid = 65534;
}

EXPORT
{
        Export_ID=1;
        Path = "/";
        Pseudo = "/";
        Access_Type = RW;
        NFS_Protocols = 4;
        Transport_Protocols = TCP;

        FSAL {
                Name = RGW;
                User_Id = testuser;
                Access_Key_Id = "I5P9C2G5VH0Y24ZA7F13";
                Secret_Access_Key = "cMCTan57vOfRpnZIhL5pz8EJE0tFx61gicWcXful";
        }
}

RGW {
    name = "client.rgw.magna039";
    ceph_conf = "/etc/ceph/ceph.conf";
    init_args = "--randomvar=specialk";
}

Comment 6 Matt Benjamin (redhat) 2017-01-16 18:38:35 UTC

Adding more detailed logging directives (see /etc/ganesha/ganesha.conf, plus redirecting logging to a file, I see the following likely root cause:

[root@magna039 ganesha]# /usr/bin/ganesha.nfsd -f /etc/ganesha/ganesha.conf -F
2017-01-16 18:34:50.720992 7ff78f6a20c0 -1 auth: unable to find a keyring on /var/lib/ceph/radosgw/-admin/keyring: (2) No such file or directory
2017-01-16 18:34:50.722071 7ff78f6a20c0 -1 monclient(hunting): authenticate NOTE: no keyring found; disabled cephx authentication
2017-01-16 18:34:50.722536 7ff78f6a20c0 -1 Couldn't init storage provider (RADOS)
*** Caught signal (Segmentation fault) **

I.e., I think the problem with path to a radosgw admin keyring is preventing RGW from starting within the NFS ganesha instance.

Comment 8 Matt Benjamin (redhat) 2017-01-16 22:25:29 UTC

Update with working setup:

1. there is a segfault on shutdown after failure to initialize RADOS--triggered proximately by misconfiguration (tracker 17638)--this won't be fixed in 2.2, but is being worked on

2. the root cause of the misconfiguration is missing values for the radosgw arguments "--name" and "--cluster";  as of 2.1, the correct way to set these values (on an installation that requires them, such as this one), is to pass them as parameters in the RGW FSAL configuration block:

RGW {
    ceph_conf = "/etc/ceph/ceph.conf";
    cluster = "ceph";
    name = "client.rgw.magna039";
    init_args = "-d --debug-rgw=16";
}

It turns out that currently the "init_args" argument should be passed as a set of separate, null-terminated strings appended to the librgw_create(...) argv argument, not passed in on a single line.