Bug 1513736
| Summary: | Client unable to see or mount NFS-Ganesha export from 12 x (4 + 2) distributed-dispersed volume | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Dustin Black <dblack> | ||||||
| Component: | nfs-ganesha | Assignee: | Kaleb KEITHLEY <kkeithle> | ||||||
| Status: | CLOSED NOTABUG | QA Contact: | Manisha Saini <msaini> | ||||||
| Severity: | high | Docs Contact: | |||||||
| Priority: | high | ||||||||
| Version: | rhgs-3.3 | CC: | amukherj, asoman, aspandey, dblack, jahernan, japplewh, jthottan, mchangir, pkarampu, rcyriac, rgowdapp, rhinduja, rhs-bugs, skoduri, ssaha, storage-qa-internal | ||||||
| Target Milestone: | --- | Keywords: | ZStream | ||||||
| Target Release: | --- | Flags: | dblack:
needinfo-
dblack: needinfo- |
||||||
| Hardware: | Unspecified | ||||||||
| OS: | Unspecified | ||||||||
| Whiteboard: | |||||||||
| Fixed In Version: | Doc Type: | If docs needed, set a value | |||||||
| Doc Text: | Story Points: | --- | |||||||
| Clone Of: | Environment: | ||||||||
| Last Closed: | 2018-02-15 05:20:28 UTC | Type: | Bug | ||||||
| Regression: | --- | Mount Type: | --- | ||||||
| Documentation: | --- | CRM: | |||||||
| Verified Versions: | Category: | --- | |||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||
| Embargoed: | |||||||||
| Attachments: |
|
||||||||
Offline from Soumya:
> Maybe the attributes of the filesystem returned were not valid. Is the fuse-mount of the same volume successful? To narrow down if its indeed issue with only ganesha, could you please try these sample gfapi 'C' programs [1] [2]
> [1] https://github.com/gluster/glusterfs/blob/master/tests/basic/gfapi/bug1283983.c
> [2] https://github.com/gluster/glusterfs/blob/master/tests/bugs/gfapi/bug-1447266/1460514.c
Yes, the volume mounts successfully via gluster native fuse.
I ran both of the gfapi sample programs for a while and have attached the log files. They both seem to result in critical timeout errors in communicating with the bricks.
Created attachment 1353008 [details]
Output from 1460514 test program
Created attachment 1353009 [details] Output from bug1283983 test program As per comment#2, even simple gfapi programs failed to get executed for this particular volume. So the issue lies in the gluster stack. In the logs, we can see that there are many disconnects between client and brick servers. This may have led to not being enough disperse subvolumes up. Request Pranith/Du to provide comments. @Dustin, Just to rule out network issues, are there any AVCs or firewalld warnings reported? (In reply to Soumya Koduri from comment #5) > As per comment#2, even simple gfapi programs failed to get executed for this > particular volume. So the issue lies in the gluster stack. In the logs, we > can see that there are many disconnects between client and brick servers. > This may have led to not being enough disperse subvolumes up. > > Request Pranith/Du to provide comments. > > @Dustin, > Just to rule out network issues, are there any AVCs or firewalld warnings > reported? The firewalld service was running on the nodes, and all ports and services seemed to be appropriately configured for Gluster, based on documentation. I never noticed failures or denials related to AVC or firewalld. My lab cycles were limited, and I didn't get a chance to test again with the firewall disabled. I'll need to work on building out another reporoducer lab. I believe I have found the trigger for the problem. As part of my usual deployment, I set a default gateway on the servers, even though the subnet that I am on isn't actually connected to a router -- therefore I am setting an invalid or unavailable default gateway on the nodes. My nodes and clients are all on the same subnet, so this shouldn't matter at all as there is no reason for any of the client-server or server-server communication to need to be routed, but with this "bad" gateway defined the client mount will always fail as described here. If I delete the default gateway entries from all of the servers, the clients can mount via NFS with no problem. Try adding a bogus default gateway to the servers and see if you can reproduce the behavior. |
Description of problem: After creating a 12 x (4 + 2) distributed-dispersed volume and configuring NFS-Ganesha with HA, a RHEL NFS client is unable to see the volume exported from the Gluster nodes. Version-Release number of selected component (if applicable): RHGS 3.3.0 from ISO How reproducible: Consistently reproducible with clean builds in hardware lab with 6 nodes each with 12 HDDs. Steps to Reproduce: 1. Create 12 x (4 + 2) distributed-dispersed volume across 6 nodes with 12 HDDs each, one LVM stack and brick per HDD 2. Configure NFS-Ganesha for HA per documentation 3. Open firewall services and ports per documentation 4. Attempt 'showmount -e' or 'mount -t nfs -o vers=3' command from a RHEL client Actual results: The 'showmount -e' command reports no exports from the VIP; mount command fails with 'access denied' error. Expected results: The 'showmount -e' displays the gluster volume as exported; mount command succeeds. Additional info: Testing our automated deployment, I have a 12 x (4 + 2) volume -- 6 nodes each with 12 single-disk bricks. Additionally there is an lvmcache layer attached to each brick's thin pool. We are using a 2:3 ratio for NFS-Ganesha nodes, so there are 4 nodes configured to host VIPs and share the volumes. Everything seems properly configured for NFS-Ganesha to start correctly with HA. VIPs are up, pcs status looks good, NFS-Ganesha service is running, configs show they are exporting the Gluster volume... However, attempting to mount from a client I get: # mount -t nfs -o vers=3 192.168.1.201:/gluster1 /mnt mount.nfs: access denied by server while mounting 192.168.1.201:/gluster1 What I immediately find on the server side is a repeating set of W and E messages in the ganesha-gfapi.log file. Here's a snippet of some lines I think are relevant; I'll share more as it's useful: [2017-11-10 19:04:18.288694] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7efe9e943242] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7efe9e7088ae] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7efe9e7089be] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x90)[0x7efe9e70a130] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7efe9e70abe0] ))))) 0-gluster1-client-45: forced unwinding frame type(GlusterFS Handshake) op(SET_LK_VER(4)) called at 2017-11-10 18:54:57.701296 (xid=0x5) 2017-11-10 19:04:18.288728] W [MSGID: 114032] [client-handshake.c:190:client_set_lk_version_cbk] 0-gluster1-client-45: received RPC status error [Transport endpoint is not connected] [2017-11-10 19:04:18.289215] E [rpc-clnt.c:365:saved_frames_unwind] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7efe9e943242] (--> /lib64/libgfrpc.so.0(saved_frames_unwind+0x1de)[0x7efe9e7088ae] (--> /lib64/libgfrpc.so.0(saved_frames_destroy+0xe)[0x7efe9e7089be] (--> /lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0x90)[0x7efe9e70a130] (--> /lib64/libgfrpc.so.0(rpc_clnt_notify+0x2a0)[0x7efe9e70abe0] ))))) 0-gluster1-client-45: forced unwinding frame type(GF-DUMP) op(NULL(2)) called at 2017-11-10 18:54:57.701307 (xid=0x6) [2017-11-10 19:04:18.289237] W [rpc-clnt-ping.c:203:rpc_clnt_ping_cbk] 0-gluster1-client-45: socket disconnected [2017-11-10 19:04:18.289286] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-gluster1-client-49: disconnected from gluster1-client-49. Client process will keep trying to connect to glusterd until brick's port is available As far as I can quickly see, the log lines are all variations on this theme, mostly pointing to different bricks. Initially, 'showmount -e' does show the exported volume, but after some time it returns an empty list of exports. The output of 'gluster vol status' seems fine with regard to the ports all up and listening. Firewall looks to be correct with the right services enabled and the gluster brick port range visible in the iptables output. I don't see any selinux denials. The ganesha.log file gives some more interesting messages: 10/11/2017 18:49:33 : epoch 86200000 : n1.example.com : ganesha.nfsd-94846[dbus_heartbeat] glusterfs_create_export :FSAL :EVENT :Volume gluster1 exported at : '/' 10/11/2017 19:06:18 : epoch 86200000 : n1.example.com : ganesha.nfsd-94846[dbus_heartbeat] posix2fsal_type :FSAL :WARN :Unknown object type: 0 10/11/2017 19:06:18 : epoch 86200000 : n1.example.com : ganesha.nfsd-94846[dbus_heartbeat] posix2fsal_type :FSAL :WARN :Unknown object type: 0 10/11/2017 19:06:18 : epoch 86200000 : n1.example.com : ganesha.nfsd-94846[dbus_heartbeat] mdcache_new_entry :INODE :MAJ :unknown type 4294967295 provided 10/11/2017 19:06:18 : epoch 86200000 : n1.example.com : ganesha.nfsd-94846[dbus_heartbeat] init_export_root :EXPORT :CRIT :Lookup failed on path, ExportId=2 Path=/gluster1 FSAL_ERROR=(Invalid object type,0) 10/11/2017 19:06:43 : epoch 86200000 : n1.example.com : ganesha.nfsd-94846[dbus_heartbeat] gsh_export_addexport :EXPORT :CRIT :0 export entries in /var/run/gluster/shared_storage/nfs-ganesha/exports/export.gluster1.conf added because (invalid param value) errors 10/11/2017 19:06:43 : epoch 86200000 : n1.example.com : ganesha.nfsd-94846[dbus_heartbeat] dbus_message_entrypoint :DBUS :MAJ :Method (AddExport) on (org.ganesha.nfsd.exportmgr) failed: name = (org.freedesktop.DBus.Error.InvalidFileContent), message = (0 export entries in /var/run/gluster/shared_storage/nfs-ganesha/exports/export.gluster1.conf added because (invalid param value) errors. Details: Config File (/var/run/gluster/shared_storage/nfs-ganesha/exports/export.gluster1.conf:4): 1 validation errors in block EXPORT Config File (/var/run/gluster/shared_storage/nfs-ganesha/exports/export.gluster1.conf:4): Errors found in configuration block EXPORT So first it seems to export the gluster1 volume, and then it complains about it. Looking at the config file that was generated for the volume, I can't see what it would be complaining about. EXPORT{ Export_Id = 2; Path = "/gluster1"; FSAL { name = GLUSTER; hostname="localhost"; volume="gluster1"; } Access_type = RW; Disable_ACL = true; Squash="No_root_squash"; Pseudo="/gluster1"; Protocols = "3", "4" ; Transports = "UDP","TCP"; SecType = "sys"; }