Description of problem: In ganesha-gfapi.log we have this error many times : [2018-10-24 13:40:51.429812] E [dht-helper.c:90:dht_fd_ctx_set] (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.5/xlator/cluster/replicate.so(+0x30c27) [0x7f56a4c20c27] -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.5/xlator/cluster/distribute.so(+0x6f46b) [0x7f56a47a146b] -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.5/xlator/cluster/distribute.so(+0x6e67) [0x7f56a4738e67] ) 0-prod-dht: invalid argument: fd [Invalid argument] We got it arround 150 time each 15 minutes. NFS-Ganesha export over NFSv4. Version-Release number of selected component (if applicable): GlusterFS v4.1.5, Ganesha v2.6.3 How reproducible: I don't know how to reproduce. It happens on production cluster during normal operation and clients didn't report issues on usage. Mostly read small files workload. Actual results: Expected results: Additional info: gluster volume info prod: Volume Name: prod Type: Replicate Volume ID: e918bd26-3318-48b3-8902-1a3b1de4f0f3 Status: Started Snapshot Count: 0 Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: gluster1.local:/data/glusterfs/prod/brick1/brick Brick2: gluster2.local:/data/glusterfs/prod/brick1/brick Brick3: gluster3.local:/data/glusterfs/prod/brick1/brick Options Reconfigured: performance.nl-cache-timeout: 600 performance.client-io-threads: on nfs.disable: on transport.address-family: inet performance.cache-size: 1GB performance.parallel-readdir: on performance.read-ahead: off cluster.readdir-optimize: on client.event-threads: 4 server.event-threads: 4 features.cache-invalidation: on features.cache-invalidation-timeout: 600 performance.cache-invalidation: on performance.md-cache-timeout: 600 network.inode-lru-limit: 200000 auth.allow: 192.168.1.99,192.168.1.98 performance.nl-cache: on cluster.enable-shared-storage: enable NFS-Ganesha export: EXPORT { Export_Id = 3; Path = "/prod"; Pseudo = "/prod"; Access_Type = RW; Squash = No_root_squash; Disable_ACL = true; Protocols = "4"; Transports = "UDP","TCP"; SecType = "sys"; FSAL { Name = "GLUSTER"; Hostname = localhost; Volume = "prod"; } }
I'm also seeing this, same Gluster version, similar setup and Ganesha 2.5.5.
I've upgrade to Gluster 4.1.6 and NFS-Ganesha 2.7.0 and I still seeing the messages.
The issue is with AFR xlator and it was sending invalid NULL fd to upper layers dht. This bug is fixed now - https://review.gluster.org/21617 (but in master branch). Yet to be backported to gluster-4.1 branch.
I also see this bug in below scenario (glusterfs-server-5.0-1.el7): # qemu-img create -f qcow2 gluster://$gluster_server/vol0/base.qcow2 20G Formatting 'gluster://10.73.196.181/vol0/base.qcow2', fmt=qcow2 size=21474836480 cluster_size=65536 lazy_refcounts=off refcount_bits=16 [2018-12-25 10:45:41.885856] E [dht-helper.c:90:dht_fd_ctx_set] (-->/usr/lib64/glusterfs/3.12.2/xlator/cluster/replicate.so(+0x2bbc5) [0x7f7a63143bc5] -->/usr/lib64/glusterfs/3.12.2/xlator/cluster/distribute.so(+0x695fb) [0x7f7a62eda5fb] -->/usr/lib64/glusterfs/3.12.2/xlator/cluster/distribute.so(+0x8762) [0x7f7a62e79762] ) 0-vol0-dht: invalid argument: fd [Invalid argument] [2018-12-25 10:45:41.987675] E [MSGID: 108006] [afr-common.c:4944:__afr_handle_child_down_event] 0-vol0-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up. [2018-12-25 10:45:43.132843] E [MSGID: 108006] [afr-common.c:4944:__afr_handle_child_down_event] 0-vol0-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.