Bug 1642488

Summary:	ganesha-gfapi.log contain many E [dht-helper.c:90:dht_fd_ctx_set] 0-prod-dht: invalid argument: fd [Invalid argument]
Product:	[Community] GlusterFS	Reporter:	renaud.fortier
Component:	ganesha-nfs	Assignee:	Soumya Koduri <skoduri>
Status:	CLOSED CURRENTRELEASE	QA Contact:
Severity:	unspecified	Docs Contact:
Priority:	unspecified
Version:	4.1	CC:	atumball, bugs, domonkos.cinke, pasik, renaud.fortier, timao
Target Milestone:	---	Keywords:	Triaged
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:	glusterfs-6.x	Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2019-06-14 10:43:03 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:	1655532
Bug Blocks:

Description renaud.fortier 2018-10-24 14:02:03 UTC

Description of problem:
In ganesha-gfapi.log we have this error many times : 

[2018-10-24 13:40:51.429812] E [dht-helper.c:90:dht_fd_ctx_set] (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.5/xlator/cluster/replicate.so(+0x30c27) [0x7f56a4c20c27] -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.5/xlator/cluster/distribute.so(+0x6f46b) [0x7f56a47a146b] -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.5/xlator/cluster/distribute.so(+0x6e67) [0x7f56a4738e67] ) 0-prod-dht: invalid argument: fd [Invalid argument]

We got it arround 150 time each 15 minutes. NFS-Ganesha export over NFSv4.

Version-Release number of selected component (if applicable):

GlusterFS v4.1.5, Ganesha v2.6.3

How reproducible:

I don't know how to reproduce. It happens on production cluster during normal operation and clients didn't report issues on usage. Mostly read small files workload.

Actual results:


Expected results:


Additional info:

gluster volume info prod:

Volume Name: prod
Type: Replicate
Volume ID: e918bd26-3318-48b3-8902-1a3b1de4f0f3
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: gluster1.local:/data/glusterfs/prod/brick1/brick
Brick2: gluster2.local:/data/glusterfs/prod/brick1/brick
Brick3: gluster3.local:/data/glusterfs/prod/brick1/brick
Options Reconfigured:
performance.nl-cache-timeout: 600
performance.client-io-threads: on
nfs.disable: on
transport.address-family: inet
performance.cache-size: 1GB
performance.parallel-readdir: on
performance.read-ahead: off
cluster.readdir-optimize: on
client.event-threads: 4
server.event-threads: 4
features.cache-invalidation: on
features.cache-invalidation-timeout: 600
performance.cache-invalidation: on
performance.md-cache-timeout: 600
network.inode-lru-limit: 200000
auth.allow: 192.168.1.99,192.168.1.98
performance.nl-cache: on
cluster.enable-shared-storage: enable

NFS-Ganesha export:
EXPORT {
  Export_Id = 3;
  Path = "/prod";
  Pseudo = "/prod";
  Access_Type = RW;
  Squash = No_root_squash;
  Disable_ACL = true;
  Protocols = "4";
  Transports = "UDP","TCP";
  SecType = "sys";
  FSAL {
    Name = "GLUSTER";
    Hostname = localhost;
    Volume = "prod";
  }
}

Comment 1 Domonkos Cinke 2018-11-01 08:35:43 UTC

I'm also seeing this, same Gluster version, similar setup and Ganesha 2.5.5.

Comment 2 renaud.fortier 2018-11-23 18:24:43 UTC

I've upgrade to Gluster 4.1.6 and NFS-Ganesha 2.7.0 and I still seeing the messages.

Comment 3 Soumya Koduri 2018-11-25 07:04:27 UTC

The issue is with AFR xlator and it was sending invalid NULL fd to upper layers dht. This bug is fixed now  - https://review.gluster.org/21617 (but in master branch). Yet to be backported to gluster-4.1 branch.

Comment 4 Tingting Mao 2018-12-25 10:57:55 UTC

I also see this bug in below scenario (glusterfs-server-5.0-1.el7):

# qemu-img create -f qcow2  gluster://$gluster_server/vol0/base.qcow2 20G
Formatting 'gluster://10.73.196.181/vol0/base.qcow2', fmt=qcow2 size=21474836480 cluster_size=65536 lazy_refcounts=off refcount_bits=16
[2018-12-25 10:45:41.885856] E [dht-helper.c:90:dht_fd_ctx_set] (-->/usr/lib64/glusterfs/3.12.2/xlator/cluster/replicate.so(+0x2bbc5) [0x7f7a63143bc5] -->/usr/lib64/glusterfs/3.12.2/xlator/cluster/distribute.so(+0x695fb) [0x7f7a62eda5fb] -->/usr/lib64/glusterfs/3.12.2/xlator/cluster/distribute.so(+0x8762) [0x7f7a62e79762] ) 0-vol0-dht: invalid argument: fd [Invalid argument]
[2018-12-25 10:45:41.987675] E [MSGID: 108006] [afr-common.c:4944:__afr_handle_child_down_event] 0-vol0-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.
[2018-12-25 10:45:43.132843] E [MSGID: 108006] [afr-common.c:4944:__afr_handle_child_down_event] 0-vol0-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.