1642488 – ganesha-gfapi.log contain many E [dht-helper.c:90:dht_fd_ctx_set] 0-prod-dht: invalid argument: fd [Invalid argument]

Bug 1642488 - ganesha-gfapi.log contain many E [dht-helper.c:90:dht_fd_ctx_set] 0-prod-dht: invalid argument: fd [Invalid argument]

Summary: ganesha-gfapi.log contain many E [dht-helper.c:90:dht_fd_ctx_set] 0-prod-dht:...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	ganesha-nfs
Sub Component:
Version:	4.1
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Soumya Koduri
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1655532
Blocks:
TreeView+	depends on / blocked

Reported:	2018-10-24 14:02 UTC by renaud.fortier
Modified:	2019-06-14 10:43 UTC (History)
CC List:	6 users (show)
Fixed In Version:	glusterfs-6.x
Clone Of:
Environment:
Last Closed:	2019-06-14 10:43:03 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description renaud.fortier 2018-10-24 14:02:03 UTC

Description of problem:
In ganesha-gfapi.log we have this error many times : 

[2018-10-24 13:40:51.429812] E [dht-helper.c:90:dht_fd_ctx_set] (-->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.5/xlator/cluster/replicate.so(+0x30c27) [0x7f56a4c20c27] -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.5/xlator/cluster/distribute.so(+0x6f46b) [0x7f56a47a146b] -->/usr/lib/x86_64-linux-gnu/glusterfs/4.1.5/xlator/cluster/distribute.so(+0x6e67) [0x7f56a4738e67] ) 0-prod-dht: invalid argument: fd [Invalid argument]

We got it arround 150 time each 15 minutes. NFS-Ganesha export over NFSv4.

Version-Release number of selected component (if applicable):

GlusterFS v4.1.5, Ganesha v2.6.3

How reproducible:

I don't know how to reproduce. It happens on production cluster during normal operation and clients didn't report issues on usage. Mostly read small files workload.

Actual results:


Expected results:


Additional info:

gluster volume info prod:

Volume Name: prod
Type: Replicate
Volume ID: e918bd26-3318-48b3-8902-1a3b1de4f0f3
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: gluster1.local:/data/glusterfs/prod/brick1/brick
Brick2: gluster2.local:/data/glusterfs/prod/brick1/brick
Brick3: gluster3.local:/data/glusterfs/prod/brick1/brick
Options Reconfigured:
performance.nl-cache-timeout: 600
performance.client-io-threads: on
nfs.disable: on
transport.address-family: inet
performance.cache-size: 1GB
performance.parallel-readdir: on
performance.read-ahead: off
cluster.readdir-optimize: on
client.event-threads: 4
server.event-threads: 4
features.cache-invalidation: on
features.cache-invalidation-timeout: 600
performance.cache-invalidation: on
performance.md-cache-timeout: 600
network.inode-lru-limit: 200000
auth.allow: 192.168.1.99,192.168.1.98
performance.nl-cache: on
cluster.enable-shared-storage: enable

NFS-Ganesha export:
EXPORT {
  Export_Id = 3;
  Path = "/prod";
  Pseudo = "/prod";
  Access_Type = RW;
  Squash = No_root_squash;
  Disable_ACL = true;
  Protocols = "4";
  Transports = "UDP","TCP";
  SecType = "sys";
  FSAL {
    Name = "GLUSTER";
    Hostname = localhost;
    Volume = "prod";
  }
}

Comment 1 Domonkos Cinke 2018-11-01 08:35:43 UTC

I'm also seeing this, same Gluster version, similar setup and Ganesha 2.5.5.

Comment 2 renaud.fortier 2018-11-23 18:24:43 UTC

I've upgrade to Gluster 4.1.6 and NFS-Ganesha 2.7.0 and I still seeing the messages.

Comment 3 Soumya Koduri 2018-11-25 07:04:27 UTC

The issue is with AFR xlator and it was sending invalid NULL fd to upper layers dht. This bug is fixed now  - https://review.gluster.org/21617 (but in master branch). Yet to be backported to gluster-4.1 branch.

Comment 4 Tingting Mao 2018-12-25 10:57:55 UTC

I also see this bug in below scenario (glusterfs-server-5.0-1.el7):

# qemu-img create -f qcow2  gluster://$gluster_server/vol0/base.qcow2 20G
Formatting 'gluster://10.73.196.181/vol0/base.qcow2', fmt=qcow2 size=21474836480 cluster_size=65536 lazy_refcounts=off refcount_bits=16
[2018-12-25 10:45:41.885856] E [dht-helper.c:90:dht_fd_ctx_set] (-->/usr/lib64/glusterfs/3.12.2/xlator/cluster/replicate.so(+0x2bbc5) [0x7f7a63143bc5] -->/usr/lib64/glusterfs/3.12.2/xlator/cluster/distribute.so(+0x695fb) [0x7f7a62eda5fb] -->/usr/lib64/glusterfs/3.12.2/xlator/cluster/distribute.so(+0x8762) [0x7f7a62e79762] ) 0-vol0-dht: invalid argument: fd [Invalid argument]
[2018-12-25 10:45:41.987675] E [MSGID: 108006] [afr-common.c:4944:__afr_handle_child_down_event] 0-vol0-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.
[2018-12-25 10:45:43.132843] E [MSGID: 108006] [afr-common.c:4944:__afr_handle_child_down_event] 0-vol0-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.

Note You need to log in before you can comment on or make changes to this bug.