Bug 1328581

Summary: Continuous "remote operation failed" messages in ganesha-gfapi.log while removing files from ganesha mount.
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Shashank Raj <sraj>
Component: nfs-ganeshaAssignee: Kaleb KEITHLEY <kkeithle>
Status: CLOSED CANTFIX QA Contact: Manisha Saini <msaini>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.1CC: bmohanra, ffilz, jthottan, kkeithle, msaini, ndevos, nlevinki, otira.fuu, pkarampu, renaud.fortier, rgowdapp, rhinduja, rhs-bugs, sanandpa, skoduri, storage-qa-internal, v.badalyan
Target Milestone: ---Keywords: Reopened, Triaged, ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Known Issue
Doc Text:
After removing a file, the nfs-ganesha process does a lookup on the removed entry to update the attributes in case of any links present. Due to this, as the file is deleted, lookup will fail with ENOENT resulting in a misleading log message in gfapi.log. This is an expected behavior and there is no functionality issue here. The log message needs to be ignored in such cases.
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-11-15 05:05:15 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1311843    
Attachments:
Description Flags
ganesha-gfapi.log from mounted node none

Description Shashank Raj 2016-04-19 18:28:22 UTC
Created attachment 1148711 [details]
ganesha-gfapi.log from mounted node

Description of problem:
Continuous "remote operation failed" messages in ganesha-gfapi.log while removing files from ganesha mount.

Version-Release number of selected component (if applicable):
ganesha-2.3.1-3

How reproducible:
Always

Steps to Reproduce:
1.Create a 4 node cluster and configure ganesha on it.
2.Create a dist-rep/tiered volume and enable ganesha on it. 
3.Mount the volume on client with vers=4 and create around 1000 files on the mount point.
4.Remove the files and observe that continuous below messages are observed in ganesha-gfapi.log

[2016-04-19 07:24:51.563297] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 0-tiervolume-client-15: remote operation failed. Path: <gfid:2223de29-3e78-47d6-9f6f-d01e68791349> (2223de29-3e78-47d6-9f6f-d01e68791349) [No such file or directory]
[2016-04-19 07:24:51.563353] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 0-tiervolume-client-14: remote operation failed. Path: <gfid:2223de29-3e78-47d6-9f6f-d01e68791349> (2223de29-3e78-47d6-9f6f-d01e68791349) [No such file or directory]
[2016-04-19 07:24:51.566120] W [MSGID: 108008] [afr-read-txn.c:250:afr_read_txn] 0-tiervolume-replicate-0: Unreadable subvolume -1 found with event generation 2 for gfid 2223de29-3e78-47d6-9f6f-d01e68791349. (Possible split-brain)
[2016-04-19 07:24:51.567044] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 0-tiervolume-client-15: remote operation failed. Path: <gfid:2223de29-3e78-47d6-9f6f-d01e68791349> (2223de29-3e78-47d6-9f6f-d01e68791349) [No such file or directory]
[2016-04-19 07:24:51.567143] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 0-tiervolume-client-14: remote operation failed. Path: <gfid:2223de29-3e78-47d6-9f6f-d01e68791349> (2223de29-3e78-47d6-9f6f-d01e68791349) [No such file or directory]
[2016-04-19 07:24:51.568218] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 0-tiervolume-client-14: remote operation failed. Path: <gfid:2223de29-3e78-47d6-9f6f-d01e68791349> (2223de29-3e78-47d6-9f6f-d01e68791349) [No such file or directory]
[2016-04-19 07:24:51.568270] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 0-tiervolume-client-15: remote operation failed. Path: <gfid:2223de29-3e78-47d6-9f6f-d01e68791349> (2223de29-3e78-47d6-9f6f-d01e68791349) [No such file or directory]
[2016-04-19 07:24:51.568790] E [MSGID: 109040] [dht-helper.c:1179:dht_migration_complete_check_task] 0-tiervolume-hot-dht: <gfid:2223de29-3e78-47d6-9f6f-d01e68791349>: failed to lookup the file on tiervolume-hot-dht [Stale file handle]
[2016-04-19 07:24:51.571522] W [MSGID: 108008] [afr-read-txn.c:250:afr_read_txn] 0-tiervolume-replicate-0: Unreadable subvolume -1 found with event generation 2 for gfid 2223de29-3e78-47d6-9f6f-d01e68791349. (Possible split-brain)
[2016-04-19 07:24:51.572629] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 0-tiervolume-client-14: remote operation failed. Path: <gfid:2223de29-3e78-47d6-9f6f-d01e68791349> (2223de29-3e78-47d6-9f6f-d01e68791349) [No such file or directory]
[2016-04-19 07:24:51.572703] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 0-tiervolume-client-15: remote operation failed. Path: <gfid:2223de29-3e78-47d6-9f6f-d01e68791349> (2223de29-3e78-47d6-9f6f-d01e68791349) [No such file or directory]
[2016-04-19 07:24:51.575765] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 0-tiervolume-client-15: remote operation failed. Path: <gfid:2223de29-3e78-47d6-9f6f-d01e68791349> (2223de29-3e78-47d6-9f6f-d01e68791349) [No such file or directory]
[2016-04-19 07:24:51.576525] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 0-tiervolume-client-14: remote operation failed. Path: <gfid:2223de29-3e78-47d6-9f6f-d01e68791349> (2223de29-3e78-47d6-9f6f-d01e68791349) [No such file or directory]
[2016-04-19 07:24:51.576806] E [MSGID: 109040] [dht-helper.c:1179:dht_migration_complete_check_task] 0-tiervolume-tier-dht: <gfid:2223de29-3e78-47d6-9f6f-d01e68791349>: failed to lookup the file on tiervolume-tier-dht [No such file or directory]
[2016-04-19 07:24:51.585415] W [MSGID: 114031] [client-rpc-fops.c:531:client3_3_stat_cbk] 0-tiervolume-client-12: remote operation failed [No such file or directory]
[2016-04-19 07:24:51.586384] W [MSGID: 114031] [client-rpc-fops.c:2974:client3_3_lookup_cbk] 0-tiervolume-client-12: remote operation failed. Path: <gfid:d2e59838-2fa8-4096-a954-7a94d911ebf2> (d2e59838-2fa8-4096-a954-7a94d911ebf2) [No such file or directory]

5) All the files gets removed without any issues.

Actual results:

Continuous "remote operation failed" messages in ganesha-gfapi.log while removing files from ganesha mount.

Expected results:

there should not be "remote operation failed" messages.

Additional info:

Comment 2 Soumya Koduri 2016-04-20 06:46:55 UTC
This is expected behaviour when used nfs-ganesha. NFS-Ganesha does stat on a file which gets removed to refresh its attributes in case of any link count increased during that time. This stat internally gets converted to glusterfs lookup fop which fails with ENOENT error.


Upstream discussion on the same can be found in the below link - 
        https://sourceforge.net/p/nfs-ganesha/mailman/message/32573278/

It may be worth to re-check if the theory still holds good with the community

Comment 3 Soumya Koduri 2016-04-27 10:20:13 UTC
I have submitted patch upstream to fix this. But as mentioned above, the extra stat seems intentional and maybe necessary. More discussions can be found at
 - https://review.gerrithub.io/#/c/273611

Comment 4 Soumya Koduri 2016-04-28 09:49:41 UTC
From the discussions we had with Frank (nfs-ganesha upstream maintainer) , looks like nfs-ganesha does need to do stat on the removed entry for the following reasons.

* We need to kill the cache_entry (cache_inode_kill_entry()) after successful removal of the entry and if there are no links present. If we choose not to and wait for cache_inode to reap it when required, there were issues reported (in cephfs) of using the stale entry when I/Os are performed on that removed file. 

* And we cannot use "cache_inode_kill_entry" right after unlink without refreshing the attributes. That's because there could be other links (followed by OPEN* such fops) created for the same file during unlink operation and we could end up losing their state.

Hence it seems mandatory to do stat and fetch the attributes of the file entry being removed. That will result in the error messages this bug complains about. Looks this shall be a WONT_FIX bug. We may need to document it as known_issue.

Comment 5 Niels de Vos 2016-04-28 16:11:41 UTC
If we reduce the log level for client-rpc-fops.c:2974:client3_3_lookup_cbk there would be the following entries left:

2x dht-helper.c:1179:dht_migration_complete_check_task
2x afr-read-txn.c:250:afr_read_txn

it would reduce the logging for this non-error with 10 out of 14 messages. We need to know from the AFR and DHT team if these messages are sufficient for them to identify potential issues.

Comment 6 Niels de Vos 2016-04-28 16:13:01 UTC
Added Du and Pranith on NEEDINFO, but anyone from the DHT and AFR team are welcome to give their comments.

Comment 8 Kaleb KEITHLEY 2016-06-20 12:27:09 UTC
Need a decision on reducing the log level of this message as this is expected behavior.

Comment 9 Badalyan Vyacheslav 2016-08-21 16:29:07 UTC
https://github.com/nfs-ganesha/nfs-ganesha/issues/106

Comment 10 Niels de Vos 2016-09-29 14:22:01 UTC
Requested feedback on gluster-devel too:
  http://www.gluster.org/pipermail/gluster-devel/2016-September/051063.html

Comment 11 Raghavendra G 2016-09-30 03:28:40 UTC
(In reply to Niels de Vos from comment #5)
> If we reduce the log level for client-rpc-fops.c:2974:client3_3_lookup_cbk
> there would be the following entries left:

How are you reducing the log level in client-rpc-fops? One of the things that is puzzling to me is that, the lookup sent by resolver (as part of stat) is a nameless lookup and nameless lookup when it fails for lack of gfid-handle should return an ESTALE. However, we are getting an ENOENT here during failure of nameless lookup and that is making protocol/client to log this msg. Note that protocol/client avoids file missing error in lookup_cbk in two cases:

1. If lookup is named lookup (loc->name != NULL) and errno is ENOENT
2. If lookup is nameless lookup and errno is ESTALE

I did a quick check of storage/posix and it seem to return ESTALE if it cannot find gfid-handle on backend. Is it possible to figure out which xlator on brick is returning ENOENT instead of ESTALE? we should be fixing that.

> 
> 2x dht-helper.c:1179:dht_migration_complete_check_task

I think we can mute file missing error messages during migration as missing file can be a valid scenario and not an error. I'll send a patch to do this.

> 2x afr-read-txn.c:250:afr_read_txn
> 
> it would reduce the logging for this non-error with 10 out of 14 messages.
> We need to know from the AFR and DHT team if these messages are sufficient
> for them to identify potential issues.

Comment 12 Pranith Kumar K 2016-09-30 03:38:47 UTC
(In reply to Niels de Vos from comment #5)
> If we reduce the log level for client-rpc-fops.c:2974:client3_3_lookup_cbk
> there would be the following entries left:
> 
> 2x dht-helper.c:1179:dht_migration_complete_check_task
> 2x afr-read-txn.c:250:afr_read_txn

I am not sure how an inode which is not in split-brain is linked as 'no read-subvolumes' case. That is something to debug.

> 
> it would reduce the logging for this non-error with 10 out of 14 messages.
> We need to know from the AFR and DHT team if these messages are sufficient
> for them to identify potential issues.

Comment 17 Niels de Vos 2017-08-30 13:00:34 UTC
The LOOKUP that is done on the just removed file should get caught by negative-lookup-cache. This is a new feature that was not available at the time this bug was reported. Make sure to enable the performance.nl-cache volume option in case it is disabled.

Comment 18 Niels de Vos 2018-01-25 09:19:39 UTC
This should probably get retested with the idea from comment #17.

Comment 20 renaud.fortier 2018-09-28 15:21:13 UTC
(In reply to Niels de Vos from comment #18)
> This should probably get retested with the idea from comment #17.

I have the same behaviour in ganesha-gfapi.log with gluster 4.1.5, ganesha 2.6.0, performance.nl-cache=on, performance.nl-cache-timeout=600