Bug 1729223 - Ganesha-gfapi logs are flooded with error messages related to "gf_uuid_is_null(gfid)) [Invalid argument]" when lookups are running from multiple clients
Summary: Ganesha-gfapi logs are flooded with error messages related to "gf_uuid_is_nul...
Keywords:
Status: CLOSED NEXTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: libgfapi
Version: 4.1
Hardware: All
OS: All
medium
high
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact: bugs@gluster.org
URL:
Whiteboard:
Depends On: 1718316 1720635 1720636
Blocks: 1717784 1726935 1730489
TreeView+ depends on / blocked
 
Reported: 2019-07-11 15:53 UTC by Soumya Koduri
Modified: 2019-07-16 21:08 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1720635
Environment:
Last Closed: 2019-07-16 20:52:53 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Gluster.org Gerrit 23031 0 None Merged gfapi: fix incorrect initialization of upcall syncop arguments 2019-07-16 20:52:52 UTC

Description Soumya Koduri 2019-07-11 15:53:53 UTC
+++ This bug was initially created as a clone of Bug #1720635 +++

+++ This bug was initially created as a clone of Bug #1718316 +++

+++ This bug was initially created as a clone of Bug #1717784 +++

Description of problem:
=========================

Ganesha-gfapi logs are flooded with errors messages related to gf_uuid_is_null(gfid), when linux untars and lookups are running from multiple clients-

---------
[2019-06-06 07:56:12.503603] E [glfs-handleops.c:1892:glfs_h_find_handle] (-->/lib64/libgfapi.so.0(+0xe0ae) [0x7f7e91e8b0ae] -->/lib64/libgfapi.so.0(+0x258f1) [0x7f7e91ea28f1] -->/lib64/libgfapi.so.0(+0x257c4) [0x7f7e91ea27c4] ) 0-glfs_h_find_handle: invalid argument: !(gf_uuid_is_null(gfid)) [Invalid argument]

---------


Version-Release number of selected component (if applicable):
===========================
# cat /etc/redhat-release 
Red Hat Enterprise Linux Server release 7.7 Beta (Maipo)

# rpm -qa | grep ganesha
nfs-ganesha-2.7.3-3.el7rhgs.x86_64
glusterfs-ganesha-6.0-3.el7rhgs.x86_64
nfs-ganesha-debuginfo-2.7.3-3.el7rhgs.x86_64
nfs-ganesha-gluster-2.7.3-3.el7rhgs.x86_64



How reproducible:
=====================
2/2


Steps to Reproduce:
======================
1.Create 4 node Ganesha cluster
2.Create 4*3 Distribute-replicate Volume.Export the volume via Ganesha
3.Mount the volume on 4 clients via v4.1 protocol 
4.Run the following workload

Client 1: Run linux untars
Client 2: du -sh in loop
Client 3: ls -lRt in loop
Client 4: find's in loop

Actual results:
==================

While test is running,ganesha-gfapi logs are flooded with errors related to "gf_uuid_is_null"

======
[2019-06-03 16:54:19.829136] E [glfs-handleops.c:1892:glfs_h_find_handle] (-->/lib64/libgfapi.so.0(+0xe0ae) [0x7ff6902d00ae] -->/lib64/libgfapi.so.0(+0x2594a) [0x7ff6902e794a] -->/lib64/libgfapi.so.0(+0x257c4) [0x7ff6902e77c4] ) 0-glfs_h_find_handle: invalid argument: !(gf_uuid_is_null(gfid)) [Invalid argument]
[2019-06-03 16:54:20.006163] E [glfs-handleops.c:1892:glfs_h_find_handle] (-->/lib64/libgfapi.so.0(+0xe0ae) [0x7ff6902d00ae] -->/lib64/libgfapi.so.0(+0x2594a) [0x7ff6902e794a] -->/lib64/libgfapi.so.0(+0x257c4) [0x7ff6902e77c4] ) 0-glfs_h_find_handle: invalid argument: !(gf_uuid_is_null(gfid)) [Invalid argument]
[2019-06-03 16:54:20.320293] E [glfs-handleops.c:1892:glfs_h_find_handle] (-->/lib64/libgfapi.so.0(+0xe0ae) [0x7ff6902d00ae] -->/lib64/libgfapi.so.0(+0x2594a) [0x7ff6902e794a] -->/lib64/libgfapi.so.0(+0x257c4) [0x7ff6902e77c4] ) 0-glfs_h_find_handle: invalid argument: !(gf_uuid_is_null(gfid)) [Invalid argument]
=====

# cat /var/log/ganesha/ganesha-gfapi.log | grep gf_uuid_is_null | wc -l
605340


Expected results:
===================
There should not be error messages in ganesha-gfapi.logs

Additional info:
===================

On narrowing down the test scenario,Seems to be the error messages are coming when only du -sh and ls -lRt are running in loop from two different clients

--- Additional comment from RHEL Product and Program Management on 2019-06-06 08:10:27 UTC ---

This bug is automatically being proposed for the next minor release of Red Hat Gluster Storage by setting the release flag 'rhgs‑3.5.0' to '?'. 

If this bug should be proposed for a different release, please manually change the proposed release flag.

--- Additional comment from Soumya Koduri on 2019-06-06 09:48:36 UTC ---

@Manisha,

are these clients connected to different NFS-Ganesha servers? On which machine did you observe these errors? I do not see such messages in the sosreports uploaded. 

>>> On narrowing down the test scenario,Seems to be the error messages are coming when only du -sh and ls -lRt are running in loop from two different clients

Does this mean, these messages are not seen with just linux untar test?

--- Additional comment from Manisha Saini on 2019-06-06 10:16:00 UTC ---

(In reply to Soumya Koduri from comment #3)
> @Manisha,
> 
> are these clients connected to different NFS-Ganesha servers? On which
> machine did you observe these errors? I do not see such messages in the
> sosreports uploaded. 

Hi soumya,

All the clients are connected to single server VIP 
I see there is some issue with how sosreport collecting ganesha logs.All logs are not captured as part of sosreport.
> 
> >>> On narrowing down the test scenario,Seems to be the error messages are coming when only du -sh and ls -lRt are running in loop from two different clients
> 
> Does this mean, these messages are not seen with just linux untar test?

No.Not seen with only untars

--- Additional comment from Soumya Koduri on 2019-06-07 10:08:03 UTC ---

Thanks Manisha for sharing the setup and logs.

"0-glfs_h_find_handle: invalid argument: !(gf_uuid_is_null(gfid)) [Invalid argument] "

The above message is logged while processing upcall requests. Somehow the gfid passed has become NULL. IMO there are two issues to be considered here -

> there are so many upcall requests generated even though there is only single server serving all the clients.

Seems like the data being accessed is huge and hence the server is trying clean up the inodes from the lru list. While destroying a inode, upcall xlator sends cache invalidation request to all its clients to notify that the particular file/inode entry is no more cached by the server. This logic can be optimized a bit here. 
For nameless lookups, server generates a dummy inode (say inodeD) and later links it to inode (if there is no entry already present for that file/dir) in the cbk path. So as part of lookup_cbk, though the inode (inodeD) received is invalid, upcall xlator creates an inode_ctx entry as it eventually can get linked to the inode table. However in certain cases, if there is already an inode (say inodeC) present for that particular file, this new inode (inodeD) created will be purged, which results in sending upcall notifications to the clients.

in Manisha's testcase, as the data created is huge and being looked up in a loop, there are many such dummy inode entries getting purged resulting in huge number of upcall notifications sent to the client. We can avoid this issue to an extent by checking if the inode is valid or not (i.e, linked or not) before sending callback notifications. 

note - this has been day-1 issue but good to be fixed.

* Another issue is gfid becoming NULL in upcall args. 

> I couldn't reproduce this issue on my setup. However seems like in upcall xlator we already check if the gfid is not NULL before sending notification.

    GF_VALIDATE_OR_GOTO("upcall_client_cache_invalidate",
                        !(gf_uuid_is_null(gfid)), out);

So that means somewhere in the client processing, gfid has become NULL. From further code-reading I see a potential issue in upcall processing callback function -

In glfs_cbk_upcall_data(),

--
    args->fs = fs;
    args->upcall_data = gf_memdup(upcall_data, sizeof(*upcall_data));
--

gf_memdup() may not be the right routine to use here as upcall_data structure contains pointers to other data. This definitely needs to be fixed. However would like to re-confirm if this caused gfid to become NULL. Request Manisha to share setup (if possible) while the tests going on to confirm this theory. Thanks!

--- Additional comment from Worker Ant on 2019-06-07 14:09:44 UTC ---

REVIEW: https://review.gluster.org/22839 (gfapi: fix incorrect initialization of upcall syncop arguments) posted (#1) for review on master by soumya k

--- Additional comment from Worker Ant on 2019-06-14 12:11:53 UTC ---

REVIEW: https://review.gluster.org/22871 (gfapi: fix incorrect initialization of upcall syncop arguments) posted (#1) for review on release-6 by soumya k

--- Additional comment from Worker Ant on 2019-07-05 09:20:16 UTC ---

REVIEW: https://review.gluster.org/22871 (gfapi: fix incorrect initialization of upcall syncop arguments) merged (#7) on release-6 by hari gowtham

Comment 1 Worker Ant 2019-07-11 16:04:19 UTC
REVIEW: https://review.gluster.org/23031 (gfapi: fix incorrect initialization of upcall syncop arguments) posted (#1) for review on release-4.1 by soumya k

Comment 2 Worker Ant 2019-07-16 20:52:53 UTC
REVIEW: https://review.gluster.org/23031 (gfapi: fix incorrect initialization of upcall syncop arguments) merged (#1) on release-4.1 by soumya k


Note You need to log in before you can comment on or make changes to this bug.