Bug 1334664

Summary: Excessive errors messages are seen in hot tier's brick logs in a tiered volume
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: krishnaram Karthick <kramdoss>
Component: indexAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED ERRATA QA Contact: krishnaram Karthick <kramdoss>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.1CC: amukherj, nbalacha, nchilaka, pkarampu, rcyriac, rhinduja, rhs-bugs, rkavunga, sankarshan
Target Milestone: ---   
Target Release: RHGS 3.2.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.8.4-1 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-03-23 05:30:27 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1336630, 1341477, 1341482, 1368608    
Bug Blocks: 1351522    

Description krishnaram Karthick 2016-05-10 09:22:33 UTC
Description of problem:
Lot of error messages related to posix_lookup are seen in hot tier's brick logs. No functionality is broken, however excessive error messages are a concern. We should look at why we log these errors.

[2016-05-10 08:29:54.009247] E [MSGID: 113091] [posix.c:178:posix_lookup] 0-superman-posix: null gfid for path (null)
[2016-05-10 08:29:54.009276] E [MSGID: 113018] [posix.c:196:posix_lookup] 0-superman-posix: lstat on null failed [Invalid argument]

Volume Name: superman
Type: Tier
Volume ID: afa1866e-2d0b-424f-8e22-a782f9068b25
Status: Started
Number of Bricks: 20
Transport-type: tcp
Hot Tier :
Hot Tier Type : Distributed-Replicate
Number of Bricks: 4 x 2 = 8
Brick1: 10.70.35.133:/bricks/brick7/sm1
Brick2: 10.70.35.10:/bricks/brick7/sm1
Brick3: 10.70.35.11:/bricks/brick7/sm1
Brick4: 10.70.35.225:/bricks/brick7/sm1
Brick5: 10.70.35.239:/bricks/brick7/sm1
Brick6: 10.70.37.60:/bricks/brick7/sm1
Brick7: 10.70.37.120:/bricks/brick7/sm1
Brick8: 10.70.37.101:/bricks/brick7/sm1
Cold Tier:
Cold Tier Type : Distributed-Disperse
Number of Bricks: 2 x (4 + 2) = 12
Brick9: 10.70.37.101:/bricks/brick0/l1
Brick10: 10.70.37.120:/bricks/brick0/l1
Brick11: 10.70.37.60:/bricks/brick0/l1
Brick12: 10.70.35.239:/bricks/brick0/l1
Brick13: 10.70.35.225:/bricks/brick0/l1
Brick14: 10.70.35.11:/bricks/brick0/l1
Brick15: 10.70.35.10:/bricks/brick0/l1
Brick16: 10.70.35.133:/bricks/brick0/l1
Brick17: 10.70.37.101:/bricks/brick1/l1
Brick18: 10.70.37.120:/bricks/brick1/l1
Brick19: 10.70.37.60:/bricks/brick1/l1
Brick20: 10.70.35.239:/bricks/brick1/l1
Options Reconfigured:
cluster.tier-mode: cache
features.ctr-enabled: on
features.quota-deem-statfs: on
features.inode-quota: on
features.quota: on
performance.readdir-ahead: on
cluster.enable-shared-storage: enable
nfs-ganesha: disable


Version-Release number of selected component (if applicable):
glusterfs-server-3.7.9-4.el7rhgs.x86_64

How reproducible:
Always

Steps to Reproduce:
1. create a disperse volume
2. From fuse or nfs mount - create files, directories untar kernel package
3. While the above operations are in progress, attach tier
4. enable quota
5. monitor hot tier's brick logs

Actual results:
Error messages posted above are logged every 10 minutes on the brick logs

Expected results:
No error messages should be seen, need to RCA if these errors are a concern

Additional info:
sosreports shall be attached shortly.

Comment 3 Nag Pavan Chilakam 2016-05-16 09:54:51 UTC
seeing on non-tiered volume:
I am seeing this on my stress setup continuously.,
[2016-05-16 09:42:48.004818] E [MSGID: 113018] [posix.c:196:posix_lookup] 0-stressvol-posix: lstat on null failed [Invalid argument]
[2016-05-16 09:42:48.009542] E [MSGID: 113091] [posix.c:178:posix_lookup] 0-stressvol-posix: null gfid for path (null)
[2016-05-16 09:42:48.009563] E [MSGID: 113018] [posix.c:196:posix_lookup] 0-stressvol-posix: lstat on null failed [Invalid argument]
[2016-05-16 09:42:48.009687] E [MSGID: 113091] [posix.c:178:posix_lookup] 0-stressvol-posix: null gfid for path (null)
[2016-05-16 09:42:48.009710] E [MSGID: 113018] [posix.c:196:posix_lookup] 0-stressvol-posix: lstat on null failed [Invalid argument]


On all brick logs it is seen

[root@dhcp35-135 ~]# rpm -qa|grep gluster
glusterfs-api-3.7.9-4.el7rhgs.x86_64
glusterfs-3.7.9-4.el7rhgs.x86_64
glusterfs-libs-3.7.9-4.el7rhgs.x86_64
glusterfs-fuse-3.7.9-4.el7rhgs.x86_64
glusterfs-cli-3.7.9-4.el7rhgs.x86_64
python-gluster-3.7.9-4.el7rhgs.noarch
glusterfs-client-xlators-3.7.9-4.el7rhgs.x86_64
glusterfs-server-3.7.9-4.el7rhgs.x86_64
[root@dhcp35-135 ~]# gluster v info
 
Volume Name: stressvol
Type: Distributed-Replicate
Volume ID: a485542e-3389-446c-bb80-c43464e92f48
Status: Started
Number of Bricks: 4 x 2 = 8
Transport-type: tcp
Bricks:
Brick1: 10.70.35.135:/rhs/brick1/stressvol
Brick2: 10.70.35.196:/rhs/brick1/stressvol
Brick3: 10.70.35.99:/rhs/brick1/stressvol
Brick4: 10.70.35.14:/rhs/brick1/stressvol
Brick5: 10.70.35.135:/rhs/brick2/stressvol
Brick6: 10.70.35.196:/rhs/brick2/stressvol
Brick7: 10.70.35.99:/rhs/brick2/stressvol
Brick8: 10.70.35.14:/rhs/brick2/stressvol
Options Reconfigured:
performance.readdir-ahead: on
[root@dhcp35-135 ~]# gluster v status
Status of volume: stressvol
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.35.135:/rhs/brick1/stressvol    49152     0          Y       14311
Brick 10.70.35.196:/rhs/brick1/stressvol    49152     0          Y       14243
Brick 10.70.35.99:/rhs/brick1/stressvol     49152     0          Y       14241
Brick 10.70.35.14:/rhs/brick1/stressvol     49152     0          Y       14224
Brick 10.70.35.135:/rhs/brick2/stressvol    49153     0          Y       14330
Brick 10.70.35.196:/rhs/brick2/stressvol    49153     0          Y       14262
Brick 10.70.35.99:/rhs/brick2/stressvol     49153     0          Y       14260
Brick 10.70.35.14:/rhs/brick2/stressvol     49153     0          Y       14243
NFS Server on localhost                     2049      0          Y       14355
Self-heal Daemon on localhost               N/A       N/A        Y       14352
NFS Server on 10.70.35.14                   2049      0          Y       14265
Self-heal Daemon on 10.70.35.14             N/A       N/A        Y       14270
NFS Server on 10.70.35.196                  2049      0          Y       14284
Self-heal Daemon on 10.70.35.196            N/A       N/A        Y       14289
NFS Server on 10.70.35.99                   2049      0          Y       14282
Self-heal Daemon on 10.70.35.99             N/A       N/A        Y       14287
 
Task Status of Volume stressvol
------------------------------------------------------------------------------
There are no active volume tasks


http://etherpad.corp.redhat.com/3-1-3-systemic-testing

Comment 10 Atin Mukherjee 2016-09-17 14:39:40 UTC
Upstream mainline : http://review.gluster.org/14589
Upstream 3.8 : http://review.gluster.org/14596

And the fix is available in rhgs-3.2.0 as part of rebase to GlusterFS 3.8.4.

Comment 13 krishnaram Karthick 2016-10-11 12:03:19 UTC
The error messages reported in the bug is no longer seen in build - glusterfs-server-3.8.4-2

The log message is not seen with any tier operations. Moving the bug to verified.

Comment 15 errata-xmlrpc 2017-03-23 05:30:27 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html