Bug 1331493

Summary: [Tiering]: IO failure with 'Remote I/O error' on NFS-Ganesha -V4 mount on a tiered volume
Product: Red Hat Gluster Storage Reporter: krishnaram Karthick <kramdoss>
Component: tierAssignee: Jiffin <jthottan>
Status: CLOSED WONTFIX QA Contact: krishnaram Karthick <kramdoss>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: rhgs-3.1CC: kkeithle, mzywusko, nbalacha, ndevos, rhs-bugs, rkavunga, skoduri
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-02-06 17:52:25 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:

Description krishnaram Karthick 2016-04-28 16:25:26 UTC
Description of problem:
On a tiered volume configured over 4 node gluster cluster, all IO operations halted and errored with 'Remote I/O error' from ganesha-v4 mount. Any lookup/read/write operation on the mountpoint fails with 'Remote I/O error'. 

gluster v info ganesha-tier
 
Volume Name: ganesha-tier
Type: Tier
Volume ID: 5dc054c0-b15c-49dc-9494-38bd04d05819
Status: Started
Number of Bricks: 8
Transport-type: tcp
Hot Tier :
Hot Tier Type : Replicate
Number of Bricks: 1 x 2 = 2
Brick1: 10.70.47.156:/bricks/brick1/l1
Brick2: 10.70.47.156:/bricks/brick0/l1
Cold Tier:
Cold Tier Type : Disperse
Number of Bricks: 1 x (4 + 2) = 6
Brick3: 10.70.47.192:/bricks/brick0/l1
Brick4: 10.70.47.178:/bricks/brick0/l1
Brick5: 10.70.47.160:/bricks/brick0/l1
Brick6: 10.70.47.192:/bricks/brick1/l1
Brick7: 10.70.47.178:/bricks/brick1/l1
Brick8: 10.70.47.160:/bricks/brick1/l1
Options Reconfigured:
cluster.watermark-hi: 10
cluster.watermark-low: 5
cluster.tier-mode: cache
features.ctr-enabled: on
performance.readdir-ahead: on
nfs.disable: on
features.cache-invalidation: on
ganesha.enable: on
features.quota: off
features.inode-quota: off
cluster.enable-shared-storage: enable
nfs-ganesha: enable


log snippet from brick logs of hot tier
=========================================

    [2016-04-28 20:30:29.005256] E [MSGID: 113091] [posix.c:178:posix_lookup] 0-ganesha-tier-posix: null gfid for path (null)
    [2016-04-28 20:30:29.005306] E [MSGID: 113018] [posix.c:196:posix_lookup] 0-ganesha-tier-posix: lstat on null failed [Invalid argument]
    The message "E [MSGID: 113091] [posix.c:178:posix_lookup] 0-ganesha-tier-posix: null gfid for path (null)" repeated 3 times between [2016-04-28 20:30:29.005256] and [2016-04-28 20:30:29.008873]
    The message "E [MSGID: 113018] [posix.c:196:posix_lookup] 0-ganesha-tier-posix: lstat on null failed [Invalid argument]" repeated 3 times between [2016-04-28 20:30:29.005306] and [2016-04-28 20:30:29.008876]
    [2016-04-28 20:40:30.002214] I [dict.c:473:dict_get] (-->/lib64/libglusterfs.so.0(default_getxattr_cbk+0xac) [0x7f136ea53d1c] -->/usr/lib64/glusterfs/3.7.9/xlator/features/marker.so(marker_getxattr_cbk+0xa7) [0x7f135af54917] -->/lib64/libglusterfs.so.0(dict_get+0xac) [0x7f136ea4415c] ) 0-dict: !this || key=() [Invalid argument]
    [2016-04-28 20:40:30.005899] E [MSGID: 113091] [posix.c:178:posix_lookup] 0-ganesha-tier-posix: null gfid for path (null)
    [2016-04-28 20:40:30.005956] E [MSGID: 113018] [posix.c:196:posix_lookup] 0-ganesha-tier-posix: lstat on null failed [Invalid argument]
    The message "E [MSGID: 113091] [posix.c:178:posix_lookup] 0-ganesha-tier-posix: null gfid for path (null)" repeated 3 times between [2016-04-28 20:40:30.005899] and [2016-04-28 20:40:30.009422]
    The message "E [MSGID: 113018] [posix.c:196:posix_lookup] 0-ganesha-tier-posix: lstat on null failed [Invalid argument]" repeated 3 times between [2016-04-28 20:40:30.005956] and [2016-04-28 20:40:30.009428]

log snippet from bricks of cold tier
======================================

l_get+0x5e)[0x7fd89d85790e] (--> /usr/lib64/glusterfs/3.7.9/xlator/features/access-control.so(posix_acl_lookup_cbk+0x1e5)[0x7fd89d85b515] ))))) 0-ganesha-tier-access-control: inode ctx is NULL for 00000000-0000-0000-0000-000000000000
[2016-04-28 21:43:00.174049] E [posix-acl.c:357:posix_acl_ctx_get] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7fd8aca36b42] (--> /usr/lib64/glusterfs/3.7.9/xlator/features/access-control.so(posix_acl_ctx_get+0x88)[0x7fd89d8573c8] (--> /usr/lib64/glusterfs/3.7.9/xlator/features/access-control.so(__posix_acl_get+0x11)[0x7fd89d857541] (--> /usr/lib64/glusterfs/3.7.9/xlator/features/access-control.so(posix_acl_get+0x5e)[0x7fd89d85790e] (--> /usr/lib64/glusterfs/3.7.9/xlator/features/access-control.so(posix_acl_lookup_cbk+0x1e5)[0x7fd89d85b515] ))))) 0-ganesha-tier-access-control: inode ctx is NULL for 00000000-0000-0000-0000-000000000000
[2016-04-28 21:43:00.761623] E [posix-acl.c:357:posix_acl_ctx_get] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7fd8aca36b42] (--> /usr/lib64/glusterfs/3.7.9/xlator/features/access-control.so(posix_acl_ctx_get+0x88)[0x7fd89d8573c8] (--> /usr/lib64/glusterfs/3.7.9/xlator/features/access-control.so(__posix_acl_get+0x11)[0x7fd89d857541] (--> /usr/lib64/glusterfs/3.7.9/xlator/features/access-control.so(posix_acl_get+0x5e)[0x7fd89d85790e] (--> /usr/lib64/glusterfs/3.7.9/xlator/features/access-control.so(posix_acl_lookup_cbk+0x1e5)[0x7fd89d85b515] ))))) 0-ganesha-tier-access-control: inode ctx is NULL for 00000000-0000-0000-0000-000000000000
[2016-04-28 21:43:01.354477] E [posix-acl.c:357:posix_acl_ctx_get] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7fd8aca36b42] (--> /usr/lib64/glusterfs/3.7.9/xlator/features/access-control.so(posix_acl_ctx_get+0x88)[0x7fd89d8573c8] (--> /usr/lib64/glusterfs/3.7.9/xlator/features/access-control.so(__posix_acl_get+0x11)[0x7fd89d857541] (--> /usr/lib64/glusterfs/3.7.9/xlator/features/access-control.so(posix_acl_get+0x5e)[0x7fd89d85790e] (--> /usr/lib64/glusterfs/3.7.9/xlator/features/access-control.so(posix_acl_lookup_cbk+0x1e5)[0x7fd89d85b515] ))))) 0-ganesha-tier-access-control: inode ctx is NULL for 00000000-0000-0000-0000-000000000000
[2016-04-28 21:43:01.361863] E [posix-acl.c:357:posix_acl_ctx_get] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7fd8aca36b42] (--> /usr/lib64/glusterfs/3.7.9/xlator/features/access-control.so(posix_acl_ctx_get+0x88)[0x7fd89d8573c8] (--> /usr/lib64/glusterfs/3.7.9/xlator/features/access-control.so(__posix_acl_get+0x11)[0x7fd89d857541] (--> /usr/lib64/glusterfs/3.7.9/xlator/features/access-control.so(posix_acl_get+0x5e)[0x7fd89d85790e] (--> /usr/lib64/glusterfs/3.7.9/xlator/features/access-control.so(posix_acl_lookup_cbk+0x1e5)[0x7fd89d85b515] ))))) 0-ganesha-tier-access-control: inode ctx is NULL for 00000000-0000-0000-0000-000000000000
[2016-04-28 21:43:03.989918] E [posix-acl.c:357:posix_acl_ctx_get] (--> /lib64/libglusterfs.so.0(_gf_log_callingfn+0x192)[0x7fd8aca36b42] (--> /usr/lib64/glusterfs/3.7.9/xlator/features/access-control.so(posix_acl_ctx_get+0x88)[0x7fd89d8573c8] (--> /usr/lib64/glusterfs/3.7.9/xlator/features/access-control.so(__posix_acl_get+0x11)[0x7fd89d857541] (--> /usr/lib64/glusterfs/3.7.9/xlator/features/access-control.so(posix_acl_get+0x5e)[0x7fd89d85790e] (--> /usr/lib64/glusterfs/3.7.9/xlator/features/access-control.so(posix_acl_lookup_cbk+0x1e5)[0x7fd89d85b515] ))))) 0-ganesha-tier-access-control: inode ctx is NULL for 00000000-0000-0000-0000-000000000000


log snippet from /var/log/ganesha.log
=======================================
    28/04/2016 20:39:51 : epoch 572216da : dhcp47-192.lab.eng.blr.redhat.com : ganesha.nfsd-15184[work-4] cache_inode_readdir :NFS READDIR :CRIT :cache_inode_lock_trust_attrs returned CACHE_INODE_SERVERFAULT for file-1 - bailing out
    28/04/2016 20:39:54 : epoch 572216da : dhcp47-192.lab.eng.blr.redhat.com : ganesha.nfsd-15184[work-5] cache_inode_readdir :NFS READDIR :CRIT :cache_inode_lock_trust_attrs returned CACHE_INODE_SERVERFAULT for file-1 - bailing out
    28/04/2016 20:41:43 : epoch 572216da : dhcp47-192.lab.eng.blr.redhat.com : ganesha.nfsd-15184[work-13] cache_inode_readdir :NFS READDIR :CRIT :cache_inode_lock_trust_attrs returned CACHE_INODE_SERVERFAULT for file-1 - bailing out
    28/04/2016 20:41:49 : epoch 572216da : dhcp47-192.lab.eng.blr.redhat.com : ganesha.nfsd-15184[work-8] cache_inode_readdir :NFS READDIR :CRIT :cache_inode_lock_trust_attrs returned CACHE_INODE_SERVERFAULT for file-1 - bailing out
    28/04/2016 20:42:00 : epoch 572216da : dhcp47-192.lab.eng.blr.redhat.com : ganesha.nfsd-15184[work-7] cache_inode_readdir :NFS READDIR :CRIT :cache_inode_lock_trust_attrs returned CACHE_INODE_SERVERFAULT for file-1 - bailing out

Version-Release number of selected component (if applicable):
glusterfs-server-3.7.9-2.el7rhgs.x86_64

How reproducible:
Yet to determine

Steps to Reproduce:
1. create a 4+2 dispersed volume from 3 nodes [2 bricks from each node]
2. Mount the volume using NFS-ganesha v4 protocol
3. create 50k files, 1k directories & perform few rename operations
4. untar linux kernel from another v4 mount
5. Attach a 1x2 hot tier
6. write 10 x 1Gb files from the mount point
7. Perform a lookup on all the dirs and sub-dirs on this volume
8. Allow fix layout to complete on the hot tier

Actual results:
File writes initiated in step 6, fails with 'Remote IO error'.

Expected results:
No IO error should be seen

Additional info:
sosreports and statedumps shall be attached

Comment 2 krishnaram Karthick 2016-04-29 03:43:00 UTC
sosreports are available here --> http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/1331493/

Comment 3 krishnaram Karthick 2016-04-29 05:47:54 UTC
There is an existing bug for 'inode ctx is NULL' error message -->  https://bugzilla.redhat.com/show_bug.cgi?id=1324338

This error msgs can be ignored.

Comment 8 Shyamsundar 2018-02-06 17:52:25 UTC
Thank you for your bug report.

We are not further root causing this bug, as a result this bug is being closed as WONTFIX. Please reopen if the problem continues to be observed after upgrading
to a latest version.