Bug 1730654

Summary: [Ganesha] Linux untars got error out with "Remote I/O error" when finds's and lookups were running in parallel (v4.1)
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Manisha Saini <msaini>
Component: nfs-ganeshaAssignee: Frank Filz <ffilz>
Status: CLOSED ERRATA QA Contact: Manisha Saini <msaini>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: rhgs-3.5CC: amukherj, dang, ffilz, grajoria, jthottan, mbenjamin, moagrawa, pasik, rgowdapp, rhs-bugs, sheggodu, skoduri, storage-qa-internal, vdas
Target Milestone: ---Keywords: Triaged
Target Release: RHGS 3.5.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-6.0-11 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2019-10-30 12:15:39 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1733520    
Bug Blocks: 1696809    

Description Manisha Saini 2019-07-17 09:34:18 UTC
Description of problem:

8*3 Distributed-Replicate volume mounted on 5 different clients via v4.1

Linux untars got error out on one of the client with "Cannot read: Remote I/O error"


Following messages were observed in ganesha.log
===========
16/07/2019 13:57:33 : epoch 16710000 : f12-h04-000-1029u.rdu2.scalelab.redhat.com : ganesha.nfsd-252315[svc_388] posix2fsal_error :FSAL :CRIT :Mapping 107(default) to ERR_FSAL_SERVERFAULT
16/07/2019 13:57:33 : epoch 16710000 : f12-h04-000-1029u.rdu2.scalelab.redhat.com : ganesha.nfsd-252315[svc_293] posix2fsal_error :FSAL :CRIT :Mapping 107(default) to ERR_FSAL_SERVERFAULT
16/07/2019 13:57:33 : epoch 16710000 : f12-h04-000-1029u.rdu2.scalelab.redhat.com : ganesha.nfsd-252315[svc_299] posix2fsal_error :FSAL :CRIT :Mapping 107(default) to ERR_FSAL_SERVERFAULT
16/07/2019 13:57:33 : epoch 16710000 : f12-h04-000-1029u.rdu2.scalelab.redhat.com : ganesha.nfsd-252315[svc_384] posix2fsal_error :FSAL :CRIT :Mapping 107(default) to ERR_FSAL_SERVERFAULT
16/07/2019 13:57:33 : epoch 16710000 : f12-h04-000-1029u.rdu2.scalelab.redhat.com : ganesha.nfsd-252315[svc_389] posix2fsal_error :FSAL :CRIT :Mapping 107(default) to ERR_FSAL_SERVERFAULT
16/07/2019 13:57:33 : epoch 16710000 : f12-h04-000-1029u.rdu2.scalelab.redhat.com : ganesha.nfsd-252315[svc_383] posix2fsal_error :FSAL :CRIT :Mapping 107(default) to ERR_FSAL_SERVERFAULT
16/07/2019 13:57:33 : epoch 16710000 : f12-h04-000-1029u.rdu2.scalelab.redhat.com : ganesha.nfsd-252315[svc_284] posix2fsal_error :FSAL :CRIT :Mapping 107(default) to ERR_FSAL_SERVERFAULT
16/07/2019 13:57:33 : epoch 16710000 : f12-h04-000-1029u.rdu2.scalelab.redhat.com : ganesha.nfsd-252315[svc_379] posix2fsal_error :FSAL :CRIT :Mapping 107(default) to ERR_FSAL_SERVERFAULT
16/07/2019 13:57:33 : epoch 16710000 : f12-h04-000-1029u.rdu2.scalelab.redhat.com : ganesha.nfsd-252315[svc_386] posix2fsal_error :FSAL :CRIT :Mapping 107(default) to ERR_FSAL_SERVERFAULT
16/07/2019 13:57:33 : epoch 16710000 : f12-h04-000-1029u.rdu2.scalelab.redhat.com : ganesha.nfsd-252315[svc_302] posix2fsal_error :FSAL :CRIT :Mapping 107(default) to ERR_FSAL_SERVERFAULT
=======================


Version-Release number of selected component (if applicable):
===================

# rpm -qa | grep ganesha
nfs-ganesha-2.7.3-5.el7rhgs.x86_64
nfs-ganesha-debuginfo-2.7.3-5.el7rhgs.x86_64
nfs-ganesha-gluster-2.7.3-5.el7rhgs.x86_64
glusterfs-ganesha-6.0-7.el7rhgs.x86_64



How reproducible:
==============
1/1

Steps to Reproduce:
==============
1.Create 8 node ganesha cluster
2.Create 8*3 Distributed-Replicate Volume
3.Export the volume via ganesha
4.Mount the volume on 5 clients via v4.1
5.Run the following workload
Client 1: Linux untars for large dirs
Client 2: du -sh in loop
Client 3: ls -lRt in loop
Client 4: find . -mindepth 1 -type f -name _04_* in loop
Client 5:  find . -mindepth 1 -type f in loop


Actual results:
=============
Linux untars got error out on client 1


-------
swbuild/CL0_SWBUILD/Dir8/bucket0/CL6_SWBUILD_Dir8_bucket11_pcn84.usnd
swbuild/CL0_SWBUILD/Dir8/bucket0/CL4_SWBUILD_Dir8_bucket13_eim54.oet
swbuild/CL0_SWBUILD/Dir8/bucket0/CL8_SWBUILD_Dir8_bucket1_coalnc52
tar: dataset.swbuild_largedir.tar: Cannot read: Remote I/O error
tar: dataset.swbuild_largedir.tar: Cannot read: Remote I/O error
tar: dataset.swbuild_largedir.tar: Cannot read: Remote I/O error
tar: dataset.swbuild_largedir.tar: Cannot read: Remote I/O error
tar: dataset.swbuild_largedir.tar: Cannot read: Remote I/O error
tar: dataset.swbuild_largedir.tar: Cannot read: Remote I/O error
tar: dataset.swbuild_largedir.tar: Cannot read: Remote I/O error
tar: dataset.swbuild_largedir.tar: Cannot read: Remote I/O error
tar: dataset.swbuild_largedir.tar: Cannot read: Remote I/O error
tar: dataset.swbuild_largedir.tar: Cannot read: Remote I/O error
tar: dataset.swbuild_largedir.tar: Cannot read: Remote I/O error
tar: dataset.swbuild_largedir.tar: Cannot read: Remote I/O error
tar: Too many errors, quitting
tar: swbuild/CL0_SWBUILD/Dir8/bucket0: Cannot utime: Remote I/O error
tar: swbuild/CL0_SWBUILD/Dir8/bucket0: Cannot change ownership to uid 0, gid 0: Remote I/O error
tar: swbuild/CL0_SWBUILD/Dir8/bucket0: Cannot change mode to rwxr-xr-x: Remote I/O error
tar: swbuild/CL0_SWBUILD: Cannot utime: Remote I/O error
tar: swbuild/CL0_SWBUILD: Cannot change ownership to uid 0, gid 0: Remote I/O error
tar: swbuild/CL0_SWBUILD: Cannot change mode to rwxr-xr-x: Remote I/O error
tar: Error is not recoverable: exiting now

[root@f12-h07-000-1029u ganesha]# ls
dataset.swbuild_largedir.tar  swbuild

---------

Checked on servers all bricks were up

----------

Status of volume: mani1
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick f07-h33-000-1029u.rdu2.scalelab.redha
t.com:/gluster/brick1/b1                    49154     0          Y       87253
Brick f07-h36-000-1029u.rdu2.scalelab.redha
t.com:/gluster/brick1/b1                    49153     0          Y       67344
Brick f07-h35-000-1029u.rdu2.scalelab.redha
t.com:/gluster/brick1/b1                    49154     0          Y       231695
Brick f07-h34-000-1029u.rdu2.scalelab.redha
t.com:/gluster/brick1/b1                    49154     0          Y       222455
Brick f12-h05-000-1029u.rdu2.scalelab.redha
t.com:/gluster/brick1/b1                    49153     0          Y       61215
Brick f12-h02-000-1029u.rdu2.scalelab.redha
t.com:/gluster/brick1/b1                    49153     0          Y       17532
Brick f12-h03-000-1029u.rdu2.scalelab.redha
t.com:/gluster/brick1/b1                    49152     0          Y       277914
Brick f12-h04-000-1029u.rdu2.scalelab.redha
t.com:/gluster/brick1/b1                    49152     0          Y       231834
Brick f07-h33-000-1029u.rdu2.scalelab.redha
t.com:/gluster/brick1/b2                    49155     0          Y       87273
Brick f07-h36-000-1029u.rdu2.scalelab.redha
t.com:/gluster/brick1/b2                    49154     0          Y       67364
Brick f07-h35-000-1029u.rdu2.scalelab.redha
t.com:/gluster/brick1/b2                    49155     0          Y       231715
Brick f07-h34-000-1029u.rdu2.scalelab.redha
t.com:/gluster/brick1/b2                    49155     0          Y       222475
Brick f12-h05-000-1029u.rdu2.scalelab.redha
t.com:/gluster/brick1/b2                    49154     0          Y       61249
Brick f12-h02-000-1029u.rdu2.scalelab.redha
t.com:/gluster/brick1/b2                    49154     0          Y       17552
Brick f12-h03-000-1029u.rdu2.scalelab.redha
t.com:/gluster/brick1/b2                    49153     0          Y       277934
Brick f12-h04-000-1029u.rdu2.scalelab.redha
t.com:/gluster/brick1/b2                    49153     0          Y       231854
Brick f07-h33-000-1029u.rdu2.scalelab.redha
t.com:/gluster/brick1/b3                    49156     0          Y       87295
Brick f07-h36-000-1029u.rdu2.scalelab.redha
t.com:/gluster/brick1/b3                    49155     0          Y       67385
Brick f07-h35-000-1029u.rdu2.scalelab.redha
t.com:/gluster/brick1/b3                    49156     0          Y       231736
Brick f07-h34-000-1029u.rdu2.scalelab.redha
t.com:/gluster/brick1/b3                    49156     0          Y       222496
Brick f12-h05-000-1029u.rdu2.scalelab.redha
t.com:/gluster/brick1/b3                    49155     0          Y       61270
Brick f12-h02-000-1029u.rdu2.scalelab.redha
t.com:/gluster/brick1/b3                    49155     0          Y       17574
Brick f12-h03-000-1029u.rdu2.scalelab.redha
t.com:/gluster/brick1/b3                    49154     0          Y       277957
Brick f12-h04-000-1029u.rdu2.scalelab.redha
t.com:/gluster/brick1/b3                    49154     0          Y       231877
Self-heal Daemon on localhost               N/A       N/A        Y       169548
Self-heal Daemon on f12-h05-000-1029u.rdu2.
scalelab.redhat.com                         N/A       N/A        Y       101644
Self-heal Daemon on f12-h03-000-1029u.rdu2.
scalelab.redhat.com                         N/A       N/A        Y       83409
Self-heal Daemon on f07-h34-000-1029u.rdu2.
scalelab.redhat.com                         N/A       N/A        Y       51214
Self-heal Daemon on f07-h35-000-1029u.rdu2.
scalelab.redhat.com                         N/A       N/A        Y       53725
Self-heal Daemon on f07-h33-000-1029u.rdu2.
scalelab.redhat.com                         N/A       N/A        Y       288660
Self-heal Daemon on f12-h02-000-1029u.rdu2.
scalelab.redhat.com                         N/A       N/A        Y       101600
Self-heal Daemon on f07-h36-000-1029u.rdu2.
scalelab.redhat.com                         N/A       N/A        N       N/A  
 
Task Status of Volume mani1
------------------------------------------------------------------------------
There are no active volume tasks
--------------------


Expected results:
==================
Linux untars should not got error out


Additional info:
=============

ganesha.log

--------
16/07/2019 10:53:27 : epoch 16710000 : f12-h04-000-1029u.rdu2.scalelab.redhat.com : ganesha.nfsd-252315[reaper] nfs_lift_grace_locked :STATE :EVENT :NFS Server Now NOT IN GRACE
16/07/2019 12:47:09 : epoch 16710000 : f12-h04-000-1029u.rdu2.scalelab.redhat.com : ganesha.nfsd-252315[svc_44] posix2fsal_error :FSAL :CRIT :Mapping 107(default) to ERR_FSAL_SERVERFAULT
16/07/2019 12:47:09 : epoch 16710000 : f12-h04-000-1029u.rdu2.scalelab.redhat.com : ganesha.nfsd-252315[svc_44] glusterfs_setattr2 :FSAL :CRIT :setattrs failed with error Transport endpoint is not connected
16/07/2019 12:47:09 : epoch 16710000 : f12-h04-000-1029u.rdu2.scalelab.redhat.com : ganesha.nfsd-252315[svc_94] posix2fsal_error :FSAL :CRIT :Mapping 107(default) to ERR_FSAL_SERVERFAULT
16/07/2019 12:47:09 : epoch 16710000 : f12-h04-000-1029u.rdu2.scalelab.redhat.com : ganesha.nfsd-252315[svc_55] posix2fsal_error :FSAL :CRIT :Mapping 107(default) to ERR_FSAL_SERVERFAULT
16/07/2019 12:47:09 : epoch 16710000 : f12-h04-000-1029u.rdu2.scalelab.redhat.com : ganesha.nfsd-252315[svc_175] posix2fsal_error :FSAL :CRIT :Mapping 107(default) to ERR_FSAL_SERVERFAULT
16/07/2019 12:47:09 : epoch 16710000 : f12-h04-000-1029u.rdu2.scalelab.redhat.com : ganesha.nfsd-252315[svc_59] posix2fsal_error :FSAL :CRIT :Mapping 107(default) to ERR_FSAL_SERVERFAULT
16/07/2019 12:47:09 : epoch 16710000 : f12-h04-000-1029u.rdu2.scalelab.redhat.com : ganesha.nfsd-252315[svc_59] glusterfs_setattr2 :FSAL :CRIT :setattrs failed with error Transport endpoint is not connected
@

Comment 14 Manisha Saini 2019-07-29 12:51:19 UTC
Ran the test mentioned in comment 0 of the BZ with the testbuild for nfs and kernel provided  in comment 17 of BZ 1730686

#  rpm -qa | grep ganesha
nfs-ganesha-gluster-2.7.3-6.el7rhgs.TESTFIX1.x86_64
nfs-ganesha-2.7.3-6.el7rhgs.TESTFIX1.x86_64
glusterfs-ganesha-6.0-9.el7rhgs.TESTFIX.bz1730654.x86_64
nfs-ganesha-debuginfo-2.7.3-6.el7rhgs.TESTFIX1.x86_64

#  rpm -qa | grep kernel
kernel-3.10.0-1062.el7.bz1732427.x86_64
kernel-3.10.0-1058.el7.x86_64
kernel-3.10.0-1061.el7.x86_64
abrt-addon-kerneloops-2.1.11-55.el7.x86_64
kernel-tools-3.10.0-1062.el7.bz1732427.x86_64
kernel-tools-libs-3.10.0-1062.el7.bz1732427.x86_64

With this linux untar got completed successfully.No I/O  error were observed

Comment 19 Manisha Saini 2019-08-21 11:25:23 UTC
Verified this BZ with

# rpm -qa | grep ganesha
nfs-ganesha-2.7.3-7.el7rhgs.x86_64
glusterfs-ganesha-6.0-11.el7rhgs.x86_64
nfs-ganesha-gluster-2.7.3-7.el7rhgs.x86_64


Steps to Reproduce:
==============
1.Create 8 node ganesha cluster
2.Create 8*3 Distributed-Replicate Volume
3.Export the volume via ganesha
4.Mount the volume on 5 clients via v4.1
5.Run the following workload
Client 1: Linux untars for large dirs
Client 2: du -sh in loop
Client 3: ls -lRt in loop
Client 4: find . -mindepth 1 -type f -name _04_* in loop
Client 5:  find . -mindepth 1 -type f in loop

Linux untar was completed successfully.Moving this BZ to verified state

Comment 21 errata-xmlrpc 2019-10-30 12:15:39 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHEA-2019:3252