Bug 1567001 - [Ganesha+EC] Bonnie failed with I/O error while crefi and parallel lookup were going on in parallel from 4 clients
Summary: [Ganesha+EC] Bonnie failed with I/O error while crefi and parallel lookup wer...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: disperse
Version: rhgs-3.4
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ---
: RHGS 3.4.0
Assignee: Xavi Hernandez
QA Contact: Manisha Saini
URL:
Whiteboard:
Depends On:
Blocks: 1503137
TreeView+ depends on / blocked
 
Reported: 2018-04-13 09:40 UTC by Manisha Saini
Modified: 2018-09-10 06:58 UTC (History)
12 users (show)

Fixed In Version: glusterfs-3.12.2-13
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-09-04 06:46:01 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2018:2607 0 None None None 2018-09-04 06:47:43 UTC

Description Manisha Saini 2018-04-13 09:40:42 UTC
Description of problem:
While running bonnie and Crefi along with parallel lookups,Bonnie failed with Input/output error


Version-Release number of selected component (if applicable):

# rpm -qa | grep ganesha
glusterfs-ganesha-3.12.2-7.el7rhgs.x86_64
nfs-ganesha-gluster-2.5.5-4.el7rhgs.x86_64
nfs-ganesha-2.5.5-4.el7rhgs.x86_64

How reproducible:
1/1


Steps to Reproduce:
1. Create 8 node ganesha cluster
2. Create 5 x (4 + 2) Distributed-Disperse volume using 6 out of 8 nodes and export the volume via ganesha 
3. Mount the volume to 4 different client with 4 different VIP's
4. Perform below data set

Client 1- Using crefi create deep directories with the following data pattern in sequence.

create,chmod,hardlink,chgrp,symlink,hardlink,truncate,hardlink 

Client 2-Run bonnie and lookups - while true;do find;done
Client 3-Lookups  while true;do ls -laRt;done
Client 4-Lookups  while true;do du -sh;done

Actual results:

Bonnie failed on client with IO error

Writing a byte at a time...done
Writing intelligently...done
Rewriting...Can't write block.: Input/output error
Bonnie: drastic I/O error (re write(2)): Input/output error

real    66m34.755s
user    0m4.833s
sys     1m31.620s
bonnie failed
0
Total 0 tests were successful
Switching over to the previous working directory
Removing /mnt/ganesha//run5495/

Expected results:
Bonnie should not fail


Additional info:


ganesha-gfapi.logs-

[2018-04-12 21:02:13.657683] E [MSGID: 122064] [ec-common.c:1156:ec_prepare_update_cbk] 0-Ganeshavol1-disperse-0: Unable to get version xattr [No such file or directory]
[2018-04-12 21:02:21.206056] E [MSGID: 122034] [ec-common.c:651:ec_child_select] 0-Ganeshavol1-disperse-0: Insufficient available children for this request (have 0, need 4)
[2018-04-12 21:02:21.206185] W [MSGID: 122040] [ec-common.c:1144:ec_prepare_update_cbk] 0-Ganeshavol1-disperse-0: Failed to get size and version [Input/output error]


Observing lots of GFID's mismatched messages in ganesha-gfapi.logs

=================
cc07bb89304>/level10/level20/level30/level40/level50/level60/level70 (gfid = bf3ddc68-ad9c-45a6-b9e6-5ccccea3e45e) returned -1 [Invalid argument]
[2018-04-13 06:33:32.602353] I [MSGID: 109094] [dht-common.c:1561:dht_revalidate_cbk] 0-Ganeshavol1-dht: Revalidate: subvolume Ganeshavol1-disperse-2 for <gfid:90ecde24-1e83-4f1f-aef6-8cc07bb89304>/level10/level20/level30/level40/level50/level60/level70 (gfid = bf3ddc68-ad9c-45a6-b9e6-5ccccea3e45e) returned -1 [Invalid argument]
[2018-04-13 06:33:32.608159] W [MSGID: 122019] [ec-helpers.c:412:ec_loc_gfid_check] 0-Ganeshavol1-disperse-3: Mismatching GFID's in loc
[2018-04-13 06:33:32.608250] I [MSGID: 109094] [dht-common.c:1561:dht_revalidate_cbk] 0-Ganeshavol1-dht: Revalidate: subvolume Ganeshavol1-disperse-3 for <gfid:90ecde24-1e83-4f1f-aef6-8cc07bb89304>/level10/level20/level30/level40/level50/level60/level70 (gfid = bf3ddc68-ad9c-45a6-b9e6-5ccccea3e45e) returned -1 [Invalid argument]
[2018-04-13 06:33:32.608550] W [MSGID: 122019] [ec-helpers.c:412:ec_loc_gfid_check] 0-Ganeshavol1-disperse-4: Mismatching GFID's in loc
[2018-04-13 06:33:32.608604] I [MSGID: 109094] [dht-common.c:1561:dht_revalidate_cbk] 0-Ganeshavol1-dht: Revalidate: subvolume Ganeshavol1-disperse-4 for <gfid:90ecde24-1e83-4f1f-aef6-8cc07bb89304>/level10/level20/level30/level40/level50/level60/level70 (gfid = bf3ddc68-ad9c-45a6-b9e6-5ccccea3e45e) returned -1 [Invalid argument]
[2018-04-13 06:33:32.608627] E [MSGID: 101046] [dht-common.c:1857:dht_revalidate_cbk] 0-Ganeshavol1-dht: dict is null
The message "W [MSGID: 122019] [ec-helpers.c:412:ec_loc_gfid_check] 0-Ganeshavol1-disperse-2: Mismatching GFID's in loc" repeated 2 times between [2018-04-13 06:33:32.586679] and [2018-04-13 06:33:32.611995]
[2018-04-13 06:33:32.627866] W [MSGID: 122019] [ec-helpers.c:412:ec_loc_gfid_check] 0-Ganeshavol1-disperse-0: Mismatching GFID's in loc
[2018-04-13 06:33:32.627940] I [MSGID: 109094] [dht-common.c:1561:dht_revalidate_cbk] 0-Ganeshavol1-dht: Revalidate: subvolume Ganeshavol1-disperse-0 for <gfid:90ecde24-1e83-4f1f-aef6-8cc07bb89304>/level10/level20/level30/level40/level50/level60/level70 (gfid = bf3ddc68-ad9c-45a6-b9e6-5ccccea3e45e) returned -1 [Invalid argument]
[2018-04-13 06:33:32.628578] W [MSGID: 122019] [ec-helpers.c:412:ec_loc_gfid_check] 0-Ganeshavol1-disperse-1: Mismatching GFID's in loc
[2018-04-13 06:33:32.628640] I [MSGID: 109094] [dht-common.c:1561:dht_revalidate_cbk] 0-Ganeshavol1-dht: Revalidate: subvolume Ganeshavol1-disperse-1 for <gfid:90ecde24-1e83-4f1f-aef6-8cc07bb89304>/level10/level20/level30/level40/level50/level60/level70 (gfid = bf3ddc68-ad9c-45a6-b9e6-5ccccea3e45e) returned -1 [Invalid argument]
[2018-04-13 06:33:32.628919] W [MSGID: 122019] [ec-helpers.c:412:ec_loc_gfid_check] 0-Ganeshavol1-disperse-4: Mismatching GFID's in loc
[2018-04-13 06:33:32.628970] I [MSGID: 109094] [dht-common.c:1561:dht_revalidate_cbk] 0-Ganeshavol1-dht: Revalidate: subvolume Ganeshavol1-disperse-4 for <gfid:90ecde24-1e83-4f1f-aef6-8cc07bb89304>/level10/level20/level30/level40/level50/level60/level70 (gfid = bf3ddc68-ad9c-45a6-b9e6-5ccccea3e45e) returned -1 [Invalid argument]
[2018-04-13 06:33:32.630219] W [MSGID: 122019] [ec-helpers.c:412:ec_loc_gfid_check] 0-Ganeshavol1-disperse-2: Mismatching GFID's in loc
[2018-04-13 06:33:32.630274] I [MSGID: 109094] [dht-common.c:1561:dht_revalidate_cbk] 0-Ganeshavol1-dht: Revalidate: subvolume Ganeshavol1-disperse-2 for <gfid:90ecde24-1e83-4f1f-aef6-8cc07bb89304>/level10/level20/level30/level40/level50/level60/level70 (gfid = bf3ddc68-ad9c-45a6-b9e6-5ccccea3e45e) returned -1 [Invalid argument]
[2018-04-13 06:33:32.630395] W [MSGID: 122019] [ec-helpers.c:412:ec_loc_gfid_check] 0-Ganeshavol1-disperse-3: Mismatching GFID's in loc
[2018-04-13 06:33:32.630405] I [MSGID: 109094] [dht-common.c:1561:dht_revalidate_cbk] 0-Ganeshavol1-dht: Revalidate: subvolume Ganeshavol1-disperse-3 for <gfid:90ecde24-1e83-4f1f-aef6-8cc07bb89304>/level10/level20/level30/level40/level50/level60/level70 (gfid = bf3ddc68-ad9c-45a6-b9e6-5ccccea3e45e) returned -1 [Invalid argument]
[2018-04-13 06:33:32.630458] E [MSGID: 101046] [dht-common.c:1857:dht_revalidate_cbk] 0-Ganeshavol1-dht: dict is null
===================


Detailed logs will be attaching shortly

Comment 4 Daniel Gryniewicz 2018-04-13 12:07:34 UTC
Are there any ganesha logs in these SOS reports?  I can't seem to find any.

Comment 24 Manisha Saini 2018-07-20 10:07:34 UTC
Verified this BZ with

# rpm -qa | grep ganesha
nfs-ganesha-debuginfo-2.5.5-8.el7rhgs.x86_64
nfs-ganesha-gluster-2.5.5-8.el7rhgs.x86_64
nfs-ganesha-2.5.5-8.el7rhgs.x86_64
glusterfs-ganesha-3.12.2-14.el7rhgs.x86_64

Create 2 x (4 + 2) Distributed-Disperse Volume.
Mounted the volume to 4 different clients using 4 different VIP's.
Ran the following workload

Client 1- Using crefi create deep directories with the following data pattern in sequence.

create,chmod,hardlink,chgrp,symlink,hardlink,truncate,hardlink 

Client 2-Run bonnie 
Client 3-Lookups  while true;do ls -laRt;done
Client 4-Lookups  while true;do du -sh;done

Performed failover/failback when IO's were running.No I/O error observed.Bonnie completed successfully.

Moving this BZ to verified state.

Comment 26 errata-xmlrpc 2018-09-04 06:46:01 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607


Note You need to log in before you can comment on or make changes to this bug.