Bug 1278399

Summary: I/O failure on attaching tier on nfs client
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Nag Pavan Chilakam <nchilaka>
Component: tierAssignee: Mohammed Rafi KC <rkavunga>
Status: CLOSED ERRATA QA Contact: Bhaskarakiran <byarlaga>
Severity: high Docs Contact:
Priority: unspecified    
Version: rhgs-3.1CC: asrivast, byarlaga, dlambrig, josferna, jthottan, mzywusko, nbalacha, nchilaka, rcyriac, rhs-bugs, rkavunga, sankarshan, sbhaloth, skoduri, storage-qa-internal
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 3.1.2   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.7.5-17 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: 1272949
: 1279095 (view as bug list) Environment:
Last Closed: 2016-03-01 05:52:37 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1272949    
Bug Blocks: 1049181, 1114033, 1139193, 1146338, 1260783, 1260923, 1276742, 1279095, 1279830, 1286064    

Description Nag Pavan Chilakam 2015-11-05 11:56:44 UTC
+++ This bug was initially created as a clone of Bug #1272949 +++

Description of problem:

on going I/o's are failing when attaching tier

Version-Release number of selected component (if applicable):


How reproducible:

100

Steps to Reproduce:
1.create a dist-rep volume
2.mount and start linux untar on mount point
3.attach-tier

Actual results:

i/o failure

Expected results:

i/o should not fail

Additional info:

--- Additional comment from Vijay Bellur on 2015-10-19 05:46:22 EDT ---

REVIEW: http://review.gluster.org/12376 (dht: heal directory path if the directory is nor present) posted (#2) for review on master by mohammed rafi  kc (rkavunga)

--- Additional comment from Vijay Bellur on 2015-10-19 05:47:10 EDT ---

REVIEW: http://review.gluster.org/12375 (Revert "fuse: resolve complete path after a graph switch") posted (#3) for review on master by mohammed rafi  kc (rkavunga)

--- Additional comment from Vijay Bellur on 2015-10-20 00:51:58 EDT ---

REVIEW: http://review.gluster.org/12375 (Revert "fuse: resolve complete path after a graph switch") posted (#4) for review on master by mohammed rafi  kc (rkavunga)

--- Additional comment from Vijay Bellur on 2015-10-20 00:52:01 EDT ---

REVIEW: http://review.gluster.org/12376 (dht: heal directory path if the directory is nor present) posted (#3) for review on master by mohammed rafi  kc (rkavunga)

--- Additional comment from Vijay Bellur on 2015-10-22 14:54:39 EDT ---

REVIEW: http://review.gluster.org/12375 (Revert "fuse: resolve complete path after a graph switch") posted (#5) for review on master by mohammed rafi  kc (rkavunga)

--- Additional comment from Vijay Bellur on 2015-10-22 14:54:42 EDT ---

REVIEW: http://review.gluster.org/12376 (dht: heal directory path if the directory is nor present) posted (#4) for review on master by mohammed rafi  kc (rkavunga)

--- Additional comment from Vijay Bellur on 2015-10-22 14:54:46 EDT ---

REVIEW: http://review.gluster.org/12414 (dht:heal layout after a nameless lookup) posted (#1) for review on master by mohammed rafi  kc (rkavunga)

--- Additional comment from Vijay Bellur on 2015-10-28 15:52:44 EDT ---

REVIEW: http://review.gluster.org/12375 (Revert "fuse: resolve complete path after a graph switch") posted (#6) for review on master by mohammed rafi  kc (rkavunga)

--- Additional comment from Vijay Bellur on 2015-10-28 15:52:47 EDT ---

REVIEW: http://review.gluster.org/12376 (dht: heal directory path if the directory is not present) posted (#5) for review on master by mohammed rafi  kc (rkavunga)

--- Additional comment from Vijay Bellur on 2015-10-28 15:52:50 EDT ---

REVIEW: http://review.gluster.org/12449 (dht: update cached subvolume during readdirp cbk) posted (#1) for review on master by mohammed rafi  kc (rkavunga)

--- Additional comment from Vijay Bellur on 2015-10-30 07:43:56 EDT ---

REVIEW: http://review.gluster.org/12375 (Revert "fuse: resolve complete path after a graph switch") posted (#7) for review on master by mohammed rafi  kc (rkavunga)

--- Additional comment from Vijay Bellur on 2015-10-30 07:44:00 EDT ---

REVIEW: http://review.gluster.org/12449 (dht: update cached subvolume during readdirp cbk) posted (#2) for review on master by mohammed rafi  kc (rkavunga)

--- Additional comment from Vijay Bellur on 2015-10-30 07:44:09 EDT ---

REVIEW: http://review.gluster.org/12376 (dht: heal directory path if the directory is not present) posted (#6) for review on master by mohammed rafi  kc (rkavunga)

--- Additional comment from Vijay Bellur on 2015-10-30 08:58:17 EDT ---

REVIEW: http://review.gluster.org/12375 (Revert "fuse: resolve complete path after a graph switch") posted (#8) for review on master by mohammed rafi  kc (rkavunga)

--- Additional comment from Vijay Bellur on 2015-10-30 08:58:20 EDT ---

REVIEW: http://review.gluster.org/12449 (dht: update cached subvolume during readdirp cbk) posted (#3) for review on master by mohammed rafi  kc (rkavunga)

--- Additional comment from Vijay Bellur on 2015-10-30 08:58:23 EDT ---

REVIEW: http://review.gluster.org/12376 (dht: heal directory path if the directory is not present) posted (#7) for review on master by mohammed rafi  kc (rkavunga)

--- Additional comment from Vijay Bellur on 2015-11-02 01:28:43 EST ---

REVIEW: http://review.gluster.org/12375 (Revert "fuse: resolve complete path after a graph switch") posted (#9) for review on master by mohammed rafi  kc (rkavunga)

--- Additional comment from Vijay Bellur on 2015-11-02 01:28:46 EST ---

REVIEW: http://review.gluster.org/12449 (dht: update cached subvolume during readdirp cbk) posted (#4) for review on master by mohammed rafi  kc (rkavunga)

--- Additional comment from Vijay Bellur on 2015-11-02 01:28:49 EST ---

REVIEW: http://review.gluster.org/12376 (dht: heal directory path if the directory is not present) posted (#8) for review on master by mohammed rafi  kc (rkavunga)

--- Additional comment from Vijay Bellur on 2015-11-02 09:14:10 EST ---

REVIEW: http://review.gluster.org/12375 (Revert "fuse: resolve complete path after a graph switch") posted (#10) for review on master by mohammed rafi  kc (rkavunga)

--- Additional comment from Vijay Bellur on 2015-11-02 09:14:13 EST ---

REVIEW: http://review.gluster.org/12449 (dht: update cached subvolume during readdirp cbk) posted (#5) for review on master by mohammed rafi  kc (rkavunga)

--- Additional comment from Vijay Bellur on 2015-11-02 09:14:16 EST ---

REVIEW: http://review.gluster.org/12376 (dht: heal directory path if the directory is not present) posted (#9) for review on master by mohammed rafi  kc (rkavunga)

--- Additional comment from Vijay Bellur on 2015-11-03 01:24:02 EST ---

REVIEW: http://review.gluster.org/12375 (Revert "fuse: resolve complete path after a graph switch") posted (#11) for review on master by mohammed rafi  kc (rkavunga)

--- Additional comment from Vijay Bellur on 2015-11-03 01:24:05 EST ---

REVIEW: http://review.gluster.org/12449 (dht: update cached subvolume during readdirp cbk) posted (#6) for review on master by mohammed rafi  kc (rkavunga)

--- Additional comment from Vijay Bellur on 2015-11-03 01:24:07 EST ---

REVIEW: http://review.gluster.org/12376 (dht: heal directory path if the directory is not present) posted (#10) for review on master by mohammed rafi  kc (rkavunga)

--- Additional comment from Vijay Bellur on 2015-11-03 14:01:40 EST ---

REVIEW: http://review.gluster.org/12375 (Revert "fuse: resolve complete path after a graph switch") posted (#12) for review on master by mohammed rafi  kc (rkavunga)

Comment 3 Mohammed Rafi KC 2015-11-10 10:01:55 UTC
I/O's were failed after attaching the tier is because, the fix-layout was not complete for some directories. So the directory structure was not proper on hot tier, and then trying to access such directories will result a failure.

Fix : 

after a nameless lookup if we get an incomplete layout, we will trigger a healing after getting full path from the server.

Comment 4 Bhaskarakiran 2015-11-17 12:08:13 UTC
Checked on 3.7.5-6 build and there are errors with attach-tier during IO on nfs mount. Once the attach-tier is done, IO resumes normally.

linux-4.1.1/arch/arm/boot/dts/animeo_ip.dts
tar: linux-4.1.1/arch/arm/boot/dts/animeo_ip.dts: Cannot open: Invalid argument
linux-4.1.1/arch/arm/boot/dts/arm-realview-pb1176.dts
tar: linux-4.1.1/arch/arm/boot/dts/arm-realview-pb1176.dts: Cannot open: Invalid argument
linux-4.1.1/arch/arm/boot/dts/armada-370-db.dts
tar: linux-4.1.1/arch/arm/boot/dts/armada-370-db.dts: Cannot open: Invalid argument
linux-4.1.1/arch/arm/boot/dts/armada-370-mirabox.dts
tar: linux-4.1.1/arch/arm/boot/dts/armada-370-mirabox.dts: Cannot open: Invalid argument
linux-4.1.1/arch/arm/boot/dts/armada-370-netgear-rn102.dts
tar: linux-4.1.1/arch/arm/boot/dts/armada-370-netgear-rn102.dts: Cannot open: Invalid argument
linux-4.1.1/arch/arm/boot/dts/armada-370-netgear-rn104.dts
tar: linux-4.1.1/arch/arm/boot/dts/armada-370-netgear-rn104.dts: Cannot open: Invalid argument
linux-4.1.1/arch/arm/boot/dts/armada-370-rd.dts
tar: linux-4.1.1/arch/arm/boot/dts/armada-370-rd.dts: Cannot open: Invalid argument
linux-4.1.1/arch/arm/boot/dts/armada-370-synology-ds213j.dts
tar: linux-4.1.1/arch/arm/boot/dts/armada-370-synology-ds213j.dts: Cannot open: Invalid argument
linux-4.1.1/arch/arm/boot/dts/armada-370-xp.dtsi
tar: linux-4.1.1/arch/arm/boot/dts/armada-370-xp.dtsi: Cannot open: Invalid argument
linux-4.1.1/arch/arm/boot/dts/armada-370.dtsi
tar: linux-4.1.1/arch/arm/boot/dts/armada-370.dtsi: Cannot open: Invalid argument
linux-4.1.1/arch/arm/boot/dts/armada-375-db.dts
tar: linux-4.1.1/arch/arm/boot/dts/armada-375-db.dts: Cannot open: Invalid argument
linux-4.1.1/arch/arm/boot/dts/armada-375.dtsi
tar: linux-4.1.1/arch/arm/boot/dts/armada-375.dtsi: Cannot open: Invalid argument

Comment 5 Vivek Agarwal 2015-11-17 12:11:49 UTC
Are these log entries only or are these actual I/O errors?

Comment 6 Mohammed Rafi KC 2015-11-17 14:35:54 UTC
tried to reproduce with same machine and configuration. I couldn't find I/O's failing. I will keep on trying to reproduce this.

Comment 7 Mohammed Rafi KC 2015-11-19 07:30:29 UTC
I reproduced this error in tiered volume as well as in non-tiered volume. This is inconsistently happening in a heavy parallel i/o's. Also sometimes, mount hangs indefinitely as server is not responding after attach/add a brick. But restarting the nfs-server will resume the i/o gracefully if it is hanged.

This is not a specific tier problem as It can reproduced on a non-tiered volume the same way in tiered volume

Comment 8 Jiffin 2015-11-20 09:16:08 UTC
Adding certain points to Rafi's comment.

The issue is reproduced in non-tier distributed replicated volume while a add-brick happens. Most of time I/O hangs(say 90 reproducible). And also I got "invalidate error on the mount" twice (one in tier and another non tier volume)

After discussing with Niels, he suggested me run the same test with reducing epoll threads to one. When I ran with epoll thread=1 , then out of four run , only one hangs in non tier distributed replicated volume. (But issue still exists)

As mentioned in Rafi's comment when we restart the nfs-server again, it resumes gracefully.

I suspect mount hang on client may be related to nfs issue, but "invalidate error" may be related to some other component.
 
And one more thing, issue is reproduced consistently(I mean hung) in my setup due to the low system(hardware) configuration.

Comment 9 Soumya Koduri 2015-11-24 09:16:14 UTC
As mentioned in https://bugzilla.redhat.com/show_bug.cgi?id=1282771#c2, this operation in the large workload seem to work if disabled throttling. Request QE to disable it and confirm if the issue still persist.https://bugzilla.redhat.com/show_bug.cgi?id=1282771#c2

Comment 10 Susant Kumar Palai 2015-11-27 10:32:50 UTC
*** Bug 1049181 has been marked as a duplicate of this bug. ***

Comment 11 Bhaskarakiran 2015-11-30 09:35:13 UTC
I can still see the invalid argument errors even after setting throttling to 0. The packet trace and setup details are provided to DEV. This holds good for the other bug  https://bugzilla.redhat.com/show_bug.cgi?id=1282771 (Detach-tier + NFS) as well.

Comment 14 Joseph Elwin Fernandes 2016-01-19 12:24:36 UTC
This will probably fix with bug 1296048 , it is worth of revisiting after the bug  bug 1296048 fixed.

Comment 19 Bhaskarakiran 2016-01-28 15:53:55 UTC
verification of bug 1296048 and this one is same. Moving the state.

Comment 21 errata-xmlrpc 2016-03-01 05:52:37 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHBA-2016-0193.html