Bug 1392906 - Input/Output Error seen while running iozone test on nfs-ganesha+mdcache enabled volume.
Summary: Input/Output Error seen while running iozone test on nfs-ganesha+mdcache enab...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: rpc
Version: rhgs-3.2
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
: RHGS 3.2.0
Assignee: Raghavendra Talur
QA Contact: Arthy Loganathan
URL:
Whiteboard:
Depends On: 1385605
Blocks: 1351528
TreeView+ depends on / blocked
 
Reported: 2016-11-08 12:58 UTC by Arthy Loganathan
Modified: 2017-03-23 06:17 UTC (History)
10 users (show)

Fixed In Version: glusterfs-3.8.4-7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2017-03-23 06:17:29 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:0486 0 normal SHIPPED_LIVE Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update 2017-03-23 09:18:45 UTC

Description Arthy Loganathan 2016-11-08 12:58:07 UTC
Description of problem:
Input/Output Error seen while running iozone test on nfs-ganesha+mdcache enabled volume.

Version-Release number of selected component (if applicable):
nfs-ganesha-gluster-2.4.1-1.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-3.el7rhgs.x86_64


How reproducible:
Always

Steps to Reproduce:
1. Create ganesha cluster and create a volume.
2. Enable nfs-ganesha and mdcache on the volume
3. Mount the volume
4. Run iozone.
    /opt/qa/tools/system_light/run.sh -w /mnt/nfs_v4 -t iozone -l /var/tmp/iozone.log

Actual results:
iozone test fails with IO error

Expected results:
iozone test should succeed without any error

Additional info:
[root@dhcp46-42 ~]# gluster vol info vol1
 
Volume Name: vol1
Type: Distributed-Replicate
Volume ID: ae86ded9-bfb7-4100-8ecf-d63c51376cc4
Status: Started
Snapshot Count: 0
Number of Bricks: 6 x 2 = 12
Transport-type: tcp
Bricks:
Brick1: dhcp46-42.lab.eng.blr.redhat.com:/bricks/brick0/br0
Brick2: dhcp46-101.lab.eng.blr.redhat.com:/bricks/brick0/br0
Brick3: dhcp47-155.lab.eng.blr.redhat.com:/bricks/brick0/br0
Brick4: dhcp47-167.lab.eng.blr.redhat.com:/bricks/brick0/br0
Brick5: dhcp46-42.lab.eng.blr.redhat.com:/bricks/brick1/br1
Brick6: dhcp46-101.lab.eng.blr.redhat.com:/bricks/brick1/br1
Brick7: dhcp47-155.lab.eng.blr.redhat.com:/bricks/brick1/br1
Brick8: dhcp47-167.lab.eng.blr.redhat.com:/bricks/brick1/br1
Brick9: dhcp46-42.lab.eng.blr.redhat.com:/bricks/brick2/br2
Brick10: dhcp46-101.lab.eng.blr.redhat.com:/bricks/brick2/br2
Brick11: dhcp47-155.lab.eng.blr.redhat.com:/bricks/brick2/br2
Brick12: dhcp47-167.lab.eng.blr.redhat.com:/bricks/brick2/br2
Options Reconfigured:
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: on
features.cache-invalidation-timeout: 600
ganesha.enable: on
features.cache-invalidation: on
transport.address-family: inet
performance.readdir-ahead: on
nfs.disable: on
nfs-ganesha: enable
cluster.enable-shared-storage: enable
[root@dhcp46-42 ~]#


[root@dhcp47-176 ~]# /opt/qa/tools/system_light/run.sh -w /mnt/nfs_v4 -t iozone -l /var/tmp/iozone.log

===========================TESTS RUNNING===========================
Changing to the specified mountpoint
/mnt/nfs_v4/run22576
executing iozone
start:13:59:45
read: Input/output error

real	17m16.234s
user	0m1.876s
sys	0m42.307s
end:14:17:02
iozone failed
0
Total 0 tests were successful
Switching over to the previous working directory
Removing /mnt/nfs_v4/run22576/
rmdir: failed to remove ‘/mnt/nfs_v4/run22576/’: Remote I/O error
rmdir failed:Directory not empty

Comment 4 Jiffin 2016-11-16 12:12:06 UTC
Arthy has reproduced this with following steps :

1.) Mount the volume using v4 on two clients
2.) On one client run iozone -a
3.) On the second client run fileop
And both clients ended in remote I/O error while trying to remove test directory.

From the ganesha-gfapi.logs 
The message "W [MSGID: 108008] [afr-read-txn.c:244:afr_read_txn] 0-vol1_new-replicate-1: Unreadable subvolume -1 found with event generation 5 for gfid 00000000-0000-0000-0000-000000000001. (Possible split-brain)" repeated 2 times between [2016-11-16 11:59:31.472371] and [2016-11-16 11:59:31.475652]
The message "W [MSGID: 114031] [client-rpc-fops.c:2938:client3_3_lookup_cbk] 0-vol1_new-client-2: remote operation failed. Path: / (00000000-0000-0000-0000-000000000001) [Transport endpoint is not connected]" repeated 2 times between [2016-11-16 11:59:31.472493] and [2016-11-16 11:59:31.475672]
[2016-11-16 11:59:31.504158] E [MSGID: 114031] [client-rpc-fops.c:2847:client3_3_opendir_cbk] 0-vol1_new-client-2: remote operation failed. Path: /run15413 (859a3b6d-9ea8-48e7-9d03-108d0ba009c5) [Transport endpoint is not connected]

It is clear that open dir call failed on 0-vol1_new-client-2 and hence the removal ended in failure.

When checked via gdb "client3_3opendir" on the client machine, request is never send to this server because in the client context the server is disconnected.
But brick process up and running. Another thing which noted it is not broadcasted to other subvol(I mean vol1_new-client-3). The end result is that opendir got called hence the rmdir.


I can also see following messages in brick log as well :
(Don't when it got created)
[2016-11-16 10:46:18.740750] E [server.c:202:server_submit_reply] (-->/usr/lib64/glusterfs/3.8.4/xlator/debug/io-stats.so(+0x15dce) [0x7f2506b3cdce] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x196d0) [0x7f25066e06d0] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x9186) [0x7f25066d0186] ) 0-: Reply submission failed
[2016-11-16 10:46:18.744106] E [MSGID: 113040] [posix-helpers.c:1622:__posix_fd_ctx_get] 0-vol1_new-posix: Failed to get fd context for a non-anonymous fd, file: (null), gfid: 952e9ac0-3700-4ee5-835e-54fc9e60116a
[2016-11-16 10:46:18.744145] W [MSGID: 113006] [posix.c:3506:posix_flush] 0-vol1_new-posix: pfd is NULL on fd=0x7f251ba460f8 [Invalid argument]
[2016-11-16 10:46:18.744207] I [MSGID: 101055] [client_t.c:415:gf_client_unref] 0-vol1_new-server: Shutting down connection dhcp46-124.lab.eng.blr.redhat.com-22631-2016/11/15-10:32:57:485838-vol1_new-client-2-0-1
[2016-11-16 10:46:18.744462] E [server-helpers.c:395:server_alloc_frame] (-->/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x325) [0x7f251b51d765] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0x2da3c) [0x7f25066f4a3c] -->/usr/lib64/glusterfs/3.8.4/xlator/protocol/server.so(+0xd7f4) [0x7f25066d47f4] ) 0-server: invalid argument: client [Invalid argument]
[2016-11-16 10:46:18.744487] E [rpcsvc.c:557:rpcsvc_check_and_reply_error] 0-rpcsvc: rpc actor failed to complete successfully
[2016-11-16 10:46:18.745188] I [MSGID: 101055] [client_t.c:415:gf_client_unref] 0-vol1_new-server: Shutting down connection dhcp46-139.lab.eng.blr.redhat.com-24725-2016/11/15-10:32:57:435295-vol1_new-client-2-0-0

Comment 5 Jiffin 2016-11-16 12:46:11 UTC
After discussing with Pranith, he mentioned it is race in client layer. Following logs confirm the same :
[2016-11-16 05:38:40.512423] I [MSGID: 114046] [client-handshake.c:1222:client_setvolume_cbk] 0-vol1_new-client-2: Connected to vol1_new-client-2, attached to remote volume '/bricks/brick0/br0'.
[2016-11-16 05:38:40.512464] I [MSGID: 114047] [client-handshake.c:1233:client_setvolume_cbk] 0-vol1_new-client-2: Server and Client lk-version numbers are not same, reopening the fds
[2016-11-16 06:00:40.351777] I [MSGID: 114018] [client.c:2280:client_rpc_notify] 0-vol1_new-client-2: disconnected from vol1_new-client-2. Client process will keep trying to connect to glusterd until brick's port is available
[2016-11-16 06:00:40.352534] I [MSGID: 114046] [client-handshake.c:1222:client_setvolume_cbk] 0-vol1_new-client-2: Connected to vol1_new-client-2, attached to remote volume '/bricks/brick0/br0'.
[2016-11-16 06:00:40.352556] I [MSGID: 114047] [client-handshake.c:1233:client_setvolume_cbk] 0-vol1_new-client-2: Server and Client lk-version numbers are not same, reopening the fds

Here disconnection and connection came with in span of milliseconds.

*** This bug has been marked as a duplicate of bug 1385605 ***

Comment 6 Jiffin 2016-11-16 12:48:59 UTC
Since use case is different.For the time being, the change component of bug and retest this issue with fix

Comment 9 Raghavendra Talur 2016-11-23 12:50:01 UTC
Patch posted upstream at http://review.gluster.org/#/c/15916

Comment 11 Arthy Loganathan 2016-12-19 05:08:27 UTC
Tested and verified the fix in build,

nfs-ganesha-2.4.1-2.el7rhgs.x86_64
nfs-ganesha-gluster-2.4.1-2.el7rhgs.x86_64
glusterfs-ganesha-3.8.4-8.el7rhgs.x86_64

Comment 13 errata-xmlrpc 2017-03-23 06:17:29 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html


Note You need to log in before you can comment on or make changes to this bug.