Bug 801398

Summary: [7fec9b41d8e1befa8d58a76d98207debddd60b65]: nfs mount reports IO error when running fs-mark tests concurrently
Product: [Community] GlusterFS Reporter: Rahul C S <rahulcs>
Component: glusterdAssignee: Kaushal <kaushal>
Status: CLOSED WORKSFORME QA Contact:
Severity: high Docs Contact:
Priority: high    
Version: pre-releaseCC: gluster-bugs, vbellur, vinaraya
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-04-11 06:57:21 UTC Type: ---
Regression: --- Mount Type: nfs
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Rahul C S 2012-03-08 12:51:33 UTC
Description of problem:
Created a 4 node storage pool. Created a distributed-replicate volume with all the 4 nodes. From both fuse & nfs clients, ran fs-mark tests & on the nfs mount fs_mark failed.

#  /opt/qa/tools/fs_mark  -d  .  -D  4  -t  10  -S  2
#       Version 3.3, 10 thread(s) starting at Wed Mar  7 06:39:39 2012
#       Sync method: SYSTEM SYNC/SINGLE FSYNC: Issue sync() after main write loop and 1 file fsync() per subdirectory.
#       Directories:  Time based hash between directories across 4 subdirectories with 180 seconds per subdirectory.
#       File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name)
#       Files info: size 51200 bytes, written with an IO size of 16384 bytes per write
#       App overhead is time in microseconds spent in the test not doing file writing related system calls.

FSUse%        Count         Size    Files/sec     App Overhead
Error in creat: Input/output error
Error in creat: Input/output error
fs_mark: mkdir ./00 failed: Input/output error
Error in creat: Input/output error
Error in creat: Input/output error
Error in creat: Input/output error
Error in creat: Input/output error
Error in creat: Input/output error
Error in creat: Input/output error
Error in creat: Input/output error
fopen failed to open: fs_log.txt.2333
fs_mark: failed to fopen log file: fs_log.txt Input/output error
fs_mark: failed to fopen log file: fs_log.txt Input/output error
fs_mark: failed to fopen log file: fs_log.txt Input/output error
fs_mark: failed to fopen log file: fs_log.txt Input/output error

gluster peer status started to discard one of the machine when the output is reported.

[root@RHEL6 glusterfs.git]# gluster peer status
Number of Peers: 2

Hostname: 10.1.11.153
Uuid: 2cfd617e-b00d-4802-82c1-fa5e7bce230d
State: Peer in Cluster (Connected)

Hostname: 10.1.11.154
Uuid: 6b9483ba-ef12-41a5-a8f5-48ab90eb3085
State: Peer in Cluster (Connected)
[root@RHEL6 glusterfs.git]# gluster peer status
[root@RHEL6 glusterfs.git]# echo $?
110
[root@RHEL6 glusterfs.git]#


The machine 152 which was part of the cluster is not listed & even glusterd is running on the machine.

glusterd logs:
[2012-03-07 06:41:08.008882] I [glusterd-op-sm.c:2110:glusterd_op_txn_complete] 0-glusterd: Cleared local lock
[2012-03-07 07:10:22.608522] E [rpc-clnt.c:217:call_bail] 0-management: bailing out frame type(GLUSTERD-DUMP) op(DUMP(1)) xid = 0x1x sent = 2
012-03-07 06:40:21.444218. timeout = 1800
[2012-03-07 07:10:22.608669] E [glusterd-handshake.c:369:glusterd_peer_dump_version_cbk] 0-: error through RPC layer, retry again later
[2012-03-07 07:20:02.545281] I [glusterd-handler.c:808:glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req
[2012-03-07 07:40:33.814894] E [rpc-clnt.c:217:call_bail] 0-management: bailing out frame type(GLUSTERD-DUMP) op(DUMP(1)) xid = 0x2x sent = 2
012-03-07 07:10:23.612376. timeout = 1800
[2012-03-07 07:40:33.815002] E [glusterd-handshake.c:369:glusterd_peer_dump_version_cbk] 0-: error through RPC layer, retry again later
[2012-03-07 08:10:40.015334] E [rpc-clnt.c:217:call_bail] 0-management: bailing out frame type(GLUSTERD-DUMP) op(DUMP(1)) xid = 0x3x sent = 2
012-03-07 07:40:38.821398. timeout = 1800
[2012-03-07 08:10:40.015445] E [glusterd-handshake.c:369:glusterd_peer_dump_version_cbk] 0-: error through RPC layer, retry again later
[2012-03-07 08:40:55.220929] E [rpc-clnt.c:217:call_bail] 0-management: bailing out frame type(GLUSTERD-DUMP) op(DUMP(1)) xid = 0x4x sent = 2
012-03-07 08:10:45.018283. timeout = 1800


glusterd logs on the machine which is not listed [152]: 
[2012-03-07 06:41:48.929963] I [glusterd-handler.c:1355:glusterd_handle_cluster_unlock] 0-glusterd: Received UNLOCK from uuid: b0e23374-0cf8-4bd2-8072-9c8467efc762
[2012-03-07 06:41:48.930014] I [glusterd-handler.c:1331:glusterd_op_unlock_send_resp] 0-glusterd: Responded to unlock, ret: 0
[2012-03-07 07:11:05.678460] E [rpc-clnt.c:217:call_bail] 0-management: bailing out frame type(Peer mgmt) op(--(4)) xid = 0x3x sent = 2012-03-07 06:41:02.423580. timeout = 1800
[2012-03-07 07:11:08.679042] E [rpc-clnt.c:217:call_bail] 0-management: bailing out frame type(GLUSTERD-DUMP) op(DUMP(1)) xid = 0x1x sent = 2012-03-07 06:41:07.847568. timeout = 1800
[2012-03-07 07:11:08.679097] E [glusterd-handshake.c:369:glusterd_peer_dump_version_cbk] 0-: error through RPC layer, retry again later
[2012-03-07 07:21:04.784258] I [glusterd-handler.c:808:glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req
[2012-03-07 07:41:21.956594] E [rpc-clnt.c:217:call_bail] 0-management: bailing out frame type(GLUSTERD-DUMP) op(DUMP(1)) xid = 0x2x sent = 2012-03-07 07:11:11.683099. timeout = 1800
[2012-03-07 07:41:21.956724] E [glusterd-handshake.c:369:glusterd_peer_dump_version_cbk] 0-: error through RPC layer, retry again later

nfs server log:
[2012-03-07 07:08:45.717419] W [rpcsvc.c:524:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply
[2012-03-07 07:10:44.921462] E [rpc-clnt.c:217:call_bail] 0-testvol-client-3: bailing out frame type(GF-DUMP) op(DUMP(1)) xid = 0x1x sent = 2
012-03-07 06:40:43.729356. timeout = 1800
[2012-03-07 07:10:44.921575] W [client-handshake.c:1732:client_dump_version_cbk] 0-testvol-client-3: received RPC status error
[2012-03-07 07:10:45.716338] E [nfs3.c:798:nfs3_getattr] 0-nfs-nfsv3: Volume is disabled: testvol
[2012-03-07 07:10:45.716406] W [rpcsvc.c:524:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply
[2012-03-07 07:11:45.715482] E [nfs3.c:798:nfs3_getattr] 0-nfs-nfsv3: Volume is disabled: testvol
[2012-03-07 07:11:45.715572] W [rpcsvc.c:524:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply
[2012-03-07 07:13:45.714403] E [nfs3.c:1537:nfs3_access] 0-nfs-nfsv3: Volume is disabled: testvol
[2012-03-07 07:13:45.714479] W [rpcsvc.c:524:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply
[2012-03-07 07:14:45.713772] E [nfs3.c:1537:nfs3_access] 0-nfs-nfsv3: Volume is disabled: testvol
[2012-03-07 07:14:45.713838] W [rpcsvc.c:524:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply
[2012-03-07 07:16:45.712689] E [nfs3.c:1537:nfs3_access] 0-nfs-nfsv3: Volume is disabled: testvol
This error has continued over & over.

Comment 2 Kaushal 2012-04-11 06:57:21 UTC
I don't hit this on master. fs_mark runs to completion.
I also didn't hit the other problem mentioned about a peer disappearing.