Description of problem: Created a 4 node storage pool. Created a distributed-replicate volume with all the 4 nodes. From both fuse & nfs clients, ran fs-mark tests & on the nfs mount fs_mark failed. # /opt/qa/tools/fs_mark -d . -D 4 -t 10 -S 2 # Version 3.3, 10 thread(s) starting at Wed Mar 7 06:39:39 2012 # Sync method: SYSTEM SYNC/SINGLE FSYNC: Issue sync() after main write loop and 1 file fsync() per subdirectory. # Directories: Time based hash between directories across 4 subdirectories with 180 seconds per subdirectory. # File names: 40 bytes long, (16 initial bytes of time stamp with 24 random bytes at end of name) # Files info: size 51200 bytes, written with an IO size of 16384 bytes per write # App overhead is time in microseconds spent in the test not doing file writing related system calls. FSUse% Count Size Files/sec App Overhead Error in creat: Input/output error Error in creat: Input/output error fs_mark: mkdir ./00 failed: Input/output error Error in creat: Input/output error Error in creat: Input/output error Error in creat: Input/output error Error in creat: Input/output error Error in creat: Input/output error Error in creat: Input/output error Error in creat: Input/output error fopen failed to open: fs_log.txt.2333 fs_mark: failed to fopen log file: fs_log.txt Input/output error fs_mark: failed to fopen log file: fs_log.txt Input/output error fs_mark: failed to fopen log file: fs_log.txt Input/output error fs_mark: failed to fopen log file: fs_log.txt Input/output error gluster peer status started to discard one of the machine when the output is reported. [root@RHEL6 glusterfs.git]# gluster peer status Number of Peers: 2 Hostname: 10.1.11.153 Uuid: 2cfd617e-b00d-4802-82c1-fa5e7bce230d State: Peer in Cluster (Connected) Hostname: 10.1.11.154 Uuid: 6b9483ba-ef12-41a5-a8f5-48ab90eb3085 State: Peer in Cluster (Connected) [root@RHEL6 glusterfs.git]# gluster peer status [root@RHEL6 glusterfs.git]# echo $? 110 [root@RHEL6 glusterfs.git]# The machine 152 which was part of the cluster is not listed & even glusterd is running on the machine. glusterd logs: [2012-03-07 06:41:08.008882] I [glusterd-op-sm.c:2110:glusterd_op_txn_complete] 0-glusterd: Cleared local lock [2012-03-07 07:10:22.608522] E [rpc-clnt.c:217:call_bail] 0-management: bailing out frame type(GLUSTERD-DUMP) op(DUMP(1)) xid = 0x1x sent = 2 012-03-07 06:40:21.444218. timeout = 1800 [2012-03-07 07:10:22.608669] E [glusterd-handshake.c:369:glusterd_peer_dump_version_cbk] 0-: error through RPC layer, retry again later [2012-03-07 07:20:02.545281] I [glusterd-handler.c:808:glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req [2012-03-07 07:40:33.814894] E [rpc-clnt.c:217:call_bail] 0-management: bailing out frame type(GLUSTERD-DUMP) op(DUMP(1)) xid = 0x2x sent = 2 012-03-07 07:10:23.612376. timeout = 1800 [2012-03-07 07:40:33.815002] E [glusterd-handshake.c:369:glusterd_peer_dump_version_cbk] 0-: error through RPC layer, retry again later [2012-03-07 08:10:40.015334] E [rpc-clnt.c:217:call_bail] 0-management: bailing out frame type(GLUSTERD-DUMP) op(DUMP(1)) xid = 0x3x sent = 2 012-03-07 07:40:38.821398. timeout = 1800 [2012-03-07 08:10:40.015445] E [glusterd-handshake.c:369:glusterd_peer_dump_version_cbk] 0-: error through RPC layer, retry again later [2012-03-07 08:40:55.220929] E [rpc-clnt.c:217:call_bail] 0-management: bailing out frame type(GLUSTERD-DUMP) op(DUMP(1)) xid = 0x4x sent = 2 012-03-07 08:10:45.018283. timeout = 1800 glusterd logs on the machine which is not listed [152]: [2012-03-07 06:41:48.929963] I [glusterd-handler.c:1355:glusterd_handle_cluster_unlock] 0-glusterd: Received UNLOCK from uuid: b0e23374-0cf8-4bd2-8072-9c8467efc762 [2012-03-07 06:41:48.930014] I [glusterd-handler.c:1331:glusterd_op_unlock_send_resp] 0-glusterd: Responded to unlock, ret: 0 [2012-03-07 07:11:05.678460] E [rpc-clnt.c:217:call_bail] 0-management: bailing out frame type(Peer mgmt) op(--(4)) xid = 0x3x sent = 2012-03-07 06:41:02.423580. timeout = 1800 [2012-03-07 07:11:08.679042] E [rpc-clnt.c:217:call_bail] 0-management: bailing out frame type(GLUSTERD-DUMP) op(DUMP(1)) xid = 0x1x sent = 2012-03-07 06:41:07.847568. timeout = 1800 [2012-03-07 07:11:08.679097] E [glusterd-handshake.c:369:glusterd_peer_dump_version_cbk] 0-: error through RPC layer, retry again later [2012-03-07 07:21:04.784258] I [glusterd-handler.c:808:glusterd_handle_cli_list_friends] 0-glusterd: Received cli list req [2012-03-07 07:41:21.956594] E [rpc-clnt.c:217:call_bail] 0-management: bailing out frame type(GLUSTERD-DUMP) op(DUMP(1)) xid = 0x2x sent = 2012-03-07 07:11:11.683099. timeout = 1800 [2012-03-07 07:41:21.956724] E [glusterd-handshake.c:369:glusterd_peer_dump_version_cbk] 0-: error through RPC layer, retry again later nfs server log: [2012-03-07 07:08:45.717419] W [rpcsvc.c:524:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply [2012-03-07 07:10:44.921462] E [rpc-clnt.c:217:call_bail] 0-testvol-client-3: bailing out frame type(GF-DUMP) op(DUMP(1)) xid = 0x1x sent = 2 012-03-07 06:40:43.729356. timeout = 1800 [2012-03-07 07:10:44.921575] W [client-handshake.c:1732:client_dump_version_cbk] 0-testvol-client-3: received RPC status error [2012-03-07 07:10:45.716338] E [nfs3.c:798:nfs3_getattr] 0-nfs-nfsv3: Volume is disabled: testvol [2012-03-07 07:10:45.716406] W [rpcsvc.c:524:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply [2012-03-07 07:11:45.715482] E [nfs3.c:798:nfs3_getattr] 0-nfs-nfsv3: Volume is disabled: testvol [2012-03-07 07:11:45.715572] W [rpcsvc.c:524:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply [2012-03-07 07:13:45.714403] E [nfs3.c:1537:nfs3_access] 0-nfs-nfsv3: Volume is disabled: testvol [2012-03-07 07:13:45.714479] W [rpcsvc.c:524:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply [2012-03-07 07:14:45.713772] E [nfs3.c:1537:nfs3_access] 0-nfs-nfsv3: Volume is disabled: testvol [2012-03-07 07:14:45.713838] W [rpcsvc.c:524:rpcsvc_handle_rpc_call] 0-rpcsvc: failed to queue error reply [2012-03-07 07:16:45.712689] E [nfs3.c:1537:nfs3_access] 0-nfs-nfsv3: Volume is disabled: testvol This error has continued over & over.
I don't hit this on master. fs_mark runs to completion. I also didn't hit the other problem mentioned about a peer disappearing.