Created attachment 941976 [details] mount log Description of problem: with gluster 3.6.0 beta1 as client, and gluster 3.3.0 as server, the write operation fail while the mount operation and read operation success. Version-Release number of selected component (if applicable): glusterfs 3.6.0 beta1 How reproducible: mount gluster 3.3 volume with gluster 3.6 client Steps to Reproduce: 1. client install gluster 3.6.0 beta1 2. server install gluster 3.3.0 3. client mount volume of gluster 3.3.0 4. write file under mount point. Actual results: cannot write successfully. Expected results: can write successfully. Additional info: hi , maybe I need to comment some more infomation of the situation. ----->>>>> version of client and server <<<<<----- client version : glusterfs 3.6.0 beta1 server version: glusterfs 3.3.0 ---------->>>> volume info <<<<<------------- Volume Name: myvol Type: Distributed-Replicate Volume ID: c36dfe1c-3f95-4d64-9dae-1b5916b56b19 Status: Started Number of Bricks: 2 x 2 = 4 Transport-type: tcp Bricks: Brick1: 10.10.10.10:/mnt/xfsd/myvol-0 Brick2: 10.10.10.10:/mnt/xfsd/myvol-1 Brick3: 10.10.10.10:/mnt/xfsd/myvol-2 Brick4: 10.10.10.10:/mnt/xfsd/myvol-3 -------------->>>> mount with --debug <<<<<------------------- /usr/sbin/glusterfs --volfile-server=10.10.10.10 --volfile-id=myvol /mnt/myvol --debug then, at another window of client, I goto the mount point /mnt/myvol and execute "ls" then a couple of lines present. I think no error message. following, I execute the command echo "hello,gluster3.3" > /mnt/myvol/file then some "error like" message are present . Again, with "ls" , I can find the file with name "file" under mount point. but "cat /mnt/myvol/file" will result in nothing. which means , this is an empyt file !!! After many tests, a conclusion is coming : I cannot write anything to a file, while I can create file successfully, and can read an existing file successfully. -------------->>>>> mount log (some "error like" message when write file ) <<<<<<---------------- [2014-09-28 07:41:28.585181] D [logging.c:1781:__gf_log_inject_timer_event] 0-logging-infra: Starting timer now. Timeout = 120, current buf size = 5 [2014-09-28 07:42:42.433338] D [MSGID: 0] [dht-common.c:621:dht_revalidate_cbk] 0-myvol-dht: revalidate lookup of / returned with op_ret 0 and op_errno 117 [2014-09-28 07:43:19.487414] D [MSGID: 0] [dht-common.c:2182:dht_lookup] 0-myvol-dht: Calling fresh lookup for /file on myvol-replicate-1 [2014-09-28 07:43:19.558459] D [MSGID: 0] [dht-common.c:1818:dht_lookup_cbk] 0-myvol-dht: fresh_lookup returned for /file with op_ret -1 and op_errno 2 [2014-09-28 07:43:19.558497] I [dht-common.c:1822:dht_lookup_cbk] 0-myvol-dht: Entry /file missing on subvol myvol-replicate-1 [2014-09-28 07:43:19.558517] D [MSGID: 0] [dht-common.c:1607:dht_lookup_everywhere] 0-myvol-dht: winding lookup call to 2 subvols [2014-09-28 07:43:19.634376] D [MSGID: 0] [dht-common.c:1413:dht_lookup_everywhere_cbk] 0-myvol-dht: returned with op_ret -1 and op_errno 2 (/file) from subvol myvol-replicate-0 [2014-09-28 07:43:19.634573] D [MSGID: 0] [dht-common.c:1413:dht_lookup_everywhere_cbk] 0-myvol-dht: returned with op_ret -1 and op_errno 2 (/file) from subvol myvol-replicate-1 [2014-09-28 07:43:19.634605] D [MSGID: 0] [dht-common.c:1086:dht_lookup_everywhere_done] 0-myvol-dht: STATUS: hashed_subvol myvol-replicate-1 cached_subvol null [2014-09-28 07:43:19.634624] D [MSGID: 0] [dht-common.c:1147:dht_lookup_everywhere_done] 0-myvol-dht: There was no cached file and unlink on hashed is not skipped /file [2014-09-28 07:43:19.634663] D [fuse-resolve.c:83:fuse_resolve_entry_cbk] 0-fuse: 00000000-0000-0000-0000-000000000001/file: failed to resolve (No such file or directory) [2014-09-28 07:43:19.708608] I [dht-common.c:1822:dht_lookup_cbk] 0-myvol-dht: Entry /file missing on subvol myvol-replicate-1 [2014-09-28 07:43:19.781420] D [logging.c:1937:_gf_msg_internal] 0-logging-infra: Buffer overflow of a buffer whose size limit is 5. About to flush least recently used log message to disk [2014-09-28 07:43:19.708640] D [MSGID: 0] [dht-common.c:1607:dht_lookup_everywhere] 0-myvol-dht: winding lookup call to 2 subvols [2014-09-28 07:43:19.781418] D [MSGID: 0] [dht-common.c:1413:dht_lookup_everywhere_cbk] 0-myvol-dht: returned with op_ret -1 and op_errno 2 (/file) from subvol myvol-replicate-0 [2014-09-28 07:43:19.781629] D [MSGID: 0] [dht-common.c:1413:dht_lookup_everywhere_cbk] 0-myvol-dht: returned with op_ret -1 and op_errno 2 (/file) from subvol myvol-replicate-1 [2014-09-28 07:43:19.781653] D [MSGID: 0] [dht-common.c:1086:dht_lookup_everywhere_done] 0-myvol-dht: STATUS: hashed_subvol myvol-replicate-1 cached_subvol null [2014-09-28 07:43:19.851925] I [dht-common.c:1822:dht_lookup_cbk] 0-myvol-dht: Entry /file missing on subvol myvol-replicate-1 [2014-09-28 07:43:19.851954] D [logging.c:1937:_gf_msg_internal] 0-logging-infra: Buffer overflow of a buffer whose size limit is 5. About to flush least recently used log message to disk The message "D [MSGID: 0] [dht-common.c:1818:dht_lookup_cbk] 0-myvol-dht: fresh_lookup returned for /file with op_ret -1 and op_errno 2" repeated 2 times between [2014-09-28 07:43:19.558459] and [2014-09-28 07:43:19.851922] [2014-09-28 07:43:19.851954] D [MSGID: 0] [dht-common.c:1607:dht_lookup_everywhere] 0-myvol-dht: winding lookup call to 2 subvols [2014-09-28 07:43:19.922764] D [MSGID: 0] [dht-common.c:1413:dht_lookup_everywhere_cbk] 0-myvol-dht: returned with op_ret -1 and op_errno 2 (/file) from subvol myvol-replicate-0 [2014-09-28 07:43:19.922925] D [MSGID: 0] [dht-common.c:1413:dht_lookup_everywhere_cbk] 0-myvol-dht: returned with op_ret -1 and op_errno 2 (/file) from subvol myvol-replicate-1 [2014-09-28 07:43:19.922974] D [fuse-resolve.c:83:fuse_resolve_entry_cbk] 0-fuse: 00000000-0000-0000-0000-000000000001/file: failed to resolve (No such file or directory) [2014-09-28 07:43:19.997012] D [logging.c:1937:_gf_msg_internal] 0-logging-infra: Buffer overflow of a buffer whose size limit is 5. About to flush least recently used log message to disk The message "D [MSGID: 0] [dht-common.c:1147:dht_lookup_everywhere_done] 0-myvol-dht: There was no cached file and unlink on hashed is not skipped /file" repeated 2 times between [2014-09-28 07:43:19.634624] and [2014-09-28 07:43:19.922951] [2014-09-28 07:43:19.997011] D [MSGID: 0] [dht-diskusage.c:96:dht_du_info_cbk] 0-myvol-dht: subvolume 'myvol-replicate-0': avail_percent is: 99.00 and avail_space is: 44000304627712 and avail_inodes is: 99.00 [2014-09-28 07:43:19.997134] D [MSGID: 0] [dht-diskusage.c:96:dht_du_info_cbk] 0-myvol-dht: subvolume 'myvol-replicate-1': avail_percent is: 99.00 and avail_space is: 44000304627712 and avail_inodes is: 99.00 [2014-09-28 07:43:19.997179] D [afr-transaction.c:1166:afr_post_nonblocking_entrylk_cbk] 0-myvol-replicate-1: Non blocking entrylks done. Proceeding to FOP [2014-09-28 07:43:20.067587] D [afr-lk-common.c:447:transaction_lk_op] 0-myvol-replicate-1: lk op is for a transaction [2014-09-28 07:43:20.216287] D [afr-transaction.c:1116:afr_post_nonblocking_inodelk_cbk] 0-myvol-replicate-1: Non blocking inodelks done. Proceeding to FOP [2014-09-28 07:43:20.356844] W [client-rpc-fops.c:850:client3_3_writev_cbk] 0-myvol-client-3: remote operation failed: Transport endpoint is not connected [2014-09-28 07:43:20.356979] W [client-rpc-fops.c:850:client3_3_writev_cbk] 0-myvol-client-2: remote operation failed: Transport endpoint is not connected [2014-09-28 07:43:20.357009] D [afr-lk-common.c:447:transaction_lk_op] 0-myvol-replicate-1: lk op is for a transaction [2014-09-28 07:43:20.428013] W [fuse-bridge.c:1261:fuse_err_cbk] 0-glusterfs-fuse: 14: FLUSH() ERR => -1 (Transport endpoint is not connected) [2014-09-28 07:43:28.593512] D [logging.c:1816:gf_log_flush_timeout_cbk] 0-logging-infra: Log timer timed out. About to flush outstanding messages if present The message "D [MSGID: 0] [dht-common.c:621:dht_revalidate_cbk] 0-myvol-dht: revalidate lookup of / returned with op_ret 0 and op_errno 117" repeated 5 times between [2014-09-28 07:42:42.433338] and [2014-09-28 07:43:19.487219] The message "D [MSGID: 0] [dht-common.c:2182:dht_lookup] 0-myvol-dht: Calling fresh lookup for /file on myvol-replicate-1" repeated 2 times between [2014-09-28 07:43:19.487414] and [2014-09-28 07:43:19.781835] [2014-09-28 07:43:19.922950] D [MSGID: 0] [dht-common.c:1086:dht_lookup_everywhere_done] 0-myvol-dht: STATUS: hashed_subvol myvol-replicate-1 cached_subvol null [2014-09-28 07:43:28.593615] D [logging.c:1781:__gf_log_inject_timer_event] 0-logging-infra: Starting timer now. Timeout = 120, current buf size = 5 [2014-09-28 07:44:26.517010] D [MSGID: 0] [dht-common.c:621:dht_revalidate_cbk] 0-myvol-dht: revalidate lookup of / returned with op_ret 0 and op_errno 117 [2014-09-28 07:44:26.950203] D [MSGID: 0] [dht-common.c:2108:dht_lookup] 0-myvol-dht: calling revalidate lookup for /file at myvol-replicate-1 [2014-09-28 07:44:27.022986] D [MSGID: 0] [dht-common.c:621:dht_revalidate_cbk] 0-myvol-dht: revalidate lookup of /file returned with op_ret 0 and op_errno 0 Pleasy give me some ideas to fix this problem. Thanks all !
Hi all , I have done more tests in my environment. After I debug the glusterfs 3.6.0 beta2, I have more information of the compatibility issue. I find when I preform a write operation , the variable req.rpc_status is -1 in xlators/protocol/client/src/client-rpc-fops.c:810 function (client3_3_writev_cbk). To find out who set rpc_status to -1, I add some debug log in all corresponding files. Then I re-run the process, and write a file. In the log file, I find that in rpc/rpc-lib/src/rpc-clnt.c:579 function (rpc_clnt_reply_fill), the rpc_status has been set to -1 . if ((rpc_reply_status (replymsg) == MSG_DENIED) || (rpc_accepted_reply_status (replymsg) != SUCCESS)){ req->rpc_status = -1; } Then I log out rpc_reply_status (replymsg) and rpc_accepted_reply_status (replymsg) , which is 0 and 4 seperately. Then I search google, 0 means MSG_ACCEPTED, and 4 means GARBAGE_ARGS. So the rpc connection is accepted, but there is garbage arguments in the rpc message body. So anybody can tell me what should I do next?
anybody has ideas about this issue?
this issue still exists in glusterfs-3.6.1 version
I saw the same issue. === client (3.5) logs: W [fuse-bridge.c:2242:fuse_writev_cbk] 0-glusterfs-fuse: 224: WRITE => -1 (Transport endpoint is not connected) W [fuse-bridge.c:1236:fuse_err_cbk] 0-glusterfs-fuse: 225: FSYNC() ERR => -1 (Transport endpoint is not connected) W [fuse-bridge.c:1236:fuse_err_cbk] 0-glusterfs-fuse: 226: FLUSH() ERR => -1 (Transport endpoint is not connected) === server (3.3) logs: [2015-01-06 14:11:38.281263] E [dict.c:2413:dict_unserialize] (-->/usr/lib64/libgfrpc.so.0(rpcsvc_notify+0x93) [0x317c20a5b3] (-->/usr/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x307) [0x317c20a4b7] (-->/usr/lib64/glusterfs/3.3.0.5rhs_iqiyi_9/xlator/protocol/server.so(server_writev+0x24b) [0x7f3607ddd8bb]))) 0-dict: undersized buffer passed. available (55231882) < required (927647052) [2015-01-06 14:11:38.281365] W [server3_1-fops.c:3393:server_writev] 0-/mnt/xfsd/maintain4: failed to unserialize dictionary (state->xdata) [2015-01-06 14:13:46.844069] I [server-handshake.c:571:server_setvolume] 0-maintain4-server: accepted client from localhost.localdomain-6469-2015/01/06-06:13:46:269874-maintain4-client-3-0 (version: 3.4.6)
We don't support 3.6 clients using 3.3 servers.
Then the glusterfs client does not quite behave as documented, see: http://www.gluster.org/community/documentation/index.php/GlusterFS_General_FAQ#Will_GlusterFS_be_upward_compatible.3F The GlusterFS protocol has a version checking mechanism, which checks for the compatibility of both server and client before making a successful connection. According to the documented versioning scheme, versions with same major numbers should be compatible with each other.
The problem seen here, seem to reflect the fix made in processing RPC fragments as the cause, see commit: http://review.gluster.org/#/c/8662/ So it may not matter what client is running, but as long as the server is 3.3 or lesser the issue would appear, higher versions of the server have the fix as far as I can tell.