Bug 1147236 - gluster 3.6 compatibility issue with gluster 3.3
Summary: gluster 3.6 compatibility issue with gluster 3.3
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: GlusterFS
Classification: Community
Component: protocol
Version: 3.6.1
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: bugs@gluster.org
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: glusterfs-3.6.0
TreeView+ depends on / blocked
 
Reported: 2014-09-28 10:09 UTC by Panpan Feng
Modified: 2015-05-06 20:43 UTC (History)
5 users (show)

Fixed In Version:
Clone Of:
Environment:
Last Closed: 2015-02-10 12:59:17 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)
mount log (45.30 KB, text/plain)
2014-09-28 10:09 UTC, Panpan Feng
no flags Details

Description Panpan Feng 2014-09-28 10:09:25 UTC
Created attachment 941976 [details]
mount log

Description of problem:

with gluster 3.6.0 beta1 as client, and gluster 3.3.0 as server, the write operation fail   while  the mount operation and read operation success.

Version-Release number of selected component (if applicable):
glusterfs 3.6.0 beta1

How reproducible:

mount gluster 3.3 volume with gluster 3.6 client

Steps to Reproduce:
1. client install gluster 3.6.0 beta1
2. server install gluster 3.3.0 
3. client mount volume of gluster 3.3.0
4. write file under mount point.

Actual results:

cannot write successfully.

Expected results:

can write successfully.

Additional info:

hi  , maybe I need to comment some more infomation of the situation.

----->>>>>  version of client and server  <<<<<-----

client version :  glusterfs 3.6.0 beta1
server version:   glusterfs 3.3.0


---------->>>> volume info <<<<<-------------

Volume Name: myvol
Type: Distributed-Replicate
Volume ID: c36dfe1c-3f95-4d64-9dae-1b5916b56b19
Status: Started
Number of Bricks: 2 x 2 = 4
Transport-type: tcp
Bricks:
Brick1: 10.10.10.10:/mnt/xfsd/myvol-0
Brick2: 10.10.10.10:/mnt/xfsd/myvol-1
Brick3: 10.10.10.10:/mnt/xfsd/myvol-2
Brick4: 10.10.10.10:/mnt/xfsd/myvol-3

-------------->>>> mount with --debug   <<<<<-------------------
/usr/sbin/glusterfs --volfile-server=10.10.10.10 --volfile-id=myvol
/mnt/myvol --debug

then, at another window of client, I  goto the mount point /mnt/myvol and
 execute "ls"
then a couple of lines present. I think no error message.   following, I
execute the command

echo "hello,gluster3.3" > /mnt/myvol/file

then some "error like" message are present .  Again, with "ls" , I can find
the file with name "file" under
mount point.  but "cat /mnt/myvol/file" will result in nothing. which means
, this is an empyt file !!!
After many tests, a conclusion is coming :   I cannot write anything to a
file, while I can create file successfully,
and can read an  existing file successfully.


-------------->>>>> mount log  (some "error like" message when write file )
<<<<<<----------------

[2014-09-28 07:41:28.585181] D [logging.c:1781:__gf_log_inject_timer_event]
0-logging-infra: Starting timer now. Timeout = 120, current buf size = 5
[2014-09-28 07:42:42.433338] D [MSGID: 0]
[dht-common.c:621:dht_revalidate_cbk] 0-myvol-dht: revalidate lookup of /
returned with op_ret 0 and op_errno 117
[2014-09-28 07:43:19.487414] D [MSGID: 0] [dht-common.c:2182:dht_lookup]
0-myvol-dht: Calling fresh lookup for /file on myvol-replicate-1
[2014-09-28 07:43:19.558459] D [MSGID: 0]
[dht-common.c:1818:dht_lookup_cbk] 0-myvol-dht: fresh_lookup returned for
/file with op_ret -1 and op_errno 2
[2014-09-28 07:43:19.558497] I [dht-common.c:1822:dht_lookup_cbk]
0-myvol-dht: Entry /file missing on subvol myvol-replicate-1
[2014-09-28 07:43:19.558517] D [MSGID: 0]
[dht-common.c:1607:dht_lookup_everywhere] 0-myvol-dht: winding lookup call
to 2 subvols
[2014-09-28 07:43:19.634376] D [MSGID: 0]
[dht-common.c:1413:dht_lookup_everywhere_cbk] 0-myvol-dht: returned with
op_ret -1 and op_errno 2 (/file) from subvol myvol-replicate-0
[2014-09-28 07:43:19.634573] D [MSGID: 0]
[dht-common.c:1413:dht_lookup_everywhere_cbk] 0-myvol-dht: returned with
op_ret -1 and op_errno 2 (/file) from subvol myvol-replicate-1
[2014-09-28 07:43:19.634605] D [MSGID: 0]
[dht-common.c:1086:dht_lookup_everywhere_done] 0-myvol-dht: STATUS:
hashed_subvol myvol-replicate-1 cached_subvol null
[2014-09-28 07:43:19.634624] D [MSGID: 0]
[dht-common.c:1147:dht_lookup_everywhere_done] 0-myvol-dht: There was no
cached file and  unlink on hashed is not skipped /file
[2014-09-28 07:43:19.634663] D [fuse-resolve.c:83:fuse_resolve_entry_cbk]
0-fuse: 00000000-0000-0000-0000-000000000001/file: failed to resolve (No
such file or directory)
[2014-09-28 07:43:19.708608] I [dht-common.c:1822:dht_lookup_cbk]
0-myvol-dht: Entry /file missing on subvol myvol-replicate-1
[2014-09-28 07:43:19.781420] D [logging.c:1937:_gf_msg_internal]
0-logging-infra: Buffer overflow of a buffer whose size limit is 5. About
to flush least recently used log message to disk
[2014-09-28 07:43:19.708640] D [MSGID: 0]
[dht-common.c:1607:dht_lookup_everywhere] 0-myvol-dht: winding lookup call
to 2 subvols
[2014-09-28 07:43:19.781418] D [MSGID: 0]
[dht-common.c:1413:dht_lookup_everywhere_cbk] 0-myvol-dht: returned with
op_ret -1 and op_errno 2 (/file) from subvol myvol-replicate-0
[2014-09-28 07:43:19.781629] D [MSGID: 0]
[dht-common.c:1413:dht_lookup_everywhere_cbk] 0-myvol-dht: returned with
op_ret -1 and op_errno 2 (/file) from subvol myvol-replicate-1
[2014-09-28 07:43:19.781653] D [MSGID: 0]
[dht-common.c:1086:dht_lookup_everywhere_done] 0-myvol-dht: STATUS:
hashed_subvol myvol-replicate-1 cached_subvol null
[2014-09-28 07:43:19.851925] I [dht-common.c:1822:dht_lookup_cbk]
0-myvol-dht: Entry /file missing on subvol myvol-replicate-1
[2014-09-28 07:43:19.851954] D [logging.c:1937:_gf_msg_internal]
0-logging-infra: Buffer overflow of a buffer whose size limit is 5. About
to flush least recently used log message to disk
The message "D [MSGID: 0] [dht-common.c:1818:dht_lookup_cbk] 0-myvol-dht:
fresh_lookup returned for /file with op_ret -1 and op_errno 2" repeated 2
times between [2014-09-28 07:43:19.558459] and [2014-09-28 07:43:19.851922]
[2014-09-28 07:43:19.851954] D [MSGID: 0]
[dht-common.c:1607:dht_lookup_everywhere] 0-myvol-dht: winding lookup call
to 2 subvols
[2014-09-28 07:43:19.922764] D [MSGID: 0]
[dht-common.c:1413:dht_lookup_everywhere_cbk] 0-myvol-dht: returned with
op_ret -1 and op_errno 2 (/file) from subvol myvol-replicate-0
[2014-09-28 07:43:19.922925] D [MSGID: 0]
[dht-common.c:1413:dht_lookup_everywhere_cbk] 0-myvol-dht: returned with
op_ret -1 and op_errno 2 (/file) from subvol myvol-replicate-1
[2014-09-28 07:43:19.922974] D [fuse-resolve.c:83:fuse_resolve_entry_cbk]
0-fuse: 00000000-0000-0000-0000-000000000001/file: failed to resolve (No
such file or directory)
[2014-09-28 07:43:19.997012] D [logging.c:1937:_gf_msg_internal]
0-logging-infra: Buffer overflow of a buffer whose size limit is 5. About
to flush least recently used log message to disk
The message "D [MSGID: 0] [dht-common.c:1147:dht_lookup_everywhere_done]
0-myvol-dht: There was no cached file and  unlink on hashed is not skipped
/file" repeated 2 times between [2014-09-28 07:43:19.634624] and
[2014-09-28 07:43:19.922951]
[2014-09-28 07:43:19.997011] D [MSGID: 0]
[dht-diskusage.c:96:dht_du_info_cbk] 0-myvol-dht: subvolume
'myvol-replicate-0': avail_percent is: 99.00 and avail_space is:
44000304627712 and avail_inodes is: 99.00
[2014-09-28 07:43:19.997134] D [MSGID: 0]
[dht-diskusage.c:96:dht_du_info_cbk] 0-myvol-dht: subvolume
'myvol-replicate-1': avail_percent is: 99.00 and avail_space is:
44000304627712 and avail_inodes is: 99.00
[2014-09-28 07:43:19.997179] D
[afr-transaction.c:1166:afr_post_nonblocking_entrylk_cbk]
0-myvol-replicate-1: Non blocking entrylks done. Proceeding to FOP
[2014-09-28 07:43:20.067587] D [afr-lk-common.c:447:transaction_lk_op]
0-myvol-replicate-1: lk op is for a transaction
[2014-09-28 07:43:20.216287] D
[afr-transaction.c:1116:afr_post_nonblocking_inodelk_cbk]
0-myvol-replicate-1: Non blocking inodelks done. Proceeding to FOP
[2014-09-28 07:43:20.356844] W [client-rpc-fops.c:850:client3_3_writev_cbk]
0-myvol-client-3: remote operation failed: Transport endpoint is not
connected
[2014-09-28 07:43:20.356979] W [client-rpc-fops.c:850:client3_3_writev_cbk]
0-myvol-client-2: remote operation failed: Transport endpoint is not
connected
[2014-09-28 07:43:20.357009] D [afr-lk-common.c:447:transaction_lk_op]
0-myvol-replicate-1: lk op is for a transaction
[2014-09-28 07:43:20.428013] W [fuse-bridge.c:1261:fuse_err_cbk]
0-glusterfs-fuse: 14: FLUSH() ERR => -1 (Transport endpoint is not
connected)
[2014-09-28 07:43:28.593512] D [logging.c:1816:gf_log_flush_timeout_cbk]
0-logging-infra: Log timer timed out. About to flush outstanding messages
if present
The message "D [MSGID: 0] [dht-common.c:621:dht_revalidate_cbk]
0-myvol-dht: revalidate lookup of / returned with op_ret 0 and op_errno
117" repeated 5 times between [2014-09-28 07:42:42.433338] and [2014-09-28
07:43:19.487219]
The message "D [MSGID: 0] [dht-common.c:2182:dht_lookup] 0-myvol-dht:
Calling fresh lookup for /file on myvol-replicate-1" repeated 2 times
between [2014-09-28 07:43:19.487414] and [2014-09-28 07:43:19.781835]
[2014-09-28 07:43:19.922950] D [MSGID: 0]
[dht-common.c:1086:dht_lookup_everywhere_done] 0-myvol-dht: STATUS:
hashed_subvol myvol-replicate-1 cached_subvol null
[2014-09-28 07:43:28.593615] D [logging.c:1781:__gf_log_inject_timer_event]
0-logging-infra: Starting timer now. Timeout = 120, current buf size = 5
[2014-09-28 07:44:26.517010] D [MSGID: 0]
[dht-common.c:621:dht_revalidate_cbk] 0-myvol-dht: revalidate lookup of /
returned with op_ret 0 and op_errno 117
[2014-09-28 07:44:26.950203] D [MSGID: 0] [dht-common.c:2108:dht_lookup]
0-myvol-dht: calling revalidate lookup for /file at myvol-replicate-1
[2014-09-28 07:44:27.022986] D [MSGID: 0]
[dht-common.c:621:dht_revalidate_cbk] 0-myvol-dht: revalidate lookup of
/file returned with op_ret 0 and op_errno 0



Pleasy give me some ideas to fix this problem.   Thanks all !

Comment 1 Panpan Feng 2014-10-11 01:59:40 UTC
Hi all , I have done more tests in my environment. After  I debug the glusterfs 3.6.0 beta2, I have more information of the compatibility issue.

I find when I preform a write operation , the variable req.rpc_status is -1 in xlators/protocol/client/src/client-rpc-fops.c:810  function (client3_3_writev_cbk).

To find out who set rpc_status to -1, I add some debug log in all corresponding files. Then I re-run the process, and write a file. 

In the log file, I find that in rpc/rpc-lib/src/rpc-clnt.c:579 function (rpc_clnt_reply_fill), the rpc_status has been set to -1 .

if ((rpc_reply_status (replymsg) == MSG_DENIED)
      || (rpc_accepted_reply_status (replymsg) != SUCCESS)){
    req->rpc_status = -1;
}

Then I log out rpc_reply_status (replymsg)  and  rpc_accepted_reply_status (replymsg) , which is 0  and  4 seperately. Then I search google, 0 means MSG_ACCEPTED, and 4 means GARBAGE_ARGS. 
So the rpc connection is accepted, but there is garbage arguments in the rpc message body. 

So anybody can tell me what should I do next?

Comment 2 Panpan Feng 2014-12-08 11:29:50 UTC
anybody has ideas  about  this  issue?

Comment 3 Panpan Feng 2015-01-06 03:07:03 UTC
this issue still exists in glusterfs-3.6.1 version

Comment 4 Frank Lu 2015-01-06 07:20:59 UTC
I saw the same issue.

=== client (3.5) logs:

W [fuse-bridge.c:2242:fuse_writev_cbk] 0-glusterfs-fuse: 224: WRITE => -1 (Transport endpoint is not connected)  
W [fuse-bridge.c:1236:fuse_err_cbk] 0-glusterfs-fuse: 225: FSYNC() ERR => -1 (Transport endpoint is not connected) 
W [fuse-bridge.c:1236:fuse_err_cbk] 0-glusterfs-fuse: 226: FLUSH() ERR => -1 (Transport endpoint is not connected)

=== server (3.3) logs:
[2015-01-06 14:11:38.281263] E [dict.c:2413:dict_unserialize] (-->/usr/lib64/libgfrpc.so.0(rpcsvc_notify+0x93) [0x317c20a5b3] (-->/usr/lib64/libgfrpc.so.0(rpcsvc_handle_rpc_call+0x307) [0x317c20a4b7] (-->/usr/lib64/glusterfs/3.3.0.5rhs_iqiyi_9/xlator/protocol/server.so(server_writev+0x24b) [0x7f3607ddd8bb]))) 0-dict: undersized buffer passed. available (55231882) < required (927647052)
[2015-01-06 14:11:38.281365] W [server3_1-fops.c:3393:server_writev] 0-/mnt/xfsd/maintain4: failed to unserialize dictionary (state->xdata)
[2015-01-06 14:13:46.844069] I [server-handshake.c:571:server_setvolume] 0-maintain4-server: accepted client from localhost.localdomain-6469-2015/01/06-06:13:46:269874-maintain4-client-3-0 (version: 3.4.6)

Comment 5 Kaleb KEITHLEY 2015-02-10 12:59:17 UTC
We don't support 3.6 clients using 3.3 servers.

Comment 6 Johannes Martin 2015-04-24 10:40:41 UTC
Then the glusterfs client does not quite behave as documented, see: http://www.gluster.org/community/documentation/index.php/GlusterFS_General_FAQ#Will_GlusterFS_be_upward_compatible.3F

The GlusterFS protocol has a version checking mechanism, which checks for the compatibility of both server and client before making a successful connection. 

According to the documented versioning scheme, versions with same major numbers should be compatible with each other.

Comment 7 Shyamsundar 2015-05-06 20:43:07 UTC
The problem seen here, seem to reflect the fix made in processing RPC fragments as the cause, see commit: http://review.gluster.org/#/c/8662/

So it may not matter what client is running, but as long as the server is 3.3 or lesser the issue would appear, higher versions of the server have the fix as far as I can tell.


Note You need to log in before you can comment on or make changes to this bug.