Created attachment 576194 [details] Mount log Description of problem: While running ping-pong on the CIFS mount, I did a add brick to the replicate volume. Which caused ping-pong to fail and the gluster volume info shows different values on the servers Server1: [root@gqac001 ~]# gluster volume info Volume Name: rep Type: Distributed-Replicate Volume ID: 6258c78d-dd00-4b41-a0d5-f6a697f3c68e Status: Started Number of Bricks: 1 x 2 = 3 Transport-type: tcp Bricks: Brick1: 10.16.157.0:/home/bricks/rep/b1 Brick2: 10.16.157.3:/home/bricks/rep/b1 Brick3: 10.16.157.0:/home/bricks/rep/b2 Options Reconfigured: diagnostics.brick-log-level: DEBUG Server2: Volume Name: rep Type: Replicate Volume ID: 6258c78d-dd00-4b41-a0d5-f6a697f3c68e Status: Started Number of Bricks: 1 x 3 = 3 Transport-type: tcp Bricks: Brick1: 10.16.157.0:/home/bricks/rep/b1 Brick2: 10.16.157.3:/home/bricks/rep/b1 Brick3: 10.16.157.0:/home/bricks/rep/b2 Options Reconfigured: diagnostics.brick-log-level: DEBUG Version-Release number of selected component (if applicable): 3.3.0 qa33 How reproducible: always Steps to Reproduce: 1. Create a 1x2 replicate volume. 2. Do a CIFS mount and run ping-pong on it ./ping_pong -rw file.txt 100 50 100 3. Peform a add-brick operation gluster volume add-brick rep replica 3 $Brick3 4. This causes ping-pong to fail. Mount log says - all sub volumes are down. although the vol files show the correct info, gluster volume info shows 1x2 = 3 on one of the node. Attached is the mount log file Actual results: Discrepancy in the gluster volume info output. Expected results: Additional info: [2012-04-09 15:24:48.491294] I [glusterfsd-mgmt.c:64:mgmt_cbk_spec] 0-mgmt: Volume file changed [2012-04-09 15:24:50.752843] I [io-cache.c:1558:check_cache_size_ok] 1-rep-quick-read: Max cache size is 50595786752 [2012-04-09 15:24:50.752932] I [io-cache.c:1558:check_cache_size_ok] 1-rep-io-cache: Max cache size is 50595786752 [2012-04-09 15:24:50.755212] I [client.c:2151:notify] 1-rep-client-0: parent translators are ready, attempting connect on transport [2012-04-09 15:24:50.762014] I [client.c:2151:notify] 1-rep-client-1: parent translators are ready, attempting connect on transport [2012-04-09 15:24:50.764346] I [client.c:2151:notify] 1-rep-client-2: parent translators are ready, attempting connect on transport Given volfile: +------------------------------------------------------------------------------+ 1: volume rep-client-0 2: type protocol/client 3: option remote-host 10.16.157.0 4: option remote-subvolume /home/bricks/rep/b1 5: option transport-type tcp 6: option username 8991e597-580f-41c3-9747-162dbeaee918 7: option password ede9a744-64c3-486d-815c-0fafb487e29a 8: end-volume 9: 10: volume rep-client-1 11: type protocol/client 12: option remote-host 10.16.157.3 13: option remote-subvolume /home/bricks/rep/b1 14: option transport-type tcp 15: option username 8991e597-580f-41c3-9747-162dbeaee918 16: option password ede9a744-64c3-486d-815c-0fafb487e29a 17: end-volume 18: 19: volume rep-client-2 20: type protocol/client 21: option remote-host 10.16.157.0 22: option remote-subvolume /home/bricks/rep/b2 23: option transport-type tcp 24: option username 8991e597-580f-41c3-9747-162dbeaee918 25: option password ede9a744-64c3-486d-815c-0fafb487e29a 26: end-volume 27: 28: volume rep-replicate-0 29: type cluster/replicate 30: subvolumes rep-client-0 rep-client-1 rep-client-2 31: end-volume 32: 33: volume rep-write-behind 34: type performance/write-behind 35: subvolumes rep-replicate-0 36: end-volume 37: 38: volume rep-read-ahead 39: type performance/read-ahead 40: subvolumes rep-write-behind 41: end-volume 42: 43: volume rep-io-cache 44: type performance/io-cache 45: subvolumes rep-read-ahead 46: end-volume 47: 48: volume rep-quick-read 49: type performance/quick-read 50: subvolumes rep-io-cache 51: end-volume 52: 53: volume rep-md-cache 54: type performance/md-cache 55: subvolumes rep-quick-read 56: end-volume 57: 58: volume rep 59: type debug/io-stats 60: option latency-measurement off 61: option count-fop-hits off 62: subvolumes rep-md-cache 63: end-volume +------------------------------------------------------------------------------+ [2012-04-09 15:24:50.767193] I [rpc-clnt.c:1669:rpc_clnt_reconfig] 1-rep-client-2: changing port to 24020 (from 0) [2012-04-09 15:24:50.767245] I [rpc-clnt.c:1669:rpc_clnt_reconfig] 1-rep-client-0: changing port to 24011 (from 0) [2012-04-09 15:24:50.767313] I [client.c:136:client_register_grace_timer] 1-rep-client-2: Registering a grace timer [2012-04-09 15:24:50.767351] I [client.c:136:client_register_grace_timer] 1-rep-client-0: Registering a grace timer [2012-04-09 15:24:52.943998] I [rpc-clnt.c:1669:rpc_clnt_reconfig] 1-rep-client-1: changing port to 24011 (from 0) [2012-04-09 15:24:52.944093] I [client.c:136:client_register_grace_timer] 1-rep-client-1: Registering a grace timer [2012-04-09 15:24:54.614556] W [client.c:2078:client_rpc_notify] 1-rep-client-2: Cancelling the grace timer [2012-04-09 15:24:54.614652] I [client-handshake.c:1632:select_server_supported_programs] 1-rep-client-2: Using Program GlusterFS 3.3.0qa33, Num (1298437), Version (330) [2012-04-09 15:24:54.614906] I [client-handshake.c:1429:client_setvolume_cbk] 1-rep-client-2: Connected to 10.16.157.0:24020, attached to remote volume '/home/bricks/rep/b2'. [2012-04-09 15:24:54.614929] I [client-handshake.c:1441:client_setvolume_cbk] 1-rep-client-2: Server and Client lk-version numbers are not same, reopening the fds [2012-04-09 15:24:54.614994] I [afr-common.c:3524:afr_notify] 1-rep-replicate-0: Subvolume 'rep-client-2' came back up; going online. [2012-04-09 15:24:54.615032] I [client-handshake.c:456:client_set_lk_version_cbk] 1-rep-client-2: Server lk version = 1 [2012-04-09 15:24:54.617407] W [client.c:2078:client_rpc_notify] 1-rep-client-0: Cancelling the grace timer [2012-04-09 15:24:54.617545] I [client-handshake.c:1632:select_server_supported_programs] 1-rep-client-0: Using Program GlusterFS 3.3.0qa33, Num (1298437), Version (330) [2012-04-09 15:24:54.617782] I [client-handshake.c:1429:client_setvolume_cbk] 1-rep-client-0: Connected to 10.16.157.0:24011, attached to remote volume '/home/bricks/rep/b1'. [2012-04-09 15:24:54.617805] I [client-handshake.c:1441:client_setvolume_cbk] 1-rep-client-0: Server and Client lk-version numbers are not same, reopening the fds [2012-04-09 15:24:54.617886] I [client-handshake.c:456:client_set_lk_version_cbk] 1-rep-client-0: Server lk version = 1 [2012-04-09 15:24:56.620713] W [client.c:2078:client_rpc_notify] 1-rep-client-1: Cancelling the grace timer [2012-04-09 15:24:56.620858] I [client-handshake.c:1632:select_server_supported_programs] 1-rep-client-1: Using Program GlusterFS 3.3.0qa33, Num (1298437), Version (330) [2012-04-09 15:24:56.621198] I [client-handshake.c:1429:client_setvolume_cbk] 1-rep-client-1: Connected to 10.16.157.3:24011, attached to remote volume '/home/bricks/rep/b1'. [2012-04-09 15:24:56.621232] I [client-handshake.c:1441:client_setvolume_cbk] 1-rep-client-1: Server and Client lk-version numbers are not same, reopening the fds [2012-04-09 15:24:56.632761] I [fuse-bridge.c:4081:fuse_graph_setup] 0-fuse: switched to graph 1 [2012-04-09 15:24:56.632814] I [client-handshake.c:456:client_set_lk_version_cbk] 1-rep-client-1: Server lk version = 1 [2012-04-09 15:24:56.633638] I [afr-common.c:1866:afr_set_root_inode_on_first_lookup] 1-rep-replicate-0: added root inode [2012-04-09 15:24:56.663467] I [afr-common.c:1204:afr_detect_self_heal_by_lookup_status] 1-rep-replicate-0: entries are missing in lookup of . [2012-04-09 15:24:56.663498] I [afr-common.c:1329:afr_launch_self_heal] 1-rep-replicate-0: background meta-data data entry missing-entry gfid self-heal triggered. path: , reason: lookup detected pending operations [2012-04-09 15:24:56.665857] W [client3_1-fops.c:1489:client3_1_inodelk_cbk] 1-rep-client-2: remote operation failed: No such file or directory [2012-04-09 15:24:56.666275] E [afr-self-heal-metadata.c:548:afr_sh_metadata_post_nonblocking_inodelk_cbk] 1-rep-replicate-0: Non Blocking metadata inodelks failed for . [2012-04-09 15:24:56.666298] E [afr-self-heal-metadata.c:550:afr_sh_metadata_post_nonblocking_inodelk_cbk] 1-rep-replicate-0: Metadata self-heal failed for . [2012-04-09 15:24:56.666663] W [client3_1-fops.c:419:client3_1_open_cbk] 1-rep-client-2: remote operation failed: No such file or directory. Path: [2012-04-09 15:24:56.666692] E [afr-self-heal-data.c:1327:afr_sh_data_open_cbk] 1-rep-replicate-0: open of failed on child rep-client-2 (No such file or directory) [2012-04-09 15:24:56.666729] E [afr-self-heal-common.c:2042:afr_self_heal_completion_cbk] 1-rep-replicate-0: background meta-data data entry self-heal failed on [2012-04-09 15:24:56.666997] I [client.c:2160:notify] 0-rep-client-0: current graph is no longer active, destroying rpc_client [2012-04-09 15:24:56.667055] I [client.c:2160:notify] 0-rep-client-1: current graph is no longer active, destroying rpc_client [2012-04-09 15:24:56.667072] I [client.c:136:client_register_grace_timer] 0-rep-client-0: Registering a grace timer [2012-04-09 15:24:56.667093] I [client.c:2099:client_rpc_notify] 0-rep-client-0: disconnected [2012-04-09 15:24:56.667120] I [client.c:136:client_register_grace_timer] 0-rep-client-1: Registering a grace timer [2012-04-09 15:24:56.667135] I [client.c:2099:client_rpc_notify] 0-rep-client-1: disconnected [2012-04-09 15:24:56.667148] E [afr-common.c:3561:afr_notify] 0-rep-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up. [2012-04-09 15:24:56.667592] W [client3_1-fops.c:419:client3_1_open_cbk] 1-rep-client-2: remote operation failed: No such file or directory. Path: [2012-04-09 15:24:56.667697] I [afr-inode-write.c:437:afr_open_fd_fix] 1-rep-replicate-0: Opening fd 0x956930 [2012-04-09 15:24:56.667959] W [client3_1-fops.c:419:client3_1_open_cbk] 1-rep-client-2: remote operation failed: No such file or directory. Path: <gfid:20fdc776-1ded-422e-b731-6dce9be6d5e6> [2012-04-09 15:24:56.669089] I [afr-inode-write.c:437:afr_open_fd_fix] 1-rep-replicate-0: Opening fd 0x956930
Works for me on 41bd7281a5fe4062fabe963d7862117aca50cc3d on master branch.
Works fine on 3.3.0 qa34 Although ping_pong fails, there is no discrepancy in the vol info.