Bug 810852

Summary: Add brick causes vol info discrepancy
Product: [Community] GlusterFS Reporter: Ujjwala <ujjwala>
Component: glusterdAssignee: krishnan parthasarathi <kparthas>
Status: CLOSED WORKSFORME QA Contact:
Severity: high Docs Contact:
Priority: unspecified    
Version: pre-2.0CC: gluster-bugs, nsathyan, sdharane
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-04-17 04:07:36 EDT Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Attachments:
Description Flags
Mount log none

Description Ujjwala 2012-04-09 08:04:25 EDT
Created attachment 576194 [details]
Mount log

Description of problem:
While running ping-pong on the CIFS mount, I did a add brick to the replicate volume. Which caused ping-pong to fail and the gluster volume info shows different values on the servers
Server1:
[root@gqac001 ~]# gluster volume info
 
Volume Name: rep
Type: Distributed-Replicate
Volume ID: 6258c78d-dd00-4b41-a0d5-f6a697f3c68e
Status: Started
Number of Bricks: 1 x 2 = 3
Transport-type: tcp
Bricks:
Brick1: 10.16.157.0:/home/bricks/rep/b1
Brick2: 10.16.157.3:/home/bricks/rep/b1
Brick3: 10.16.157.0:/home/bricks/rep/b2
Options Reconfigured:
diagnostics.brick-log-level: DEBUG

Server2:
Volume Name: rep
Type: Replicate
Volume ID: 6258c78d-dd00-4b41-a0d5-f6a697f3c68e
Status: Started
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: 10.16.157.0:/home/bricks/rep/b1
Brick2: 10.16.157.3:/home/bricks/rep/b1
Brick3: 10.16.157.0:/home/bricks/rep/b2
Options Reconfigured:
diagnostics.brick-log-level: DEBUG



Version-Release number of selected component (if applicable):
3.3.0 qa33

How reproducible:
always

Steps to Reproduce:
1. Create a 1x2 replicate volume.
2. Do a CIFS mount and run ping-pong on it
./ping_pong -rw file.txt 100 50 100
3. Peform a add-brick operation 
gluster volume add-brick rep replica 3 $Brick3
4. This causes ping-pong to fail. Mount log says - all sub volumes are down.
although the vol files show the correct info, gluster volume info shows 1x2 = 3 on one of the node.

Attached is the mount log file 
  
Actual results:
Discrepancy in the gluster volume info output.


Expected results:



Additional info:
[2012-04-09 15:24:48.491294] I [glusterfsd-mgmt.c:64:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2012-04-09 15:24:50.752843] I [io-cache.c:1558:check_cache_size_ok] 1-rep-quick-read: Max cache size is 50595786752
[2012-04-09 15:24:50.752932] I [io-cache.c:1558:check_cache_size_ok] 1-rep-io-cache: Max cache size is 50595786752
[2012-04-09 15:24:50.755212] I [client.c:2151:notify] 1-rep-client-0: parent translators are ready, attempting connect on transport
[2012-04-09 15:24:50.762014] I [client.c:2151:notify] 1-rep-client-1: parent translators are ready, attempting connect on transport
[2012-04-09 15:24:50.764346] I [client.c:2151:notify] 1-rep-client-2: parent translators are ready, attempting connect on transport
Given volfile:
+------------------------------------------------------------------------------+
  1: volume rep-client-0
  2:     type protocol/client
  3:     option remote-host 10.16.157.0
  4:     option remote-subvolume /home/bricks/rep/b1
  5:     option transport-type tcp
  6:     option username 8991e597-580f-41c3-9747-162dbeaee918
  7:     option password ede9a744-64c3-486d-815c-0fafb487e29a
  8: end-volume
  9: 
 10: volume rep-client-1
 11:     type protocol/client
 12:     option remote-host 10.16.157.3
 13:     option remote-subvolume /home/bricks/rep/b1
 14:     option transport-type tcp
 15:     option username 8991e597-580f-41c3-9747-162dbeaee918
 16:     option password ede9a744-64c3-486d-815c-0fafb487e29a
 17: end-volume
 18: 
 19: volume rep-client-2
 20:     type protocol/client
 21:     option remote-host 10.16.157.0
 22:     option remote-subvolume /home/bricks/rep/b2
 23:     option transport-type tcp
 24:     option username 8991e597-580f-41c3-9747-162dbeaee918
 25:     option password ede9a744-64c3-486d-815c-0fafb487e29a
 26: end-volume
 27: 
 28: volume rep-replicate-0
 29:     type cluster/replicate
 30:     subvolumes rep-client-0 rep-client-1 rep-client-2
 31: end-volume
 32: 
 33: volume rep-write-behind
 34:     type performance/write-behind
 35:     subvolumes rep-replicate-0
 36: end-volume
 37: 
 38: volume rep-read-ahead
 39:     type performance/read-ahead
 40:     subvolumes rep-write-behind
 41: end-volume
 42: 
 43: volume rep-io-cache
 44:     type performance/io-cache
 45:     subvolumes rep-read-ahead
 46: end-volume
 47: 
 48: volume rep-quick-read
 49:     type performance/quick-read
 50:     subvolumes rep-io-cache
 51: end-volume
 52: 
 53: volume rep-md-cache
 54:     type performance/md-cache
 55:     subvolumes rep-quick-read
 56: end-volume
 57: 
 58: volume rep
 59:     type debug/io-stats
 60:     option latency-measurement off
 61:     option count-fop-hits off
 62:     subvolumes rep-md-cache
 63: end-volume

+------------------------------------------------------------------------------+

[2012-04-09 15:24:50.767193] I [rpc-clnt.c:1669:rpc_clnt_reconfig] 1-rep-client-2: changing port to 24020 (from 0)
[2012-04-09 15:24:50.767245] I [rpc-clnt.c:1669:rpc_clnt_reconfig] 1-rep-client-0: changing port to 24011 (from 0)
[2012-04-09 15:24:50.767313] I [client.c:136:client_register_grace_timer] 1-rep-client-2: Registering a grace timer
[2012-04-09 15:24:50.767351] I [client.c:136:client_register_grace_timer] 1-rep-client-0: Registering a grace timer
[2012-04-09 15:24:52.943998] I [rpc-clnt.c:1669:rpc_clnt_reconfig] 1-rep-client-1: changing port to 24011 (from 0)
[2012-04-09 15:24:52.944093] I [client.c:136:client_register_grace_timer] 1-rep-client-1: Registering a grace timer
[2012-04-09 15:24:54.614556] W [client.c:2078:client_rpc_notify] 1-rep-client-2: Cancelling the grace timer
[2012-04-09 15:24:54.614652] I [client-handshake.c:1632:select_server_supported_programs] 1-rep-client-2: Using Program GlusterFS 3.3.0qa33, Num (1298437), Version (330)
[2012-04-09 15:24:54.614906] I [client-handshake.c:1429:client_setvolume_cbk] 1-rep-client-2: Connected to 10.16.157.0:24020, attached to remote volume '/home/bricks/rep/b2'.
[2012-04-09 15:24:54.614929] I [client-handshake.c:1441:client_setvolume_cbk] 1-rep-client-2: Server and Client lk-version numbers are not same, reopening the fds
[2012-04-09 15:24:54.614994] I [afr-common.c:3524:afr_notify] 1-rep-replicate-0: Subvolume 'rep-client-2' came back up; going online.
[2012-04-09 15:24:54.615032] I [client-handshake.c:456:client_set_lk_version_cbk] 1-rep-client-2: Server lk version = 1
[2012-04-09 15:24:54.617407] W [client.c:2078:client_rpc_notify] 1-rep-client-0: Cancelling the grace timer
[2012-04-09 15:24:54.617545] I [client-handshake.c:1632:select_server_supported_programs] 1-rep-client-0: Using Program GlusterFS 3.3.0qa33, Num (1298437), Version (330)
[2012-04-09 15:24:54.617782] I [client-handshake.c:1429:client_setvolume_cbk] 1-rep-client-0: Connected to 10.16.157.0:24011, attached to remote volume '/home/bricks/rep/b1'.
[2012-04-09 15:24:54.617805] I [client-handshake.c:1441:client_setvolume_cbk] 1-rep-client-0: Server and Client lk-version numbers are not same, reopening the fds
[2012-04-09 15:24:54.617886] I [client-handshake.c:456:client_set_lk_version_cbk] 1-rep-client-0: Server lk version = 1
[2012-04-09 15:24:56.620713] W [client.c:2078:client_rpc_notify] 1-rep-client-1: Cancelling the grace timer
[2012-04-09 15:24:56.620858] I [client-handshake.c:1632:select_server_supported_programs] 1-rep-client-1: Using Program GlusterFS 3.3.0qa33, Num (1298437), Version (330)
[2012-04-09 15:24:56.621198] I [client-handshake.c:1429:client_setvolume_cbk] 1-rep-client-1: Connected to 10.16.157.3:24011, attached to remote volume '/home/bricks/rep/b1'.
[2012-04-09 15:24:56.621232] I [client-handshake.c:1441:client_setvolume_cbk] 1-rep-client-1: Server and Client lk-version numbers are not same, reopening the fds
[2012-04-09 15:24:56.632761] I [fuse-bridge.c:4081:fuse_graph_setup] 0-fuse: switched to graph 1
[2012-04-09 15:24:56.632814] I [client-handshake.c:456:client_set_lk_version_cbk] 1-rep-client-1: Server lk version = 1
[2012-04-09 15:24:56.633638] I [afr-common.c:1866:afr_set_root_inode_on_first_lookup] 1-rep-replicate-0: added root inode
[2012-04-09 15:24:56.663467] I [afr-common.c:1204:afr_detect_self_heal_by_lookup_status] 1-rep-replicate-0: entries are missing in lookup of .
[2012-04-09 15:24:56.663498] I [afr-common.c:1329:afr_launch_self_heal] 1-rep-replicate-0: background  meta-data data entry missing-entry gfid self-heal triggered. path: , reason: lookup detected pending operations
[2012-04-09 15:24:56.665857] W [client3_1-fops.c:1489:client3_1_inodelk_cbk] 1-rep-client-2: remote operation failed: No such file or directory
[2012-04-09 15:24:56.666275] E [afr-self-heal-metadata.c:548:afr_sh_metadata_post_nonblocking_inodelk_cbk] 1-rep-replicate-0: Non Blocking metadata inodelks failed for .
[2012-04-09 15:24:56.666298] E [afr-self-heal-metadata.c:550:afr_sh_metadata_post_nonblocking_inodelk_cbk] 1-rep-replicate-0: Metadata self-heal failed for .
[2012-04-09 15:24:56.666663] W [client3_1-fops.c:419:client3_1_open_cbk] 1-rep-client-2: remote operation failed: No such file or directory. Path: 
[2012-04-09 15:24:56.666692] E [afr-self-heal-data.c:1327:afr_sh_data_open_cbk] 1-rep-replicate-0: open of  failed on child rep-client-2 (No such file or directory)
[2012-04-09 15:24:56.666729] E [afr-self-heal-common.c:2042:afr_self_heal_completion_cbk] 1-rep-replicate-0: background  meta-data data entry self-heal failed on 
[2012-04-09 15:24:56.666997] I [client.c:2160:notify] 0-rep-client-0: current graph is no longer active, destroying rpc_client 
[2012-04-09 15:24:56.667055] I [client.c:2160:notify] 0-rep-client-1: current graph is no longer active, destroying rpc_client 
[2012-04-09 15:24:56.667072] I [client.c:136:client_register_grace_timer] 0-rep-client-0: Registering a grace timer
[2012-04-09 15:24:56.667093] I [client.c:2099:client_rpc_notify] 0-rep-client-0: disconnected
[2012-04-09 15:24:56.667120] I [client.c:136:client_register_grace_timer] 0-rep-client-1: Registering a grace timer
[2012-04-09 15:24:56.667135] I [client.c:2099:client_rpc_notify] 0-rep-client-1: disconnected
[2012-04-09 15:24:56.667148] E [afr-common.c:3561:afr_notify] 0-rep-replicate-0: All subvolumes are down. Going offline until atleast one of them comes back up.
[2012-04-09 15:24:56.667592] W [client3_1-fops.c:419:client3_1_open_cbk] 1-rep-client-2: remote operation failed: No such file or directory. Path: 
[2012-04-09 15:24:56.667697] I [afr-inode-write.c:437:afr_open_fd_fix] 1-rep-replicate-0: Opening fd 0x956930
[2012-04-09 15:24:56.667959] W [client3_1-fops.c:419:client3_1_open_cbk] 1-rep-client-2: remote operation failed: No such file or directory. Path: <gfid:20fdc776-1ded-422e-b731-6dce9be6d5e6>
[2012-04-09 15:24:56.669089] I [afr-inode-write.c:437:afr_open_fd_fix] 1-rep-replicate-0: Opening fd 0x956930
Comment 1 krishnan parthasarathi 2012-04-17 04:07:36 EDT
Works for me on 41bd7281a5fe4062fabe963d7862117aca50cc3d on master branch.
Comment 2 Ujjwala 2012-04-17 06:01:32 EDT
Works fine on 3.3.0 qa34
Although ping_pong fails, there is no discrepancy in the vol info.