Created attachment 788916 [details] cinder-vol-rebalance.log Description of problem: Looks similar to bug 993119. Openstack cinder image files are not migrated after an add-brick operation. DHT hex is not reset. If we happen to use this volume from within an instance, the vm is damaged beyond repair and the only option that works after this stage is "Terminate Instance". Version-Release number of selected component (if applicable): glusterfs-3.4.0.21rhs-1.el6_4.x86_64 How reproducible: Tested once. Steps to Reproduce: (detail steps can be provided upon request) 1. Create a 6X2 distribute-replicate volume. 2. Setup openstack cinder to automatically mount the volume. 3. Create glance images and cinder volumes 4. Launch instances and attach cinder volumes to them. 5. Perform add-brick and then rebalance. 6. Ensure that rebalance status is "completed" for all the nodes. Actual results: Empty linkto files still exist on the bricks. Expected results: Files should be migrated successfully. Additional info: [root@rhsqa-virt06 ~]# ssh 10.70.43.66 ls -l /rhs/bricks/cinder1/volume-64633c79-6748-4d01-b60b-2923a5d1db3e ---------T 2 qemu qemu 0 Aug 21 18:24 /rhs/bricks/cinder1/volume-64633c79-6748-4d01-b60b-2923a5d1db3e [root@rhsqa-virt06 ~]# ssh 10.70.43.47 ls -l /rhs/bricks/cinder1/volume-64633c79-6748-4d01-b60b-2923a5d1db3e -rw-rw-rw- 2 qemu qemu 10737418240 Aug 21 11:03 /rhs/bricks/cinder1/volume-64633c79-6748-4d01-b60b-2923a5d1db3e [root@rhsqa-virt06 ~]# ssh 10.70.43.53 ls -l /rhs/bricks/cinder1/volume-64633c79-6748-4d01-b60b-2923a5d1db3e -rw-rw-rw- 2 qemu qemu 10737418240 Aug 21 11:03 /rhs/bricks/cinder1/volume-64633c79-6748-4d01-b60b-2923a5d1db3e [root@rhsqa-virt06 ~]# ssh 10.70.43.67 ls -l /rhs/bricks/cinder1/volume-64633c79-6748-4d01-b60b-2923a5d1db3e ---------T 2 qemu qemu 0 Aug 21 18:24 /rhs/bricks/cinder1/volume-64633c79-6748-4d01-b60b-2923a5d1db3e [root@rhsqa-virt06 ~]# [root@localhost ~]# gluster vol info cinder-vol Volume Name: cinder-vol Type: Distributed-Replicate Volume ID: 4d0b2e57-2eb3-44c0-9f9c-5e2a1835b81a Status: Started Number of Bricks: 8 x 2 = 16 Transport-type: tcp Bricks: Brick1: 10.70.43.66:/rhs/bricks/cinder1 Brick2: 10.70.43.67:/rhs/bricks/cinder1 Brick3: 10.70.43.68:/rhs/bricks/cinder1 Brick4: 10.70.43.69:/rhs/bricks/cinder1 Brick5: 10.70.43.70:/rhs/bricks/cinder1 Brick6: 10.70.43.61:/rhs/bricks/cinder1 Brick7: 10.70.43.56:/rhs/bricks/cinder1 Brick8: 10.70.43.57:/rhs/bricks/cinder1 Brick9: 10.70.43.53:/rhs/bricks/cinder1 Brick10: 10.70.43.47:/rhs/bricks/cinder1 Brick11: 10.70.43.44:/rhs/bricks/cinder1 Brick12: 10.70.43.42:/rhs/bricks/cinder1 Brick13: 10.70.43.40:/rhs/bricks/cinder1 Brick14: 10.70.43.35:/rhs/bricks/cinder1 Brick15: 10.70.43.36:/rhs/bricks/cinder1 Brick16: 10.70.43.37:/rhs/bricks/cinder1 Options Reconfigured: diagnostics.client-log-level: DEBUG diagnostics.brick-log-level: DEBUG storage.owner-gid: 165 storage.owner-uid: 165 network.remote-dio: enable cluster.eager-lock: enable performance.stat-prefetch: off performance.io-cache: off performance.read-ahead: off performance.quick-read: off [root@localhost ~]# [root@localhost ~]# gluster volume rebalance cinder-vol status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 0 0Bytes 25 0 0 completed 2.00 10.70.43.66 5 41.0GB 27 0 0 completed 431.00 10.70.43.67 0 0Bytes 25 0 0 completed 3.00 10.70.43.47 0 0Bytes 25 0 0 completed 2.00 10.70.43.69 0 0Bytes 25 0 0 completed 2.00 10.70.43.70 3 30.0GB 35 0 2 completed 1813.00 10.70.43.72 0 0Bytes 25 0 0 completed 2.00 10.70.43.57 0 0Bytes 25 0 0 completed 2.00 10.70.43.61 0 0Bytes 25 0 0 completed 3.00 10.70.43.56 6 46.0GB 39 0 0 completed 2758.00 10.70.43.68 5 50.0GB 32 0 0 completed 512.00 10.70.43.68 5 50.0GB 32 0 0 completed 512.00 10.70.43.53 1 10.0GB 30 0 0 completed 652.00 10.70.43.40 0 0Bytes 25 0 0 completed 2.00 10.70.43.35 0 0Bytes 25 0 0 completed 2.00 10.70.43.36 0 0Bytes 25 0 0 completed 2.00 10.70.43.37 0 0Bytes 25 0 0 completed 3.00 volume rebalance: cinder-vol: success: [root@localhost ~]# [root@rhsqa-virt06 ~]# for i in 10.70.43.66 10.70.43.47 10.70.43.53 10.70.43.67; do ssh $i "(echo $i:; getfattr -d -m . /rhs/bricks/cinder1; echo)"; done 10.70.43.66: getfattr: Removing leading '/' from absolute path names # file: rhs/bricks/cinder1 trusted.gfid=0sAAAAAAAAAAAAAAAAAAAAAQ== trusted.glusterfs.dht=0sAAAAAQAAAAAAAAAAH////g== trusted.glusterfs.volume-id=0sTQsuVy6zRMCfnF4qGDW4Gg== 10.70.43.47: getfattr: Removing leading '/' from absolute path names # file: rhs/bricks/cinder1 trusted.gfid=0sAAAAAAAAAAAAAAAAAAAAAQ== trusted.glusterfs.dht=0sAAAAAQAAAAB////8n///+g== trusted.glusterfs.volume-id=0sTQsuVy6zRMCfnF4qGDW4Gg== 10.70.43.53: getfattr: Removing leading '/' from absolute path names # file: rhs/bricks/cinder1 trusted.gfid=0sAAAAAAAAAAAAAAAAAAAAAQ== trusted.glusterfs.dht=0sAAAAAQAAAAB////8n///+g== trusted.glusterfs.volume-id=0sTQsuVy6zRMCfnF4qGDW4Gg== 10.70.43.67: getfattr: Removing leading '/' from absolute path names # file: rhs/bricks/cinder1 trusted.gfid=0sAAAAAAAAAAAAAAAAAAAAAQ== trusted.glusterfs.dht=0sAAAAAQAAAAAAAAAAH////g== trusted.glusterfs.volume-id=0sTQsuVy6zRMCfnF4qGDW4Gg== [root@rhsqa-virt06 ~]#
[root@rhsqa-virt06 ~]# for i in 10.70.43.66 10.70.43.47 10.70.43.53 10.70.43.67; do ssh $i "(echo $i:; getfattr -d -m . /rhs/bricks/cinder1/volume-64633c79-6748-4d01-b60b-2923a5d1db3e; echo)"; done 10.70.43.66: getfattr: Removing leading '/' from absolute path names # file: rhs/bricks/cinder1/volume-64633c79-6748-4d01-b60b-2923a5d1db3e trusted.gfid=0sebxoy0WtQayHIsGc4lpcew== trusted.glusterfs.dht.linkto="cinder-vol-replicate-4" 10.70.43.47: getfattr: Removing leading '/' from absolute path names # file: rhs/bricks/cinder1/volume-64633c79-6748-4d01-b60b-2923a5d1db3e trusted.afr.cinder-vol-client-8=0sAAAAAAAAAAAAAAAA trusted.afr.cinder-vol-client-9=0sAAAAAAAAAAAAAAAA trusted.gfid=0sebxoy0WtQayHIsGc4lpcew== 10.70.43.53: getfattr: Removing leading '/' from absolute path names # file: rhs/bricks/cinder1/volume-64633c79-6748-4d01-b60b-2923a5d1db3e trusted.afr.cinder-vol-client-8=0sAAAAAAAAAAAAAAAA trusted.afr.cinder-vol-client-9=0sAAAAAAAAAAAAAAAA trusted.gfid=0sebxoy0WtQayHIsGc4lpcew== 10.70.43.67: getfattr: Removing leading '/' from absolute path names # file: rhs/bricks/cinder1/volume-64633c79-6748-4d01-b60b-2923a5d1db3e trusted.gfid=0sebxoy0WtQayHIsGc4lpcew== trusted.glusterfs.dht.linkto="cinder-vol-replicate-4" [root@rhsqa-virt06 ~]#
Created attachment 789038 [details] cinder-vol-rebalance logs from all the nodes $ for i in 10.70.43.66 10.70.43.67 10.70.43.68 10.70.43.69 10.70.43.70 10.70.43.61 10.70.43.56 10.70.43.57 10.70.43.53 10.70.43.47 10.70.43.44 10.70.43.42 10.70.43.40 10.70.43.35 10.70.43.36 10.70.43.37; do scp root@$i:/var/log/glusterfs/cinder-vol-rebalance.log /tmp/cinder-vol-rebalance.log.$i; done $ tar -cjvf cinder-vol-rebalance.log.tar.bz2 /tmp/cinder-vol-rebalance.log*
[root@rhsqa-virt06 ~]# for i in 10.70.43.66 10.70.43.47 10.70.43.53 10.70.43.57 10.70.43.67 10.70.43.56 10.70.43.61 10.70.43.44 10.70.43.68 10.70.43.69 10.70.43.72 10.70.43.70 10.70.43.42 10.70.43.76 10.70.43.40 10.70.43.35 10.70.43.36 10.70.43.37; do ssh $i "(echo $i; getfattr -m - -d -e hex /rhs/bricks/cinder1; echo)"; done 10.70.43.66 getfattr: Removing leading '/' from absolute path names # file: rhs/bricks/cinder1 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x0000000100000000000000001ffffffe trusted.glusterfs.volume-id=0x4d0b2e572eb344c09f9c5e2a1835b81a 10.70.43.47 getfattr: Removing leading '/' from absolute path names # file: rhs/bricks/cinder1 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x00000001000000007ffffffc9ffffffa trusted.glusterfs.volume-id=0x4d0b2e572eb344c09f9c5e2a1835b81a 10.70.43.53 getfattr: Removing leading '/' from absolute path names # file: rhs/bricks/cinder1 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x00000001000000007ffffffc9ffffffa trusted.glusterfs.volume-id=0x4d0b2e572eb344c09f9c5e2a1835b81a 10.70.43.57 getfattr: Removing leading '/' from absolute path names # file: rhs/bricks/cinder1 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x00000001000000005ffffffd7ffffffb trusted.glusterfs.volume-id=0x4d0b2e572eb344c09f9c5e2a1835b81a 10.70.43.67 getfattr: Removing leading '/' from absolute path names # file: rhs/bricks/cinder1 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x0000000100000000000000001ffffffe trusted.glusterfs.volume-id=0x4d0b2e572eb344c09f9c5e2a1835b81a 10.70.43.56 getfattr: Removing leading '/' from absolute path names # file: rhs/bricks/cinder1 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x00000001000000005ffffffd7ffffffb trusted.glusterfs.volume-id=0x4d0b2e572eb344c09f9c5e2a1835b81a 10.70.43.61 getfattr: Removing leading '/' from absolute path names # file: rhs/bricks/cinder1 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x00000001000000003ffffffe5ffffffc trusted.glusterfs.volume-id=0x4d0b2e572eb344c09f9c5e2a1835b81a 10.70.43.44 getfattr: Removing leading '/' from absolute path names # file: rhs/bricks/cinder1 trusted.afr.cinder-vol-client-10=0x000000000000000000000000 trusted.afr.cinder-vol-client-11=0x000000000000000000000002 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x00000001000000009ffffffbbffffff9 trusted.glusterfs.volume-id=0x4d0b2e572eb344c09f9c5e2a1835b81a 10.70.43.68 getfattr: Removing leading '/' from absolute path names # file: rhs/bricks/cinder1 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x00000001000000001fffffff3ffffffd trusted.glusterfs.volume-id=0x4d0b2e572eb344c09f9c5e2a1835b81a 10.70.43.69 # file: rhs/bricks/cinder1 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x00000001000000001fffffff3ffffffd trusted.glusterfs.volume-id=0x4d0b2e572eb344c09f9c5e2a1835b81a getfattr: Removing leading '/' from absolute path names 10.70.43.72 getfattr: Removing leading '/' from absolute path names # file: rhs/bricks/cinder1 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x00000001000000002492492449249247 trusted.glusterfs.volume-id=0x4d0b2e572eb344c09f9c5e2a1835b81a 10.70.43.70 getfattr: Removing leading '/' from absolute path names # file: rhs/bricks/cinder1 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x00000001000000003ffffffe5ffffffc trusted.glusterfs.volume-id=0x4d0b2e572eb344c09f9c5e2a1835b81a ssh: connect to host 10.70.43.42 port 22: No route to host 10.70.43.76 getfattr: Removing leading '/' from absolute path names # file: rhs/bricks/cinder1 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x00000001000000002492492449249247 trusted.glusterfs.volume-id=0x4d0b2e572eb344c09f9c5e2a1835b81a 10.70.43.40 getfattr: Removing leading '/' from absolute path names # file: rhs/bricks/cinder1 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x0000000100000000bffffffadffffff8 trusted.glusterfs.volume-id=0x4d0b2e572eb344c09f9c5e2a1835b81a 10.70.43.35 getfattr: Removing leading '/' from absolute path names # file: rhs/bricks/cinder1 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x0000000100000000bffffffadffffff8 trusted.glusterfs.volume-id=0x4d0b2e572eb344c09f9c5e2a1835b81a 10.70.43.36 getfattr: Removing leading '/' from absolute path names # file: rhs/bricks/cinder1 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x0000000100000000dffffff9ffffffff trusted.glusterfs.volume-id=0x4d0b2e572eb344c09f9c5e2a1835b81a 10.70.43.37 getfattr: Removing leading '/' from absolute path names # file: rhs/bricks/cinder1 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x0000000100000000dffffff9ffffffff trusted.glusterfs.volume-id=0x4d0b2e572eb344c09f9c5e2a1835b81a [root@rhsqa-virt06 ~]#
Created attachment 789039 [details] /var/log/glusterfs/bricks/rhs-bricks-cinder1.log logs # for i in 10.70.43.66 10.70.43.67 10.70.43.68 10.70.43.69 10.70.43.70 10.70.43.61 10.70.43.56 10.70.43.57 10.70.43.53 10.70.43.47 10.70.43.44 10.70.43.42 10.70.43.40 10.70.43.35 10.70.43.36 10.70.43.37; do scp root@$i:/var/log/glusterfs/bricks/rhs-bricks-cinder1.log /tmp/rhs-bricks-cinder1.log.$i; done # tar -jcvf rhs-bricks-cinder1.logs.tar.bz rhs-bricks-cinder1.log*
Created attachment 789040 [details] glustershd logs from all nodes # for i in 10.70.43.66 10.70.43.67 10.70.43.68 10.70.43.69 10.70.43.70 10.70.43.61 10.70.43.56 10.70.43.57 10.70.43.53 10.70.43.47 10.70.43.44 10.70.43.42 10.70.43.40 10.70.43.35 10.70.43.36 10.70.43.37; do scp root@$i:/var/log/glusterfs/glustershd.log /tmp/glustershd.log.$i; done # tar -jcvf glustershd.logs.tar.bz glustershd.log* Note: that 10.70.43.42 node was down under mysterious circumstances during this rebalance operation. So you should see alot of the following messages. Don't think it is relevant since its replica pair 10.70.43.44 was alive and healthy but still mentioning it here to help in diagnosis. <snip> [2013-08-21 15:01:24.595182] D [name.c:155:client_fill_address_family] 0-cinder-vol-client-11: address-family not specified, guessing it to be inet from (remote-host: 10.70.43.42) [2013-08-21 15:01:24.601564] D [common-utils.c:248:gf_resolve_ip6] 0-resolver: returning ip-10.70.43.42 (port-24007) for hostname: 10.70.43.42 and port: 24007 [2013-08-21 15:01:24.601850] D [name.c:155:client_fill_address_family] 0-glance-vol-client-11: address-family not specified, guessing it to be inet from (remote-host: 10.70.43.42) [2013-08-21 15:01:24.607525] D [common-utils.c:248:gf_resolve_ip6] 0-resolver: returning ip-10.70.43.42 (port-24007) for hostname: 10.70.43.42 and port: 24007 [2013-08-21 15:01:24.699835] D [socket.c:615:__socket_shutdown] 0-cinder-vol-client-11: shutdown() returned -1. Transport endpoint is not connected [2013-08-21 15:01:24.699902] D [socket.c:492:__socket_rwv] 0-cinder-vol-client-11: EOF on socket [2013-08-21 15:01:24.699923] D [socket.c:2237:socket_event_handler] 0-transport: disconnecting now </snip>
Created attachment 789051 [details] Client log Client: [root@rhs-hpc-srv1 glusterfs(keystone_admin)]# df -hT Filesystem Type Size Used Avail Use% Mounted on /dev/sdb1 ext4 1.1T 1.5G 1.1T 1% / tmpfs tmpfs 24G 0 24G 0% /dev/shm /dev/sdf1 ext4 30G 211M 28G 1% /boot /dev/sdc1 ext4 1.1T 28G 1013G 3% /var /srv/loopback-device/device1 ext4 960M 34M 876M 4% /srv/node/device1 10.70.43.44:glance-vol fuse.glusterfs 300G 29G 272G 10% /var/lib/glance/images 10.70.43.44:cinder-vol fuse.glusterfs 400G 29G 372G 8% /var/lib/cinder/volumes/1d12e17a168a458a2db39ca37ee302fd 10.70.43.44:cinder-vol fuse.glusterfs 400G 29G 372G 8% /var/lib/nova/mnt/1d12e17a168a458a2db39ca37ee302fd [root@rhs-hpc-srv1 glusterfs(keystone_admin)]# Log file attached is /var/log/glusterfs/var-lib-cinder-volumes-1d12e17a168a458a2db39ca37ee302fd.log
Other than the 2 (see below)errors, there are no errors reported. Linkfile still exist, as migration of few files have been skipped due to space constraints. Those should not lead to any failures as afr is involved too. [2013-08-21 01:26:54.073212] E [rpc-clnt.c:368:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x165) [0x7f056f71cf55] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) [0x7f056f71ca93] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f056f71c9ae]))) 1-cinder-vol-client-11: forced unwinding frame type(GlusterFS 3.3) op(LOOKUP(27)) called at 2013-08-21 01:26:11.904694 (xid=0x443x) [2013-08-21 01:26:54.083631] E [rpc-clnt.c:368:saved_frames_unwind] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_notify+0x165) [0x7f056f71cf55] (-->/usr/lib64/libgfrpc.so.0(rpc_clnt_connection_cleanup+0xc3) [0x7f056f71ca93] (-->/usr/lib64/libgfrpc.so.0(saved_frames_destroy+0xe) [0x7f056f71c9ae]))) 1-cinder-vol-client-11: forced unwinding frame type(GlusterFS Handshake) op(PING(3)) called at 2013-08-21 01:26:11.904704 (xid=0x444x)
I shall re-test this issue. But I am not clear in which log level you asking me to set. DEBUG or TRACE. I believe TRACE (since I am trying to reproduce it), please correct me if otherwise.
Please re-run the tests with log level set to DEBUG. With TRACE, there is a possibility of it being masked due to excessive logging(If the issue is hit again, and DEBUG does not help, we may need to re-run it with TRACE.)
I could reproduce this. Add brick followed by rebalance + mkfs.ext4 (in this case) on the /attached/device causes this. Attaching logs and other info shortly.
Fixing summary since per dev: ----T files still will be created if moving a files leads to volume imbalance wrt space consumption, even if stale -----T are left in the back-end, dht can and will clean them up later on during readdirp/access operations (which hit dht xlator) ----T are not an issue here.. if they are missing or stale, dht can clean them up as needed
[root@rhsqa-virt06 ~]# for i in 10.70.43.178 10.70.43.73 10.70.42.250 10.70.42.185 10.70.42.172 10.70.42.178 10.70.42.255 10.70.42.157 10.70.42.164 10.70.43.95 10.70.42.169 10.70.42.159 10.70.42.156 10.70.43.92 10.70.42.163 10.70.42.180 10.70.42.154 10.70.42.161 10.70.42.173 10.70.42.170; do ssh $i "(echo $i; getfattr -m - -d -e hex /rhs/bricks/cinder-volume; echo)"; done 10.70.43.178 # file: rhs/bricks/cinder-volume trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x0000000100000000e38e38e0ffffffff trusted.glusterfs.volume-id=0xced7fae7c90c4df6953f7148ee2ada76 getfattr: Removing leading '/' from absolute path names 10.70.43.73 # file: rhs/bricks/cinder-volume trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x0000000100000000e38e38e0ffffffff trusted.glusterfs.volume-id=0xced7fae7c90c4df6953f7148ee2ada76 getfattr: Removing leading '/' from absolute path names 10.70.42.250 # file: rhs/bricks/cinder-volume trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x00000001000000005555555471c71c6f trusted.glusterfs.volume-id=0xced7fae7c90c4df6953f7148ee2ada76 getfattr: Removing leading '/' from absolute path names 10.70.42.185 # file: rhs/bricks/cinder-volume trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x00000001000000005555555471c71c6f trusted.glusterfs.volume-id=0xced7fae7c90c4df6953f7148ee2ada76 getfattr: Removing leading '/' from absolute path names 10.70.42.172 # file: rhs/bricks/cinder-volume trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x000000010000000071c71c708e38e38b trusted.glusterfs.volume-id=0xced7fae7c90c4df6953f7148ee2ada76 getfattr: Removing leading '/' from absolute path names 10.70.42.178 # file: rhs/bricks/cinder-volume trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x000000010000000071c71c708e38e38b trusted.glusterfs.volume-id=0xced7fae7c90c4df6953f7148ee2ada76 getfattr: Removing leading '/' from absolute path names 10.70.42.255 getfattr: /rhs/bricks/cinder-volume: No such file or directory 10.70.42.157 getfattr: /rhs/bricks/cinder-volume: No such file or directory 10.70.42.164 # file: rhs/bricks/cinder-volume trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x000000010000000038e38e3855555553 trusted.glusterfs.volume-id=0xced7fae7c90c4df6953f7148ee2ada76 getfattr: Removing leading '/' from absolute path names 10.70.43.95 # file: rhs/bricks/cinder-volume trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x000000010000000038e38e3855555553 trusted.glusterfs.volume-id=0xced7fae7c90c4df6953f7148ee2ada76 getfattr: Removing leading '/' from absolute path names 10.70.42.169 # file: rhs/bricks/cinder-volume trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x00000001000000001c71c71c38e38e37 trusted.glusterfs.volume-id=0xced7fae7c90c4df6953f7148ee2ada76 getfattr: Removing leading '/' from absolute path names 10.70.42.159 # file: rhs/bricks/cinder-volume trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x00000001000000001c71c71c38e38e37 trusted.glusterfs.volume-id=0xced7fae7c90c4df6953f7148ee2ada76 getfattr: Removing leading '/' from absolute path names 10.70.42.156 # file: rhs/bricks/cinder-volume trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x0000000100000000000000001c71c71b trusted.glusterfs.volume-id=0xced7fae7c90c4df6953f7148ee2ada76 getfattr: Removing leading '/' from absolute path names 10.70.43.92 # file: rhs/bricks/cinder-volume trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x0000000100000000000000001c71c71b trusted.glusterfs.volume-id=0xced7fae7c90c4df6953f7148ee2ada76 getfattr: Removing leading '/' from absolute path names 10.70.42.163 getfattr: Removing leading '/' from absolute path names # file: rhs/bricks/cinder-volume trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x00000001000000008e38e38caaaaaaa7 trusted.glusterfs.volume-id=0xced7fae7c90c4df6953f7148ee2ada76 10.70.42.180 getfattr: Removing leading '/' from absolute path names # file: rhs/bricks/cinder-volume trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x00000001000000008e38e38caaaaaaa7 trusted.glusterfs.volume-id=0xced7fae7c90c4df6953f7148ee2ada76 10.70.42.154 getfattr: Removing leading '/' from absolute path names # file: rhs/bricks/cinder-volume trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x0000000100000000aaaaaaa8c71c71c3 trusted.glusterfs.volume-id=0xced7fae7c90c4df6953f7148ee2ada76 10.70.42.161 getfattr: Removing leading '/' from absolute path names # file: rhs/bricks/cinder-volume trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x0000000100000000aaaaaaa8c71c71c3 trusted.glusterfs.volume-id=0xced7fae7c90c4df6953f7148ee2ada76 10.70.42.173 getfattr: Removing leading '/' from absolute path names # file: rhs/bricks/cinder-volume trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x0000000100000000c71c71c4e38e38df trusted.glusterfs.volume-id=0xced7fae7c90c4df6953f7148ee2ada76 10.70.42.170 getfattr: Removing leading '/' from absolute path names # file: rhs/bricks/cinder-volume trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x0000000100000000c71c71c4e38e38df trusted.glusterfs.volume-id=0xced7fae7c90c4df6953f7148ee2ada76 [root@rhsqa-virt06 ~]#
[root@localhost ~]# gluster vol info cinder-volume Volume Name: cinder-volume Type: Distributed-Replicate Volume ID: ced7fae7-c90c-4df6-953f-7148ee2ada76 Status: Started Number of Bricks: 9 x 2 = 18 Transport-type: tcp Bricks: Brick1: 10.70.43.178:/rhs/bricks/cinder-volume Brick2: 10.70.43.73:/rhs/bricks/cinder-volume Brick3: 10.70.42.250:/rhs/bricks/cinder-volume Brick4: 10.70.42.185:/rhs/bricks/cinder-volume Brick5: 10.70.42.172:/rhs/bricks/cinder-volume Brick6: 10.70.42.178:/rhs/bricks/cinder-volume Brick7: 10.70.42.164:/rhs/bricks/cinder-volume Brick8: 10.70.43.95:/rhs/bricks/cinder-volume Brick9: 10.70.42.156:/rhs/bricks/cinder-volume Brick10: 10.70.43.92:/rhs/bricks/cinder-volume Brick11: 10.70.42.169:/rhs/bricks/cinder-volume Brick12: 10.70.42.159:/rhs/bricks/cinder-volume Brick13: 10.70.42.163:/rhs/bricks/cinder-volume Brick14: 10.70.42.180:/rhs/bricks/cinder-volume Brick15: 10.70.42.154:/rhs/bricks/cinder-volume Brick16: 10.70.42.161:/rhs/bricks/cinder-volume Brick17: 10.70.42.173:/rhs/bricks/cinder-volume Brick18: 10.70.42.170:/rhs/bricks/cinder-volume Options Reconfigured: diagnostics.brick-log-level: DEBUG diagnostics.client-log-level: DEBUG network.remote-dio: enable cluster.eager-lock: enable performance.stat-prefetch: off performance.io-cache: off performance.read-ahead: off performance.quick-read: off storage.owner-uid: 165 storage.owner-gid: 165 [root@localhost ~]# [root@localhost ~]# gluster volume rebalance cinder-volume status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 4 4.0GB 78 0 3 completed 47.00 10.70.43.73 0 0Bytes 74 0 0 completed 2.00 10.70.42.250 10 10.0GB 88 0 0 completed 101.00 10.70.42.185 0 0Bytes 74 0 0 completed 2.00 10.70.42.172 8 8.0GB 93 0 0 completed 83.00 10.70.42.178 0 0Bytes 74 0 0 completed 2.00 10.70.42.255 0 0Bytes 74 0 0 completed 2.00 10.70.42.157 0 0Bytes 74 0 0 completed 2.00 10.70.42.164 4 4.0GB 88 0 0 completed 243.00 10.70.43.95 0 0Bytes 74 0 0 completed 2.00 10.70.42.169 8 8.0GB 94 0 0 completed 84.00 10.70.42.159 0 0Bytes 74 0 0 completed 2.00 10.70.42.156 0 0Bytes 74 0 2 completed 2.00 10.70.43.92 0 0Bytes 74 0 0 completed 2.00 10.70.42.163 0 0Bytes 74 0 8 completed 2.00 10.70.42.180 0 0Bytes 74 0 0 completed 2.00 10.70.42.154 0 0Bytes 74 0 0 completed 2.00 10.70.42.161 0 0Bytes 74 0 0 completed 2.00 10.70.42.173 0 0Bytes 74 0 0 completed 2.00 10.70.42.170 0 0Bytes 74 0 0 completed 2.00 volume rebalance: cinder-volume: success: [root@localhost ~]#
Created attachment 790895 [details] var-lib-cinder-volumes-01cbb41254aca5f79b543409c40afa79.log Client info and logs attached. [root@rhs-hpc-srv1 glusterfs(keystone_admin)]# df -hT Filesystem Type Size Used Avail Use% Mounted on /dev/sda1 ext4 30G 10G 18G 36% / tmpfs tmpfs 24G 0 24G 0% /dev/shm /dev/sdf ext4 1.1T 318M 1.1T 1% /var/log 10.70.43.178:/glance-volume fuse.glusterfs 300G 22G 279G 8% /var/lib/glance /srv/loopback-device/device1 ext4 960M 34M 876M 4% /srv/node/device1 10.70.43.178:/cinder-volume fuse.glusterfs 450G 22G 429G 5% /var/lib/cinder/volumes/01cbb41254aca5f79b543409c40afa79 10.70.43.178:/cinder-volume fuse.glusterfs 450G 22G 429G 5% /var/lib/nova/mnt/01cbb41254aca5f79b543409c40afa79 [root@rhs-hpc-srv1 glusterfs(keystone_admin)]#
Created attachment 790896 [details] Screenshot of the instance
On the latest graph of the client, see these network disconnection log messages: [2013-08-27 04:23:18.693638] I [afr-common.c:3953:afr_local_init] 2-cinder-volume-replicate-7: no subvolumes up [2013-08-27 04:23:18.693694] D [dht-common.c:2885:dht_fd_cbk] 2-cinder-volume-dht: subvolume cinder-volume-replicate-7 returned -1 (Transport endpoint is not connected) [2013-08-27 04:23:18.693710] I [afr-common.c:3953:afr_local_init] 2-cinder-volume-replicate-8: no subvolumes up [2013-08-27 04:23:18.693721] D [dht-common.c:2885:dht_fd_cbk] 2-cinder-volume-dht: subvolume cinder-volume-replicate-8 returned -1 (Transport endpoint is not connected) .... [2013-08-27 04:23:18.698527] I [afr-common.c:3953:afr_local_init] 2-cinder-volume-replicate-7: no subvolumes up [2013-08-27 04:23:18.698565] I [afr-common.c:3953:afr_local_init] 2-cinder-volume-replicate-8: no subvolumes up [2013-08-27 04:23:21.693287] I [afr-common.c:3953:afr_local_init] 2-cinder-volume-replicate-7: no subvolumes up [2013-08-27 04:23:21.693342] I [afr-common.c:3953:afr_local_init] 2-cinder-volume-replicate-8: no subvolumes up [2013-08-27 04:23:23.573663] I [afr-common.c:3953:afr_local_init] 2-cinder-volume-replicate-7: no subvolumes up [2013-08-27 04:23:23.573720] I [afr-common.c:3953:afr_local_init] 2-cinder-volume-replicate-8: no subvolumes up
More disconnection logs: [2013-08-27 04:23:18.442073] I [dht-layout.c:633:dht_layout_normalize] 2-cinder-volume-dht: found anomalies in /. holes=1 overlaps=1 missing=0 down=2 misc=0 [2013-08-27 04:23:18.443150] D [dht-layout.c:649:dht_layout_normalize] (-->/usr/lib64/glusterfs/3.4.0.22rhs/xlator/protocol/client.so(client3_3_lookup_cbk+0x626) [0x7f5c2aae71f6] (-->/usr/lib64/glusterfs/3.4.0.2 2rhs/xlator/cluster/replicate.so(afr_lookup_cbk+0x520) [0x7f5c2a8abe80] (-->/usr/lib64/glusterfs/3.4.0.22rhs/xlator/cluster/distribute.so(dht_lookup_dir_cbk+0x484) [0x7f5c2a633b64]))) 2-cinder-volume-dht: path=/ err=Transport endpoint is not connected on subvol=cinder-volume-replicate-8 [2013-08-27 04:23:18.443181] D [dht-layout.c:649:dht_layout_normalize] (-->/usr/lib64/glusterfs/3.4.0.22rhs/xlator/protocol/client.so(client3_3_lookup_cbk+0x626) [0x7f5c2aae71f6] (-->/usr/lib64/glusterfs/3.4.0.2 2rhs/xlator/cluster/replicate.so(afr_lookup_cbk+0x520) [0x7f5c2a8abe80] (-->/usr/lib64/glusterfs/3.4.0.22rhs/xlator/cluster/distribute.so(dht_lookup_dir_cbk+0x484) [0x7f5c2a633b64]))) 2-cinder-volume-dht: path=/ err=Transport endpoint is not connected on subvol=cinder-volume-replicate-7 [2013-08-27 04:23:18.443187] D [dht-common.c:499:dht_lookup_dir_cbk] 2-cinder-volume-dht: fixing assignment on / [2013-08-27 04:23:18.443204] W [dht-selfheal.c:909:dht_selfheal_directory] 2-cinder-volume-dht: 2 subvolumes down -- not fixing
The subvolume seems to have come back up few seconds later: [2013-08-27 04:23:33.237097] I [afr-common.c:3795:afr_notify] 2-cinder-volume-replicate-8: Subvolume 'cinder-volume-client-17' came back up; going online. [2013-08-27 04:23:37.315468] I [afr-common.c:3795:afr_notify] 2-cinder-volume-replicate-7: Subvolume 'cinder-volume-client-14' came back up; going online.
Looks like the graph switch happened with 'CHILD_CONNECTING' event, and the CHILD_UP came little late. Due to this, we are done with graph switch, but the new graph is not totally active.
Created attachment 791431 [details] sosreport-1-to-9-rhs-nodes.tar.bz # tar -jcvf sosreport-1-to-9-rhs-nodes.tar.bz sosreport-10.70.42.154-20130828203046-eac2.tar.xz sosreport-10.70.42.156-20130828202709-4c1b.tar.xz sosreport-10.70.42.159-20130828202853-9a92.tar.xz sosreport-10.70.42.161-20130828203137-d166.tar.xz sosreport-10.70.42.163-20130828202932-d450.tar.xz sosreport-10.70.42.164-20130828200908-6e11.tar.xz sosreport-10.70.42.169-20130828202816-adac.tar.xz sosreport-10.70.42.170-20130828203245-f0b1.tar.xz sosreport-10.70.42.172-20130828200505-c1d1.tar.xz
Created attachment 791432 [details] sosreport-10-to-18-rhs-nodes.tar.bz # tar -jcvf sosreport-10-to-18-rhs-nodes.tar.bz sosreport-10.70.42.173-20130828203211-12a7.tar.xz sosreport-10.70.42.178-20130828200552-8ff6.tar.xz sosreport-10.70.42.180-20130828203008-670e.tar.xz sosreport-10.70.42.185-20130828200344-0b69.tar.xz sosreport-10.70.42.250-20130828200235-ec2a.tar.xz sosreport-10.70.43.178-20130828200124-b8a2.tar.xz sosreport-10.70.43.73-20130828200201-2c23.tar.xz sosreport-10.70.43.92-20130828202742-cbb3.tar.xz sosreport-10.70.43.95-20130828201036-de7b.tar.xz
Looks like we now know the issue with rebalance and self-heal. There is a patch in the works, but as proposed by the development, the fix is not 'non-intrusive'. Hence would like little more soaking time. Proposed it in Big Bend Readout meeting, and got approval from people to push it to Update1. Patch will be posted and reviewed, tested in upstream in meantime.
Are we documenting this as known issue for Big Bend ?
<snip> Internal Whiteboard: Big Bend → Big Bend U1 Flags: rhs-2.1.0+ → rhs-2.1.z? </snip> Removing the blocker flag as it seem to be targetted for Update1 now.
Shishir, This bug has been identified as a known issue for Big Bend release. Please provide CCFR information in the Doc Text field.
Shishir, I have edited the Doc Text field. Kindly review and sign off.
Possible fix for this issue is posted upstream @ http://review.gluster.org/#/c/5891/
upstream patch accepted, now posted to downstream @ https://code.engineering.redhat.com/gerrit/#/c/16038/ https://code.engineering.redhat.com/gerrit/#/c/16039/
Per integration bug triage 12/13, request blocker? flag for Corbett if this failedQA.
Tested and verified on glusterfs-3.4.0.44rhs-1.el6rhs.x86_64. Files migrate successfully without leaving T-files behind.
Pranith, Can you please review the doc text for technical accuracy?
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHEA-2014-0208.html
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days