Description of problem: ----------------------- There were excessive warning messages in rebalance log file, while performing rebalance with self-heal Version-Release number of selected component (if applicable): ------------------------------------------------------------- glusterfs-3.4.0.55rhs-1.el6rhs How reproducible: ----------------- Happened to face it in first attempt, never retried it Steps to Reproduce: ------------------- 1. Added 4 RHSS Nodes to gluster-cluster RHEVM 2. Created a distribute-replicate(2x2) volume using RHEVM UI ( using bricks from only RHSS Node-1 & RHSS Node-2 ) 3. Optimize the volume for virt store 4. Start the volume 5. Create a Data domain (storage domain) backed by gluster volume 6. Create a appvm ( configuration and disk image creation) 7. Poweroff(hard shutdown) RHSS Node-2 from RHEVM UI 8. Install appvm with RHEL 6.5 9. Run IO inside appvm using dd command 10. Add a pair of bricks using RHEVM UI (using bricks from RHSS Node3 and RHSS Node4 ) 11. Perform rebalance and check the image file is under migration If the rebalance is not happening, continue adding bricks and perform rebalance, till rebalance kicks in. If rebalance is in progress (image of appvm is in migration), bring up the RHSS Node-2 that was down earlier, so that self-heal kicks in 12. Trigger self-heal on the RHSS Node, which just up 13. Check self-heal info Actual results: --------------- 1. There are plenty of warning messages flowing in rebalance logs for long time 2. Rebalance by itself took long time to complete migration. 50GB file took 4.25 Hours to complete (15305 seconds) Expected results: ----------------- 1. Rebalance log file should not be overloaded with warning messages. 2. Not sure about the significance of these warning messages, if its the case, then the problem should be resolved Additional info: ---------------- INSTALLATION INFO ================== 1. RHSS Installed using ISO - RHSS-2.1-20140107.n.1-RHS-x86_64-DVD1.iso http://download.eng.blr.redhat.com/composes/nightly/RHSS-2.1-20140107.n.1/2.1/RHS/x86_64/iso/ 2. No extra packages installed SETUP INFO ========== 1. RHSS Node running in VM with 8GB RAM and 4 VCPUs 2. Trusted Storage Pool consists of 4 RHSS Nodes RHSS Node-1 : 10.70.37.86 RHSS Node-2 : 10.70.37.187 RHSS Node-3 : 10.70.37.46 RHSS Node-4 : 10.70.37.198 3. RHEL 6.5 Client 15GB RAM and 4 Cores used as Hypervisor (Client machine to fuse mount the gluster volume) rhs-client15.lab.eng.blr.redhat.com : 10.70.36.39 BRICK INFO =========== 1. 2 bricks from each RHSS Node was used for the volume Brick1: 10.70.37.86:/rhs/brick1/drdir1 Brick2: 10.70.37.187:/rhs/brick1/drdir1 Brick3: 10.70.37.86:/rhs/brick2/drdir2 Brick4: 10.70.37.187:/rhs/brick2/drdir2 2. Node that was brought down for testing self-heal RHSS Node-2 : 10.70.37.187 3. Brick(s) that were down as RHSS Nodes were down Brick2: 10.70.37.187:/rhs/brick1/drdir1 Brick4: 10.70.37.187:/rhs/brick2/drdir2 4. Extended Attributes on all bricks ( Current status ) VOLUME INFO ============ [Thu Jan 9 10:58:19 UTC 2014 root.37.86:~ ] # gluster v i Volume Name: dr-imgstore Type: Distributed-Replicate Volume ID: 532b453b-98c2-4873-9c75-14cf1403680d Status: Started Number of Bricks: 7 x 2 = 14 Transport-type: tcp Bricks: Brick1: 10.70.37.86:/rhs/brick1/drdir1 Brick2: 10.70.37.187:/rhs/brick1/drdir1 Brick3: 10.70.37.86:/rhs/brick2/drdir2 Brick4: 10.70.37.187:/rhs/brick2/drdir2 Brick5: 10.70.37.46:/rhs/brick1/addbrick1 Brick6: 10.70.37.198:/rhs/brick1/addbrick1 Brick7: 10.70.37.46:/rhs/brick1/addbrick2 Brick8: 10.70.37.198:/rhs/brick1/addbrick2 Brick9: 10.70.37.46:/rhs/brick1/addbrick3 Brick10: 10.70.37.198:/rhs/brick1/addbrick3 Brick11: 10.70.37.46:/rhs/brick1/addbrick4 Brick12: 10.70.37.198:/rhs/brick1/addbrick4 Brick13: 10.70.37.46:/rhs/brick1/addbrick5 Brick14: 10.70.37.198:/rhs/brick1/addbrick5 Options Reconfigured: storage.owner-gid: 36 storage.owner-uid: 36 cluster.server-quorum-type: server cluster.quorum-type: auto network.remote-dio: enable cluster.eager-lock: enable performance.stat-prefetch: off performance.io-cache: off performance.read-ahead: off performance.quick-read: off auth.allow: * user.cifs: enable nfs.disable: off [Thu Jan 9 11:06:12 UTC 2014 root.37.86:~ ] # gluster volume status Status of volume: dr-imgstore Gluster process Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.37.86:/rhs/brick1/drdir1 49152 Y 7712 Brick 10.70.37.187:/rhs/brick1/drdir1 N/A Y 2096 Brick 10.70.37.86:/rhs/brick2/drdir2 49153 Y 7723 Brick 10.70.37.187:/rhs/brick2/drdir2 N/A Y 2103 Brick 10.70.37.46:/rhs/brick1/addbrick1 49152 Y 10891 Brick 10.70.37.198:/rhs/brick1/addbrick1 49152 Y 10229 Brick 10.70.37.46:/rhs/brick1/addbrick2 49153 Y 14062 Brick 10.70.37.198:/rhs/brick1/addbrick2 49153 Y 13158 Brick 10.70.37.46:/rhs/brick1/addbrick3 49154 Y 14614 Brick 10.70.37.198:/rhs/brick1/addbrick3 49154 Y 13907 Brick 10.70.37.46:/rhs/brick1/addbrick4 49155 Y 15346 Brick 10.70.37.198:/rhs/brick1/addbrick4 49155 Y 14506 Brick 10.70.37.46:/rhs/brick1/addbrick5 49156 Y 17283 Brick 10.70.37.198:/rhs/brick1/addbrick5 49156 Y 15869 NFS Server on localhost 2049 Y 16133 Self-heal Daemon on localhost N/A Y 16141 NFS Server on 10.70.37.198 2049 Y 15882 Self-heal Daemon on 10.70.37.198 N/A Y 15889 NFS Server on 10.70.37.187 2049 Y 2260 Self-heal Daemon on 10.70.37.187 N/A Y 2469 NFS Server on 10.70.37.46 2049 Y 17295 Self-heal Daemon on 10.70.37.46 N/A Y 17303 Task Status of Volume dr-imgstore ------------------------------------------------------------------------------ Task : Rebalance ID : aca426cc-156f-432e-a236-27997e4f8cfd Status : completed REBALANCE INFO ============== [Thu Jan 9 10:54:36 UTC 2014 root.37.86:~ ] # gluster volume rebalance dr-imgstore status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 2 50.0GB 11 0 2 completed 15305.00 10.70.37.187 0 0Bytes 9 0 0 completed 0.00 10.70.37.46 3 1.0MB 11 0 0 completed 0.00 10.70.37.198 0 0Bytes 9 0 0 completed 0.00 volume rebalance: dr-imgstore: success: CONSOLE INFO ============= Following the console log when kicking rebalance from RHSS Node-1 [Wed Jan 8 20:04:58 UTC 2014 root.37.86:~ ] # gluster volume rebalance dr-imgstore start volume rebalance: dr-imgstore: success: Starting rebalance on volume dr-imgstore has been successful. ID: aca426cc-156f-432e-a236-27997e4f8cfd [Wed Jan 8 20:07:26 UTC 2014 root.37.86:~ ] # gluster volume rebalance dr-imgstore status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 1 2.0MB 8 0 2 in progress 17.00 volume rebalance: dr-imgstore: success: [Wed Jan 8 20:07:54 UTC 2014 root.37.86:~ ] # gluster volume rebalance dr-imgstore status volume rebalance: dr-imgstore: failed: Another transaction could be in progress. Please try again after sometime. OTHER INFO ========== 1. All commands are executed from RHSS Node-1, 10.70.37.86 2. Volume was mounted in the client using volfile server from RHSS Node-1, 10.70.37.86
Warning messages that were logged in rebalance log: These error meesages are in lots in rebalance log, providing a snip of it [2014-01-08 20:19:04.083490] W [client-rpc-fops.c:1579:client3_3_finodelk_cbk] 0-dr-imgstore-client-1: remote operation failed: No such file or directory [2014-01-08 20:19:04.088348] W [client-rpc-fops.c:1579:client3_3_finodelk_cbk] 0-dr-imgstore-client-1: remote operation failed: No such file or directory [2014-01-08 20:19:04.089297] W [client-rpc-fops.c:1579:client3_3_finodelk_cbk] 0-dr-imgstore-client-1: remote operation failed: No such file or directory [2014-01-08 20:19:04.102517] W [client-rpc-fops.c:1579:client3_3_finodelk_cbk] 0-dr-imgstore-client-1: remote operation failed: No such file or directory [2014-01-08 20:19:04.103579] W [client-rpc-fops.c:1579:client3_3_finodelk_cbk] 0-dr-imgstore-client-1: remote operation failed: No such file or directory [2014-01-08 20:19:04.106097] W [client-rpc-fops.c:1579:client3_3_finodelk_cbk] 0-dr-imgstore-client-1: remote operation failed: No such file or directory [2014-01-08 20:19:04.107523] W [client-rpc-fops.c:1579:client3_3_finodelk_cbk] 0-dr-imgstore-client-1: remote operation failed: No such file or directory [2014-01-08 20:19:04.126431] W [client-rpc-fops.c:1579:client3_3_finodelk_cbk] 0-dr-imgstore-client-1: remote operation failed: No such file or directory [2014-01-08 20:19:04.127355] W [client-rpc-fops.c:1579:client3_3_finodelk_cbk] 0-dr-imgstore-client-1: remote operation failed: No such file or directory [2014-01-08 20:19:04.129200] W [client-rpc-fops.c:1579:client3_3_finodelk_cbk] 0-dr-imgstore-client-1: remote operation failed: No such file or directory [2014-01-08 20:19:04.130336] W [client-rpc-fops.c:1579:client3_3_finodelk_cbk] 0-dr-imgstore-client-1: remote operation failed: No such file or directory [2014-01-08 20:19:04.138848] W [client-rpc-fops.c:1579:client3_3_finodelk_cbk] 0-dr-imgstore-client-1: remote operation failed: No such file or directory [2014-01-08 20:19:04.140530] W [client-rpc-fops.c:1579:client3_3_finodelk_cbk] 0-dr-imgstore-client-1: remote operation failed: No such file or directory [2014-01-08 20:19:04.147754] W [client-rpc-fops.c:1579:client3_3_finodelk_cbk] 0-dr-imgstore-client-1: remote operation failed: No such file or directory [2014-01-08 20:19:04.148842] W [client-rpc-fops.c:1579:client3_3_finodelk_cbk] 0-dr-imgstore-client-1: remote operation failed: No such file or directory [2014-01-08 20:19:04.174937] W [client-rpc-fops.c:1579:client3_3_finodelk_cbk] 0-dr-imgstore-client-1: remote operation failed: No such file or directory [2014-01-08 20:19:04.176108] W [client-rpc-fops.c:1579:client3_3_finodelk_cbk] 0-dr-imgstore-client-1: remote operation failed: No such file or directory [2014-01-08 20:19:04.178267] W [client-rpc-fops.c:1579:client3_3_finodelk_cbk] 0-dr-imgstore-client-1: remote operation failed: No such file or directory [2014-01-08 20:19:04.187086] W [client-rpc-fops.c:1579:client3_3_finodelk_cbk] 0-dr-imgstore-client-1: remote operation failed: No such file or directory [2014-01-08 20:19:04.190869] W [client-rpc-fops.c:1579:client3_3_finodelk_cbk] 0-dr-imgstore-client-1: remote operation failed: No such file or directory [2014-01-08 20:19:04.192235] W [client-rpc-fops.c:1579:client3_3_finodelk_cbk] 0-dr-imgstore-client-1: remote operation failed: No such file or directory [2014-01-08 20:19:04.219643] W [client-rpc-fops.c:1579:client3_3_finodelk_cbk] 0-dr-imgstore-client-1: remote operation failed: No such file or directory [2014-01-08 20:19:04.220420] W [client-rpc-fops.c:1579:client3_3_finodelk_cbk] 0-dr-imgstore-client-1: remote operation failed: No such file or directory [2014-01-08 20:19:04.222506] W [client-rpc-fops.c:1579:client3_3_finodelk_cbk] 0-dr-imgstore-client-1: remote operation failed: No such file or directory [2014-01-08 20:19:04.223471] W [client-rpc-fops.c:1579:client3_3_finodelk_cbk] 0-dr-imgstore-client-1: remote operation failed: No such file or directory [2014-01-08 20:19:04.245249] W [client-rpc-fops.c:1579:client3_3_finodelk_cbk] 0-dr-imgstore-client-1: remote operation failed: No such file or directory [2014-01-08 20:19:05.045824] W [client-rpc-fops.c:1579:client3_3_finodelk_cbk] 0-dr-imgstore-client-1: remote operation failed: No such file or directory [2014-01-08 20:19:05.155607] W [client-rpc-fops.c:1579:client3_3_finodelk_cbk] 0-dr-imgstore-client-1: remote operation failed: No such file or directory [2014-01-08 20:19:05.175864] W [client-rpc-fops.c:1579:client3_3_finodelk_cbk] 0-dr-imgstore-client-1: remote operation failed: No such file or directory
Missed xattr info and client mount info in comment 0, XATTR INFO ON BRICKS ===================== [Thu Jan 9 11:24:03 UTC 2014 root.37.86:~ ] # getfattr -d -m. -e hex /rhs/brick1/drdir1/ getfattr: Removing leading '/' from absolute path names # file: rhs/brick1/drdir1/ trusted.afr.dr-imgstore-client-0=0x000000000000000000000000 trusted.afr.dr-imgstore-client-1=0x000000000000000000000000 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x00000001000000002492492449249247 trusted.glusterfs.volume-id=0x532b453b98c248739c7514cf1403680d [Thu Jan 9 12:53:18 UTC 2014 root.37.86:~ ] # getfattr -d -m. -e hex /rhs/brick2/drdir2/ getfattr: Removing leading '/' from absolute path names # file: rhs/brick2/drdir2/ trusted.afr.dr-imgstore-client-2=0x000000000000000000000000 trusted.afr.dr-imgstore-client-3=0x000000000000000000000000 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x0000000100000000492492486db6db6b trusted.glusterfs.volume-id=0x532b453b98c248739c7514cf1403680d [Thu Jan 9 11:21:38 UTC 2014 root.37.187:~ ] # getfattr -d -m. -e hex /rhs/brick1/drdir1/ getfattr: Removing leading '/' from absolute path names # file: rhs/brick1/drdir1/ trusted.afr.dr-imgstore-client-0=0x000000000000000000000000 trusted.afr.dr-imgstore-client-1=0x000000000000000000000000 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x00000001000000002492492449249247 trusted.glusterfs.volume-id=0x532b453b98c248739c7514cf1403680d [Thu Jan 9 12:53:38 UTC 2014 root.37.187:~ ] # getfattr -d -m. -e hex /rhs/brick2/drdir2/ getfattr: Removing leading '/' from absolute path names # file: rhs/brick2/drdir2/ trusted.afr.dr-imgstore-client-2=0x000000000000000000000000 trusted.afr.dr-imgstore-client-3=0x000000000000000000000000 trusted.gfid=0x00000000000000000000000000000001 trusted.glusterfs.dht=0x0000000100000000492492486db6db6b trusted.glusterfs.volume-id=0x532b453b98c248739c7514cf1403680d CLIENT SIDE INFO ================ [root@rhs-client15 ~]# mount /dev/mapper/vg_rhsclient15-lv_root on / type ext4 (rw) proc on /proc type proc (rw) sysfs on /sys type sysfs (rw) devpts on /dev/pts type devpts (rw,gid=5,mode=620) tmpfs on /dev/shm type tmpfs (rw,rootcontext="system_u:object_r:tmpfs_t:s0") /dev/sda1 on /boot type ext4 (rw) /dev/mapper/vg_rhsclient15-lv_home on /home type ext4 (rw) none on /proc/sys/fs/binfmt_misc type binfmt_misc (rw) 10.70.37.86:dr-imgstore on /rhev/data-center/mnt/glusterSD/10.70.37.86:dr-imgstore type fuse.glusterfs (rw,default_permissions,allow_other,max_read=131072) [root@rhs-client15 ~]# df -Th Filesystem Type Size Used Avail Use% Mounted on /dev/mapper/vg_rhsclient15-lv_root ext4 50G 2.5G 45G 6% / tmpfs tmpfs 7.8G 4.0K 7.8G 1% /dev/shm /dev/sda1 ext4 485M 67M 393M 15% /boot /dev/mapper/vg_rhsclient15-lv_home ext4 1.8T 196M 1.7T 1% /home 10.70.37.86:dr-imgstore fuse.glusterfs 700G 93G 608G 14% /rhev/data-center/mnt/glusterSD/10.70.37.86:dr-imgstore
Missed appvm related info, there was only one appvm created with disk of size 50GB and installed with RHEL 6.5 Image file as seen from client mount , [root@rhs-client15 ~]# df -Th Filesystem Type Size Used Avail Use% Mounted on /dev/mapper/vg_rhsclient15-lv_root ext4 50G 2.5G 45G 6% / tmpfs tmpfs 7.8G 4.0K 7.8G 1% /dev/shm /dev/sda1 ext4 485M 67M 393M 15% /boot /dev/mapper/vg_rhsclient15-lv_home ext4 1.8T 196M 1.7T 1% /home 10.70.37.86:dr-imgstore fuse.glusterfs 700G 93G 608G 14% /rhev/data-center/mnt/glusterSD/10.70.37.86:dr-imgstore [root@rhs-client15 ~]# cd /rhev/data-center/mnt/glusterSD/10.70.37.86:dr-imgstore [root@rhs-client15 10.70.37.86:dr-imgstore]# ls c33c0d51-e8f5-409d-9a52-fea048db0645 __DIRECT_IO_TEST__ [root@rhs-client15 10.70.37.86:dr-imgstore]# cd c33c0d51-e8f5-409d-9a52-fea048db0645/ dom_md/ images/ master/ [root@rhs-client15 10.70.37.86:dr-imgstore]# cd c33c0d51-e8f5-409d-9a52-fea048db0645/ dom_md/ images/ master/ [root@rhs-client15 10.70.37.86:dr-imgstore]# ls c33c0d51-e8f5-409d-9a52-fea048db0645/images/94b388b5-2906-43e8-b372-bd6bfce099f6/ ff2fffbf-a14f-4727-9bea-8afa672e9bc8 ff2fffbf-a14f-4727-9bea-8afa672e9bc8.lease ff2fffbf-a14f-4727-9bea-8afa672e9bc8.meta [root@rhs-client15 10.70.37.86:dr-imgstore]# qemu-img info c33c0d51-e8f5-409d-9a52-fea048db0645/images/94b388b5-2906-43e8-b372-bd6bfce099f6/ff2fffbf-a14f-4727-9bea-8afa672e9bc8 image: c33c0d51-e8f5-409d-9a52-fea048db0645/images/94b388b5-2906-43e8-b372-bd6bfce099f6/ff2fffbf-a14f-4727-9bea-8afa672e9bc8 file format: raw virtual size: 50G (53687091200 bytes) disk size: 45G
Also I notice, that rebalance taking more time For 50GB file, it took 4.25 hours. (15305 seconds), which I feel much higher. [Thu Jan 9 10:54:36 UTC 2014 root.37.86:~ ] # gluster volume rebalance dr-imgstore status Node Rebalanced-files size scanned failures skipped status run time in secs --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 2 50.0GB 11 0 2 completed 15305.00 10.70.37.187 0 0Bytes 9 0 0 completed 0.00 10.70.37.46 3 1.0MB 11 0 0 completed 0.00 10.70.37.198 0 0Bytes 9 0 0 completed 0.00 volume rebalance: dr-imgstore: success: Apart from that heal info also takes more than 50 minutes for the above case and being tracked under the bug https://bugzilla.redhat.com/show_bug.cgi?id=1046022#c3
adding 3.0 flag and removing 2.1.z
Pranith, Could you provide the patch URL for this bug and the basic RCA as part of finding a fix for this issue.
I am finding this error message once in rebalance logs. Earlier this message used to be logged more frequently ( flooded in reality ) in rebalance logs. Rebalance logs says this: ------------------------- [2014-11-26 19:18:38.725708] E [client-rpc-fops.c:1520:client3_3_inodelk_cbk] 0-vmstore-client-1: remote operation failed: Invalid argument And relevant brick logs says this: ---------------------------------- [2014-11-27 00:38:12.602788] E [inodelk.c:297:__inode_unlock_lock] 0-vmstore-locks: Matching lock not found for unlock 0-9223372036854775807, by fdffffff on 0xf02190 [2014-11-27 00:38:12.602832] E [server-rpc-fops.c:242:server_inodelk_cbk] 0-vmstore-server: 17: INODELK <gfid:36eb7531-c288-4c4f-8617-7fe84d20e026> (36eb7531-c288-4c4f-8617-7fe84d20e026) ==> (Invalid argument) I have tested this with glusterfs-3.6.0.34-1.el6rhs with test steps as described in comment0
Based on comment10, marking this bug as VERIFIED
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHBA-2015-0038.html