Description of problem: ======================= I am seeing EIO in the rebalance log after issuing a rebalance post addbricks Note: I have a 10 node cluster and the bricks are of different size DHT subvol1 : DHT subvol || nodes || Bricks Sizes || subvol1 || 1,2,3,4,5,6|| 10GB subvol1 || 7,8,9 || 195GB subvol1 || 10 || 8.1GB Note: data size is 6.7GB on each brick (first subvol) Subvol2 also of same size as above mount of volume(checking size) 10.70.35.215:ecv 129G 54G 75G 42% /mnt/ecv I checked for a couple of files which had EIO error and found that actutally t-link file was created on all the dst_bricks , however were not migrated(that is understandable coz the dst_bricks are not bigger than source, and hence would have to use force) eg of one file: [root@dhcp35-214 ~]# ll /*/brick*/ecv//dir1/linux-4.10.4/include/linux/smsc911x.h -rw-rw-r--. 2 root root 512 Mar 18 16:49 /rhs/brick1/ecv/dir1/linux-4.10.4/include/linux/smsc911x.h ---------T. 2 root root 0 May 3 13:07 /rhs/brick2/ecv/dir1/linux-4.10.4/include/linux/smsc911x.h Version-Release number of selected component (if applicable): ====== 3.8.4-24 How reproducible: Steps to Reproduce: 1.had a 1x(8+2) ec volume and created different files from the fuse mount. 2.I then add 1 more distribute subvol to this volume to make it 2x(8+2) (no IOs in progress) 3.set log level to DEBUG for testing purpose (no IO in progress) 4. triggered rebalance I am seeing below EIO logs Actual results: [2017-05-03 07:37:35.638821] D [MSGID: 0] [dict.c:161:key_value_cmp] 0-ecv-disperse-0: 'security.selinux' is different in two dicts (33, 38) [2017-05-03 07:37:35.638860] N [MSGID: 122031] [ec-inode-read.c:199:ec_combine_getxattr] 0-ecv-disperse-0: Mismatching dictionary in answers of 'GF_FOP_GETXATTR' [2017-05-03 07:37:35.638909] D [MSGID: 0] [dict.c:161:key_value_cmp] 0-ecv-disperse-0: 'security.selinux' is different in two dicts (33, 38) [2017-05-03 07:37:35.638927] N [MSGID: 122031] [ec-inode-read.c:199:ec_combine_getxattr] 0-ecv-disperse-0: Mismatching dictionary in answers of 'GF_FOP_GETXATTR' [2017-05-03 07:37:35.638975] D [MSGID: 0] [dict.c:161:key_value_cmp] 0-ecv-disperse-0: 'security.selinux' is different in two dicts (33, 38) [2017-05-03 07:37:35.638993] N [MSGID: 122031] [ec-inode-read.c:199:ec_combine_getxattr] 0-ecv-disperse-0: Mismatching dictionary in answers of 'GF_FOP_GETXATTR' [2017-05-03 07:37:35.639038] D [MSGID: 0] [dict.c:161:key_value_cmp] 0-ecv-disperse-0: 'security.selinux' is different in two dicts (33, 38) [2017-05-03 07:37:35.639056] N [MSGID: 122031] [ec-inode-read.c:199:ec_combine_getxattr] 0-ecv-disperse-0: Mismatching dictionary in answers of 'GF_FOP_GETXATTR' [2017-05-03 07:37:35.639219] W [MSGID: 114031] [client-rpc-fops.c:1893:client3_3_fsetxattr_cbk] 0-ecv-client-12: remote operation failed [2017-05-03 07:37:35.639249] D [MSGID: 0] [client-rpc-fops.c:1897:client3_3_fsetxattr_cbk] 0-stack-trace: stack-address: 0x7fd1d807c7c0, ecv-client-12 returned -1 error: Success [2017-05-03 07:37:35.639254] W [MSGID: 114031] [client-rpc-fops.c:1893:client3_3_fsetxattr_cbk] 0-ecv-client-17: remote operation failed [2017-05-03 07:37:35.639321] D [MSGID: 0] [client-rpc-fops.c:1897:client3_3_fsetxattr_cbk] 0-stack-trace: stack-address: 0x7fd1d807c7c0, ecv-client-17 returned -1 error: Success [2017-05-03 07:37:35.639328] W [MSGID: 114031] [client-rpc-fops.c:1893:client3_3_fsetxattr_cbk] 0-ecv-client-15: remote operation failed [2017-05-03 07:37:35.639346] D [MSGID: 0] [client-rpc-fops.c:1897:client3_3_fsetxattr_cbk] 0-stack-trace: stack-address: 0x7fd1d807c7c0, ecv-client-15 returned -1 error: Success [2017-05-03 07:37:35.639352] W [MSGID: 114031] [client-rpc-fops.c:1893:client3_3_fsetxattr_cbk] 0-ecv-client-14: remote operation failed [2017-05-03 07:37:35.639398] D [MSGID: 0] [client-rpc-fops.c:1897:client3_3_fsetxattr_cbk] 0-stack-trace: stack-address: 0x7fd1d807c7c0, ecv-client-14 returned -1 error: Success [2017-05-03 07:37:35.639429] W [MSGID: 114031] [client-rpc-fops.c:1893:client3_3_fsetxattr_cbk] 0-ecv-client-16: remote operation failed [2017-05-03 07:37:35.639399] W [MSGID: 114031] [client-rpc-fops.c:1893:client3_3_fsetxattr_cbk] 0-ecv-client-13: remote operation failed [2017-05-03 07:37:35.639459] D [MSGID: 0] [client-rpc-fops.c:1897:client3_3_fsetxattr_cbk] 0-stack-trace: stack-address: 0x7fd1d807c7c0, ecv-client-13 returned -1 error: Success [2017-05-03 07:37:35.639461] D [MSGID: 0] [client-rpc-fops.c:1897:client3_3_fsetxattr_cbk] 0-stack-trace: stack-address: 0x7fd1d807c7c0, ecv-client-16 returned -1 error: Success [2017-05-03 07:37:35.639512] W [MSGID: 114031] [client-rpc-fops.c:1893:client3_3_fsetxattr_cbk] 0-ecv-client-18: remote operation failed [2017-05-03 07:37:35.639528] D [MSGID: 0] [client-rpc-fops.c:1897:client3_3_fsetxattr_cbk] 0-stack-trace: stack-address: 0x7fd1d807c7c0, ecv-client-18 returned -1 error: Success [2017-05-03 07:37:35.639619] W [MSGID: 114031] [client-rpc-fops.c:1893:client3_3_fsetxattr_cbk] 0-ecv-client-10: remote operation failed [2017-05-03 07:37:35.639637] D [MSGID: 0] [client-rpc-fops.c:1897:client3_3_fsetxattr_cbk] 0-stack-trace: stack-address: 0x7fd1d807c7c0, ecv-client-10 returned -1 error: Success [2017-05-03 07:37:35.639913] D [MSGID: 0] [dict.c:161:key_value_cmp] 0-ecv-disperse-0: 'security.selinux' is different in two dicts (33, 38) [2017-05-03 07:37:35.639950] N [MSGID: 122031] [ec-inode-read.c:199:ec_combine_getxattr] 0-ecv-disperse-0: Mismatching dictionary in answers of 'GF_FOP_GETXATTR' [2017-05-03 07:37:35.639979] D [MSGID: 0] [defaults.c:1346:default_getxattr_cbk] 0-stack-trace: stack-address: 0x7fd1c8028440, ecv-disperse-0 returned -1 error: Input/output error [Input/output error] [2017-05-03 07:37:35.640047] W [MSGID: 109023] [dht-rebalance.c:1507:dht_migrate_file] 0-ecv-dht: Migrate file failed:/dir1/linux-4.10.4/include/linux/smsc911x.h: failed to get xattr from ecv-disperse-0 (Input/output error) [2017-05-03 07:37:35.640631] W [MSGID: 114031] [client-rpc-fops.c:1893:client3_3_fsetxattr_cbk] 0-ecv-client-11: remote operation failed [2017-05-03 07:37:35.640691] D [MSGID: 0] [client-rpc-fops.c:1897:client3_3_fsetxattr_cbk] 0-stack-trace: stack-address: 0x7fd1d807c7c0, ecv-client-11 returned -1 error: Success [2017-05-03 07:37:35.641331] W [MSGID: 114031] [client-rpc-fops.c:1893:client3_3_fsetxattr_cbk] 0-ecv-client-19: remote operation failed [2017-05-03 07:37:35.641365] D [MSGID: 0] [client-rpc-fops.c:1897:client3_3_fsetxattr_cbk] 0-stack-trace: stack-address: 0x7fd1d807c7c0, ecv-client-19 returned -1 error: Success [2017-05-03 07:37:35.641384] D [MSGID: 0] [defaults.c:1092:default_fsetxattr_cbk] 0-stack-trace: stack-address: 0x7fd1c40076e0, ecv-disperse-1 returned -1 error: Success [2017-05-03 07:37:35.641466] D [MSGID: 0] [dht-common.c:1289:dht_lookup_linkfile_create_cbk] 0-ecv-dht: creation of linkto on hashed subvol:ecv-disperse-1, returned with op_ret 0 and op_errno 0: 4ca832d5-190c-4554-b2b3-2c20d66490a0 [2017-05-03 07:37:35.642079] D [MSGID: 0] [dht-rebalance.c:3167:gf_defrag_process_dir] 0-ecv-dht: added file:adf4350.h parent:/dir1/linux-4.10.4/include/linux/iio/frequency to the queue [2017-05-03 07:37:35.643011] D [MSGID: 0] [defaults.c:228:default_fallocate_failure_cbk] 0-stack-trace: stack-address: 0x7fd1c40076e0, ecv-disperse-1 returned -1 error: Operation not supported [Operation not supported] [2017-05-03 07:37:35.643291] E [MSGID: 109023] [dht-rebalance.c:740:__dht_rebalance_create_dst_file] 0-ecv-dht: fallocate failed for /dir1/linux-4.10.4/include/linux/smpboot.h on ecv-disperse-1 (Operation not supported) [2017-05-03 07:37:35.643320] E [MSGID: 0] [dht-rebalance.c:1516:dht_migrate_file] 0-ecv-dht: Create dst failed on - ecv-disperse-1 for file - /dir1/linux-4.10.4/include/linux/smpboot.h [2017-05-03 07:37:35.644545] I [dht-rebalance.c:3182:gf_defrag_process_dir] 0-ecv-dht: Migration operation on dir /dir1/linux-4.10.4/include/linux/iio/frequency took 0.03 secs ^C [root@dhcp35-45 glusterfs]# gluster v rebal ecv status Node Rebalanced-files size scanned failures skipped status run time in h:m:s --------- ----------- ----------- ----------- ----------- ----------- ------------ -------------- localhost 22 608Bytes 62411 0 31197 completed 0:10:58 10.70.35.130 0 0Bytes 0 0 0 completed 0:04:59 10.70.35.122 0 0Bytes 0 0 0 completed 0:04:57 10.70.35.23 0 0Bytes 0 0 0 completed 0:04:58 10.70.35.112 0 0Bytes 0 0 0 completed 0:04:53 10.70.35.138 0 0Bytes 0 0 0 completed 0:05:02 10.70.35.192 0 0Bytes 0 0 0 completed 0:04:47 10.70.35.215 0 0Bytes 0 0 0 completed 0:04:50 10.70.35.214 0 0Bytes 0 0 0 completed 0:04:43 10.70.46.130 0 0Bytes 0 0 0 completed 0:05:00 volume rebalance: ecv: success [root@dhcp35-45 glusterfs]# gluster v status Status of volume: ecv Gluster process TCP Port RDMA Port Online Pid ------------------------------------------------------------------------------ Brick 10.70.35.45:/rhs/brick1/ecv 49156 0 Y 30438 Brick 10.70.35.130:/rhs/brick1/ecv 49152 0 Y 18964 Brick 10.70.35.122:/rhs/brick1/ecv 49152 0 Y 21342 Brick 10.70.35.23:/rhs/brick1/ecv 49152 0 Y 7287 Brick 10.70.35.112:/rhs/brick1/ecv 49152 0 Y 28214 Brick 10.70.35.138:/rhs/brick1/ecv 49152 0 Y 20003 Brick 10.70.35.192:/rhs/brick1/ecv 49152 0 Y 24601 Brick 10.70.35.215:/rhs/brick1/ecv 49152 0 Y 18902 Brick 10.70.35.214:/rhs/brick1/ecv 49152 0 Y 3516 Brick 10.70.46.130:/bricks/brick1/ecv 49152 0 Y 32099 Brick 10.70.35.45:/rhs/brick2/ecv 49157 0 Y 10117 Brick 10.70.35.130:/rhs/brick2/ecv 49153 0 Y 23504 Brick 10.70.35.122:/rhs/brick2/ecv 49153 0 Y 25969 Brick 10.70.35.23:/rhs/brick2/ecv 49153 0 Y 12033 Brick 10.70.35.112:/rhs/brick2/ecv 49153 0 Y 19079 Brick 10.70.35.138:/rhs/brick2/ecv 49153 0 Y 24002 Brick 10.70.35.192:/rhs/brick2/ecv 49153 0 Y 26986 Brick 10.70.35.215:/rhs/brick2/ecv 49153 0 Y 21366 Brick 10.70.35.214:/rhs/brick2/ecv 49153 0 Y 23114 Brick 10.70.46.130:/bricks/brick2/ecv 49153 0 Y 19940 Self-heal Daemon on localhost N/A N/A Y 10137 Self-heal Daemon on 10.70.35.130 N/A N/A Y 23530 Self-heal Daemon on 10.70.35.122 N/A N/A Y 25995 Self-heal Daemon on 10.70.35.23 N/A N/A Y 12053 Self-heal Daemon on 10.70.35.215 N/A N/A Y 21386 Self-heal Daemon on 10.70.35.192 N/A N/A Y 27006 Self-heal Daemon on 10.70.35.214 N/A N/A Y 23140 Self-heal Daemon on 10.70.35.112 N/A N/A Y 19099 Self-heal Daemon on 10.70.35.138 N/A N/A Y 24022 Self-heal Daemon on 10.70.46.130 N/A N/A Y 19960 Task Status of Volume ecv ------------------------------------------------------------------------------ Task : Rebalance ID : 6f574dcc-3019-4cfa-a8fd-5a6307e58989 Status : completed [root@dhcp35-45 glusterfs]# gluster v info Volume Name: ecv Type: Distributed-Disperse Volume ID: ec0e53fb-8c99-41b6-bdcd-fa55dd33c429 Status: Started Snapshot Count: 0 Number of Bricks: 2 x (8 + 2) = 20 Transport-type: tcp Bricks: Brick1: 10.70.35.45:/rhs/brick1/ecv Brick2: 10.70.35.130:/rhs/brick1/ecv Brick3: 10.70.35.122:/rhs/brick1/ecv Brick4: 10.70.35.23:/rhs/brick1/ecv Brick5: 10.70.35.112:/rhs/brick1/ecv Brick6: 10.70.35.138:/rhs/brick1/ecv Brick7: 10.70.35.192:/rhs/brick1/ecv Brick8: 10.70.35.215:/rhs/brick1/ecv Brick9: 10.70.35.214:/rhs/brick1/ecv Brick10: 10.70.46.130:/bricks/brick1/ecv Brick11: 10.70.35.45:/rhs/brick2/ecv Brick12: 10.70.35.130:/rhs/brick2/ecv Brick13: 10.70.35.122:/rhs/brick2/ecv Brick14: 10.70.35.23:/rhs/brick2/ecv Brick15: 10.70.35.112:/rhs/brick2/ecv Brick16: 10.70.35.138:/rhs/brick2/ecv Brick17: 10.70.35.192:/rhs/brick2/ecv Brick18: 10.70.35.215:/rhs/brick2/ecv Brick19: 10.70.35.214:/rhs/brick2/ecv Brick20: 10.70.46.130:/bricks/brick2/ecv Options Reconfigured: diagnostics.client-log-level: DEBUG transport.address-family: inet nfs.disable: on [root@dhcp35-45 glusterfs]# rpm -qa|grep gluster glusterfs-api-3.8.4-24.el7rhgs.x86_64 glusterfs-geo-replication-3.8.4-24.el7rhgs.x86_64 vdsm-gluster-4.17.33-1.1.el7rhgs.noarch glusterfs-server-3.8.4-24.el7rhgs.x86_64 glusterfs-3.8.4-24.el7rhgs.x86_64 glusterfs-events-3.8.4-24.el7rhgs.x86_64 gluster-nagios-addons-0.2.8-1.el7rhgs.x86_64 glusterfs-libs-3.8.4-24.el7rhgs.x86_64 glusterfs-fuse-3.8.4-24.el7rhgs.x86_64 python-gluster-3.8.4-24.el7rhgs.noarch glusterfs-rdma-3.8.4-24.el7rhgs.x86_64 gluster-nagios-common-0.2.4-1.el7rhgs.noarch glusterfs-devel-3.8.4-24.el7rhgs.x86_64 glusterfs-client-xlators-3.8.4-24.el7rhgs.x86_64 glusterfs-api-devel-3.8.4-24.el7rhgs.x86_64 glusterfs-cli-3.8.4-24.el7rhgs.x86_64 glusterfs-debuginfo-3.8.4-24.el7rhgs.x86_64
RCA: The rebalance code was recently changed to use fallocate when creating the dst file. However it looks like this is not supported on EC volumes: 736 int32_t ec_gf_fallocate(call_frame_t * frame, xlator_t * this, fd_t * fd, 737 int32_t keep_size, off_t offset, size_t len, 738 dict_t * xdata) 739 { 740 default_fallocate_failure_cbk(frame, ENOTSUP); 741 742 return 0; 743 } With this change, rebalance will fail on EC volumes. Waiting to hear from the EC team on the reason for allocate not being supported.
upstream patch : https://review.gluster.org/#/c/15200/
Upstream Patch: https://review.gluster.org/#/c/15200/ (master) https://review.gluster.org/#/c/17369/ (3.11) Downstream Patch: https://code.engineering.redhat.com/gerrit/#/c/107051/
on_qa validation: 1)not seeing the errors anymore 2)rebalance passes 3)replace brick passes 4)fallocate command works hence moving to verified version of test:3.8.4-28
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2017:2774