Bug 1447559 - Seeing Input/Output error in rebalance logs during a rebalance on an ec volume(log level=Debug)
Summary: Seeing Input/Output error in rebalance logs during a rebalance on an ec volum...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat Storage
Component: disperse
Version: rhgs-3.3
Hardware: Unspecified
OS: Unspecified
unspecified
medium
Target Milestone: ---
: RHGS 3.3.0
Assignee: Sunil Kumar Acharya
QA Contact: Nag Pavan Chilakam
URL:
Whiteboard:
Depends On:
Blocks: 1417151 1448293 1448307 1454686 1455241
TreeView+ depends on / blocked
 
Reported: 2017-05-03 08:08 UTC by Nag Pavan Chilakam
Modified: 2017-09-21 04:41 UTC (History)
5 users (show)

Fixed In Version: glusterfs-3.8.4-26
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1448293 1448307 (view as bug list)
Environment:
Last Closed: 2017-09-21 04:41:45 UTC
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2017:2774 0 normal SHIPPED_LIVE glusterfs bug fix and enhancement update 2017-09-21 08:16:29 UTC

Description Nag Pavan Chilakam 2017-05-03 08:08:01 UTC
Description of problem:
=======================
I am seeing EIO in the rebalance log after issuing a rebalance post addbricks

Note: I have a 10 node cluster and the bricks are of different size

DHT subvol1 :

DHT subvol || nodes    || Bricks Sizes ||

subvol1    || 1,2,3,4,5,6|| 10GB
subvol1    || 7,8,9      || 195GB
subvol1    || 10         || 8.1GB

Note: data size is 6.7GB on each brick (first subvol)

Subvol2 also of same size as above

mount of volume(checking size)
10.70.35.215:ecv        129G   54G   75G  42% /mnt/ecv




I checked for a couple of files which had EIO error and found that actutally t-link file was created on all the dst_bricks , however were not migrated(that is understandable coz the dst_bricks are not bigger than source, and hence would have to use force)

eg of one file:
[root@dhcp35-214 ~]# ll /*/brick*/ecv//dir1/linux-4.10.4/include/linux/smsc911x.h
-rw-rw-r--. 2 root root 512 Mar 18 16:49 /rhs/brick1/ecv/dir1/linux-4.10.4/include/linux/smsc911x.h
---------T. 2 root root   0 May  3 13:07 /rhs/brick2/ecv/dir1/linux-4.10.4/include/linux/smsc911x.h


Version-Release number of selected component (if applicable):
======
3.8.4-24

How reproducible:


Steps to Reproduce:
1.had a 1x(8+2) ec volume and created different files from the fuse mount.
2.I then add 1 more distribute subvol to this volume to make it 2x(8+2) (no IOs in progress)
3.set log level to DEBUG for testing purpose (no IO in progress)
4. triggered rebalance

I am seeing below EIO logs



Actual results:

[2017-05-03 07:37:35.638821] D [MSGID: 0] [dict.c:161:key_value_cmp] 0-ecv-disperse-0: 'security.selinux' is different in two dicts (33, 38)
[2017-05-03 07:37:35.638860] N [MSGID: 122031] [ec-inode-read.c:199:ec_combine_getxattr] 0-ecv-disperse-0: Mismatching dictionary in answers of 'GF_FOP_GETXATTR'
[2017-05-03 07:37:35.638909] D [MSGID: 0] [dict.c:161:key_value_cmp] 0-ecv-disperse-0: 'security.selinux' is different in two dicts (33, 38)
[2017-05-03 07:37:35.638927] N [MSGID: 122031] [ec-inode-read.c:199:ec_combine_getxattr] 0-ecv-disperse-0: Mismatching dictionary in answers of 'GF_FOP_GETXATTR'
[2017-05-03 07:37:35.638975] D [MSGID: 0] [dict.c:161:key_value_cmp] 0-ecv-disperse-0: 'security.selinux' is different in two dicts (33, 38)
[2017-05-03 07:37:35.638993] N [MSGID: 122031] [ec-inode-read.c:199:ec_combine_getxattr] 0-ecv-disperse-0: Mismatching dictionary in answers of 'GF_FOP_GETXATTR'
[2017-05-03 07:37:35.639038] D [MSGID: 0] [dict.c:161:key_value_cmp] 0-ecv-disperse-0: 'security.selinux' is different in two dicts (33, 38)
[2017-05-03 07:37:35.639056] N [MSGID: 122031] [ec-inode-read.c:199:ec_combine_getxattr] 0-ecv-disperse-0: Mismatching dictionary in answers of 'GF_FOP_GETXATTR'
[2017-05-03 07:37:35.639219] W [MSGID: 114031] [client-rpc-fops.c:1893:client3_3_fsetxattr_cbk] 0-ecv-client-12: remote operation failed
[2017-05-03 07:37:35.639249] D [MSGID: 0] [client-rpc-fops.c:1897:client3_3_fsetxattr_cbk] 0-stack-trace: stack-address: 0x7fd1d807c7c0, ecv-client-12 returned -1 error: Success
[2017-05-03 07:37:35.639254] W [MSGID: 114031] [client-rpc-fops.c:1893:client3_3_fsetxattr_cbk] 0-ecv-client-17: remote operation failed
[2017-05-03 07:37:35.639321] D [MSGID: 0] [client-rpc-fops.c:1897:client3_3_fsetxattr_cbk] 0-stack-trace: stack-address: 0x7fd1d807c7c0, ecv-client-17 returned -1 error: Success
[2017-05-03 07:37:35.639328] W [MSGID: 114031] [client-rpc-fops.c:1893:client3_3_fsetxattr_cbk] 0-ecv-client-15: remote operation failed
[2017-05-03 07:37:35.639346] D [MSGID: 0] [client-rpc-fops.c:1897:client3_3_fsetxattr_cbk] 0-stack-trace: stack-address: 0x7fd1d807c7c0, ecv-client-15 returned -1 error: Success
[2017-05-03 07:37:35.639352] W [MSGID: 114031] [client-rpc-fops.c:1893:client3_3_fsetxattr_cbk] 0-ecv-client-14: remote operation failed
[2017-05-03 07:37:35.639398] D [MSGID: 0] [client-rpc-fops.c:1897:client3_3_fsetxattr_cbk] 0-stack-trace: stack-address: 0x7fd1d807c7c0, ecv-client-14 returned -1 error: Success
[2017-05-03 07:37:35.639429] W [MSGID: 114031] [client-rpc-fops.c:1893:client3_3_fsetxattr_cbk] 0-ecv-client-16: remote operation failed
[2017-05-03 07:37:35.639399] W [MSGID: 114031] [client-rpc-fops.c:1893:client3_3_fsetxattr_cbk] 0-ecv-client-13: remote operation failed
[2017-05-03 07:37:35.639459] D [MSGID: 0] [client-rpc-fops.c:1897:client3_3_fsetxattr_cbk] 0-stack-trace: stack-address: 0x7fd1d807c7c0, ecv-client-13 returned -1 error: Success
[2017-05-03 07:37:35.639461] D [MSGID: 0] [client-rpc-fops.c:1897:client3_3_fsetxattr_cbk] 0-stack-trace: stack-address: 0x7fd1d807c7c0, ecv-client-16 returned -1 error: Success
[2017-05-03 07:37:35.639512] W [MSGID: 114031] [client-rpc-fops.c:1893:client3_3_fsetxattr_cbk] 0-ecv-client-18: remote operation failed
[2017-05-03 07:37:35.639528] D [MSGID: 0] [client-rpc-fops.c:1897:client3_3_fsetxattr_cbk] 0-stack-trace: stack-address: 0x7fd1d807c7c0, ecv-client-18 returned -1 error: Success
[2017-05-03 07:37:35.639619] W [MSGID: 114031] [client-rpc-fops.c:1893:client3_3_fsetxattr_cbk] 0-ecv-client-10: remote operation failed
[2017-05-03 07:37:35.639637] D [MSGID: 0] [client-rpc-fops.c:1897:client3_3_fsetxattr_cbk] 0-stack-trace: stack-address: 0x7fd1d807c7c0, ecv-client-10 returned -1 error: Success
[2017-05-03 07:37:35.639913] D [MSGID: 0] [dict.c:161:key_value_cmp] 0-ecv-disperse-0: 'security.selinux' is different in two dicts (33, 38)
[2017-05-03 07:37:35.639950] N [MSGID: 122031] [ec-inode-read.c:199:ec_combine_getxattr] 0-ecv-disperse-0: Mismatching dictionary in answers of 'GF_FOP_GETXATTR'
[2017-05-03 07:37:35.639979] D [MSGID: 0] [defaults.c:1346:default_getxattr_cbk] 0-stack-trace: stack-address: 0x7fd1c8028440, ecv-disperse-0 returned -1 error: Input/output error [Input/output error]
[2017-05-03 07:37:35.640047] W [MSGID: 109023] [dht-rebalance.c:1507:dht_migrate_file] 0-ecv-dht: Migrate file failed:/dir1/linux-4.10.4/include/linux/smsc911x.h: failed to get xattr from ecv-disperse-0 (Input/output error)
[2017-05-03 07:37:35.640631] W [MSGID: 114031] [client-rpc-fops.c:1893:client3_3_fsetxattr_cbk] 0-ecv-client-11: remote operation failed
[2017-05-03 07:37:35.640691] D [MSGID: 0] [client-rpc-fops.c:1897:client3_3_fsetxattr_cbk] 0-stack-trace: stack-address: 0x7fd1d807c7c0, ecv-client-11 returned -1 error: Success
[2017-05-03 07:37:35.641331] W [MSGID: 114031] [client-rpc-fops.c:1893:client3_3_fsetxattr_cbk] 0-ecv-client-19: remote operation failed
[2017-05-03 07:37:35.641365] D [MSGID: 0] [client-rpc-fops.c:1897:client3_3_fsetxattr_cbk] 0-stack-trace: stack-address: 0x7fd1d807c7c0, ecv-client-19 returned -1 error: Success
[2017-05-03 07:37:35.641384] D [MSGID: 0] [defaults.c:1092:default_fsetxattr_cbk] 0-stack-trace: stack-address: 0x7fd1c40076e0, ecv-disperse-1 returned -1 error: Success
[2017-05-03 07:37:35.641466] D [MSGID: 0] [dht-common.c:1289:dht_lookup_linkfile_create_cbk] 0-ecv-dht: creation of linkto on hashed subvol:ecv-disperse-1, returned with op_ret 0 and op_errno 0: 4ca832d5-190c-4554-b2b3-2c20d66490a0
[2017-05-03 07:37:35.642079] D [MSGID: 0] [dht-rebalance.c:3167:gf_defrag_process_dir] 0-ecv-dht: added file:adf4350.h parent:/dir1/linux-4.10.4/include/linux/iio/frequency to the queue 
[2017-05-03 07:37:35.643011] D [MSGID: 0] [defaults.c:228:default_fallocate_failure_cbk] 0-stack-trace: stack-address: 0x7fd1c40076e0, ecv-disperse-1 returned -1 error: Operation not supported [Operation not supported]
[2017-05-03 07:37:35.643291] E [MSGID: 109023] [dht-rebalance.c:740:__dht_rebalance_create_dst_file] 0-ecv-dht: fallocate failed for /dir1/linux-4.10.4/include/linux/smpboot.h on ecv-disperse-1 (Operation not supported)
[2017-05-03 07:37:35.643320] E [MSGID: 0] [dht-rebalance.c:1516:dht_migrate_file] 0-ecv-dht: Create dst failed on - ecv-disperse-1 for file - /dir1/linux-4.10.4/include/linux/smpboot.h
[2017-05-03 07:37:35.644545] I [dht-rebalance.c:3182:gf_defrag_process_dir] 0-ecv-dht: Migration operation on dir /dir1/linux-4.10.4/include/linux/iio/frequency took 0.03 secs
^C





[root@dhcp35-45 glusterfs]# gluster v rebal ecv status
                                    Node Rebalanced-files          size       scanned      failures       skipped               status  run time in h:m:s
                               ---------      -----------   -----------   -----------   -----------   -----------         ------------     --------------
                               localhost               22      608Bytes         62411             0         31197            completed        0:10:58
                            10.70.35.130                0        0Bytes             0             0             0            completed        0:04:59
                            10.70.35.122                0        0Bytes             0             0             0            completed        0:04:57
                             10.70.35.23                0        0Bytes             0             0             0            completed        0:04:58
                            10.70.35.112                0        0Bytes             0             0             0            completed        0:04:53
                            10.70.35.138                0        0Bytes             0             0             0            completed        0:05:02
                            10.70.35.192                0        0Bytes             0             0             0            completed        0:04:47
                            10.70.35.215                0        0Bytes             0             0             0            completed        0:04:50
                            10.70.35.214                0        0Bytes             0             0             0            completed        0:04:43
                            10.70.46.130                0        0Bytes             0             0             0            completed        0:05:00
volume rebalance: ecv: success
[root@dhcp35-45 glusterfs]# gluster v status
Status of volume: ecv
Gluster process                             TCP Port  RDMA Port  Online  Pid
------------------------------------------------------------------------------
Brick 10.70.35.45:/rhs/brick1/ecv           49156     0          Y       30438
Brick 10.70.35.130:/rhs/brick1/ecv          49152     0          Y       18964
Brick 10.70.35.122:/rhs/brick1/ecv          49152     0          Y       21342
Brick 10.70.35.23:/rhs/brick1/ecv           49152     0          Y       7287 
Brick 10.70.35.112:/rhs/brick1/ecv          49152     0          Y       28214
Brick 10.70.35.138:/rhs/brick1/ecv          49152     0          Y       20003
Brick 10.70.35.192:/rhs/brick1/ecv          49152     0          Y       24601
Brick 10.70.35.215:/rhs/brick1/ecv          49152     0          Y       18902
Brick 10.70.35.214:/rhs/brick1/ecv          49152     0          Y       3516 
Brick 10.70.46.130:/bricks/brick1/ecv       49152     0          Y       32099
Brick 10.70.35.45:/rhs/brick2/ecv           49157     0          Y       10117
Brick 10.70.35.130:/rhs/brick2/ecv          49153     0          Y       23504
Brick 10.70.35.122:/rhs/brick2/ecv          49153     0          Y       25969
Brick 10.70.35.23:/rhs/brick2/ecv           49153     0          Y       12033
Brick 10.70.35.112:/rhs/brick2/ecv          49153     0          Y       19079
Brick 10.70.35.138:/rhs/brick2/ecv          49153     0          Y       24002
Brick 10.70.35.192:/rhs/brick2/ecv          49153     0          Y       26986
Brick 10.70.35.215:/rhs/brick2/ecv          49153     0          Y       21366
Brick 10.70.35.214:/rhs/brick2/ecv          49153     0          Y       23114
Brick 10.70.46.130:/bricks/brick2/ecv       49153     0          Y       19940
Self-heal Daemon on localhost               N/A       N/A        Y       10137
Self-heal Daemon on 10.70.35.130            N/A       N/A        Y       23530
Self-heal Daemon on 10.70.35.122            N/A       N/A        Y       25995
Self-heal Daemon on 10.70.35.23             N/A       N/A        Y       12053
Self-heal Daemon on 10.70.35.215            N/A       N/A        Y       21386
Self-heal Daemon on 10.70.35.192            N/A       N/A        Y       27006
Self-heal Daemon on 10.70.35.214            N/A       N/A        Y       23140
Self-heal Daemon on 10.70.35.112            N/A       N/A        Y       19099
Self-heal Daemon on 10.70.35.138            N/A       N/A        Y       24022
Self-heal Daemon on 10.70.46.130            N/A       N/A        Y       19960
 
Task Status of Volume ecv
------------------------------------------------------------------------------
Task                 : Rebalance           
ID                   : 6f574dcc-3019-4cfa-a8fd-5a6307e58989
Status               : completed           
 
[root@dhcp35-45 glusterfs]# gluster v info
 
Volume Name: ecv
Type: Distributed-Disperse
Volume ID: ec0e53fb-8c99-41b6-bdcd-fa55dd33c429
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x (8 + 2) = 20
Transport-type: tcp
Bricks:
Brick1: 10.70.35.45:/rhs/brick1/ecv
Brick2: 10.70.35.130:/rhs/brick1/ecv
Brick3: 10.70.35.122:/rhs/brick1/ecv
Brick4: 10.70.35.23:/rhs/brick1/ecv
Brick5: 10.70.35.112:/rhs/brick1/ecv
Brick6: 10.70.35.138:/rhs/brick1/ecv
Brick7: 10.70.35.192:/rhs/brick1/ecv
Brick8: 10.70.35.215:/rhs/brick1/ecv
Brick9: 10.70.35.214:/rhs/brick1/ecv
Brick10: 10.70.46.130:/bricks/brick1/ecv
Brick11: 10.70.35.45:/rhs/brick2/ecv
Brick12: 10.70.35.130:/rhs/brick2/ecv
Brick13: 10.70.35.122:/rhs/brick2/ecv
Brick14: 10.70.35.23:/rhs/brick2/ecv
Brick15: 10.70.35.112:/rhs/brick2/ecv
Brick16: 10.70.35.138:/rhs/brick2/ecv
Brick17: 10.70.35.192:/rhs/brick2/ecv
Brick18: 10.70.35.215:/rhs/brick2/ecv
Brick19: 10.70.35.214:/rhs/brick2/ecv
Brick20: 10.70.46.130:/bricks/brick2/ecv
Options Reconfigured:
diagnostics.client-log-level: DEBUG
transport.address-family: inet
nfs.disable: on
[root@dhcp35-45 glusterfs]# rpm -qa|grep gluster
glusterfs-api-3.8.4-24.el7rhgs.x86_64
glusterfs-geo-replication-3.8.4-24.el7rhgs.x86_64
vdsm-gluster-4.17.33-1.1.el7rhgs.noarch
glusterfs-server-3.8.4-24.el7rhgs.x86_64
glusterfs-3.8.4-24.el7rhgs.x86_64
glusterfs-events-3.8.4-24.el7rhgs.x86_64
gluster-nagios-addons-0.2.8-1.el7rhgs.x86_64
glusterfs-libs-3.8.4-24.el7rhgs.x86_64
glusterfs-fuse-3.8.4-24.el7rhgs.x86_64
python-gluster-3.8.4-24.el7rhgs.noarch
glusterfs-rdma-3.8.4-24.el7rhgs.x86_64
gluster-nagios-common-0.2.4-1.el7rhgs.noarch
glusterfs-devel-3.8.4-24.el7rhgs.x86_64
glusterfs-client-xlators-3.8.4-24.el7rhgs.x86_64
glusterfs-api-devel-3.8.4-24.el7rhgs.x86_64
glusterfs-cli-3.8.4-24.el7rhgs.x86_64
glusterfs-debuginfo-3.8.4-24.el7rhgs.x86_64

Comment 2 Nithya Balachandran 2017-05-04 17:03:11 UTC
RCA: 

The rebalance code was recently changed to use fallocate when creating the dst file. However it looks like this is not supported on EC volumes:


 736 int32_t ec_gf_fallocate(call_frame_t * frame, xlator_t * this, fd_t * fd,       
 737                         int32_t keep_size, off_t offset, size_t len,            
 738                         dict_t * xdata)                                         
 739 {                                                                               
 740     default_fallocate_failure_cbk(frame, ENOTSUP);                              
 741                                                                                 
 742     return 0;                                                                   
 743 }  


With this change, rebalance will fail on EC volumes. 
Waiting to hear from the EC team on the reason for allocate not being supported.

Comment 3 Atin Mukherjee 2017-05-08 08:07:00 UTC
upstream patch : https://review.gluster.org/#/c/15200/

Comment 6 Sunil Kumar Acharya 2017-05-23 11:02:55 UTC
Upstream Patch:
https://review.gluster.org/#/c/15200/ (master)
https://review.gluster.org/#/c/17369/ (3.11)

Downstream Patch:
https://code.engineering.redhat.com/gerrit/#/c/107051/

Comment 8 Nag Pavan Chilakam 2017-06-19 12:04:30 UTC
on_qa validation:
1)not seeing the errors anymore
2)rebalance passes
3)replace brick passes
4)fallocate command works
hence moving to verified

version of test:3.8.4-28

Comment 10 errata-xmlrpc 2017-09-21 04:41:45 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:2774


Note You need to log in before you can comment on or make changes to this bug.