Description of problem: ====================== I was verifying BZ#1427159 , when I hit this problem. However given that the issue mentioned in BZ#1427159 is fixed , raising a new bz This problem is not seen when we bring down and bring up only one brick (talking about 4+2) When we keep appending a file and we bring down one brick, and while append is going on , bring another redundant brick down and then bring up the first brick, the heal never completes as long as the append keeps happening. We can see the xattrs for the file on sink and compare with source and see that size and dirty and version never catch up with the source(a slight lag) Also, to confirm the same, disable the server side and client side heal and stop the append and then bring down another brick, which means now there are only 3 good bricks(as the first brick as I told is not completely healed), and If we now do a read or md5sum(do a new client ,as old client might have cached), it can be seen that after sometime you will hit IO error Version-Release number of selected component (if applicable): ========= 3.8.4-35 How reproducible: ============= always Steps to Reproduce: 1.create a 4+2 ec vol 2.keep appending a file 3.bring down b1 4. wait for a minute or so and bring down b2 5. now again after a minute or so bring up b1 6. check the xattrs(use watch command), it can be seen that b1 starts to get healed, but never catches up with other healthy bricks(there is always a difference in the xattr values) as long as the IO is happening
shd log during post b1 was brought up [2017-07-27 09:26:56.884820] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-tv-client-5: Connected to tv-client-5, attached to remote volume '/rhs/brick2/ec'. [2017-07-27 09:26:56.884832] I [MSGID: 114047] [client-handshake.c:1226:client_setvolume_cbk] 0-tv-client-5: Server and Client lk-version numbers are not same, reopening the fds [2017-07-27 09:26:56.884934] I [MSGID: 122061] [ec.c:323:ec_up] 0-tv-disperse-0: Going UP [2017-07-27 09:26:56.885071] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-tv-client-5: Server lk version = 1 [2017-07-27 09:27:47.861249] I [glusterfsd-mgmt.c:54:mgmt_cbk_spec] 0-mgmt: Volume file changed [2017-07-27 09:27:47.877049] I [glusterfsd-mgmt.c:54:mgmt_cbk_spec] 0-mgmt: Volume file changed [2017-07-27 09:27:47.879429] I [glusterfsd-mgmt.c:1823:mgmt_getspec_cbk] 0-glusterfs: No change in volfile,continuing [2017-07-27 09:30:56.006812] W [MSGID: 122006] [ec-combine.c:208:ec_iatt_combine] 0-tv-disperse-0: Failed to combine iatt (inode: 10162049781215076630-10162049781215076630, links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0, size: 331304960-331300352, mode: 100644-100644) [2017-07-27 09:30:56.006858] N [MSGID: 122029] [ec-generic.c:683:ec_combine_lookup] 0-tv-disperse-0: Mismatching iatt in answers of 'GF_FOP_LOOKUP' [2017-07-27 09:30:56.007054] W [MSGID: 122006] [ec-combine.c:208:ec_iatt_combine] 0-tv-disperse-0: Failed to combine iatt (inode: 10162049781215076630-10162049781215076630, links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0, size: 331300352-331304960, mode: 100644-100644) [2017-07-27 09:30:56.007535] W [MSGID: 122006] [ec-combine.c:208:ec_iatt_combine] 0-tv-disperse-0: Failed to combine iatt (inode: 10162049781215076630-10162049781215076630, links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0, size: 331300864-331304960, mode: 100644-100644) [2017-07-27 09:30:56.007556] W [MSGID: 122006] [ec-combine.c:208:ec_iatt_combine] 0-tv-disperse-0: Failed to combine iatt (inode: 10162049781215076630-10162049781215076630, links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0, size: 331300864-331300352, mode: 100644-100644) The message "N [MSGID: 122029] [ec-generic.c:683:ec_combine_lookup] 0-tv-disperse-0: Mismatching iatt in answers of 'GF_FOP_LOOKUP'" repeated 4 times between [2017-07-27 09:30:56.006858] and [2017-07-27 09:30:56.007564] [2017-07-27 09:30:56.008981] W [MSGID: 122006] [ec-combine.c:208:ec_iatt_combine] 0-tv-disperse-0: Failed to combine iatt (inode: 10162049781215076630-10162049781215076630, links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0, size: 331305984-331304960, mode: 100644-100644) [2017-07-27 09:30:56.009051] N [MSGID: 122029] [ec-generic.c:683:ec_combine_lookup] 0-tv-disperse-0: Mismatching iatt in answers of 'GF_FOP_LOOKUP' [2017-07-27 09:30:56.009104] W [MSGID: 122006] [ec-combine.c:208:ec_iatt_combine] 0-tv-disperse-0: Failed to combine iatt (inode: 10162049781215076630-10162049781215076630, links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0, size: 331305984-331304960, mode: 100644-100644) [2017-07-27 09:30:56.009120] N [MSGID: 122029] [ec-generic.c:683:ec_combine_lookup] 0-tv-disperse-0: Mismatching iatt in answers of 'GF_FOP_LOOKUP' [2017-07-27 09:30:56.009169] W [MSGID: 122006] [ec-combine.c:208:ec_iatt_combine] 0-tv-disperse-0: Failed to combine iatt (inode: 10162049781215076630-10162049781215076630, links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0, size: 331304960-331305984, mode: 100644-100644) [2017-07-27 09:30:56.009185] N [MSGID: 122029] [ec-generic.c:683:ec_combine_lookup] 0-tv-disperse-0: Mismatching iatt in answers of 'GF_FOP_LOOKUP' [2017-07-27 09:30:56.009400] W [MSGID: 122006] [ec-combine.c:208:ec_iatt_combine] 0-tv-disperse-0: Failed to combine iatt (inode: 10162049781215076630-10162049781215076630, links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0, size: 331304960-331305984, mode: 100644-100644) [2017-07-27 09:30:56.009418] N [MSGID: 122029] [ec-generic.c:683:ec_combine_lookup] 0-tv-disperse-0: Mismatching iatt in answers of 'GF_FOP_LOOKUP' [2017-07-27 09:30:56.009519] W [MSGID: 122006] [ec-combine.c:208:ec_iatt_combine] 0-tv-disperse-0: Failed to combine iatt (inode: 10162049781215076630-10162049781215076630, links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0, size: 331305984-331304960, mode: 100644-100644) [2017-07-27 09:30:56.009534] N [MSGID: 122029] [ec-generic.c:683:ec_combine_lookup] 0-tv-disperse-0: Mismatching iatt in answers of 'GF_FOP_LOOKUP' [2017-07-27 09:31:56.020590] W [MSGID: 114031] [client-rpc-fops.c:2940:client3_3_lookup_cbk] 0-tv-client-2: remote operation failed. Path: <gfid:8e769011-8db5-4f7a-b886-c77de542ca83> (8e769011-8db5-4f7a-b886-c77de542ca83) [No such file or directory] [2017-07-27 09:31:56.020668] W [MSGID: 114031] [client-rpc-fops.c:2940:client3_3_lookup_cbk] 0-tv-client-4: remote operation failed. Path: <gfid:8e769011-8db5-4f7a-b886-c77de542ca83> (8e769011-8db5-4f7a-b886-c77de542ca83) [No such file or directory] q
sosreports and logs @ http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1475789
This issue is fixed by : https://review.gluster.org/#/c/16772/ Verified it on my test machine.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:2607