Bug 1475789

Summary: As long as appends keep happening on a file healing never completes on a brick when another brick is brought down in between
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Nag Pavan Chilakam <nchilaka>
Component: disperseAssignee: Sunil Kumar Acharya <sheggodu>
Status: CLOSED ERRATA QA Contact: Upasana <ubansal>
Severity: high Docs Contact:
Priority: high    
Version: rhgs-3.3CC: amukherj, aspandey, bturner, jbyers, rabhat, rhinduja, rhs-bugs, sheggodu, storage-qa-internal
Target Milestone: ---   
Target Release: RHGS 3.4.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: rebase
Fixed In Version: glusterfs-3.12.2-1 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-09-04 06:34:23 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1503134    

Description Nag Pavan Chilakam 2017-07-27 11:26:41 UTC
Description of problem:
======================
I was verifying BZ#1427159 , when I hit this problem.
However given that the issue mentioned in BZ#1427159  is fixed , raising a new bz

This problem is not seen when we bring down and bring up only one brick


(talking about 4+2)
When we keep appending a file and we bring down one brick, and while append is going on , bring another redundant brick down and then bring up the first brick, the heal never completes as long as the append keeps happening.

We can see the xattrs for the file on sink and compare with source and see that size and dirty and version never catch up with the source(a slight lag)


Also, to confirm the same, disable the server side and client side heal and stop the append and then bring down another brick, which means now there are only 3 good bricks(as the first brick as I told is not completely healed), and If we now do a read or md5sum(do a new client ,as old client might have cached), it can be seen that after sometime you will hit IO error

Version-Release number of selected component (if applicable):
=========
3.8.4-35

How reproducible:
=============
always

Steps to Reproduce:
1.create a 4+2 ec vol
2.keep appending a file
3.bring down b1
4. wait for a minute or so and bring down b2
5. now again after a minute or so bring up b1
6. check the xattrs(use watch command), it can be seen that b1 starts to get healed, but never catches up with other healthy bricks(there is always a difference in the xattr values) as long as the IO is happening

Comment 2 Nag Pavan Chilakam 2017-07-27 11:30:47 UTC
shd log during post b1 was brought up


[2017-07-27 09:26:56.884820] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-tv-client-5: Connected to tv-client-5, attached to remote volume '/rhs/brick2/ec'.
[2017-07-27 09:26:56.884832] I [MSGID: 114047] [client-handshake.c:1226:client_setvolume_cbk] 0-tv-client-5: Server and Client lk-version numbers are not same, reopening the fds
[2017-07-27 09:26:56.884934] I [MSGID: 122061] [ec.c:323:ec_up] 0-tv-disperse-0: Going UP
[2017-07-27 09:26:56.885071] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-tv-client-5: Server lk version = 1
[2017-07-27 09:27:47.861249] I [glusterfsd-mgmt.c:54:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2017-07-27 09:27:47.877049] I [glusterfsd-mgmt.c:54:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2017-07-27 09:27:47.879429] I [glusterfsd-mgmt.c:1823:mgmt_getspec_cbk] 0-glusterfs: No change in volfile,continuing
[2017-07-27 09:30:56.006812] W [MSGID: 122006] [ec-combine.c:208:ec_iatt_combine] 0-tv-disperse-0: Failed to combine iatt (inode: 10162049781215076630-10162049781215076630, links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0, size: 331304960-331300352, mode: 100644-100644)
[2017-07-27 09:30:56.006858] N [MSGID: 122029] [ec-generic.c:683:ec_combine_lookup] 0-tv-disperse-0: Mismatching iatt in answers of 'GF_FOP_LOOKUP'
[2017-07-27 09:30:56.007054] W [MSGID: 122006] [ec-combine.c:208:ec_iatt_combine] 0-tv-disperse-0: Failed to combine iatt (inode: 10162049781215076630-10162049781215076630, links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0, size: 331300352-331304960, mode: 100644-100644)
[2017-07-27 09:30:56.007535] W [MSGID: 122006] [ec-combine.c:208:ec_iatt_combine] 0-tv-disperse-0: Failed to combine iatt (inode: 10162049781215076630-10162049781215076630, links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0, size: 331300864-331304960, mode: 100644-100644)
[2017-07-27 09:30:56.007556] W [MSGID: 122006] [ec-combine.c:208:ec_iatt_combine] 0-tv-disperse-0: Failed to combine iatt (inode: 10162049781215076630-10162049781215076630, links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0, size: 331300864-331300352, mode: 100644-100644)
The message "N [MSGID: 122029] [ec-generic.c:683:ec_combine_lookup] 0-tv-disperse-0: Mismatching iatt in answers of 'GF_FOP_LOOKUP'" repeated 4 times between [2017-07-27 09:30:56.006858] and [2017-07-27 09:30:56.007564]
[2017-07-27 09:30:56.008981] W [MSGID: 122006] [ec-combine.c:208:ec_iatt_combine] 0-tv-disperse-0: Failed to combine iatt (inode: 10162049781215076630-10162049781215076630, links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0, size: 331305984-331304960, mode: 100644-100644)
[2017-07-27 09:30:56.009051] N [MSGID: 122029] [ec-generic.c:683:ec_combine_lookup] 0-tv-disperse-0: Mismatching iatt in answers of 'GF_FOP_LOOKUP'
[2017-07-27 09:30:56.009104] W [MSGID: 122006] [ec-combine.c:208:ec_iatt_combine] 0-tv-disperse-0: Failed to combine iatt (inode: 10162049781215076630-10162049781215076630, links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0, size: 331305984-331304960, mode: 100644-100644)
[2017-07-27 09:30:56.009120] N [MSGID: 122029] [ec-generic.c:683:ec_combine_lookup] 0-tv-disperse-0: Mismatching iatt in answers of 'GF_FOP_LOOKUP'
[2017-07-27 09:30:56.009169] W [MSGID: 122006] [ec-combine.c:208:ec_iatt_combine] 0-tv-disperse-0: Failed to combine iatt (inode: 10162049781215076630-10162049781215076630, links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0, size: 331304960-331305984, mode: 100644-100644)
[2017-07-27 09:30:56.009185] N [MSGID: 122029] [ec-generic.c:683:ec_combine_lookup] 0-tv-disperse-0: Mismatching iatt in answers of 'GF_FOP_LOOKUP'
[2017-07-27 09:30:56.009400] W [MSGID: 122006] [ec-combine.c:208:ec_iatt_combine] 0-tv-disperse-0: Failed to combine iatt (inode: 10162049781215076630-10162049781215076630, links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0, size: 331304960-331305984, mode: 100644-100644)
[2017-07-27 09:30:56.009418] N [MSGID: 122029] [ec-generic.c:683:ec_combine_lookup] 0-tv-disperse-0: Mismatching iatt in answers of 'GF_FOP_LOOKUP'
[2017-07-27 09:30:56.009519] W [MSGID: 122006] [ec-combine.c:208:ec_iatt_combine] 0-tv-disperse-0: Failed to combine iatt (inode: 10162049781215076630-10162049781215076630, links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0, size: 331305984-331304960, mode: 100644-100644)
[2017-07-27 09:30:56.009534] N [MSGID: 122029] [ec-generic.c:683:ec_combine_lookup] 0-tv-disperse-0: Mismatching iatt in answers of 'GF_FOP_LOOKUP'
[2017-07-27 09:31:56.020590] W [MSGID: 114031] [client-rpc-fops.c:2940:client3_3_lookup_cbk] 0-tv-client-2: remote operation failed. Path: <gfid:8e769011-8db5-4f7a-b886-c77de542ca83> (8e769011-8db5-4f7a-b886-c77de542ca83) [No such file or directory]
[2017-07-27 09:31:56.020668] W [MSGID: 114031] [client-rpc-fops.c:2940:client3_3_lookup_cbk] 0-tv-client-4: remote operation failed. Path: <gfid:8e769011-8db5-4f7a-b886-c77de542ca83> (8e769011-8db5-4f7a-b886-c77de542ca83) [No such file or directory]
q

Comment 3 Nag Pavan Chilakam 2017-07-27 11:46:18 UTC
sosreports and logs @  http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1475789

Comment 5 Sunil Kumar Acharya 2017-09-14 11:13:20 UTC
This issue is fixed by : https://review.gluster.org/#/c/16772/
Verified it on my test machine.

Comment 15 errata-xmlrpc 2018-09-04 06:34:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607