This service will be undergoing maintenance at 00:00 UTC, 2017-10-23 It is expected to last about 30 minutes
Bug 1475789 - As long as appends keep happening on a file healing never completes on a brick when another brick is brought down in between
As long as appends keep happening on a file healing never completes on a bric...
Status: MODIFIED
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: disperse (Show other bugs)
3.3
Unspecified Unspecified
high Severity high
: ---
: ---
Assigned To: Sunil Kumar Acharya
nchilaka
rebase
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2017-07-27 07:26 EDT by nchilaka
Modified: 2017-10-17 09:28 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed:
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Description nchilaka 2017-07-27 07:26:41 EDT
Description of problem:
======================
I was verifying BZ#1427159 , when I hit this problem.
However given that the issue mentioned in BZ#1427159  is fixed , raising a new bz

This problem is not seen when we bring down and bring up only one brick


(talking about 4+2)
When we keep appending a file and we bring down one brick, and while append is going on , bring another redundant brick down and then bring up the first brick, the heal never completes as long as the append keeps happening.

We can see the xattrs for the file on sink and compare with source and see that size and dirty and version never catch up with the source(a slight lag)


Also, to confirm the same, disable the server side and client side heal and stop the append and then bring down another brick, which means now there are only 3 good bricks(as the first brick as I told is not completely healed), and If we now do a read or md5sum(do a new client ,as old client might have cached), it can be seen that after sometime you will hit IO error

Version-Release number of selected component (if applicable):
=========
3.8.4-35

How reproducible:
=============
always

Steps to Reproduce:
1.create a 4+2 ec vol
2.keep appending a file
3.bring down b1
4. wait for a minute or so and bring down b2
5. now again after a minute or so bring up b1
6. check the xattrs(use watch command), it can be seen that b1 starts to get healed, but never catches up with other healthy bricks(there is always a difference in the xattr values) as long as the IO is happening
Comment 2 nchilaka 2017-07-27 07:30:47 EDT
shd log during post b1 was brought up


[2017-07-27 09:26:56.884820] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-tv-client-5: Connected to tv-client-5, attached to remote volume '/rhs/brick2/ec'.
[2017-07-27 09:26:56.884832] I [MSGID: 114047] [client-handshake.c:1226:client_setvolume_cbk] 0-tv-client-5: Server and Client lk-version numbers are not same, reopening the fds
[2017-07-27 09:26:56.884934] I [MSGID: 122061] [ec.c:323:ec_up] 0-tv-disperse-0: Going UP
[2017-07-27 09:26:56.885071] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-tv-client-5: Server lk version = 1
[2017-07-27 09:27:47.861249] I [glusterfsd-mgmt.c:54:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2017-07-27 09:27:47.877049] I [glusterfsd-mgmt.c:54:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2017-07-27 09:27:47.879429] I [glusterfsd-mgmt.c:1823:mgmt_getspec_cbk] 0-glusterfs: No change in volfile,continuing
[2017-07-27 09:30:56.006812] W [MSGID: 122006] [ec-combine.c:208:ec_iatt_combine] 0-tv-disperse-0: Failed to combine iatt (inode: 10162049781215076630-10162049781215076630, links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0, size: 331304960-331300352, mode: 100644-100644)
[2017-07-27 09:30:56.006858] N [MSGID: 122029] [ec-generic.c:683:ec_combine_lookup] 0-tv-disperse-0: Mismatching iatt in answers of 'GF_FOP_LOOKUP'
[2017-07-27 09:30:56.007054] W [MSGID: 122006] [ec-combine.c:208:ec_iatt_combine] 0-tv-disperse-0: Failed to combine iatt (inode: 10162049781215076630-10162049781215076630, links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0, size: 331300352-331304960, mode: 100644-100644)
[2017-07-27 09:30:56.007535] W [MSGID: 122006] [ec-combine.c:208:ec_iatt_combine] 0-tv-disperse-0: Failed to combine iatt (inode: 10162049781215076630-10162049781215076630, links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0, size: 331300864-331304960, mode: 100644-100644)
[2017-07-27 09:30:56.007556] W [MSGID: 122006] [ec-combine.c:208:ec_iatt_combine] 0-tv-disperse-0: Failed to combine iatt (inode: 10162049781215076630-10162049781215076630, links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0, size: 331300864-331300352, mode: 100644-100644)
The message "N [MSGID: 122029] [ec-generic.c:683:ec_combine_lookup] 0-tv-disperse-0: Mismatching iatt in answers of 'GF_FOP_LOOKUP'" repeated 4 times between [2017-07-27 09:30:56.006858] and [2017-07-27 09:30:56.007564]
[2017-07-27 09:30:56.008981] W [MSGID: 122006] [ec-combine.c:208:ec_iatt_combine] 0-tv-disperse-0: Failed to combine iatt (inode: 10162049781215076630-10162049781215076630, links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0, size: 331305984-331304960, mode: 100644-100644)
[2017-07-27 09:30:56.009051] N [MSGID: 122029] [ec-generic.c:683:ec_combine_lookup] 0-tv-disperse-0: Mismatching iatt in answers of 'GF_FOP_LOOKUP'
[2017-07-27 09:30:56.009104] W [MSGID: 122006] [ec-combine.c:208:ec_iatt_combine] 0-tv-disperse-0: Failed to combine iatt (inode: 10162049781215076630-10162049781215076630, links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0, size: 331305984-331304960, mode: 100644-100644)
[2017-07-27 09:30:56.009120] N [MSGID: 122029] [ec-generic.c:683:ec_combine_lookup] 0-tv-disperse-0: Mismatching iatt in answers of 'GF_FOP_LOOKUP'
[2017-07-27 09:30:56.009169] W [MSGID: 122006] [ec-combine.c:208:ec_iatt_combine] 0-tv-disperse-0: Failed to combine iatt (inode: 10162049781215076630-10162049781215076630, links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0, size: 331304960-331305984, mode: 100644-100644)
[2017-07-27 09:30:56.009185] N [MSGID: 122029] [ec-generic.c:683:ec_combine_lookup] 0-tv-disperse-0: Mismatching iatt in answers of 'GF_FOP_LOOKUP'
[2017-07-27 09:30:56.009400] W [MSGID: 122006] [ec-combine.c:208:ec_iatt_combine] 0-tv-disperse-0: Failed to combine iatt (inode: 10162049781215076630-10162049781215076630, links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0, size: 331304960-331305984, mode: 100644-100644)
[2017-07-27 09:30:56.009418] N [MSGID: 122029] [ec-generic.c:683:ec_combine_lookup] 0-tv-disperse-0: Mismatching iatt in answers of 'GF_FOP_LOOKUP'
[2017-07-27 09:30:56.009519] W [MSGID: 122006] [ec-combine.c:208:ec_iatt_combine] 0-tv-disperse-0: Failed to combine iatt (inode: 10162049781215076630-10162049781215076630, links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0, size: 331305984-331304960, mode: 100644-100644)
[2017-07-27 09:30:56.009534] N [MSGID: 122029] [ec-generic.c:683:ec_combine_lookup] 0-tv-disperse-0: Mismatching iatt in answers of 'GF_FOP_LOOKUP'
[2017-07-27 09:31:56.020590] W [MSGID: 114031] [client-rpc-fops.c:2940:client3_3_lookup_cbk] 0-tv-client-2: remote operation failed. Path: <gfid:8e769011-8db5-4f7a-b886-c77de542ca83> (8e769011-8db5-4f7a-b886-c77de542ca83) [No such file or directory]
[2017-07-27 09:31:56.020668] W [MSGID: 114031] [client-rpc-fops.c:2940:client3_3_lookup_cbk] 0-tv-client-4: remote operation failed. Path: <gfid:8e769011-8db5-4f7a-b886-c77de542ca83> (8e769011-8db5-4f7a-b886-c77de542ca83) [No such file or directory]
q
Comment 3 nchilaka 2017-07-27 07:46:18 EDT
sosreports and logs @  http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1475789
Comment 5 Sunil Kumar Acharya 2017-09-14 07:13:20 EDT
This issue is fixed by : https://review.gluster.org/#/c/16772/
Verified it on my test machine.

Note You need to log in before you can comment on or make changes to this bug.