Bug 1475789 - As long as appends keep happening on a file healing never completes on a brick when another brick is brought down in between
Summary: As long as appends keep happening on a file healing never completes on a bric...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: disperse
Version: rhgs-3.3
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ---
: RHGS 3.4.0
Assignee: Sunil Kumar Acharya
QA Contact: Upasana
URL:
Whiteboard: rebase
Depends On:
Blocks: 1503134
TreeView+ depends on / blocked
 
Reported: 2017-07-27 11:26 UTC by nchilaka
Modified: 2018-09-12 18:36 UTC (History)
9 users (show)

Fixed In Version: glusterfs-3.12.2-1
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed: 2018-09-04 06:34:23 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2018:2607 None None None 2018-09-04 06:35:54 UTC

Description nchilaka 2017-07-27 11:26:41 UTC
Description of problem:
======================
I was verifying BZ#1427159 , when I hit this problem.
However given that the issue mentioned in BZ#1427159  is fixed , raising a new bz

This problem is not seen when we bring down and bring up only one brick


(talking about 4+2)
When we keep appending a file and we bring down one brick, and while append is going on , bring another redundant brick down and then bring up the first brick, the heal never completes as long as the append keeps happening.

We can see the xattrs for the file on sink and compare with source and see that size and dirty and version never catch up with the source(a slight lag)


Also, to confirm the same, disable the server side and client side heal and stop the append and then bring down another brick, which means now there are only 3 good bricks(as the first brick as I told is not completely healed), and If we now do a read or md5sum(do a new client ,as old client might have cached), it can be seen that after sometime you will hit IO error

Version-Release number of selected component (if applicable):
=========
3.8.4-35

How reproducible:
=============
always

Steps to Reproduce:
1.create a 4+2 ec vol
2.keep appending a file
3.bring down b1
4. wait for a minute or so and bring down b2
5. now again after a minute or so bring up b1
6. check the xattrs(use watch command), it can be seen that b1 starts to get healed, but never catches up with other healthy bricks(there is always a difference in the xattr values) as long as the IO is happening

Comment 2 nchilaka 2017-07-27 11:30:47 UTC
shd log during post b1 was brought up


[2017-07-27 09:26:56.884820] I [MSGID: 114046] [client-handshake.c:1215:client_setvolume_cbk] 0-tv-client-5: Connected to tv-client-5, attached to remote volume '/rhs/brick2/ec'.
[2017-07-27 09:26:56.884832] I [MSGID: 114047] [client-handshake.c:1226:client_setvolume_cbk] 0-tv-client-5: Server and Client lk-version numbers are not same, reopening the fds
[2017-07-27 09:26:56.884934] I [MSGID: 122061] [ec.c:323:ec_up] 0-tv-disperse-0: Going UP
[2017-07-27 09:26:56.885071] I [MSGID: 114035] [client-handshake.c:201:client_set_lk_version_cbk] 0-tv-client-5: Server lk version = 1
[2017-07-27 09:27:47.861249] I [glusterfsd-mgmt.c:54:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2017-07-27 09:27:47.877049] I [glusterfsd-mgmt.c:54:mgmt_cbk_spec] 0-mgmt: Volume file changed
[2017-07-27 09:27:47.879429] I [glusterfsd-mgmt.c:1823:mgmt_getspec_cbk] 0-glusterfs: No change in volfile,continuing
[2017-07-27 09:30:56.006812] W [MSGID: 122006] [ec-combine.c:208:ec_iatt_combine] 0-tv-disperse-0: Failed to combine iatt (inode: 10162049781215076630-10162049781215076630, links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0, size: 331304960-331300352, mode: 100644-100644)
[2017-07-27 09:30:56.006858] N [MSGID: 122029] [ec-generic.c:683:ec_combine_lookup] 0-tv-disperse-0: Mismatching iatt in answers of 'GF_FOP_LOOKUP'
[2017-07-27 09:30:56.007054] W [MSGID: 122006] [ec-combine.c:208:ec_iatt_combine] 0-tv-disperse-0: Failed to combine iatt (inode: 10162049781215076630-10162049781215076630, links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0, size: 331300352-331304960, mode: 100644-100644)
[2017-07-27 09:30:56.007535] W [MSGID: 122006] [ec-combine.c:208:ec_iatt_combine] 0-tv-disperse-0: Failed to combine iatt (inode: 10162049781215076630-10162049781215076630, links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0, size: 331300864-331304960, mode: 100644-100644)
[2017-07-27 09:30:56.007556] W [MSGID: 122006] [ec-combine.c:208:ec_iatt_combine] 0-tv-disperse-0: Failed to combine iatt (inode: 10162049781215076630-10162049781215076630, links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0, size: 331300864-331300352, mode: 100644-100644)
The message "N [MSGID: 122029] [ec-generic.c:683:ec_combine_lookup] 0-tv-disperse-0: Mismatching iatt in answers of 'GF_FOP_LOOKUP'" repeated 4 times between [2017-07-27 09:30:56.006858] and [2017-07-27 09:30:56.007564]
[2017-07-27 09:30:56.008981] W [MSGID: 122006] [ec-combine.c:208:ec_iatt_combine] 0-tv-disperse-0: Failed to combine iatt (inode: 10162049781215076630-10162049781215076630, links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0, size: 331305984-331304960, mode: 100644-100644)
[2017-07-27 09:30:56.009051] N [MSGID: 122029] [ec-generic.c:683:ec_combine_lookup] 0-tv-disperse-0: Mismatching iatt in answers of 'GF_FOP_LOOKUP'
[2017-07-27 09:30:56.009104] W [MSGID: 122006] [ec-combine.c:208:ec_iatt_combine] 0-tv-disperse-0: Failed to combine iatt (inode: 10162049781215076630-10162049781215076630, links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0, size: 331305984-331304960, mode: 100644-100644)
[2017-07-27 09:30:56.009120] N [MSGID: 122029] [ec-generic.c:683:ec_combine_lookup] 0-tv-disperse-0: Mismatching iatt in answers of 'GF_FOP_LOOKUP'
[2017-07-27 09:30:56.009169] W [MSGID: 122006] [ec-combine.c:208:ec_iatt_combine] 0-tv-disperse-0: Failed to combine iatt (inode: 10162049781215076630-10162049781215076630, links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0, size: 331304960-331305984, mode: 100644-100644)
[2017-07-27 09:30:56.009185] N [MSGID: 122029] [ec-generic.c:683:ec_combine_lookup] 0-tv-disperse-0: Mismatching iatt in answers of 'GF_FOP_LOOKUP'
[2017-07-27 09:30:56.009400] W [MSGID: 122006] [ec-combine.c:208:ec_iatt_combine] 0-tv-disperse-0: Failed to combine iatt (inode: 10162049781215076630-10162049781215076630, links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0, size: 331304960-331305984, mode: 100644-100644)
[2017-07-27 09:30:56.009418] N [MSGID: 122029] [ec-generic.c:683:ec_combine_lookup] 0-tv-disperse-0: Mismatching iatt in answers of 'GF_FOP_LOOKUP'
[2017-07-27 09:30:56.009519] W [MSGID: 122006] [ec-combine.c:208:ec_iatt_combine] 0-tv-disperse-0: Failed to combine iatt (inode: 10162049781215076630-10162049781215076630, links: 1-1, uid: 0-0, gid: 0-0, rdev: 0-0, size: 331305984-331304960, mode: 100644-100644)
[2017-07-27 09:30:56.009534] N [MSGID: 122029] [ec-generic.c:683:ec_combine_lookup] 0-tv-disperse-0: Mismatching iatt in answers of 'GF_FOP_LOOKUP'
[2017-07-27 09:31:56.020590] W [MSGID: 114031] [client-rpc-fops.c:2940:client3_3_lookup_cbk] 0-tv-client-2: remote operation failed. Path: <gfid:8e769011-8db5-4f7a-b886-c77de542ca83> (8e769011-8db5-4f7a-b886-c77de542ca83) [No such file or directory]
[2017-07-27 09:31:56.020668] W [MSGID: 114031] [client-rpc-fops.c:2940:client3_3_lookup_cbk] 0-tv-client-4: remote operation failed. Path: <gfid:8e769011-8db5-4f7a-b886-c77de542ca83> (8e769011-8db5-4f7a-b886-c77de542ca83) [No such file or directory]
q

Comment 3 nchilaka 2017-07-27 11:46:18 UTC
sosreports and logs @  http://rhsqe-repo.lab.eng.blr.redhat.com/sosreports/nchilaka/bug.1475789

Comment 5 Sunil Kumar Acharya 2017-09-14 11:13:20 UTC
This issue is fixed by : https://review.gluster.org/#/c/16772/
Verified it on my test machine.

Comment 15 errata-xmlrpc 2018-09-04 06:34:23 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:2607


Note You need to log in before you can comment on or make changes to this bug.