Bug 1423065
Summary: | Deadlock in gf_timer calls and a possible core (illegal memory access) in AFR on dropping FUSE cache | ||
---|---|---|---|
Product: | [Community] GlusterFS | Reporter: | Shyamsundar <srangana> |
Component: | replicate | Assignee: | Ravishankar N <ravishankar> |
Status: | CLOSED DUPLICATE | QA Contact: | |
Severity: | unspecified | Docs Contact: | |
Priority: | unspecified | ||
Version: | 3.10 | CC: | bugs, jdarcy, ravishankar, srangana |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2017-02-20 02:57:51 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 1416031 |
Description
Shyamsundar
2017-02-17 00:56:21 UTC
I just hit this again on the same test case, with the same exact stack etc. Request some attention to this, as it blocks performance regression testing (and also seems to be consistent) for the release. This is disturbing, as it looks somewhat similar to bug 1421721. Seems like something might have gone awry in the timer code recently. Nothing has changed in the timer code recently. However, check out https://bugzilla.redhat.com/show_bug.cgi?id=1421721#c4 (just added). That's not *directly* applicable here, since it involves glusterd code and we're not in glusterd, but it seems pretty likely that it's something similar corrupting the timer list. ---------------------------- Notes to self: (gdb) p *ctx $4 = {read_subvol = 4379182848, spb_choice = -1, timer = 0x2b1a134558a5f533, need_refresh = (_gf_true | unknown: 1487271126)} ctx->timer is not NULL. But ctx->timer is initialized only when split-brain resolution related setfattr commands are executed from mount, which was not done here in the test run. ctx->read_subvol is supposed to be a bit map array of readable subvols (https://github.com/gluster/glusterfs/blob/v3.10.0rc0/xlators/cluster/afr/src/afr-common.c#L162) but the value 4379182848 in this case when converted to binary to get the data/metadata/event_gen bits in the bitmap is giving gibberish results ctx->need_refresh not being just _gf_true (i.e 0x01) but a large non zero value (1487271126) is again gibberish. ---------------------------- Shyam, it looks like the inode context values are corrupt here and the problem is likely to be BZ 1423385 where the inode context of a different xlator is returned when we get it with a key of a given xlator. (I don't see any bug in AFR itself from a looking at the code). Can you wait until https://review.gluster.org/#/c/16655/ gets merged to see if it fixes the problem? (In reply to Ravishankar N from comment #4) > Shyam, it looks like the inode context values are corrupt here and the > problem is likely to be BZ 1423385 where the inode context of a different > xlator is returned when we get it with a key of a given xlator. (I don't see > any bug in AFR itself from a looking at the code). > > Can you wait until https://review.gluster.org/#/c/16655/ gets merged to see > if it fixes the problem? I intend to do that, had a separate IRC conversation with Poornima, and I might give this fix a shot in my setup to see if the problem goes away. Also, I can possibly narrow the test case a bit in the process. Retested with the possible fix posted here,https://review.gluster.org/#/c/16655/ Problem is not reproducible. As prior to the fix this hung/crashed for me every time, and the last 3 runs have not, marking this as a duplicate of bug #1423385 *** This bug has been marked as a duplicate of bug 1423385 *** (In reply to Shyamsundar from comment #6) > Retested with the possible fix posted > here,https://review.gluster.org/#/c/16655/ > > Problem is not reproducible. As prior to the fix this hung/crashed for me > every time, and the last 3 runs have not, marking this as a duplicate of bug > #1423385 > > *** This bug has been marked as a duplicate of bug 1423385 *** Many thanks for testing Shyam! |