Bug 1315560

Summary: ./tests/basic/tier/tier-file-create.t dumping core fairly often on build machines in Linux
Product: [Community] GlusterFS Reporter: Krutika Dhananjay <kdhananj>
Component: tieringAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED CURRENTRELEASE QA Contact: bugs <bugs>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: mainlineCC: bugs, dlambrig, josferna, nbalacha, pkarampu
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.8rc2 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1318428 1322520 (view as bug list) Environment:
Last Closed: 2016-06-16 13:59:45 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 1318428, 1322520    

Description Krutika Dhananjay 2016-03-08 04:50:48 UTC
Description of problem:

http://www.gluster.org/pipermail/gluster-devel/2016-March/048568.html

https://build.gluster.org/job/rackspace-regression-2GB-triggered/18872/consoleFull
https://build.gluster.org/job/rackspace-regression-2GB-triggered/18793/console


I have set the author to the author of the script to begin with.

Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 Vijay Bellur 2016-03-08 04:52:23 UTC
REVIEW: http://review.gluster.org/13632 (tests: Move tier-file-create.t to bad tests) posted (#1) for review on master by Krutika Dhananjay (kdhananj)

Comment 2 Vijay Bellur 2016-03-08 06:45:07 UTC
REVIEW: http://review.gluster.org/13632 (tests: Move tier-file-create.t to bad tests) posted (#2) for review on master by Krutika Dhananjay (kdhananj)

Comment 3 Vijay Bellur 2016-03-08 11:29:20 UTC
REVIEW: http://review.gluster.org/13632 (tests: Move tier-file-create.t to bad tests) posted (#3) for review on master by Krutika Dhananjay (kdhananj)

Comment 4 Vijay Bellur 2016-03-08 20:00:44 UTC
COMMIT: http://review.gluster.org/13632 committed in master by Jeff Darcy (jdarcy) 
------
commit 66d62edd08be5701407e4adcb153a676702ff8b8
Author: Krutika Dhananjay <kdhananj>
Date:   Tue Mar 8 10:21:14 2016 +0530

    tests: Move tier-file-create.t to bad tests
    
    Change-Id: Iaddb244699b0e2647a67a75f257e4c47e0e69e0d
    BUG: 1315560
    Signed-off-by: Krutika Dhananjay <kdhananj>
    Reviewed-on: http://review.gluster.org/13632
    Smoke: Gluster Build System <jenkins.com>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.com>
    Reviewed-by: Dan Lambright <dlambrig>
    Reviewed-by: Jeff Darcy <jdarcy>

Comment 5 Vijay Bellur 2016-03-11 10:48:50 UTC
REVIEW: http://review.gluster.org/13680 (cluster/ec: Do not ref dictionary in lookup) posted (#1) for review on master by Pranith Kumar Karampuri (pkarampu)

Comment 6 Vijay Bellur 2016-03-14 11:40:03 UTC
COMMIT: http://review.gluster.org/13680 committed in master by Xavier Hernandez (xhernandez) 
------
commit 64cba025b13aad7fb3020a04930cfa22fbfcb859
Author: Pranith Kumar K <pkarampu>
Date:   Tue Mar 8 23:05:08 2016 +0530

    cluster/ec: Do not ref dictionary in lookup
    
    Problem:
    1) dict_for_each loops over the elements without any locks, so the members of
       the dictionary can be ref/unrefed while dict_for_each is executed by another
       thread leading to crashes.
    
    Basically with distributed ec + disctributed replicate as cold, hot tiers. tier
    sends a lookup which fails on ec. (By this time dict already contains ec
    xattrs) After this lookup_everywhere code path is hit in tier which triggers
    lookup on each of distribute's hash lookup but fails which leads to the cold,
    hot dht's lookup_everywhere in two parallel epoll threads where in ec when it
    tries to set trusted.ec.version/dirty/size as keys in the dictionary, the older
    values against the same key get erased. While this erasing is going on if the
    thread that is doing lookup on afr's subvolume accesses these keys either in
    dict_copy_with_ref or client xlator trying to serialize, that can either lead
    to crash or hang based on if the spin/mutex lock is called on invalid memory.
    
    2) EC deletes GF_CONTENT_KEY from the dictionary, this may lead to extra reads
       in case of lookup-everwhere for tiered volumes.
    
    Fix:
    Do dict_copy_with_ref() for the lookup-dictionary.
    This is avoiding the problem and is not actually fixing the 1st problem.
    2nd problem will be fixed.
    
    Change-Id: I5427aa14c48cb7572977d4de9a28c5ffff2b4b95
    BUG: 1315560
    Signed-off-by: Pranith Kumar K <pkarampu>
    Reviewed-on: http://review.gluster.org/13680
    Smoke: Gluster Build System <jenkins.com>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.com>
    Reviewed-by: Xavier Hernandez <xhernandez>

Comment 7 Niels de Vos 2016-06-16 13:59:45 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.0, please open a new bug report.

glusterfs-3.8.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://blog.gluster.org/2016/06/glusterfs-3-8-released/
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user