1302201 – Scrubber crash (list corruption)

Bug 1302201 - Scrubber crash (list corruption)

Summary: Scrubber crash (list corruption)

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	bitrot
Sub Component:
Version:	mainline
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Venky Shankar
QA Contact:
Docs Contact:	bugs@gluster.org
URL:
Whiteboard:
Depends On:	1302199
Blocks:
TreeView+	depends on / blocked

Reported:	2016-01-27 06:56 UTC by Venky Shankar
Modified:	2016-06-16 13:55 UTC (History)
CC List:	5 users (show)
Fixed In Version:	glusterfs-3.8rc2
Clone Of:	1302199
Environment:
Last Closed:	2016-06-16 13:55:37 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Description Venky Shankar 2016-01-27 06:56:44 UTC

+++ This bug was initially created as a clone of Bug #1302199 +++

Description of problem:

Emmanuel reported a scrubber crash in NetBSD. Backtrace shows list corruption when bitrot scrubber tries to fetch an item to scrub from a set of bricks.

Backtrace:

(gdb) bt
#0  0xbb213b74 in list_del_init (old=0x0) at /home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/libglusterfs/src/list.h:87
#1  0xbb21682f in _br_scrubber_get_entry (child=0xbb106924, fsentry=0xb84fcfc0)
    at /home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/xlators/features/bit-rot/src/bitd/bit-rot-scrub.c:1033
#2  0xbb2168b0 in _br_scrubber_find_scrubbable_entry (fsscrub=0xbb106cf0, fsentry=0xb84fcfc0)
    at /home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/xlators/features/bit-rot/src/bitd/bit-rot-scrub.c:1055
#3  0xbb216959 in br_scrubber_pick_entry (fsscrub=0xbb106cf0, fsentry=0xb84fcfc0)
    at /home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/xlators/features/bit-rot/src/bitd/bit-rot-scrub.c:1077
#4  0xbb216b0f in br_scrubber_proc (arg=<error reading variable: Cannot access memory at address 0xb84fcfd8>)
    at /home/jenkins/root/workspace/rackspace-netbsd7-regression-triggered/xlators/features/bit-rot/src/bitd/bit-rot-scrub.c:1153

Version-Release number of selected component (if applicable):
3.7

How reproducible:
Intermittently

Steps to Reproduce:
Run the following test case:

    ./tests/bitrot/br-state-check.t

Actual results:
Test case fails at times and scrubber crashes

Expected results:
Test case should pass (and generate no cores)

Additional info:

--- Additional comment from Venky Shankar on 2016-01-27 01:56:09 EST ---

_br_scrubber_find_scrubbable_entry() does a pthread_cond_wait(...) to get signalled when ->scrublist is non-empty:

    if (list_empty (&fsscrub->scrublist))
        pthread_cond_wait (&fsscrub->cond, &fsscrub->mutex);

pthread_cond_wait() is prone to spurious wakeups as mentioned in man(3) pthread_cond_wait and callers are expected to validate the condition again. In the above case, if pthread_cond_wait() returns prematurely, then accessing first element of ->scrublist and calling list_entry() would give garbage.

Comment 1 Vijay Bellur 2016-01-27 11:44:31 UTC

REVIEW: http://review.gluster.org/13302 (features / bitrot: Prevent spurious pthread_cond_wait() wakeup) posted (#1) for review on master by Venky Shankar (vshankar)

Comment 2 Vijay Bellur 2016-01-28 05:12:59 UTC

REVIEW: http://review.gluster.org/13302 (features / bitrot: Prevent spurious pthread_cond_wait() wakeup) posted (#2) for review on master by Pranith Kumar Karampuri (pkarampu)

Comment 3 Vijay Bellur 2016-01-28 13:22:33 UTC

COMMIT: http://review.gluster.org/13302 committed in master by Venky Shankar (vshankar) 
------
commit 786a8b395b09126a1151865c57ec2753a26facbb
Author: Venky Shankar <vshankar>
Date:   Wed Jan 27 17:04:18 2016 +0530

    features / bitrot: Prevent spurious pthread_cond_wait() wakeup
    
    pthread_cond_wait() is prone to spurious wakeups and it's utmost
    necessarry to check a boolean predicate for thread continuation.
    
    See man(3) pthread_cond_wait() for details.
    
    The following is done in bitrot scrubber:
    
        if (list_empty (&fsscrub->scrublist))
           pthread_cond_wait (&fsscrub->cond, &fsscrub->mutex);
    
    followed by:
    
        list_first_entry (&fsscrub->scrublist, ...)
    
    A spurious wakeup from pthread_cond_wait() with the absence of
    list_empty() check causes list_first_entry() to return garbage.
    
    Change-Id: I08786b9686b5503fcad6127e4c2a2cfac4bb7849
    BUG: 1302201
    Signed-off-by: Venky Shankar <vshankar>
    Reviewed-on: http://review.gluster.org/13302
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>
    Tested-by: Pranith Kumar Karampuri <pkarampu>
    Smoke: Gluster Build System <jenkins.com>
    CentOS-regression: Gluster Build System <jenkins.com>
    NetBSD-regression: NetBSD Build System <jenkins.org>

Comment 4 Niels de Vos 2016-06-16 13:55:37 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.8.0, please open a new bug report.

glusterfs-3.8.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://blog.gluster.org/2016/06/glusterfs-3-8-released/
[2] http://thread.gmane.org/gmane.comp.file-systems.gluster.user

Note You need to log in before you can comment on or make changes to this bug.