1379655 – Recording (ffmpeg) processes on FUSE get hung

Bug 1379655 - Recording (ffmpeg) processes on FUSE get hung

Summary: Recording (ffmpeg) processes on FUSE get hung

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	write-behind
Sub Component:
Version:	mainline
Hardware:	x86_64
OS:	Linux
Priority:	high
Severity:	high
Target Milestone:	---
Assignee:	Raghavendra G
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	1378131 1385618 1385620 1385622
TreeView+	depends on / blocked

Reported:	2016-09-27 10:56 UTC by Raghavendra G
Modified:	2017-03-06 17:28 UTC (History)
CC List:	8 users (show)
Fixed In Version:	glusterfs-3.10.0
Clone Of:	1378131
Clones:	1385618 1385620 1385622 (view as bug list)
Environment:
Last Closed:	2017-03-06 17:28:10 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)

Comment 1 Worker Ant 2016-09-27 11:37:50 UTC

REVIEW: http://review.gluster.org/15579 (performance/write-behind: remove the request from liability queue in wb_fulfill_request) posted (#2) for review on master by Raghavendra G (rgowdapp)

Comment 2 Soumya Koduri 2016-10-04 12:17:30 UTC

Please provide a public description for this bug.

Comment 3 Worker Ant 2016-10-13 11:51:41 UTC

REVIEW: http://review.gluster.org/15579 (performance/write-behind: remove the request from liability queue in wb_fulfill_request) posted (#3) for review on master by Raghavendra G (rgowdapp)

Comment 4 Worker Ant 2016-10-14 03:27:55 UTC

REVIEW: http://review.gluster.org/15579 (performance/write-behind: remove the request from liability queue in wb_fulfill_request) posted (#4) for review on master by Raghavendra G (rgowdapp)

Comment 5 Worker Ant 2016-10-17 04:53:14 UTC

COMMIT: http://review.gluster.org/15579 committed in master by Raghavendra G (rgowdapp) 
------
commit a8b2a981881221925bb5edfe7bb65b25ad855c04
Author: Raghavendra G <rgowdapp>
Date:   Tue Sep 27 16:35:08 2016 +0530

    performance/write-behind: remove the request from liability queue in
    wb_fulfill_request
    
    Before this patch, a request is removed from liability queue only when
    ref count of request hits 0. Though, wb_fulfill_request does an unref,
    it need not be the last unref and hence the request may survive in
    liability queue till the last unref. Let,
    
    T1: the time at which wb_fulfill_request is invoked
    T2: the time at which last unref is done on request
    
    Let's consider a case of T2 > T1. In the time window between T1 and
    T2, any other request (waiter) conflicting with request in liability
    queue (blocker - basically a write which has been lied) is blocked
    from winding. If T2 happens to be when wb_do_unwinds is invoked, no
    further processing of request list happens and "waiter" would get
    blocked forever. An example imaginary sequence of events is given
    below:
    
    1. A write request w1 is picked up for unwinding in __wb_pick_unwinds
       (but unwind is not done _yet_ and hence reference
       remains). However, w1 is moved to liability queue. Let's call this
       invocation of wb_process_queue by wb_writev as PQ1.
    
    2. A flush (f1) request hits write behind. Since the liability queue
       of inode is not empty, f1 is not picked for unwinding. Let's call
       the invocation of wb_process_queue by wb_flush as PQ2.
    
    3. PQ2 continues and picks w1 for fulfilling and invokes
       wb_fulfill. As part of successful wb_fulfill_cbk,
       wb_fulfill_request (w1) is invoked. But, w1 is not freed (and hence
       not removed from liability queue) as w1 is not unwound _yet_ and a
       ref remains (PQ1 has not invoked wb_do_unwinds _yet_).
    
    4. wb_fulfill_cbk (triggered by PQ2) invokes a wb_process_queue (let's
       say PQ3). f1 is not resumed in PQ3 as w1 is still in liability
       queue. At this time, PQ2 and PQ3 are complete.
    
    5. PQ1 continues, unwinds w1 and does last unref on w1 and w1 is freed
       (and removed from liability queue). Since PQ1 didn't invoke
       wb_fulfill on any other write requests, there won't be any future
       codepaths that would invoke wb_process_queue and f1 is stuck
       forever.
    
    With this fix, w1 is removed from liability queue in step 3 above and
    PQ3 resumes f1 in step 4 (as there are no requests conflicting with f1
    in liability queue during execution of PQ3).
    
    Signed-off-by: Raghavendra G <rgowdapp>
    BUG: 1379655
    Change-Id: Idacda1fcd520ac27f30224f8dfe8360dba6ac6cb
    Reviewed-on: http://review.gluster.org/15579
    CentOS-regression: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Smoke: Gluster Build System <jenkins.org>

Comment 6 Worker Ant 2016-12-26 18:08:39 UTC

REVIEW: http://review.gluster.org/16285 (performance/write-behind: Add debug messages) posted (#5) for review on master by Raghavendra G (rgowdapp)

Comment 7 Raghavendra G 2016-12-27 03:39:39 UTC

There is one more patch that adds debug messages which will help to diagnose similar issues in future. Hence moving back the bz to POST

Comment 8 Worker Ant 2017-01-05 05:31:39 UTC

REVIEW: http://review.gluster.org/16285 (performance/write-behind: Add debug messages) posted (#6) for review on master by Raghavendra G (rgowdapp)

Comment 9 Worker Ant 2017-01-09 04:33:49 UTC

COMMIT: http://review.gluster.org/16285 committed in master by Raghavendra G (rgowdapp) 
------
commit 521c55c53bd42bfdcc0919019ee81c81305382a2
Author: Raghavendra G <rgowdapp>
Date:   Mon Dec 26 15:16:10 2016 +0530

    performance/write-behind: Add debug messages
    
    Change-Id: I2ea1350fcbe4b6c06dcb8093b28316f734cd3b48
    BUG: 1379655
    Signed-off-by: Raghavendra G <rgowdapp>
    Reviewed-on: http://review.gluster.org/16285
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    CentOS-regression: Gluster Build System <jenkins.org>

Comment 10 Raghavendra G 2017-02-10 04:50:52 UTC

Description of the bug: (f)stat syscall was hung. On looking at 3 statedumps collected in comment 8 (glusterdump.303.dump.1474546152, glusterdump.303.dump.1474546203 , glusterdump.303.dump.1474546233) I've following observations:

1. Call stack of all three are identical.
2. there are a bunch of fstat and flush stuck in write-behind.
3. I don't see stack involving frames belonging to any of children of write-behind.

Based on the above observations, my deductions are:
1. stack represents a hang, but not in-progress operations.
2. Culprit is write-behind. Seems like there is a bug in write-behind which is causing the hang.

Description of fix can be found in comment #5

Comment 11 Shyamsundar 2017-03-06 17:28:10 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.10.0, please open a new bug report.

glusterfs-3.10.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/gluster-users/2017-February/030119.html
[2] https://www.gluster.org/pipermail/gluster-users/

Note You need to log in before you can comment on or make changes to this bug.