Bug 1378131 - [GSS] - Recording (ffmpeg) processes on FUSE get hung
Summary: [GSS] - Recording (ffmpeg) processes on FUSE get hung
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Gluster Storage
Classification: Red Hat
Component: write-behind
Version: rhgs-3.1
Hardware: x86_64
OS: Linux
high
high
Target Milestone: ---
: RHGS 3.2.0
Assignee: Raghavendra G
QA Contact: nchilaka
URL:
Whiteboard:
Depends On: 1379655 1385618 1385620 1385622
Blocks: 1351528
TreeView+ depends on / blocked
 
Reported: 2016-09-21 14:33 UTC by Mukul Malhotra
Modified: 2017-03-23 05:47 UTC (History)
11 users (show)

Fixed In Version: glusterfs-3.8.4-7
Doc Type: If docs needed, set a value
Doc Text:
Clone Of:
: 1379655 (view as bug list)
Environment:
Last Closed: 2017-03-23 05:47:59 UTC


Attachments (Terms of Use)


Links
System ID Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2017:0486 normal SHIPPED_LIVE Moderate: Red Hat Gluster Storage 3.2.0 security, bug fix, and enhancement update 2017-03-23 09:18:45 UTC

Comment 11 Mukul Malhotra 2016-09-23 12:11:57 UTC
Hello,

Do we have any update on this bz ?

Mukul

Comment 12 Raghavendra G 2016-09-26 04:48:57 UTC
(In reply to Mukul Malhotra from comment #11)
> Hello,
> 
> Do we have any update on this bz ?

I'll be working on it today. Will probably have some updates over the course of the day.

> 
> Mukul

Comment 17 Mukul Malhotra 2016-09-28 07:16:09 UTC
Raghavendra,

Customer actually does have the reproducible steps other then comment#10 where the rsync processes used to copy hundreds of thousands .jpg files on the gluster volume which results in hung state.

Mukul

Comment 18 Raghavendra G 2016-09-28 08:32:01 UTC
(In reply to Mukul Malhotra from comment #17)
> Raghavendra,
> 
> Customer actually does have the reproducible steps other then comment#10
> where the rsync processes used to copy hundreds of thousands .jpg files on
> the gluster volume which results in hung state.

Good :). Is it possible to share the steps across? A script is even better :).

> 
> Mukul

Comment 19 Raghavendra G 2016-09-28 08:34:27 UTC
Did you actually mean the alternative reproducer is mentioned in comment 10, where a bunch of rsync commands are copying *.jpg files?

Comment 20 Mukul Malhotra 2016-09-28 09:39:37 UTC
Raghavendra,

>Did you actually mean the alternative reproducer is mentioned in comment 10, where a bunch of rsync commands are copying *.jpg files?

Yes correct as customer does not have any other reproducer.

Mukul

Comment 25 Mukul Malhotra 2016-10-05 11:45:02 UTC
Raghavendra,

As per current update, customer's Issue has been resolved after suggesting a workaround to disable "performance.client-io-threads" option.

Now, customer has requested a hotfix to be installed on a production system & does not require a testbuild fix.

Customer has been informed that as the patch will be available then we will Initiate the hotfix process & provide update by next week.

Mukul

Comment 33 Alok 2016-11-02 07:58:31 UTC
@Mukul, Please guide the CU to the kcs article which mentions about "client-io-thread" support for disperse volumes. If the CU does not hit the reported issue with disabling "client-io-thread", same should be recommended. Also, Please update the CU that RHGS team is working on enabling "client-io-thread" for other volume types for upcoming RHGS 3.2 release but it is tentative at this stage.

Comment 34 Mukul Malhotra 2016-11-02 09:16:16 UTC
Alok,

>@Mukul, Please guide the CU to the kcs article which mentions about "client-io-thread" support for disperse volumes. If the CU does not hit the reported issue with disabling "client-io-thread", same should be recommended

Thanks, this information (disabling "client-io-thread") was already provided to the customer as a workaround earlier & it fixes the issue.

Also, suggested that this option is recommended with Erasure Coded volume + FUSE client.

>Also, Please update the CU that RHGS team is working on enabling "client-io-thread" for other volume types for upcoming RHGS 3.2 release but it is tentative at this stage.

Yes suggested the same & the case has been closed by the customer.

Mukul

Comment 35 Atin Mukherjee 2016-11-07 04:37:58 UTC
upstream mainline : http://review.gluster.org/15579 (merged)
upstream 3.8 : http://review.gluster.org/15658 (merged)

Comment 40 nchilaka 2017-02-20 14:27:12 UTC
Hi Raghavendra,
Can you help me with below questions:

I am trying to come up with a testcase(s) to validate this fix based on above conversations:

TC#1: do rsync from multiple locations to a gluster volume
TC#2:  customer scenario which is as below 
"When customer launches 16 ffmpeg processes, where each one of them records to 2 mp3 files with 256K bit rate, then the issue appears where every hour from one to six processes get hung, waiting for a filesystem response, specifically from Fuse-driver glusterfs."
QE will have to see if this is feasible as part of the infra we have.

Question to Developer: is there any other way of  testing this fix, without all the above complications?
Can you suggest me with any new/alternate cases ?

Also , regarding volume settings, I have below questions
1)the customer volume is having below options, I am planning to set them all. 
Question to Developer: are you ok with me setting all the below options(which customer has set)

features.barrier: disable
auth.allow: 10.110.14.63,10.110.14.64,10.110.14.65,10.100.77.18
performance.open-behind: on
performance.quick-read: on
performance.client-io-threads: on
server.event-threads: 6
client.event-threads: 4
cluster.lookup-optimize: on
performance.readdir-ahead: on
auto-delete: enable


2) I am going to try on 2x2 volume
Question to Developer:let me know if you want any change in the volume type?

3) I see that in comment#25, clientio-threads were to be disabled, as workaround
Question to Developer:Clientio threads in a replicate volume in now disabled by default, do you want me to enable it for testing this fix(which i think must be enabled, as customer too has enabled)


Note : these will be tested on nodes which are VMs and fuse as the n/w protocol

Comment 41 Raghavendra G 2017-02-21 16:32:48 UTC
(In reply to nchilaka from comment #40)
> Hi Raghavendra,
> Can you help me with below questions:
> 
> I am trying to come up with a testcase(s) to validate this fix based on
> above conversations:
> 
> TC#1: do rsync from multiple locations to a gluster volume
> TC#2:  customer scenario which is as below 
> "When customer launches 16 ffmpeg processes, where each one of them records
> to 2 mp3 files with 256K bit rate, then the issue appears where every hour
> from one to six processes get hung, waiting for a filesystem response,
> specifically from Fuse-driver glusterfs."
> QE will have to see if this is feasible as part of the infra we have.
> 
> Question to Developer: is there any other way of  testing this fix, without
> all the above complications?
> Can you suggest me with any new/alternate cases ?

Its a race condition. So, quite difficult to hit. I don't have an easy reproducer. In fact as one of the comments mentions, I tried to reproduce the issue by running rsync for a day (without fix), but without success. The fix posted was arrived at through code review.

> 
> Also , regarding volume settings, I have below questions
> 1)the customer volume is having below options, I am planning to set them
> all. 
> Question to Developer: are you ok with me setting all the below
> options(which customer has set)

Yes. Please have same options set as the customer.

> 
> features.barrier: disable
> auth.allow: 10.110.14.63,10.110.14.64,10.110.14.65,10.100.77.18
> performance.open-behind: on
> performance.quick-read: on
> performance.client-io-threads: on
> server.event-threads: 6
> client.event-threads: 4
> cluster.lookup-optimize: on
> performance.readdir-ahead: on
> auto-delete: enable
> 
> 
> 2) I am going to try on 2x2 volume
> Question to Developer:let me know if you want any change in the volume type?

No changes required.

> 
> 3) I see that in comment#25, clientio-threads were to be disabled, as
> workaround
> Question to Developer:Clientio threads in a replicate volume in now disabled
> by default, do you want me to enable it for testing this fix(which i think
> must be enabled, as customer too has enabled)

Please have it enabled.

> 
> 
> Note : these will be tested on nodes which are VMs and fuse as the n/w
> protocol

Comment 44 errata-xmlrpc 2017-03-23 05:47:59 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://rhn.redhat.com/errata/RHSA-2017-0486.html


Note You need to log in before you can comment on or make changes to this bug.