Bug 1378131
Summary: | [GSS] - Recording (ffmpeg) processes on FUSE get hung | |||
---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Mukul Malhotra <mmalhotr> | |
Component: | write-behind | Assignee: | Raghavendra G <rgowdapp> | |
Status: | CLOSED ERRATA | QA Contact: | Nag Pavan Chilakam <nchilaka> | |
Severity: | high | Docs Contact: | ||
Priority: | high | |||
Version: | rhgs-3.1 | CC: | amukherj, asrivast, kdhananj, mmalhotr, pkarampu, ravishankar, rcyriac, rgowdapp, rhinduja, rhs-bugs, sankarshan | |
Target Milestone: | --- | |||
Target Release: | RHGS 3.2.0 | |||
Hardware: | x86_64 | |||
OS: | Linux | |||
Whiteboard: | ||||
Fixed In Version: | glusterfs-3.8.4-7 | Doc Type: | If docs needed, set a value | |
Doc Text: | Story Points: | --- | ||
Clone Of: | ||||
: | 1379655 (view as bug list) | Environment: | ||
Last Closed: | 2017-03-23 05:47:59 UTC | Type: | Bug | |
Regression: | --- | Mount Type: | --- | |
Documentation: | --- | CRM: | ||
Verified Versions: | Category: | --- | ||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
Cloudforms Team: | --- | Target Upstream Version: | ||
Embargoed: | ||||
Bug Depends On: | 1379655, 1385618, 1385620, 1385622 | |||
Bug Blocks: | 1351528 |
Comment 11
Mukul Malhotra
2016-09-23 12:11:57 UTC
(In reply to Mukul Malhotra from comment #11) > Hello, > > Do we have any update on this bz ? I'll be working on it today. Will probably have some updates over the course of the day. > > Mukul Raghavendra, Customer actually does have the reproducible steps other then comment#10 where the rsync processes used to copy hundreds of thousands .jpg files on the gluster volume which results in hung state. Mukul (In reply to Mukul Malhotra from comment #17) > Raghavendra, > > Customer actually does have the reproducible steps other then comment#10 > where the rsync processes used to copy hundreds of thousands .jpg files on > the gluster volume which results in hung state. Good :). Is it possible to share the steps across? A script is even better :). > > Mukul Did you actually mean the alternative reproducer is mentioned in comment 10, where a bunch of rsync commands are copying *.jpg files? Raghavendra,
>Did you actually mean the alternative reproducer is mentioned in comment 10, where a bunch of rsync commands are copying *.jpg files?
Yes correct as customer does not have any other reproducer.
Mukul
Raghavendra, As per current update, customer's Issue has been resolved after suggesting a workaround to disable "performance.client-io-threads" option. Now, customer has requested a hotfix to be installed on a production system & does not require a testbuild fix. Customer has been informed that as the patch will be available then we will Initiate the hotfix process & provide update by next week. Mukul @Mukul, Please guide the CU to the kcs article which mentions about "client-io-thread" support for disperse volumes. If the CU does not hit the reported issue with disabling "client-io-thread", same should be recommended. Also, Please update the CU that RHGS team is working on enabling "client-io-thread" for other volume types for upcoming RHGS 3.2 release but it is tentative at this stage. Alok, >@Mukul, Please guide the CU to the kcs article which mentions about "client-io-thread" support for disperse volumes. If the CU does not hit the reported issue with disabling "client-io-thread", same should be recommended Thanks, this information (disabling "client-io-thread") was already provided to the customer as a workaround earlier & it fixes the issue. Also, suggested that this option is recommended with Erasure Coded volume + FUSE client. >Also, Please update the CU that RHGS team is working on enabling "client-io-thread" for other volume types for upcoming RHGS 3.2 release but it is tentative at this stage. Yes suggested the same & the case has been closed by the customer. Mukul upstream mainline : http://review.gluster.org/15579 (merged) upstream 3.8 : http://review.gluster.org/15658 (merged) Hi Raghavendra, Can you help me with below questions: I am trying to come up with a testcase(s) to validate this fix based on above conversations: TC#1: do rsync from multiple locations to a gluster volume TC#2: customer scenario which is as below "When customer launches 16 ffmpeg processes, where each one of them records to 2 mp3 files with 256K bit rate, then the issue appears where every hour from one to six processes get hung, waiting for a filesystem response, specifically from Fuse-driver glusterfs." QE will have to see if this is feasible as part of the infra we have. Question to Developer: is there any other way of testing this fix, without all the above complications? Can you suggest me with any new/alternate cases ? Also , regarding volume settings, I have below questions 1)the customer volume is having below options, I am planning to set them all. Question to Developer: are you ok with me setting all the below options(which customer has set) features.barrier: disable auth.allow: 10.110.14.63,10.110.14.64,10.110.14.65,10.100.77.18 performance.open-behind: on performance.quick-read: on performance.client-io-threads: on server.event-threads: 6 client.event-threads: 4 cluster.lookup-optimize: on performance.readdir-ahead: on auto-delete: enable 2) I am going to try on 2x2 volume Question to Developer:let me know if you want any change in the volume type? 3) I see that in comment#25, clientio-threads were to be disabled, as workaround Question to Developer:Clientio threads in a replicate volume in now disabled by default, do you want me to enable it for testing this fix(which i think must be enabled, as customer too has enabled) Note : these will be tested on nodes which are VMs and fuse as the n/w protocol (In reply to nchilaka from comment #40) > Hi Raghavendra, > Can you help me with below questions: > > I am trying to come up with a testcase(s) to validate this fix based on > above conversations: > > TC#1: do rsync from multiple locations to a gluster volume > TC#2: customer scenario which is as below > "When customer launches 16 ffmpeg processes, where each one of them records > to 2 mp3 files with 256K bit rate, then the issue appears where every hour > from one to six processes get hung, waiting for a filesystem response, > specifically from Fuse-driver glusterfs." > QE will have to see if this is feasible as part of the infra we have. > > Question to Developer: is there any other way of testing this fix, without > all the above complications? > Can you suggest me with any new/alternate cases ? Its a race condition. So, quite difficult to hit. I don't have an easy reproducer. In fact as one of the comments mentions, I tried to reproduce the issue by running rsync for a day (without fix), but without success. The fix posted was arrived at through code review. > > Also , regarding volume settings, I have below questions > 1)the customer volume is having below options, I am planning to set them > all. > Question to Developer: are you ok with me setting all the below > options(which customer has set) Yes. Please have same options set as the customer. > > features.barrier: disable > auth.allow: 10.110.14.63,10.110.14.64,10.110.14.65,10.100.77.18 > performance.open-behind: on > performance.quick-read: on > performance.client-io-threads: on > server.event-threads: 6 > client.event-threads: 4 > cluster.lookup-optimize: on > performance.readdir-ahead: on > auto-delete: enable > > > 2) I am going to try on 2x2 volume > Question to Developer:let me know if you want any change in the volume type? No changes required. > > 3) I see that in comment#25, clientio-threads were to be disabled, as > workaround > Question to Developer:Clientio threads in a replicate volume in now disabled > by default, do you want me to enable it for testing this fix(which i think > must be enabled, as customer too has enabled) Please have it enabled. > > > Note : these will be tested on nodes which are VMs and fuse as the n/w > protocol Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://rhn.redhat.com/errata/RHSA-2017-0486.html |