Bug 1626780
Summary: | sas workload job getting stuck after sometime | ||||||
---|---|---|---|---|---|---|---|
Product: | [Red Hat Storage] Red Hat Gluster Storage | Reporter: | Nag Pavan Chilakam <nchilaka> | ||||
Component: | write-behind | Assignee: | Raghavendra G <rgowdapp> | ||||
Status: | CLOSED ERRATA | QA Contact: | Nag Pavan Chilakam <nchilaka> | ||||
Severity: | urgent | Docs Contact: | |||||
Priority: | unspecified | ||||||
Version: | rhgs-3.4 | CC: | apaladug, bmarson, rgowdapp, rhs-bugs, sanandpa, sankarshan, sheggodu | ||||
Target Milestone: | --- | Keywords: | ZStream | ||||
Target Release: | RHGS 3.4.z Batch Update 1 | ||||||
Hardware: | Unspecified | ||||||
OS: | Unspecified | ||||||
Whiteboard: | |||||||
Fixed In Version: | glusterfs-3.12.2-20 | Doc Type: | If docs needed, set a value | ||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | |||||||
: | 1626787 (view as bug list) | Environment: | |||||
Last Closed: | 2018-10-31 08:46:14 UTC | Type: | Bug | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Bug Depends On: | |||||||
Bug Blocks: | 1626787 | ||||||
Attachments: |
|
Description
Nag Pavan Chilakam
2018-09-08 14:55:02 UTC
Created attachment 1481795 [details]
statedump of fusemount process from node where sas job was stuck
statedump in comment #3 somehow didn't have the flush calltrace. I took another statedump and a flush request was indeed stuck in write-behind. <raghug> [xlator.performance.write-behind.wb_inode] <raghug> path=/SAS_workAB39000020E2_rhs-client44.lab.eng.blr.redhat.com/TD_5/FCST/process/SUBG/gp_h_filter.sas7bdat.lck <raghug> inode=0x7f4f9c463c40 <raghug> gfid=59e84e6d-76f5-40f0-8888-8d3745291011 <raghug> window_conf=1048576 <raghug> window_current=65536 <raghug> transit-size=0 <raghug> dontsync=0 <raghug> [.WRITE] <raghug> unique=17337841 <raghug> refcount=1 <raghug> wound=no <raghug> generation-number=6 <raghug> req->op_ret=65536 <raghug> req->op_errno=0 <raghug> sync-attempts=0 <raghug> sync-in-progress=no <raghug> size=65536 <raghug> offset=65536 <raghug> lied=-1 <raghug> append=0 <raghug> fulfilled=0 <raghug> go=-1 <raghug> [.FLUSH] <raghug> unique=17337847 <raghug> refcount=1 <raghug> wound=no <raghug> generation-number=9 <raghug> req->op_ret=0 <raghug> req->op_errno=0 <raghug> sync-attempts=0 testing so far on 3.4.1 ie 3.12.2-21 We started with 4 clients and all perf xlators off and WB=ON, however we started to hit enonent problem as in BZ#1627617(which was failedqa) and along the way WB was turned off to unblock enoent problem(as a start, but nothing conclusive yet) Now with WB off, so far no SAS job was stuck till 500jobs(still in progress). But this bug was raised with WB=on and hence I believe testing is blocked here for this bug's onqa validation Retested with wb=on, and no jobs got stuck, hence moving to verified tested on 3.12.2-21 with 4 jobs per client with total of 4 clients tests ran and completed with WB=On note: there is still another issue being tracked seperately, which seems to be in consistent. ie BZ#1627617 - SAS job aborts complaining about file doesn't exist However as we are not seeing any job getting stuck, unlike in previous case when bug was raised(where it was getting stuck at initial few jobs itself), hence moving to verified I think it's important to understand that the other issue noted in comment #11 happens early in the testing. It is possible that we could be hitting that bug sooner (from elevating job count per node from 4 to 6) than we have typically seen this bug occur. Once engineering dreams up a fix for that other bug ;) this one potentially could return when we elevate job count. Barry (In reply to Barry Marson from comment #12) > I think it's important to understand that the other issue noted in comment > #11 happens early in the testing. It is possible that we could be hitting > that bug sooner (from elevating job count per node from 4 to 6) than we have > typically seen this bug occur. Once engineering dreams up a fix for that > other bug ;) this one potentially could return when we elevate job count. Note that bug can return is an hypothesis :). On similar lines of hypothesising, with my understanding of the issue, I would say its highly unlikely. Running SAS on Glusterfs has issues, but unlikely this issue will come again. > > Barry Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:3432 |