Bug 1630735

Summary: SAS job aborts complaining lock acquisition failure
Product: [Red Hat Storage] Red Hat Gluster Storage Reporter: Raghavendra G <rgowdapp>
Component: glusterfsAssignee: Raghavendra G <rgowdapp>
Status: CLOSED DUPLICATE QA Contact: Raghavendra G <rgowdapp>
Severity: high Docs Contact:
Priority: medium    
Version: rhgs-3.4CC: bmarson, nchilaka, rhs-bugs, sankarshan, vbellur
Target Milestone: ---Keywords: ZStream
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2018-10-22 04:54:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Raghavendra G 2018-09-19 07:04:14 UTC
Description of problem:
While running SAS jobs from 4 nodes with 6 jobs per node, SAS failed.

The messages from SAS are below

16SEP2018_19_52_35_refresh_mp.log..di_wrap_mg_process_unit.250.log:ERROR: A lock is not available for P_MDCPS.PARTITION_518_375.DATA.
16SEP2018_19_52_35_refresh_mp.log..di_wrap_mg_process_unit.250.log:ERROR: File GR.LOCAL_VARS.DATA does not exist.
16SEP2018_19_52_35_refresh_mp.log..di_wrap_mg_process_unit.250.log:ERROR: File GR.MODEL_SPEC_LOCAL.DATA does not exist.
16SEP2018_19_52_35_refresh_mp.log..di_wrap_mg_process_unit.250.log:ERROR: File GS.LOCAL_VARS.DATA does not exist.
16SEP2018_19_52_35_refresh_mp.log..di_wrap_mg_process_unit.250.log:ERROR: File GF.LOCAL_VARS.DATA does not exist.
16SEP2018_19_52_35_refresh_mp.log..di_wrap_mg_process_unit.251.log:ERROR: File P_DF.PARTITION_130_1.DATA is damaged. I/O processing did not complete.
16SEP2018_19_52_35_refresh_mp.log..di_wrap_mg_process_unit.251.log:ERROR: File P_DF.PARTITION_130_1.DATA is damaged.

That should correspond with file:

/sasdata/bulked/model_data_calib_ps/partition_518_375.sas7bdat

Im not sure about the file referenced in the last two entries.  My search comes up with several potential candidates ...

-bash-4.2# find /sasdata -name \*partition_130_1\*t
/sasdata/bulked/model_param_est/partition_130_1.sas7bdat
/sasdata/bulked/scoring_param_avg/partition_130_1.sas7bdat
/sasdata/bulked/disagg_factor/partition_130_1.sas7bdat
/sasdata/bulked/bayesian_model/partition_130_1.sas7bdat
/sasdata/bulked/model_exception/partition_130_1.sas7bdat
/sasdata/bulked/attr_map/partition_130_1.sas7bdat
/sasdata/bulked/attr_reg/partition_130_1_9.sas7bdat
/sasdata/bulked/attr_reg/partition_130_1_21.sas7bdat
/sasdata/bulked/attr_reg/partition_130_1_25.sas7bdat
/sasdata/bulked/attr_reg/partition_130_1_23.sas7bdat
/sasdata/bulked/attr_reg/partition_130_1_24.sas7bdat
/sasdata/bulked/attr_reg/partition_130_1_3.sas7bdat
/sasdata/bulked/attr_reg/partition_130_1_1.sas7bdat
/sasdata/bulked/attr_reg/partition_130_1_22.sas7bdat


Volume Name: sasdata
Type: Distributed-Replicate
Volume ID: a220c97b-d6e5-456f-a99f-9787a0a2a016
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x (2 + 1) = 6
Transport-type: tcp
Bricks:
Brick1: xyz:/gluster/brick1/sasdata
Brick2: abc:/gluster/brick1/sasdata
Brick3: def:/gluster/brick2/sasdata-arbiter (arbiter)
Brick4: machine1:/gluster/brick3/sasdata
Brick5: machine2:/gluster/brick3/sasdata
Brick6: machine3:/gluster/brick3/sasdata-arbiter (arbiter)
Options Reconfigured:
locks.trace: on
diagnostics.count-fop-hits: on
diagnostics.latency-measurement: on
performance.quick-read: off
performance.stat-prefetch: off
performance.read-ahead: off
performance.io-cache: off
performance.open-behind: off
performance.write-behind: on
performance.readdir-ahead: off
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off

Version-Release number of selected component (if applicable):
glusterfs-client-xlators-3.12.2-18.2.gita99ede286b.el7rhgs.x86_64
glusterfs-cli-3.12.2-18.2.gita99ede286b.el7rhgs.x86_64
glusterfs-resource-agents-3.12.2-18.2.gita99ede286b.el7rhgs.noarch
glusterfs-debuginfo-3.12.2-18.2.gita99ede286b.el7rhgs.x86_64
glusterfs-3.12.2-18.2.gita99ede286b.el7rhgs.x86_64
python2-gluster-3.12.2-18.2.gita99ede286b.el7rhgs.x86_64
glusterfs-api-3.12.2-18.2.gita99ede286b.el7rhgs.x86_64
glusterfs-server-3.12.2-18.2.gita99ede286b.el7rhgs.x86_64
glusterfs-geo-replication-3.12.2-18.2.gita99ede286b.el7rhgs.x86_64
glusterfs-rdma-3.12.2-18.2.gita99ede286b.el7rhgs.x86_64
glusterfs-libs-3.12.2-18.2.gita99ede286b.el7rhgs.x86_64
glusterfs-fuse-3.12.2-18.2.gita99ede286b.el7rhgs.x86_64
glusterfs-events-3.12.2-18.2.gita99ede286b.el7rhgs.x86_64

Private build on top of v3.12.2-18. Build had following two patches extra:
https://review.gluster.org/21123
https://review.gluster.org/21146

How reproducible:
2/3 runs

Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 2 Raghavendra G 2018-09-19 07:05:37 UTC
Statedumps were collected and there was a run with lock.trace enabled. Both didn't indicate any stale posixlocks at features/locks translator. Will upload logs and statedumps shortly.

Comment 3 Raghavendra G 2018-09-19 07:08:31 UTC
Note that saswork was mounted on local filesystem and only sasdata directory was on glusterfs.

Comment 4 Raghavendra G 2018-10-22 04:54:26 UTC

*** This bug has been marked as a duplicate of bug 1627617 ***