Description of problem: While running SAS jobs from 4 nodes with 6 jobs per node, SAS failed. The messages from SAS are below 16SEP2018_19_52_35_refresh_mp.log..di_wrap_mg_process_unit.250.log:ERROR: A lock is not available for P_MDCPS.PARTITION_518_375.DATA. 16SEP2018_19_52_35_refresh_mp.log..di_wrap_mg_process_unit.250.log:ERROR: File GR.LOCAL_VARS.DATA does not exist. 16SEP2018_19_52_35_refresh_mp.log..di_wrap_mg_process_unit.250.log:ERROR: File GR.MODEL_SPEC_LOCAL.DATA does not exist. 16SEP2018_19_52_35_refresh_mp.log..di_wrap_mg_process_unit.250.log:ERROR: File GS.LOCAL_VARS.DATA does not exist. 16SEP2018_19_52_35_refresh_mp.log..di_wrap_mg_process_unit.250.log:ERROR: File GF.LOCAL_VARS.DATA does not exist. 16SEP2018_19_52_35_refresh_mp.log..di_wrap_mg_process_unit.251.log:ERROR: File P_DF.PARTITION_130_1.DATA is damaged. I/O processing did not complete. 16SEP2018_19_52_35_refresh_mp.log..di_wrap_mg_process_unit.251.log:ERROR: File P_DF.PARTITION_130_1.DATA is damaged. That should correspond with file: /sasdata/bulked/model_data_calib_ps/partition_518_375.sas7bdat Im not sure about the file referenced in the last two entries. My search comes up with several potential candidates ... -bash-4.2# find /sasdata -name \*partition_130_1\*t /sasdata/bulked/model_param_est/partition_130_1.sas7bdat /sasdata/bulked/scoring_param_avg/partition_130_1.sas7bdat /sasdata/bulked/disagg_factor/partition_130_1.sas7bdat /sasdata/bulked/bayesian_model/partition_130_1.sas7bdat /sasdata/bulked/model_exception/partition_130_1.sas7bdat /sasdata/bulked/attr_map/partition_130_1.sas7bdat /sasdata/bulked/attr_reg/partition_130_1_9.sas7bdat /sasdata/bulked/attr_reg/partition_130_1_21.sas7bdat /sasdata/bulked/attr_reg/partition_130_1_25.sas7bdat /sasdata/bulked/attr_reg/partition_130_1_23.sas7bdat /sasdata/bulked/attr_reg/partition_130_1_24.sas7bdat /sasdata/bulked/attr_reg/partition_130_1_3.sas7bdat /sasdata/bulked/attr_reg/partition_130_1_1.sas7bdat /sasdata/bulked/attr_reg/partition_130_1_22.sas7bdat Volume Name: sasdata Type: Distributed-Replicate Volume ID: a220c97b-d6e5-456f-a99f-9787a0a2a016 Status: Started Snapshot Count: 0 Number of Bricks: 2 x (2 + 1) = 6 Transport-type: tcp Bricks: Brick1: xyz:/gluster/brick1/sasdata Brick2: abc:/gluster/brick1/sasdata Brick3: def:/gluster/brick2/sasdata-arbiter (arbiter) Brick4: machine1:/gluster/brick3/sasdata Brick5: machine2:/gluster/brick3/sasdata Brick6: machine3:/gluster/brick3/sasdata-arbiter (arbiter) Options Reconfigured: locks.trace: on diagnostics.count-fop-hits: on diagnostics.latency-measurement: on performance.quick-read: off performance.stat-prefetch: off performance.read-ahead: off performance.io-cache: off performance.open-behind: off performance.write-behind: on performance.readdir-ahead: off transport.address-family: inet nfs.disable: on performance.client-io-threads: off Version-Release number of selected component (if applicable): glusterfs-client-xlators-3.12.2-18.2.gita99ede286b.el7rhgs.x86_64 glusterfs-cli-3.12.2-18.2.gita99ede286b.el7rhgs.x86_64 glusterfs-resource-agents-3.12.2-18.2.gita99ede286b.el7rhgs.noarch glusterfs-debuginfo-3.12.2-18.2.gita99ede286b.el7rhgs.x86_64 glusterfs-3.12.2-18.2.gita99ede286b.el7rhgs.x86_64 python2-gluster-3.12.2-18.2.gita99ede286b.el7rhgs.x86_64 glusterfs-api-3.12.2-18.2.gita99ede286b.el7rhgs.x86_64 glusterfs-server-3.12.2-18.2.gita99ede286b.el7rhgs.x86_64 glusterfs-geo-replication-3.12.2-18.2.gita99ede286b.el7rhgs.x86_64 glusterfs-rdma-3.12.2-18.2.gita99ede286b.el7rhgs.x86_64 glusterfs-libs-3.12.2-18.2.gita99ede286b.el7rhgs.x86_64 glusterfs-fuse-3.12.2-18.2.gita99ede286b.el7rhgs.x86_64 glusterfs-events-3.12.2-18.2.gita99ede286b.el7rhgs.x86_64 Private build on top of v3.12.2-18. Build had following two patches extra: https://review.gluster.org/21123 https://review.gluster.org/21146 How reproducible: 2/3 runs Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Statedumps were collected and there was a run with lock.trace enabled. Both didn't indicate any stale posixlocks at features/locks translator. Will upload logs and statedumps shortly.
Note that saswork was mounted on local filesystem and only sasdata directory was on glusterfs.
*** This bug has been marked as a duplicate of bug 1627617 ***