Description of problem: ====================== As part of validating SASworkload(to help resolve ENEL issue BZ#1581306 ), the SAS workload is getting stuck after certain jobs got completed job 5 is stuck. This is effectively the first 10 minutes of the actual test. This is rhs-client44. Its not in a D state .. Im not sure what state its in. I believe its probably waiting on a mutex that failed to be handled correctly. 26551 ? SNl 0:00 /usr/local/SAS/SASFoundation/9.4/utilities/bin/cntspawn -service 1443 -noscript -sascmd /sasdata/bulked/_scripts/spawner_sas.sh 26552 ? SN 0:00 \_ sasels 7 4 8 2 111bc5 7939 ? SN 0:00 \_ /usr/local/SAS/SASFoundation/9.4/utilities/bin/sasauth 7940 ? SN 0:00 \_ /usr/local/SAS/SASFoundation/9.4/utilities/bin/sasauth 8416 ? SN 0:00 \_ /bin/ksh /sasdata/bulked/_scripts/spawner_sas.sh -DMR -COMAMID TCP -NOXCMD -NOTERMINAL 8418 ? SNl 2:16 \_ /usr/local/SAS/SASFoundation/9.4/sasexe/sas -config ../_scripts/sas_distr.cfg -DMR -COMAMID TCP -NOXCMD -NOTERMINAL 8420 ? SN 0:00 \_ sasels 13 10 14 2 fea46 Something else was going on as well. [Fri Sep 7 20:55:16 2018] INFO: task sas:26556 blocked for more than 120 seconds. [Fri Sep 7 20:55:16 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Fri Sep 7 20:55:16 2018] sas D ffff9974bd270000 0 26556 8416 0x00000080 [Fri Sep 7 20:55:16 2018] Call Trace: [Fri Sep 7 20:55:16 2018] [<ffffffff98318db9>] schedule+0x29/0x70 [Fri Sep 7 20:55:16 2018] [<ffffffffc07124e5>] __fuse_request_send+0xf5/0x2e0 [fuse] [Fri Sep 7 20:55:16 2018] [<ffffffffc0715bb4>] ? fuse_get_req_nofail_nopages+0xc4/0x1e0 [fuse] [Fri Sep 7 20:55:16 2018] [<ffffffff97cbef10>] ? wake_up_atomic_t+0x30/0x30 [Fri Sep 7 20:55:16 2018] [<ffffffffc07126e2>] fuse_request_send+0x12/0x20 [fuse] [Fri Sep 7 20:55:16 2018] [<ffffffffc071ba6f>] fuse_flush+0xff/0x150 [fuse] [Fri Sep 7 20:55:16 2018] [<ffffffff97e1c5e7>] filp_close+0x37/0x90 [Fri Sep 7 20:55:16 2018] [<ffffffff97e3f356>] __close_fd+0x96/0xc0 [Fri Sep 7 20:55:16 2018] [<ffffffff97e1e1a3>] SyS_close+0x23/0x50 [Fri Sep 7 20:55:16 2018] [<ffffffff9832579b>] system_call_fastpath+0x22/0x27 [Fri Sep 7 20:55:16 2018] [<ffffffff983256e1>] ? system_call_after_swapgs+0xae/0x146 [Fri Sep 7 20:57:16 2018] INFO: task sas:26556 blocked for more than 120 seconds. [Fri Sep 7 20:57:16 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Fri Sep 7 20:57:16 2018] sas D ffff9974bd270000 0 26556 8416 0x00000080 [Fri Sep 7 20:57:16 2018] Call Trace: [Fri Sep 7 20:57:16 2018] [<ffffffff98318db9>] schedule+0x29/0x70 [Fri Sep 7 20:57:16 2018] [<ffffffffc07124e5>] __fuse_request_send+0xf5/0x2e0 [fuse] [Fri Sep 7 20:57:16 2018] [<ffffffffc0715bb4>] ? fuse_get_req_nofail_nopages+0xc4/0x1e0 [fuse] [Fri Sep 7 20:57:16 2018] [<ffffffff97cbef10>] ? wake_up_atomic_t+0x30/0x30 [Fri Sep 7 20:57:16 2018] [<ffffffffc07126e2>] fuse_request_send+0x12/0x20 [fuse] [Fri Sep 7 20:57:16 2018] [<ffffffffc071ba6f>] fuse_flush+0xff/0x150 [fuse] [Fri Sep 7 20:57:16 2018] [<ffffffff97e1c5e7>] filp_close+0x37/0x90 [Fri Sep 7 20:57:16 2018] [<ffffffff97e3f356>] __close_fd+0x96/0xc0 [Fri Sep 7 20:57:16 2018] [<ffffffff97e1e1a3>] SyS_close+0x23/0x50 [Fri Sep 7 20:57:16 2018] [<ffffffff9832579b>] system_call_fastpath+0x22/0x27 [Fri Sep 7 20:57:16 2018] [<ffffffff983256e1>] ? system_call_after_swapgs+0xae/0x146 [Fri Sep 7 20:59:16 2018] INFO: task sas:26556 blocked for more than 120 seconds. [Fri Sep 7 20:59:16 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Fri Sep 7 20:59:16 2018] sas D ffff9974bd270000 0 26556 8416 0x00000080 [Fri Sep 7 20:59:16 2018] Call Trace: [Fri Sep 7 20:59:16 2018] [<ffffffff98318db9>] schedule+0x29/0x70 [Fri Sep 7 20:59:16 2018] [<ffffffffc07124e5>] __fuse_request_send+0xf5/0x2e0 [fuse] [Fri Sep 7 20:59:16 2018] [<ffffffffc0715bb4>] ? fuse_get_req_nofail_nopages+0xc4/0x1e0 [fuse] [Fri Sep 7 20:59:16 2018] [<ffffffff97cbef10>] ? wake_up_atomic_t+0x30/0x30 [Fri Sep 7 20:59:16 2018] [<ffffffffc07126e2>] fuse_request_send+0x12/0x20 [fuse] [Fri Sep 7 20:59:16 2018] [<ffffffffc071ba6f>] fuse_flush+0xff/0x150 [fuse] [Fri Sep 7 20:59:16 2018] [<ffffffff97e1c5e7>] filp_close+0x37/0x90 [Fri Sep 7 20:59:16 2018] [<ffffffff97e3f356>] __close_fd+0x96/0xc0 [Fri Sep 7 20:59:16 2018] [<ffffffff97e1e1a3>] SyS_close+0x23/0x50 [Fri Sep 7 20:59:16 2018] [<ffffffff9832579b>] system_call_fastpath+0x22/0x27 [Fri Sep 7 20:59:16 2018] [<ffffffff983256e1>] ? system_call_after_swapgs+0xae/0x146 [Fri Sep 7 21:01:16 2018] INFO: task sas:26556 blocked for more than 120 seconds. [Fri Sep 7 21:01:16 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Fri Sep 7 21:01:16 2018] sas D ffff9974bd270000 0 26556 8416 0x00000080 [Fri Sep 7 21:01:16 2018] Call Trace: [Fri Sep 7 21:01:16 2018] [<ffffffff98318db9>] schedule+0x29/0x70 [Fri Sep 7 21:01:16 2018] [<ffffffffc07124e5>] __fuse_request_send+0xf5/0x2e0 [fuse] [Fri Sep 7 21:01:16 2018] [<ffffffffc0715bb4>] ? fuse_get_req_nofail_nopages+0xc4/0x1e0 [fuse] [Fri Sep 7 21:01:16 2018] [<ffffffff97cbef10>] ? wake_up_atomic_t+0x30/0x30 [Fri Sep 7 21:01:16 2018] [<ffffffffc07126e2>] fuse_request_send+0x12/0x20 [fuse] [Fri Sep 7 21:01:16 2018] [<ffffffffc071ba6f>] fuse_flush+0xff/0x150 [fuse] [Fri Sep 7 21:01:16 2018] [<ffffffff97e1c5e7>] filp_close+0x37/0x90 [Fri Sep 7 21:01:16 2018] [<ffffffff97e3f356>] __close_fd+0x96/0xc0 [Fri Sep 7 21:01:16 2018] [<ffffffff97e1e1a3>] SyS_close+0x23/0x50 [Fri Sep 7 21:01:16 2018] [<ffffffff9832579b>] system_call_fastpath+0x22/0x27 [Fri Sep 7 21:01:16 2018] [<ffffffff983256e1>] ? system_call_after_swapgs+0xae/0x146 [Fri Sep 7 21:03:16 2018] INFO: task sas:26556 blocked for more than 120 seconds. [Fri Sep 7 21:03:16 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Fri Sep 7 21:03:16 2018] sas D ffff9974bd270000 0 26556 8416 0x00000080 [Fri Sep 7 21:03:16 2018] Call Trace: [Fri Sep 7 21:03:16 2018] [<ffffffff98318db9>] schedule+0x29/0x70 [Fri Sep 7 21:03:16 2018] [<ffffffffc07124e5>] __fuse_request_send+0xf5/0x2e0 [fuse] [Fri Sep 7 21:03:16 2018] [<ffffffffc0715bb4>] ? fuse_get_req_nofail_nopages+0xc4/0x1e0 [fuse] [Fri Sep 7 21:03:16 2018] [<ffffffff97cbef10>] ? wake_up_atomic_t+0x30/0x30 [Fri Sep 7 21:03:16 2018] [<ffffffffc07126e2>] fuse_request_send+0x12/0x20 [fuse] [Fri Sep 7 21:03:16 2018] [<ffffffffc071ba6f>] fuse_flush+0xff/0x150 [fuse] [Fri Sep 7 21:03:16 2018] [<ffffffff97e1c5e7>] filp_close+0x37/0x90 [Fri Sep 7 21:03:16 2018] [<ffffffff97e3f356>] __close_fd+0x96/0xc0 [Fri Sep 7 21:03:16 2018] [<ffffffff97e1e1a3>] SyS_close+0x23/0x50 [Fri Sep 7 21:03:16 2018] [<ffffffff9832579b>] system_call_fastpath+0x22/0x27 [Fri Sep 7 21:03:16 2018] [<ffffffff983256e1>] ? system_call_after_swapgs+0xae/0x146 [Fri Sep 7 21:05:16 2018] INFO: task sas:26556 blocked for more than 120 seconds. [Fri Sep 7 21:05:16 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Fri Sep 7 21:05:16 2018] sas D ffff9974bd270000 0 26556 8416 0x00000080 [Fri Sep 7 21:05:16 2018] Call Trace: [Fri Sep 7 21:05:16 2018] [<ffffffff98318db9>] schedule+0x29/0x70 [Fri Sep 7 21:05:16 2018] [<ffffffffc07124e5>] __fuse_request_send+0xf5/0x2e0 [fuse] [Fri Sep 7 21:05:16 2018] [<ffffffffc0715bb4>] ? fuse_get_req_nofail_nopages+0xc4/0x1e0 [fuse] [Fri Sep 7 21:05:16 2018] [<ffffffff97cbef10>] ? wake_up_atomic_t+0x30/0x30 [Fri Sep 7 21:05:16 2018] [<ffffffffc07126e2>] fuse_request_send+0x12/0x20 [fuse] [Fri Sep 7 21:05:16 2018] [<ffffffffc071ba6f>] fuse_flush+0xff/0x150 [fuse] [Fri Sep 7 21:05:16 2018] [<ffffffff97e1c5e7>] filp_close+0x37/0x90 [Fri Sep 7 21:05:16 2018] [<ffffffff97e3f356>] __close_fd+0x96/0xc0 [Fri Sep 7 21:05:16 2018] [<ffffffff97e1e1a3>] SyS_close+0x23/0x50 [Fri Sep 7 21:05:16 2018] [<ffffffff9832579b>] system_call_fastpath+0x22/0x27 [Fri Sep 7 21:05:16 2018] [<ffffffff983256e1>] ? system_call_after_swapgs+0xae/0x146 [Fri Sep 7 21:07:16 2018] INFO: task sas:26556 blocked for more than 120 seconds. [Fri Sep 7 21:07:16 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Fri Sep 7 21:07:16 2018] sas D ffff9974bd270000 0 26556 8416 0x00000080 [Fri Sep 7 21:07:16 2018] Call Trace: [Fri Sep 7 21:07:16 2018] [<ffffffff98318db9>] schedule+0x29/0x70 [Fri Sep 7 21:07:16 2018] [<ffffffffc07124e5>] __fuse_request_send+0xf5/0x2e0 [fuse] [Fri Sep 7 21:07:16 2018] [<ffffffffc0715bb4>] ? fuse_get_req_nofail_nopages+0xc4/0x1e0 [fuse] [Fri Sep 7 21:07:16 2018] [<ffffffff97cbef10>] ? wake_up_atomic_t+0x30/0x30 [Fri Sep 7 21:07:16 2018] [<ffffffffc07126e2>] fuse_request_send+0x12/0x20 [fuse] [Fri Sep 7 21:07:16 2018] [<ffffffffc071ba6f>] fuse_flush+0xff/0x150 [fuse] [Fri Sep 7 21:07:16 2018] [<ffffffff97e1c5e7>] filp_close+0x37/0x90 [Fri Sep 7 21:07:16 2018] [<ffffffff97e3f356>] __close_fd+0x96/0xc0 [Fri Sep 7 21:07:16 2018] [<ffffffff97e1e1a3>] SyS_close+0x23/0x50 [Fri Sep 7 21:07:16 2018] [<ffffffff9832579b>] system_call_fastpath+0x22/0x27 [Fri Sep 7 21:07:16 2018] [<ffffffff983256e1>] ? system_call_after_swapgs+0xae/0x146 [Fri Sep 7 21:09:16 2018] INFO: task sas:26556 blocked for more than 120 seconds. [Fri Sep 7 21:09:16 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Fri Sep 7 21:09:16 2018] sas D ffff9974bd270000 0 26556 8416 0x00000080 [Fri Sep 7 21:09:16 2018] Call Trace: [Fri Sep 7 21:09:16 2018] [<ffffffff98318db9>] schedule+0x29/0x70 [Fri Sep 7 21:09:16 2018] [<ffffffffc07124e5>] __fuse_request_send+0xf5/0x2e0 [fuse] [Fri Sep 7 21:09:16 2018] [<ffffffffc0715bb4>] ? fuse_get_req_nofail_nopages+0xc4/0x1e0 [fuse] [Fri Sep 7 21:09:16 2018] [<ffffffff97cbef10>] ? wake_up_atomic_t+0x30/0x30 [Fri Sep 7 21:09:16 2018] [<ffffffffc07126e2>] fuse_request_send+0x12/0x20 [fuse] [Fri Sep 7 21:09:16 2018] [<ffffffffc071ba6f>] fuse_flush+0xff/0x150 [fuse] [Fri Sep 7 21:09:16 2018] [<ffffffff97e1c5e7>] filp_close+0x37/0x90 [Fri Sep 7 21:09:16 2018] [<ffffffff97e3f356>] __close_fd+0x96/0xc0 [Fri Sep 7 21:09:16 2018] [<ffffffff97e1e1a3>] SyS_close+0x23/0x50 [Fri Sep 7 21:09:16 2018] [<ffffffff9832579b>] system_call_fastpath+0x22/0x27 [Fri Sep 7 21:09:16 2018] [<ffffffff983256e1>] ? system_call_after_swapgs+0xae/0x146 [Fri Sep 7 21:11:16 2018] INFO: task sas:26556 blocked for more than 120 seconds. [Fri Sep 7 21:11:16 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Fri Sep 7 21:11:16 2018] sas D ffff9974bd270000 0 26556 8416 0x00000080 [Fri Sep 7 21:11:16 2018] Call Trace: [Fri Sep 7 21:11:16 2018] [<ffffffff98318db9>] schedule+0x29/0x70 [Fri Sep 7 21:11:16 2018] [<ffffffffc07124e5>] __fuse_request_send+0xf5/0x2e0 [fuse] [Fri Sep 7 21:11:16 2018] [<ffffffffc0715bb4>] ? fuse_get_req_nofail_nopages+0xc4/0x1e0 [fuse] [Fri Sep 7 21:11:16 2018] [<ffffffff97cbef10>] ? wake_up_atomic_t+0x30/0x30 [Fri Sep 7 21:11:16 2018] [<ffffffffc07126e2>] fuse_request_send+0x12/0x20 [fuse] [Fri Sep 7 21:11:16 2018] [<ffffffffc071ba6f>] fuse_flush+0xff/0x150 [fuse] [Fri Sep 7 21:11:16 2018] [<ffffffff97e1c5e7>] filp_close+0x37/0x90 [Fri Sep 7 21:11:16 2018] [<ffffffff97e3f356>] __close_fd+0x96/0xc0 [Fri Sep 7 21:11:16 2018] [<ffffffff97e1e1a3>] SyS_close+0x23/0x50 [Fri Sep 7 21:11:16 2018] [<ffffffff9832579b>] system_call_fastpath+0x22/0x27 [Fri Sep 7 21:11:16 2018] [<ffffffff983256e1>] ? system_call_after_swapgs+0xae/0x146 [Fri Sep 7 21:13:16 2018] INFO: task sas:26556 blocked for more than 120 seconds. [Fri Sep 7 21:13:16 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. [Fri Sep 7 21:13:16 2018] sas D ffff9974bd270000 0 26556 8416 0x00000080 [Fri Sep 7 21:13:16 2018] Call Trace: [Fri Sep 7 21:13:16 2018] [<ffffffff98318db9>] schedule+0x29/0x70 [Fri Sep 7 21:13:16 2018] [<ffffffffc07124e5>] __fuse_request_send+0xf5/0x2e0 [fuse] [Fri Sep 7 21:13:16 2018] [<ffffffffc0715bb4>] ? fuse_get_req_nofail_nopages+0xc4/0x1e0 [fuse] [Fri Sep 7 21:13:16 2018] [<ffffffff97cbef10>] ? wake_up_atomic_t+0x30/0x30 [Fri Sep 7 21:13:16 2018] [<ffffffffc07126e2>] fuse_request_send+0x12/0x20 [fuse] [Fri Sep 7 21:13:16 2018] [<ffffffffc071ba6f>] fuse_flush+0xff/0x150 [fuse] [Fri Sep 7 21:13:16 2018] [<ffffffff97e1c5e7>] filp_close+0x37/0x90 [Fri Sep 7 21:13:16 2018] [<ffffffff97e3f356>] __close_fd+0x96/0xc0 [Fri Sep 7 21:13:16 2018] [<ffffffff97e1e1a3>] SyS_close+0x23/0x50 [Fri Sep 7 21:13:16 2018] [<ffffffff9832579b>] system_call_fastpath+0x22/0x27 [Fri Sep 7 21:13:16 2018] [<ffffffff983256e1>] ? system_call_after_swapgs+0xae/0x146 Im not sure if the stuck process is related to the next problem, but the driver process seems virtually stuck trying to prepare new jobs to run on the two nodes. The process has been more or less idle for an hour. This shouldn't happen. Maybe it's stuck for a similar reason. It's pid is below on rhs-client44 as well 26397 pts/0 S 0:00 | \_ /bin/sh ./do-calibration-gluster-run.sh writebehind 26645 pts/0 S 0:00 | | \_ /bin/ksh ./run_distr.sh 26646 pts/0 Sl 3:57 | | \_ /usr/local/SAS/SASFoundation/9.4/sasexe/sas -config ../_scripts/sas_distr.cfg -sysin ../_scripts/run.sas -work ../_run 26647 pts/0 S 0:00 | | \_ sasels 7 4 8 2 feaa6 Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: ------------------ [putting all the information, which Barry(bmarson) and I had discussed overtime ,for better understanding, and for future references] 1.created a 4 node cluster (16GB RAM, 10gig nic, 3x1.8TB storage) 2.All bricks are created directly on PV formatted with XFS(instead of thin LVs), this was due to the fact that perf improved from ~40mbps to ~170mbps 3.create 2 volumes sasdata and saswork with below vol configs as below Volume Name: sasdata Type: Distributed-Replicate Volume ID: 39ebc7b5-0302-4dcc-86ad-cf018c98bced Status: Started Snapshot Count: 0 Number of Bricks: 2 x (2 + 1) = 6 Transport-type: tcp Bricks: Brick1: rhs-client19.lab.eng.blr.redhat.com:/gluster/brick1/sasdata Brick2: rhs-client25.lab.eng.blr.redhat.com:/gluster/brick1/sasdata Brick3: rhs-client32.lab.eng.blr.redhat.com:/gluster/brick2/sasdata-arbiter (arbiter) Brick4: rhs-client38.lab.eng.blr.redhat.com:/gluster/brick3/sasdata Brick5: rhs-client19.lab.eng.blr.redhat.com:/gluster/brick3/sasdata Brick6: rhs-client25.lab.eng.blr.redhat.com:/gluster/brick3/sasdata-arbiter (arbiter) Options Reconfigured: diagnostics.client-log-level: INFO performance.write-behind: on performance.readdir-ahead: off performance.open-behind: off performance.io-cache: off performance.read-ahead: off performance.stat-prefetch: off performance.quick-read: off transport.address-family: inet nfs.disable: on performance.client-io-threads: off Volume Name: saswork Type: Distribute Volume ID: cc46536a-9704-4400-a12e-cc387f5746d9 Status: Started Snapshot Count: 0 Number of Bricks: 2 Transport-type: tcp Bricks: Brick1: rhs-client32.lab.eng.blr.redhat.com:/gluster/brick1/saswork Brick2: rhs-client38.lab.eng.blr.redhat.com:/gluster/brick1/saswork Options Reconfigured: diagnostics.client-log-level: INFO performance.write-behind: on performance.readdir-ahead: off performance.open-behind: off performance.io-cache: off performance.read-ahead: off performance.stat-prefetch: off performance.quick-read: off transport.address-family: inet nfs.disable: on sasdata is where the input and output results go saswork is a scratch work area for temp files 4. the sasjob is started in below fashion from single client a single sasjob is started, then scaled, to 2, 4 and this goes on to 750, as there about 750jobs in this model 4jobs take about ~25min 5. This action is also overtime triggered from 3 other clients(making it totally 4) 6. first time -> However on the first client, the job got stuck after about 4 jobs, the sasjob got stuck 7. restarted and mounted the volume on the client again(fresh mount on client), as the sasjob was getting into zombie state. Hit the issue again Actual results: ------------- Expected results: ------------------ sasjob should not get stuck
Created attachment 1481795 [details] statedump of fusemount process from node where sas job was stuck
statedump in comment #3 somehow didn't have the flush calltrace. I took another statedump and a flush request was indeed stuck in write-behind. <raghug> [xlator.performance.write-behind.wb_inode] <raghug> path=/SAS_workAB39000020E2_rhs-client44.lab.eng.blr.redhat.com/TD_5/FCST/process/SUBG/gp_h_filter.sas7bdat.lck <raghug> inode=0x7f4f9c463c40 <raghug> gfid=59e84e6d-76f5-40f0-8888-8d3745291011 <raghug> window_conf=1048576 <raghug> window_current=65536 <raghug> transit-size=0 <raghug> dontsync=0 <raghug> [.WRITE] <raghug> unique=17337841 <raghug> refcount=1 <raghug> wound=no <raghug> generation-number=6 <raghug> req->op_ret=65536 <raghug> req->op_errno=0 <raghug> sync-attempts=0 <raghug> sync-in-progress=no <raghug> size=65536 <raghug> offset=65536 <raghug> lied=-1 <raghug> append=0 <raghug> fulfilled=0 <raghug> go=-1 <raghug> [.FLUSH] <raghug> unique=17337847 <raghug> refcount=1 <raghug> wound=no <raghug> generation-number=9 <raghug> req->op_ret=0 <raghug> req->op_errno=0 <raghug> sync-attempts=0
https://review.gluster.org/21123
testing so far on 3.4.1 ie 3.12.2-21 We started with 4 clients and all perf xlators off and WB=ON, however we started to hit enonent problem as in BZ#1627617(which was failedqa) and along the way WB was turned off to unblock enoent problem(as a start, but nothing conclusive yet) Now with WB off, so far no SAS job was stuck till 500jobs(still in progress). But this bug was raised with WB=on and hence I believe testing is blocked here for this bug's onqa validation
Retested with wb=on, and no jobs got stuck, hence moving to verified tested on 3.12.2-21 with 4 jobs per client with total of 4 clients tests ran and completed with WB=On note: there is still another issue being tracked seperately, which seems to be in consistent. ie BZ#1627617 - SAS job aborts complaining about file doesn't exist However as we are not seeing any job getting stuck, unlike in previous case when bug was raised(where it was getting stuck at initial few jobs itself), hence moving to verified
I think it's important to understand that the other issue noted in comment #11 happens early in the testing. It is possible that we could be hitting that bug sooner (from elevating job count per node from 4 to 6) than we have typically seen this bug occur. Once engineering dreams up a fix for that other bug ;) this one potentially could return when we elevate job count. Barry
(In reply to Barry Marson from comment #12) > I think it's important to understand that the other issue noted in comment > #11 happens early in the testing. It is possible that we could be hitting > that bug sooner (from elevating job count per node from 4 to 6) than we have > typically seen this bug occur. Once engineering dreams up a fix for that > other bug ;) this one potentially could return when we elevate job count. Note that bug can return is an hypothesis :). On similar lines of hypothesising, with my understanding of the issue, I would say its highly unlikely. Running SAS on Glusterfs has issues, but unlikely this issue will come again. > > Barry
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2018:3432