Bug 1626780

Summary: sas workload job getting stuck after sometime
Product: Red Hat Gluster Storage Reporter: nchilaka <nchilaka>
Component: write-behindAssignee: Raghavendra G <rgowdapp>
Status: CLOSED ERRATA QA Contact: nchilaka <nchilaka>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: rhgs-3.4CC: apaladug, bmarson, rgowdapp, rhs-bugs, sanandpa, sankarshan, sheggodu
Target Milestone: ---Keywords: ZStream
Target Release: RHGS 3.4.z Batch Update 1Flags: nchilaka: needinfo? (rgowdapp)
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.12.2-20 Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of:
: 1626787 (view as bug list) Environment:
Last Closed: 2018-10-31 08:46:14 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Bug Depends On:    
Bug Blocks: 1626787    
Attachments:
Description Flags
statedump of fusemount process from node where sas job was stuck none

Description nchilaka 2018-09-08 14:55:02 UTC
Description of problem:
======================
As part of validating SASworkload(to help resolve ENEL issue BZ#1581306 ), the SAS workload is getting stuck after certain jobs got completed

job 5 is stuck.  This is effectively the first 10 minutes of the actual test.  This is rhs-client44.  Its not in a D state .. Im not sure what state its in.  I believe its probably waiting on a mutex that failed to be handled correctly.

26551 ?        SNl    0:00 /usr/local/SAS/SASFoundation/9.4/utilities/bin/cntspawn -service 1443 -noscript -sascmd /sasdata/bulked/_scripts/spawner_sas.sh
26552 ?        SN     0:00  \_ sasels  7 4 8 2 111bc5
 7939 ?        SN     0:00      \_ /usr/local/SAS/SASFoundation/9.4/utilities/bin/sasauth
 7940 ?        SN     0:00      \_ /usr/local/SAS/SASFoundation/9.4/utilities/bin/sasauth
 8416 ?        SN     0:00      \_ /bin/ksh /sasdata/bulked/_scripts/spawner_sas.sh -DMR -COMAMID TCP -NOXCMD -NOTERMINAL
 8418 ?        SNl    2:16          \_ /usr/local/SAS/SASFoundation/9.4/sasexe/sas -config ../_scripts/sas_distr.cfg -DMR -COMAMID TCP -NOXCMD -NOTERMINAL
 8420 ?        SN     0:00              \_ sasels  13 10 14 2 fea46

Something else was going on as well.

[Fri Sep  7 20:55:16 2018] INFO: task sas:26556 blocked for more than 120 seconds.
[Fri Sep  7 20:55:16 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Sep  7 20:55:16 2018] sas             D ffff9974bd270000     0 26556   8416 0x00000080
[Fri Sep  7 20:55:16 2018] Call Trace:
[Fri Sep  7 20:55:16 2018]  [<ffffffff98318db9>] schedule+0x29/0x70
[Fri Sep  7 20:55:16 2018]  [<ffffffffc07124e5>] __fuse_request_send+0xf5/0x2e0 [fuse]
[Fri Sep  7 20:55:16 2018]  [<ffffffffc0715bb4>] ? fuse_get_req_nofail_nopages+0xc4/0x1e0 [fuse]
[Fri Sep  7 20:55:16 2018]  [<ffffffff97cbef10>] ? wake_up_atomic_t+0x30/0x30
[Fri Sep  7 20:55:16 2018]  [<ffffffffc07126e2>] fuse_request_send+0x12/0x20 [fuse]
[Fri Sep  7 20:55:16 2018]  [<ffffffffc071ba6f>] fuse_flush+0xff/0x150 [fuse]
[Fri Sep  7 20:55:16 2018]  [<ffffffff97e1c5e7>] filp_close+0x37/0x90
[Fri Sep  7 20:55:16 2018]  [<ffffffff97e3f356>] __close_fd+0x96/0xc0
[Fri Sep  7 20:55:16 2018]  [<ffffffff97e1e1a3>] SyS_close+0x23/0x50
[Fri Sep  7 20:55:16 2018]  [<ffffffff9832579b>] system_call_fastpath+0x22/0x27
[Fri Sep  7 20:55:16 2018]  [<ffffffff983256e1>] ? system_call_after_swapgs+0xae/0x146
[Fri Sep  7 20:57:16 2018] INFO: task sas:26556 blocked for more than 120 seconds.
[Fri Sep  7 20:57:16 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Sep  7 20:57:16 2018] sas             D ffff9974bd270000     0 26556   8416 0x00000080
[Fri Sep  7 20:57:16 2018] Call Trace:
[Fri Sep  7 20:57:16 2018]  [<ffffffff98318db9>] schedule+0x29/0x70
[Fri Sep  7 20:57:16 2018]  [<ffffffffc07124e5>] __fuse_request_send+0xf5/0x2e0 [fuse]
[Fri Sep  7 20:57:16 2018]  [<ffffffffc0715bb4>] ? fuse_get_req_nofail_nopages+0xc4/0x1e0 [fuse]
[Fri Sep  7 20:57:16 2018]  [<ffffffff97cbef10>] ? wake_up_atomic_t+0x30/0x30
[Fri Sep  7 20:57:16 2018]  [<ffffffffc07126e2>] fuse_request_send+0x12/0x20 [fuse]
[Fri Sep  7 20:57:16 2018]  [<ffffffffc071ba6f>] fuse_flush+0xff/0x150 [fuse]
[Fri Sep  7 20:57:16 2018]  [<ffffffff97e1c5e7>] filp_close+0x37/0x90
[Fri Sep  7 20:57:16 2018]  [<ffffffff97e3f356>] __close_fd+0x96/0xc0
[Fri Sep  7 20:57:16 2018]  [<ffffffff97e1e1a3>] SyS_close+0x23/0x50
[Fri Sep  7 20:57:16 2018]  [<ffffffff9832579b>] system_call_fastpath+0x22/0x27
[Fri Sep  7 20:57:16 2018]  [<ffffffff983256e1>] ? system_call_after_swapgs+0xae/0x146
[Fri Sep  7 20:59:16 2018] INFO: task sas:26556 blocked for more than 120 seconds.
[Fri Sep  7 20:59:16 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Sep  7 20:59:16 2018] sas             D ffff9974bd270000     0 26556   8416 0x00000080
[Fri Sep  7 20:59:16 2018] Call Trace:
[Fri Sep  7 20:59:16 2018]  [<ffffffff98318db9>] schedule+0x29/0x70
[Fri Sep  7 20:59:16 2018]  [<ffffffffc07124e5>] __fuse_request_send+0xf5/0x2e0 [fuse]
[Fri Sep  7 20:59:16 2018]  [<ffffffffc0715bb4>] ? fuse_get_req_nofail_nopages+0xc4/0x1e0 [fuse]
[Fri Sep  7 20:59:16 2018]  [<ffffffff97cbef10>] ? wake_up_atomic_t+0x30/0x30
[Fri Sep  7 20:59:16 2018]  [<ffffffffc07126e2>] fuse_request_send+0x12/0x20 [fuse]
[Fri Sep  7 20:59:16 2018]  [<ffffffffc071ba6f>] fuse_flush+0xff/0x150 [fuse]
[Fri Sep  7 20:59:16 2018]  [<ffffffff97e1c5e7>] filp_close+0x37/0x90
[Fri Sep  7 20:59:16 2018]  [<ffffffff97e3f356>] __close_fd+0x96/0xc0
[Fri Sep  7 20:59:16 2018]  [<ffffffff97e1e1a3>] SyS_close+0x23/0x50
[Fri Sep  7 20:59:16 2018]  [<ffffffff9832579b>] system_call_fastpath+0x22/0x27
[Fri Sep  7 20:59:16 2018]  [<ffffffff983256e1>] ? system_call_after_swapgs+0xae/0x146
[Fri Sep  7 21:01:16 2018] INFO: task sas:26556 blocked for more than 120 seconds.
[Fri Sep  7 21:01:16 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Sep  7 21:01:16 2018] sas             D ffff9974bd270000     0 26556   8416 0x00000080
[Fri Sep  7 21:01:16 2018] Call Trace:
[Fri Sep  7 21:01:16 2018]  [<ffffffff98318db9>] schedule+0x29/0x70
[Fri Sep  7 21:01:16 2018]  [<ffffffffc07124e5>] __fuse_request_send+0xf5/0x2e0 [fuse]
[Fri Sep  7 21:01:16 2018]  [<ffffffffc0715bb4>] ? fuse_get_req_nofail_nopages+0xc4/0x1e0 [fuse]
[Fri Sep  7 21:01:16 2018]  [<ffffffff97cbef10>] ? wake_up_atomic_t+0x30/0x30
[Fri Sep  7 21:01:16 2018]  [<ffffffffc07126e2>] fuse_request_send+0x12/0x20 [fuse]
[Fri Sep  7 21:01:16 2018]  [<ffffffffc071ba6f>] fuse_flush+0xff/0x150 [fuse]
[Fri Sep  7 21:01:16 2018]  [<ffffffff97e1c5e7>] filp_close+0x37/0x90
[Fri Sep  7 21:01:16 2018]  [<ffffffff97e3f356>] __close_fd+0x96/0xc0
[Fri Sep  7 21:01:16 2018]  [<ffffffff97e1e1a3>] SyS_close+0x23/0x50
[Fri Sep  7 21:01:16 2018]  [<ffffffff9832579b>] system_call_fastpath+0x22/0x27
[Fri Sep  7 21:01:16 2018]  [<ffffffff983256e1>] ? system_call_after_swapgs+0xae/0x146
[Fri Sep  7 21:03:16 2018] INFO: task sas:26556 blocked for more than 120 seconds.
[Fri Sep  7 21:03:16 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Sep  7 21:03:16 2018] sas             D ffff9974bd270000     0 26556   8416 0x00000080
[Fri Sep  7 21:03:16 2018] Call Trace:
[Fri Sep  7 21:03:16 2018]  [<ffffffff98318db9>] schedule+0x29/0x70
[Fri Sep  7 21:03:16 2018]  [<ffffffffc07124e5>] __fuse_request_send+0xf5/0x2e0 [fuse]
[Fri Sep  7 21:03:16 2018]  [<ffffffffc0715bb4>] ? fuse_get_req_nofail_nopages+0xc4/0x1e0 [fuse]
[Fri Sep  7 21:03:16 2018]  [<ffffffff97cbef10>] ? wake_up_atomic_t+0x30/0x30
[Fri Sep  7 21:03:16 2018]  [<ffffffffc07126e2>] fuse_request_send+0x12/0x20 [fuse]
[Fri Sep  7 21:03:16 2018]  [<ffffffffc071ba6f>] fuse_flush+0xff/0x150 [fuse]
[Fri Sep  7 21:03:16 2018]  [<ffffffff97e1c5e7>] filp_close+0x37/0x90
[Fri Sep  7 21:03:16 2018]  [<ffffffff97e3f356>] __close_fd+0x96/0xc0
[Fri Sep  7 21:03:16 2018]  [<ffffffff97e1e1a3>] SyS_close+0x23/0x50
[Fri Sep  7 21:03:16 2018]  [<ffffffff9832579b>] system_call_fastpath+0x22/0x27
[Fri Sep  7 21:03:16 2018]  [<ffffffff983256e1>] ? system_call_after_swapgs+0xae/0x146
[Fri Sep  7 21:05:16 2018] INFO: task sas:26556 blocked for more than 120 seconds.
[Fri Sep  7 21:05:16 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Sep  7 21:05:16 2018] sas             D ffff9974bd270000     0 26556   8416 0x00000080
[Fri Sep  7 21:05:16 2018] Call Trace:
[Fri Sep  7 21:05:16 2018]  [<ffffffff98318db9>] schedule+0x29/0x70
[Fri Sep  7 21:05:16 2018]  [<ffffffffc07124e5>] __fuse_request_send+0xf5/0x2e0 [fuse]
[Fri Sep  7 21:05:16 2018]  [<ffffffffc0715bb4>] ? fuse_get_req_nofail_nopages+0xc4/0x1e0 [fuse]
[Fri Sep  7 21:05:16 2018]  [<ffffffff97cbef10>] ? wake_up_atomic_t+0x30/0x30
[Fri Sep  7 21:05:16 2018]  [<ffffffffc07126e2>] fuse_request_send+0x12/0x20 [fuse]
[Fri Sep  7 21:05:16 2018]  [<ffffffffc071ba6f>] fuse_flush+0xff/0x150 [fuse]
[Fri Sep  7 21:05:16 2018]  [<ffffffff97e1c5e7>] filp_close+0x37/0x90
[Fri Sep  7 21:05:16 2018]  [<ffffffff97e3f356>] __close_fd+0x96/0xc0
[Fri Sep  7 21:05:16 2018]  [<ffffffff97e1e1a3>] SyS_close+0x23/0x50
[Fri Sep  7 21:05:16 2018]  [<ffffffff9832579b>] system_call_fastpath+0x22/0x27
[Fri Sep  7 21:05:16 2018]  [<ffffffff983256e1>] ? system_call_after_swapgs+0xae/0x146
[Fri Sep  7 21:07:16 2018] INFO: task sas:26556 blocked for more than 120 seconds.
[Fri Sep  7 21:07:16 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Sep  7 21:07:16 2018] sas             D ffff9974bd270000     0 26556   8416 0x00000080
[Fri Sep  7 21:07:16 2018] Call Trace:
[Fri Sep  7 21:07:16 2018]  [<ffffffff98318db9>] schedule+0x29/0x70
[Fri Sep  7 21:07:16 2018]  [<ffffffffc07124e5>] __fuse_request_send+0xf5/0x2e0 [fuse]
[Fri Sep  7 21:07:16 2018]  [<ffffffffc0715bb4>] ? fuse_get_req_nofail_nopages+0xc4/0x1e0 [fuse]
[Fri Sep  7 21:07:16 2018]  [<ffffffff97cbef10>] ? wake_up_atomic_t+0x30/0x30
[Fri Sep  7 21:07:16 2018]  [<ffffffffc07126e2>] fuse_request_send+0x12/0x20 [fuse]
[Fri Sep  7 21:07:16 2018]  [<ffffffffc071ba6f>] fuse_flush+0xff/0x150 [fuse]
[Fri Sep  7 21:07:16 2018]  [<ffffffff97e1c5e7>] filp_close+0x37/0x90
[Fri Sep  7 21:07:16 2018]  [<ffffffff97e3f356>] __close_fd+0x96/0xc0
[Fri Sep  7 21:07:16 2018]  [<ffffffff97e1e1a3>] SyS_close+0x23/0x50
[Fri Sep  7 21:07:16 2018]  [<ffffffff9832579b>] system_call_fastpath+0x22/0x27
[Fri Sep  7 21:07:16 2018]  [<ffffffff983256e1>] ? system_call_after_swapgs+0xae/0x146
[Fri Sep  7 21:09:16 2018] INFO: task sas:26556 blocked for more than 120 seconds.
[Fri Sep  7 21:09:16 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Sep  7 21:09:16 2018] sas             D ffff9974bd270000     0 26556   8416 0x00000080
[Fri Sep  7 21:09:16 2018] Call Trace:
[Fri Sep  7 21:09:16 2018]  [<ffffffff98318db9>] schedule+0x29/0x70
[Fri Sep  7 21:09:16 2018]  [<ffffffffc07124e5>] __fuse_request_send+0xf5/0x2e0 [fuse]
[Fri Sep  7 21:09:16 2018]  [<ffffffffc0715bb4>] ? fuse_get_req_nofail_nopages+0xc4/0x1e0 [fuse]
[Fri Sep  7 21:09:16 2018]  [<ffffffff97cbef10>] ? wake_up_atomic_t+0x30/0x30
[Fri Sep  7 21:09:16 2018]  [<ffffffffc07126e2>] fuse_request_send+0x12/0x20 [fuse]
[Fri Sep  7 21:09:16 2018]  [<ffffffffc071ba6f>] fuse_flush+0xff/0x150 [fuse]
[Fri Sep  7 21:09:16 2018]  [<ffffffff97e1c5e7>] filp_close+0x37/0x90
[Fri Sep  7 21:09:16 2018]  [<ffffffff97e3f356>] __close_fd+0x96/0xc0
[Fri Sep  7 21:09:16 2018]  [<ffffffff97e1e1a3>] SyS_close+0x23/0x50
[Fri Sep  7 21:09:16 2018]  [<ffffffff9832579b>] system_call_fastpath+0x22/0x27
[Fri Sep  7 21:09:16 2018]  [<ffffffff983256e1>] ? system_call_after_swapgs+0xae/0x146
[Fri Sep  7 21:11:16 2018] INFO: task sas:26556 blocked for more than 120 seconds.
[Fri Sep  7 21:11:16 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Sep  7 21:11:16 2018] sas             D ffff9974bd270000     0 26556   8416 0x00000080
[Fri Sep  7 21:11:16 2018] Call Trace:
[Fri Sep  7 21:11:16 2018]  [<ffffffff98318db9>] schedule+0x29/0x70
[Fri Sep  7 21:11:16 2018]  [<ffffffffc07124e5>] __fuse_request_send+0xf5/0x2e0 [fuse]
[Fri Sep  7 21:11:16 2018]  [<ffffffffc0715bb4>] ? fuse_get_req_nofail_nopages+0xc4/0x1e0 [fuse]
[Fri Sep  7 21:11:16 2018]  [<ffffffff97cbef10>] ? wake_up_atomic_t+0x30/0x30
[Fri Sep  7 21:11:16 2018]  [<ffffffffc07126e2>] fuse_request_send+0x12/0x20 [fuse]
[Fri Sep  7 21:11:16 2018]  [<ffffffffc071ba6f>] fuse_flush+0xff/0x150 [fuse]
[Fri Sep  7 21:11:16 2018]  [<ffffffff97e1c5e7>] filp_close+0x37/0x90
[Fri Sep  7 21:11:16 2018]  [<ffffffff97e3f356>] __close_fd+0x96/0xc0
[Fri Sep  7 21:11:16 2018]  [<ffffffff97e1e1a3>] SyS_close+0x23/0x50
[Fri Sep  7 21:11:16 2018]  [<ffffffff9832579b>] system_call_fastpath+0x22/0x27
[Fri Sep  7 21:11:16 2018]  [<ffffffff983256e1>] ? system_call_after_swapgs+0xae/0x146
[Fri Sep  7 21:13:16 2018] INFO: task sas:26556 blocked for more than 120 seconds.
[Fri Sep  7 21:13:16 2018] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[Fri Sep  7 21:13:16 2018] sas             D ffff9974bd270000     0 26556   8416 0x00000080
[Fri Sep  7 21:13:16 2018] Call Trace:
[Fri Sep  7 21:13:16 2018]  [<ffffffff98318db9>] schedule+0x29/0x70
[Fri Sep  7 21:13:16 2018]  [<ffffffffc07124e5>] __fuse_request_send+0xf5/0x2e0 [fuse]
[Fri Sep  7 21:13:16 2018]  [<ffffffffc0715bb4>] ? fuse_get_req_nofail_nopages+0xc4/0x1e0 [fuse]
[Fri Sep  7 21:13:16 2018]  [<ffffffff97cbef10>] ? wake_up_atomic_t+0x30/0x30
[Fri Sep  7 21:13:16 2018]  [<ffffffffc07126e2>] fuse_request_send+0x12/0x20 [fuse]
[Fri Sep  7 21:13:16 2018]  [<ffffffffc071ba6f>] fuse_flush+0xff/0x150 [fuse]
[Fri Sep  7 21:13:16 2018]  [<ffffffff97e1c5e7>] filp_close+0x37/0x90
[Fri Sep  7 21:13:16 2018]  [<ffffffff97e3f356>] __close_fd+0x96/0xc0
[Fri Sep  7 21:13:16 2018]  [<ffffffff97e1e1a3>] SyS_close+0x23/0x50
[Fri Sep  7 21:13:16 2018]  [<ffffffff9832579b>] system_call_fastpath+0x22/0x27
[Fri Sep  7 21:13:16 2018]  [<ffffffff983256e1>] ? system_call_after_swapgs+0xae/0x146
Im not sure if the stuck process is related to the next problem, but the driver process seems virtually stuck trying to prepare new jobs to run on the two nodes.  The process has been more or less idle for an hour.  This shouldn't happen.  Maybe it's stuck for a similar reason.  It's pid is below on rhs-client44 as well

26397 pts/0    S      0:00  |       \_ /bin/sh ./do-calibration-gluster-run.sh writebehind
26645 pts/0    S      0:00  |       |   \_ /bin/ksh ./run_distr.sh
26646 pts/0    Sl     3:57  |       |       \_ /usr/local/SAS/SASFoundation/9.4/sasexe/sas -config ../_scripts/sas_distr.cfg -sysin ../_scripts/run.sas -work ../_run
26647 pts/0    S      0:00  |       |           \_ sasels  7 4 8 2 feaa6





Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
------------------
[putting all the information, which Barry(bmarson) and I had discussed overtime ,for better understanding, and for future references]

1.created a 4 node cluster (16GB RAM, 10gig nic, 3x1.8TB storage)
2.All bricks are created directly on PV formatted with XFS(instead of thin LVs), this was due to the fact that perf improved from ~40mbps to ~170mbps
3.create 2 volumes sasdata and saswork with below vol configs as below
Volume Name: sasdata
Type: Distributed-Replicate
Volume ID: 39ebc7b5-0302-4dcc-86ad-cf018c98bced
Status: Started
Snapshot Count: 0
Number of Bricks: 2 x (2 + 1) = 6
Transport-type: tcp
Bricks:
Brick1: rhs-client19.lab.eng.blr.redhat.com:/gluster/brick1/sasdata
Brick2: rhs-client25.lab.eng.blr.redhat.com:/gluster/brick1/sasdata
Brick3: rhs-client32.lab.eng.blr.redhat.com:/gluster/brick2/sasdata-arbiter (arbiter)
Brick4: rhs-client38.lab.eng.blr.redhat.com:/gluster/brick3/sasdata
Brick5: rhs-client19.lab.eng.blr.redhat.com:/gluster/brick3/sasdata
Brick6: rhs-client25.lab.eng.blr.redhat.com:/gluster/brick3/sasdata-arbiter (arbiter)
Options Reconfigured:
diagnostics.client-log-level: INFO
performance.write-behind: on
performance.readdir-ahead: off
performance.open-behind: off
performance.io-cache: off
performance.read-ahead: off
performance.stat-prefetch: off
performance.quick-read: off
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: off
 
Volume Name: saswork
Type: Distribute
Volume ID: cc46536a-9704-4400-a12e-cc387f5746d9
Status: Started
Snapshot Count: 0
Number of Bricks: 2
Transport-type: tcp
Bricks:
Brick1: rhs-client32.lab.eng.blr.redhat.com:/gluster/brick1/saswork
Brick2: rhs-client38.lab.eng.blr.redhat.com:/gluster/brick1/saswork
Options Reconfigured:
diagnostics.client-log-level: INFO
performance.write-behind: on
performance.readdir-ahead: off
performance.open-behind: off
performance.io-cache: off
performance.read-ahead: off
performance.stat-prefetch: off
performance.quick-read: off
transport.address-family: inet
nfs.disable: on


sasdata is where the input and output results go
saswork is a scratch work area for temp files

4. the sasjob is started in below fashion
from single client a single sasjob is started, then scaled, to 2, 4  and this goes on to 750, as there about 750jobs in this model

4jobs take about ~25min

5. This action is also overtime triggered from 3 other clients(making it totally 4)

6. first time -> However on the first client, the job got stuck after about 4 jobs, the sasjob got stuck

7. restarted and mounted the volume on the client again(fresh mount on client), as the sasjob was getting into zombie state.

Hit the issue again



Actual results:
-------------


Expected results:
------------------

sasjob should not get stuck

Comment 3 nchilaka 2018-09-08 15:07:15 UTC
Created attachment 1481795 [details]
statedump of fusemount process from node where sas job was stuck

Comment 4 Raghavendra G 2018-09-08 16:11:42 UTC
statedump in comment #3 somehow didn't have the flush calltrace. I took another statedump and a flush request was indeed stuck in write-behind.

<raghug> [xlator.performance.write-behind.wb_inode]
<raghug> path=/SAS_workAB39000020E2_rhs-client44.lab.eng.blr.redhat.com/TD_5/FCST/process/SUBG/gp_h_filter.sas7bdat.lck
<raghug> inode=0x7f4f9c463c40
<raghug> gfid=59e84e6d-76f5-40f0-8888-8d3745291011
<raghug> window_conf=1048576
<raghug> window_current=65536
<raghug> transit-size=0
<raghug> dontsync=0
<raghug> [.WRITE]
<raghug> unique=17337841
<raghug> refcount=1
<raghug> wound=no
<raghug> generation-number=6
<raghug> req->op_ret=65536
<raghug> req->op_errno=0
<raghug> sync-attempts=0
<raghug> sync-in-progress=no
<raghug> size=65536
<raghug> offset=65536
<raghug> lied=-1
<raghug> append=0
<raghug> fulfilled=0
<raghug> go=-1
<raghug> [.FLUSH]
<raghug> unique=17337847
<raghug> refcount=1
<raghug> wound=no
<raghug> generation-number=9
<raghug> req->op_ret=0
<raghug> req->op_errno=0
<raghug> sync-attempts=0

Comment 5 Raghavendra G 2018-09-10 04:16:37 UTC
https://review.gluster.org/21123

Comment 10 nchilaka 2018-10-08 10:41:40 UTC
testing so far on 3.4.1 ie 3.12.2-21

We started with 4 clients and all perf xlators off and WB=ON,
however we started to hit enonent problem as in BZ#1627617(which was failedqa)
and along the way WB was turned off to unblock enoent problem(as a start, but nothing conclusive yet)
Now with WB off, so far no SAS job was stuck till 500jobs(still in progress).
But this bug was raised with WB=on and hence I believe testing is blocked here
for this bug's onqa validation

Comment 11 nchilaka 2018-10-15 16:18:30 UTC
Retested with wb=on, and no jobs got stuck, hence moving to verified
tested on 3.12.2-21 with 4 jobs per client with total of 4 clients
tests ran and completed with WB=On
note: there is still another issue  being tracked seperately, which seems to be in consistent. ie BZ#1627617 - SAS job aborts complaining about file doesn't exist

However as we are not seeing any job getting stuck, unlike in previous case when bug was raised(where it was getting stuck at initial few jobs itself), hence moving to verified

Comment 12 Barry Marson 2018-10-15 16:52:55 UTC
I think it's important to understand that the other issue noted in comment #11 happens early in the testing.  It is possible that we could be hitting that bug sooner (from elevating job count per node from 4 to 6) than we have typically seen this bug occur.  Once engineering dreams up a fix for that other bug ;) this one potentially could return when we elevate job count.

Barry

Comment 13 Raghavendra G 2018-10-16 01:38:19 UTC
(In reply to Barry Marson from comment #12)
> I think it's important to understand that the other issue noted in comment
> #11 happens early in the testing.  It is possible that we could be hitting
> that bug sooner (from elevating job count per node from 4 to 6) than we have
> typically seen this bug occur.  Once engineering dreams up a fix for that
> other bug ;) this one potentially could return when we elevate job count.

Note that bug can return is an hypothesis :). On similar lines of hypothesising, with my understanding of the issue, I would say its highly unlikely.

Running SAS on Glusterfs has issues, but unlikely this issue will come again.

> 
> Barry

Comment 15 errata-xmlrpc 2018-10-31 08:46:14 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2018:3432