Description of problem: ioprocess log is missing basic information required to debug it when storage is slow. In https://bugzilla.redhat.com/show_bug.cgi?id=1270220#c19, unlink syscall took 90 seconds, but ioprocess timed out after 180 seconds. It seems that the the timeout was caused by waiting in the request queue because all threads were busy processing requests with very slow nfs storage. However we don't have any information about iprocess queues or the time a request spent waiting in the queue, so we can only guess. I think we need these additional logs in debug level: 1. When a request starts running Starting request <id> <parameters> 2. When a request is queued Queuing request <id> (<queue size>) 3. When a request is finished Finished request <id> (status=<ok/fail> wait_time=<wait time> run_time=<run time>) 4. When queue is full - error Queue full, rejecting request <id> (Hopefully we do not block in this case, creating unlimited delay in the caller). This log should make it easy to tune the iprocess queue size and number of threads in the field, to adapt to slow storage or other trouble. It should be also easy to follow a request in the logs, hence the request id. -- Note: ignore the sub component, this was the only option.
The attachments are not relevant, ignore them.
Current ioprocess logs can contains junk, which makes them harder to use; tracked in bug 1275062.
In oVirt testing is done on single release by default. Therefore I'm removing the 4.0 flag. If you think this bug must be tested in 4.0 as well, please re-add the flag. Please note we might not have testing resources to handle the 4.0 clone.
Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA.
QE please verify it during testing https://bugzilla.redhat.com/show_bug.cgi?id=1270220, thank you, P.
ioprocess-0.15.0-5.el7ev.x86_64 it's ok for all ioprocess debug messages but as i'm unable to simulate slow nfs (i even tried Nir's slowfs but i could activate slowfs via nfs in engine because sanlock didn't play well with that) i can't see logging about full queue. 2016-01-15 15:41:25+0100 3242 [13244]: s5 lockspace 9f2747f8-3a88-4ba4-a4b3-cb3ac6f21426:2:/rhev/data-center/mnt/10.34.63.223:_slowfs/9f2747f8-3a88-4ba4-a4b3-cb3ac6f21426/dom_md/ids:0 2016-01-15 15:41:46+0100 3263 [973]: s5:r3 resource 9f2747f8-3a88-4ba4-a4b3-cb3ac6f21426:SDM:/rhev/data-center/mnt/10.34.63.223:_slowfs/9f2747f8-3a88-4ba4-a4b3-cb3ac6f21426/dom_md/leases:1048576 for 2,9,8591 2016-01-15 15:41:56+0100 3273 [17652]: 9f2747f8 aio timeout 0 0x7f5f340008c0:0x7f5f340008d0:0x7f5f581da000 ioto 10 to_count 1 2016-01-15 15:41:56+0100 3273 [17652]: s5 delta_renew read rv -202 offset 0 /rhev/data-center/mnt/10.34.63.223:_slowfs/9f2747f8-3a88-4ba4-a4b3-cb3ac6f21426/dom_md/ids 2016-01-15 15:41:56+0100 3273 [17652]: s5 renewal error -202 delta_length 10 last_success 3242 2016-01-15 15:42:07+0100 3284 [17652]: 9f2747f8 aio timeout 0 0x7f5f34000910:0x7f5f34000920:0x7f5f580d8000 ioto 10 to_count 2 2016-01-15 15:42:07+0100 3284 [17652]: s5 delta_renew read rv -202 offset 0 /rhev/data-center/mnt/10.34.63.223:_slowfs/9f2747f8-3a88-4ba4-a4b3-cb3ac6f21426/dom_md/ids 2016-01-15 15:42:07+0100 3284 [17652]: s5 renewal error -202 delta_length 11 last_success 3242 2016-01-15 15:42:18+0100 3295 [17652]: 9f2747f8 aio timeout 0 0x7f5f34000960:0x7f5f34000970:0x7f5f3b6fe000 ioto 10 to_count 3 2016-01-15 15:42:18+0100 3295 [17652]: s5 delta_renew read rv -202 offset 0 /rhev/data-center/mnt/10.34.63.223:_slowfs/9f2747f8-3a88-4ba4-a4b3-cb3ac6f21426/dom_md/ids 2016-01-15 15:42:18+0100 3295 [17652]: s5 renewal error -202 delta_length 11 last_success 3242 but it's ok for other debug messages of ioprocess. ioprocess communication (13350)::DEBUG::2016-01-15 15:43:09,407::__init__::411::IOProcess::(_processLogs) (1640) Start request for method 'access' (waitTime=28) ioprocess communication (13350)::DEBUG::2016-01-15 15:43:09,407::__init__::411::IOProcess::(_processLogs) Queuing request (slotsLeft=20) ioprocess communication (13350)::DEBUG::2016-01-15 15:43:09,407::__init__::411::IOProcess::(_processLogs) (1640) Finished request for method 'access' (runTime=216) ioprocess communication (13350)::DEBUG::2016-01-15 15:43:09,407::__init__::411::IOProcess::(_processLogs) (1640) Dequeuing request (slotsLeft=21) any hints? or is it ok with 4 above?
FYA there's missing 'id' in 'Queuing request' log message. It seems odd with respect to #0.
(In reply to Jiri Belka from comment #11) > FYA there's missing 'id' in 'Queuing request' log message. It seems odd with > respect to #0. I agree, adding the id requires rather large change in ioprocess (the id is not availble when this is logged). I plan to improve this in ioprocess 1.0.
(In reply to Jiri Belka from comment #10) > ioprocess-0.15.0-5.el7ev.x86_64 > > it's ok for all ioprocess debug messages but as i'm unable to simulate slow > nfs (i even tried Nir's slowfs but i could activate slowfs via nfs in engine > because sanlock didn't play well with that) i can't see logging about full > queue. To simulate full queue, you need to configure ioprocess with smaller queue size, and submit several tasks that vdsm perform in parallel. However the patch attached to this bug did not change the logs in the case of full queuq. The current code is logging this warning when the queue is full: g_warning("(%li) Request queue full", reqId); Since the patch did not change this behavior, there is nothing to verify, unless you want to verify that the behavior did not change.
ok for now, ioprocess-0.15.0-5.el7ev.x86_64 based on #13 it's ok. improvements about full queue and queuing request id are on dev's TODO list.