Bug 1273122

Summary: Missing information about ioprocess state
Product: [oVirt] ovirt-distribution Reporter: Nir Soffer <nsoffer>
Component: ioprocessAssignee: Yeela Kaplan <ykaplan>
ioprocess sub component: General QA Contact: Jiri Belka <jbelka>
Status: CLOSED CURRENTRELEASE Docs Contact:
Severity: high    
Priority: unspecified CC: amureini, bugs, oourfali, sbonazzo, tspeetje, ybronhei, yjog, ykaplan, ylavi
Version: ioprocess-0.15.0Flags: oourfali: ovirt-3.6.z?
rule-engine: planning_ack?
oourfali: devel_ack+
pstehlik: testing_ack+
Target Milestone: ovirt-3.6.1   
Target Release: ioprocess-0.15.0   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard: infra
Fixed In Version: ioprocess-0.15.0-4 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 1286963 (view as bug list) Environment:
Last Closed: 2016-01-19 15:36:53 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1275062    
Bug Blocks: 1270220, 1286963    

Description Nir Soffer 2015-10-19 16:37:20 UTC
Description of problem:

ioprocess log is missing basic information required to debug it when 
storage is slow.

In https://bugzilla.redhat.com/show_bug.cgi?id=1270220#c19, unlink 
syscall took 90 seconds, but ioprocess timed out after 180 seconds.
It seems that the the timeout was caused by waiting in the request
queue because all threads were busy processing requests with very
slow nfs storage.

However we don't have any information about iprocess queues or the
time a request spent waiting in the queue, so we can only guess.

I think we need these additional logs in debug level:

1. When a request starts running
   Starting request <id> <parameters>

2. When a request is queued
   Queuing request <id> (<queue size>)

3. When a request is finished
   Finished request <id> (status=<ok/fail> wait_time=<wait time>
       run_time=<run time>)

4. When queue is full - error
   Queue full, rejecting request <id>

   (Hopefully we do not block in this case, creating unlimited delay
    in the caller).

This log should make it easy to tune the iprocess queue size and number of
threads in the field, to adapt to slow storage or other trouble.

It should be also easy to follow a request in the logs, hence the request
id.

--
Note: ignore the sub component, this was the only option.

Comment 3 Nir Soffer 2015-10-21 12:41:48 UTC
The attachments are not relevant, ignore them.

Comment 4 Nir Soffer 2015-10-26 09:09:57 UTC
Current ioprocess logs can contains junk, which makes them harder to use; 
tracked in bug 1275062.

Comment 5 Yaniv Lavi 2015-10-29 12:46:08 UTC
In oVirt testing is done on single release by default. Therefore I'm removing the 4.0 flag. If you think this bug must be tested in 4.0 as well, please re-add the flag. Please note we might not have testing resources to handle the 4.0 clone.

Comment 7 Red Hat Bugzilla Rules Engine 2015-12-06 11:59:53 UTC
Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA.

Comment 9 Pavel Stehlik 2015-12-23 11:57:59 UTC
QE please verify it during testing  https://bugzilla.redhat.com/show_bug.cgi?id=1270220, thank you, P.

Comment 10 Jiri Belka 2016-01-15 14:48:56 UTC
ioprocess-0.15.0-5.el7ev.x86_64

it's ok for all ioprocess debug messages but as i'm unable to simulate slow nfs (i even tried Nir's slowfs but i could activate slowfs via nfs in engine because sanlock didn't play well with that) i can't see logging about full queue.

2016-01-15 15:41:25+0100 3242 [13244]: s5 lockspace 9f2747f8-3a88-4ba4-a4b3-cb3ac6f21426:2:/rhev/data-center/mnt/10.34.63.223:_slowfs/9f2747f8-3a88-4ba4-a4b3-cb3ac6f21426/dom_md/ids:0
2016-01-15 15:41:46+0100 3263 [973]: s5:r3 resource 9f2747f8-3a88-4ba4-a4b3-cb3ac6f21426:SDM:/rhev/data-center/mnt/10.34.63.223:_slowfs/9f2747f8-3a88-4ba4-a4b3-cb3ac6f21426/dom_md/leases:1048576 for 2,9,8591
2016-01-15 15:41:56+0100 3273 [17652]: 9f2747f8 aio timeout 0 0x7f5f340008c0:0x7f5f340008d0:0x7f5f581da000 ioto 10 to_count 1
2016-01-15 15:41:56+0100 3273 [17652]: s5 delta_renew read rv -202 offset 0 /rhev/data-center/mnt/10.34.63.223:_slowfs/9f2747f8-3a88-4ba4-a4b3-cb3ac6f21426/dom_md/ids
2016-01-15 15:41:56+0100 3273 [17652]: s5 renewal error -202 delta_length 10 last_success 3242
2016-01-15 15:42:07+0100 3284 [17652]: 9f2747f8 aio timeout 0 0x7f5f34000910:0x7f5f34000920:0x7f5f580d8000 ioto 10 to_count 2
2016-01-15 15:42:07+0100 3284 [17652]: s5 delta_renew read rv -202 offset 0 /rhev/data-center/mnt/10.34.63.223:_slowfs/9f2747f8-3a88-4ba4-a4b3-cb3ac6f21426/dom_md/ids
2016-01-15 15:42:07+0100 3284 [17652]: s5 renewal error -202 delta_length 11 last_success 3242
2016-01-15 15:42:18+0100 3295 [17652]: 9f2747f8 aio timeout 0 0x7f5f34000960:0x7f5f34000970:0x7f5f3b6fe000 ioto 10 to_count 3
2016-01-15 15:42:18+0100 3295 [17652]: s5 delta_renew read rv -202 offset 0 /rhev/data-center/mnt/10.34.63.223:_slowfs/9f2747f8-3a88-4ba4-a4b3-cb3ac6f21426/dom_md/ids
2016-01-15 15:42:18+0100 3295 [17652]: s5 renewal error -202 delta_length 11 last_success 3242

but it's ok for other debug messages of ioprocess.

ioprocess communication (13350)::DEBUG::2016-01-15 15:43:09,407::__init__::411::IOProcess::(_processLogs) (1640) Start request for method 'access' (waitTime=28)
ioprocess communication (13350)::DEBUG::2016-01-15 15:43:09,407::__init__::411::IOProcess::(_processLogs) Queuing request (slotsLeft=20)
ioprocess communication (13350)::DEBUG::2016-01-15 15:43:09,407::__init__::411::IOProcess::(_processLogs) (1640) Finished request for method 'access' (runTime=216)
ioprocess communication (13350)::DEBUG::2016-01-15 15:43:09,407::__init__::411::IOProcess::(_processLogs) (1640) Dequeuing request (slotsLeft=21)

any hints? or is it ok with 4 above?

Comment 11 Jiri Belka 2016-01-15 14:52:48 UTC
FYA there's missing 'id' in 'Queuing request' log message. It seems odd with respect to #0.

Comment 12 Nir Soffer 2016-01-15 15:25:50 UTC
(In reply to Jiri Belka from comment #11)
> FYA there's missing 'id' in 'Queuing request' log message. It seems odd with
> respect to #0.

I agree, adding the id requires rather large change in ioprocess (the id is not
availble when this is logged). I plan to improve this in ioprocess 1.0.

Comment 13 Nir Soffer 2016-01-15 15:36:26 UTC
(In reply to Jiri Belka from comment #10)
> ioprocess-0.15.0-5.el7ev.x86_64
> 
> it's ok for all ioprocess debug messages but as i'm unable to simulate slow
> nfs (i even tried Nir's slowfs but i could activate slowfs via nfs in engine
> because sanlock didn't play well with that) i can't see logging about full
> queue.

To simulate full queue, you need to configure ioprocess with smaller queue
size, and submit several tasks that vdsm perform in parallel. However the 
patch attached to this bug did not change the logs in the case of full queuq.

The current code is logging this warning when the queue is full:

    g_warning("(%li) Request queue full", reqId);

Since the patch did not change this behavior, there is nothing to verify, unless
you want to verify that the behavior did not change.

Comment 14 Jiri Belka 2016-01-15 17:01:11 UTC
ok for now, ioprocess-0.15.0-5.el7ev.x86_64

based on #13 it's ok. improvements about full queue and queuing request id are on dev's TODO list.