Bug 1273122 - Missing information about ioprocess state
Summary: Missing information about ioprocess state
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-distribution
Classification: oVirt
Component: ioprocess
Version: ioprocess-0.15.0
Hardware: Unspecified
OS: Unspecified
unspecified
high
Target Milestone: ovirt-3.6.1
: ioprocess-0.15.0
Assignee: Yeela Kaplan
QA Contact: Jiri Belka
URL:
Whiteboard: infra
Depends On: 1275062
Blocks: 1270220 1286963
TreeView+ depends on / blocked
 
Reported: 2015-10-19 16:37 UTC by Nir Soffer
Modified: 2016-01-19 15:36 UTC (History)
9 users (show)

Fixed In Version: ioprocess-0.15.0-4
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 1286963 (view as bug list)
Environment:
Last Closed: 2016-01-19 15:36:53 UTC
oVirt Team: Infra
Embargoed:
oourfali: ovirt-3.6.z?
rule-engine: planning_ack?
oourfali: devel_ack+
pstehlik: testing_ack+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 47646 0 None None None Never
oVirt gerrit 47658 0 None None None Never

Description Nir Soffer 2015-10-19 16:37:20 UTC
Description of problem:

ioprocess log is missing basic information required to debug it when 
storage is slow.

In https://bugzilla.redhat.com/show_bug.cgi?id=1270220#c19, unlink 
syscall took 90 seconds, but ioprocess timed out after 180 seconds.
It seems that the the timeout was caused by waiting in the request
queue because all threads were busy processing requests with very
slow nfs storage.

However we don't have any information about iprocess queues or the
time a request spent waiting in the queue, so we can only guess.

I think we need these additional logs in debug level:

1. When a request starts running
   Starting request <id> <parameters>

2. When a request is queued
   Queuing request <id> (<queue size>)

3. When a request is finished
   Finished request <id> (status=<ok/fail> wait_time=<wait time>
       run_time=<run time>)

4. When queue is full - error
   Queue full, rejecting request <id>

   (Hopefully we do not block in this case, creating unlimited delay
    in the caller).

This log should make it easy to tune the iprocess queue size and number of
threads in the field, to adapt to slow storage or other trouble.

It should be also easy to follow a request in the logs, hence the request
id.

--
Note: ignore the sub component, this was the only option.

Comment 3 Nir Soffer 2015-10-21 12:41:48 UTC
The attachments are not relevant, ignore them.

Comment 4 Nir Soffer 2015-10-26 09:09:57 UTC
Current ioprocess logs can contains junk, which makes them harder to use; 
tracked in bug 1275062.

Comment 5 Yaniv Lavi 2015-10-29 12:46:08 UTC
In oVirt testing is done on single release by default. Therefore I'm removing the 4.0 flag. If you think this bug must be tested in 4.0 as well, please re-add the flag. Please note we might not have testing resources to handle the 4.0 clone.

Comment 7 Red Hat Bugzilla Rules Engine 2015-12-06 11:59:53 UTC
Bug tickets that are moved to testing must have target release set to make sure tester knows what to test. Please set the correct target release before moving to ON_QA.

Comment 9 Pavel Stehlik 2015-12-23 11:57:59 UTC
QE please verify it during testing  https://bugzilla.redhat.com/show_bug.cgi?id=1270220, thank you, P.

Comment 10 Jiri Belka 2016-01-15 14:48:56 UTC
ioprocess-0.15.0-5.el7ev.x86_64

it's ok for all ioprocess debug messages but as i'm unable to simulate slow nfs (i even tried Nir's slowfs but i could activate slowfs via nfs in engine because sanlock didn't play well with that) i can't see logging about full queue.

2016-01-15 15:41:25+0100 3242 [13244]: s5 lockspace 9f2747f8-3a88-4ba4-a4b3-cb3ac6f21426:2:/rhev/data-center/mnt/10.34.63.223:_slowfs/9f2747f8-3a88-4ba4-a4b3-cb3ac6f21426/dom_md/ids:0
2016-01-15 15:41:46+0100 3263 [973]: s5:r3 resource 9f2747f8-3a88-4ba4-a4b3-cb3ac6f21426:SDM:/rhev/data-center/mnt/10.34.63.223:_slowfs/9f2747f8-3a88-4ba4-a4b3-cb3ac6f21426/dom_md/leases:1048576 for 2,9,8591
2016-01-15 15:41:56+0100 3273 [17652]: 9f2747f8 aio timeout 0 0x7f5f340008c0:0x7f5f340008d0:0x7f5f581da000 ioto 10 to_count 1
2016-01-15 15:41:56+0100 3273 [17652]: s5 delta_renew read rv -202 offset 0 /rhev/data-center/mnt/10.34.63.223:_slowfs/9f2747f8-3a88-4ba4-a4b3-cb3ac6f21426/dom_md/ids
2016-01-15 15:41:56+0100 3273 [17652]: s5 renewal error -202 delta_length 10 last_success 3242
2016-01-15 15:42:07+0100 3284 [17652]: 9f2747f8 aio timeout 0 0x7f5f34000910:0x7f5f34000920:0x7f5f580d8000 ioto 10 to_count 2
2016-01-15 15:42:07+0100 3284 [17652]: s5 delta_renew read rv -202 offset 0 /rhev/data-center/mnt/10.34.63.223:_slowfs/9f2747f8-3a88-4ba4-a4b3-cb3ac6f21426/dom_md/ids
2016-01-15 15:42:07+0100 3284 [17652]: s5 renewal error -202 delta_length 11 last_success 3242
2016-01-15 15:42:18+0100 3295 [17652]: 9f2747f8 aio timeout 0 0x7f5f34000960:0x7f5f34000970:0x7f5f3b6fe000 ioto 10 to_count 3
2016-01-15 15:42:18+0100 3295 [17652]: s5 delta_renew read rv -202 offset 0 /rhev/data-center/mnt/10.34.63.223:_slowfs/9f2747f8-3a88-4ba4-a4b3-cb3ac6f21426/dom_md/ids
2016-01-15 15:42:18+0100 3295 [17652]: s5 renewal error -202 delta_length 11 last_success 3242

but it's ok for other debug messages of ioprocess.

ioprocess communication (13350)::DEBUG::2016-01-15 15:43:09,407::__init__::411::IOProcess::(_processLogs) (1640) Start request for method 'access' (waitTime=28)
ioprocess communication (13350)::DEBUG::2016-01-15 15:43:09,407::__init__::411::IOProcess::(_processLogs) Queuing request (slotsLeft=20)
ioprocess communication (13350)::DEBUG::2016-01-15 15:43:09,407::__init__::411::IOProcess::(_processLogs) (1640) Finished request for method 'access' (runTime=216)
ioprocess communication (13350)::DEBUG::2016-01-15 15:43:09,407::__init__::411::IOProcess::(_processLogs) (1640) Dequeuing request (slotsLeft=21)

any hints? or is it ok with 4 above?

Comment 11 Jiri Belka 2016-01-15 14:52:48 UTC
FYA there's missing 'id' in 'Queuing request' log message. It seems odd with respect to #0.

Comment 12 Nir Soffer 2016-01-15 15:25:50 UTC
(In reply to Jiri Belka from comment #11)
> FYA there's missing 'id' in 'Queuing request' log message. It seems odd with
> respect to #0.

I agree, adding the id requires rather large change in ioprocess (the id is not
availble when this is logged). I plan to improve this in ioprocess 1.0.

Comment 13 Nir Soffer 2016-01-15 15:36:26 UTC
(In reply to Jiri Belka from comment #10)
> ioprocess-0.15.0-5.el7ev.x86_64
> 
> it's ok for all ioprocess debug messages but as i'm unable to simulate slow
> nfs (i even tried Nir's slowfs but i could activate slowfs via nfs in engine
> because sanlock didn't play well with that) i can't see logging about full
> queue.

To simulate full queue, you need to configure ioprocess with smaller queue
size, and submit several tasks that vdsm perform in parallel. However the 
patch attached to this bug did not change the logs in the case of full queuq.

The current code is logging this warning when the queue is full:

    g_warning("(%li) Request queue full", reqId);

Since the patch did not change this behavior, there is nothing to verify, unless
you want to verify that the behavior did not change.

Comment 14 Jiri Belka 2016-01-15 17:01:11 UTC
ok for now, ioprocess-0.15.0-5.el7ev.x86_64

based on #13 it's ok. improvements about full queue and queuing request id are on dev's TODO list.


Note You need to log in before you can comment on or make changes to this bug.