Bug 1401976 - 100% cpu when a even a single host is out
Summary: 100% cpu when a even a single host is out
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: vdsm-jsonrpc-java
Classification: oVirt
Component: Core
Version: 1.2.6
Hardware: Unspecified
OS: Unspecified
unspecified
urgent
Target Milestone: ovirt-4.0.6
: 1.2.10
Assignee: Roy Golan
QA Contact: Petr Kubica
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2016-12-06 13:52 UTC by Roy Golan
Modified: 2017-01-18 07:29 UTC (History)
5 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
When the engine is connecting to a host it checks when the connection is done, with a timeout. This check would run in a loop, occupying the cpu. The fix is to make sure this timeout really arrives and we won't occupy the cpu without a limit.
Clone Of:
Environment:
Last Closed: 2017-01-18 07:29:22 UTC
oVirt Team: Infra
Embargoed:
rule-engine: ovirt-4.0.z+
rule-engine: blocker+


Attachments (Terms of Use)
high cpu utilization after blocking a host (1.74 MB, image/png)
2016-12-07 13:10 UTC, Roy Golan
no flags Details


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 67881 0 master MERGED Timeout handling in finishConnect 2016-12-07 08:29:51 UTC
oVirt gerrit 67975 0 ovirt-4.0 MERGED Timeout handling in finishConnect 2016-12-08 10:21:35 UTC

Description Roy Golan 2016-12-06 13:52:01 UTC
Description of problem:

when a a client is trying connect and can't we will try to busy wait till the connection timeout

while (!this.channel.finishConnect()) {
                final long timeout = getTimeout(policy.getRetryTimeOut(), policy.getTimeUnit());

                final FutureTask<SocketChannel> connectTask = scheduleTask(new Retryable<>(() -> {
                    if (System.currentTimeMillis() >= timeout) {
                        throw new ConnectException("Connection timeout");
                    }
                    return null;
                }, this.policy));
                connectTask.get();
            }

this will occupy the Reactor thread for and will cause it to continuously insert new tasks to check the connection without any backoff strategy. it will create tons of objects that will go to waste as well


Also the timeout handling is also broken as it is always gets increamented inside the while loop. So the loop never exists


Version-Release number of selected component (if applicable):
1.1.8 and engine-3.6.3 

How reproducible:
100%

Steps to Reproduce:
1. Have 3 hosts and 1 or 2 domains
2. start the engine see all is up
3. disconnect the engine from the network - the client code will go into a loop shortly after exceeding the heartbeat


Expected results:

stable the timeout parameter and fail after the exeact timeout value
don't busy wait on the timeout check, use different strategy.

Comment 1 Roy Golan 2016-12-06 13:53:20 UTC
Thanks for froland keen eye to find the timeout issue

Comment 2 Piotr Kliczewski 2016-12-06 14:04:09 UTC
Busy wait should be slow down by [1] what is the timeout setting that you use?



[1] https://github.com/oVirt/vdsm-jsonrpc-java/blob/master/client/src/main/java/org/ovirt/vdsm/jsonrpc/client/utils/retry/Retryable.java#L38

Comment 3 Roy Golan 2016-12-06 15:36:31 UTC
That is true for tasks that will throw exception. The 'wait for finish task' [1] will not throw an exception. 

My timeout setting is all default.


[1] 'wait for finish task'

https://github.com/oVirt/vdsm-jsonrpc-java/blob/master/client/src/main/java/org/ovirt/vdsm/jsonrpc/client/reactors/ReactorClient.java#L119

Comment 4 Piotr Kliczewski 2016-12-06 16:14:19 UTC
Usually wait connect takes less than a second. The issue here is with getting new timeout each time and this would fix never ending spinning wait. Without this bug current strategy seems to be legit.

Thanks for finding it.

Comment 5 Roy Golan 2016-12-07 13:10:27 UTC
Created attachment 1229054 [details]
high cpu utilization after blocking a host

Comment 6 Petr Kubica 2016-12-16 11:42:47 UTC
Verified. vdsm-jsonrpc-java-1.2.10-1.el7ev.noarch (repository 4.0.6-7)


Note You need to log in before you can comment on or make changes to this bug.