Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1401976

Summary: 100% cpu when a even a single host is out
Product: [oVirt] vdsm-jsonrpc-java Reporter: Roy Golan <rgolan>
Component: CoreAssignee: Roy Golan <rgolan>
Status: CLOSED CURRENTRELEASE QA Contact: Petr Kubica <pkubica>
Severity: urgent Docs Contact:
Priority: unspecified    
Version: 1.2.6CC: bugs, mperina, oourfali, pkliczew, rgolan
Target Milestone: ovirt-4.0.6Keywords: Performance, ZStream
Target Release: 1.2.10Flags: rule-engine: ovirt-4.0.z+
rule-engine: blocker+
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
When the engine is connecting to a host it checks when the connection is done, with a timeout. This check would run in a loop, occupying the cpu. The fix is to make sure this timeout really arrives and we won't occupy the cpu without a limit.
Story Points: ---
Clone Of: Environment:
Last Closed: 2017-01-18 07:29:22 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
high cpu utilization after blocking a host none

Description Roy Golan 2016-12-06 13:52:01 UTC
Description of problem:

when a a client is trying connect and can't we will try to busy wait till the connection timeout

while (!this.channel.finishConnect()) {
                final long timeout = getTimeout(policy.getRetryTimeOut(), policy.getTimeUnit());

                final FutureTask<SocketChannel> connectTask = scheduleTask(new Retryable<>(() -> {
                    if (System.currentTimeMillis() >= timeout) {
                        throw new ConnectException("Connection timeout");
                    }
                    return null;
                }, this.policy));
                connectTask.get();
            }

this will occupy the Reactor thread for and will cause it to continuously insert new tasks to check the connection without any backoff strategy. it will create tons of objects that will go to waste as well


Also the timeout handling is also broken as it is always gets increamented inside the while loop. So the loop never exists


Version-Release number of selected component (if applicable):
1.1.8 and engine-3.6.3 

How reproducible:
100%

Steps to Reproduce:
1. Have 3 hosts and 1 or 2 domains
2. start the engine see all is up
3. disconnect the engine from the network - the client code will go into a loop shortly after exceeding the heartbeat


Expected results:

stable the timeout parameter and fail after the exeact timeout value
don't busy wait on the timeout check, use different strategy.

Comment 1 Roy Golan 2016-12-06 13:53:20 UTC
Thanks for froland keen eye to find the timeout issue

Comment 2 Piotr Kliczewski 2016-12-06 14:04:09 UTC
Busy wait should be slow down by [1] what is the timeout setting that you use?



[1] https://github.com/oVirt/vdsm-jsonrpc-java/blob/master/client/src/main/java/org/ovirt/vdsm/jsonrpc/client/utils/retry/Retryable.java#L38

Comment 3 Roy Golan 2016-12-06 15:36:31 UTC
That is true for tasks that will throw exception. The 'wait for finish task' [1] will not throw an exception. 

My timeout setting is all default.


[1] 'wait for finish task'

https://github.com/oVirt/vdsm-jsonrpc-java/blob/master/client/src/main/java/org/ovirt/vdsm/jsonrpc/client/reactors/ReactorClient.java#L119

Comment 4 Piotr Kliczewski 2016-12-06 16:14:19 UTC
Usually wait connect takes less than a second. The issue here is with getting new timeout each time and this would fix never ending spinning wait. Without this bug current strategy seems to be legit.

Thanks for finding it.

Comment 5 Roy Golan 2016-12-07 13:10:27 UTC
Created attachment 1229054 [details]
high cpu utilization after blocking a host

Comment 6 Petr Kubica 2016-12-16 11:42:47 UTC
Verified. vdsm-jsonrpc-java-1.2.10-1.el7ev.noarch (repository 4.0.6-7)