1401976 – 100% cpu when a even a single host is out

Bug 1401976 - 100% cpu when a even a single host is out

Summary: 100% cpu when a even a single host is out

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	vdsm-jsonrpc-java
Classification:	oVirt
Component:	Core
Sub Component:
Version:	1.2.6
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	urgent
Target Milestone:	ovirt-4.0.6
Target Release:	1.2.10
Assignee:	Roy Golan
QA Contact:	Petr Kubica
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2016-12-06 13:52 UTC by Roy Golan
Modified:	2017-01-18 07:29 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2017-01-18 07:29:22 UTC
oVirt Team:	Infra
Embargoed:
Flags:	rule-engine: ovirt-4.0.z+ rule-engine: blocker+

Attachments	(Terms of Use)
high cpu utilization after blocking a host (1.74 MB, image/png) 2016-12-07 13:10 UTC, Roy Golan	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
oVirt gerrit	67881	0	master	MERGED	Timeout handling in finishConnect	2016-12-07 08:29:51 UTC
oVirt gerrit	67975	0	ovirt-4.0	MERGED	Timeout handling in finishConnect	2016-12-08 10:21:35 UTC

Description Roy Golan 2016-12-06 13:52:01 UTC

Description of problem:

when a a client is trying connect and can't we will try to busy wait till the connection timeout

while (!this.channel.finishConnect()) {
                final long timeout = getTimeout(policy.getRetryTimeOut(), policy.getTimeUnit());

                final FutureTask<SocketChannel> connectTask = scheduleTask(new Retryable<>(() -> {
                    if (System.currentTimeMillis() >= timeout) {
                        throw new ConnectException("Connection timeout");
                    }
                    return null;
                }, this.policy));
                connectTask.get();
            }

this will occupy the Reactor thread for and will cause it to continuously insert new tasks to check the connection without any backoff strategy. it will create tons of objects that will go to waste as well


Also the timeout handling is also broken as it is always gets increamented inside the while loop. So the loop never exists


Version-Release number of selected component (if applicable):
1.1.8 and engine-3.6.3 

How reproducible:
100%

Steps to Reproduce:
1. Have 3 hosts and 1 or 2 domains
2. start the engine see all is up
3. disconnect the engine from the network - the client code will go into a loop shortly after exceeding the heartbeat


Expected results:

stable the timeout parameter and fail after the exeact timeout value
don't busy wait on the timeout check, use different strategy.

Comment 1 Roy Golan 2016-12-06 13:53:20 UTC

Thanks for froland keen eye to find the timeout issue

Comment 2 Piotr Kliczewski 2016-12-06 14:04:09 UTC

Busy wait should be slow down by [1] what is the timeout setting that you use?



[1] https://github.com/oVirt/vdsm-jsonrpc-java/blob/master/client/src/main/java/org/ovirt/vdsm/jsonrpc/client/utils/retry/Retryable.java#L38

Comment 3 Roy Golan 2016-12-06 15:36:31 UTC

That is true for tasks that will throw exception. The 'wait for finish task' [1] will not throw an exception. 

My timeout setting is all default.


[1] 'wait for finish task'

https://github.com/oVirt/vdsm-jsonrpc-java/blob/master/client/src/main/java/org/ovirt/vdsm/jsonrpc/client/reactors/ReactorClient.java#L119

Comment 4 Piotr Kliczewski 2016-12-06 16:14:19 UTC

Usually wait connect takes less than a second. The issue here is with getting new timeout each time and this would fix never ending spinning wait. Without this bug current strategy seems to be legit.

Thanks for finding it.

Comment 5 Roy Golan 2016-12-07 13:10:27 UTC

Created attachment 1229054 [details]
high cpu utilization after blocking a host

Comment 6 Petr Kubica 2016-12-16 11:42:47 UTC

Verified. vdsm-jsonrpc-java-1.2.10-1.el7ev.noarch (repository 4.0.6-7)

Note You need to log in before you can comment on or make changes to this bug.