Bug 1401976
| Summary: | 100% cpu when a even a single host is out | ||||||
|---|---|---|---|---|---|---|---|
| Product: | [oVirt] vdsm-jsonrpc-java | Reporter: | Roy Golan <rgolan> | ||||
| Component: | Core | Assignee: | Roy Golan <rgolan> | ||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | Petr Kubica <pkubica> | ||||
| Severity: | urgent | Docs Contact: | |||||
| Priority: | unspecified | ||||||
| Version: | 1.2.6 | CC: | bugs, mperina, oourfali, pkliczew, rgolan | ||||
| Target Milestone: | ovirt-4.0.6 | Keywords: | Performance, ZStream | ||||
| Target Release: | 1.2.10 | Flags: | rule-engine:
ovirt-4.0.z+
rule-engine: blocker+ |
||||
| Hardware: | Unspecified | ||||||
| OS: | Unspecified | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: |
When the engine is connecting to a host it checks when the connection is done, with a timeout. This check would run in a loop, occupying the cpu. The fix is to make sure this timeout really arrives and we won't occupy the cpu without a limit.
|
Story Points: | --- | ||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2017-01-18 07:29:22 UTC | Type: | Bug | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | Infra | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
Thanks for froland keen eye to find the timeout issue Busy wait should be slow down by [1] what is the timeout setting that you use? [1] https://github.com/oVirt/vdsm-jsonrpc-java/blob/master/client/src/main/java/org/ovirt/vdsm/jsonrpc/client/utils/retry/Retryable.java#L38 That is true for tasks that will throw exception. The 'wait for finish task' [1] will not throw an exception. My timeout setting is all default. [1] 'wait for finish task' https://github.com/oVirt/vdsm-jsonrpc-java/blob/master/client/src/main/java/org/ovirt/vdsm/jsonrpc/client/reactors/ReactorClient.java#L119 Usually wait connect takes less than a second. The issue here is with getting new timeout each time and this would fix never ending spinning wait. Without this bug current strategy seems to be legit. Thanks for finding it. Created attachment 1229054 [details]
high cpu utilization after blocking a host
Verified. vdsm-jsonrpc-java-1.2.10-1.el7ev.noarch (repository 4.0.6-7) |
Description of problem: when a a client is trying connect and can't we will try to busy wait till the connection timeout while (!this.channel.finishConnect()) { final long timeout = getTimeout(policy.getRetryTimeOut(), policy.getTimeUnit()); final FutureTask<SocketChannel> connectTask = scheduleTask(new Retryable<>(() -> { if (System.currentTimeMillis() >= timeout) { throw new ConnectException("Connection timeout"); } return null; }, this.policy)); connectTask.get(); } this will occupy the Reactor thread for and will cause it to continuously insert new tasks to check the connection without any backoff strategy. it will create tons of objects that will go to waste as well Also the timeout handling is also broken as it is always gets increamented inside the while loop. So the loop never exists Version-Release number of selected component (if applicable): 1.1.8 and engine-3.6.3 How reproducible: 100% Steps to Reproduce: 1. Have 3 hosts and 1 or 2 domains 2. start the engine see all is up 3. disconnect the engine from the network - the client code will go into a loop shortly after exceeding the heartbeat Expected results: stable the timeout parameter and fail after the exeact timeout value don't busy wait on the timeout check, use different strategy.