Bug 1457471 - Flood of getHardwareInfo (every 2 seconds or so) by ovirt-ha-agent
Summary: Flood of getHardwareInfo (every 2 seconds or so) by ovirt-ha-agent
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: ovirt-hosted-engine-ha
Classification: oVirt
Component: Agent
Version: 2.1.0.6
Hardware: Unspecified
OS: Unspecified
high
high
Target Milestone: ovirt-4.2.0
: 2.2.1
Assignee: Simone Tiraboschi
QA Contact: Nikolai Sednev
URL:
Whiteboard:
Depends On: 1417708
Blocks:
TreeView+ depends on / blocked
 
Reported: 2017-05-31 18:58 UTC by Yaniv Kaul
Modified: 2018-01-23 04:52 UTC (History)
7 users (show)

Fixed In Version:
Doc Type: Enhancement
Doc Text:
Previously, ovirt-ha-agent was renewing its JSON RPC connection to VDSM on each message using getHardwareInfo to verify it. Now that the JSON RPC client internally supports a reconnect mechanism, there is no need to always create a fresh connection on the ovirt-ha-agent side.
Clone Of:
Environment:
Last Closed: 2017-12-20 11:42:21 UTC
oVirt Team: Integration
Embargoed:
rule-engine: ovirt-4.2+
rule-engine: exception+


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
oVirt gerrit 77679 0 master MERGED jsonrpc: validate the connection with ping command 2017-06-09 13:57:30 UTC
oVirt gerrit 78985 0 master ABANDONED jsonrpc: avoid validating connection 2017-11-29 09:37:42 UTC
oVirt gerrit 78993 0 v2.1.z ABANDONED jsonrpc: avoid validating connection 2017-10-16 10:56:38 UTC

Description Yaniv Kaul 2017-05-31 18:58:40 UTC
Description of problem:
For some unknown reason, there's a flood of getHardwareInfo sent every 2 seconds.
This is a HostedEngine host, not sure if it's related.

Comment 2 Yaniv Bronhaim 2017-06-01 11:05:39 UTC
it can't be internal vdsm issue - someone calls to that verb from somewhere.
I see in vdsm.log that this version doesn't include the new log format that specifies the sender address.. in engine.log I don't see any bursts of this requests, so its either mom or some other local service that calls it. 
not sure its the actual "issue" related to the ticket you attached, but im not sure.

Comment 3 Yaniv Bronhaim 2017-06-01 11:09:53 UTC
I see in case discussion that its indeed not related. but regarding the getHardwardCaps calls I want to request MOM and HE guys to check their logs here too and see if they are the reason for this requests burst

Comment 4 Martin Perina 2017-06-01 11:34:53 UTC
Martin/Simone could you please take a look?

Comment 5 Martin Sivák 2017-06-01 11:39:41 UTC
I do not see any trace of that verb in any of those two mentioned projects.

[msivak@dhcp131-31 ~]$ cd Work/mom/
[msivak@dhcp131-31 mom]$ git grep -i gethardware
[msivak@dhcp131-31 mom]$ cd ../ovirt-hosted-engine-ha/
[msivak@dhcp131-31 ovirt-hosted-engine-ha]$ git grep -i gethardware
[msivak@dhcp131-31 ovirt-hosted-engine-ha]$

Comment 6 Yaniv Kaul 2017-06-01 13:19:25 UTC
note that network_caps is also called WAY more than it should be - every 5 or 10 seconds.
It's enough there's some leak (FD for example) in either, and we have an issue.
Example:

MainProcess|jsonrpc/3::DEBUG::2017-05-30 08:57:55,785::supervdsmServer::93::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {}
MainProcess|jsonrpc/3::DEBUG::2017-05-30 08:58:05,164::supervdsmServer::93::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {}
MainProcess|jsonrpc/3::DEBUG::2017-05-30 08:58:10,193::supervdsmServer::93::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {}
MainProcess|jsonrpc/4::DEBUG::2017-05-30 08:58:15,554::supervdsmServer::93::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {}
MainProcess|jsonrpc/5::DEBUG::2017-05-30 08:58:25,937::supervdsmServer::93::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {}
MainProcess|jsonrpc/4::DEBUG::2017-05-30 08:58:35,306::supervdsmServer::93::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {}
MainProcess|jsonrpc/7::DEBUG::2017-05-30 08:58:45,699::supervdsmServer::93::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {}
MainProcess|jsonrpc/1::DEBUG::2017-05-30 08:58:56,068::supervdsmServer::93::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {}
MainProcess|jsonrpc/2::DEBUG::2017-05-30 08:59:05,452::supervdsmServer::93::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {}
MainProcess|jsonrpc/0::DEBUG::2017-05-30 08:59:15,831::supervdsmServer::93::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {}
MainProcess|jsonrpc/2::DEBUG::2017-05-30 08:59:25,201::supervdsmServer::93::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {}

Comment 7 Simone Tiraboschi 2017-06-01 15:31:26 UTC
On master we merged https://gerrit.ovirt.org/#/c/76212/ to reuse an existing jsonrpc connection instead of always opening a new one; since, as for https://bugzilla.redhat.com/show_bug.cgi?id=1376843 , the jsonrpc  doesn't provide any keep-alive or reconnect mechanism we are directly calling getVdsHardwareInfo to validate the connection at application level.

But this shouldn't affect 4.1 as reported and 2 seconds is by far too frequent also on master code.

Comment 8 Martin Sivák 2017-06-01 15:43:44 UTC
Oh, getVdsHardwareInfo.. well, let me check again then. MOM does not use that one either.

But hosted engine can call that quite often.

Simone: we call util.connect_vdsm_json_rpc from many places including almost all monitors. And the first call from connect is __vdsm_json_rpc_check that always tries to call getHwInfo at least once.

I guess we miss some kind of Ping() command to check whether a connection is alive.

Comment 9 Yaniv Kaul 2017-06-01 16:17:28 UTC
(In reply to Simone Tiraboschi from comment #7)
> On master we merged https://gerrit.ovirt.org/#/c/76212/ to reuse an existing
> jsonrpc connection instead of always opening a new one; since, as for
> https://bugzilla.redhat.com/show_bug.cgi?id=1376843 , the jsonrpc  doesn't
> provide any keep-alive or reconnect mechanism we are directly calling
> getVdsHardwareInfo to validate the connection at application level.

That's the cheapest command to validate the connection? 
I hope the result is at least cached on the vdsm side. 

> 
> But this shouldn't affect 4.1 as reported and 2 seconds is by far too
> frequent also on master code.

Comment 10 Simone Tiraboschi 2017-06-01 16:41:16 UTC
OK, reproduced on 4.1.
Stopping ovirt-ha-agent the getHardwareInfo will stop as well.

Also on 4.1 we are using Host.getHardwareInfo to validate the fresh jsonrpc connection.

As soon as I restart ovirt-ha-agent, I see a few Host.getHardwareInfo calls (one for each monitoring thread) in burst every 10 seconds.

2017-06-01 18:30:31,765+0200 INFO  (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585)
2017-06-01 18:30:31,776+0200 INFO  (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585)
2017-06-01 18:30:31,919+0200 INFO  (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585)
2017-06-01 18:30:32,016+0200 INFO  (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585)
2017-06-01 18:30:32,137+0200 INFO  (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585)
2017-06-01 18:30:32,182+0200 INFO  (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585)
2017-06-01 18:30:32,182+0200 INFO  (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585)
2017-06-01 18:30:32,183+0200 INFO  (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585)
2017-06-01 18:30:32,183+0200 INFO  (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.01 seconds (__init__:585)
2017-06-01 18:30:42,216+0200 INFO  (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585)
2017-06-01 18:30:42,257+0200 INFO  (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585)
2017-06-01 18:30:42,306+0200 INFO  (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585)
2017-06-01 18:30:42,327+0200 INFO  (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585)
2017-06-01 18:30:52,336+0200 INFO  (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585)
2017-06-01 18:30:52,430+0200 INFO  (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585)
2017-06-01 18:30:52,430+0200 INFO  (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585)
2017-06-01 18:30:53,294+0200 INFO  (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.01 seconds (__init__:585)
2017-06-01 18:30:53,297+0200 INFO  (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585)
2017-06-01 18:30:53,428+0200 INFO  (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585)


But just after a few minutes the monitoring threads probably go out of sync and
the getHardwareInfo calls starts distributing over the time.

2017-06-01 18:37:41,757+0200 INFO  (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585)
2017-06-01 18:37:42,221+0200 INFO  (jsonrpc/0) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585)
2017-06-01 18:37:43,397+0200 INFO  (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585)
2017-06-01 18:37:47,967+0200 INFO  (jsonrpc/0) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585)
2017-06-01 18:37:48,096+0200 INFO  (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585)
2017-06-01 18:37:48,100+0200 INFO  (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585)
2017-06-01 18:37:48,106+0200 INFO  (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585)
2017-06-01 18:37:48,233+0200 INFO  (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.01 seconds (__init__:585)
2017-06-01 18:37:48,332+0200 INFO  (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585)
2017-06-01 18:37:48,469+0200 INFO  (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585)
2017-06-01 18:37:48,474+0200 INFO  (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585)
2017-06-01 18:37:51,617+0200 INFO  (jsonrpc/0) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585)
2017-06-01 18:37:51,783+0200 INFO  (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585)
2017-06-01 18:37:52,286+0200 INFO  (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585)
2017-06-01 18:37:53,474+0200 INFO  (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.01 seconds (__init__:585)
2017-06-01 18:37:58,498+0200 INFO  (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585)
2017-06-01 18:37:58,626+0200 INFO  (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585)
2017-06-01 18:37:58,629+0200 INFO  (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585)
2017-06-01 18:37:58,640+0200 INFO  (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585)
2017-06-01 18:37:58,771+0200 INFO  (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585)
2017-06-01 18:37:58,877+0200 INFO  (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585)
2017-06-01 18:37:59,019+0200 INFO  (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585)
2017-06-01 18:37:59,024+0200 INFO  (jsonrpc/0) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585)

Comment 11 Yaniv Kaul 2017-06-01 18:16:48 UTC
Note that the customer host has 56 cores. That can drive nuts any Python multi-threaded application, as well know (wrt accurate thread dispatching)

Comment 12 Martin Sivák 2017-06-02 13:55:10 UTC
The Python GIL is not an issue for hosted engine. All the threads are IO bound and spend most of the time waiting.

VDSM currently uses core pining to avoid the issue.

But we should still call the cheaper ping as Simone already did in his patch.

Comment 13 Martin Perina 2017-06-02 14:05:56 UTC
It's not needed any more to call any additional commands to validate if connection is open or not. We already have a patch [1] in 4.1.z, which raises an exception if connection is unavailable, so you should just catch this exception during normal communication and reconnect (no need to call any additional commands to validate connection is open).

We will also work on BZ1376843 and backport it into 4.1.4, which should be ultimate solution for this issue, because it will perform automatic reconnection within Python jsonrpc-client, so you could remove all code handling reconnections from HE.


[1] https://gerrit.ovirt.org/76181

Comment 14 Sandro Bonazzola 2017-08-08 09:40:54 UTC
Re-targeting to 4.1.6 since we need bug #1471617 to be fixed in order to close this one.

Comment 15 Sandro Bonazzola 2017-10-16 10:55:26 UTC
Re-targeting to 4.2.0 since its dependency bug #1471617 has been closed deferred.
Moving the dependency to the 4.2.0 bug #1417708

Comment 16 Nikolai Sednev 2017-12-13 09:22:21 UTC
Using ovirt-engine-appliance-4.2-20171210.1.el7.centos.noarch and ovirt-hosted-engine-setup-2.2.1-0.0.master.20171206172737.gitd3001c8.el7.centos.noarch during and after deployment of SHE on pair of hosts, over Gluster, and after adding some NFS data storage domain, I did not reproduced the original issue, hence moving to verified.

Comment 17 Sandro Bonazzola 2017-12-20 11:42:21 UTC
This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017.

Since the problem described in this bug report should be
resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE.

If the solution does not work for you, please open a new bug report.


Note You need to log in before you can comment on or make changes to this bug.