Description of problem: For some unknown reason, there's a flood of getHardwareInfo sent every 2 seconds. This is a HostedEngine host, not sure if it's related.
it can't be internal vdsm issue - someone calls to that verb from somewhere. I see in vdsm.log that this version doesn't include the new log format that specifies the sender address.. in engine.log I don't see any bursts of this requests, so its either mom or some other local service that calls it. not sure its the actual "issue" related to the ticket you attached, but im not sure.
I see in case discussion that its indeed not related. but regarding the getHardwardCaps calls I want to request MOM and HE guys to check their logs here too and see if they are the reason for this requests burst
Martin/Simone could you please take a look?
I do not see any trace of that verb in any of those two mentioned projects. [msivak@dhcp131-31 ~]$ cd Work/mom/ [msivak@dhcp131-31 mom]$ git grep -i gethardware [msivak@dhcp131-31 mom]$ cd ../ovirt-hosted-engine-ha/ [msivak@dhcp131-31 ovirt-hosted-engine-ha]$ git grep -i gethardware [msivak@dhcp131-31 ovirt-hosted-engine-ha]$
note that network_caps is also called WAY more than it should be - every 5 or 10 seconds. It's enough there's some leak (FD for example) in either, and we have an issue. Example: MainProcess|jsonrpc/3::DEBUG::2017-05-30 08:57:55,785::supervdsmServer::93::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {} MainProcess|jsonrpc/3::DEBUG::2017-05-30 08:58:05,164::supervdsmServer::93::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {} MainProcess|jsonrpc/3::DEBUG::2017-05-30 08:58:10,193::supervdsmServer::93::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {} MainProcess|jsonrpc/4::DEBUG::2017-05-30 08:58:15,554::supervdsmServer::93::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {} MainProcess|jsonrpc/5::DEBUG::2017-05-30 08:58:25,937::supervdsmServer::93::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {} MainProcess|jsonrpc/4::DEBUG::2017-05-30 08:58:35,306::supervdsmServer::93::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {} MainProcess|jsonrpc/7::DEBUG::2017-05-30 08:58:45,699::supervdsmServer::93::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {} MainProcess|jsonrpc/1::DEBUG::2017-05-30 08:58:56,068::supervdsmServer::93::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {} MainProcess|jsonrpc/2::DEBUG::2017-05-30 08:59:05,452::supervdsmServer::93::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {} MainProcess|jsonrpc/0::DEBUG::2017-05-30 08:59:15,831::supervdsmServer::93::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {} MainProcess|jsonrpc/2::DEBUG::2017-05-30 08:59:25,201::supervdsmServer::93::SuperVdsm.ServerCallback::(wrapper) call network_caps with () {}
On master we merged https://gerrit.ovirt.org/#/c/76212/ to reuse an existing jsonrpc connection instead of always opening a new one; since, as for https://bugzilla.redhat.com/show_bug.cgi?id=1376843 , the jsonrpc doesn't provide any keep-alive or reconnect mechanism we are directly calling getVdsHardwareInfo to validate the connection at application level. But this shouldn't affect 4.1 as reported and 2 seconds is by far too frequent also on master code.
Oh, getVdsHardwareInfo.. well, let me check again then. MOM does not use that one either. But hosted engine can call that quite often. Simone: we call util.connect_vdsm_json_rpc from many places including almost all monitors. And the first call from connect is __vdsm_json_rpc_check that always tries to call getHwInfo at least once. I guess we miss some kind of Ping() command to check whether a connection is alive.
(In reply to Simone Tiraboschi from comment #7) > On master we merged https://gerrit.ovirt.org/#/c/76212/ to reuse an existing > jsonrpc connection instead of always opening a new one; since, as for > https://bugzilla.redhat.com/show_bug.cgi?id=1376843 , the jsonrpc doesn't > provide any keep-alive or reconnect mechanism we are directly calling > getVdsHardwareInfo to validate the connection at application level. That's the cheapest command to validate the connection? I hope the result is at least cached on the vdsm side. > > But this shouldn't affect 4.1 as reported and 2 seconds is by far too > frequent also on master code.
OK, reproduced on 4.1. Stopping ovirt-ha-agent the getHardwareInfo will stop as well. Also on 4.1 we are using Host.getHardwareInfo to validate the fresh jsonrpc connection. As soon as I restart ovirt-ha-agent, I see a few Host.getHardwareInfo calls (one for each monitoring thread) in burst every 10 seconds. 2017-06-01 18:30:31,765+0200 INFO (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585) 2017-06-01 18:30:31,776+0200 INFO (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585) 2017-06-01 18:30:31,919+0200 INFO (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585) 2017-06-01 18:30:32,016+0200 INFO (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585) 2017-06-01 18:30:32,137+0200 INFO (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585) 2017-06-01 18:30:32,182+0200 INFO (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585) 2017-06-01 18:30:32,182+0200 INFO (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585) 2017-06-01 18:30:32,183+0200 INFO (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585) 2017-06-01 18:30:32,183+0200 INFO (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.01 seconds (__init__:585) 2017-06-01 18:30:42,216+0200 INFO (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585) 2017-06-01 18:30:42,257+0200 INFO (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585) 2017-06-01 18:30:42,306+0200 INFO (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585) 2017-06-01 18:30:42,327+0200 INFO (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585) 2017-06-01 18:30:52,336+0200 INFO (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585) 2017-06-01 18:30:52,430+0200 INFO (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585) 2017-06-01 18:30:52,430+0200 INFO (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585) 2017-06-01 18:30:53,294+0200 INFO (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.01 seconds (__init__:585) 2017-06-01 18:30:53,297+0200 INFO (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585) 2017-06-01 18:30:53,428+0200 INFO (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585) But just after a few minutes the monitoring threads probably go out of sync and the getHardwareInfo calls starts distributing over the time. 2017-06-01 18:37:41,757+0200 INFO (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585) 2017-06-01 18:37:42,221+0200 INFO (jsonrpc/0) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585) 2017-06-01 18:37:43,397+0200 INFO (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585) 2017-06-01 18:37:47,967+0200 INFO (jsonrpc/0) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585) 2017-06-01 18:37:48,096+0200 INFO (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585) 2017-06-01 18:37:48,100+0200 INFO (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585) 2017-06-01 18:37:48,106+0200 INFO (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585) 2017-06-01 18:37:48,233+0200 INFO (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.01 seconds (__init__:585) 2017-06-01 18:37:48,332+0200 INFO (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585) 2017-06-01 18:37:48,469+0200 INFO (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585) 2017-06-01 18:37:48,474+0200 INFO (jsonrpc/3) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585) 2017-06-01 18:37:51,617+0200 INFO (jsonrpc/0) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585) 2017-06-01 18:37:51,783+0200 INFO (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585) 2017-06-01 18:37:52,286+0200 INFO (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585) 2017-06-01 18:37:53,474+0200 INFO (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.01 seconds (__init__:585) 2017-06-01 18:37:58,498+0200 INFO (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585) 2017-06-01 18:37:58,626+0200 INFO (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585) 2017-06-01 18:37:58,629+0200 INFO (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585) 2017-06-01 18:37:58,640+0200 INFO (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585) 2017-06-01 18:37:58,771+0200 INFO (jsonrpc/4) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585) 2017-06-01 18:37:58,877+0200 INFO (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585) 2017-06-01 18:37:59,019+0200 INFO (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585) 2017-06-01 18:37:59,024+0200 INFO (jsonrpc/0) [jsonrpc.JsonRpcServer] RPC call Host.getHardwareInfo succeeded in 0.00 seconds (__init__:585)
Note that the customer host has 56 cores. That can drive nuts any Python multi-threaded application, as well know (wrt accurate thread dispatching)
The Python GIL is not an issue for hosted engine. All the threads are IO bound and spend most of the time waiting. VDSM currently uses core pining to avoid the issue. But we should still call the cheaper ping as Simone already did in his patch.
It's not needed any more to call any additional commands to validate if connection is open or not. We already have a patch [1] in 4.1.z, which raises an exception if connection is unavailable, so you should just catch this exception during normal communication and reconnect (no need to call any additional commands to validate connection is open). We will also work on BZ1376843 and backport it into 4.1.4, which should be ultimate solution for this issue, because it will perform automatic reconnection within Python jsonrpc-client, so you could remove all code handling reconnections from HE. [1] https://gerrit.ovirt.org/76181
Re-targeting to 4.1.6 since we need bug #1471617 to be fixed in order to close this one.
Re-targeting to 4.2.0 since its dependency bug #1471617 has been closed deferred. Moving the dependency to the 4.2.0 bug #1417708
Using ovirt-engine-appliance-4.2-20171210.1.el7.centos.noarch and ovirt-hosted-engine-setup-2.2.1-0.0.master.20171206172737.gitd3001c8.el7.centos.noarch during and after deployment of SHE on pair of hosts, over Gluster, and after adding some NFS data storage domain, I did not reproduced the original issue, hence moving to verified.
This bugzilla is included in oVirt 4.2.0 release, published on Dec 20th 2017. Since the problem described in this bug report should be resolved in oVirt 4.2.0 release, published on Dec 20th 2017, it has been closed with a resolution of CURRENT RELEASE. If the solution does not work for you, please open a new bug report.