+++ This bug is a downstream clone. The original bug is: +++ +++ bug 1828669 +++ ====================================================================== Description of problem: User selected SPM, which suceeded. But after that engine lost communication to all hosts until restarted. 1. User force selected SPM 2020-04-14 12:00:59,977+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-333) [6f52780d-6629-4088-8ad1-6e5476cc773d] EVENT_ID: USER_FORCE_SELECTED_SPM(159), Host **** was force selected by **** 2. Start SPM 2020-04-14 12:01:10,220+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-41) [18f7fe20] START, SpmStartVDSCommand(HostName = ****, SpmStartVDSCommandParameters:{hostId='d040c2f0-e658-46c2-8af4-9e60e65299b5', storagePoolId='2fcbca30-a4ed-46b8-96fb-87357d3bb010', prevId='-1', prevLVER='-1', storagePoolFormatType='V5', recoveryMode='Manual', SCSIFencing='false'}), log id: 6bbb49bc 2020-04-14 12:01:11,243+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-41) [18f7fe20] FINISH, SpmStartVDSCommand, return: SpmStatusResult:{SPM Id='1', SPM LVER='23', SPM Status='SPM'}, log id: 6bbb49bc 3. Connect IRS 2020-04-14 12:01:11,281+02 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to /**** Problem: now from this timestamp above on, nothing the engine sent reached the hosts, and everything started failing with vdsTimeout. 4. Here is the first vdsTimeout (set to default 180s) 2020-04-14 12:04:11,777+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-26) [] Command 'GetAllVmStatsVDSCommand(HostName = ****, VdsIdVDSCommandParametersBase:{hostId='456ce074-fbaa-4cfc-ac85-e7be6cb3ed7b'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues 5. Everything else to all other hosts behave the same way, until engine is restarted, then everything is green. From a host side, we can see GetAllVmStats stopping to arrive after the IRS connection: 2020-04-14 12:00:20,533+0200 INFO (jsonrpc/5) [api.host] START getAllVmStats() from=::ffff:****,50904 (api:48) <---- engine IP 2020-04-14 12:00:26,772+0200 INFO (jsonrpc/3) [api.host] START getAllVmStats() from=::1,47490 (api:48) <---- MoM 2020-04-14 12:00:35,537+0200 INFO (jsonrpc/6) [api.host] START getAllVmStats() from=::ffff:****,50904 (api:48) 2020-04-14 12:00:41,791+0200 INFO (jsonrpc/4) [api.host] START getAllVmStats() from=::1,47490 (api:48) 2020-04-14 12:00:50,541+0200 INFO (jsonrpc/0) [api.host] START getAllVmStats() from=::ffff:****,50904 (api:48) 2020-04-14 12:00:56,810+0200 INFO (jsonrpc/2) [api.host] START getAllVmStats() from=::1,47490 (api:48) 2020-04-14 12:01:05,546+0200 INFO (jsonrpc/3) [api.host] START getAllVmStats() from=::ffff:****,50904 (api:48) 2020-04-14 12:01:11,282+0200 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::ffff:****:38574 (protocoldetector:61) <---- IRS 2020-04-14 12:01:11,819+0200 INFO (jsonrpc/5) [api.host] START getAllVmStats() from=::1,47490 (api:48) 2020-04-14 12:01:26,841+0200 INFO (jsonrpc/3) [api.host] START getAllVmStats() from=::1,47490 (api:48) 2020-04-14 12:01:41,864+0200 INFO (jsonrpc/7) [api.host] START getAllVmStats() from=::1,47490 (api:48) 2020-04-14 12:01:56,887+0200 INFO (jsonrpc/6) [api.host] START getAllVmStats() from=::1,47490 (api:48) 2020-04-14 12:02:11,907+0200 INFO (jsonrpc/4) [api.host] START getAllVmStats() from=::1,47490 (api:48) 2020-04-14 12:02:26,926+0200 INFO (jsonrpc/1) [api.host] START getAllVmStats() from=::1,47490 (api:48) 2020-04-14 12:02:41,947+0200 INFO (jsonrpc/0) [api.host] START getAllVmStats() from=::1,47490 (api:48) 2020-04-14 12:02:56,969+0200 INFO (jsonrpc/2) [api.host] START getAllVmStats() from=::1,47490 (api:48) After engine was restarted, all green and working again to all hosts. Version-Release number of selected component (if applicable): vdsm-jsonrpc-java-1.4.18-1.el7ev.noarch ovirt-engine-4.3.9.4-11.el7.noarch vdsm-4.30.44-1.el7ev.x86_64 How reproducible: Unknown. Actual results: Everything failing until ovirt-engine restarted Expected results: Working normally (Originally by Germano Veit Michel)
I failed to recreate an issue on my dev setup on both the mentioned 4.3.9 and 4.4-master. Provided logs unfortunately does not provide enough information to say what was the core issue. I reviewed the json rpc client code and added some more logging, but frankly speaking, I do not think that would help a lot. It would be extremely useful to get access to affected environment (host + engine) and check the thread dump. For now I think, I have exhausted all the ideas I had to verify/recreate the issue. (Originally by Artur Socha)
Verified on ovirt-engine-4.3.11-0.1.el7.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Red Hat Virtualization Engine security, bug fix 4.3.11), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2020:4112