Description of problem: User selected SPM, which suceeded. But after that engine lost communication to all hosts until restarted. 1. User force selected SPM 2020-04-14 12:00:59,977+02 INFO [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-333) [6f52780d-6629-4088-8ad1-6e5476cc773d] EVENT_ID: USER_FORCE_SELECTED_SPM(159), Host **** was force selected by **** 2. Start SPM 2020-04-14 12:01:10,220+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-41) [18f7fe20] START, SpmStartVDSCommand(HostName = ****, SpmStartVDSCommandParameters:{hostId='d040c2f0-e658-46c2-8af4-9e60e65299b5', storagePoolId='2fcbca30-a4ed-46b8-96fb-87357d3bb010', prevId='-1', prevLVER='-1', storagePoolFormatType='V5', recoveryMode='Manual', SCSIFencing='false'}), log id: 6bbb49bc 2020-04-14 12:01:11,243+02 INFO [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-41) [18f7fe20] FINISH, SpmStartVDSCommand, return: SpmStatusResult:{SPM Id='1', SPM LVER='23', SPM Status='SPM'}, log id: 6bbb49bc 3. Connect IRS 2020-04-14 12:01:11,281+02 INFO [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to /**** Problem: now from this timestamp above on, nothing the engine sent reached the hosts, and everything started failing with vdsTimeout. 4. Here is the first vdsTimeout (set to default 180s) 2020-04-14 12:04:11,777+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-26) [] Command 'GetAllVmStatsVDSCommand(HostName = ****, VdsIdVDSCommandParametersBase:{hostId='456ce074-fbaa-4cfc-ac85-e7be6cb3ed7b'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues 5. Everything else to all other hosts behave the same way, until engine is restarted, then everything is green. From a host side, we can see GetAllVmStats stopping to arrive after the IRS connection: 2020-04-14 12:00:20,533+0200 INFO (jsonrpc/5) [api.host] START getAllVmStats() from=::ffff:****,50904 (api:48) <---- engine IP 2020-04-14 12:00:26,772+0200 INFO (jsonrpc/3) [api.host] START getAllVmStats() from=::1,47490 (api:48) <---- MoM 2020-04-14 12:00:35,537+0200 INFO (jsonrpc/6) [api.host] START getAllVmStats() from=::ffff:****,50904 (api:48) 2020-04-14 12:00:41,791+0200 INFO (jsonrpc/4) [api.host] START getAllVmStats() from=::1,47490 (api:48) 2020-04-14 12:00:50,541+0200 INFO (jsonrpc/0) [api.host] START getAllVmStats() from=::ffff:****,50904 (api:48) 2020-04-14 12:00:56,810+0200 INFO (jsonrpc/2) [api.host] START getAllVmStats() from=::1,47490 (api:48) 2020-04-14 12:01:05,546+0200 INFO (jsonrpc/3) [api.host] START getAllVmStats() from=::ffff:****,50904 (api:48) 2020-04-14 12:01:11,282+0200 INFO (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::ffff:****:38574 (protocoldetector:61) <---- IRS 2020-04-14 12:01:11,819+0200 INFO (jsonrpc/5) [api.host] START getAllVmStats() from=::1,47490 (api:48) 2020-04-14 12:01:26,841+0200 INFO (jsonrpc/3) [api.host] START getAllVmStats() from=::1,47490 (api:48) 2020-04-14 12:01:41,864+0200 INFO (jsonrpc/7) [api.host] START getAllVmStats() from=::1,47490 (api:48) 2020-04-14 12:01:56,887+0200 INFO (jsonrpc/6) [api.host] START getAllVmStats() from=::1,47490 (api:48) 2020-04-14 12:02:11,907+0200 INFO (jsonrpc/4) [api.host] START getAllVmStats() from=::1,47490 (api:48) 2020-04-14 12:02:26,926+0200 INFO (jsonrpc/1) [api.host] START getAllVmStats() from=::1,47490 (api:48) 2020-04-14 12:02:41,947+0200 INFO (jsonrpc/0) [api.host] START getAllVmStats() from=::1,47490 (api:48) 2020-04-14 12:02:56,969+0200 INFO (jsonrpc/2) [api.host] START getAllVmStats() from=::1,47490 (api:48) After engine was restarted, all green and working again to all hosts. Version-Release number of selected component (if applicable): vdsm-jsonrpc-java-1.4.18-1.el7ev.noarch ovirt-engine-4.3.9.4-11.el7.noarch vdsm-4.30.44-1.el7ev.x86_64 How reproducible: Unknown. Actual results: Everything failing until ovirt-engine restarted Expected results: Working normally
I failed to recreate an issue on my dev setup on both the mentioned 4.3.9 and 4.4-master. Provided logs unfortunately does not provide enough information to say what was the core issue. I reviewed the json rpc client code and added some more logging, but frankly speaking, I do not think that would help a lot. It would be extremely useful to get access to affected environment (host + engine) and check the thread dump. For now I think, I have exhausted all the ideas I had to verify/recreate the issue.
Verified on ovirt-engine-4.4.1.2-0.10.el8ev.noarch
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory (Important: RHV Manager (ovirt-engine) 4.4 security, bug fix, and enhancement update), and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHSA-2020:3247