Bug 1845152

Summary: After SPM select the engine lost communication to all hosts until restarted [improved logging] [RHV clone - 4.3.11]
Product: Red Hat Enterprise Virtualization Manager Reporter: RHV bug bot <rhv-bugzilla-bot>
Component: ovirt-engineAssignee: Artur Socha <asocha>
Status: CLOSED ERRATA QA Contact: Petr Matyáš <pmatyas>
Severity: medium Docs Contact:
Priority: medium    
Version: 4.3.9CC: lleistne, mperina, pelauter
Target Milestone: ovirt-4.3.11Keywords: CodeChange, ZStream
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: rhv-4.3.11-2 Doc Type: No Doc Update
Doc Text:
Story Points: ---
Clone Of: 1828669 Environment:
Last Closed: 2020-09-30 10:07:13 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: Infra RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On: 1828669    
Bug Blocks:    

Description RHV bug bot 2020-06-08 14:35:50 UTC
+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1828669 +++
======================================================================

Description of problem:

User selected SPM, which suceeded. But after that engine lost communication to all hosts until restarted.

1. User force selected SPM
2020-04-14 12:00:59,977+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-333) [6f52780d-6629-4088-8ad1-6e5476cc773d] EVENT_ID: USER_FORCE_SELECTED_SPM(159), Host **** was force selected by ****

2. Start SPM
2020-04-14 12:01:10,220+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-41) [18f7fe20] START, SpmStartVDSCommand(HostName = ****, SpmStartVDSCommandParameters:{hostId='d040c2f0-e658-46c2-8af4-9e60e65299b5', storagePoolId='2fcbca30-a4ed-46b8-96fb-87357d3bb010', prevId='-1', prevLVER='-1', storagePoolFormatType='V5', recoveryMode='Manual', SCSIFencing='false'}), log id: 6bbb49bc
2020-04-14 12:01:11,243+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-41) [18f7fe20] FINISH, SpmStartVDSCommand, return: SpmStatusResult:{SPM Id='1', SPM LVER='23', SPM Status='SPM'}, log id: 6bbb49bc

3. Connect IRS
2020-04-14 12:01:11,281+02 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to /****

Problem: now from this timestamp above on, nothing the engine sent reached the hosts, and everything started failing with vdsTimeout.

4. Here is the first vdsTimeout (set to default 180s)

2020-04-14 12:04:11,777+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-26) [] Command 'GetAllVmStatsVDSCommand(HostName = ****, VdsIdVDSCommandParametersBase:{hostId='456ce074-fbaa-4cfc-ac85-e7be6cb3ed7b'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues

5. Everything else to all other hosts behave the same way, until engine is restarted, then everything is green.


From a host side, we can see GetAllVmStats stopping to arrive after the IRS connection:

2020-04-14 12:00:20,533+0200 INFO  (jsonrpc/5) [api.host] START getAllVmStats() from=::ffff:****,50904 (api:48)     <---- engine IP
2020-04-14 12:00:26,772+0200 INFO  (jsonrpc/3) [api.host] START getAllVmStats() from=::1,47490 (api:48)             <---- MoM
2020-04-14 12:00:35,537+0200 INFO  (jsonrpc/6) [api.host] START getAllVmStats() from=::ffff:****,50904 (api:48)
2020-04-14 12:00:41,791+0200 INFO  (jsonrpc/4) [api.host] START getAllVmStats() from=::1,47490 (api:48)
2020-04-14 12:00:50,541+0200 INFO  (jsonrpc/0) [api.host] START getAllVmStats() from=::ffff:****,50904 (api:48)
2020-04-14 12:00:56,810+0200 INFO  (jsonrpc/2) [api.host] START getAllVmStats() from=::1,47490 (api:48)
2020-04-14 12:01:05,546+0200 INFO  (jsonrpc/3) [api.host] START getAllVmStats() from=::ffff:****,50904 (api:48)
2020-04-14 12:01:11,282+0200 INFO  (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::ffff:****:38574 (protocoldetector:61)   <---- IRS
2020-04-14 12:01:11,819+0200 INFO  (jsonrpc/5) [api.host] START getAllVmStats() from=::1,47490 (api:48)
2020-04-14 12:01:26,841+0200 INFO  (jsonrpc/3) [api.host] START getAllVmStats() from=::1,47490 (api:48)
2020-04-14 12:01:41,864+0200 INFO  (jsonrpc/7) [api.host] START getAllVmStats() from=::1,47490 (api:48)
2020-04-14 12:01:56,887+0200 INFO  (jsonrpc/6) [api.host] START getAllVmStats() from=::1,47490 (api:48)
2020-04-14 12:02:11,907+0200 INFO  (jsonrpc/4) [api.host] START getAllVmStats() from=::1,47490 (api:48)
2020-04-14 12:02:26,926+0200 INFO  (jsonrpc/1) [api.host] START getAllVmStats() from=::1,47490 (api:48)
2020-04-14 12:02:41,947+0200 INFO  (jsonrpc/0) [api.host] START getAllVmStats() from=::1,47490 (api:48)
2020-04-14 12:02:56,969+0200 INFO  (jsonrpc/2) [api.host] START getAllVmStats() from=::1,47490 (api:48)

After engine was restarted, all green and working again to all hosts.

Version-Release number of selected component (if applicable):
vdsm-jsonrpc-java-1.4.18-1.el7ev.noarch
ovirt-engine-4.3.9.4-11.el7.noarch
vdsm-4.30.44-1.el7ev.x86_64

How reproducible:
Unknown.

Actual results:
Everything failing until ovirt-engine restarted

Expected results:
Working normally

(Originally by Germano Veit Michel)

Comment 6 RHV bug bot 2020-06-08 14:36:01 UTC
I failed to recreate an issue on my dev setup on both the mentioned 4.3.9 and 4.4-master. Provided logs unfortunately does not provide enough information to say what was the core issue.
I reviewed the json rpc client code and added some more logging, but frankly speaking, I do not think that would help a lot.
It would be extremely useful to get access to affected environment (host + engine) and check the thread dump.

For now I think, I have exhausted all the ideas I had to verify/recreate the issue.

(Originally by Artur Socha)

Comment 13 Petr Matyáš 2020-06-22 13:15:36 UTC
Verified on ovirt-engine-4.3.11-0.1.el7.noarch

Comment 16 errata-xmlrpc 2020-09-30 10:07:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Virtualization Engine security, bug fix 4.3.11), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4112