Bug 1845152 - After SPM select the engine lost communication to all hosts until restarted [improved logging] [RHV clone - 4.3.11]
Summary: After SPM select the engine lost communication to all hosts until restarted [...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.3.9
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ovirt-4.3.11
: ---
Assignee: Artur Socha
QA Contact: Petr Matyáš
URL:
Whiteboard:
Depends On: 1828669
Blocks:
TreeView+ depends on / blocked
 
Reported: 2020-06-08 14:35 UTC by RHV bug bot
Modified: 2020-09-30 10:07 UTC (History)
3 users (show)

Fixed In Version: rhv-4.3.11-2
Doc Type: No Doc Update
Doc Text:
Clone Of: 1828669
Environment:
Last Closed: 2020-09-30 10:07:13 UTC
oVirt Team: Infra
Target Upstream Version:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 5044661 0 None None None 2020-06-08 14:36:23 UTC
Red Hat Product Errata RHBA-2020:4112 0 None None None 2020-09-30 10:07:44 UTC
oVirt gerrit 109161 0 master MERGED Ingnored unit tests re-enabled and stabilized 2020-10-21 18:40:49 UTC
oVirt gerrit 109319 0 master MERGED More verbose logging 2020-10-21 18:40:35 UTC
oVirt gerrit 109322 0 master MERGED ReactorClient: half flag made atomic 2020-10-21 18:40:35 UTC
oVirt gerrit 109365 0 master MERGED Error logging fixed and cleanups 2020-10-21 18:40:35 UTC
oVirt gerrit 109510 0 ovirt-engine-4.3 MERGED Error logging fixed 2020-10-21 18:40:35 UTC

Description RHV bug bot 2020-06-08 14:35:50 UTC
+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1828669 +++
======================================================================

Description of problem:

User selected SPM, which suceeded. But after that engine lost communication to all hosts until restarted.

1. User force selected SPM
2020-04-14 12:00:59,977+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-333) [6f52780d-6629-4088-8ad1-6e5476cc773d] EVENT_ID: USER_FORCE_SELECTED_SPM(159), Host **** was force selected by ****

2. Start SPM
2020-04-14 12:01:10,220+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-41) [18f7fe20] START, SpmStartVDSCommand(HostName = ****, SpmStartVDSCommandParameters:{hostId='d040c2f0-e658-46c2-8af4-9e60e65299b5', storagePoolId='2fcbca30-a4ed-46b8-96fb-87357d3bb010', prevId='-1', prevLVER='-1', storagePoolFormatType='V5', recoveryMode='Manual', SCSIFencing='false'}), log id: 6bbb49bc
2020-04-14 12:01:11,243+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-41) [18f7fe20] FINISH, SpmStartVDSCommand, return: SpmStatusResult:{SPM Id='1', SPM LVER='23', SPM Status='SPM'}, log id: 6bbb49bc

3. Connect IRS
2020-04-14 12:01:11,281+02 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to /****

Problem: now from this timestamp above on, nothing the engine sent reached the hosts, and everything started failing with vdsTimeout.

4. Here is the first vdsTimeout (set to default 180s)

2020-04-14 12:04:11,777+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-26) [] Command 'GetAllVmStatsVDSCommand(HostName = ****, VdsIdVDSCommandParametersBase:{hostId='456ce074-fbaa-4cfc-ac85-e7be6cb3ed7b'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues

5. Everything else to all other hosts behave the same way, until engine is restarted, then everything is green.


From a host side, we can see GetAllVmStats stopping to arrive after the IRS connection:

2020-04-14 12:00:20,533+0200 INFO  (jsonrpc/5) [api.host] START getAllVmStats() from=::ffff:****,50904 (api:48)     <---- engine IP
2020-04-14 12:00:26,772+0200 INFO  (jsonrpc/3) [api.host] START getAllVmStats() from=::1,47490 (api:48)             <---- MoM
2020-04-14 12:00:35,537+0200 INFO  (jsonrpc/6) [api.host] START getAllVmStats() from=::ffff:****,50904 (api:48)
2020-04-14 12:00:41,791+0200 INFO  (jsonrpc/4) [api.host] START getAllVmStats() from=::1,47490 (api:48)
2020-04-14 12:00:50,541+0200 INFO  (jsonrpc/0) [api.host] START getAllVmStats() from=::ffff:****,50904 (api:48)
2020-04-14 12:00:56,810+0200 INFO  (jsonrpc/2) [api.host] START getAllVmStats() from=::1,47490 (api:48)
2020-04-14 12:01:05,546+0200 INFO  (jsonrpc/3) [api.host] START getAllVmStats() from=::ffff:****,50904 (api:48)
2020-04-14 12:01:11,282+0200 INFO  (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::ffff:****:38574 (protocoldetector:61)   <---- IRS
2020-04-14 12:01:11,819+0200 INFO  (jsonrpc/5) [api.host] START getAllVmStats() from=::1,47490 (api:48)
2020-04-14 12:01:26,841+0200 INFO  (jsonrpc/3) [api.host] START getAllVmStats() from=::1,47490 (api:48)
2020-04-14 12:01:41,864+0200 INFO  (jsonrpc/7) [api.host] START getAllVmStats() from=::1,47490 (api:48)
2020-04-14 12:01:56,887+0200 INFO  (jsonrpc/6) [api.host] START getAllVmStats() from=::1,47490 (api:48)
2020-04-14 12:02:11,907+0200 INFO  (jsonrpc/4) [api.host] START getAllVmStats() from=::1,47490 (api:48)
2020-04-14 12:02:26,926+0200 INFO  (jsonrpc/1) [api.host] START getAllVmStats() from=::1,47490 (api:48)
2020-04-14 12:02:41,947+0200 INFO  (jsonrpc/0) [api.host] START getAllVmStats() from=::1,47490 (api:48)
2020-04-14 12:02:56,969+0200 INFO  (jsonrpc/2) [api.host] START getAllVmStats() from=::1,47490 (api:48)

After engine was restarted, all green and working again to all hosts.

Version-Release number of selected component (if applicable):
vdsm-jsonrpc-java-1.4.18-1.el7ev.noarch
ovirt-engine-4.3.9.4-11.el7.noarch
vdsm-4.30.44-1.el7ev.x86_64

How reproducible:
Unknown.

Actual results:
Everything failing until ovirt-engine restarted

Expected results:
Working normally

(Originally by Germano Veit Michel)

Comment 6 RHV bug bot 2020-06-08 14:36:01 UTC
I failed to recreate an issue on my dev setup on both the mentioned 4.3.9 and 4.4-master. Provided logs unfortunately does not provide enough information to say what was the core issue.
I reviewed the json rpc client code and added some more logging, but frankly speaking, I do not think that would help a lot.
It would be extremely useful to get access to affected environment (host + engine) and check the thread dump.

For now I think, I have exhausted all the ideas I had to verify/recreate the issue.

(Originally by Artur Socha)

Comment 13 Petr Matyáš 2020-06-22 13:15:36 UTC
Verified on ovirt-engine-4.3.11-0.1.el7.noarch

Comment 16 errata-xmlrpc 2020-09-30 10:07:13 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Red Hat Virtualization Engine security, bug fix 4.3.11), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2020:4112


Note You need to log in before you can comment on or make changes to this bug.