Bug 1828669 - After SPM select the engine lost communication to all hosts until restarted [improved logging]
Summary: After SPM select the engine lost communication to all hosts until restarted [...
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ovirt-engine
Version: 4.3.9
Hardware: x86_64
OS: Linux
medium
medium
Target Milestone: ovirt-4.4.1
: ---
Assignee: Artur Socha
QA Contact: Petr Matyáš
URL:
Whiteboard:
Depends On:
Blocks: 1845152
TreeView+ depends on / blocked
 
Reported: 2020-04-28 05:04 UTC by Germano Veit Michel
Modified: 2023-12-15 17:46 UTC (History)
2 users (show)

Fixed In Version: rhv-4.4.1-3
Doc Type: No Doc Update
Doc Text:
Clone Of:
: 1845152 (view as bug list)
Environment:
Last Closed: 2020-08-04 13:22:48 UTC
oVirt Team: Infra
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Knowledge Base (Solution) 5044661 0 None None None 2020-05-04 05:59:45 UTC
Red Hat Product Errata RHSA-2020:3247 0 None None None 2020-08-04 13:23:02 UTC
oVirt gerrit 109161 0 master MERGED Ingnored unit tests re-enabled and stabilized 2021-01-18 18:36:50 UTC
oVirt gerrit 109319 0 master MERGED More verbose logging 2021-01-18 18:36:50 UTC
oVirt gerrit 109322 0 master MERGED ReactorClient: half flag made atomic 2021-01-18 18:36:50 UTC
oVirt gerrit 109365 0 master MERGED Error logging fixed and cleanups 2021-01-18 18:36:50 UTC

Description Germano Veit Michel 2020-04-28 05:04:19 UTC
Description of problem:

User selected SPM, which suceeded. But after that engine lost communication to all hosts until restarted.

1. User force selected SPM
2020-04-14 12:00:59,977+02 INFO  [org.ovirt.engine.core.dal.dbbroker.auditloghandling.AuditLogDirector] (default task-333) [6f52780d-6629-4088-8ad1-6e5476cc773d] EVENT_ID: USER_FORCE_SELECTED_SPM(159), Host **** was force selected by ****

2. Start SPM
2020-04-14 12:01:10,220+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-41) [18f7fe20] START, SpmStartVDSCommand(HostName = ****, SpmStartVDSCommandParameters:{hostId='d040c2f0-e658-46c2-8af4-9e60e65299b5', storagePoolId='2fcbca30-a4ed-46b8-96fb-87357d3bb010', prevId='-1', prevLVER='-1', storagePoolFormatType='V5', recoveryMode='Manual', SCSIFencing='false'}), log id: 6bbb49bc
2020-04-14 12:01:11,243+02 INFO  [org.ovirt.engine.core.vdsbroker.vdsbroker.SpmStartVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-41) [18f7fe20] FINISH, SpmStartVDSCommand, return: SpmStatusResult:{SPM Id='1', SPM LVER='23', SPM Status='SPM'}, log id: 6bbb49bc

3. Connect IRS
2020-04-14 12:01:11,281+02 INFO  [org.ovirt.vdsm.jsonrpc.client.reactors.ReactorClient] (SSL Stomp Reactor) [] Connecting to /****

Problem: now from this timestamp above on, nothing the engine sent reached the hosts, and everything started failing with vdsTimeout.

4. Here is the first vdsTimeout (set to default 180s)

2020-04-14 12:04:11,777+02 ERROR [org.ovirt.engine.core.vdsbroker.vdsbroker.GetAllVmStatsVDSCommand] (EE-ManagedScheduledExecutorService-engineScheduledThreadPool-Thread-26) [] Command 'GetAllVmStatsVDSCommand(HostName = ****, VdsIdVDSCommandParametersBase:{hostId='456ce074-fbaa-4cfc-ac85-e7be6cb3ed7b'})' execution failed: VDSGenericException: VDSNetworkException: Message timeout which can be caused by communication issues

5. Everything else to all other hosts behave the same way, until engine is restarted, then everything is green.


From a host side, we can see GetAllVmStats stopping to arrive after the IRS connection:

2020-04-14 12:00:20,533+0200 INFO  (jsonrpc/5) [api.host] START getAllVmStats() from=::ffff:****,50904 (api:48)     <---- engine IP
2020-04-14 12:00:26,772+0200 INFO  (jsonrpc/3) [api.host] START getAllVmStats() from=::1,47490 (api:48)             <---- MoM
2020-04-14 12:00:35,537+0200 INFO  (jsonrpc/6) [api.host] START getAllVmStats() from=::ffff:****,50904 (api:48)
2020-04-14 12:00:41,791+0200 INFO  (jsonrpc/4) [api.host] START getAllVmStats() from=::1,47490 (api:48)
2020-04-14 12:00:50,541+0200 INFO  (jsonrpc/0) [api.host] START getAllVmStats() from=::ffff:****,50904 (api:48)
2020-04-14 12:00:56,810+0200 INFO  (jsonrpc/2) [api.host] START getAllVmStats() from=::1,47490 (api:48)
2020-04-14 12:01:05,546+0200 INFO  (jsonrpc/3) [api.host] START getAllVmStats() from=::ffff:****,50904 (api:48)
2020-04-14 12:01:11,282+0200 INFO  (Reactor thread) [ProtocolDetector.AcceptorImpl] Accepted connection from ::ffff:****:38574 (protocoldetector:61)   <---- IRS
2020-04-14 12:01:11,819+0200 INFO  (jsonrpc/5) [api.host] START getAllVmStats() from=::1,47490 (api:48)
2020-04-14 12:01:26,841+0200 INFO  (jsonrpc/3) [api.host] START getAllVmStats() from=::1,47490 (api:48)
2020-04-14 12:01:41,864+0200 INFO  (jsonrpc/7) [api.host] START getAllVmStats() from=::1,47490 (api:48)
2020-04-14 12:01:56,887+0200 INFO  (jsonrpc/6) [api.host] START getAllVmStats() from=::1,47490 (api:48)
2020-04-14 12:02:11,907+0200 INFO  (jsonrpc/4) [api.host] START getAllVmStats() from=::1,47490 (api:48)
2020-04-14 12:02:26,926+0200 INFO  (jsonrpc/1) [api.host] START getAllVmStats() from=::1,47490 (api:48)
2020-04-14 12:02:41,947+0200 INFO  (jsonrpc/0) [api.host] START getAllVmStats() from=::1,47490 (api:48)
2020-04-14 12:02:56,969+0200 INFO  (jsonrpc/2) [api.host] START getAllVmStats() from=::1,47490 (api:48)

After engine was restarted, all green and working again to all hosts.

Version-Release number of selected component (if applicable):
vdsm-jsonrpc-java-1.4.18-1.el7ev.noarch
ovirt-engine-4.3.9.4-11.el7.noarch
vdsm-4.30.44-1.el7ev.x86_64

How reproducible:
Unknown.

Actual results:
Everything failing until ovirt-engine restarted

Expected results:
Working normally

Comment 6 Artur Socha 2020-06-01 08:19:54 UTC
I failed to recreate an issue on my dev setup on both the mentioned 4.3.9 and 4.4-master. Provided logs unfortunately does not provide enough information to say what was the core issue.
I reviewed the json rpc client code and added some more logging, but frankly speaking, I do not think that would help a lot.
It would be extremely useful to get access to affected environment (host + engine) and check the thread dump.

For now I think, I have exhausted all the ideas I had to verify/recreate the issue.

Comment 14 Petr Matyáš 2020-06-15 12:19:38 UTC
Verified on ovirt-engine-4.4.1.2-0.10.el8ev.noarch

Comment 16 errata-xmlrpc 2020-08-04 13:22:48 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory (Important: RHV Manager (ovirt-engine) 4.4 security, bug fix, and enhancement update), and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHSA-2020:3247


Note You need to log in before you can comment on or make changes to this bug.