1502653 – [downstream clone - 4.1.7] HE host on EngineStarting -> EngineMaybeAway -> EngineDown cycle

Bug 1502653 - [downstream clone - 4.1.7] HE host on EngineStarting -> EngineMaybeAway -> EngineDown cycle

Summary: [downstream clone - 4.1.7] HE host on EngineStarting -> EngineMaybeAway -> En...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Virtualization Manager
Classification:	Red Hat
Component:	ovirt-hosted-engine-ha
Sub Component:
Version:	4.1.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	ovirt-4.1.7
Target Release:	---
Assignee:	Denis Chaplygin
QA Contact:	Nikolai Sednev
Docs Contact:
URL:
Whiteboard:
Depends On:	1498327
Blocks:	1493547
TreeView+	depends on / blocked

Reported:	2017-10-16 11:42 UTC by rhev-integ
Modified:	2021-06-10 13:16 UTC (History)
CC List:	6 users (show)
Fixed In Version:
Doc Type:	Enhancement
Doc Text:	This release adds caching of the OVF storage location. As the OVF storage location rarely changes, it does not need to be searched for on every monitoring loop iteration. Instead it can be saved and reused, and expired only in the case of an error. As a result, the monitoring loop execution time is decreased significantly.
Clone Of:	1498327
Environment:
Last Closed:	2017-11-07 17:26:57 UTC
oVirt Team:	SLA
Target Upstream Version:
Embargoed:
Flags:	dchaplyg: needinfo- dchaplyg: needinfo- lsvaty: testing_plan_complete-

Attachments	(Terms of Use)

Links
System	ID	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2017:3136	normal	SHIPPED_LIVE	ovirt-hosted-engine-ha bug fix update for 4.1.7	2017-11-07 22:22:11 UTC
oVirt gerrit	82585	'None'	MERGED	ovf: Caching of OVF storage location data.	2020-12-15 08:50:18 UTC
oVirt gerrit	82589	'None'	MERGED	ovf: Caching of OVF storage location data.	2020-12-15 08:50:18 UTC

Description rhev-integ 2017-10-16 11:42:06 UTC

+++ This bug is a downstream clone. The original bug is: +++
+++   bug 1498327 +++
======================================================================

Description of problem:

Scenario: 2 Hosted-Engine Hosts
1. Engine running on Host A
2. From time to time Host B tries to start the HE VM.
3. Host B start attempt fails, vdsm emits notification to engine that start failed
4. Engine displays " VM HostedEngine is down with error. Exit message: resource busy: Failed to acquire lock: Lease is held by another host."
5. User flooded with emails regarding state changes on Host B.

Nothing really breaks but 4 and 5 disturb the user and the message of HostedEngine being down is quite alarming, but it's false.

I can't find the exact reason for this from the logs, but here are some observations:

A) Host B seems to regularly cycle through these states:
EngineDown -> EngineStart -> EngineStarting -> EngineMaybeAway -> EngineDown
But EngineStart[ing] fails as the HE VM is running on Host A, as expected.

B) I found a bit peculiar that on host A, every 5th cycle takes long. Usually 58-59 seconds and then every 5th cycle its 72-73s. 
# egrep 'initialize_vdsm' agent.log | awk -F '[ :]' '{curr=$6*60*60+$7*60+$8} {diff=curr-prev} {print $16,diff" ("$5" "$6":"$7":"$8")"} {prev=curr}'
(_initialize_vdsm) 57 (2017-10-03 18:26:44,972)
(_initialize_vdsm) 58 (2017-10-03 18:27:42,986)
(_initialize_vdsm) 58 (2017-10-03 18:28:40,875)
(_initialize_vdsm) 59 (2017-10-03 18:29:39,449)
(_initialize_vdsm) 73 (2017-10-03 18:30:52,554)
(_initialize_vdsm) 58 (2017-10-03 18:31:50,876)
(_initialize_vdsm) 58 (2017-10-03 18:32:48,838)
(_initialize_vdsm) 58 (2017-10-03 18:33:46,784)
(_initialize_vdsm) 58 (2017-10-03 18:34:44,959)
(_initialize_vdsm) 73 (2017-10-03 18:35:57,512)
(_initialize_vdsm) 58 (2017-10-03 18:36:55,824)
(_initialize_vdsm) 58 (2017-10-03 18:37:53,564)
(_initialize_vdsm) 58 (2017-10-03 18:38:51,460)
(_initialize_vdsm) 58 (2017-10-03 18:39:49,754)
(_initialize_vdsm) 73 (2017-10-03 18:41:02,316)
(_initialize_vdsm) 58 (2017-10-03 18:42:00,227)


The difference in time is mostly on extracting the OVFs. Every 5th cycle seems to take longer for OVF extraction once the OVF images are found. There are no extra volumes on the HostedEngine Storage domain besides the HE HA ones.

Could this be causing sometimes the metadata not being updated in time by host A, triggering B to not consider A alive and then starting the HE VM?

C) On Host A, vdsm sometimes take longer to list the volumes and to get their info for the agent, it seems to match the longer cycles.

2017-10-03 18:30:10,286+0000 INFO  (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call StorageDomain.getVolumes succeeded in 2.40 seconds (__init__:539)                                              
2017-10-03 18:30:12,741+0000 INFO  (jsonrpc/0) [jsonrpc.JsonRpcServer] RPC call Volume.getInfo succeeded in 2.44 seconds (__init__:539)                                                        
2017-10-03 18:30:15,170+0000 INFO  (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call StorageDomain.getVolumes succeeded in 2.40 seconds (__init__:539)                                              
2017-10-03 18:30:32,318+0000 INFO  (jsonrpc/7) [jsonrpc.JsonRpcServer] RPC call Volume.getInfo succeeded in 17.14 seconds (__init__:539)                                                       
2017-10-03 18:30:34,825+0000 INFO  (jsonrpc/5) [jsonrpc.JsonRpcServer] RPC call StorageDomain.getVolumes succeeded in 2.48 seconds (__init__:539)                                              
2017-10-03 18:30:37,342+0000 INFO  (jsonrpc/6) [jsonrpc.JsonRpcServer] RPC call Volume.getInfo succeeded in 2.49 seconds (__init__:539)                                                        
2017-10-03 18:30:39,856+0000 INFO  (jsonrpc/1) [jsonrpc.JsonRpcServer] RPC call StorageDomain.getVolumes succeeded in 2.48 seconds (__init__:539) 

For example, here is a long one:

2017-10-03 19:10:54,080+0000 INFO  (jsonrpc/2) [vdsm.api] START getVolumeInfo(sdUUID=u'1f9ee93c-7372-48e2-8566-a49199e95704', spUUID=u'00000000-0000-0000-0000-000000000000', imgUUID=u'e962cf89-a79c-4bd9-881c-33c5bb871b3b', volUUID=u'bab52922-65e6-48db-9740-7d67e74f3c3b', options=None) from=::1,17398 (api:46)

2017-10-03 19:11:08,701+0000 INFO  (jsonrpc/2) [storage.VolumeManifest] Info request: sdUUID=1f9ee93c-7372-48e2-8566-a49199e95704 imgUUID=e962cf89-a79c-4bd9-881c-33c5bb871b3b volUUID = bab52922-65e6-48db-9740-7d67e74f3c3b  (volume:238)
2017-10-03 19:11:08,718+0000 INFO  (jsonrpc/2) [storage.LVM] Refreshing lvs: vg=1f9ee93c-7372-48e2-8566-a49199e95704 lvs=['leases'] (lvm:1291)
2017-10-03 19:11:08,718+0000 INFO  (jsonrpc/2) [storage.LVM] Refreshing LVs (vg=1f9ee93c-7372-48e2-8566-a49199e95704, lvs=['leases']) (lvm:1319)
2017-10-03 19:11:11,339+0000 INFO  (jsonrpc/2) [storage.VolumeManifest] 1f9ee93c-7372-48e2-8566-a49199e95704/e962cf89-a79c-4bd9-881c-33c5bb871b3b/bab52922-65e6-48db-9740-7d67e74f3c3b info is {'status': 'OK', 'domain': '1f9ee93c-7372-48e2-8566-a49199e95704', 'voltype': 'LEAF', 'description': 'hosted-engine.metadata', 'parent': '00000000-0000-0000-0000-000000000000', 'format': 'RAW', 'generation': 0, 'image': 'e962cf89-a79c-4bd9-881c-33c5bb871b3b', 'ctime': '1499973228', 'disktype': '2', 'legality': 'LEGAL', 'mtime': '0', 'apparentsize': '134217728', 'children': [], 'pool': '', 'capacity': '134217728', 'uuid': u'bab52922-65e6-48db-9740-7d67e74f3c3b', 'truesize': '134217728', 'type': 'PREALLOCATED', 'lease': {'owners': [], 'version': None}} (volume:272)
2017-10-03 19:11:11,339+0000 INFO  (jsonrpc/2) [vdsm.api] FINISH getVolumeInfo return={'info': {'status': 'OK', 'domain': '1f9ee93c-7372-48e2-8566-a49199e95704', 'voltype': 'LEAF', 'description': 'hosted-engine.metadata', 'parent': '00000000-0000-0000-0000-000000000000', 'format': 'RAW', 'generation': 0, 'image': 'e962cf89-a79c-4bd9-881c-33c5bb871b3b', 'ctime': '1499973228', 'disktype': '2', 'legality': 'LEGAL', 'mtime': '0', 'apparentsize': '134217728', 'children': [], 'pool': '', 'capacity': '134217728', 'uuid': u'bab52922-65e6-48db-9740-7d67e74f3c3b', 'truesize': '134217728', 'type': 'PREALLOCATED', 'lease': {'owners': [], 'version': None}}} from=::1,17398 (api:52)
2017-10-03 19:11:11,340+0000 INFO  (jsonrpc/2) [jsonrpc.JsonRpcServer] RPC call Volume.getInfo succeeded in 17.26 seconds (__init__:539)

D) HE Storage is FC: LUN 360002ac0000000000000000e0001dd6e (253:6)

                  DEV       tps  rd_sec/s  wr_sec/s  avgrq-sz  avgqu-sz     await     svctm     %util       
Average:     dev253-6     25.07   1260.12    182.29     57.52      0.01      0.53      0.20      0.49
                                                                                   
Version-Release number of selected component (if applicable):
Both hosts:
  ovirt-hosted-engine-ha-2.1.5-1.el7ev.noarch
  vdsm-4.19.20-1.el7ev.x86_64

How reproducible:
Not yet.

(Originally by Germano Veit Michel)

Comment 2 rhev-integ 2017-10-16 11:42:13 UTC

I am pretty sure 2.1.5 should be much faster than 60 seconds per cycle. I am giving it to Denis who wrote the optimization that should guarantee that.

(Originally by Martin Sivak)

Comment 7 Nikolai Sednev 2017-10-30 15:53:32 UTC

Failed to reproduce on regular iSCSI or Gluster 3.12 or NFS environment with simple deployment using ovirt-hosted-engine-setup-2.1.4-1.el7ev.noarch on pair of RHEL7.4 clean hosts. EngineMaybeAway status was fixed in https://bugzilla.redhat.com/show_bug.cgi?id=1504150#c2.

Comment 8 Nikolai Sednev 2017-10-31 08:42:59 UTC

Moving to verified as I've failed to reproduce the issue. Please feel free to reopen in case that it still happens.

Comment 11 errata-xmlrpc 2017-11-07 17:26:57 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2017:3136

Comment 12 Daniel Gur 2019-08-28 13:12:50 UTC

sync2jira

Comment 13 Daniel Gur 2019-08-28 13:17:03 UTC

sync2jira

Note You need to log in before you can comment on or make changes to this bug.