Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 1599110

Summary:	Hosted-Engine HA failed and started looping, not recovering itself.
Product:	Red Hat Enterprise Virtualization Manager	Reporter:	Germano Veit Michel <gveitmic>
Component:	ovirt-hosted-engine-ha	Assignee:	Martin Sivák <msivak>
Status:	CLOSED DUPLICATE	QA Contact:	meital avital <mavital>
Severity:	medium	Docs Contact:
Priority:	unspecified
Version:	4.2.3	CC:	lsurette
Target Milestone:	---	Flags:	lsvaty: testing_plan_complete-
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	If docs needed, set a value
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2018-07-09 06:05:27 UTC	Type:	Bug
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	SLA	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Germano Veit Michel 2018-07-09 00:35:49 UTC

Description of problem:

There was a problem with the hosted-engine metadata LV (somehow it got deactivated). After this the HA mechanism was unable to recover from it. It looped through states sending tons of emails and never attempted to prepareImages/refresh SD to bring that LV up again to recover from the incident. Storage was available all the time.

See the details:

1. At 15:58:25, the hosted-engine.metadata volume LV was somehow deactivated. VDSM logs from this time have rotated, all I could find was vdsm restarted 1 minute earlier and it was not through soft fencing.

MainThread::INFO::2018-07-04 15:57:55,791::hosted_engine::506::ovirt_hosted_engine_ha.agent.hosted_engine.HostedEngine::(_monitoring_loop) Required service vdsmd is not up.

StatusStorageThread::ERROR::2018-07-04 15:58:25,676::status_broker::92::ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update::(run) Failed to read state.                           
Traceback (most recent call last):                                                                                                                                                             
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 88, in run                                                                                      
    self._storage_broker.get_raw_stats()                                                                                                                                                       
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 163, in get_raw_stats                                                                          
    .format(str(e)))                                                                                                                                                                           
RequestError: failed to read metadata: [Errno 2] No such file or directory: '/var/run/vdsm/storage/2fc4a59d-f27b-41be-a851-940694b94314/e14df275-5499-49fe-a158-4460948e54aa/f3efb31c-ee23-4a24-9c96-c695831409d6'

$ grep f3efb31c-ee23-4a24-9c96-c695831409d6 etc/ovirt-hosted-engine/hosted-engine.conf 
metadata_volume_UUID=f3efb31c-ee23-4a24-9c96-c695831409d6

2. Then the agent restarts (due to vdsm restarting I understand)

MainThread::INFO::2018-07-04 15:58:38,094::agent::89::ovirt_hosted_engine_ha.agent.agent.Agent::(run) Agent shutting down
MainThread::INFO::2018-07-04 15:58:49,824::agent::67::ovirt_hosted_engine_ha.agent.agent.Agent::(run) ovirt-hosted-engine-ha agent 2.2.11 started

3. The ha mechanism did not recover from this anymore. After the ha-agent restart it loops through these states below while the HE VM was running on another host (he disk lock acquired):

EngineDown-EngineStart-EngineStarting-EngineForceStop-ReinitializeFSM

This sent dozens of emails as this specific host was unable to read the metadata LV to see the other hosts states and therefore kept trying (and failing) to start the HE VM.

For example, after 15:58, if one looks at host id 1 timestamp (the logs from above are from host id 5), its always 2265573 on the point of view of this failed host. But it doesn't try to refresh the storage domain or prepare images to bring up the metadata LV to read newer status of other hosts. Just keeps looping.

This is a 5 host HE cluster, only host id 5 faced this problem, all other 4 hosts were fine.

Version-Release number of selected component (if applicable):
ovirt-hosted-engine-ha-2.2.11-1.el7ev.noarch

How reproducible:
Unknown

Actual results:
HA agent of single host looping, not recovering.

Expected results:
Bring up the missing LV (prepareVolumes/refresh storage) and recover

Comment 3 Germano Veit Michel 2018-07-09 05:21:30 UTC

(In reply to Germano Veit Michel from comment #0)
> There was a problem with the hosted-engine metadata LV (somehow it got
> deactivated).

I think this could have been caused by the vdsm restart as it cleans up "stale" LVs on start-up. Will try to reproduce.

Comment 4 Germano Veit Michel 2018-07-09 06:05:27 UTC

Reproduced LV deactivation by vdsm, but the problem does not fully reproduce in 4.2.4, it recovers automatically because the HA broker restarts.

# date && lvs | grep 8a9577f7-2e95-49a2-8ab8-8447c935446f
Mon Jul  9 15:58:40 AEST 2018
  8a9577f7-2e95-49a2-8ab8-8447c935446f 6c873880-0244-4823-90d5-03e4ec562cd8 -wi-a-----   1.00g

# systemctl restart vdsmd

2018-07-09 15:59:02,877+1000 INFO  (hsm/init) [storage.LVM] Deactivating lvs: vg=6c873880-0244-4823-90d5-03e4ec562cd8 lvs=['8a9577f7-2e95-49a2-8ab8-8447c935446f', 'ad62b2ec-3f99-47cd-a8e0-d22d4b1b3d69', '3fc2c155-65b7-4b2c-988a-b8beab04a01a', 'c9042982-1ec4-4b74-8c97-b7ab119ac8d6'] (lvm:684)

# date && lvs | grep 8a9577f7-2e95-49a2-8ab8-8447c935446f
Mon Jul  9 15:59:23 AEST 2018
  8a9577f7-2e95-49a2-8ab8-8447c935446f 6c873880-0244-4823-90d5-03e4ec562cd8 -wi-------   1.00g

StatusStorageThread::ERROR::2018-07-09 15:59:09,945::storage_broker::161::ovirt_hosted_engine_ha.broker.storage_broker.StorageBroker::(get_raw_stats) Failed to read metadata from /var/run/vdsm/storage/6c873880-0244-4823-90d5-03e4ec562cd8/8be3e565-99ee-4a07-b515-58ca38c2e8ec/3fc2c155-65b7-4b2c-988a-b8beab04a01a
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 152, in get_raw_stats
    f = os.open(path, direct_flag | os.O_RDONLY | os.O_SYNC)
OSError: [Errno 2] No such file or directory: '/var/run/vdsm/storage/6c873880-0244-4823-90d5-03e4ec562cd8/8be3e565-99ee-4a07-b515-58ca38c2e8ec/3fc2c155-65b7-4b2c-988a-b8beab04a01a'
StatusStorageThread::ERROR::2018-07-09 15:59:09,946::status_broker::98::ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update::(run) Failed to read state.
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/status_broker.py", line 94, in run
    self._storage_broker.get_raw_stats()
  File "/usr/lib/python2.7/site-packages/ovirt_hosted_engine_ha/broker/storage_broker.py", line 163, in get_raw_stats
    .format(str(e)))
RequestError: failed to read metadata: [Errno 2] No such file or directory: '/var/run/vdsm/storage/6c873880-0244-4823-90d5-03e4ec562cd8/8be3e565-99ee-4a07-b515-58ca38c2e8ec/3fc2c155-65b7-4b2c-988a-b8beab04a01a'
StatusStorageThread::ERROR::2018-07-09 15:59:09,947::status_broker::70::ovirt_hosted_engine_ha.broker.status_broker.StatusBroker.Update::(trigger_restart) Trying to restart the broker

Tracked it down to this: BZ1583712

*** This bug has been marked as a duplicate of bug 1583712 ***